Re: [mm/debug_vm_pgtable] a97a171093: BUG:unable_to_handle_page_fault_for_address

2020-07-09 Thread Anshuman Khandual



On 07/09/2020 11:41 AM, kernel test robot wrote:
> [   94.349598] BUG: unable to handle page fault for address: ed10a7ffddff
> [   94.351039] #PF: supervisor read access in kernel mode
> [   94.352172] #PF: error_code(0x) - not-present page
> [   94.353256] PGD 43ffed067 P4D 43ffed067 PUD 43fdee067 PMD 0 
> [   94.354484] Oops:  [#1] SMP KASAN
> [   94.355238] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 
> 5.8.0-rc4-2-ga97a17109332c #1
> [   94.360456] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.12.0-1 04/01/2014
> [   94.361950] RIP: 0010:hugetlb_advanced_tests+0x137/0x699
> [   94.363026] Code: 8b 13 4d 85 f6 75 0b 48 ff 05 2c e4 6a 01 31 ed eb 41 bf 
> f8 ff ff ff ba ff ff 37 00 4c 01 f7 48 c1 e2 2a 48 89 f9 48 c1 e9 03 <80> 3c 
> 11 00 74 05 e8 cd c0 67 fa ba f8 ff ff ff 49 8b 2c 16 48 85
> [   94.366592] RSP: :c9047d30 EFLAGS: 00010a06
> [   94.367693] RAX: 11049b80 RBX: 888380525308 RCX: 
> 1110a7ffddff
> [   94.369215] RDX: dc00 RSI: 111087ffdc00 RDI: 
> 88853ffeeff8
> [   94.370693] RBP: 0018e510 R08: 0025 R09: 
> 0001
> [   94.372165] R10: 888380523c07 R11: ed10700a4780 R12: 
> 88843208e510
> [   94.373674] R13: 0025 R14: 88843ffef000 R15: 
> 31e01ae61000
> [   94.375147] FS:  () GS:8883a380() 
> knlGS:
> [   94.376883] CS:  0010 DS:  ES:  CR0: 80050033
> [   94.378051] CR2: ed10a7ffddff CR3: 04e15000 CR4: 
> 000406a0
> [   94.379522] Call Trace:
> [   94.380073]  debug_vm_pgtable+0xd81/0x2029
> [   94.380871]  ? pmd_advanced_tests+0x621/0x621
> [   94.381819]  do_one_initcall+0x1eb/0xbd0
> [   94.382551]  ? trace_event_raw_event_initcall_finish+0x240/0x240
> [   94.383634]  ? rcu_read_lock_sched_held+0xb9/0x110
> [   94.388727]  ? rcu_read_lock_held+0xd0/0xd0
> [   94.389604]  ? __kasan_check_read+0x1d/0x30
> [   94.390485]  kernel_init_freeable+0x430/0x4f8
> [   94.391416]  ? rest_init+0x3f8/0x3f8
> [   94.392185]  kernel_init+0x14/0x1e8
> [   94.392918]  ret_from_fork+0x22/0x30
> [   94.393662] Modules linked in:
> [   94.394289] CR2: ed10a7ffddff
> [   94.395000] ---[ end trace 8ca5a1655dfb8c39 ]---

This bug is caused from here.

static inline struct mem_section *__nr_to_section(unsigned long nr)
{
#ifdef CONFIG_SPARSEMEM_EXTREME
if (!mem_section)
return NULL;
#endif
if (!mem_section[SECTION_NR_TO_ROOT(nr)]) < BUG
return NULL;
return &mem_section[SECTION_NR_TO_ROOT(nr)][nr & SECTION_ROOT_MASK];
}

static inline struct mem_section *__pfn_to_section(unsigned long pfn)
{
return __nr_to_section(pfn_to_section_nr(pfn));
}

#define __pfn_to_page(pfn)  \
({  unsigned long __pfn = (pfn);\
struct mem_section *__sec = __pfn_to_section(__pfn);\
__section_mem_map_addr(__sec) + __pfn;  \
})

which is called via hugetlb_advanced_tests().

paddr = (__pfn_to_phys(pfn) | RANDOM_ORVALUE) & PMD_MASK;
pte = pte_mkhuge(mk_pte(pfn_to_page(PHYS_PFN(paddr)), prot));

Primary reason being RANDOM_ORVALUE, which is added to the paddr before
being masked with PMD_MASK. This clobbers up the pfn value which cannot
be searched in relevant memory sections. This problem stays hidden on
other configs where pfn_to_page() does not go via memory section search.
Dropping off RANDOM_ORVALUE solves the problem. Probably, just wanted to
drop that off during V2 series (https://lkml.org/lkml/2020/4/8/997) but
dont remember why ended up keeping it again.


Re: PowerNV PCI & SR-IOV cleanups

2020-07-09 Thread Christoph Hellwig
On Fri, Jul 10, 2020 at 03:23:25PM +1000, Oliver O'Halloran wrote:
> This is largely prep work for supporting VFs in the 32bit MMIO window.
> This is an unfortunate necessity due to how the Linux BAR allocator
> handles BARs marked as non-prefetchable. The distinction
> between prefetch and non-prefetchable BARs was made largely irrelevant
> with the introduction of PCIe, but the BAR allocator is overly
> conservative. It will always place non-pref bars in the prefetchable
> window, which is 32bit only. This results in us being unable to use VFs
> from NVMe drives and a few different RAID cards.

How about fixing that in the core PCI code?

(nothing against this series through, as it seems like a massive
cleanup)


Re: [PATCH] powerpc/perf: Add kernel support for new MSR[HV PR] bits in trace-imc.

2020-07-09 Thread Madhavan Srinivasan




On 7/3/20 12:06 PM, Anju T Sudhakar wrote:

IMC trace-mode record has MSR[HV PR] bits added in the third DW.
These bits can be used to set the cpumode for the instruction pointer
captured in each sample.

Add support in kernel to use these bits to set the cpumode for

each sample.
  


Changes looks fine to me.
Reviewed-by: Madhavan Srinivasan 

   
Signed-off-by: Anju T Sudhakar 

---
  arch/powerpc/include/asm/imc-pmu.h |  5 +
  arch/powerpc/perf/imc-pmu.c| 29 -
  2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/imc-pmu.h 
b/arch/powerpc/include/asm/imc-pmu.h
index 4da4fcba0684..4f897993b710 100644
--- a/arch/powerpc/include/asm/imc-pmu.h
+++ b/arch/powerpc/include/asm/imc-pmu.h
@@ -99,6 +99,11 @@ struct trace_imc_data {
   */
  #define IMC_TRACE_RECORD_TB1_MASK  0x3ffULL

+/*
+ * Bit 0:1 in third DW of IMC trace record
+ * specifies the MSR[HV PR] values.
+ */
+#define IMC_TRACE_RECORD_VAL_HVPR(x)   ((x) >> 62)

  /*
   * Device tree parser code detects IMC pmu support and
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index cb50a9e1fd2d..310922fed9eb 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -1178,11 +1178,30 @@ static int trace_imc_prepare_sample(struct 
trace_imc_data *mem,
header->size = sizeof(*header) + event->header_size;
header->misc = 0;

-   if (is_kernel_addr(data->ip))
-   header->misc |= PERF_RECORD_MISC_KERNEL;
-   else
-   header->misc |= PERF_RECORD_MISC_USER;
-
+   if (cpu_has_feature(CPU_FTRS_POWER9)) {
+   if (is_kernel_addr(data->ip))
+   header->misc |= PERF_RECORD_MISC_KERNEL;
+   else
+   header->misc |= PERF_RECORD_MISC_USER;
+   } else {
+   switch (IMC_TRACE_RECORD_VAL_HVPR(mem->val)) {
+   case 0:/* when MSR HV and PR not set in the trace-record */
+   header->misc |= PERF_RECORD_MISC_GUEST_KERNEL;
+   break;
+   case 1: /* MSR HV is 0 and PR is 1 */
+   header->misc |= PERF_RECORD_MISC_GUEST_USER;
+   break;
+   case 2: /* MSR Hv is 1 and PR is 0 */
+   header->misc |= PERF_RECORD_MISC_HYPERVISOR;
+   break;
+   case 3: /* MSR HV is 1 and PR is 1 */
+   header->misc |= PERF_RECORD_MISC_USER;
+   break;
+   default:
+   pr_info("IMC: Unable to set the flag based on MSR 
bits\n");
+   break;
+   }
+   }
perf_event_header__init_id(header, data, event);

return 0;




Re: [PATCH v5 2/2] powerpc/hv-24x7: Add sysfs files inside hv-24x7 device to show cpumask

2020-07-09 Thread Madhavan Srinivasan




On 7/9/20 10:48 AM, Kajol Jain wrote:

Patch here adds a cpumask attr to hv_24x7 pmu along with ABI documentation.

Primary use to expose the cpumask is for the perf tool which has the
capability to parse the driver sysfs folder and understand the
cpumask file. Having cpumask file will reduce the number of perf command
line parameters (will avoid "-C" option in the perf tool
command line). It can also notify the user which is
the current cpu used to retrieve the counter data.

command:# cat /sys/devices/hv_24x7/interface/cpumask
0



Reviewed-by: Madhavan Srinivasan 


Signed-off-by: Kajol Jain 
---
  .../ABI/testing/sysfs-bus-event_source-devices-hv_24x7| 7 +++
  arch/powerpc/perf/hv-24x7.c   | 8 
  2 files changed, 15 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
index e8698afcd952..f7e32f218f73 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-hv_24x7
@@ -43,6 +43,13 @@ Description: read only
This sysfs interface exposes the number of cores per chip
present in the system.

+What:  /sys/devices/hv_24x7/interface/cpumask
+Date:  July 2020
+Contact:   Linux on PowerPC Developer List 
+Description:   read only
+   This sysfs file exposes the cpumask which is designated to make
+   HCALLs to retrieve hv-24x7 pmu event counter data.
+
  What: /sys/bus/event_source/devices/hv_24x7/event_descs/
  Date: February 2014
  Contact:  Linux on PowerPC Developer List 
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 93b4700dcf8c..acc34148ad09 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -448,6 +448,12 @@ static ssize_t device_show_string(struct device *dev,
return sprintf(buf, "%s\n", (char *)d->var);
  }

+static ssize_t cpumask_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   return cpumap_print_to_pagebuf(true, buf, &hv_24x7_cpumask);
+}
+
  static ssize_t sockets_show(struct device *dev,
struct device_attribute *attr, char *buf)
  {
@@ -1115,6 +1121,7 @@ static DEVICE_ATTR_RO(domains);
  static DEVICE_ATTR_RO(sockets);
  static DEVICE_ATTR_RO(chipspersocket);
  static DEVICE_ATTR_RO(coresperchip);
+static DEVICE_ATTR_RO(cpumask);

  static struct bin_attribute *if_bin_attrs[] = {
&bin_attr_catalog,
@@ -1128,6 +1135,7 @@ static struct attribute *if_attrs[] = {
&dev_attr_sockets.attr,
&dev_attr_chipspersocket.attr,
&dev_attr_coresperchip.attr,
+   &dev_attr_cpumask.attr,
NULL,
  };





[PATCH 15/15] powerpc/powernv/sriov: Make single PE mode a per-BAR setting

2020-07-09 Thread Oliver O'Halloran
Using single PE BARs to map an SR-IOV BAR is really a choice about what
strategy to use when mapping a BAR. It doesn't make much sense for this to
be a global setting since a device might have one large BAR which needs to
be mapped with single PE windows and another smaller BAR that can be mapped
with a regular segmented window. Make the segmented vs single decision a
per-BAR setting and clean up the logic that decides which mode to use.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 131 +++--
 arch/powerpc/platforms/powernv/pci.h   |  10 +-
 2 files changed, 75 insertions(+), 66 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 8de03636888a..87377d95d648 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -146,10 +146,9 @@
 static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
 {
struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
-   const resource_size_t gate = phb->ioda.m64_segsize >> 2;
struct resource *res;
int i;
-   resource_size_t size, total_vf_bar_sz;
+   resource_size_t vf_bar_sz;
struct pnv_iov_data *iov;
int mul, total_vfs;
 
@@ -158,9 +157,9 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
goto disable_iov;
pdev->dev.archdata.iov_data = iov;
 
+   /* FIXME: totalvfs > phb->ioda.total_pe_num is going to be a problem */
total_vfs = pci_sriov_get_totalvfs(pdev);
mul = phb->ioda.total_pe_num;
-   total_vf_bar_sz = 0;
 
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = &pdev->resource[i + PCI_IOV_RESOURCES];
@@ -173,50 +172,51 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
goto disable_iov;
}
 
-   total_vf_bar_sz += pci_iov_resource_size(pdev,
-   i + PCI_IOV_RESOURCES);
+   vf_bar_sz = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
 
/*
-* If bigger than quarter of M64 segment size, just round up
-* power of two.
+* Generally, one segmented M64 BAR maps one IOV BAR. However,
+* if a VF BAR is too large we end up wasting a lot of space.
+* If we've got a BAR that's bigger than greater than 1/4 of the
+* default window's segment size then switch to using single PE
+* windows. This limits the total number of VFs we can support.
 *
-* Generally, one M64 BAR maps one IOV BAR. To avoid conflict
-* with other devices, IOV BAR size is expanded to be
-* (total_pe * VF_BAR_size).  When VF_BAR_size is half of M64
-* segment size , the expanded size would equal to half of the
-* whole M64 space size, which will exhaust the M64 Space and
-* limit the system flexibility.  This is a design decision to
-* set the boundary to quarter of the M64 segment size.
+* The 1/4 limit is arbitrary and can be tweaked.
 */
-   if (total_vf_bar_sz > gate) {
-   mul = roundup_pow_of_two(total_vfs);
-   dev_info(&pdev->dev,
-   "VF BAR Total IOV size %llx > %llx, roundup to 
%d VFs\n",
-   total_vf_bar_sz, gate, mul);
-   iov->m64_single_mode = true;
-   break;
-   }
-   }
+   if (vf_bar_sz > (phb->ioda.m64_segsize >> 2)) {
+   /*
+* On PHB3, the minimum size alignment of M64 BAR in
+* single mode is 32MB. If this VF BAR is smaller than
+* 32MB, but still too large for a segmented window
+* then we can't map it and need to disable SR-IOV for
+* this device.
+*/
+   if (vf_bar_sz < SZ_32M) {
+   pci_err(pdev, "VF BAR%d: %pR can't be mapped in 
single PE mode\n",
+   i, res);
+   goto disable_iov;
+   }
 
-   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
-   res = &pdev->resource[i + PCI_IOV_RESOURCES];
-   if (!res->flags || res->parent)
+   iov->m64_single_mode[i] = true;
continue;
+   }
+
 
-   size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
/*
-* On PHB3, the minimum size alignment of M64 BAR in single
-* mode is 32MB.
+* This BAR can be mapped with one segme

[PATCH 14/15] powerpc/powernv/sriov: Refactor M64 BAR setup

2020-07-09 Thread Oliver O'Halloran
Split up the logic so that we have one branch that handles setting up a
segmented window and another that handles setting up single PE windows for
each VF.

Signed-off-by: Oliver O'Halloran 
---
This patch could be folded into the previous one. I've kept it
seperate mainly because the diff is *horrific* when they're merged.
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 57 ++
 1 file changed, 27 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 2f967aa4fbf5..8de03636888a 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -441,52 +441,49 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
struct resource   *res;
inti, j;
int64_trc;
-   inttotal_vfs;
resource_size_tsize, start;
-   intm64_bars;
+   intbase_pe_num;
 
phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
-   total_vfs = pci_sriov_get_totalvfs(pdev);
-
-   if (iov->m64_single_mode)
-   m64_bars = num_vfs;
-   else
-   m64_bars = 1;
 
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = &pdev->resource[i + PCI_IOV_RESOURCES];
if (!res->flags || !res->parent)
continue;
 
-   for (j = 0; j < m64_bars; j++) {
+   /* don't need single mode? map everything in one go! */
+   if (!iov->m64_single_mode) {
win = pnv_pci_alloc_m64_bar(phb, iov);
if (win < 0)
goto m64_failed;
 
-   if (iov->m64_single_mode) {
-   int pe_num = iov->vf_pe_arr[j].pe_number;
-
-   size = pci_iov_resource_size(pdev,
-   PCI_IOV_RESOURCES + i);
-   start = res->start + size * j;
-   rc = pnv_ioda_map_m64_single(phb, win,
-pe_num,
-start,
-size);
-   } else {
-   size = resource_size(res);
-   start = res->start;
-
-   rc = pnv_ioda_map_m64_accordion(phb, win, start,
-   size);
-   }
+   size = resource_size(res);
+   start = res->start;
 
-   if (rc != OPAL_SUCCESS) {
-   dev_err(&pdev->dev, "Failed to map M64 window 
#%d: %lld\n",
-   win, rc);
+   rc = pnv_ioda_map_m64_accordion(phb, win, start, size);
+   if (rc)
+   goto m64_failed;
+
+   continue;
+   }
+
+   /* otherwise map each VF with single PE BARs */
+   size = pci_iov_resource_size(pdev, PCI_IOV_RESOURCES + i);
+   base_pe_num = iov->vf_pe_arr[0].pe_number;
+
+   for (j = 0; j < num_vfs; j++) {
+   win = pnv_pci_alloc_m64_bar(phb, iov);
+   if (win < 0)
+   goto m64_failed;
+
+   start = res->start + size * j;
+   rc = pnv_ioda_map_m64_single(phb, win,
+base_pe_num + j,
+start,
+size);
+   if (rc)
goto m64_failed;
-   }
}
}
return 0;
-- 
2.26.2



[PATCH 13/15] powerpc/powernv/sriov: Move M64 BAR allocation into a helper

2020-07-09 Thread Oliver O'Halloran
I want to refactor the loop this code is currently inside of. Hoist it on
out.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 31 ++
 1 file changed, 20 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index d5699cd2ab7a..2f967aa4fbf5 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -416,6 +416,23 @@ static int64_t pnv_ioda_map_m64_single(struct pnv_phb *phb,
return rc;
 }
 
+static int pnv_pci_alloc_m64_bar(struct pnv_phb *phb, struct pnv_iov_data *iov)
+{
+   int win;
+
+   do {
+   win = find_next_zero_bit(&phb->ioda.m64_bar_alloc,
+   phb->ioda.m64_bar_idx + 1, 0);
+
+   if (win >= phb->ioda.m64_bar_idx + 1)
+   return -1;
+   } while (test_and_set_bit(win, &phb->ioda.m64_bar_alloc));
+
+   set_bit(win, iov->used_m64_bar_mask);
+
+   return win;
+}
+
 static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
 {
struct pnv_iov_data   *iov;
@@ -443,17 +460,9 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 
num_vfs)
continue;
 
for (j = 0; j < m64_bars; j++) {
-
-   /* allocate a window ID for this BAR */
-   do {
-   win = 
find_next_zero_bit(&phb->ioda.m64_bar_alloc,
-   phb->ioda.m64_bar_idx + 1, 0);
-
-   if (win >= phb->ioda.m64_bar_idx + 1)
-   goto m64_failed;
-   } while (test_and_set_bit(win, 
&phb->ioda.m64_bar_alloc));
-   set_bit(win, iov->used_m64_bar_mask);
-
+   win = pnv_pci_alloc_m64_bar(phb, iov);
+   if (win < 0)
+   goto m64_failed;
 
if (iov->m64_single_mode) {
int pe_num = iov->vf_pe_arr[j].pe_number;
-- 
2.26.2



[PATCH 12/15] powerpc/powernv/sriov: De-indent setup and teardown

2020-07-09 Thread Oliver O'Halloran
Remove the IODA2 PHB checks. We already assume IODA2 in several places so
there's not much point in wrapping most of the setup and teardown process
in an if block.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 86 --
 1 file changed, 49 insertions(+), 37 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 08f88187d65a..d5699cd2ab7a 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -610,16 +610,18 @@ static void pnv_pci_sriov_disable(struct pci_dev *pdev)
num_vfs = iov->num_vfs;
base_pe = iov->vf_pe_arr[0].pe_number;
 
+   if (WARN_ON(!iov))
+   return;
+
/* Release VF PEs */
pnv_ioda_release_vf_PE(pdev);
 
-   if (phb->type == PNV_PHB_IODA2) {
-   if (!iov->m64_single_mode)
-   pnv_pci_vf_resource_shift(pdev, -base_pe);
+   /* Un-shift the IOV BAR resources */
+   if (!iov->m64_single_mode)
+   pnv_pci_vf_resource_shift(pdev, -base_pe);
 
-   /* Release M64 windows */
-   pnv_pci_vf_release_m64(pdev, num_vfs);
-   }
+   /* Release M64 windows */
+   pnv_pci_vf_release_m64(pdev, num_vfs);
 }
 
 static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
@@ -693,41 +695,51 @@ static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 
num_vfs)
phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
 
-   if (phb->type == PNV_PHB_IODA2) {
-   if (!iov->vfs_expanded) {
-   dev_info(&pdev->dev, "don't support this SRIOV device"
-   " with non 64bit-prefetchable IOV BAR\n");
-   return -ENOSPC;
-   }
+   /*
+* There's a calls to IODA2 PE setup code littered throughout. We could
+* probably fix that, but we'd still have problems due to the
+* restriction inherent on IODA1 PHBs.
+*
+* NB: We class IODA3 as IODA2 since they're very similar.
+*/
+   if (phb->type != PNV_PHB_IODA2) {
+   pci_err(pdev, "SR-IOV is not supported on this PHB\n");
+   return -ENXIO;
+   }
 
-   /* allocate a contigious block of PEs for our VFs */
-   base_pe = pnv_ioda_alloc_pe(phb, num_vfs);
-   if (!base_pe) {
-   pci_err(pdev, "Unable to allocate PEs for %d VFs\n", 
num_vfs);
-   return -EBUSY;
-   }
+   if (!iov->vfs_expanded) {
+   dev_info(&pdev->dev, "don't support this SRIOV device"
+   " with non 64bit-prefetchable IOV BAR\n");
+   return -ENOSPC;
+   }
 
-   iov->vf_pe_arr = base_pe;
-   iov->num_vfs = num_vfs;
+   /* allocate a contigious block of PEs for our VFs */
+   base_pe = pnv_ioda_alloc_pe(phb, num_vfs);
+   if (!base_pe) {
+   pci_err(pdev, "Unable to allocate PEs for %d VFs\n", num_vfs);
+   return -EBUSY;
+   }
 
-   /* Assign M64 window accordingly */
-   ret = pnv_pci_vf_assign_m64(pdev, num_vfs);
-   if (ret) {
-   dev_info(&pdev->dev, "Not enough M64 window 
resources\n");
-   goto m64_failed;
-   }
+   iov->vf_pe_arr = base_pe;
+   iov->num_vfs = num_vfs;
 
-   /*
-* When using one M64 BAR to map one IOV BAR, we need to shift
-* the IOV BAR according to the PE# allocated to the VFs.
-* Otherwise, the PE# for the VF will conflict with others.
-*/
-   if (!iov->m64_single_mode) {
-   ret = pnv_pci_vf_resource_shift(pdev,
-   base_pe->pe_number);
-   if (ret)
-   goto shift_failed;
-   }
+   /* Assign M64 window accordingly */
+   ret = pnv_pci_vf_assign_m64(pdev, num_vfs);
+   if (ret) {
+   dev_info(&pdev->dev, "Not enough M64 window resources\n");
+   goto m64_failed;
+   }
+
+   /*
+* When using one M64 BAR to map one IOV BAR, we need to shift
+* the IOV BAR according to the PE# allocated to the VFs.
+* Otherwise, the PE# for the VF will conflict with others.
+*/
+   if (!iov->m64_single_mode) {
+   ret = pnv_pci_vf_resource_shift(pdev,
+   base_pe->pe_number);
+   if (ret)
+   goto shift_failed;
}
 
/* Setup VF PEs */
-- 
2.26.2



[PATCH 11/15] powerpc/powernv/sriov: Drop iov->pe_num_map[]

2020-07-09 Thread Oliver O'Halloran
Currently the iov->pe_num_map[] does one of two things depending on
whether single PE mode is being used or not. When it is, this contains an
array which maps a vf_index to the corresponding PE number. When single PE
mode is not being used this contains a scalar which is the base PE for the
set of enabled VFs (for for VFn is base + n).

The array was necessary because when calling pnv_ioda_alloc_pe() there is
no guarantee that the allocated PEs would be contigious. We can now
allocate contigious blocks of PEs so this is no longer an issue. This
allows us to drop the if (single_mode) {} .. else {} block scattered
through the SR-IOV code which is a nice clean up.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 109 +
 arch/powerpc/platforms/powernv/pci.h   |   4 +-
 2 files changed, 25 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index d53a85ccb538..08f88187d65a 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -456,11 +456,13 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
 
 
if (iov->m64_single_mode) {
+   int pe_num = iov->vf_pe_arr[j].pe_number;
+
size = pci_iov_resource_size(pdev,
PCI_IOV_RESOURCES + i);
start = res->start + size * j;
rc = pnv_ioda_map_m64_single(phb, win,
-iov->pe_num_map[j],
+pe_num,
 start,
 size);
} else {
@@ -599,38 +601,24 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, 
int offset)
 
 static void pnv_pci_sriov_disable(struct pci_dev *pdev)
 {
+   u16num_vfs, base_pe;
struct pnv_phb*phb;
-   struct pnv_ioda_pe*pe;
struct pnv_iov_data   *iov;
-   u16num_vfs, i;
 
phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
num_vfs = iov->num_vfs;
+   base_pe = iov->vf_pe_arr[0].pe_number;
 
/* Release VF PEs */
pnv_ioda_release_vf_PE(pdev);
 
if (phb->type == PNV_PHB_IODA2) {
if (!iov->m64_single_mode)
-   pnv_pci_vf_resource_shift(pdev, -*iov->pe_num_map);
+   pnv_pci_vf_resource_shift(pdev, -base_pe);
 
/* Release M64 windows */
pnv_pci_vf_release_m64(pdev, num_vfs);
-
-   /* Release PE numbers */
-   if (iov->m64_single_mode) {
-   for (i = 0; i < num_vfs; i++) {
-   if (iov->pe_num_map[i] == IODA_INVALID_PE)
-   continue;
-
-   pe = &phb->ioda.pe_array[iov->pe_num_map[i]];
-   pnv_ioda_free_pe(pe);
-   }
-   } else
-   bitmap_clear(phb->ioda.pe_alloc, *iov->pe_num_map, 
num_vfs);
-   /* Releasing pe_num_map */
-   kfree(iov->pe_num_map);
}
 }
 
@@ -656,13 +644,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 
num_vfs)
int vf_bus = pci_iov_virtfn_bus(pdev, vf_index);
struct pci_dn *vf_pdn;
 
-   if (iov->m64_single_mode)
-   pe_num = iov->pe_num_map[vf_index];
-   else
-   pe_num = *iov->pe_num_map + vf_index;
-
-   pe = &phb->ioda.pe_array[pe_num];
-   pe->pe_number = pe_num;
+   pe = &iov->vf_pe_arr[vf_index];
pe->phb = phb;
pe->flags = PNV_IODA_PE_VF;
pe->pbus = NULL;
@@ -670,6 +652,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 
num_vfs)
pe->mve_number = -1;
pe->rid = (vf_bus << 8) | vf_devfn;
 
+   pe_num = pe->pe_number;
pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%x\n",
pci_domain_nr(pdev->bus), pdev->bus->number,
PCI_SLOT(vf_devfn), PCI_FUNC(vf_devfn), pe_num);
@@ -701,9 +684,9 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 
num_vfs)
 
 static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
 {
+   struct pnv_ioda_pe*base_pe;
struct pnv_iov_data   *iov;
struct pnv_phb*phb;
-   struct pnv_ioda_pe*pe;
intret;
u16i;
 
@@ -717,55 +700,14 @@ static int pnv_pci_sriov_enable(str

[PATCH 10/15] powerpc/powernv/pci: Refactor pnv_ioda_alloc_pe()

2020-07-09 Thread Oliver O'Halloran
Rework the PE allocation logic to allow allocating blocks of PEs rather
than individually. We'll use this to allocate contigious blocks of PEs for
the SR-IOVs.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 41 ++-
 arch/powerpc/platforms/powernv/pci.h  |  2 +-
 2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 2d36a9ebf0e9..c9c25fb0783c 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -145,23 +145,45 @@ static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int 
pe_no)
return;
}
 
+   mutex_lock(&phb->ioda.pe_alloc_mutex);
if (test_and_set_bit(pe_no, phb->ioda.pe_alloc))
pr_debug("%s: PE %x was reserved on PHB#%x\n",
 __func__, pe_no, phb->hose->global_number);
+   mutex_unlock(&phb->ioda.pe_alloc_mutex);
 
pnv_ioda_init_pe(phb, pe_no);
 }
 
-struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
+struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb, int count)
 {
-   long pe;
+   struct pnv_ioda_pe *ret = NULL;
+   int run = 0, pe, i;
 
+   mutex_lock(&phb->ioda.pe_alloc_mutex);
+
+   /* scan backwards for a run of @count cleared bits */
for (pe = phb->ioda.total_pe_num - 1; pe >= 0; pe--) {
-   if (!test_and_set_bit(pe, phb->ioda.pe_alloc))
-   return pnv_ioda_init_pe(phb, pe);
+   if (test_bit(pe, phb->ioda.pe_alloc)) {
+   run = 0;
+   continue;
+   }
+
+   run++;
+   if (run == count)
+   break;
}
+   if (run != count)
+   goto out;
 
-   return NULL;
+   for (i = pe; i < pe + count; i++) {
+   set_bit(i, phb->ioda.pe_alloc);
+   pnv_ioda_init_pe(phb, i);
+   }
+   ret = &phb->ioda.pe_array[pe];
+
+out:
+   mutex_unlock(&phb->ioda.pe_alloc_mutex);
+   return ret;
 }
 
 void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
@@ -173,7 +195,10 @@ void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
WARN_ON(pe->npucomp); /* NPUs for nvlink are not supposed to be freed */
kfree(pe->npucomp);
memset(pe, 0, sizeof(struct pnv_ioda_pe));
+
+   mutex_lock(&phb->ioda.pe_alloc_mutex);
clear_bit(pe_num, phb->ioda.pe_alloc);
+   mutex_unlock(&phb->ioda.pe_alloc_mutex);
 }
 
 /* The default M64 BAR is shared by all PEs */
@@ -976,7 +1001,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct 
pci_dev *dev)
if (pdn->pe_number != IODA_INVALID_PE)
return NULL;
 
-   pe = pnv_ioda_alloc_pe(phb);
+   pe = pnv_ioda_alloc_pe(phb, 1);
if (!pe) {
pr_warn("%s: Not enough PE# available, disabling device\n",
pci_name(dev));
@@ -1047,7 +1072,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct 
pci_bus *bus, bool all)
 
/* The PE number isn't pinned by M64 */
if (!pe)
-   pe = pnv_ioda_alloc_pe(phb);
+   pe = pnv_ioda_alloc_pe(phb, 1);
 
if (!pe) {
pr_warn("%s: Not enough PE# available for PCI bus %04x:%02x\n",
@@ -3065,7 +3090,7 @@ static void __init pnv_pci_init_ioda_phb(struct 
device_node *np,
pnv_ioda_reserve_pe(phb, phb->ioda.root_pe_idx);
} else {
/* otherwise just allocate one */
-   root_pe = pnv_ioda_alloc_pe(phb);
+   root_pe = pnv_ioda_alloc_pe(phb, 1);
phb->ioda.root_pe_idx = root_pe->pe_number;
}
 
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 58c97e60c3db..b4c9bdba7217 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -223,7 +223,7 @@ int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct 
pnv_ioda_pe *pe);
 void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe);
 void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe);
 
-struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb);
+struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb, int count);
 void pnv_ioda_free_pe(struct pnv_ioda_pe *pe);
 
 #ifdef CONFIG_PCI_IOV
-- 
2.26.2



[PATCH 09/15] powerpc/powernv/sriov: Factor out M64 BAR setup

2020-07-09 Thread Oliver O'Halloran
The sequence required to use the single PE BAR mode is kinda janky and
requires a little explanation. The API was designed with P7-IOC style
windows where the setup process is something like:

1. Configure the window start / end address
2. Enable the window
3. Map the segments of each window to the PE

For Single PE BARs the process is:

1. Set the PE for segment zero on a disabled window
2. Set the range
3. Enable the window

Move the OPAL calls into their own helper functions where the quirks can be
contained.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 132 -
 1 file changed, 103 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index e4c65cb49757..d53a85ccb538 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -320,6 +320,102 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, 
u16 num_vfs)
return 0;
 }
 
+
+/*
+ * PHB3 and beyond support "accordion" windows. The window's address range
+ * is subdivided into phb->ioda.total_pe_num segments and there's a 1-1
+ * mapping between PEs and segments.
+ *
+ * They're called that because as the window size changes the segment sizes
+ * change with it. Sort of like an accordion, sort of.
+ */
+static int64_t pnv_ioda_map_m64_accordion(struct pnv_phb *phb,
+ int window_id,
+ resource_size_t start,
+ resource_size_t size)
+{
+   int64_t rc;
+
+   rc = opal_pci_set_phb_mem_window(phb->opal_id,
+OPAL_M64_WINDOW_TYPE,
+window_id,
+start,
+0, /* unused */
+size);
+   if (rc)
+   goto out;
+
+   rc = opal_pci_phb_mmio_enable(phb->opal_id,
+ OPAL_M64_WINDOW_TYPE,
+ window_id,
+ OPAL_ENABLE_M64_SPLIT);
+out:
+   if (rc)
+   pr_err("Failed to map M64 window #%d: %lld\n", window_id, rc);
+
+   return rc;
+}
+
+static int64_t pnv_ioda_map_m64_single(struct pnv_phb *phb,
+  int pe_num,
+  int window_id,
+  resource_size_t start,
+  resource_size_t size)
+{
+   int64_t rc;
+
+   /*
+* The API for setting up m64 mmio windows seems to have been designed
+* with P7-IOC in mind. For that chip each M64 BAR (window) had a fixed
+* split of 8 equally sized segments each of which could individually
+* assigned to a PE.
+*
+* The problem with this is that the API doesn't have any way to
+* communicate the number of segments we want on a BAR. This wasn't
+* a problem for p7-ioc since you didn't have a choice, but the
+* single PE windows added in PHB3 don't map cleanly to this API.
+*
+* As a result we've got this slightly awkward process where we
+* call opal_pci_map_pe_mmio_window() to put the single in single
+* PE mode, and set the PE for the window before setting the address
+* bounds. We need to do it this way because the single PE windows
+* for PHB3 have different alignment requirements on PHB3.
+*/
+   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+pe_num,
+OPAL_M64_WINDOW_TYPE,
+window_id,
+0);
+   if (rc)
+   goto out;
+
+   /*
+* NB: In single PE mode the window needs to be aligned to 32MB
+*/
+   rc = opal_pci_set_phb_mem_window(phb->opal_id,
+OPAL_M64_WINDOW_TYPE,
+window_id,
+start,
+0, /* ignored by FW, m64 is 1-1 */
+size);
+   if (rc)
+   goto out;
+
+   /*
+* Now actually enable it. We specified the BAR should be in "non-split"
+* mode so FW will validate that the BAR is in single PE mode.
+*/
+   rc = opal_pci_phb_mmio_enable(phb->opal_id,
+ OPAL_M64_WINDOW_TYPE,
+ window_id,
+ OPAL_ENABLE_M64_NON_SPLIT);
+out:
+   if (rc)
+   pr_err("Error mapping single PE BAR\n");
+
+   return rc;
+}
+
 static int pnv_pci_vf_assign_m64(struct pci_dev *pdev

[PATCH 08/15] powerpc/powernv/sriov: Simplify used window tracking

2020-07-09 Thread Oliver O'Halloran
No need for the multi-dimensional arrays, just use a bitmap.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 48 +++---
 arch/powerpc/platforms/powernv/pci.h   |  7 +++-
 2 files changed, 20 insertions(+), 35 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 216ceeff69b0..e4c65cb49757 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -303,28 +303,20 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, 
u16 num_vfs)
 {
struct pnv_iov_data   *iov;
struct pnv_phb*phb;
-   inti, j;
-   intm64_bars;
+   int window_id;
 
phb = pci_bus_to_pnvhb(pdev->bus);
iov = pnv_iov_get(pdev);
 
-   if (iov->m64_single_mode)
-   m64_bars = num_vfs;
-   else
-   m64_bars = 1;
+   for_each_set_bit(window_id, iov->used_m64_bar_mask, 64) {
+   opal_pci_phb_mmio_enable(phb->opal_id,
+OPAL_M64_WINDOW_TYPE,
+window_id,
+0);
 
-   for (i = 0; i < PCI_SRIOV_NUM_BARS; i++)
-   for (j = 0; j < m64_bars; j++) {
-   if (iov->m64_map[j][i] == IODA_INVALID_M64)
-   continue;
-   opal_pci_phb_mmio_enable(phb->opal_id,
-   OPAL_M64_WINDOW_TYPE, iov->m64_map[j][i], 0);
-   clear_bit(iov->m64_map[j][i], &phb->ioda.m64_bar_alloc);
-   iov->m64_map[j][i] = IODA_INVALID_M64;
-   }
+   clear_bit(window_id, &phb->ioda.m64_bar_alloc);
+   }
 
-   kfree(iov->m64_map);
return 0;
 }
 
@@ -350,23 +342,14 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
else
m64_bars = 1;
 
-   iov->m64_map = kmalloc_array(m64_bars,
-sizeof(*iov->m64_map),
-GFP_KERNEL);
-   if (!iov->m64_map)
-   return -ENOMEM;
-   /* Initialize the m64_map to IODA_INVALID_M64 */
-   for (i = 0; i < m64_bars ; i++)
-   for (j = 0; j < PCI_SRIOV_NUM_BARS; j++)
-   iov->m64_map[i][j] = IODA_INVALID_M64;
-
-
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = &pdev->resource[i + PCI_IOV_RESOURCES];
if (!res->flags || !res->parent)
continue;
 
for (j = 0; j < m64_bars; j++) {
+
+   /* allocate a window ID for this BAR */
do {
win = 
find_next_zero_bit(&phb->ioda.m64_bar_alloc,
phb->ioda.m64_bar_idx + 1, 0);
@@ -374,8 +357,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 
num_vfs)
if (win >= phb->ioda.m64_bar_idx + 1)
goto m64_failed;
} while (test_and_set_bit(win, 
&phb->ioda.m64_bar_alloc));
-
-   iov->m64_map[j][i] = win;
+   set_bit(win, iov->used_m64_bar_mask);
 
if (iov->m64_single_mode) {
size = pci_iov_resource_size(pdev,
@@ -391,12 +373,12 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
pe_num = iov->pe_num_map[j];
rc = opal_pci_map_pe_mmio_window(phb->opal_id,
pe_num, OPAL_M64_WINDOW_TYPE,
-   iov->m64_map[j][i], 0);
+   win, 0);
}
 
rc = opal_pci_set_phb_mem_window(phb->opal_id,
 OPAL_M64_WINDOW_TYPE,
-iov->m64_map[j][i],
+win,
 start,
 0, /* unused */
 size);
@@ -410,10 +392,10 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
 
if (iov->m64_single_mode)
rc = opal_pci_phb_mmio_enable(phb->opal_id,
-OPAL_M64_WINDOW_TYPE, iov->m64_map[j][i], 
2);
+OPAL_M64_WINDOW_TYPE, win, 2);
else
rc = opal_pci_phb_mmio_enable(phb->opal_id,
-OPAL_M64_WINDOW_TYPE, iov->m64_map[j][i], 
1);
+   

[PATCH 07/15] powerpc/powernv/sriov: Rename truncate_iov

2020-07-09 Thread Oliver O'Halloran
This prevents SR-IOV being used by making the SR-IOV BAR resources
unallocatable. Rename it to reflect what it actually does.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index f4c74ab1284d..216ceeff69b0 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -155,7 +155,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
 
iov = kzalloc(sizeof(*iov), GFP_KERNEL);
if (!iov)
-   goto truncate_iov;
+   goto disable_iov;
pdev->dev.archdata.iov_data = iov;
 
total_vfs = pci_sriov_get_totalvfs(pdev);
@@ -170,7 +170,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
dev_warn(&pdev->dev, "Don't support SR-IOV with"
" non M64 VF BAR%d: %pR. \n",
 i, res);
-   goto truncate_iov;
+   goto disable_iov;
}
 
total_vf_bar_sz += pci_iov_resource_size(pdev,
@@ -209,7 +209,8 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
 * mode is 32MB.
 */
if (iov->m64_single_mode && (size < SZ_32M))
-   goto truncate_iov;
+   goto disable_iov;
+
dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
res->end = res->start + size * mul - 1;
dev_dbg(&pdev->dev, "   %pR\n", res);
@@ -220,8 +221,8 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev 
*pdev)
 
return;
 
-truncate_iov:
-   /* To save MMIO space, IOV BAR is truncated. */
+disable_iov:
+   /* Save ourselves some MMIO space by disabling the unusable BARs */
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = &pdev->resource[i + PCI_IOV_RESOURCES];
res->flags = 0;
-- 
2.26.2



[PATCH 05/15] powerpc/powernv/sriov: Move SR-IOV into a seperate file

2020-07-09 Thread Oliver O'Halloran
pci-ioda.c is getting a bit unwieldly due to the amount of stuff jammed in
there. The SR-IOV support can be extracted easily enough and is mostly
standalone, so move it into a seperate file.

This patch also moves the PowerNV SR-IOV specific fields from pci_dn and moves 
them
into a platform specific structure. I'm not sure how they ended up in there
in the first place, but leaking platform specifics into common code has
proven to be a terrible idea so far so lets stop doing that.

Signed-off-by: Oliver O'Halloran 
---
The pci_dn change and the pci-sriov.c changes originally separate patches.
I accidently squashed them together while rebasing and fixing that seemed
like more pain that it was worth. I kind of like it this way though since
they did cause a lot of churn on the same set of functions.

I'll split them up again if you really want (please don't want this).
---
 arch/powerpc/include/asm/device.h  |   3 +
 arch/powerpc/platforms/powernv/Makefile|   1 +
 arch/powerpc/platforms/powernv/pci-ioda.c  | 673 +
 arch/powerpc/platforms/powernv/pci-sriov.c | 642 
 arch/powerpc/platforms/powernv/pci.h   |  74 +++
 5 files changed, 738 insertions(+), 655 deletions(-)
 create mode 100644 arch/powerpc/platforms/powernv/pci-sriov.c

diff --git a/arch/powerpc/include/asm/device.h 
b/arch/powerpc/include/asm/device.h
index 266542769e4b..4d8934db7ef5 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -49,6 +49,9 @@ struct dev_archdata {
 #ifdef CONFIG_CXL_BASE
struct cxl_context  *cxl_ctx;
 #endif
+#ifdef CONFIG_PCI_IOV
+   void *iov_data;
+#endif
 };
 
 struct pdev_archdata {
diff --git a/arch/powerpc/platforms/powernv/Makefile 
b/arch/powerpc/platforms/powernv/Makefile
index fe3f0fb5aeca..2eb6ae150d1f 100644
--- a/arch/powerpc/platforms/powernv/Makefile
+++ b/arch/powerpc/platforms/powernv/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_FA_DUMP) += opal-fadump.o
 obj-$(CONFIG_PRESERVE_FA_DUMP) += opal-fadump.o
 obj-$(CONFIG_OPAL_CORE)+= opal-core.o
 obj-$(CONFIG_PCI)  += pci.o pci-ioda.o npu-dma.o pci-ioda-tce.o
+obj-$(CONFIG_PCI_IOV)   += pci-sriov.o
 obj-$(CONFIG_CXL_BASE) += pci-cxl.o
 obj-$(CONFIG_EEH)  += eeh-powernv.o
 obj-$(CONFIG_MEMORY_FAILURE)   += opal-memory-errors.o
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 8fb17676d914..2d36a9ebf0e9 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -115,26 +115,6 @@ static int __init pci_reset_phbs_setup(char *str)
 
 early_param("ppc_pci_reset_phbs", pci_reset_phbs_setup);
 
-static inline bool pnv_pci_is_m64(struct pnv_phb *phb, struct resource *r)
-{
-   /*
-* WARNING: We cannot rely on the resource flags. The Linux PCI
-* allocation code sometimes decides to put a 64-bit prefetchable
-* BAR in the 32-bit window, so we have to compare the addresses.
-*
-* For simplicity we only test resource start.
-*/
-   return (r->start >= phb->ioda.m64_base &&
-   r->start < (phb->ioda.m64_base + phb->ioda.m64_size));
-}
-
-static inline bool pnv_pci_is_m64_flags(unsigned long resource_flags)
-{
-   unsigned long flags = (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH);
-
-   return (resource_flags & flags) == flags;
-}
-
 static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int pe_no)
 {
s64 rc;
@@ -172,7 +152,7 @@ static void pnv_ioda_reserve_pe(struct pnv_phb *phb, int 
pe_no)
pnv_ioda_init_pe(phb, pe_no);
 }
 
-static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
+struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb *phb)
 {
long pe;
 
@@ -184,7 +164,7 @@ static struct pnv_ioda_pe *pnv_ioda_alloc_pe(struct pnv_phb 
*phb)
return NULL;
 }
 
-static void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
+void pnv_ioda_free_pe(struct pnv_ioda_pe *pe)
 {
struct pnv_phb *phb = pe->phb;
unsigned int pe_num = pe->pe_number;
@@ -816,7 +796,7 @@ static void pnv_ioda_unset_peltv(struct pnv_phb *phb,
pe_warn(pe, "OPAL error %lld remove self from PELTV\n", rc);
 }
 
-static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
struct pci_dev *parent;
uint8_t bcomp, dcomp, fcomp;
@@ -887,7 +867,7 @@ static int pnv_ioda_deconfigure_pe(struct pnv_phb *phb, 
struct pnv_ioda_pe *pe)
return 0;
 }
 
-static int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
+int pnv_ioda_configure_pe(struct pnv_phb *phb, struct pnv_ioda_pe *pe)
 {
struct pci_dev *parent;
uint8_t bcomp, dcomp, fcomp;
@@ -982,91 +962,6 @@ static int pnv_ioda_configure_pe(struct pnv_phb *phb, 
struct pnv_ioda_pe *pe)
return 0;
 }
 
-#ifdef CONFIG_PCI_IOV
-static int pnv_pci_vf_resource_s

[PATCH 06/15] powerpc/powernv/sriov: Explain how SR-IOV works on PowerNV

2020-07-09 Thread Oliver O'Halloran
SR-IOV support on PowerNV is a byzantine maze of hooks. I have no idea
how anyone is supposed to know how it works except through a lot of
stuffering. Write up some docs about the overall story to help out
the next sucker^Wperson who needs to tinker with it.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-sriov.c | 130 +
 1 file changed, 130 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c 
b/arch/powerpc/platforms/powernv/pci-sriov.c
index 080ea39f5a83..f4c74ab1284d 100644
--- a/arch/powerpc/platforms/powernv/pci-sriov.c
+++ b/arch/powerpc/platforms/powernv/pci-sriov.c
@@ -12,6 +12,136 @@
 /* for pci_dev_is_added() */
 #include "../../../../drivers/pci/pci.h"
 
+/*
+ * The majority of the complexity in supporting SR-IOV on PowerNV comes from
+ * the need to put the MMIO space for each VF into a separate PE. Internally
+ * the PHB maps MMIO addresses to a specific PE using the "Memory BAR Table".
+ * The MBT historically only applied to the 64bit MMIO window of the PHB
+ * so it's common to see it referred to as the "M64BT".
+ *
+ * An MBT entry stores the mapped range as an , pair. This forces
+ * the address range that we want to map to be power-of-two sized and aligned.
+ * For conventional PCI devices this isn't really an issue since PCI device 
BARs
+ * have the same requirement.
+ *
+ * For a SR-IOV BAR things are a little more awkward since size and alignment
+ * are not coupled. The alignment is set based on the the per-VF BAR size, but
+ * the total BAR area is: number-of-vfs * per-vf-size. The number of VFs
+ * isn't necessarily a power of two, so neither is the total size. To fix that
+ * we need to finesse (read: hack) the Linux BAR allocator so that it will
+ * allocate the SR-IOV BARs in a way that lets us map them using the MBT.
+ *
+ * The changes to size and alignment that we need to do depend on the "mode"
+ * of MBT entry that we use. We only support SR-IOV on PHB3 (IODA2) and above,
+ * so as a baseline we can assume that we have the following BAR modes
+ * available:
+ *
+ *   NB: $PE_COUNT is the number of PEs that the PHB supports.
+ *
+ * a) A segmented BAR that splits the mapped range into $PE_COUNT equally sized
+ *segments. The n'th segment is mapped to the n'th PE.
+ * b) An un-segmented BAR that maps the whole address range to a specific PE.
+ *
+ *
+ * We prefer to use mode a) since it only requires one MBT entry per SR-IOV BAR
+ * For comparison b) requires one entry per-VF per-BAR, or:
+ * (num-vfs * num-sriov-bars) in total. To use a) we need the size of each 
segment
+ * to equal the size of the per-VF BAR area. So:
+ *
+ * new_size = per-vf-size * number-of-PEs
+ *
+ * The alignment for the SR-IOV BAR also needs to be changed from per-vf-size
+ * to "new_size", calculated above. Implementing this is a convoluted process
+ * which requires several hooks in the PCI core:
+ *
+ * 1. In pcibios_add_device() we call pnv_pci_ioda_fixup_iov().
+ *
+ *At this point the device has been probed and the device's BARs are sized,
+ *but no resource allocations have been done. The SR-IOV BARs are sized
+ *based on the maximum number of VFs supported by the device and we need
+ *to increase that to new_size.
+ *
+ * 2. Later, when Linux actually assigns resources it tries to make the 
resource
+ *allocations for each PCI bus as compact as possible. As a part of that it
+ *sorts the BARs on a bus by their required alignment, which is calculated
+ *using pci_resource_alignment().
+ *
+ *For IOV resources this goes:
+ *pci_resource_alignment()
+ *pci_sriov_resource_alignment()
+ *pcibios_sriov_resource_alignment()
+ *pnv_pci_iov_resource_alignment()
+ *
+ *Our hook overrides the default alignment, equal to the per-vf-size, with
+ *new_size computed above.
+ *
+ * 3. When userspace enables VFs for a device:
+ *
+ *sriov_enable()
+ *   pcibios_sriov_enable()
+ *   pnv_pcibios_sriov_enable()
+ *
+ *This is where we actually allocate PE numbers for each VF and setup the
+ *MBT mapping for each SR-IOV BAR. In steps 1) and 2) we setup an "arena"
+ *where each MBT segment is equal in size to the VF BAR so we can shift
+ *around the actual SR-IOV BAR location within this arena. We need this
+ *ability because the PE space is shared by all devices on the same PHB.
+ *When using mode a) described above segment 0 in maps to PE#0 which might
+ *be already being used by another device on the PHB.
+ *
+ *As a result we need allocate a contigious range of PE numbers, then shift
+ *the address programmed into the SR-IOV BAR of the PF so that the address
+ *of VF0 matches up with the segment corresponding to the first allocated
+ *PE number. This is handled in pnv_pci_vf_resource_shift().
+ *
+ *Once all that is done we return to the PCI core which then enables VFs,
+ *scans them and cr

[PATCH 04/15] powerpc/powernv/pci: Initialise M64 for IODA1 as a 1-1 window

2020-07-09 Thread Oliver O'Halloran
We pre-configure the m64 window for IODA1 as a 1-1 segment-PE mapping,
similar to PHB3. Currently the actual mapping of segments occurs in
pnv_ioda_pick_m64_pe(), but we can move it into pnv_ioda1_init_m64() and
drop the IODA1 specific code paths in the PE setup / teardown.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 55 +++
 1 file changed, 25 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index bb9c1cc60c33..8fb17676d914 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -311,6 +311,28 @@ static int pnv_ioda1_init_m64(struct pnv_phb *phb)
}
}
 
+   for (index = 0; index < phb->ioda.total_pe_num; index++) {
+   int64_t rc;
+
+   /*
+* P7IOC supports M64DT, which helps mapping M64 segment
+* to one particular PE#. However, PHB3 has fixed mapping
+* between M64 segment and PE#. In order to have same logic
+* for P7IOC and PHB3, we enforce fixed mapping between M64
+* segment and PE# on P7IOC.
+*/
+   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+   index, OPAL_M64_WINDOW_TYPE,
+   index / PNV_IODA1_M64_SEGS,
+   index % PNV_IODA1_M64_SEGS);
+   if (rc != OPAL_SUCCESS) {
+   pr_warn("%s: Error %lld mapping M64 for PHB#%x-PE#%x\n",
+   __func__, rc, phb->hose->global_number,
+   index);
+   goto fail;
+   }
+   }
+
/*
 * Exclude the segments for reserved and root bus PE, which
 * are first or last two PEs.
@@ -402,26 +424,6 @@ static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct 
pci_bus *bus, bool all)
pe->master = master_pe;
list_add_tail(&pe->list, &master_pe->slaves);
}
-
-   /*
-* P7IOC supports M64DT, which helps mapping M64 segment
-* to one particular PE#. However, PHB3 has fixed mapping
-* between M64 segment and PE#. In order to have same logic
-* for P7IOC and PHB3, we enforce fixed mapping between M64
-* segment and PE# on P7IOC.
-*/
-   if (phb->type == PNV_PHB_IODA1) {
-   int64_t rc;
-
-   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-   pe->pe_number, OPAL_M64_WINDOW_TYPE,
-   pe->pe_number / PNV_IODA1_M64_SEGS,
-   pe->pe_number % PNV_IODA1_M64_SEGS);
-   if (rc != OPAL_SUCCESS)
-   pr_warn("%s: Error %lld mapping M64 for 
PHB#%x-PE#%x\n",
-   __func__, rc, phb->hose->global_number,
-   pe->pe_number);
-   }
}
 
kfree(pe_alloc);
@@ -3354,14 +3356,8 @@ static void pnv_ioda_free_pe_seg(struct pnv_ioda_pe *pe,
if (map[idx] != pe->pe_number)
continue;
 
-   if (win == OPAL_M64_WINDOW_TYPE)
-   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-   phb->ioda.reserved_pe_idx, win,
-   idx / PNV_IODA1_M64_SEGS,
-   idx % PNV_IODA1_M64_SEGS);
-   else
-   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
-   phb->ioda.reserved_pe_idx, win, 0, idx);
+   rc = opal_pci_map_pe_mmio_window(phb->opal_id,
+   phb->ioda.reserved_pe_idx, win, 0, idx);
 
if (rc != OPAL_SUCCESS)
pe_warn(pe, "Error %lld unmapping (%d) segment#%d\n",
@@ -3380,8 +3376,7 @@ static void pnv_ioda_release_pe_seg(struct pnv_ioda_pe 
*pe)
 phb->ioda.io_segmap);
pnv_ioda_free_pe_seg(pe, OPAL_M32_WINDOW_TYPE,
 phb->ioda.m32_segmap);
-   pnv_ioda_free_pe_seg(pe, OPAL_M64_WINDOW_TYPE,
-phb->ioda.m64_segmap);
+   /* M64 is pre-configured by pnv_ioda1_init_m64() */
} else if (phb->type == PNV_PHB_IODA2) {
pnv_ioda_free_pe_seg(pe, OPAL_M32_WINDOW_TYPE,
 phb->ioda.m32_segmap);
-- 
2.26.2



[PATCH 03/15] powerpc/powernv/pci: Add explicit tracking of the DMA setup state

2020-07-09 Thread Oliver O'Halloran
There's an optimisation in the PE setup which skips performing DMA
setup for a PE if we only have bridges in a PE. The assumption being
that only "real" devices will DMA to system memory, which is probably
fair. However, if we start off with only bridge devices in a PE then
add a non-bridge device the new device won't be able to use DMA because
we never configured it.

Fix this (admittedly pretty weird) edge case by tracking whether we've done
the DMA setup for the PE or not. If a non-bridge device is added to the PE
(via rescan or hotplug, or whatever) we can set up DMA on demand.

This also means the only remaining user of the old "DMA Weight" code is
the IODA1 DMA setup code that it was originally added for, which is good.

Cc: Alexey Kardashevskiy 
Signed-off-by: Oliver O'Halloran 
---
Alexey, do we need to have the IOMMU API stuff set/clear this flag?
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 48 ++-
 arch/powerpc/platforms/powernv/pci.h  |  7 
 2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index bfb40607aa0e..bb9c1cc60c33 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -141,6 +141,7 @@ static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb 
*phb, int pe_no)
 
phb->ioda.pe_array[pe_no].phb = phb;
phb->ioda.pe_array[pe_no].pe_number = pe_no;
+   phb->ioda.pe_array[pe_no].dma_setup_done = false;
 
/*
 * Clear the PE frozen state as it might be put into frozen state
@@ -1685,6 +1686,12 @@ static int pnv_pcibios_sriov_enable(struct pci_dev 
*pdev, u16 num_vfs)
 }
 #endif /* CONFIG_PCI_IOV */
 
+static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
+  struct pnv_ioda_pe *pe);
+
+static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
+  struct pnv_ioda_pe *pe);
+
 static void pnv_pci_ioda_dma_dev_setup(struct pci_dev *pdev)
 {
struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
@@ -1713,6 +1720,24 @@ static void pnv_pci_ioda_dma_dev_setup(struct pci_dev 
*pdev)
pci_info(pdev, "Added to existing PE#%x\n", pe->pe_number);
}
 
+   /*
+* We assume that bridges *probably* don't need to do any DMA so we can
+* skip allocating a TCE table, etc unless we get a non-bridge device.
+*/
+   if (!pe->dma_setup_done && !pci_is_bridge(pdev)) {
+   switch (phb->type) {
+   case PNV_PHB_IODA1:
+   pnv_pci_ioda1_setup_dma_pe(phb, pe);
+   break;
+   case PNV_PHB_IODA2:
+   pnv_pci_ioda2_setup_dma_pe(phb, pe);
+   break;
+   default:
+   pr_warn("%s: No DMA for PHB#%x (type %d)\n",
+   __func__, phb->hose->global_number, phb->type);
+   }
+   }
+
if (pdn)
pdn->pe_number = pe->pe_number;
pe->device_count++;
@@ -,6 +2247,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb 
*phb,
pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
iommu_init_table(tbl, phb->hose->node, 0, 0);
 
+   pe->dma_setup_done = true;
return;
  fail:
/* XXX Failure: Try to fallback to 64-bit only ? */
@@ -2536,9 +2562,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
 {
int64_t rc;
 
-   if (!pnv_pci_ioda_pe_dma_weight(pe))
-   return;
-
/* TVE #1 is selected by PCI address bit 59 */
pe->tce_bypass_base = 1ull << 59;
 
@@ -2563,6 +2586,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb 
*phb,
iommu_register_group(&pe->table_group, phb->hose->global_number,
 pe->pe_number);
 #endif
+   pe->dma_setup_done = true;
 }
 
 int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, unsigned int hw_irq)
@@ -3136,7 +3160,6 @@ static void pnv_pci_fixup_bridge_resources(struct pci_bus 
*bus,
 
 static void pnv_pci_configure_bus(struct pci_bus *bus)
 {
-   struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
struct pci_dev *bridge = bus->self;
struct pnv_ioda_pe *pe;
bool all = (bridge && pci_pcie_type(bridge) == PCI_EXP_TYPE_PCI_BRIDGE);
@@ -3160,17 +3183,6 @@ static void pnv_pci_configure_bus(struct pci_bus *bus)
return;
 
pnv_ioda_setup_pe_seg(pe);
-   switch (phb->type) {
-   case PNV_PHB_IODA1:
-   pnv_pci_ioda1_setup_dma_pe(phb, pe);
-   break;
-   case PNV_PHB_IODA2:
-   pnv_pci_ioda2_setup_dma_pe(phb, pe);
-   break;
-   default:
-   pr_warn("%s: No DMA for PHB#%x (type %d)\n",
-   __func__, phb->hose->global_number, phb->type);
-   }
 }
 
 static resource_size_t pnv_

[PATCH 02/15] powerpc/powernv/pci: Always tear down DMA windows on PE release

2020-07-09 Thread Oliver O'Halloran
Currently we have these two functions:

pnv_pci_ioda2_release_dma_pe(), and
pnv_pci_ioda2_release_pe_dma()

The first is used when tearing down VF PEs and the other is used for normal
devices. There's very little difference between the two though. The latter
(non-VF) will skip a call to pnv_pci_ioda2_unset_window() unless
CONFIG_IOMMU_API=y is set. There's no real point in doing this so fold the
two together.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++
 1 file changed, 3 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 687919db0347..bfb40607aa0e 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -1422,26 +1422,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
return -EBUSY;
 }
 
-static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
-   int num);
-
-static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct 
pnv_ioda_pe *pe)
-{
-   struct iommu_table*tbl;
-   int64_t   rc;
-
-   tbl = pe->table_group.tables[0];
-   rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
-   if (rc)
-   pe_warn(pe, "OPAL error %lld release DMA window\n", rc);
-
-   pnv_pci_ioda2_set_bypass(pe, false);
-   if (pe->table_group.group) {
-   iommu_group_put(pe->table_group.group);
-   BUG_ON(pe->table_group.group);
-   }
-   iommu_tce_table_put(tbl);
-}
+static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe);
 
 static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
 {
@@ -1455,11 +1436,12 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
if (!pdev->is_physfn)
return;
 
+   /* FIXME: Use pnv_ioda_release_pe()? */
list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
if (pe->parent_dev != pdev)
continue;
 
-   pnv_pci_ioda2_release_dma_pe(pdev, pe);
+   pnv_pci_ioda2_release_pe_dma(pe);
 
/* Remove from list */
mutex_lock(&phb->ioda.pe_list_mutex);
@@ -2429,7 +2411,6 @@ static long pnv_pci_ioda2_setup_default_config(struct 
pnv_ioda_pe *pe)
return 0;
 }
 
-#if defined(CONFIG_IOMMU_API) || defined(CONFIG_PCI_IOV)
 static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
int num)
 {
@@ -2453,7 +2434,6 @@ static long pnv_pci_ioda2_unset_window(struct 
iommu_table_group *table_group,
 
return ret;
 }
-#endif
 
 #ifdef CONFIG_IOMMU_API
 unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
@@ -3334,18 +3314,14 @@ static void pnv_pci_ioda2_release_pe_dma(struct 
pnv_ioda_pe *pe)
 {
struct iommu_table *tbl = pe->table_group.tables[0];
unsigned int weight = pnv_pci_ioda_pe_dma_weight(pe);
-#ifdef CONFIG_IOMMU_API
int64_t rc;
-#endif
 
if (!weight)
return;
 
-#ifdef CONFIG_IOMMU_API
rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
if (rc)
pe_warn(pe, "OPAL error %lld release DMA window\n", rc);
-#endif
 
pnv_pci_ioda2_set_bypass(pe, false);
if (pe->table_group.group) {
-- 
2.26.2



[PATCH 01/15] powernv/pci: Add pci_bus_to_pnvhb() helper

2020-07-09 Thread Oliver O'Halloran
Add a helper to go from a pci_bus structure to the pnv_phb that hosts that
bus. There's a lot of instances of the following pattern:

struct pci_controller *hose = pci_bus_to_host(pdev->bus);
struct pnv_phb *phb = hose->private_data;

Without any other uses of the pci_controller inside the function. This is
hard to read since it requires you to memorise the contents of the
private data fields and kind of error prone since it involves blindly
assigning a void pointer. Add a helper to make it more concise and
explicit.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 88 +++
 arch/powerpc/platforms/powernv/pci.c  | 14 ++--
 arch/powerpc/platforms/powernv/pci.h  | 10 +++
 3 files changed, 38 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 31c3e6d58c41..687919db0347 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -252,8 +252,7 @@ static int pnv_ioda2_init_m64(struct pnv_phb *phb)
 static void pnv_ioda_reserve_dev_m64_pe(struct pci_dev *pdev,
 unsigned long *pe_bitmap)
 {
-   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
struct resource *r;
resource_size_t base, sgsz, start, end;
int segno, i;
@@ -351,8 +350,7 @@ static void pnv_ioda_reserve_m64_pe(struct pci_bus *bus,
 
 static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pci_bus *bus, bool all)
 {
-   struct pci_controller *hose = pci_bus_to_host(bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
struct pnv_ioda_pe *master_pe, *pe;
unsigned long size, *pe_alloc;
int i;
@@ -673,8 +671,7 @@ struct pnv_ioda_pe *pnv_pci_bdfn_to_pe(struct pnv_phb *phb, 
u16 bdfn)
 
 struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
 {
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
struct pci_dn *pdn = pci_get_pdn(dev);
 
if (!pdn)
@@ -1069,8 +1066,7 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, 
int offset)
 
 static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
 {
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
struct pci_dn *pdn = pci_get_pdn(dev);
struct pnv_ioda_pe *pe;
 
@@ -1129,8 +1125,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct 
pci_dev *dev)
  */
 static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, bool all)
 {
-   struct pci_controller *hose = pci_bus_to_host(bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
struct pnv_ioda_pe *pe = NULL;
unsigned int pe_num;
 
@@ -1196,8 +1191,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_npu_PE(struct 
pci_dev *npu_pdev)
struct pnv_ioda_pe *pe;
struct pci_dev *gpu_pdev;
struct pci_dn *npu_pdn;
-   struct pci_controller *hose = pci_bus_to_host(npu_pdev->bus);
-   struct pnv_phb *phb = hose->private_data;
+   struct pnv_phb *phb = pci_bus_to_pnvhb(npu_pdev->bus);
 
/*
 * Intentionally leak a reference on the npu device (for
@@ -1300,16 +1294,12 @@ static void pnv_pci_ioda_setup_nvlink(void)
 #ifdef CONFIG_PCI_IOV
 static int pnv_pci_vf_release_m64(struct pci_dev *pdev, u16 num_vfs)
 {
-   struct pci_bus*bus;
-   struct pci_controller *hose;
struct pnv_phb*phb;
struct pci_dn *pdn;
inti, j;
intm64_bars;
 
-   bus = pdev->bus;
-   hose = pci_bus_to_host(bus);
-   phb = hose->private_data;
+   phb = pci_bus_to_pnvhb(pdev->bus);
pdn = pci_get_pdn(pdev);
 
if (pdn->m64_single_mode)
@@ -1333,8 +1323,6 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, 
u16 num_vfs)
 
 static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
 {
-   struct pci_bus*bus;
-   struct pci_controller *hose;
struct pnv_phb*phb;
struct pci_dn *pdn;
unsigned int   win;
@@ -1346,9 +1334,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, 
u16 num_vfs)
intpe_num;
intm64_bars;
 
-   bus = pdev->bus;
-   hose = pci_bus_to_host(bus);
-   phb = hose->private_data;
+   phb = pci_bus_to_pnvhb(pdev->bus);
pdn = pci_get_pdn(pdev);
total_vfs = pci_sriov_get_totalvfs(pdev);
 
@@ -1459,15 +1445,11 @@ s

PowerNV PCI & SR-IOV cleanups

2020-07-09 Thread Oliver O'Halloran
Finally bit the bullet and learned how all the MMIO->PE mapping setup
actually works. As a side effect I found a bunch of oddities in how
PowerNV SR-IOV support is implemented. This series mostly sorts that
out with a few more generic cleanups along the way.

This is largely prep work for supporting VFs in the 32bit MMIO window.
This is an unfortunate necessity due to how the Linux BAR allocator
handles BARs marked as non-prefetchable. The distinction
between prefetch and non-prefetchable BARs was made largely irrelevant
with the introduction of PCIe, but the BAR allocator is overly
conservative. It will always place non-pref bars in the prefetchable
window, which is 32bit only. This results in us being unable to use VFs
from NVMe drives and a few different RAID cards.

This series is based on top of these two:

https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187630
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187688

Rebases cleanly on top of the first, but I haven't tested that one plus
this extensively.

Oliver




[PATCH v2 1/3] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above

2020-07-09 Thread Pratik Rajesh Sampat
POWER9 onwards the support for the registers HID1, HID4, HID5 has been
receded.
Although mfspr on the above registers worked in Power9, In Power10
simulator is unrecognized. Moving their assignment under the
check for machines lower than Power9

Signed-off-by: Pratik Rajesh Sampat 
Reviewed-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/powernv/idle.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 2dd467383a88..19d94d021357 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
 */
uint64_t lpcr_val   = mfspr(SPRN_LPCR);
uint64_t hid0_val   = mfspr(SPRN_HID0);
-   uint64_t hid1_val   = mfspr(SPRN_HID1);
-   uint64_t hid4_val   = mfspr(SPRN_HID4);
-   uint64_t hid5_val   = mfspr(SPRN_HID5);
uint64_t hmeer_val  = mfspr(SPRN_HMEER);
uint64_t msr_val = MSR_IDLE;
uint64_t psscr_val = pnv_deepest_stop_psscr_val;
@@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
 
/* Only p8 needs to set extra HID regiters */
if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   uint64_t hid1_val = mfspr(SPRN_HID1);
+   uint64_t hid4_val = mfspr(SPRN_HID4);
+   uint64_t hid5_val = mfspr(SPRN_HID5);
 
rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
if (rc != 0)
-- 
2.25.4



[PATCH v2 3/3] powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable

2020-07-09 Thread Pratik Rajesh Sampat
Replace the variable name from using "pnv_first_spr_loss_level" to
"pnv_first_fullstate_loss_level".

As pnv_first_spr_loss_level is supposed to be the earliest state that
has OPAL_PM_LOSE_FULL_CONTEXT set, however as shallow states too loose
SPR values, render an incorrect terminology.

Signed-off-by: Pratik Rajesh Sampat 
---
 arch/powerpc/platforms/powernv/idle.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index f2e2a6a4c274..d54e7ef234e3 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -48,7 +48,7 @@ static bool default_stop_found;
  * First stop state levels when SPR and TB loss can occur.
  */
 static u64 pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-static u64 pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+static u64 pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1;
 
 /*
  * psscr value and mask of the deepest stop idle state.
@@ -659,7 +659,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
  */
mmcr0   = mfspr(SPRN_MMCR0);
}
-   if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level) {
+   if ((psscr & PSSCR_RL_MASK) >= pnv_first_fullstate_loss_level) {
sprs.lpcr   = mfspr(SPRN_LPCR);
sprs.hfscr  = mfspr(SPRN_HFSCR);
sprs.fscr   = mfspr(SPRN_FSCR);
@@ -751,7 +751,7 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
 * just always test PSSCR for SPR/TB state loss.
 */
pls = (psscr & PSSCR_PLS) >> PSSCR_PLS_SHIFT;
-   if (likely(pls < pnv_first_spr_loss_level)) {
+   if (likely(pls < pnv_first_fullstate_loss_level)) {
if (sprs_saved)
atomic_stop_thread_idle();
goto out;
@@ -1098,7 +1098,7 @@ static void __init pnv_power9_idle_init(void)
 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
 */
pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
-   pnv_first_spr_loss_level = MAX_STOP_STATE + 1;
+   pnv_first_fullstate_loss_level = MAX_STOP_STATE + 1;
for (i = 0; i < nr_pnv_idle_states; i++) {
int err;
struct pnv_idle_states_t *state = &pnv_idle_states[i];
@@ -1109,8 +1109,8 @@ static void __init pnv_power9_idle_init(void)
pnv_first_tb_loss_level = psscr_rl;
 
if ((state->flags & OPAL_PM_LOSE_FULL_CONTEXT) &&
-(pnv_first_spr_loss_level > psscr_rl))
-   pnv_first_spr_loss_level = psscr_rl;
+(pnv_first_fullstate_loss_level > psscr_rl))
+   pnv_first_fullstate_loss_level = psscr_rl;
 
/*
 * The idle code does not deal with TB loss occurring
@@ -1121,8 +1121,8 @@ static void __init pnv_power9_idle_init(void)
 * compatibility.
 */
if ((state->flags & OPAL_PM_TIMEBASE_STOP) &&
-(pnv_first_spr_loss_level > psscr_rl))
-   pnv_first_spr_loss_level = psscr_rl;
+(pnv_first_fullstate_loss_level > psscr_rl))
+   pnv_first_fullstate_loss_level = psscr_rl;
 
err = validate_psscr_val_mask(&state->psscr_val,
  &state->psscr_mask,
@@ -1168,7 +1168,7 @@ static void __init pnv_power9_idle_init(void)
}
 
pr_info("cpuidle-powernv: First stop level that may lose SPRs = 
0x%llx\n",
-   pnv_first_spr_loss_level);
+   pnv_first_fullstate_loss_level);
 
pr_info("cpuidle-powernv: First stop level that may lose timebase = 
0x%llx\n",
pnv_first_tb_loss_level);
-- 
2.25.4



[PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0, DAWRX0 for P10

2020-07-09 Thread Pratik Rajesh Sampat
Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
stop levels < 4.
Therefore save the values of these SPRs before entering a  "stop"
state and restore their values on wakeup.

Signed-off-by: Pratik Rajesh Sampat 
---
 arch/powerpc/platforms/powernv/idle.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 19d94d021357..f2e2a6a4c274 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -600,6 +600,8 @@ struct p9_sprs {
u64 iamr;
u64 amor;
u64 uamor;
+   u64 dawr0;
+   u64 dawrx0;
 };
 
 static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
@@ -687,6 +689,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
sprs.iamr   = mfspr(SPRN_IAMR);
sprs.amor   = mfspr(SPRN_AMOR);
sprs.uamor  = mfspr(SPRN_UAMOR);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   sprs.dawr0 = mfspr(SPRN_DAWR0);
+   sprs.dawrx0 = mfspr(SPRN_DAWRX0);
+   }
 
srr1 = isa300_idle_stop_mayloss(psscr); /* go idle */
 
@@ -710,6 +716,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
mtspr(SPRN_IAMR,sprs.iamr);
mtspr(SPRN_AMOR,sprs.amor);
mtspr(SPRN_UAMOR,   sprs.uamor);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mtspr(SPRN_DAWR0, sprs.dawr0);
+   mtspr(SPRN_DAWRX0, sprs.dawrx0);
+   }
 
/*
 * Workaround for POWER9 DD2.0, if we lost resources, the ERAT
-- 
2.25.4



[PATCH v2 0/3] Power10 basic energy management

2020-07-09 Thread Pratik Rajesh Sampat
Changelog v1 --> v2:
1. Save-restore DAWR and DAWRX unconditionally as they are lost in
shallow idle states too
2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to
correct naming terminology

Pratik Rajesh Sampat (3):
  powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above
  powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
  powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable

 arch/powerpc/platforms/powernv/idle.c | 34 +--
 1 file changed, 22 insertions(+), 12 deletions(-)

-- 
2.25.4



[RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option

2020-07-09 Thread Nicholas Piggin
On big systems, the mm refcount can become highly contented when doing
a lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so so instead implement a non-refcounted scheme
that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
any remaining lazy ones.

On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig | 16 
 arch/powerpc/Kconfig |  1 +
 include/linux/sched/mm.h |  6 +++---
 kernel/fork.c| 39 +++
 4 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 2daf8fe6146a..edf69437a971 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -418,6 +418,22 @@ config MMU_LAZY_TLB
help
  Enable "lazy TLB" mmu context switching for kernel threads.
 
+config MMU_LAZY_TLB_REFCOUNT
+   def_bool y
+   depends on MMU_LAZY_TLB
+   depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+config MMU_LAZY_TLB_SHOOTDOWN
+   bool
+   depends on MMU_LAZY_TLB
+   help
+ Instead of refcounting the "lazy tlb" mm struct, which can cause
+ contention with multi-threaded apps on large multiprocessor systems,
+ this option causes __mmdrop to IPI all CPUs in the mm_cpumask and
+ switch to init_mm if they were using the to-be-freed mm as the lazy
+ tlb. Architectures which do not track all possible lazy tlb CPUs in
+ mm_cpumask can not use this (without modification).
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 920c4e3ca4ef..24ac85c868db 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -225,6 +225,7 @@ config PPC
select HAVE_PERF_USER_STACK_DUMP
select MMU_GATHER_RCU_TABLE_FREE
select MMU_GATHER_PAGE_SIZE
+   select MMU_LAZY_TLB_SHOOTDOWN
select HAVE_REGS_AND_STACK_ACCESS_API
select HAVE_RELIABLE_STACKTRACE if PPC_BOOK3S_64 && 
CPU_LITTLE_ENDIAN
select HAVE_SYSCALL_TRACEPOINTS
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 2c2b20e2ccc7..1067af8039bd 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -53,19 +53,19 @@ void mmdrop(struct mm_struct *mm);
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
mmdrop(mm);
 }
 
 static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm)
 {
-   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
mmdrop(mm); /* This depends on mmdrop providing a full smp_mb() 
*/
else
smp_mb();
diff --git a/kernel/fork.c b/kernel/fork.c
index 142b23645d82..da0fba9e6079 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -685,6 +685,40 @@ static void check_mm(struct mm_struct *mm)
 #define allocate_mm()  (kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_shoot_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   if (current->active_mm == mm) {
+   BUG_ON(current->mm);
+   switch_mm(mm, &init_mm, current);
+   current->active_mm = &init_mm;
+   }
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+   struct mm_struct *mm = arg;
+
+   BUG_ON(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+   smp_call_function_many(mm_cpumask(mm), do_shoot_lazy_tlb, (void 
*)mm, 1);
+   do_shoot_lazy_tlb(mm);
+   }
+}
+
+static void check_lazy_tlbs(struct mm_struct *mm)
+{
+   if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+   smp_call_function(do_check_lazy_tlb, (void *)mm, 1);
+   do_check_lazy_tlb(mm);
+   }
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -695,6 +729,11 @@ void __mmdrop(struct mm_struct *mm)
BUG_ON(mm == &init_mm);
WARN_ON_ONCE(mm == current->mm);
WARN_ON_ONCE(mm == current->active_mm);
+
+   /* Ensure no CPUs are using this as their lazy tlb mm */
+   shoot_lazy_tlbs(mm);
+   check_lazy_tlbs(mm);
+
  

[RFC PATCH 6/7] lazy tlb: allow lazy tlb mm switching to be configurable

2020-07-09 Thread Nicholas Piggin
NOMMU systems could easily go without this and save a bit of code
and the mm refcounting, because their mm switch is a no-op. I haven't
flipped them over because haven't audited all arch code to convert
over to using the _lazy_tlb refcounting.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig |  7 +
 include/linux/sched/mm.h | 12 ++---
 kernel/sched/core.c  | 55 +++-
 kernel/sched/sched.h |  4 ++-
 4 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..2daf8fe6146a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -411,6 +411,13 @@ config MMU_GATHER_NO_GATHER
bool
depends on MMU_GATHER_TABLE_FREE
 
+# Would like to make this depend on MMU, because there is little use for lazy 
mm switching
+# with NOMMU, but have to audit NOMMU architecture code first.
+config MMU_LAZY_TLB
+   def_bool y
+   help
+ Enable "lazy TLB" mmu context switching for kernel threads.
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 110d4ad21de6..2c2b20e2ccc7 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -53,18 +53,22 @@ void mmdrop(struct mm_struct *mm);
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-   mmgrab(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-   mmdrop(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   mmdrop(mm);
 }
 
 static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm)
 {
-   /* This depends on mmdrop providing a full smp_mb() */
-   mmdrop(mm);
+   if (IS_ENABLED(CONFIG_MMU_LAZY_TLB))
+   mmdrop(mm); /* This depends on mmdrop providing a full smp_mb() 
*/
+   else
+   smp_mb();
 }
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d19f2f517f6c..14b4fae6f6e3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3253,7 +3253,7 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
__releases(rq->lock)
 {
struct rq *rq = this_rq();
-   struct mm_struct *mm = rq->prev_mm;
+   struct mm_struct *mm = NULL;
long prev_state;
 
/*
@@ -3272,7 +3272,10 @@ static struct rq *finish_task_switch(struct task_struct 
*prev)
  current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);
 
-   rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB
+   mm = rq->prev_lazy_mm;
+   rq->prev_lazy_mm = NULL;
+#endif
 
/*
 * A task struct has one reference for the use as "current".
@@ -3393,22 +3396,11 @@ asmlinkage __visible void schedule_tail(struct 
task_struct *prev)
calculate_sigpending();
 }
 
-/*
- * context_switch - switch to the new MM and the new thread's register state.
- */
-static __always_inline struct rq *
-context_switch(struct rq *rq, struct task_struct *prev,
-  struct task_struct *next, struct rq_flags *rf)
+static __always_inline void
+context_switch_mm(struct rq *rq, struct task_struct *prev,
+  struct task_struct *next)
 {
-   prepare_task_switch(rq, prev, next);
-
-   /*
-* For paravirt, this is coupled with an exit in switch_to to
-* combine the page table reload and the switch backend into
-* one hypercall.
-*/
-   arch_start_context_switch(prev);
-
+#ifdef CONFIG_MMU_LAZY_TLB
/*
 * kernel -> kernel   lazy + transfer active
 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
@@ -3440,10 +3432,37 @@ context_switch(struct rq *rq, struct task_struct *prev,
exit_lazy_tlb(prev->active_mm, next);
 
/* will mmdrop_lazy_tlb() in finish_task_switch(). */
-   rq->prev_mm = prev->active_mm;
+   rq->prev_lazy_mm = prev->active_mm;
prev->active_mm = NULL;
}
}
+#else
+   if (!next->mm)
+   next->active_mm = &init_mm;
+   membarrier_switch_mm(rq, prev->active_mm, next->active_mm);
+   switch_mm_irqs_off(prev->active_mm, next->active_mm, next);
+   if (!prev->mm)
+   prev->active_mm = NULL;
+#endif
+}
+
+/*
+ * context_switch - switch to the new MM and the new thread's register state.
+ */
+static __always_inline struct rq *
+context_switch(struct rq *rq, struct task_struct *prev,
+  struct task_struct *next, struct rq_flags *rf)
+{
+   prepare_task_switch(rq, prev, next);
+
+   /*
+* For paravirt, this is coupled with an exit in switch_to to
+* combine the page table reload and the switch backend into
+* one hypercall.
+*/
+   arch_start_context_switch(prev);
+
+   cont

[RFC PATCH 5/7] lazy tlb: introduce lazy mm refcount helper functions

2020-07-09 Thread Nicholas Piggin
Add explicit _lazy_tlb annotated functions for lazy mm refcounting.
This makes things a bit more explicit, and allows explicit refcounting
to be removed if it is not used.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/smp.c|  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |  4 ++--
 fs/exec.c|  2 +-
 include/linux/sched/mm.h | 17 +
 kernel/cpu.c |  2 +-
 kernel/exit.c|  2 +-
 kernel/kthread.c | 11 +++
 kernel/sched/core.c  | 13 +++--
 8 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 73199470c265..ad95812d2a3f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1253,7 +1253,7 @@ void start_secondary(void *unused)
unsigned int cpu = smp_processor_id();
struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
 
-   mmgrab(&init_mm);
+   mmgrab(&init_mm); /* XXX: where is the mmput for this? */
current->active_mm = &init_mm;
 
smp_store_cpu_info(cpu);
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index b5cc9b23cf02..52730629b3eb 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -652,10 +652,10 @@ static void do_exit_flush_lazy_tlb(void *arg)
 * Must be a kernel thread because sender is single-threaded.
 */
BUG_ON(current->mm);
-   mmgrab(&init_mm);
+   mmgrab_lazy_tlb(&init_mm);
switch_mm(mm, &init_mm, current);
current->active_mm = &init_mm;
-   mmdrop(mm);
+   mmdrop_lazy_tlb(mm);
}
_tlbiel_pid(pid, RIC_FLUSH_ALL);
 }
diff --git a/fs/exec.c b/fs/exec.c
index e2ab71e88293..3a01b2751ea9 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1119,7 +1119,7 @@ static int exec_mmap(struct mm_struct *mm)
mmput(old_mm);
} else {
exit_lazy_tlb(active_mm, tsk);
-   mmdrop(active_mm);
+   mmdrop_lazy_tlb(active_mm);
}
return 0;
 }
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 9b026264b445..110d4ad21de6 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -50,6 +50,23 @@ static inline void mmdrop(struct mm_struct *mm)
 
 void mmdrop(struct mm_struct *mm);
 
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+   mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+   mmdrop(mm);
+}
+
+static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm)
+{
+   /* This depends on mmdrop providing a full smp_mb() */
+   mmdrop(mm);
+}
+
 /*
  * This has to be called after a get_task_mm()/mmget_not_zero()
  * followed by taking the mmap_lock for writing before modifying the
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 134688d79589..ff9fcbc4e76b 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -578,7 +578,7 @@ static int finish_cpu(unsigned int cpu)
 */
if (mm != &init_mm)
idle->active_mm = &init_mm;
-   mmdrop(mm);
+   mmdrop_lazy_tlb(mm);
return 0;
 }
 
diff --git a/kernel/exit.c b/kernel/exit.c
index 727150f28103..d535da9fd2f8 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -470,7 +470,7 @@ static void exit_mm(void)
__set_current_state(TASK_RUNNING);
mmap_read_lock(mm);
}
-   mmgrab(mm);
+   mmgrab_lazy_tlb(mm);
BUG_ON(mm != current->active_mm);
/* more a memory barrier than a real lock */
task_lock(current);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 6f93c649aa97..a7133cc2ddaf 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1238,12 +1238,12 @@ void kthread_use_mm(struct mm_struct *mm)
WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
WARN_ON_ONCE(tsk->mm);
 
+   mmgrab(mm);
+
task_lock(tsk);
active_mm = tsk->active_mm;
-   if (active_mm != mm) {
-   mmgrab(mm);
+   if (active_mm != mm)
tsk->active_mm = mm;
-   }
tsk->mm = mm;
switch_mm(active_mm, mm, tsk);
task_unlock(tsk);
@@ -1253,7 +1253,7 @@ void kthread_use_mm(struct mm_struct *mm)
 
exit_lazy_tlb(active_mm, tsk);
if (active_mm != mm)
-   mmdrop(active_mm);
+   mmdrop_lazy_tlb(active_mm);
 
to_kthread(tsk)->oldfs = get_fs();
set_fs(USER_DS);
@@ -1276,9 +1276,12 @@ void kthread_unuse_mm(struct mm_struct *mm)
task_lock(tsk);
sync_mm_rss(mm);
tsk->mm = NULL;
+   mmgrab_lazy_tlb(mm);
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
task_unlock(tsk);
+
+   mmdrop(mm);
 }
 EXPORT_SYMBOL_G

[RFC PATCH 4/7] x86: use exit_lazy_tlb rather than membarrier_mm_sync_core_before_usermode

2020-07-09 Thread Nicholas Piggin
And get rid of the generic sync_core_before_usermode facility.

This helper is the wrong way around I think. The idea that membarrier
state requires a core sync before returning to user is the easy one
that does not need hiding behind membarrier calls. The gap in core
synchronization due to x86's sysret/sysexit and lazy tlb mode, is the
tricky detail that is better put in x86 lazy tlb code.

Consider if an arch did not synchronize core in switch_mm either, then
membarrier_mm_sync_core_before_usermode would be in the wrong place
but arch specific mmu context functions would still be the right place.
There is also a exit_lazy_tlb case that is not covered by this call, which
could be a bugs (kthread use mm the membarrier process's mm then context
switch back to the process without switching mm or lazy mm switch).

This makes lazy tlb code a bit more modular.

Signed-off-by: Nicholas Piggin 
---
 .../membarrier-sync-core/arch-support.txt |  6 +++-
 arch/x86/include/asm/mmu_context.h| 35 +++
 arch/x86/include/asm/sync_core.h  | 28 ---
 include/linux/sched/mm.h  | 14 
 include/linux/sync_core.h | 21 ---
 kernel/cpu.c  |  4 ++-
 kernel/kthread.c  |  2 +-
 kernel/sched/core.c   | 16 -
 8 files changed, 51 insertions(+), 75 deletions(-)
 delete mode 100644 arch/x86/include/asm/sync_core.h
 delete mode 100644 include/linux/sync_core.h

diff --git a/Documentation/features/sched/membarrier-sync-core/arch-support.txt 
b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
index 52ad74a25f54..bd43fb1f5986 100644
--- a/Documentation/features/sched/membarrier-sync-core/arch-support.txt
+++ b/Documentation/features/sched/membarrier-sync-core/arch-support.txt
@@ -5,6 +5,10 @@
 #
 # Architecture requirements
 #
+# If your architecture returns to user-space through non-core-serializing
+# instructions, you need to ensure these are done in switch_mm and 
exit_lazy_tlb
+# (if lazy tlb switching is implemented).
+#
 # * arm/arm64/powerpc
 #
 # Rely on implicit context synchronization as a result of exception return
@@ -24,7 +28,7 @@
 # instead on write_cr3() performed by switch_mm() to provide core serialization
 # after changing the current mm, and deal with the special case of kthread ->
 # uthread (temporarily keeping current mm into active_mm) by issuing a
-# sync_core_before_usermode() in that specific case.
+# serializing instruction in exit_lazy_mm() in that specific case.
 #
 ---
 | arch |status|
diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 255750548433..5263863a9be8 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -95,6 +96,40 @@ static inline void switch_ldt(struct mm_struct *prev, struct 
mm_struct *next)
 #define enter_lazy_tlb enter_lazy_tlb
 extern void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk);
 
+#ifdef CONFIG_MEMBARRIER
+/*
+ * Ensure that a core serializing instruction is issued before returning
+ * to user-mode, if a SYNC_CORE was requested. x86 implements return to
+ * user-space through sysexit, sysrel, and sysretq, which are not core
+ * serializing.
+ *
+ * See the membarrier comment in finish_task_switch as to why this is done
+ * in exit_lazy_tlb.
+ */
+#define exit_lazy_tlb exit_lazy_tlb
+static inline void exit_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
+{
+   /* Switching mm is serializing with write_cr3 */
+if (tsk->mm != mm)
+return;
+
+if (likely(!(atomic_read(&mm->membarrier_state) &
+ MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
+return;
+
+   /* With PTI, we unconditionally serialize before running user code. */
+   if (static_cpu_has(X86_FEATURE_PTI))
+   return;
+   /*
+* Return from interrupt and NMI is done through iret, which is core
+* serializing.
+*/
+   if (in_irq() || in_nmi())
+   return;
+   sync_core();
+}
+#endif
+
 /*
  * Init a new mm.  Used on mm copies, like at fork()
  * and on mm's that are brand-new, like at execve().
diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h
deleted file mode 100644
index c67caafd3381..
--- a/arch/x86/include/asm/sync_core.h
+++ /dev/null
@@ -1,28 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_X86_SYNC_CORE_H
-#define _ASM_X86_SYNC_CORE_H
-
-#include 
-#include 
-#include 
-
-/*
- * Ensure that a core serializing instruction is issued before returning
- * to user-mode. x86 implements return to user-space through sysexit,
- * sysrel, and sysretq, which are not core serializing.
- */
-static inline void sync_co

[RFC PATCH 3/7] mm: introduce exit_lazy_tlb

2020-07-09 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---
 fs/exec.c |  5 +++--
 include/asm-generic/mmu_context.h | 20 
 kernel/kthread.c  |  1 +
 kernel/sched/core.c   |  2 ++
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index e6e8a9a70327..e2ab71e88293 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1117,9 +1117,10 @@ static int exec_mmap(struct mm_struct *mm)
setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
mm_update_next_owner(old_mm);
mmput(old_mm);
-   return 0;
+   } else {
+   exit_lazy_tlb(active_mm, tsk);
+   mmdrop(active_mm);
}
-   mmdrop(active_mm);
return 0;
 }
 
diff --git a/include/asm-generic/mmu_context.h 
b/include/asm-generic/mmu_context.h
index 86cea80a50df..3fc4c3879b79 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -24,6 +24,26 @@ static inline void enter_lazy_tlb(struct mm_struct *mm,
 }
 #endif
 
+/*
+ * exit_lazy_tlb - Called after switching away from a lazy TLB mode mm.
+ *
+ * mm:  the lazy mm context that was switched away from
+ * tsk: the task that was switched to non-lazy mm
+ *
+ * tsk->mm will not be NULL.
+ *
+ * Note this is not symmetrical to enter_lazy_tlb, this is not
+ * called when tasks switch into the lazy mm, it's called after the
+ * lazy mm becomes non-lazy (either switched to a different mm or the
+ * owner of the mm returns).
+ */
+#ifndef exit_lazy_tlb
+static inline void exit_lazy_tlb(struct mm_struct *mm,
+   struct task_struct *tsk)
+{
+}
+#endif
+
 /**
  * init_new_context - Initialize context of a new mm_struct.
  * @tsk: task struct for the mm
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 132f84a5fde3..e813d92f2eab 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -1253,6 +1253,7 @@ void kthread_use_mm(struct mm_struct *mm)
 
if (active_mm != mm)
mmdrop(active_mm);
+   exit_lazy_tlb(active_mm, tsk);
 
to_kthread(tsk)->oldfs = get_fs();
set_fs(USER_DS);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca5db40392d4..debc917bc69b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3439,6 +3439,8 @@ context_switch(struct rq *rq, struct task_struct *prev,
switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
if (!prev->mm) {// from kernel
+   exit_lazy_tlb(prev->active_mm, next);
+
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
-- 
2.23.0



[RFC PATCH 2/7] arch: use asm-generic mmu context for no-op implementations

2020-07-09 Thread Nicholas Piggin
This patch bunches all architectures together. If the general idea is
accepted I will split them individually. Some architectures can go
further e.g., with consolidating switch_mm and activate_mm but I
only did the more obvious ones.
---
 arch/alpha/include/asm/mmu_context.h | 12 ++---
 arch/arc/include/asm/mmu_context.h   | 16 +++
 arch/arm/include/asm/mmu_context.h   | 26 ++-
 arch/arm64/include/asm/mmu_context.h |  7 ++-
 arch/csky/include/asm/mmu_context.h  |  8 ++--
 arch/hexagon/include/asm/mmu_context.h   | 33 +++---
 arch/ia64/include/asm/mmu_context.h  | 17 ++-
 arch/m68k/include/asm/mmu_context.h  | 47 
 arch/microblaze/include/asm/mmu_context_mm.h |  8 ++--
 arch/microblaze/include/asm/processor.h  |  3 --
 arch/mips/include/asm/mmu_context.h  | 11 ++---
 arch/nds32/include/asm/mmu_context.h | 10 +
 arch/nios2/include/asm/mmu_context.h | 21 ++---
 arch/nios2/mm/mmu_context.c  |  1 +
 arch/openrisc/include/asm/mmu_context.h  |  8 ++--
 arch/openrisc/mm/tlb.c   |  2 +
 arch/parisc/include/asm/mmu_context.h| 12 ++---
 arch/powerpc/include/asm/mmu_context.h   | 22 +++--
 arch/riscv/include/asm/mmu_context.h | 22 +
 arch/s390/include/asm/mmu_context.h  |  9 ++--
 arch/sh/include/asm/mmu_context.h|  5 +--
 arch/sh/include/asm/mmu_context_32.h |  9 
 arch/sparc/include/asm/mmu_context_32.h  | 10 ++---
 arch/sparc/include/asm/mmu_context_64.h  | 10 ++---
 arch/um/include/asm/mmu_context.h| 12 +++--
 arch/unicore32/include/asm/mmu_context.h | 24 ++
 arch/x86/include/asm/mmu_context.h   |  6 +++
 arch/xtensa/include/asm/mmu_context.h| 11 ++---
 arch/xtensa/include/asm/nommu_context.h  | 26 +--
 29 files changed, 106 insertions(+), 302 deletions(-)

diff --git a/arch/alpha/include/asm/mmu_context.h 
b/arch/alpha/include/asm/mmu_context.h
index 6d7d9bc1b4b8..4eea7c616992 100644
--- a/arch/alpha/include/asm/mmu_context.h
+++ b/arch/alpha/include/asm/mmu_context.h
@@ -214,8 +214,6 @@ ev4_activate_mm(struct mm_struct *prev_mm, struct mm_struct 
*next_mm)
tbiap();
 }
 
-#define deactivate_mm(tsk,mm)  do { } while (0)
-
 #ifdef CONFIG_ALPHA_GENERIC
 # define switch_mm(a,b,c)  alpha_mv.mv_switch_mm((a),(b),(c))
 # define activate_mm(x,y)  alpha_mv.mv_activate_mm((x),(y))
@@ -229,6 +227,7 @@ ev4_activate_mm(struct mm_struct *prev_mm, struct mm_struct 
*next_mm)
 # endif
 #endif
 
+#define init_new_context init_new_context
 static inline int
 init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 {
@@ -242,12 +241,7 @@ init_new_context(struct task_struct *tsk, struct mm_struct 
*mm)
return 0;
 }
 
-extern inline void
-destroy_context(struct mm_struct *mm)
-{
-   /* Nothing to do.  */
-}
-
+#define enter_lazy_tlb enter_lazy_tlb
 static inline void
 enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
 {
@@ -255,6 +249,8 @@ enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
*tsk)
  = ((unsigned long)mm->pgd - IDENT_ADDR) >> PAGE_SHIFT;
 }
 
+#include 
+
 #ifdef __MMU_EXTERN_INLINE
 #undef __EXTERN_INLINE
 #undef __MMU_EXTERN_INLINE
diff --git a/arch/arc/include/asm/mmu_context.h 
b/arch/arc/include/asm/mmu_context.h
index 3a5e6a5b9ed6..586d31902a99 100644
--- a/arch/arc/include/asm/mmu_context.h
+++ b/arch/arc/include/asm/mmu_context.h
@@ -102,6 +102,7 @@ static inline void get_new_mmu_context(struct mm_struct *mm)
  * Initialize the context related info for a new mm_struct
  * instance.
  */
+#define init_new_context init_new_context
 static inline int
 init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 {
@@ -113,6 +114,7 @@ init_new_context(struct task_struct *tsk, struct mm_struct 
*mm)
return 0;
 }
 
+#define destroy_context destroy_context
 static inline void destroy_context(struct mm_struct *mm)
 {
unsigned long flags;
@@ -153,13 +155,12 @@ static inline void switch_mm(struct mm_struct *prev, 
struct mm_struct *next,
 }
 
 /*
- * Called at the time of execve() to get a new ASID
- * Note the subtlety here: get_new_mmu_context() behaves differently here
- * vs. in switch_mm(). Here it always returns a new ASID, because mm has
- * an unallocated "initial" value, while in latter, it moves to a new ASID,
- * only if it was unallocated
+ * activate_mm defaults to switch_mm and is called at the time of execve() to
+ * get a new ASID Note the subtlety here: get_new_mmu_context() behaves
+ * differently here vs. in switch_mm(). Here it always returns a new ASID,
+ * because mm has an unallocated "initial" value, while in latter, it moves to
+ * a new ASID, only if it was unallocated
  */
-#define activate_mm(prev, next)switch_mm(prev, next, NULL)
 
 /* it seemed that deactivate_mm( ) is a reasonable place to do book-

[RFC PATCH 1/7] asm-generic: add generic MMU versions of mmu context functions

2020-07-09 Thread Nicholas Piggin
Many of these are no-ops on many architectures, so extend mmu_context.h
to cover MMU and NOMMU, and split the NOMMU bits out to nommu_context.h

Cc: Arnd Bergmann 
Cc: Remis Lima Baima 
Signed-off-by: Nicholas Piggin 
---
 arch/microblaze/include/asm/mmu_context.h |  2 +-
 arch/sh/include/asm/mmu_context.h |  2 +-
 include/asm-generic/mmu_context.h | 57 +--
 include/asm-generic/nommu_context.h   | 19 
 4 files changed, 64 insertions(+), 16 deletions(-)
 create mode 100644 include/asm-generic/nommu_context.h

diff --git a/arch/microblaze/include/asm/mmu_context.h 
b/arch/microblaze/include/asm/mmu_context.h
index f74f9da07fdc..34004efb3def 100644
--- a/arch/microblaze/include/asm/mmu_context.h
+++ b/arch/microblaze/include/asm/mmu_context.h
@@ -2,5 +2,5 @@
 #ifdef CONFIG_MMU
 # include 
 #else
-# include 
+# include 
 #endif
diff --git a/arch/sh/include/asm/mmu_context.h 
b/arch/sh/include/asm/mmu_context.h
index 48e67d544d53..9470d17c71c2 100644
--- a/arch/sh/include/asm/mmu_context.h
+++ b/arch/sh/include/asm/mmu_context.h
@@ -134,7 +134,7 @@ static inline void switch_mm(struct mm_struct *prev,
 #define set_TTB(pgd)   do { } while (0)
 #define get_TTB()  (0)
 
-#include 
+#include 
 
 #endif /* CONFIG_MMU */
 
diff --git a/include/asm-generic/mmu_context.h 
b/include/asm-generic/mmu_context.h
index 6be9106fb6fb..86cea80a50df 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -3,44 +3,73 @@
 #define __ASM_GENERIC_MMU_CONTEXT_H
 
 /*
- * Generic hooks for NOMMU architectures, which do not need to do
- * anything special here.
+ * Generic hooks to implement no-op functionality.
  */
 
-#include 
-
 struct task_struct;
 struct mm_struct;
 
+/*
+ * enter_lazy_tlb - Called when "tsk" is about to enter lazy TLB mode.
+ *
+ * @mm:  the currently active mm context which is becoming lazy
+ * @tsk: task which is entering lazy tlb
+ *
+ * tsk->mm will be NULL
+ */
+#ifndef enter_lazy_tlb
 static inline void enter_lazy_tlb(struct mm_struct *mm,
struct task_struct *tsk)
 {
 }
+#endif
 
+/**
+ * init_new_context - Initialize context of a new mm_struct.
+ * @tsk: task struct for the mm
+ * @mm:  the new mm struct
+ */
+#ifndef init_new_context
 static inline int init_new_context(struct task_struct *tsk,
struct mm_struct *mm)
 {
return 0;
 }
+#endif
 
+/**
+ * destroy_context - Undo init_new_context when the mm is going away
+ * @mm: old mm struct
+ */
+#ifndef destroy_context
 static inline void destroy_context(struct mm_struct *mm)
 {
 }
+#endif
 
-static inline void deactivate_mm(struct task_struct *task,
-   struct mm_struct *mm)
-{
-}
-
-static inline void switch_mm(struct mm_struct *prev,
-   struct mm_struct *next,
-   struct task_struct *tsk)
+/**
+ * activate_mm - called after exec switches the current task to a new mm, to 
switch to it
+ * @prev_mm: previous mm of this task
+ * @next_mm: new mm
+ */
+#ifndef activate_mm
+static inline void activate_mm(struct mm_struct *prev_mm,
+  struct mm_struct *next_mm)
 {
+   switch_mm(prev_mm, next_mm, current);
 }
+#endif
 
-static inline void activate_mm(struct mm_struct *prev_mm,
-  struct mm_struct *next_mm)
+/**
+ * dectivate_mm - called when an mm is released after exit or exec switches 
away from it
+ * @tsk: the task
+ * @mm:  the old mm
+ */
+#ifndef deactivate_mm
+static inline void deactivate_mm(struct task_struct *tsk,
+   struct mm_struct *mm)
 {
 }
+#endif
 
 #endif /* __ASM_GENERIC_MMU_CONTEXT_H */
diff --git a/include/asm-generic/nommu_context.h 
b/include/asm-generic/nommu_context.h
new file mode 100644
index ..72b8d8b1d81e
--- /dev/null
+++ b/include/asm-generic/nommu_context.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_GENERIC_NOMMU_H
+#define __ASM_GENERIC_NOMMU_H
+
+/*
+ * Generic hooks for NOMMU architectures, which do not need to do
+ * anything special here.
+ */
+
+#include 
+#include 
+
+static inline void switch_mm(struct mm_struct *prev,
+   struct mm_struct *next,
+   struct task_struct *tsk)
+{
+}
+
+#endif /* __ASM_GENERIC_NOMMU_H */
-- 
2.23.0



[RFC PATCH 0/7] mmu context cleanup, lazy tlb cleanup,

2020-07-09 Thread Nicholas Piggin
This blew up a bit bigger than I thought, so I'd like to get some
comments as to whether people agree with the direction it's going.

The patches aren't cleanly split out by arch, but as it is now it's
probably easier to get a quick overview of the changes at a glance
anyway.

So there's a few different things here.

1. Clean up and use asm-generic for no-op mmu context functions (so
   not just for nommu architectures). This should be functionally a
   no-op for everybody. This allows exit_lazy_tlb to easily be added.

2. Add exit_lazy_tlb and use it for x86, so this is x86 and membarrier
   specific changes. I _may_ have spotted a small membarrier / core sync
   bug here when adding exit_lazy_tlb.

3. Tidy up lazy tlb a little bit, have its own refcount function and
   allow it to be selected out. We can audit the nommu archs and
   deselect it for those.

4. Add a non-refcounting lazy mmu mode, to help scalability when the
   same mm is used for a lot of lazy mmu switching.

Comments, questions on anything would be much appreciated.

Thanks,
Nick

Nicholas Piggin (7):
  asm-generic: add generic MMU versions of mmu context functions
  arch: use asm-generic mmu context for no-op implementations
  mm: introduce exit_lazy_tlb
  x86: use exit_lazy_tlb rather than
membarrier_mm_sync_core_before_usermode
  lazy tlb: introduce lazy mm refcount helper functions
  lazy tlb: allow lazy tlb mm switching to be configurable
  lazy tlb: shoot lazies, a non-refcounting lazy tlb option

 .../membarrier-sync-core/arch-support.txt |  6 +-
 arch/Kconfig  | 23 +
 arch/alpha/include/asm/mmu_context.h  | 12 +--
 arch/arc/include/asm/mmu_context.h| 16 ++--
 arch/arm/include/asm/mmu_context.h| 26 +-
 arch/arm64/include/asm/mmu_context.h  |  7 +-
 arch/csky/include/asm/mmu_context.h   |  8 +-
 arch/hexagon/include/asm/mmu_context.h| 33 ++--
 arch/ia64/include/asm/mmu_context.h   | 17 +---
 arch/m68k/include/asm/mmu_context.h   | 47 ++-
 arch/microblaze/include/asm/mmu_context.h |  2 +-
 arch/microblaze/include/asm/mmu_context_mm.h  |  8 +-
 arch/microblaze/include/asm/processor.h   |  3 -
 arch/mips/include/asm/mmu_context.h   | 11 +--
 arch/nds32/include/asm/mmu_context.h  | 10 +--
 arch/nios2/include/asm/mmu_context.h  | 21 +
 arch/nios2/mm/mmu_context.c   |  1 +
 arch/openrisc/include/asm/mmu_context.h   |  8 +-
 arch/openrisc/mm/tlb.c|  2 +
 arch/parisc/include/asm/mmu_context.h | 12 +--
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/mmu_context.h| 22 ++---
 arch/powerpc/kernel/smp.c |  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c  |  4 +-
 arch/riscv/include/asm/mmu_context.h  | 22 +
 arch/s390/include/asm/mmu_context.h   |  9 +-
 arch/sh/include/asm/mmu_context.h |  7 +-
 arch/sh/include/asm/mmu_context_32.h  |  9 --
 arch/sparc/include/asm/mmu_context_32.h   | 10 +--
 arch/sparc/include/asm/mmu_context_64.h   | 10 +--
 arch/um/include/asm/mmu_context.h | 12 ++-
 arch/unicore32/include/asm/mmu_context.h  | 24 +-
 arch/x86/include/asm/mmu_context.h| 41 +
 arch/x86/include/asm/sync_core.h  | 28 ---
 arch/xtensa/include/asm/mmu_context.h | 11 +--
 arch/xtensa/include/asm/nommu_context.h   | 26 +-
 fs/exec.c |  5 +-
 include/asm-generic/mmu_context.h | 77 +
 include/asm-generic/nommu_context.h   | 19 +
 include/linux/sched/mm.h  | 35 
 include/linux/sync_core.h | 21 -
 kernel/cpu.c  |  6 +-
 kernel/exit.c |  2 +-
 kernel/fork.c | 39 +
 kernel/kthread.c  | 12 ++-
 kernel/sched/core.c   | 84 ---
 kernel/sched/sched.h  |  4 +-
 47 files changed, 388 insertions(+), 427 deletions(-)
 delete mode 100644 arch/x86/include/asm/sync_core.h
 create mode 100644 include/asm-generic/nommu_context.h
 delete mode 100644 include/linux/sync_core.h

-- 
2.23.0



Re: [RFC][PATCH] avoid refcounting the lazy tlb mm struct

2020-07-09 Thread Anton Blanchard
Hi Nick,

> On big systems, the mm refcount can become highly contented when doing
> a lot of context switching with threaded applications (particularly
> switching between the idle thread and an application thread).
> 
> Not doing lazy tlb at all slows switching down quite a bit, so I
> wonder if we can avoid the refcount for the lazy tlb, but have
> __mmdrop() IPI all CPUs that might be using this mm lazily.
> 
> This patch has only had light testing so far, but seems to work okay.

I tested this patch on a large POWER8 system with 1536 hardware threads.
I can create a worst case situation for mm refcounting by using
the threaded context switch test in will-it-scale set to half the
number of available CPUs (768).

With that workload the patch improves the context switch rate by 118x!

Tested-by: Anton Blanchard 

Thanks,
Anton

> diff --git a/arch/Kconfig b/arch/Kconfig
> index 8cc35dc556c7..69ea7172db3d 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -411,6 +411,16 @@ config MMU_GATHER_NO_GATHER
>   bool
>   depends on MMU_GATHER_TABLE_FREE
>  
> +config MMU_LAZY_TLB_SHOOTDOWN
> + bool
> + help
> +   Instead of refcounting the "lazy tlb" mm struct, which can
> cause
> +   contention with multi-threaded apps on large
> multiprocessor systems,
> +   this option causes __mmdrop to IPI all CPUs in the
> mm_cpumask and
> +   switch to init_mm if they were using the to-be-freed mm as
> the lazy
> +   tlb. Architectures which do not track all possible lazy
> tlb CPUs in
> +   mm_cpumask can not use this (without modification).
> +
>  config ARCH_HAVE_NMI_SAFE_CMPXCHG
>   bool
>  
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 920c4e3ca4ef..24ac85c868db 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -225,6 +225,7 @@ config PPC
>   select HAVE_PERF_USER_STACK_DUMP
>   select MMU_GATHER_RCU_TABLE_FREE
>   select MMU_GATHER_PAGE_SIZE
> + select MMU_LAZY_TLB_SHOOTDOWN
>   select HAVE_REGS_AND_STACK_ACCESS_API
>   select HAVE_RELIABLE_STACKTRACE if
> PPC_BOOK3S_64 && CPU_LITTLE_ENDIAN select HAVE_SYSCALL_TRACEPOINTS
> diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c
> b/arch/powerpc/mm/book3s64/radix_tlb.c index
> b5cc9b23cf02..52730629b3eb 100644 ---
> a/arch/powerpc/mm/book3s64/radix_tlb.c +++
> b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -652,10 +652,10 @@ static
> void do_exit_flush_lazy_tlb(void *arg)
>* Must be a kernel thread because sender is
> single-threaded. */
>   BUG_ON(current->mm);
> - mmgrab(&init_mm);
> + mmgrab_lazy_tlb(&init_mm);
>   switch_mm(mm, &init_mm, current);
>   current->active_mm = &init_mm;
> - mmdrop(mm);
> + mmdrop_lazy_tlb(mm);
>   }
>   _tlbiel_pid(pid, RIC_FLUSH_ALL);
>  }
> diff --git a/fs/exec.c b/fs/exec.c
> index e6e8a9a70327..6c96c8feba1f 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1119,7 +1119,7 @@ static int exec_mmap(struct mm_struct *mm)
>   mmput(old_mm);
>   return 0;
>   }
> - mmdrop(active_mm);
> + mmdrop_lazy_tlb(active_mm);
>   return 0;
>  }
>  
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 480a4d1b7dd8..ef28059086a1 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -51,6 +51,25 @@ static inline void mmdrop(struct mm_struct *mm)
>  
>  void mmdrop(struct mm_struct *mm);
>  
> +static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
> +{
> + if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN))
> + mmgrab(mm);
> +}
> +
> +static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
> +{
> + if (!IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN))
> + mmdrop(mm);
> +}
> +
> +static inline void mmdrop_lazy_tlb_smp_mb(struct mm_struct *mm)
> +{
> + mmdrop_lazy_tlb(mm);
> + if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN))
> + smp_mb();
> +}
> +
>  /*
>   * This has to be called after a get_task_mm()/mmget_not_zero()
>   * followed by taking the mmap_lock for writing before modifying the
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 142b23645d82..e3f1039cee9f 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -685,6 +685,34 @@ static void check_mm(struct mm_struct *mm)
>  #define allocate_mm()(kmem_cache_alloc(mm_cachep,
> GFP_KERNEL)) #define free_mm(mm)  (kmem_cache_free(mm_cachep,
> (mm))) 
> +static void do_shoot_lazy_tlb(void *arg)
> +{
> + struct mm_struct *mm = arg;
> +
> + if (current->active_mm == mm) {
> + BUG_ON(current->mm);
> + switch_mm(mm, &init_mm, current);
> + current->active_mm = &init_mm;
> + }
> +}
> +
> +static void do_check_lazy_tlb(void *arg)
> +{
> + struct mm_struct *mm = arg;
> +
> + BUG_ON(current->active_mm == mm);
> +}
> +
> +void shoot_lazy_tlbs(struct mm_struct *mm)
> +{
> + if (IS_ENABLED(CONFIG_MM

RE: [PATCH 1/2] powerpc/vas: Report proper error for address translation failure

2020-07-09 Thread Bulent Abali
copied verbatim from P9 DD2 Nest Accelerators Workbook Version 3.2

Table 4-36. CSB Non-zero CC Reported Error Types

CC=5, Error Type: Translation, 
Comment: Unused, defined by RFC02130 (footnote:  DMA controller uses this 
CC internally in translation fault handling. Do not reuse for other 
purposes.)

CC=240 through 251, reserved for future firmware use, 
Comment: Error codes 240 - 255 (0xF0 - 0xF0) are reserved for firmware use 
and are not signalled by the hardware. 
These CCs are written in the CSB by hypervisor to alert the partition to 
error conditions detected by the hypervisor. 
These codes have been used in past processors for this purpose and ought 
not be relocated.





From:   Haren Myneni/Beaverton/IBM
To: Michael Ellerman 
Cc: ab...@us.ibm.com, Haren Myneni , 
linuxppc-dev@lists.ozlabs.org, 
"Linuxppc-dev", 
rzin...@linux.ibm.com, tuli...@br.ibm.com, Haren 
Myneni/Beaverton/IBM@IBMUS
Date:   07/09/2020 04:01 PM
Subject:Re: [EXTERNAL] Re: [PATCH 1/2] powerpc/vas: Report proper 
error for address translation failure




"Linuxppc-dev"  
wrote on 07/09/2020 04:22:10 AM:

> From: Michael Ellerman 
> To: Haren Myneni 
> Cc: tuli...@br.ibm.com, ab...@us.ibm.com, linuxppc-
> d...@lists.ozlabs.org, rzin...@linux.ibm.com
> Date: 07/09/2020 04:21 AM
> Subject: [EXTERNAL] Re: [PATCH 1/2] powerpc/vas: Report proper error
> for address translation failure
> Sent by: "Linuxppc-dev"  +hbabu=us.ibm@lists.ozlabs.org>
> 
> Haren Myneni  writes:
> > DMA controller uses CC=5 internally for translation fault handling. So
> > OS should be using CC=250 and should report this error to the user 
space
> > when NX encounters address translation failure on the request buffer.
> 
> That doesn't really explain *why* the OS must use CC=250.
> 
> Is it documented somewhere that 5 is for hardware use, and 250 is for
> software?

Yes, mentioned in Table 4-36. CSB Non-zero CC Reported Error Types (P9 NX 
DD2 work book). Also footnote for CC=5 says "DMA controller uses this CC 
internally in translation fault handling. Do not reuse for other purposes"

I will add documentation reference for CC=250 comment. 

> 
> > This patch defines CSB_CC_ADDRESS_TRANSLATION(250) and updates
> > CSB.CC with this proper error code for user space.
> 
> We still have:
> 
> #define CSB_CC_TRANSLATION   (5)
> 
> And it's very unclear where one or the other should be used.
> 
> Can one or the other get a name that makes the distinction clear.

CSB_CC_TRANSLATION is added in 842 driver (nx-common-powernv.c) when NX is 
introduced (P7+). NX will not see faults on kernel requests (cc=250) and 
even CC=5. 

Table 4-36: 
For CC=5: says Translation
CC=250:says "Address Translation Fault"

So I can say CRB_CC_ADDRESS_TRANSLATION_FAULT or CRN_CC_TRANSLATION_FAULT. 
This code path (also CRBs) should be generic, so should not use like 
CRB_CC_NX_FAULT. 

Thanks
Haren

> 
> cheers
> 
> 
> > diff --git a/Documentation/powerpc/vas-api.rst b/Documentation/
> powerpc/vas-api.rst
> > index 1217c2f..78627cc 100644
> > --- a/Documentation/powerpc/vas-api.rst
> > +++ b/Documentation/powerpc/vas-api.rst
> > @@ -213,7 +213,7 @@ request buffers are not in memory. The 
> operating system handles the fault by
> >  updating CSB with the following data:
> > 
> > csb.flags = CSB_V;
> > -   csb.cc = CSB_CC_TRANSLATION;
> > +   csb.cc = CSB_CC_ADDRESS_TRANSLATION;
> > csb.ce = CSB_CE_TERMINATION;
> > csb.address = fault_address;
> > 
> > diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/
> include/asm/icswx.h
> > index 965b1f3..b1c9a57 100644
> > --- a/arch/powerpc/include/asm/icswx.h
> > +++ b/arch/powerpc/include/asm/icswx.h
> > @@ -77,6 +77,8 @@ struct coprocessor_completion_block {
> >  #define CSB_CC_CHAIN  (37)
> >  #define CSB_CC_SEQUENCE  (38)
> >  #define CSB_CC_HW  (39)
> > +/* User space address traslation failure */
> > +#define   CSB_CC_ADDRESS_TRANSLATION   (250)
> > 
> >  #define CSB_SIZE  (0x10)
> >  #define CSB_ALIGN  CSB_SIZE
> > diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/
> powerpc/platforms/powernv/vas-fault.c
> > index 266a6ca..33e89d4 100644
> > --- a/arch/powerpc/platforms/powernv/vas-fault.c
> > +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> > @@ -79,7 +79,7 @@ static void update_csb(struct vas_window *window,
> > csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
> > 
> > memset(&csb, 0, sizeof(csb));
> > -   csb.cc = CSB_CC_TRANSLATION;
> > +   csb.cc = CSB_CC_ADDRESS_TRANSLATION;
> > csb.ce = CSB_CE_TERMINATION;
> > csb.cs = 0;
> > csb.count = 0;
> > -- 
> > 1.8.3.1
> 






RE: [PATCH 1/2] powerpc/vas: Report proper error for address translation failure

2020-07-09 Thread Haren Myneni



"Linuxppc-dev" 
wrote on 07/09/2020 04:22:10 AM:

> From: Michael Ellerman 
> To: Haren Myneni 
> Cc: tuli...@br.ibm.com, ab...@us.ibm.com, linuxppc-
> d...@lists.ozlabs.org, rzin...@linux.ibm.com
> Date: 07/09/2020 04:21 AM
> Subject: [EXTERNAL] Re: [PATCH 1/2] powerpc/vas: Report proper error
> for address translation failure
> Sent by: "Linuxppc-dev"  +hbabu=us.ibm@lists.ozlabs.org>
>
> Haren Myneni  writes:
> > DMA controller uses CC=5 internally for translation fault handling. So
> > OS should be using CC=250 and should report this error to the user
space
> > when NX encounters address translation failure on the request buffer.
>
> That doesn't really explain *why* the OS must use CC=250.
>
> Is it documented somewhere that 5 is for hardware use, and 250 is for
> software?

Yes, mentioned in Table 4-36. CSB Non-zero CC Reported Error Types (P9 NX
DD2 work book). Also footnote for CC=5 says "DMA controller uses this CC
internally in translation fault handling. Do not reuse for other purposes"

I will add documentation reference for CC=250 comment.

>
> > This patch defines CSB_CC_ADDRESS_TRANSLATION(250) and updates
> > CSB.CC with this proper error code for user space.
>
> We still have:
>
> #define CSB_CC_TRANSLATION   (5)
>
> And it's very unclear where one or the other should be used.
>
> Can one or the other get a name that makes the distinction clear.

CSB_CC_TRANSLATION is added in 842 driver (nx-common-powernv.c) when NX is
introduced (P7+). NX will not see faults on kernel requests (cc=250) and
even CC=5.

Table 4-36:
For CC=5: says Translation
CC=250:says "Address Translation Fault"

So I can say CRB_CC_ADDRESS_TRANSLATION_FAULT or CRN_CC_TRANSLATION_FAULT.
This code path (also CRBs) should be generic, so should not use like
CRB_CC_NX_FAULT.

Thanks
Haren

>
> cheers
>
>
> > diff --git a/Documentation/powerpc/vas-api.rst b/Documentation/
> powerpc/vas-api.rst
> > index 1217c2f..78627cc 100644
> > --- a/Documentation/powerpc/vas-api.rst
> > +++ b/Documentation/powerpc/vas-api.rst
> > @@ -213,7 +213,7 @@ request buffers are not in memory. The
> operating system handles the fault by
> >  updating CSB with the following data:
> >
> > csb.flags = CSB_V;
> > -   csb.cc = CSB_CC_TRANSLATION;
> > +   csb.cc = CSB_CC_ADDRESS_TRANSLATION;
> > csb.ce = CSB_CE_TERMINATION;
> > csb.address = fault_address;
> >
> > diff --git a/arch/powerpc/include/asm/icswx.h b/arch/powerpc/
> include/asm/icswx.h
> > index 965b1f3..b1c9a57 100644
> > --- a/arch/powerpc/include/asm/icswx.h
> > +++ b/arch/powerpc/include/asm/icswx.h
> > @@ -77,6 +77,8 @@ struct coprocessor_completion_block {
> >  #define CSB_CC_CHAIN  (37)
> >  #define CSB_CC_SEQUENCE  (38)
> >  #define CSB_CC_HW  (39)
> > +/* User space address traslation failure */
> > +#define   CSB_CC_ADDRESS_TRANSLATION   (250)
> >
> >  #define CSB_SIZE  (0x10)
> >  #define CSB_ALIGN  CSB_SIZE
> > diff --git a/arch/powerpc/platforms/powernv/vas-fault.c b/arch/
> powerpc/platforms/powernv/vas-fault.c
> > index 266a6ca..33e89d4 100644
> > --- a/arch/powerpc/platforms/powernv/vas-fault.c
> > +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> > @@ -79,7 +79,7 @@ static void update_csb(struct vas_window *window,
> > csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
> >
> > memset(&csb, 0, sizeof(csb));
> > -   csb.cc = CSB_CC_TRANSLATION;
> > +   csb.cc = CSB_CC_ADDRESS_TRANSLATION;
> > csb.ce = CSB_CE_TERMINATION;
> > csb.cs = 0;
> > csb.count = 0;
> > --
> > 1.8.3.1
>


[PATCH v5] ima: move APPRAISE_BOOTPARAM dependency on ARCH_POLICY to runtime

2020-07-09 Thread Bruno Meneguele
APPRAISE_BOOTPARAM has been marked as dependent on !ARCH_POLICY in compile
time, enforcing the appraisal whenever the kernel had the arch policy option
enabled.

However it breaks systems where the option is set but the system didn't
boot in a "secure boot" platform. In this scenario, anytime an appraisal
policy (i.e. ima_policy=appraisal_tcb) is used it will be forced, without
giving the user the opportunity to label the filesystem, before enforcing
integrity.

Considering the ARCH_POLICY is only effective when secure boot is actually
enabled this patch remove the compile time dependency and move it to a
runtime decision, based on the secure boot state of that platform.

With this patch:

- x86-64 with secure boot enabled

[0.004305] Secure boot enabled
...
[0.015651] Kernel command line: <...> ima_policy=appraise_tcb 
ima_appraise=fix
[0.015682] ima: appraise boot param ignored: secure boot enabled

- powerpc with secure boot disabled

[0.00] Kernel command line: <...> ima_policy=appraise_tcb 
ima_appraise=fix
[0.00] Secure boot mode disabled
...
< nothing about boot param ignored >

System working fine without secure boot and with both options set:

CONFIG_IMA_APPRAISE_BOOTPARAM=y
CONFIG_IMA_ARCH_POLICY=y

Audit logs pointing to "missing-hash" but still being able to execute due to
ima_appraise=fix:

type=INTEGRITY_DATA msg=audit(07/09/2020 12:30:27.778:1691) : pid=4976
uid=root auid=root ses=2
subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 op=appraise_data
cause=missing-hash comm=bash name=/usr/bin/evmctl dev="dm-0" ino=493150
res=no

Cc: sta...@vger.kernel.org
Fixes: d958083a8f64 ("x86/ima: define arch_get_ima_policy() for x86")
Signed-off-by: Bruno Meneguele 
---
Changelog:
v5:
  - add pr_info() to inform user the ima_appraise= boot param is being
ignored due to secure boot enabled (Nayna)
  - add some testing results to commit log
v4:
  - instead of change arch_policy loading code, check secure boot state at
"ima_appraise=" parameter handler (Mimi)
v3:
  - extend secure boot arch checker to also consider trusted boot
  - enforce IMA appraisal when secure boot is effectively enabled (Nayna)
  - fix ima_appraise flag assignment by or'ing it (Mimi)
v2:
  - pr_info() message prefix correction

 security/integrity/ima/Kconfig| 2 +-
 security/integrity/ima/ima_appraise.c | 5 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index edde88dbe576..62dc11a5af01 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -232,7 +232,7 @@ config IMA_APPRAISE_REQUIRE_POLICY_SIGS
 
 config IMA_APPRAISE_BOOTPARAM
bool "ima_appraise boot parameter"
-   depends on IMA_APPRAISE && !IMA_ARCH_POLICY
+   depends on IMA_APPRAISE
default y
help
  This option enables the different "ima_appraise=" modes
diff --git a/security/integrity/ima/ima_appraise.c 
b/security/integrity/ima/ima_appraise.c
index a9649b04b9f1..884de471b38a 100644
--- a/security/integrity/ima/ima_appraise.c
+++ b/security/integrity/ima/ima_appraise.c
@@ -19,6 +19,11 @@
 static int __init default_appraise_setup(char *str)
 {
 #ifdef CONFIG_IMA_APPRAISE_BOOTPARAM
+   if (arch_ima_get_secureboot()) {
+   pr_info("appraise boot param ignored: secure boot enabled");
+   return 1;
+   }
+
if (strncmp(str, "off", 3) == 0)
ima_appraise = 0;
else if (strncmp(str, "log", 3) == 0)
-- 
2.26.2



Re: [PATCH 2/2] PCI/AER: Log correctable errors as warning, not error

2020-07-09 Thread Bjorn Helgaas
On Tue, Jul 07, 2020 at 07:14:01PM -0500, Bjorn Helgaas wrote:
> From: Matt Jolly 
> 
> PCIe correctable errors are recovered by hardware with no need for software
> intervention (PCIe r5.0, sec 6.2.2.1).
> 
> Reduce the log level of correctable errors from KERN_ERR to KERN_WARNING.
> 
> The bug reports below are for correctable error logging.  This doesn't fix
> the cause of those reports, but it may make the messages less alarming.
> 
> [bhelgaas: commit log, use pci_printk() to avoid code duplication]
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=201517
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=196183
> Link: https://lore.kernel.org/r/20200618155511.16009-1-Kangie@footclan.ninja
> Signed-off-by: Matt Jolly 
> Signed-off-by: Bjorn Helgaas 

I applied both of these to pci/error for v5.9.

> ---
>  drivers/pci/pcie/aer.c | 25 +++--
>  1 file changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9176c8a968b9..ca886bf91fd9 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -673,20 +673,23 @@ static void __aer_print_error(struct pci_dev *dev,
>  {
>   const char **strings;
>   unsigned long status = info->status & ~info->mask;
> - const char *errmsg;
> + const char *level, *errmsg;
>   int i;
>  
> - if (info->severity == AER_CORRECTABLE)
> + if (info->severity == AER_CORRECTABLE) {
>   strings = aer_correctable_error_string;
> - else
> + level = KERN_WARNING;
> + } else {
>   strings = aer_uncorrectable_error_string;
> + level = KERN_ERR;
> + }
>  
>   for_each_set_bit(i, &status, 32) {
>   errmsg = strings[i];
>   if (!errmsg)
>   errmsg = "Unknown Error Bit";
>  
> - pci_err(dev, "   [%2d] %-22s%s\n", i, errmsg,
> + pci_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
>   info->first_error == i ? " (First)" : "");
>   }
>   pci_dev_aer_stats_incr(dev, info);
> @@ -696,6 +699,7 @@ void aer_print_error(struct pci_dev *dev, struct 
> aer_err_info *info)
>  {
>   int layer, agent;
>   int id = ((dev->bus->number << 8) | dev->devfn);
> + const char *level;
>  
>   if (!info->status) {
>   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
> (Unregistered Agent ID)\n",
> @@ -706,13 +710,14 @@ void aer_print_error(struct pci_dev *dev, struct 
> aer_err_info *info)
>   layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>   agent = AER_GET_AGENT(info->severity, info->status);
>  
> - pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> - aer_error_severity_string[info->severity],
> - aer_error_layer[layer], aer_agent_string[agent]);
> + level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> +
> + pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> +aer_error_severity_string[info->severity],
> +aer_error_layer[layer], aer_agent_string[agent]);
>  
> - pci_err(dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> - dev->vendor, dev->device,
> - info->status, info->mask);
> + pci_printk(level, dev, "  device [%04x:%04x] error 
> status/mask=%08x/%08x\n",
> +dev->vendor, dev->device, info->status, info->mask);
>  
>   __aer_print_error(dev, info);
>  
> -- 
> 2.25.1
> 


Re: /sys/kernel/debug/kmemleak empty despite kmemleak reports

2020-07-09 Thread Paul Menzel

Dear Catalin,


Am 09.07.20 um 19:57 schrieb Catalin Marinas:

On Thu, Jul 09, 2020 at 04:37:10PM +0200, Paul Menzel wrote:

Despite Linux 5.8-rc4 reporting memory leaks on the IBM POWER 8 S822LC, the
file does not contain more information.


$ dmesg
[…] > [48662.953323] perf: interrupt took too long (2570 > 2500), lowering 
kernel.perf_event_max_sample_rate to 77750
[48854.810636] perf: interrupt took too long (3216 > 3212), lowering 
kernel.perf_event_max_sample_rate to 62000
[52300.044518] perf: interrupt took too long (4244 > 4020), lowering 
kernel.perf_event_max_sample_rate to 47000
[52751.373083] perf: interrupt took too long (5373 > 5305), lowering 
kernel.perf_event_max_sample_rate to 37000
[53354.000363] perf: interrupt took too long (6793 > 6716), lowering 
kernel.perf_event_max_sample_rate to 29250
[53850.215606] perf: interrupt took too long (8672 > 8491), lowering 
kernel.perf_event_max_sample_rate to 23000
[57542.266099] perf: interrupt took too long (10940 > 10840), lowering 
kernel.perf_event_max_sample_rate to 18250
[57559.645404] perf: interrupt took too long (13714 > 13675), lowering 
kernel.perf_event_max_sample_rate to 14500
[61608.697728] Can't find PMC that caused IRQ
[71774.463111] kmemleak: 12 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)
[92372.044785] process '@/usr/bin/gnatmake-5' started with executable stack
[92849.380672] FS-Cache: Loaded
[92849.417269] FS-Cache: Netfs 'nfs' registered for caching
[92849.595974] NFS: Registering the id_resolver key type
[92849.596000] Key type id_resolver registered
[92849.596000] Key type id_legacy registered
[101808.079143] kmemleak: 1 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)
[106904.323471] Can't find PMC that caused IRQ
[129416.391456] kmemleak: 1 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)
[158171.604221] kmemleak: 34 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)
$ sudo cat /sys/kernel/debug/kmemleak


When they are no longer present, they are most likely false positives.


How can this be? Shouldn’t the false positive also be logged in 
`/sys/kernel/debug/kmemleak`?



Was this triggered during boot? Or under some workload?


From the timestamps it looks like under some load.


Kind regards,

Paul


Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 9, 2020, at 4:46 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Thu, Jul 09, 2020 at 01:56:19PM -0400, Mathieu Desnoyers wrote:
>> > Just to make sure I understand your recommendation. So rather than
>> > hard coding r17 as the temporary registers, we could explicitly
>> > declare the temporary register as a C variable, pass it as an
>> > input operand to the inline asm, and then refer to it by operand
>> > name in the macros using it. This way the compiler would be free
>> > to perform its own register allocation.
>> > 
>> > If that is what you have in mind, then yes, I think it makes a
>> > lot of sense.
>> 
>> Except that asm goto have this limitation with gcc: those cannot
>> have any output operand, only inputs, clobbers and target labels.
>> We cannot modify a temporary register received as input operand. So I don't
>> see how to get a temporary register allocated by the compiler considering
>> this limitation.
> 
> Heh, yet another reason not to obfuscate your inline asm: it didn't
> register this is asm goto.
> 
> A clobber is one way, yes (those *are* allowed in asm goto).  Another
> way is to not actually change that register: move the original value
> back into there at the end of the asm!  (That isn't always easy to do,
> it depends on your code).  So something like
> 
>   long start = ...;
>   long tmp = start;
>   asm("stuff that modifies %0; ...; mr %0,%1" : : "r"(tmp), "r"(start));
> 
> is just fine: %0 isn't actually modified at all, as far as GCC is
> concerned, and this isn't lying to it!

It appears to be at the cost of adding one extra instruction on the fast-path
to restore the register to its original value. I'll leave Boqun whom authored
the original rseq-ppc code to figure out what works best performance-wise
(when he finds time).

Thanks for the pointers!

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-09 Thread Segher Boessenkool
On Thu, Jul 09, 2020 at 01:56:19PM -0400, Mathieu Desnoyers wrote:
> > Just to make sure I understand your recommendation. So rather than
> > hard coding r17 as the temporary registers, we could explicitly
> > declare the temporary register as a C variable, pass it as an
> > input operand to the inline asm, and then refer to it by operand
> > name in the macros using it. This way the compiler would be free
> > to perform its own register allocation.
> > 
> > If that is what you have in mind, then yes, I think it makes a
> > lot of sense.
> 
> Except that asm goto have this limitation with gcc: those cannot
> have any output operand, only inputs, clobbers and target labels.
> We cannot modify a temporary register received as input operand. So I don't
> see how to get a temporary register allocated by the compiler considering
> this limitation.

Heh, yet another reason not to obfuscate your inline asm: it didn't
register this is asm goto.

A clobber is one way, yes (those *are* allowed in asm goto).  Another
way is to not actually change that register: move the original value
back into there at the end of the asm!  (That isn't always easy to do,
it depends on your code).  So something like

long start = ...;
long tmp = start;
asm("stuff that modifies %0; ...; mr %0,%1" : : "r"(tmp), "r"(start));

is just fine: %0 isn't actually modified at all, as far as GCC is
concerned, and this isn't lying to it!


Segher


Re: Failure to build librseq on ppc

2020-07-09 Thread Segher Boessenkool
On Thu, Jul 09, 2020 at 01:42:56PM -0400, Mathieu Desnoyers wrote:
> > That works fine then, for a testcase.  Using r17 is not a great idea for
> > performance (it increases the active register footprint, and causes more
> > registers to be saved in the prologue of the functions, esp. on older
> > compilers), and it is easier to just let the compiler choose a good
> > register to use.  But maybe you want to see r17 in the generated
> > testcases, as eyecatcher or something, dunno :-)
> 
> Just to make sure I understand your recommendation. So rather than
> hard coding r17 as the temporary registers, we could explicitly
> declare the temporary register as a C variable, pass it as an
> input operand to the inline asm, and then refer to it by operand
> name in the macros using it. This way the compiler would be free
> to perform its own register allocation.
> 
> If that is what you have in mind, then yes, I think it makes a
> lot of sense.

You write to it as well, so an inout register ("+r" or such).  And yes,
you use a local var for it (like "long tmp;").  And then you can refer
to it like anything else in your asm, like "%3" or like
"%[a_long_name]"; and the compiler sees it as any other register,
exactly.


Segher


Re: [PATCH 11/20] Documentation: leds/ledtrig-transient: eliminate duplicated word

2020-07-09 Thread Jacek Anaszewski

On 7/7/20 8:04 PM, Randy Dunlap wrote:

Drop the doubled word "for".

Signed-off-by: Randy Dunlap 
Cc: Jonathan Corbet 
Cc: linux-...@vger.kernel.org
Cc: Jacek Anaszewski 
Cc: Pavel Machek 
Cc: Dan Murphy 
Cc: linux-l...@vger.kernel.org
---
  Documentation/leds/ledtrig-transient.rst |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-next-20200701.orig/Documentation/leds/ledtrig-transient.rst
+++ linux-next-20200701/Documentation/leds/ledtrig-transient.rst
@@ -157,7 +157,7 @@ repeat the following step as needed::
echo 1 > activate - start timer = duration to run once
echo none > trigger
  
-This trigger is intended to be used for for the following example use cases:

+This trigger is intended to be used for the following example use cases:
  
   - Control of vibrate (phones, tablets etc.) hardware by user space app.

   - Use of LED by user space app as activity indicator.



Acked-by: Jacek Anaszewski 

--
Best regards,
Jacek Anaszewski


Re: /sys/kernel/debug/kmemleak empty despite kmemleak reports

2020-07-09 Thread Catalin Marinas
On Thu, Jul 09, 2020 at 04:37:10PM +0200, Paul Menzel wrote:
> Despite Linux 5.8-rc4 reporting memory leaks on the IBM POWER 8 S822LC, the
> file does not contain more information.
> 
> > $ dmesg
> > […] > [48662.953323] perf: interrupt took too long (2570 > 2500),
> > lowering
> kernel.perf_event_max_sample_rate to 77750
> > [48854.810636] perf: interrupt took too long (3216 > 3212), lowering 
> > kernel.perf_event_max_sample_rate to 62000
> > [52300.044518] perf: interrupt took too long (4244 > 4020), lowering 
> > kernel.perf_event_max_sample_rate to 47000
> > [52751.373083] perf: interrupt took too long (5373 > 5305), lowering 
> > kernel.perf_event_max_sample_rate to 37000
> > [53354.000363] perf: interrupt took too long (6793 > 6716), lowering 
> > kernel.perf_event_max_sample_rate to 29250
> > [53850.215606] perf: interrupt took too long (8672 > 8491), lowering 
> > kernel.perf_event_max_sample_rate to 23000
> > [57542.266099] perf: interrupt took too long (10940 > 10840), lowering 
> > kernel.perf_event_max_sample_rate to 18250
> > [57559.645404] perf: interrupt took too long (13714 > 13675), lowering 
> > kernel.perf_event_max_sample_rate to 14500
> > [61608.697728] Can't find PMC that caused IRQ
> > [71774.463111] kmemleak: 12 new suspected memory leaks (see 
> > /sys/kernel/debug/kmemleak)
> > [92372.044785] process '@/usr/bin/gnatmake-5' started with executable stack
> > [92849.380672] FS-Cache: Loaded
> > [92849.417269] FS-Cache: Netfs 'nfs' registered for caching
> > [92849.595974] NFS: Registering the id_resolver key type
> > [92849.596000] Key type id_resolver registered
> > [92849.596000] Key type id_legacy registered
> > [101808.079143] kmemleak: 1 new suspected memory leaks (see 
> > /sys/kernel/debug/kmemleak)
> > [106904.323471] Can't find PMC that caused IRQ
> > [129416.391456] kmemleak: 1 new suspected memory leaks (see 
> > /sys/kernel/debug/kmemleak)
> > [158171.604221] kmemleak: 34 new suspected memory leaks (see 
> > /sys/kernel/debug/kmemleak)
> > $ sudo cat /sys/kernel/debug/kmemleak

When they are no longer present, they are most likely false positives.
Was this triggered during boot? Or under some workload?

-- 
Catalin


Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 9, 2020, at 1:42 PM, Mathieu Desnoyers 
mathieu.desnoy...@efficios.com wrote:

> - On Jul 9, 2020, at 1:37 PM, Segher Boessenkool 
> seg...@kernel.crashing.org
> wrote:
> 
>> On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote:
>>> > What protects r17 *after* this asm statement?
>>> 
>>> As discussed in the other leg of the thread (with the code example),
>>> r17 is in the clobber list of all asm statements using this macro, and
>>> is used as a temporary register within each inline asm.
>> 
>> That works fine then, for a testcase.  Using r17 is not a great idea for
>> performance (it increases the active register footprint, and causes more
>> registers to be saved in the prologue of the functions, esp. on older
>> compilers), and it is easier to just let the compiler choose a good
>> register to use.  But maybe you want to see r17 in the generated
>> testcases, as eyecatcher or something, dunno :-)
> 
> Just to make sure I understand your recommendation. So rather than
> hard coding r17 as the temporary registers, we could explicitly
> declare the temporary register as a C variable, pass it as an
> input operand to the inline asm, and then refer to it by operand
> name in the macros using it. This way the compiler would be free
> to perform its own register allocation.
> 
> If that is what you have in mind, then yes, I think it makes a
> lot of sense.

Except that asm goto have this limitation with gcc: those cannot
have any output operand, only inputs, clobbers and target labels.
We cannot modify a temporary register received as input operand. So I don't
see how to get a temporary register allocated by the compiler considering
this limitation.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


[Bug 208197] OF: /pci@f2000000/mac-io@17/gpio@50/...: could not find phandle

2020-07-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=208197

--- Comment #9 from Erhard F. (erhar...@mailbox.org) ---
(In reply to Michael Ellerman from comment #7)
> I couldn't really make sense of your bisect log, it doesn't have any
> good/bad commits in it.
> 
> Can you attach the output of "git bisect log".
Yea sorry, my fault... I thought e.g. with "git bisect bad | tee -a
~/bisect.log" bisect.log generated via tee would have the same output as with
"git bisect log" but I was wrong. 

Please find the correct one attached now. I had to restart the bisect as I
already resetted the original one.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

[Bug 208197] OF: /pci@f2000000/mac-io@17/gpio@50/...: could not find phandle

2020-07-09 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=208197

Erhard F. (erhar...@mailbox.org) changed:

   What|Removed |Added

 Attachment #290097|0   |1
is obsolete||

--- Comment #8 from Erhard F. (erhar...@mailbox.org) ---
Created attachment 290191
  --> https://bugzilla.kernel.org/attachment.cgi?id=290191&action=edit
bisect.log

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 9, 2020, at 1:37 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote:
>> > What protects r17 *after* this asm statement?
>> 
>> As discussed in the other leg of the thread (with the code example),
>> r17 is in the clobber list of all asm statements using this macro, and
>> is used as a temporary register within each inline asm.
> 
> That works fine then, for a testcase.  Using r17 is not a great idea for
> performance (it increases the active register footprint, and causes more
> registers to be saved in the prologue of the functions, esp. on older
> compilers), and it is easier to just let the compiler choose a good
> register to use.  But maybe you want to see r17 in the generated
> testcases, as eyecatcher or something, dunno :-)

Just to make sure I understand your recommendation. So rather than
hard coding r17 as the temporary registers, we could explicitly
declare the temporary register as a C variable, pass it as an
input operand to the inline asm, and then refer to it by operand
name in the macros using it. This way the compiler would be free
to perform its own register allocation.

If that is what you have in mind, then yes, I think it makes a
lot of sense.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-09 Thread Segher Boessenkool
On Thu, Jul 09, 2020 at 09:43:47AM -0400, Mathieu Desnoyers wrote:
> > What protects r17 *after* this asm statement?
> 
> As discussed in the other leg of the thread (with the code example),
> r17 is in the clobber list of all asm statements using this macro, and
> is used as a temporary register within each inline asm.

That works fine then, for a testcase.  Using r17 is not a great idea for
performance (it increases the active register footprint, and causes more
registers to be saved in the prologue of the functions, esp. on older
compilers), and it is easier to just let the compiler choose a good
register to use.  But maybe you want to see r17 in the generated
testcases, as eyecatcher or something, dunno :-)


Segher


Re: Failure to build librseq on ppc

2020-07-09 Thread Segher Boessenkool
On Thu, Jul 09, 2020 at 09:33:18AM -0400, Mathieu Desnoyers wrote:
> > The way this all uses r17 will likely not work reliably.
> 
> r17 is only used as a temporary register within the inline assembler, and it 
> is
> in the clobber list. In which scenario would it not work reliably ?

This isn't clear at all, that is the problem.

> > The way multiple asm statements are used seems to have missing
> > dependencies between the statements.
> 
> I'm not sure I follow here. Note that we are injecting the CPP macros into
> a single inline asm statement as strings.

Yeah...  more trickiness.

> > And done macro-mess this, you want to be able to debug it, and you need
> > other people to be able to read it!
> 
> I understand that looking at macros can be cumbersome from the perspective
> of a reviewer only interested in a single architecture,

No, from the perspective of *any* reviewer.

> However, from my perspective, as a maintainer who must maintain similar code
> for x86 32/64, powerpc 32/64, arm, aarch64, s390, s390x, mips 32/64, and 
> likely
> other architectures in the future, the macros abstracting 32-bit and 64-bit
> allow to eliminate code duplication for each architecture with 32-bit and 
> 64-bit
> variants, which is better for maintainability.

IMNSHO it is MUCH better to just have simple separate implementations
for each.  They differ in *all* details.

Or have static inline functions, with proper dependencies, instead of
nasty text macros.

But it's your code, do what you want :-)


Segher


[PATCH v2] powerpc/pseries: Avoid using addr_to_pfn in realmode

2020-07-09 Thread Ganesh Goudar
When an UE or memory error exception is encountered the MCE handler
tries to find the pfn using addr_to_pfn() which takes effective
address as an argument, later pfn is used to poison the page where
memory error occurred, recent rework in this area made addr_to_pfn
to run in realmode, which can be fatal as it may try to access
memory outside RMO region.

To fix this use addr_to_pfn after switching to virtual mode.

Signed-off-by: Ganesh Goudar 
---
V2: Leave bare metal code and save_mce_event as is.
---
 arch/powerpc/platforms/pseries/ras.c | 20 +++-
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index f3736fcd98fc..def875815e92 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -610,16 +610,8 @@ static int mce_handle_error(struct pt_regs *regs, struct 
rtas_error_log *errp)
if (mce_log->sub_err_type & UE_EFFECTIVE_ADDR_PROVIDED)
eaddr = be64_to_cpu(mce_log->effective_address);
 
-   if (mce_log->sub_err_type & UE_LOGICAL_ADDR_PROVIDED) {
+   if (mce_log->sub_err_type & UE_LOGICAL_ADDR_PROVIDED)
paddr = be64_to_cpu(mce_log->logical_address);
-   } else if (mce_log->sub_err_type & UE_EFFECTIVE_ADDR_PROVIDED) {
-   unsigned long pfn;
-
-   pfn = addr_to_pfn(regs, eaddr);
-   if (pfn != ULONG_MAX)
-   paddr = pfn << PAGE_SHIFT;
-   }
-
break;
case MC_ERROR_TYPE_SLB:
mce_err.error_type = MCE_ERROR_TYPE_SLB;
@@ -725,6 +717,16 @@ static int mce_handle_error(struct pt_regs *regs, struct 
rtas_error_log *errp)
 *   SLB multihit is done by now.
 */
mtmsr(mfmsr() | MSR_IR | MSR_DR);
+
+   /* Use addr_to_pfn after switching to virtual mode */
+   if (!paddr && error_type == MC_ERROR_TYPE_UE &&
+   mce_log->sub_err_type & UE_EFFECTIVE_ADDR_PROVIDED) {
+   unsigned long pfn;
+
+   pfn = addr_to_pfn(regs, eaddr);
+   if (pfn != ULONG_MAX)
+   paddr = pfn << PAGE_SHIFT;
+   }
save_mce_event(regs, disposition == RTAS_DISP_FULLY_RECOVERED,
&mce_err, regs->nip, eaddr, paddr);
 
-- 
2.17.2



Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-09 Thread Waiman Long

On 7/9/20 6:53 AM, Michael Ellerman wrote:

Nicholas Piggin  writes:


Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/paravirt.h   | 28 
  arch/powerpc/include/asm/qspinlock.h  | 66 +++
  arch/powerpc/include/asm/qspinlock_paravirt.h |  7 ++
  arch/powerpc/platforms/pseries/Kconfig|  5 ++
  arch/powerpc/platforms/pseries/setup.c|  6 +-
  include/asm-generic/qspinlock.h   |  2 +

Another ack?


I am OK with adding the #ifdef around queued_spin_lock().

Acked-by: Waiman Long 


diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 7a8546660a63..f2d51f929cf5 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 
yield_count)
  {
___bad_yield_to_preempted(); /* This would be a bug */
  }
+
+extern void ___bad_yield_to_any(void);
+static inline void yield_to_any(void)
+{
+   ___bad_yield_to_any(); /* This would be a bug */
+}

Why do we do that rather than just not defining yield_to_any() at all
and letting the build fail on that?

There's a condition somewhere that we know will false at compile time
and drop the call before linking?


diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h 
b/arch/powerpc/include/asm/qspinlock_paravirt.h
new file mode 100644
index ..750d1b5e0202
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_paravirt.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __ASM_QSPINLOCK_PARAVIRT_H
+#define __ASM_QSPINLOCK_PARAVIRT_H

_ASM_POWERPC_QSPINLOCK_PARAVIRT_H please.


+
+EXPORT_SYMBOL(__pv_queued_spin_unlock);

Why's that in a header? Should that (eventually) go with the generic 
implementation?
The PV qspinlock implementation is not that generic at the moment. Even 
though native qspinlock is used by a number of archs, PV qspinlock is 
only currently used in x86. This is certainly an area that needs 
improvement.

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 24c18362e5ea..756e727b383f 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -25,9 +25,14 @@ config PPC_PSERIES
select SWIOTLB
default y
  
+config PARAVIRT_SPINLOCKS

+   bool
+   default n

default n is the default.


diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 2db8469e475f..747a203d9453 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -771,8 +771,12 @@ static void __init pSeries_setup_arch(void)
if (firmware_has_feature(FW_FEATURE_LPAR)) {
vpa_init(boot_cpuid);
  
-		if (lppaca_shared_proc(get_lppaca()))

+   if (lppaca_shared_proc(get_lppaca())) {
static_branch_enable(&shared_processor);
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+   pv_spinlocks_init();
+#endif
+   }

We could avoid the ifdef with this I think?

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 434615f1d761..6ec72282888d 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -10,5 +10,9 @@
  #include 
  #endif

+#ifndef CONFIG_PARAVIRT_SPINLOCKS
+static inline void pv_spinlocks_init(void) { }
+#endif
+
  #endif /* __KERNEL__ */
  #endif /* __ASM_SPINLOCK_H */


cheers

We don't really need to do a pv_spinlocks_init() if pv_kick() isn't 
supported.


Cheers,
Longman



Re: [PATCH 10/20] Documentation: kbuild/kconfig-language: eliminate duplicated word

2020-07-09 Thread Masahiro Yamada
On Wed, Jul 8, 2020 at 3:06 AM Randy Dunlap  wrote:
>
> Drop the doubled word "the".
>
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: Masahiro Yamada 



I guess this series will go in via the doc sub-system.

If so, please feel free to add:

Acked-by: Masahiro Yamada 






> Cc: Michal Marek 
> Cc: linux-kbu...@vger.kernel.org
> ---
>  Documentation/kbuild/kconfig-language.rst |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- linux-next-20200701.orig/Documentation/kbuild/kconfig-language.rst
> +++ linux-next-20200701/Documentation/kbuild/kconfig-language.rst
> @@ -681,7 +681,7 @@ translate Kconfig logic into boolean for
>  find dead code / features (always inactive), 114 dead features were found in
>  Linux using this methodology [1]_ (Section 8: Threats to validity).
>
> -Confirming this could prove useful as Kconfig stands as one of the the 
> leading
> +Confirming this could prove useful as Kconfig stands as one of the leading
>  industrial variability modeling languages [1]_ [2]_. Its study would help
>  evaluate practical uses of such languages, their use was only theoretical
>  and real world requirements were not well understood. As it stands though



--
Best Regards
Masahiro Yamada


/sys/kernel/debug/kmemleak empty despite kmemleak reports

2020-07-09 Thread Paul Menzel

Dear Linux folks,


Despite Linux 5.8-rc4 reporting memory leaks on the IBM POWER 8 S822LC, 
the file does not contain more information.



$ dmesg
[…] > [48662.953323] perf: interrupt took too long (2570 > 2500), lowering 

kernel.perf_event_max_sample_rate to 77750

[48854.810636] perf: interrupt took too long (3216 > 3212), lowering 
kernel.perf_event_max_sample_rate to 62000
[52300.044518] perf: interrupt took too long (4244 > 4020), lowering 
kernel.perf_event_max_sample_rate to 47000
[52751.373083] perf: interrupt took too long (5373 > 5305), lowering 
kernel.perf_event_max_sample_rate to 37000
[53354.000363] perf: interrupt took too long (6793 > 6716), lowering 
kernel.perf_event_max_sample_rate to 29250
[53850.215606] perf: interrupt took too long (8672 > 8491), lowering 
kernel.perf_event_max_sample_rate to 23000
[57542.266099] perf: interrupt took too long (10940 > 10840), lowering 
kernel.perf_event_max_sample_rate to 18250
[57559.645404] perf: interrupt took too long (13714 > 13675), lowering 
kernel.perf_event_max_sample_rate to 14500
[61608.697728] Can't find PMC that caused IRQ
[71774.463111] kmemleak: 12 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)
[92372.044785] process '@/usr/bin/gnatmake-5' started with executable stack
[92849.380672] FS-Cache: Loaded
[92849.417269] FS-Cache: Netfs 'nfs' registered for caching
[92849.595974] NFS: Registering the id_resolver key type
[92849.596000] Key type id_resolver registered
[92849.596000] Key type id_legacy registered
[101808.079143] kmemleak: 1 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)
[106904.323471] Can't find PMC that caused IRQ
[129416.391456] kmemleak: 1 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)
[158171.604221] kmemleak: 34 new suspected memory leaks (see 
/sys/kernel/debug/kmemleak)
$ sudo cat /sys/kernel/debug/kmemleak
$



Kind regards,

Paul


Re: [PATCH v2 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-07-09 Thread Christophe Leroy




Le 09/07/2020 à 15:51, Santosh Sivaraj a écrit :

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Reviewed-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 


Reviewed-by: Christophe Leroy 


---
  arch/powerpc/platforms/pseries/papr_scm.c | 96 ++-
  1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09fd..90729029ca010 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -13,9 +13,11 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
+#include 
  
  #define BIND_ANY_ADDR (~0ul)
  
@@ -80,6 +82,7 @@ struct papr_scm_priv {

struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
  
  	/* Protect dimm health data from concurrent read/writes */

struct mutex health_mutex;
@@ -91,6 +94,9 @@ struct papr_scm_priv {
u64 health_bitmap;
  };
  
+LIST_HEAD(papr_nd_regions);

+DEFINE_MUTEX(papr_ndr_lock);
+
  static int drc_pmem_bind(struct papr_scm_priv *p)
  {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online node 
%d",
 target_nid, online_nid);
  
+	mutex_lock(&papr_ndr_lock);

+   list_add_tail(&p->region_list, &papr_nd_regions);
+   mutex_unlock(&papr_ndr_lock);
+
return 0;
  
  err:	nvdimm_bus_unregister(p->bus);

@@ -766,6 +776,68 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
  }
  
+static void papr_scm_add_badblock(struct nd_region *region,

+ struct nvdimm_bus *bus, u64 phys_addr)
+{
+   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+   pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+   return;
+   }
+
+   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(&papr_nd_regions))
+   return NOTIFY_DONE;
+
+   /*
+* The physical address obtained here is PAGE_SIZE aligned, so get the
+* exact address from the effective address
+*/
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   /* mce notifier is called from a process context, so mutex is safe */
+   mutex_lock(&papr_ndr_lock);
+   list_for_each_entry(p, &papr_nd_regions, region_list) {
+   if (phys_addr >= p->res.start && phys_addr <= p->res.end) {
+   found = true;
+   break;
+   }
+   }
+
+   if (found)
+   papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+   mutex_unlock(&papr_ndr_lock);
+
+   return found ? NOTIFY_OK : NOTIFY_DONE;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
  static int papr_scm_probe(struct platform_device *pdev)
  {
struct device_node *dn = pdev->dev.of_node;
@@ -866,6 +938,10 @@ static int papr_scm_remove(struct platform_device *pdev)
  {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
  
+	mutex_lock(&papr_ndr_lock);

+   list_del(&p->region_list);
+   mutex_unlock(&papr_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -888,7 +964,25 @@ static struct platform_driver papr_scm_driver = {
},
  };
  
-module_platform_driver(papr_scm_driver);

+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(&papr_scm_driver);
+   if (!ret)
+   mce_register_notifier(&mce_ue_nb);
+
+   return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(&mce_ue_nb);
+   platform_driver_unregister(&papr_scm_driver);
+}
+module_exit(papr_scm_exit);
+
  MODULE_DEVICE_TABLE(of, papr_scm_match);
  MODULE_LICENSE("GPL");
 

Re: [PATCH v2 1/2] powerpc/mce: Add MCE notification chain

2020-07-09 Thread Christophe Leroy




Le 09/07/2020 à 15:51, Santosh Sivaraj a écrit :

Introduce notification chain which lets us know about uncorrected memory
errors(UE). This would help prospective users in pmem or nvdimm subsystem
to track bad blocks for better handling of persistent memory allocations.

Signed-off-by: Santosh Sivaraj 
Signed-off-by: Ganesh Goudar 


Reviewed-by: Christophe Leroy 


---
  arch/powerpc/include/asm/mce.h |  2 ++
  arch/powerpc/kernel/mce.c  | 15 +++
  2 files changed, 17 insertions(+)

v2: Address comments from Christophe.

RESEND: Sending the two patches together so the dependencies are clear. The
earlier patch reviews are here [1]; rebase the patches on top on 5.8-rc4

[1]: 
https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 376a395daf329..7bdd0cd4f2de0 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct 
machine_check_event *evt,
  unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
  extern void mce_common_process_ue(struct pt_regs *regs,
  struct mce_error_info *mce_err);
+int mce_register_notifier(struct notifier_block *nb);
+int mce_unregister_notifier(struct notifier_block *nb);
  #ifdef CONFIG_PPC_BOOK3S_64
  void flush_and_reload_slb(void);
  #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fd90c0eda2290..b7b3ed4e61937 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
  
  DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
  
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);

+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
  static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
  {
@@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+   blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
  #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but



[PATCH v2 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-07-09 Thread Santosh Sivaraj
Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Reviewed-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 96 ++-
 1 file changed, 95 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09fd..90729029ca010 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -13,9 +13,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -80,6 +82,7 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
 
/* Protect dimm health data from concurrent read/writes */
struct mutex health_mutex;
@@ -91,6 +94,9 @@ struct papr_scm_priv {
u64 health_bitmap;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(&papr_ndr_lock);
+   list_add_tail(&p->region_list, &papr_nd_regions);
+   mutex_unlock(&papr_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -766,6 +776,68 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+static void papr_scm_add_badblock(struct nd_region *region,
+ struct nvdimm_bus *bus, u64 phys_addr)
+{
+   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+   pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+   return;
+   }
+
+   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(&papr_nd_regions))
+   return NOTIFY_DONE;
+
+   /*
+* The physical address obtained here is PAGE_SIZE aligned, so get the
+* exact address from the effective address
+*/
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   /* mce notifier is called from a process context, so mutex is safe */
+   mutex_lock(&papr_ndr_lock);
+   list_for_each_entry(p, &papr_nd_regions, region_list) {
+   if (phys_addr >= p->res.start && phys_addr <= p->res.end) {
+   found = true;
+   break;
+   }
+   }
+
+   if (found)
+   papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+   mutex_unlock(&papr_ndr_lock);
+
+   return found ? NOTIFY_OK : NOTIFY_DONE;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -866,6 +938,10 @@ static int papr_scm_remove(struct platform_device *pdev)
 {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
 
+   mutex_lock(&papr_ndr_lock);
+   list_del(&p->region_list);
+   mutex_unlock(&papr_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -888,7 +964,25 @@ static struct platform_driver papr_scm_driver = {
},
 };
 
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(&papr_scm_driver);
+   if (!ret)
+   mce_register_notifier(&mce_ue_nb);
+
+   return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(&mce_ue_nb);
+   platform_driver_unregister(&papr_scm_driver);
+}
+module_exit(papr_scm_exit);
+
 MODULE_DEVICE_TABLE(of, papr_scm_match);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("IBM Corporation");
-- 
2.26.2



[PATCH v2 1/2] powerpc/mce: Add MCE notification chain

2020-07-09 Thread Santosh Sivaraj
Introduce notification chain which lets us know about uncorrected memory
errors(UE). This would help prospective users in pmem or nvdimm subsystem
to track bad blocks for better handling of persistent memory allocations.

Signed-off-by: Santosh Sivaraj 
Signed-off-by: Ganesh Goudar 
---
 arch/powerpc/include/asm/mce.h |  2 ++
 arch/powerpc/kernel/mce.c  | 15 +++
 2 files changed, 17 insertions(+)

v2: Address comments from Christophe.

RESEND: Sending the two patches together so the dependencies are clear. The
earlier patch reviews are here [1]; rebase the patches on top on 5.8-rc4

[1]: 
https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 376a395daf329..7bdd0cd4f2de0 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct 
machine_check_event *evt,
 unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
 extern void mce_common_process_ue(struct pt_regs *regs,
  struct mce_error_info *mce_err);
+int mce_register_notifier(struct notifier_block *nb);
+int mce_unregister_notifier(struct notifier_block *nb);
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fd90c0eda2290..b7b3ed4e61937 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+   blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
 #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but
-- 
2.26.2



Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 8, 2020, at 8:18 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> On Wed, Jul 08, 2020 at 08:01:23PM -0400, Mathieu Desnoyers wrote:
>> > > #define RSEQ_ASM_OP_CMPEQ(var, expect, label)
>> > > \
>> > > LOAD_WORD "%%r17, %[" __rseq_str(var) "]\n\t"
>> > >\
>> > 
>> > The way this hardcodes r17 *will* break, btw.  The compiler will not
>> > likely want to use r17 as long as your code (after inlining etc.!) stays
>> > small, but there is Murphy's law.
>> 
>> r17 is in the clobber list, so it should be ok.
> 
> What protects r17 *after* this asm statement?

As discussed in the other leg of the thread (with the code example),
r17 is in the clobber list of all asm statements using this macro, and
is used as a temporary register within each inline asm.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


Re: Failure to build librseq on ppc

2020-07-09 Thread Mathieu Desnoyers
- On Jul 8, 2020, at 8:10 PM, Segher Boessenkool seg...@kernel.crashing.org 
wrote:

> Hi!
> 
> On Wed, Jul 08, 2020 at 10:00:01AM -0400, Mathieu Desnoyers wrote:
[...]
> 
>> -#define STORE_WORD "std "
>> -#define LOAD_WORD  "ld "
>> -#define LOADX_WORD "ldx "
>> +#define STORE_WORD(arg)"std%U[" __rseq_str(arg) "]%X[" 
>> __rseq_str(arg)
>> "] "/* To memory ("m" constraint) */
>> +#define LOAD_WORD(arg) "lwd%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) "] "
>> /* From memory ("m" constraint) */
> 
> That cannot work (you typoed "ld" here).

Indeed, I noticed it before pushing to master (lwd -> ld).

> 
> Some more advice about this code, pretty generic stuff:

Let's take an example to support the discussion here. I'm taking it from
master branch (after a cleanup changing e.g. LOAD_WORD into RSEQ_LOAD_LONG).
So for powerpc32 we have (code edited to remove testing instrumentation):

#define __rseq_str_1(x) #x
#define __rseq_str(x)   __rseq_str_1(x)

#define RSEQ_STORE_LONG(arg)"stw%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* To memory ("m" constraint) */
#define RSEQ_STORE_INT(arg) RSEQ_STORE_LONG(arg)
/* To memory ("m" constraint) */
#define RSEQ_LOAD_LONG(arg) "lwz%U[" __rseq_str(arg) "]%X[" __rseq_str(arg) 
"] "/* From memory ("m" constraint) */
#define RSEQ_LOAD_INT(arg)  RSEQ_LOAD_LONG(arg) 
/* From memory ("m" constraint) */
#define RSEQ_LOADX_LONG "lwzx " 
/* From base register ("b" constraint) */
#define RSEQ_CMP_LONG   "cmpw "

#define __RSEQ_ASM_DEFINE_TABLE(label, version, flags,  
\
start_ip, post_commit_offset, abort_ip) 
\
".pushsection __rseq_cs, \"aw\"\n\t"
\
".balign 32\n\t"
\
__rseq_str(label) ":\n\t"   
\
".long " __rseq_str(version) ", " __rseq_str(flags) "\n\t"  
\
/* 32-bit only supported on BE */   
\
".long 0x0, " __rseq_str(start_ip) ", 0x0, " 
__rseq_str(post_commit_offset) ", 0x0, " __rseq_str(abort_ip) "\n\t" \
".popsection\n\t"   \
".pushsection __rseq_cs_ptr_array, \"aw\"\n\t"  \
".long 0x0, " __rseq_str(label) "b\n\t" \
".popsection\n\t"

/*
 * Exit points of a rseq critical section consist of all instructions outside
 * of the critical section where a critical section can either branch to or
 * reach through the normal course of its execution. The abort IP and the
 * post-commit IP are already part of the __rseq_cs section and should not be
 * explicitly defined as additional exit points. Knowing all exit points is
 * useful to assist debuggers stepping over the critical section.
 */
#define RSEQ_ASM_DEFINE_EXIT_POINT(start_ip, exit_ip)   
\
".pushsection __rseq_exit_point_array, \"aw\"\n\t"  
\
/* 32-bit only supported on BE */   
\
".long 0x0, " __rseq_str(start_ip) ", 0x0, " 
__rseq_str(exit_ip) "\n\t" \
".popsection\n\t"

#define RSEQ_ASM_STORE_RSEQ_CS(label, cs_label, rseq_cs)
\
"lis %%r17, (" __rseq_str(cs_label) ")@ha\n\t"  
\
"addi %%r17, %%r17, (" __rseq_str(cs_label) ")@l\n\t"   
\
RSEQ_STORE_INT(rseq_cs) "%%r17, %[" __rseq_str(rseq_cs) "]\n\t" 
\
__rseq_str(label) ":\n\t"

#define RSEQ_ASM_DEFINE_TABLE(label, start_ip, post_commit_ip, abort_ip)
\
__RSEQ_ASM_DEFINE_TABLE(label, 0x0, 0x0, start_ip,  
\
(post_commit_ip - start_ip), abort_ip)

#define RSEQ_ASM_CMP_CPU_ID(cpu_id, current_cpu_id, label)  
\
RSEQ_LOAD_INT(current_cpu_id) "%%r17, %[" 
__rseq_str(current_cpu_id) "]\n\t" \
"cmpw cr7, %[" __rseq_str(cpu_id) "], %%r17\n\t"
\
"bne- cr7, " __rseq_str(label) "\n\t"

#define RSEQ_ASM_DEFINE_ABORT(label, abort_label)   
\
".pushsection __rseq_failure, \"ax\"\n\t"   
\
".long " __rseq_str(RSEQ_SIG) "\n\t"
\
__rseq_str(label) ":\n\t"   
\
"b %l[" __rseq_str(abort_label) "]\n\t" 
\
".popsection\n\t"

#define RSEQ_ASM_OP_CMPEQ(var, expect, label)   
\
RSEQ_LOAD_LONG(var) "%%r17, %[" __rseq_str(var) "]\

[PATCH v3 4/4] powerpc/mm/radix: Create separate mappings for hot-plugged memory

2020-07-09 Thread Aneesh Kumar K.V
To enable memory unplug without splitting kernel page table
mapping, we force the max mapping size to the LMB size. LMB
size is the unit in which hypervisor will do memory add/remove
operation.

Pseries systems supports max LMB size of 256MB. Hence on pseries,
we now end up mapping memory with 2M page size instead of 1G. To improve
that we want hypervisor to hint the kernel about the hotplug
memory range. That was added that as part of

commit b6eca183e23e ("powerpc/kernel: Enables memory
hot-remove after reboot on pseries guests")

But PowerVM doesn't provide that hint yet. Once we get PowerVM
updated, we can then force the 2M mapping only to hot-pluggable
memory region using memblock_is_hotpluggable(). Till then
let's depend on LMB size for finding the mapping page size
for linear range.

With this change KVM guest will also be doing linear mapping with
2M page size.

The actual TLB benefit of mapping guest page table entries with
hugepage size can only be materialized if the partition scoped
entries are also using the same or higher page size. A guest using
1G hugetlbfs backing guest memory can have a performance impact with
the above change.

Signed-off-by: Bharata B Rao 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  5 ++
 arch/powerpc/mm/book3s64/radix_pgtable.c | 81 
 arch/powerpc/platforms/powernv/setup.c   | 10 ++-
 3 files changed, 84 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 5393a535240c..15aae924f41c 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -82,6 +82,11 @@ extern unsigned int mmu_pid_bits;
 /* Base PID to allocate from */
 extern unsigned int mmu_base_pid;
 
+/*
+ * memory block size used with radix translation.
+ */
+extern unsigned int __ro_after_init radix_mem_block_size;
+
 #define PRTB_SIZE_SHIFT(mmu_pid_bits + 4)
 #define PRTB_ENTRIES   (1ul << mmu_pid_bits)
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index d5a01b9aadc9..bba45fc0b7b2 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -33,6 +34,7 @@
 
 unsigned int mmu_pid_bits;
 unsigned int mmu_base_pid;
+unsigned int radix_mem_block_size __ro_after_init;
 
 static __ref void *early_alloc_pgtable(unsigned long size, int nid,
unsigned long region_start, unsigned long region_end)
@@ -265,6 +267,7 @@ static unsigned long next_boundary(unsigned long addr, 
unsigned long end)
 
 static int __meminit create_physical_mapping(unsigned long start,
 unsigned long end,
+unsigned long max_mapping_size,
 int nid, pgprot_t _prot)
 {
unsigned long vaddr, addr, mapping_size = 0;
@@ -278,6 +281,8 @@ static int __meminit create_physical_mapping(unsigned long 
start,
int rc;
 
gap = next_boundary(addr, end) - addr;
+   if (gap > max_mapping_size)
+   gap = max_mapping_size;
previous_size = mapping_size;
prev_exec = exec;
 
@@ -328,8 +333,9 @@ static void __init radix_init_pgtable(void)
 
/* We don't support slb for radix */
mmu_slb_size = 0;
+
/*
-* Create the linear mapping, using standard page size for now
+* Create the linear mapping
 */
for_each_memblock(memory, reg) {
/*
@@ -345,6 +351,7 @@ static void __init radix_init_pgtable(void)
 
WARN_ON(create_physical_mapping(reg->base,
reg->base + reg->size,
+   radix_mem_block_size,
-1, PAGE_KERNEL));
}
 
@@ -485,6 +492,47 @@ static int __init radix_dt_scan_page_sizes(unsigned long 
node,
return 1;
 }
 
+static int __init probe_memory_block_size(unsigned long node, const char 
*uname, int
+ depth, void *data)
+{
+   unsigned long *mem_block_size = (unsigned long *)data;
+   const __be64 *prop;
+   int len;
+
+   if (depth != 1)
+   return 0;
+
+   if (strcmp(uname, "ibm,dynamic-reconfiguration-memory"))
+   return 0;
+
+   prop = of_get_flat_dt_prop(node, "ibm,lmb-size", &len);
+   if (!prop || len < sizeof(__be64))
+   /*
+* Nothing in the device tree
+*/
+   *mem_block_size = MIN_MEMORY_BLOCK_SIZE;
+   else
+   *mem_block_size = be64_to_cpup(prop);
+   return 1;
+}
+
+static unsigned long radix_memory_block_size(void)
+{
+   unsigne

[PATCH v3 3/4] powerpc/mm/radix: Remove split_kernel_mapping()

2020-07-09 Thread Aneesh Kumar K.V
From: Bharata B Rao 

We split the page table mapping on memory unplug if the
linear range was mapped with huge page mapping (for ex: 1G)
The page table splitting code has a few issues:

1. Recursive locking

Memory unplug path takes cpu_hotplug_lock and calls stop_machine()
for splitting the mappings. However stop_machine() takes
cpu_hotplug_lock again causing deadlock.

2. BUG: sleeping function called from in_atomic() context
-
Memory unplug path (remove_pagetable) takes init_mm.page_table_lock
spinlock and later calls stop_machine() which does wait_for_completion()

3. Bad unlock unbalance
---
Memory unplug path takes init_mm.page_table_lock spinlock and calls
stop_machine(). The stop_machine thread function runs in a different
thread context (migration thread) which tries to release and reaquire
ptl. Releasing ptl from a different thread than which acquired it
causes bad unlock unbalance.

These problems can be avoided if we avoid mapping hot-plugged memory
with 1G mapping, thereby removing the need for splitting them during
unplug. The kernel always make sure the minimum unplug request is
SUBSECTION_SIZE for device memory and SECTION_SIZE for regular memory.

In preparation for such a change remove page table splitting support.

This essentially is a revert of
commit 4dd5f8a99e791 ("powerpc/mm/radix: Split linear mapping on hot-unplug")

Signed-off-by: Bharata B Rao 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 95 +---
 1 file changed, 19 insertions(+), 76 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 46ad2da3087a..d5a01b9aadc9 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -15,7 +15,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include 
 #include 
@@ -722,32 +721,6 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
p4d_clear(p4d);
 }
 
-struct change_mapping_params {
-   pte_t *pte;
-   unsigned long start;
-   unsigned long end;
-   unsigned long aligned_start;
-   unsigned long aligned_end;
-};
-
-static int __meminit stop_machine_change_mapping(void *data)
-{
-   struct change_mapping_params *params =
-   (struct change_mapping_params *)data;
-
-   if (!data)
-   return -1;
-
-   spin_unlock(&init_mm.page_table_lock);
-   pte_clear(&init_mm, params->aligned_start, params->pte);
-   create_physical_mapping(__pa(params->aligned_start),
-   __pa(params->start), -1, PAGE_KERNEL);
-   create_physical_mapping(__pa(params->end), __pa(params->aligned_end),
-   -1, PAGE_KERNEL);
-   spin_lock(&init_mm.page_table_lock);
-   return 0;
-}
-
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
 unsigned long end)
 {
@@ -776,52 +749,6 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
}
 }
 
-/*
- * clear the pte and potentially split the mapping helper
- */
-static void __meminit split_kernel_mapping(unsigned long addr, unsigned long 
end,
-   unsigned long size, pte_t *pte)
-{
-   unsigned long mask = ~(size - 1);
-   unsigned long aligned_start = addr & mask;
-   unsigned long aligned_end = addr + size;
-   struct change_mapping_params params;
-   bool split_region = false;
-
-   if ((end - addr) < size) {
-   /*
-* We're going to clear the PTE, but not flushed
-* the mapping, time to remap and flush. The
-* effects if visible outside the processor or
-* if we are running in code close to the
-* mapping we cleared, we are in trouble.
-*/
-   if (overlaps_kernel_text(aligned_start, addr) ||
-   overlaps_kernel_text(end, aligned_end)) {
-   /*
-* Hack, just return, don't pte_clear
-*/
-   WARN_ONCE(1, "Linear mapping %lx->%lx overlaps kernel "
- "text, not splitting\n", addr, end);
-   return;
-   }
-   split_region = true;
-   }
-
-   if (split_region) {
-   params.pte = pte;
-   params.start = addr;
-   params.end = end;
-   params.aligned_start = addr & ~(size - 1);
-   params.aligned_end = min_t(unsigned long, aligned_end,
-   (unsigned long)__va(memblock_end_of_DRAM()));
-   stop_machine(stop_machine_change_mapping, ¶ms, NULL);
-   return;
-   }
-
-   pte_clear(&init_mm, addr, pte);
-}
-
 static void remove_pmd_table(pmd_t 

[PATCH v3 2/4] powerpc/mm/radix: Free PUD table when freeing pagetable

2020-07-09 Thread Aneesh Kumar K.V
From: Bharata B Rao 

remove_pagetable() isn't freeing PUD table. This causes memory
leak during memory unplug. Fix this.

Fixes: 4b5d62ca17a1 ("powerpc/mm: add radix__remove_section_mapping()")
Signed-off-by: Bharata B Rao 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 85806a6bed4d..46ad2da3087a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -707,6 +707,21 @@ static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
pud_clear(pud);
 }
 
+static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
+{
+   pud_t *pud;
+   int i;
+
+   for (i = 0; i < PTRS_PER_PUD; i++) {
+   pud = pud_start + i;
+   if (!pud_none(*pud))
+   return;
+   }
+
+   pud_free(&init_mm, pud_start);
+   p4d_clear(p4d);
+}
+
 struct change_mapping_params {
pte_t *pte;
unsigned long start;
@@ -881,6 +896,7 @@ static void __meminit remove_pagetable(unsigned long start, 
unsigned long end)
 
pud_base = (pud_t *)p4d_page_vaddr(*p4d);
remove_pud_table(pud_base, addr, next);
+   free_pud_table(pud_base, p4d);
}
 
spin_unlock(&init_mm.page_table_lock);
-- 
2.26.2



[PATCH v3 1/4] powerpc/mm/radix: Fix PTE/PMD fragment count for early page table mappings

2020-07-09 Thread Aneesh Kumar K.V
We can hit the following BUG_ON during memory unplug:

kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:342!
Oops: Exception in kernel mode, sig: 5 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
NIP [c0093308] pmd_fragment_free+0x48/0xc0
LR [c147bfec] remove_pagetable+0x578/0x60c
Call Trace:
0xc0805000 (unreliable)
remove_pagetable+0x384/0x60c
radix__remove_section_mapping+0x18/0x2c
remove_section_mapping+0x1c/0x3c
arch_remove_memory+0x11c/0x180
try_remove_memory+0x120/0x1b0
__remove_memory+0x20/0x40
dlpar_remove_lmb+0xc0/0x114
dlpar_memory+0x8b0/0xb20
handle_dlpar_errorlog+0xc0/0x190
pseries_hp_work_fn+0x2c/0x60
process_one_work+0x30c/0x810
worker_thread+0x98/0x540
kthread+0x1c4/0x1d0
ret_from_kernel_thread+0x5c/0x74

This occurs when unplug is attempted for such memory which has
been mapped using memblock pages as part of early kernel page
table setup. We wouldn't have initialized the PMD or PTE fragment
count for those PMD or PTE pages.

This can be fixed by allocating memory in PAGE_SIZE granularity
during early page table allocation. This makes sure a specific
page is not shared for another memblock allocation and we can
free them correctly on removing page-table pages.

Since we now do PAGE_SIZE allocations for both PUD table and
PMD table (Note that PTE table allocation is already of PAGE_SIZE),
we end up allocating more memory for the same amount of system RAM.
Here is a comparision of how much more we need for a 64T and 2G
system after this patch:

1. 64T system
-
64T RAM would need 64G for vmemmap with struct page size being 64B.

128 PUD tables for 64T memory (1G mappings)
1 PUD table and 64 PMD tables for 64G vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (128+1+64)*4K=772K
With PAGE_SIZE(64K) table allocations, (128+1+64)*64K=12352K

2. 2G system

2G RAM would need 2M for vmemmap with struct page size being 64B.

1 PUD table for 2G memory (1G mapping)
1 PUD table and 1 PMD table for 2M vmemmap (2M mappings)

With default PUD[PMD]_TABLE_SIZE(4K), (1+1+1)*4K=12K
With new PAGE_SIZE(64K) table allocations, (1+1+1)*64K=192K

Signed-off-by: Bharata B Rao 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgalloc.h | 16 +++-
 arch/powerpc/mm/book3s64/pgtable.c   |  5 -
 arch/powerpc/mm/book3s64/radix_pgtable.c | 15 +++
 arch/powerpc/mm/pgtable-frag.c   |  3 +++
 4 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgalloc.h 
b/arch/powerpc/include/asm/book3s/64/pgalloc.h
index 69c5b051734f..e1af0b394ceb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgalloc.h
+++ b/arch/powerpc/include/asm/book3s/64/pgalloc.h
@@ -107,9 +107,23 @@ static inline pud_t *pud_alloc_one(struct mm_struct *mm, 
unsigned long addr)
return pud;
 }
 
+static inline void __pud_free(pud_t *pud)
+{
+   struct page *page = virt_to_page(pud);
+
+   /*
+* Early pud pages allocated via memblock allocator
+* can't be directly freed to slab
+*/
+   if (PageReserved(page))
+   free_reserved_page(page);
+   else
+   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+}
+
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), pud);
+   return __pud_free(pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index c58ad1049909..85de5b574dd7 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -339,6 +339,9 @@ void pmd_fragment_free(unsigned long *pmd)
 {
struct page *page = virt_to_page(pmd);
 
+   if (PageReserved(page))
+   return free_reserved_page(page);
+
BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
if (atomic_dec_and_test(&page->pt_frag_refcount)) {
pgtable_pmd_page_dtor(page);
@@ -356,7 +359,7 @@ static inline void pgtable_free(void *table, int index)
pmd_fragment_free(table);
break;
case PUD_INDEX:
-   kmem_cache_free(PGT_CACHE(PUD_CACHE_INDEX), table);
+   __pud_free(table);
break;
 #if defined(CONFIG_PPC_4K_PAGES) && defined(CONFIG_HUGETLB_PAGE)
/* 16M hugepd directory at pud level */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index bb00e0cba119..85806a6bed4d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -56,6 +56,13 @@ static __ref void *early_alloc_pgtable(unsigned long size, 
int nid,
return ptr;
 }
 
+/*
+ * When allocating pud or pmd pointers, we allocate a complete page
+ * of PAGE_SIZE rather than PUD_TABLE_SIZE or PMD_TABLE_SIZE. This
+ * is to ensure that the p

[PATCH v3 0/4] powerpc/mm/radix: Memory unplug fixes

2020-07-09 Thread Aneesh Kumar K.V
This is the next version of the fixes for memory unplug on radix.
The issues and the fix are described in the actual patches.

Changes from v2:
- Address review feedback

Changes from v1:
- Added back patch to drop split_kernel_mapping
- Most of the split_kernel_mapping related issues are now described
  in the removal patch
- drop pte fragment change
- use lmb size as the max mapping size.
- Radix baremetal now use memory block size of 1G.


Changes from v0:
- Rebased to latest kernel.
- Took care of p4d changes.
- Addressed Aneesh's review feedback:
 - Added comments.
 - Indentation fixed.
- Dropped the 1st patch (setting DRCONF_MEM_HOTREMOVABLE lmb flags) as
  it is debatable if this flag should be set in the device tree by OS
  and not by platform in case of hotplug. This can be looked at separately.
  (The fixes in this patchset remain valid without the dropped patch)
- Dropped the last patch that removed split_kernel_mapping() to ensure
  that spilitting code is available for any radix guest running on
  platforms that don't set DRCONF_MEM_HOTREMOVABLE.



Aneesh Kumar K.V (2):
  powerpc/mm/radix: Fix PTE/PMD fragment count for early page table
mappings
  powerpc/mm/radix: Create separate mappings for hot-plugged memory

Bharata B Rao (2):
  powerpc/mm/radix: Free PUD table when freeing pagetable
  powerpc/mm/radix: Remove split_kernel_mapping()

 arch/powerpc/include/asm/book3s/64/mmu.h |   5 +
 arch/powerpc/include/asm/book3s/64/pgalloc.h |  16 +-
 arch/powerpc/mm/book3s64/pgtable.c   |   5 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 197 +++
 arch/powerpc/mm/pgtable-frag.c   |   3 +
 arch/powerpc/platforms/powernv/setup.c   |  10 +-
 6 files changed, 147 insertions(+), 89 deletions(-)

-- 
2.26.2



[PATCH] powerpc/watchpoint/ptrace: Introduce PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31

2020-07-09 Thread Ravi Bangoria
PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31 can be used to determine
whether we are running on an ISA 3.1 compliant machine. Which is
needed to determine DAR behaviour, 512 byte boundary limit etc.
This was requested by Pedro Miraglia Franco de Carvalho for
extending watchpoint features in gdb. Note that availability of
2nd DAWR is independent of this flag and should be checked using
ppc_debug_info->num_data_bps.

Signed-off-by: Ravi Bangoria 
---
 arch/powerpc/include/uapi/asm/ptrace.h| 1 +
 arch/powerpc/kernel/ptrace/ptrace-noadv.c | 5 -
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/ptrace.h 
b/arch/powerpc/include/uapi/asm/ptrace.h
index f5f1ccc740fc..0a87bcd4300a 100644
--- a/arch/powerpc/include/uapi/asm/ptrace.h
+++ b/arch/powerpc/include/uapi/asm/ptrace.h
@@ -222,6 +222,7 @@ struct ppc_debug_info {
 #define PPC_DEBUG_FEATURE_DATA_BP_RANGE0x0004
 #define PPC_DEBUG_FEATURE_DATA_BP_MASK 0x0008
 #define PPC_DEBUG_FEATURE_DATA_BP_DAWR 0x0010
+#define PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31 0x0020
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/kernel/ptrace/ptrace-noadv.c 
b/arch/powerpc/kernel/ptrace/ptrace-noadv.c
index 697c7e4b5877..b2de874d650b 100644
--- a/arch/powerpc/kernel/ptrace/ptrace-noadv.c
+++ b/arch/powerpc/kernel/ptrace/ptrace-noadv.c
@@ -52,8 +52,11 @@ void ppc_gethwdinfo(struct ppc_debug_info *dbginfo)
dbginfo->sizeof_condition = 0;
if (IS_ENABLED(CONFIG_HAVE_HW_BREAKPOINT)) {
dbginfo->features = PPC_DEBUG_FEATURE_DATA_BP_RANGE;
-   if (dawr_enabled())
+   if (dawr_enabled()) {
dbginfo->features |= PPC_DEBUG_FEATURE_DATA_BP_DAWR;
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   dbginfo->features |= 
PPC_DEBUG_FEATURE_DATA_BP_DAWR_ARCH_31;
+   }
} else {
dbginfo->features = 0;
}
-- 
2.26.2



RE: [PATCH 14/20] Documentation: misc/xilinx_sdfec: eliminate duplicated word

2020-07-09 Thread Dragan Cvetic


> -Original Message-
> From: Randy Dunlap 
> Sent: Tuesday 7 July 2020 19:04
> To: linux-ker...@vger.kernel.org
> Cc: Randy Dunlap ; Jonathan Corbet ; 
> linux-...@vger.kernel.org; linux-
> m...@vger.kernel.org; Mike Rapoport ; Jens Axboe 
> ; linux-bl...@vger.kernel.org; Jason
> Wessel ; Daniel Thompson 
> ; Douglas Anderson
> ; kgdb-bugrep...@lists.sourceforge.net; Wu Hao 
> ; linux-f...@vger.kernel.org;
> James Wang ; Liviu Dudau ; 
> Mihail Atanassov ;
> Mali DP Maintainers ; David Airlie ; 
> Daniel Vetter ; dri-
> de...@lists.freedesktop.org; Srinivas Pandruvada 
> ; Jiri Kosina ; linux-
> in...@vger.kernel.org; Wolfram Sang ; 
> linux-...@vger.kernel.org; Masahiro Yamada ;
> Michal Marek ; linux-kbu...@vger.kernel.org; Jacek 
> Anaszewski ; Pavel
> Machek ; Dan Murphy ; 
> linux-l...@vger.kernel.org; Dan Williams ;
> Paul Cercueil ; Thomas Bogendoerfer 
> ; linux-m...@vger.kernel.org; Derek
> Kiernan ; Dragan Cvetic ; Michael 
> Ellerman ; Benjamin
> Herrenschmidt ; Paul Mackerras ; 
> linuxppc-dev@lists.ozlabs.org; Tony Krowiak
> ; Pierre Morel ; Halil Pasic 
> ; linux-s...@vger.kernel.org;
> Matthew Wilcox ; Hannes Reinecke ; 
> linux-s...@vger.kernel.org; James E.J. Bottomley
> ; Martin K. Petersen ; Jarkko 
> Sakkinen ;
> Mimi Zohar ; linux-integr...@vger.kernel.org; 
> keyri...@vger.kernel.org; Paolo Bonzini
> ; k...@vger.kernel.org; Andrew Morton 
> 
> Subject: [PATCH 14/20] Documentation: misc/xilinx_sdfec: eliminate duplicated 
> word
> 
> Drop the doubled word "the".
> 
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: Derek Kiernan 
> Cc: Dragan Cvetic 
> ---
>  Documentation/misc-devices/xilinx_sdfec.rst |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- linux-next-20200701.orig/Documentation/misc-devices/xilinx_sdfec.rst
> +++ linux-next-20200701/Documentation/misc-devices/xilinx_sdfec.rst
> @@ -78,7 +78,7 @@ application interfaces:
>- open: Implements restriction that only a single file descriptor can be 
> open per SD-FEC instance at any time
>- release: Allows another file descriptor to be open, that is after 
> current file descriptor is closed
>- poll: Provides a method to monitor for SD-FEC Error events
> -  - unlocked_ioctl: Provides the the following ioctl commands that allows 
> the application configure the SD-FEC core:
> +  - unlocked_ioctl: Provides the following ioctl commands that allows the 
> application configure the SD-FEC core:
> 
>   - :c:macro:`XSDFEC_START_DEV`
>   - :c:macro:`XSDFEC_STOP_DEV`

Acked-by: Dragan Cvetic 
Thanks Randy

Dragan


Re: [PATCH v2 04/10] powerpc/perf: Add power10_feat to dt_cpu_ftrs

2020-07-09 Thread Athira Rajeev


> On 08-Jul-2020, at 4:45 PM, Michael Ellerman  wrote:
> 
> Athira Rajeev  > writes:
>> From: Madhavan Srinivasan 
>> 
>> Add power10 feature function to dt_cpu_ftrs.c along
>> with a power10 specific init() to initialize pmu sprs.
>> 
>> Signed-off-by: Madhavan Srinivasan 
>> ---
>> arch/powerpc/include/asm/reg.h|  3 +++
>> arch/powerpc/kernel/cpu_setup_power.S |  7 +++
>> arch/powerpc/kernel/dt_cpu_ftrs.c | 26 ++
>> 3 files changed, 36 insertions(+)
>> 
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 21a1b2d..900ada1 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -1068,6 +1068,9 @@
>> #define MMCR0_PMC2_LOADMISSTIME  0x5
>> #endif
>> 
>> +/* BHRB disable bit for PowerISA v3.10 */
>> +#define MMCRA_BHRB_DISABLE  0x0020
>> +
>> /*
>>  * SPRG usage:
>>  *
>> diff --git a/arch/powerpc/kernel/cpu_setup_power.S 
>> b/arch/powerpc/kernel/cpu_setup_power.S
>> index efdcfa7..e8b3370c 100644
>> --- a/arch/powerpc/kernel/cpu_setup_power.S
>> +++ b/arch/powerpc/kernel/cpu_setup_power.S
>> @@ -233,3 +233,10 @@ __init_PMU_ISA207:
>>  li  r5,0
>>  mtspr   SPRN_MMCRS,r5
>>  blr
>> +
>> +__init_PMU_ISA31:
>> +li  r5,0
>> +mtspr   SPRN_MMCR3,r5
>> +LOAD_REG_IMMEDIATE(r5, MMCRA_BHRB_DISABLE)
>> +mtspr   SPRN_MMCRA,r5
>> +blr
> 
> This doesn't seem like it belongs in this patch. It's not called?

Yes, you are right, this needs to be called from `__setup_cpu_power10`.
Since we didn’t had setup part for power10 in the tree initially, missed it.
I will include this update in V3 

Thanks
Athira
> 
> cheers



Re: [PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10

2020-07-09 Thread Pratik Sampat




On 09/07/20 2:39 pm, Gautham R Shenoy wrote:

On Fri, Jul 03, 2020 at 06:16:40PM +0530, Pratik Rajesh Sampat wrote:

Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
stop levels < 4.

Adding Ravi Bangoria  to the cc.


Therefore save the values of these SPRs before entering a  "stop"
state and restore their values on wakeup.

Signed-off-by: Pratik Rajesh Sampat 


The saving and restoration looks good to me.

---
  arch/powerpc/platforms/powernv/idle.c | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 19d94d021357..471d4a65b1fa 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -600,6 +600,8 @@ struct p9_sprs {
u64 iamr;
u64 amor;
u64 uamor;
+   u64 dawr0;
+   u64 dawrx0;
  };

  static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
@@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
sprs.tscr   = mfspr(SPRN_TSCR);
if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
sprs.ldbar = mfspr(SPRN_LDBAR);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   sprs.dawr0 = mfspr(SPRN_DAWR0);
+   sprs.dawrx0 = mfspr(SPRN_DAWRX0);
+   }



But this is within the if condition which says

if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level)

This if condition is meant for stop4 and stop5 since these are stop
levels that have OPAL_PM_LOSE_HYP_CONTEXT set.

Since we can lose DAWR*, on states that lose limited hypervisor
context, such as stop0-2, we need to unconditionally save them
like AMR, IAMR etc.


Right, shallow states too loose DAWR/X. Thanks for pointing it out.
I'll fix this and resend.


sprs_saved = true;

@@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, 
bool mmu_on)
mtspr(SPRN_MMCR2,   sprs.mmcr2);
if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
mtspr(SPRN_LDBAR, sprs.ldbar);
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mtspr(SPRN_DAWR0, sprs.dawr0);
+   mtspr(SPRN_DAWRX0, sprs.dawrx0);
+   }


Likewise, we need to unconditionally restore these SPRs.



mtspr(SPRN_SPRG3,   local_paca->sprg_vdso);

--
2.25.4


Thanks
Pratik


Re: [PATCH 1/2] powerpc/vas: Report proper error for address translation failure

2020-07-09 Thread Michael Ellerman
Haren Myneni  writes:
> DMA controller uses CC=5 internally for translation fault handling. So
> OS should be using CC=250 and should report this error to the user space
> when NX encounters address translation failure on the request buffer.

That doesn't really explain *why* the OS must use CC=250.

Is it documented somewhere that 5 is for hardware use, and 250 is for
software?

> This patch defines CSB_CC_ADDRESS_TRANSLATION(250) and updates
> CSB.CC with this proper error code for user space.

We still have:

#define CSB_CC_TRANSLATION  (5)

And it's very unclear where one or the other should be used.

Can one or the other get a name that makes the distinction clear.

cheers


> diff --git a/Documentation/powerpc/vas-api.rst 
> b/Documentation/powerpc/vas-api.rst
> index 1217c2f..78627cc 100644
> --- a/Documentation/powerpc/vas-api.rst
> +++ b/Documentation/powerpc/vas-api.rst
> @@ -213,7 +213,7 @@ request buffers are not in memory. The operating system 
> handles the fault by
>  updating CSB with the following data:
>  
>   csb.flags = CSB_V;
> - csb.cc = CSB_CC_TRANSLATION;
> + csb.cc = CSB_CC_ADDRESS_TRANSLATION;
>   csb.ce = CSB_CE_TERMINATION;
>   csb.address = fault_address;
>  
> diff --git a/arch/powerpc/include/asm/icswx.h 
> b/arch/powerpc/include/asm/icswx.h
> index 965b1f3..b1c9a57 100644
> --- a/arch/powerpc/include/asm/icswx.h
> +++ b/arch/powerpc/include/asm/icswx.h
> @@ -77,6 +77,8 @@ struct coprocessor_completion_block {
>  #define CSB_CC_CHAIN (37)
>  #define CSB_CC_SEQUENCE  (38)
>  #define CSB_CC_HW(39)
> +/* User space address traslation failure */
> +#define  CSB_CC_ADDRESS_TRANSLATION  (250)
>  
>  #define CSB_SIZE (0x10)
>  #define CSB_ALIGNCSB_SIZE
> diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
> b/arch/powerpc/platforms/powernv/vas-fault.c
> index 266a6ca..33e89d4 100644
> --- a/arch/powerpc/platforms/powernv/vas-fault.c
> +++ b/arch/powerpc/platforms/powernv/vas-fault.c
> @@ -79,7 +79,7 @@ static void update_csb(struct vas_window *window,
>   csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
>  
>   memset(&csb, 0, sizeof(csb));
> - csb.cc = CSB_CC_TRANSLATION;
> + csb.cc = CSB_CC_ADDRESS_TRANSLATION;
>   csb.ce = CSB_CE_TERMINATION;
>   csb.cs = 0;
>   csb.count = 0;
> -- 
> 1.8.3.1


Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-09 Thread Alex Ghiti

Hi Palmer,

Le 7/9/20 à 1:05 AM, Palmer Dabbelt a écrit :

On Sun, 07 Jun 2020 00:59:46 PDT (-0700), a...@ghiti.fr wrote:

This is a preparatory patch for relocatable kernel.

The kernel used to be linked at PAGE_OFFSET address and used to be loaded
physically at the beginning of the main memory. Therefore, we could use
the linear mapping for the kernel mapping.

But the relocated kernel base address will be different from PAGE_OFFSET
and since in the linear mapping, two different virtual addresses cannot
point to the same physical address, the kernel mapping needs to lie 
outside

the linear mapping.


I know it's been a while, but I keep opening this up to review it and just
can't get over how ugly it is to put the kernel's linear map in the vmalloc
region.

I guess I don't understand why this is necessary at all.  Specifically: why
can't we just relocate the kernel within the linear map?  That would let 
the
bootloader put the kernel wherever it wants, modulo the physical memory 
size we

support.  We'd need to handle the regions that are coupled to the kernel's
execution address, but we could just put them in an explicit memory region
which is what we should probably be doing anyway.


Virtual relocation in the linear mapping requires to move the kernel 
physically too. Zong implemented this physical move in its KASLR RFC 
patchset, which is cumbersome since finding an available physical spot 
is harder than just selecting a virtual range in the vmalloc range.


In addition, having the kernel mapping in the linear mapping prevents 
the use of hugepage for the linear mapping resulting in performance loss 
(at least for the GB that encompasses the kernel).


Why do you find this "ugly" ? The vmalloc region is just a bunch of 
available virtual addresses to whatever purpose we want, and as noted by 
Zong, arm64 uses the same scheme.





In addition, because modules and BPF must be close to the kernel (inside
+-2GB window), the kernel is placed at the end of the vmalloc zone minus
2GB, which leaves room for modules and BPF. The kernel could not be
placed at the beginning of the vmalloc zone since other vmalloc
allocations from the kernel could get all the +-2GB window around the
kernel which would prevent new modules and BPF programs to be loaded.


Well, that's not enough to make sure this doesn't happen -- it's just 
enough to
make sure it doesn't happen very quickily.  That's the same boat we're 
already

in, though, so it's not like it's worse.


Indeed, that's not worse, I haven't found a way to reserve vmalloc area 
without actually allocating it.





Signed-off-by: Alexandre Ghiti 
Reviewed-by: Zong Li 
---
 arch/riscv/boot/loader.lds.S |  3 +-
 arch/riscv/include/asm/page.h    | 10 +-
 arch/riscv/include/asm/pgtable.h | 38 ++---
 arch/riscv/kernel/head.S |  3 +-
 arch/riscv/kernel/module.c   |  4 +--
 arch/riscv/kernel/vmlinux.lds.S  |  3 +-
 arch/riscv/mm/init.c | 58 +---
 arch/riscv/mm/physaddr.c |  2 +-
 8 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
index 47a5003c2e28..62d94696a19c 100644
--- a/arch/riscv/boot/loader.lds.S
+++ b/arch/riscv/boot/loader.lds.S
@@ -1,13 +1,14 @@
 /* SPDX-License-Identifier: GPL-2.0 */

 #include 
+#include 

 OUTPUT_ARCH(riscv)
 ENTRY(_start)

 SECTIONS
 {
-    . = PAGE_OFFSET;
+    . = KERNEL_LINK_ADDR;

 .payload : {
 *(.payload)
diff --git a/arch/riscv/include/asm/page.h 
b/arch/riscv/include/asm/page.h

index 2d50f76efe48..48bb09b6a9b7 100644
--- a/arch/riscv/include/asm/page.h
+++ b/arch/riscv/include/asm/page.h
@@ -90,18 +90,26 @@ typedef struct page *pgtable_t;

 #ifdef CONFIG_MMU
 extern unsigned long va_pa_offset;
+extern unsigned long va_kernel_pa_offset;
 extern unsigned long pfn_base;
 #define ARCH_PFN_OFFSET    (pfn_base)
 #else
 #define va_pa_offset    0
+#define va_kernel_pa_offset    0
 #define ARCH_PFN_OFFSET    (PAGE_OFFSET >> PAGE_SHIFT)
 #endif /* CONFIG_MMU */

 extern unsigned long max_low_pfn;
 extern unsigned long min_low_pfn;
+extern unsigned long kernel_virt_addr;

 #define __pa_to_va_nodebug(x)    ((void *)((unsigned long) (x) + 
va_pa_offset))

-#define __va_to_pa_nodebug(x)    ((unsigned long)(x) - va_pa_offset)
+#define linear_mapping_va_to_pa(x)    ((unsigned long)(x) - 
va_pa_offset)

+#define kernel_mapping_va_to_pa(x)    \
+    ((unsigned long)(x) - va_kernel_pa_offset)
+#define __va_to_pa_nodebug(x)    \
+    (((x) >= PAGE_OFFSET) ?    \
+    linear_mapping_va_to_pa(x) : kernel_mapping_va_to_pa(x))

 #ifdef CONFIG_DEBUG_VIRTUAL
 extern phys_addr_t __virt_to_phys(unsigned long x);
diff --git a/arch/riscv/include/asm/pgtable.h 
b/arch/riscv/include/asm/pgtable.h

index 35b60035b6b0..94ef3b49dfb6 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -11,23 +11,29 @@

 #include 

-#ifndef __ASSEMBLY__
-

Re: [PATCH 2/2] KVM: PPC: Book3S HV: rework secure mem slot dropping

2020-07-09 Thread Michael Ellerman
Laurent Dufour  writes:
> Le 08/07/2020 à 13:25, Bharata B Rao a écrit :
>> On Fri, Jul 03, 2020 at 05:59:14PM +0200, Laurent Dufour wrote:
>>> When a secure memslot is dropped, all the pages backed in the secure device
>>> (aka really backed by secure memory by the Ultravisor) should be paged out
>>> to a normal page. Previously, this was achieved by triggering the page
>>> fault mechanism which is calling kvmppc_svm_page_out() on each pages.
>>>
>>> This can't work when hot unplugging a memory slot because the memory slot
>>> is flagged as invalid and gfn_to_pfn() is then not trying to access the
>>> page, so the page fault mechanism is not triggered.
>>>
>>> Since the final goal is to make a call to kvmppc_svm_page_out() it seems
>>> simpler to directly calling it instead of triggering such a mechanism. This
>>> way kvmppc_uvmem_drop_pages() can be called even when hot unplugging a
>>> memslot.
>> 
>> Yes, this appears much simpler.
>
> Thanks Bharata for reviewing this.
>
>> 
>>>
>>> Since kvmppc_uvmem_drop_pages() is already holding kvm->arch.uvmem_lock,
>>> the call to __kvmppc_svm_page_out() is made.
>>> As __kvmppc_svm_page_out needs the vma pointer to migrate the pages, the
>>> VMA is fetched in a lazy way, to not trigger find_vma() all the time. In
>>> addition, the mmap_sem is help in read mode during that time, not in write
>>> mode since the virual memory layout is not impacted, and
>>> kvm->arch.uvmem_lock prevents concurrent operation on the secure device.
>>>
>>> Cc: Ram Pai 
>>> Cc: Bharata B Rao 
>>> Cc: Paul Mackerras 
>>> Signed-off-by: Laurent Dufour 
>>> ---
>>>   arch/powerpc/kvm/book3s_hv_uvmem.c | 54 --
>>>   1 file changed, 37 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c 
>>> b/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> index 852cc9ae6a0b..479ddf16d18c 100644
>>> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
>>> @@ -533,35 +533,55 @@ static inline int kvmppc_svm_page_out(struct 
>>> vm_area_struct *vma,
>>>* fault on them, do fault time migration to replace the device PTEs in
>>>* QEMU page table with normal PTEs from newly allocated pages.
>>>*/
>>> -void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>>> +void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *slot,
>>>  struct kvm *kvm, bool skip_page_out)
>>>   {
>>> int i;
>>> struct kvmppc_uvmem_page_pvt *pvt;
>>> -   unsigned long pfn, uvmem_pfn;
>>> -   unsigned long gfn = free->base_gfn;
>>> +   struct page *uvmem_page;
>>> +   struct vm_area_struct *vma = NULL;
>>> +   unsigned long uvmem_pfn, gfn;
>>> +   unsigned long addr, end;
>>> +
>>> +   down_read(&kvm->mm->mmap_sem);
>> 
>> You should be using mmap_read_lock(kvm->mm) with recent kernels.
>
> Absolutely, shame on me, I reviewed Michel's series about that!
>
> Paul, Michael, could you fix that when pulling this patch or should I sent a 
> whole new series?

Paul will take this series, so up to him.

cheers


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-09 Thread Peter Zijlstra
On Thu, Jul 09, 2020 at 08:53:16PM +1000, Michael Ellerman wrote:
> Nicholas Piggin  writes:
> 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/include/asm/paravirt.h   | 28 
> >  arch/powerpc/include/asm/qspinlock.h  | 66 +++
> >  arch/powerpc/include/asm/qspinlock_paravirt.h |  7 ++
> >  arch/powerpc/platforms/pseries/Kconfig|  5 ++
> >  arch/powerpc/platforms/pseries/setup.c|  6 +-
> >  include/asm-generic/qspinlock.h   |  2 +
> 
> Another ack?

Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH 2/2] powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10

2020-07-09 Thread Gautham R Shenoy
On Fri, Jul 03, 2020 at 06:16:40PM +0530, Pratik Rajesh Sampat wrote:
> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
> stop levels < 4.

Adding Ravi Bangoria  to the cc.

> Therefore save the values of these SPRs before entering a  "stop"
> state and restore their values on wakeup.
> 
> Signed-off-by: Pratik Rajesh Sampat 


The saving and restoration looks good to me. 
> ---
>  arch/powerpc/platforms/powernv/idle.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 19d94d021357..471d4a65b1fa 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -600,6 +600,8 @@ struct p9_sprs {
>   u64 iamr;
>   u64 amor;
>   u64 uamor;
> + u64 dawr0;
> + u64 dawrx0;
>  };
> 
>  static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> @@ -677,6 +679,10 @@ static unsigned long power9_idle_stop(unsigned long 
> psscr, bool mmu_on)
>   sprs.tscr   = mfspr(SPRN_TSCR);
>   if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
>   sprs.ldbar = mfspr(SPRN_LDBAR);
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> + sprs.dawr0 = mfspr(SPRN_DAWR0);
> + sprs.dawrx0 = mfspr(SPRN_DAWRX0);
> + }
>


But this is within the if condition which says

if ((psscr & PSSCR_RL_MASK) >= pnv_first_spr_loss_level)

This if condition is meant for stop4 and stop5 since these are stop
levels that have OPAL_PM_LOSE_HYP_CONTEXT set.

Since we can lose DAWR*, on states that lose limited hypervisor
context, such as stop0-2, we need to unconditionally save them
like AMR, IAMR etc.


>   sprs_saved = true;
> 
> @@ -792,6 +798,10 @@ static unsigned long power9_idle_stop(unsigned long 
> psscr, bool mmu_on)
>   mtspr(SPRN_MMCR2,   sprs.mmcr2);
>   if (!firmware_has_feature(FW_FEATURE_ULTRAVISOR))
>   mtspr(SPRN_LDBAR, sprs.ldbar);
> + if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> + mtspr(SPRN_DAWR0, sprs.dawr0);
> + mtspr(SPRN_DAWRX0, sprs.dawrx0);
> + }


Likewise, we need to unconditionally restore these SPRs.


> 
>   mtspr(SPRN_SPRG3,   local_paca->sprg_vdso);
> 
> -- 
> 2.25.4
> 


Re: [PATCH 1/2] powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above

2020-07-09 Thread Gautham R Shenoy
On Fri, Jul 03, 2020 at 06:16:39PM +0530, Pratik Rajesh Sampat wrote:
> POWER9 onwards the support for the registers HID1, HID4, HID5 has been
> receded.
> Although mfspr on the above registers worked in Power9, In Power10
> simulator is unrecognized. Moving their assignment under the
> check for machines lower than Power9
> 
> Signed-off-by: Pratik Rajesh Sampat 

Nice catch.

Reviewed-by: Gautham R. Shenoy 

> ---
>  arch/powerpc/platforms/powernv/idle.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c 
> b/arch/powerpc/platforms/powernv/idle.c
> index 2dd467383a88..19d94d021357 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
>*/
>   uint64_t lpcr_val   = mfspr(SPRN_LPCR);
>   uint64_t hid0_val   = mfspr(SPRN_HID0);
> - uint64_t hid1_val   = mfspr(SPRN_HID1);
> - uint64_t hid4_val   = mfspr(SPRN_HID4);
> - uint64_t hid5_val   = mfspr(SPRN_HID5);
>   uint64_t hmeer_val  = mfspr(SPRN_HMEER);
>   uint64_t msr_val = MSR_IDLE;
>   uint64_t psscr_val = pnv_deepest_stop_psscr_val;
> @@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
> 
>   /* Only p8 needs to set extra HID regiters */
>   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
> + uint64_t hid1_val = mfspr(SPRN_HID1);
> + uint64_t hid4_val = mfspr(SPRN_HID4);
> + uint64_t hid5_val = mfspr(SPRN_HID5);
> 
>   rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
>   if (rc != 0)
> -- 
> 2.25.4
> 
--
Thanks and Regards
gautham.


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-09 Thread Michael Ellerman
Nicholas Piggin  writes:

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/paravirt.h   | 28 
>  arch/powerpc/include/asm/qspinlock.h  | 66 +++
>  arch/powerpc/include/asm/qspinlock_paravirt.h |  7 ++
>  arch/powerpc/platforms/pseries/Kconfig|  5 ++
>  arch/powerpc/platforms/pseries/setup.c|  6 +-
>  include/asm-generic/qspinlock.h   |  2 +

Another ack?

> diff --git a/arch/powerpc/include/asm/paravirt.h 
> b/arch/powerpc/include/asm/paravirt.h
> index 7a8546660a63..f2d51f929cf5 100644
> --- a/arch/powerpc/include/asm/paravirt.h
> +++ b/arch/powerpc/include/asm/paravirt.h
> @@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 
> yield_count)
>  {
>   ___bad_yield_to_preempted(); /* This would be a bug */
>  }
> +
> +extern void ___bad_yield_to_any(void);
> +static inline void yield_to_any(void)
> +{
> + ___bad_yield_to_any(); /* This would be a bug */
> +}

Why do we do that rather than just not defining yield_to_any() at all
and letting the build fail on that?

There's a condition somewhere that we know will false at compile time
and drop the call before linking?

> diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h 
> b/arch/powerpc/include/asm/qspinlock_paravirt.h
> new file mode 100644
> index ..750d1b5e0202
> --- /dev/null
> +++ b/arch/powerpc/include/asm/qspinlock_paravirt.h
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#ifndef __ASM_QSPINLOCK_PARAVIRT_H
> +#define __ASM_QSPINLOCK_PARAVIRT_H

_ASM_POWERPC_QSPINLOCK_PARAVIRT_H please.

> +
> +EXPORT_SYMBOL(__pv_queued_spin_unlock);

Why's that in a header? Should that (eventually) go with the generic 
implementation?

> diff --git a/arch/powerpc/platforms/pseries/Kconfig 
> b/arch/powerpc/platforms/pseries/Kconfig
> index 24c18362e5ea..756e727b383f 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -25,9 +25,14 @@ config PPC_PSERIES
>   select SWIOTLB
>   default y
>  
> +config PARAVIRT_SPINLOCKS
> + bool
> + default n

default n is the default.

> diff --git a/arch/powerpc/platforms/pseries/setup.c 
> b/arch/powerpc/platforms/pseries/setup.c
> index 2db8469e475f..747a203d9453 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -771,8 +771,12 @@ static void __init pSeries_setup_arch(void)
>   if (firmware_has_feature(FW_FEATURE_LPAR)) {
>   vpa_init(boot_cpuid);
>  
> - if (lppaca_shared_proc(get_lppaca()))
> + if (lppaca_shared_proc(get_lppaca())) {
>   static_branch_enable(&shared_processor);
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
> + pv_spinlocks_init();
> +#endif
> + }

We could avoid the ifdef with this I think?

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 434615f1d761..6ec72282888d 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -10,5 +10,9 @@
 #include 
 #endif

+#ifndef CONFIG_PARAVIRT_SPINLOCKS
+static inline void pv_spinlocks_init(void) { }
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* __ASM_SPINLOCK_H */


cheers


Re: [PATCH v3 4/6] powerpc/64s: implement queued spinlocks and rwlocks

2020-07-09 Thread Peter Zijlstra
On Thu, Jul 09, 2020 at 08:20:25PM +1000, Michael Ellerman wrote:
> Nicholas Piggin  writes:
> > These have shown significantly improved performance and fairness when
> > spinlock contention is moderate to high on very large systems.
> >
> >  [ Numbers hopefully forthcoming after more testing, but initial
> >results look good ]
> 
> Would be good to have something here, even if it's preliminary.
> 
> > Thanks to the fast path, single threaded performance is not noticably
> > hurt.
> >
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/Kconfig  | 13 
> >  arch/powerpc/include/asm/Kbuild   |  2 ++
> >  arch/powerpc/include/asm/qspinlock.h  | 25 +++
> >  arch/powerpc/include/asm/spinlock.h   |  5 +
> >  arch/powerpc/include/asm/spinlock_types.h |  5 +
> >  arch/powerpc/lib/Makefile |  3 +++
> 
> >  include/asm-generic/qspinlock.h   |  2 ++
> 
> Who's ack do we need for that part?

Mine I suppose would do, as discussed earlier, it probably isn't
required anymore, but I understand the paranoia of not wanting to change
too many things at once :-)


Acked-by: Peter Zijlstra (Intel) 


Re: [PATCH] powerpc: select ARCH_HAS_MEMBARRIER_SYNC_CORE

2020-07-09 Thread Nicholas Piggin
Excerpts from Mathieu Desnoyers's message of July 9, 2020 12:12 am:
> - On Jul 8, 2020, at 1:17 AM, Nicholas Piggin npig...@gmail.com wrote:
> 
>> Excerpts from Mathieu Desnoyers's message of July 7, 2020 9:25 pm:
>>> - On Jul 7, 2020, at 1:50 AM, Nicholas Piggin npig...@gmail.com wrote:
>>> 
> [...]
 I should actually change the comment for 64-bit because soft masked
 interrupt replay is an interesting case. I thought it was okay (because
 the IPI would cause a hard interrupt which does do the rfi) but that
 should at least be written.
>>> 
>>> Yes.
>>> 
 The context synchronisation happens before
 the Linux IPI function is called, but for the purpose of membarrier I
 think that is okay (the membarrier just needs to have caused a memory
 barrier + context synchronistaion by the time it has done).
>>> 
>>> Can you point me to the code implementing this logic ?
>> 
>> It's mostly in arch/powerpc/kernel/exception-64s.S and
>> powerpc/kernel/irq.c, but a lot of asm so easier to explain.
>> 
>> When any Linux code does local_irq_disable(), we set interrupts as
>> software-masked in a per-cpu flag. When interrupts (including IPIs) come
>> in, the first thing we do is check that flag and if we are masked, then
>> record that the interrupt needs to be "replayed" in another per-cpu
>> flag. The interrupt handler then exits back using RFI (which is context
>> synchronising the CPU). Later, when the kernel code does
>> local_irq_enable(), it checks the replay flag to see if anything needs
>> to be done. At that point we basically just call the interrupt handler
>> code like a normal function, and when that returns there is no context
>> synchronising instruction.
> 
> AFAIU this can only happen for interrupts nesting over irqoff sections,
> therefore over kernel code, never userspace, right ?

Right.

>> So membarrier IPI will always cause target CPUs to perform a context
>> synchronising instruction, but sometimes it happens before the IPI
>> handler function runs.
> 
> If my understanding is correct, the replayed interrupt handler logic
> only nests over kernel code, which will eventually need to issue a
> context synchronizing instruction before returning to user-space.

Yes.

> All we care about is that starting from the membarrier, each core
> either:
> 
> - interrupt user-space to issue the context synchronizing instruction if
>   they were running userspace, or
> - _eventually_ issue a context synchronizing instruction before returning
>   to user-space if they were running kernel code.
> 
> So your earlier statement "the membarrier just needs to have caused a memory
> barrier + context synchronistaion by the time it has done" is not strictly
> correct: the context synchronizing instruction does not strictly need to
> happen on each core before membarrier returns. A similar line of thoughts
> can be followed for memory barriers.

Ah okay that makes it simpler, then no such speical comment is required 
for the powerpc specific interrupt handling.

Thanks,
Nick


Re: [PATCH v2 2/3] powerpc/64s: remove PROT_SAO support

2020-07-09 Thread Nicholas Piggin
Excerpts from Paul Mackerras's message of July 9, 2020 2:34 pm:
> On Fri, Jul 03, 2020 at 11:19:57AM +1000, Nicholas Piggin wrote:
>> ISA v3.1 does not support the SAO storage control attribute required to
>> implement PROT_SAO. PROT_SAO was used by specialised system software
>> (Lx86) that has been discontinued for about 7 years, and is not thought
>> to be used elsewhere, so removal should not cause problems.
>> 
>> We rather remove it than keep support for older processors, because
>> live migrating guest partitions to newer processors may not be possible
>> if SAO is in use (or worse allowed with silent races).
> 
> This is actually a real problem for KVM, because now we have the
> capabilities of the host affecting the characteristics of the guest
> virtual machine in a manner which userspace (e.g. QEMU) is unable to
> control.
> 
> It would probably be better to disallow SAO on all machines than have
> it available on some hosts and not others.  (Yes I know there is a
> check on CPU_FTR_ARCH_206 in there, but that has been a no-op since we
> removed the PPC970 KVM support.)

This change doesn't change the SAO difference on the host processors
though, just tries to slightly improve it from silently broken to
maybe complaining a bit.

I didn't want to stop some very old image that uses this and is running
okay on an existing host from working, but maybe the existence of such
a thing would contradict my reasoning. But then if we don't care about
it why care about this KVM behaviour difference at all?

> Solving this properly will probably require creating a new KVM host
> capability and associated machine parameter in QEMU, along with a new
> machine type.

Rather than answer any of these questions, I might take the KVM change
out and that can be dealt with separately from guest SAO removal.

Thanks,
Nick

> 
> [snip]
> 
>> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
>> b/arch/powerpc/include/asm/kvm_book3s_64.h
>> index 9bb9bb370b53..fac39ff659d4 100644
>> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
>> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
>> @@ -398,9 +398,10 @@ static inline bool hpte_cache_flags_ok(unsigned long 
>> hptel, bool is_ci)
>>  {
>>  unsigned int wimg = hptel & HPTE_R_WIMG;
>>  
>> -/* Handle SAO */
>> +/* Handle SAO for POWER7,8,9 */
>>  if (wimg == (HPTE_R_W | HPTE_R_I | HPTE_R_M) &&
>> -cpu_has_feature(CPU_FTR_ARCH_206))
>> +cpu_has_feature(CPU_FTR_ARCH_206) &&
>> +!cpu_has_feature(CPU_FTR_ARCH_31))
>>  wimg = HPTE_R_M;
> 
> Paul.
> 


Re: [PATCH v3 4/6] powerpc/64s: implement queued spinlocks and rwlocks

2020-07-09 Thread Michael Ellerman
Nicholas Piggin  writes:
> These have shown significantly improved performance and fairness when
> spinlock contention is moderate to high on very large systems.
>
>  [ Numbers hopefully forthcoming after more testing, but initial
>results look good ]

Would be good to have something here, even if it's preliminary.

> Thanks to the fast path, single threaded performance is not noticably
> hurt.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/Kconfig  | 13 
>  arch/powerpc/include/asm/Kbuild   |  2 ++
>  arch/powerpc/include/asm/qspinlock.h  | 25 +++
>  arch/powerpc/include/asm/spinlock.h   |  5 +
>  arch/powerpc/include/asm/spinlock_types.h |  5 +
>  arch/powerpc/lib/Makefile |  3 +++

>  include/asm-generic/qspinlock.h   |  2 ++

Who's ack do we need for that part?

> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 24ac85c868db..17663ea57697 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -492,6 +494,17 @@ config HOTPLUG_CPU
>  
> Say N if you are unsure.
>  
> +config PPC_QUEUED_SPINLOCKS
> + bool "Queued spinlocks"
> + depends on SMP
> + default "y" if PPC_BOOK3S_64

Not sure about default y? At least until we've got a better idea of the
perf impact on a range of small/big new/old systems.

> + help
> +   Say Y here to use to use queued spinlocks which are more complex
> +   but give better salability and fairness on large SMP and NUMA
> +   systems.
> +
> +   If unsure, say "Y" if you have lots of cores, otherwise "N".

Would be nice if we could give a range for "lots".

> diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
> index dadbcf3a0b1e..1dd8b6adff5e 100644
> --- a/arch/powerpc/include/asm/Kbuild
> +++ b/arch/powerpc/include/asm/Kbuild
> @@ -6,5 +6,7 @@ generated-y += syscall_table_spu.h
>  generic-y += export.h
>  generic-y += local64.h
>  generic-y += mcs_spinlock.h
> +generic-y += qrwlock.h
> +generic-y += qspinlock.h

The 2nd line spits a warning about a redundant entry. I think you want
to just drop it.


cheers


Re: [PATCH v3 3/6] powerpc: move spinlock implementation to simple_spinlock

2020-07-09 Thread Michael Ellerman
Nicholas Piggin  writes:
> To prepare for queued spinlocks. This is a simple rename except to update
> preprocessor guard name and a file reference.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/simple_spinlock.h| 292 ++
>  .../include/asm/simple_spinlock_types.h   |  21 ++
>  arch/powerpc/include/asm/spinlock.h   | 285 +
>  arch/powerpc/include/asm/spinlock_types.h |  12 +-
>  4 files changed, 315 insertions(+), 295 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
>  create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h
>
> diff --git a/arch/powerpc/include/asm/simple_spinlock.h 
> b/arch/powerpc/include/asm/simple_spinlock.h
> new file mode 100644
> index ..e048c041c4a9
> --- /dev/null
> +++ b/arch/powerpc/include/asm/simple_spinlock.h
> @@ -0,0 +1,292 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#ifndef __ASM_SIMPLE_SPINLOCK_H
> +#define __ASM_SIMPLE_SPINLOCK_H

_ASM_POWERPC_SIMPLE_SPINLOCK_H

> +#ifdef __KERNEL__

Shouldn't be necessary.

> +/*
> + * Simple spin lock operations.  
> + *
> + * Copyright (C) 2001-2004 Paul Mackerras , IBM
> + * Copyright (C) 2001 Anton Blanchard , IBM
> + * Copyright (C) 2002 Dave Engebretsen , IBM
> + *   Rework to support virtual processors
> + *
> + * Type of int is used as a full 64b word is not necessary.
> + *
> + * (the type definitions are in asm/simple_spinlock_types.h)
> + */
> +#include 
> +#include 
> +#ifdef CONFIG_PPC64
> +#include 
> +#endif

I don't think paca.h needs a CONFIG_PPC64 guard, it contains one. I know
you're just moving the code, but still nice to cleanup slightly along
the way.

cheers



Re: [PATCH v3 2/6] powerpc/pseries: move some PAPR paravirt functions to their own file

2020-07-09 Thread Michael Ellerman
Nicholas Piggin  writes:
>

Little bit of changelog would be nice :D

> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/paravirt.h | 61 +
>  arch/powerpc/include/asm/spinlock.h | 24 +---
>  arch/powerpc/lib/locks.c| 12 +++---
>  3 files changed, 68 insertions(+), 29 deletions(-)
>  create mode 100644 arch/powerpc/include/asm/paravirt.h
>
> diff --git a/arch/powerpc/include/asm/paravirt.h 
> b/arch/powerpc/include/asm/paravirt.h
> new file mode 100644
> index ..7a8546660a63
> --- /dev/null
> +++ b/arch/powerpc/include/asm/paravirt.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +#ifndef __ASM_PARAVIRT_H
> +#define __ASM_PARAVIRT_H

Should be _ASM_POWERPC_PARAVIRT_H

> +#ifdef __KERNEL__

We shouldn't need __KERNEL__ in here, it's not a uapi header.

cheers


Re: [RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-07-09 Thread Paul Mackerras
On Thu, Jul 09, 2020 at 02:38:51PM +0530, Bharata B Rao wrote:
> On Thu, Jul 09, 2020 at 03:18:03PM +1000, Paul Mackerras wrote:
> > On Fri, Jul 03, 2020 at 04:14:20PM +0530, Bharata B Rao wrote:
> > > In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
> > > H_RPT_INVALIDATE if available. The availability of this hcall
> > > is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
> > > DT property.
> > 
> > What are we going to use when nested KVM supports HPT guests at L2?
> > L1 will need to do partition-scoped tlbies with R=0 via a hypercall,
> > but H_RPT_INVALIDATE says in its name that it only handles radix
> > page tables (i.e. R=1).
> 
> For L2 HPT guests, the old hcall is expected to work after it adds
> support for R=0 case?

That was the plan.

> The new hcall should be advertised via ibm,hypertas-functions only
> for radix guests I suppose.

Well, the L1 hypervisor is a radix guest of L0, so it would have
H_RPT_INVALIDATE available to it?

I guess the question is whether H_RPT_INVALIDATE is supposed to do
everything, that is, radix process-scoped invalidations, radix
partition-scoped invalidations, and HPT partition-scoped
invalidations.  If that is the plan then we should call it something
different.

This patchset seems to imply that H_RPT_INVALIDATE is at least going
to be used for radix partition-scoped invalidations as well as radix
process-scoped invalidations.  If you are thinking that in future when
we need HPT partition-scoped invalidations for a radix L1 hypervisor
running a HPT L2 guest, we are going to define a new hypercall for
that, I suppose that is OK, though it doesn't really seem necessary.

Paul.


Re: [PATCH v3 1/6] powerpc/powernv: must include hvcall.h to get PAPR defines

2020-07-09 Thread Michael Ellerman
Nicholas Piggin  writes:
> An include goes away in future patches which breaks compilation
> without this.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/platforms/powernv/pci-ioda-tce.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda-tce.c 
> b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> index f923359d8afc..8eba6ece7808 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda-tce.c
> @@ -15,6 +15,7 @@
>  
>  #include 
>  #include 
> +#include  /* share error returns with PAPR */
>  #include "pci.h"
>  
>  unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb)
> -- 
> 2.23.0

This isn't needed anymore AFAICS, since:

5f202c1a1d42 ("powerpc/powernv/ioda: Return correct error if TCE level 
allocation failed")

cheers


Re: [PATCH RESEND 1/2] powerpc/mce: Add MCE notification chain

2020-07-09 Thread Santosh Sivaraj
Christophe Leroy  writes:

> Le 09/07/2020 à 09:56, Santosh Sivaraj a écrit :
>> Introduce notification chain which lets know about uncorrected memory
>> errors(UE). This would help prospective users in pmem or nvdimm subsystem
>> to track bad blocks for better handling of persistent memory allocations.
>> 
>> Signed-off-by: Santosh S 
>> Signed-off-by: Ganesh Goudar 
>> ---
>>   arch/powerpc/include/asm/mce.h |  2 ++
>>   arch/powerpc/kernel/mce.c  | 15 +++
>>   2 files changed, 17 insertions(+)
>> 
>> Send the two patches together, so the dependencies are clear. The earlier 
>> patch reviews are
>> here: 
>> https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/
>> 
>> Rebase the patches on top on 5.8-rc4
>> 
>> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
>> index 376a395daf329..a57b0772702a9 100644
>> --- a/arch/powerpc/include/asm/mce.h
>> +++ b/arch/powerpc/include/asm/mce.h
>> @@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct 
>> machine_check_event *evt,
>>   unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
>>   extern void mce_common_process_ue(struct pt_regs *regs,
>>struct mce_error_info *mce_err);
>> +extern int mce_register_notifier(struct notifier_block *nb);
>> +extern int mce_unregister_notifier(struct notifier_block *nb);
>
> Using the 'extern' keyword on function declaration is pointless and 
> should be avoided in new patches. (checkpatch.pl --strict usually 
> complains about it).

I will remove that in the v2 which I will be sending for your comments for
the other patch.

Thanks,
Santosh

>
>>   #ifdef CONFIG_PPC_BOOK3S_64
>>   void flush_and_reload_slb(void);
>>   #endif /* CONFIG_PPC_BOOK3S_64 */
>> diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
>> index fd90c0eda2290..b7b3ed4e61937 100644
>> --- a/arch/powerpc/kernel/mce.c
>> +++ b/arch/powerpc/kernel/mce.c
>> @@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
>>   
>>   DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
>>   
>> +static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
>> +
>> +int mce_register_notifier(struct notifier_block *nb)
>> +{
>> +return blocking_notifier_chain_register(&mce_notifier_list, nb);
>> +}
>> +EXPORT_SYMBOL_GPL(mce_register_notifier);
>> +
>> +int mce_unregister_notifier(struct notifier_block *nb)
>> +{
>> +return blocking_notifier_chain_unregister(&mce_notifier_list, nb);
>> +}
>> +EXPORT_SYMBOL_GPL(mce_unregister_notifier);
>> +
>>   static void mce_set_error_info(struct machine_check_event *mce,
>> struct mce_error_info *mce_err)
>>   {
>> @@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct 
>> *work)
>>  while (__this_cpu_read(mce_ue_count) > 0) {
>>  index = __this_cpu_read(mce_ue_count) - 1;
>>  evt = this_cpu_ptr(&mce_ue_event_queue[index]);
>> +blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
>>   #ifdef CONFIG_MEMORY_FAILURE
>>  /*
>>   * This should probably queued elsewhere, but
>> 
>
> Christophe


Re: [PATCH RESEND 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-07-09 Thread Santosh Sivaraj
Christophe Leroy  writes:

> Le 09/07/2020 à 09:56, Santosh Sivaraj a écrit :
>> Subscribe to the MCE notification and add the physical address which
>> generated a memory error to nvdimm bad range.
>> 
>> Reviewed-by: Mahesh Salgaonkar 
>> Signed-off-by: Santosh Sivaraj 
>> ---
>>   arch/powerpc/platforms/pseries/papr_scm.c | 98 ++-
>>   1 file changed, 97 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
>> b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 9c569078a09fd..5ebb1c797795d 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -13,9 +13,11 @@
>>   #include 
>>   #include 
>>   #include 
>> +#include 
>>   
>>   #include 
>>   #include 
>> +#include 
>>   
>>   #define BIND_ANY_ADDR (~0ul)
>>   
>> @@ -80,6 +82,7 @@ struct papr_scm_priv {
>>  struct resource res;
>>  struct nd_region *region;
>>  struct nd_interleave_set nd_set;
>> +struct list_head region_list;
>>   
>>  /* Protect dimm health data from concurrent read/writes */
>>  struct mutex health_mutex;
>> @@ -91,6 +94,9 @@ struct papr_scm_priv {
>>  u64 health_bitmap;
>>   };
>>   
>> +LIST_HEAD(papr_nd_regions);
>> +DEFINE_MUTEX(papr_ndr_lock);
>> +
>>   static int drc_pmem_bind(struct papr_scm_priv *p)
>>   {
>>  unsigned long ret[PLPAR_HCALL_BUFSIZE];
>> @@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
>>  dev_info(dev, "Region registered with target node %d and online 
>> node %d",
>>   target_nid, online_nid);
>>   
>> +mutex_lock(&papr_ndr_lock);
>> +list_add_tail(&p->region_list, &papr_nd_regions);
>> +mutex_unlock(&papr_ndr_lock);
>> +
>>  return 0;
>>   
>>   err:   nvdimm_bus_unregister(p->bus);
>> @@ -766,6 +776,70 @@ err:nvdimm_bus_unregister(p->bus);
>>  return -ENXIO;
>>   }
>>   
>> +static void papr_scm_add_badblock(struct nd_region *region,
>> +  struct nvdimm_bus *bus, u64 phys_addr)
>> +{
>> +u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
>> +
>> +if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
>> +pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
>> +return;
>> +}
>> +
>> +pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
>> + aligned_addr, aligned_addr + L1_CACHE_BYTES);
>> +
>> +nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
>> +}
>> +
>> +static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
>> + void *data)
>> +{
>> +struct machine_check_event *evt = data;
>> +struct papr_scm_priv *p;
>> +u64 phys_addr;
>> +bool found = false;
>> +
>> +if (evt->error_type != MCE_ERROR_TYPE_UE)
>> +return NOTIFY_DONE;
>> +
>> +if (list_empty(&papr_nd_regions))
>> +return NOTIFY_DONE;
>> +
>> +/*
>> + * The physical address obtained here is PAGE_SIZE aligned, so get the
>> + * exact address from the effective address
>> + */
>> +phys_addr = evt->u.ue_error.physical_address +
>> +(evt->u.ue_error.effective_address & ~PAGE_MASK);
>
> Not properly aligned

Will fix it.

>
>> +
>> +if (!evt->u.ue_error.physical_address_provided ||
>> +!is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
>> +return NOTIFY_DONE;
>> +
>> +/* mce notifier is called from a process context, so mutex is safe */
>> +mutex_lock(&papr_ndr_lock);
>> +list_for_each_entry(p, &papr_nd_regions, region_list) {
>> +struct resource res = p->res;
>
> Is this local struct really worth it ? Why not use p->res below directly ?
>

Right, not really needed. I can fix that in v2.

>> +
>> +if (phys_addr >= res.start && phys_addr <= res.end) {
>> +found = true;
>> +break;
>> +}
>> +}
>> +
>> +if (found)
>> +papr_scm_add_badblock(p->region, p->bus, phys_addr);
>> +
>> +mutex_unlock(&papr_ndr_lock);
>> +
>> +return found ? NOTIFY_OK : NOTIFY_DONE;
>> +}
>> +
>> +static struct notifier_block mce_ue_nb = {
>> +.notifier_call = handle_mce_ue
>> +};
>> +
>>   static int papr_scm_probe(struct platform_device *pdev)
>>   {
>>  struct device_node *dn = pdev->dev.of_node;
>> @@ -866,6 +940,10 @@ static int papr_scm_remove(struct platform_device *pdev)
>>   {
>>  struct papr_scm_priv *p = platform_get_drvdata(pdev);
>>   
>> +mutex_lock(&papr_ndr_lock);
>> +list_del(&(p->region_list));
>> +mutex_unlock(&papr_ndr_lock);
>> +
>>  nvdimm_bus_unregister(p->bus);
>>  drc_pmem_unbind(p);
>>  kfree(p->bus_desc.provider_name);
>> @@ -888,7 +966,25 @@ static struct platform_driver papr_scm_driver = {
>>  },
>>   };
>>   
>> -module_platform_driver(papr_scm_driver);
>> +static int __init papr_scm_init(void)
>> +

Re: [RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM

2020-07-09 Thread Bharata B Rao
On Thu, Jul 09, 2020 at 03:18:03PM +1000, Paul Mackerras wrote:
> On Fri, Jul 03, 2020 at 04:14:20PM +0530, Bharata B Rao wrote:
> > In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
> > H_RPT_INVALIDATE if available. The availability of this hcall
> > is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
> > DT property.
> 
> What are we going to use when nested KVM supports HPT guests at L2?
> L1 will need to do partition-scoped tlbies with R=0 via a hypercall,
> but H_RPT_INVALIDATE says in its name that it only handles radix
> page tables (i.e. R=1).

For L2 HPT guests, the old hcall is expected to work after it adds
support for R=0 case?

The new hcall should be advertised via ibm,hypertas-functions only
for radix guests I suppose.

Regards,
Bharata.


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-09 Thread Peter Zijlstra
On Wed, Jul 08, 2020 at 07:54:34PM -0400, Waiman Long wrote:
> On 7/8/20 4:41 AM, Peter Zijlstra wrote:
> > On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:
> > > Yes, powerpc could certainly get more performance out of the slow
> > > paths, and then there are a few parameters to tune.
> > Can you clarify? The slow path is already in use on ARM64 which is weak,
> > so I doubt there's superfluous serialization present. And Will spend a
> > fair amount of time on making that thing guarantee forward progressm, so
> > there just isn't too much room to play.
> > 
> > > We don't have a good alternate patching for function calls yet, but
> > > that would be something to do for native vs pv.
> > Going by your jump_label implementation, support for static_call should
> > be fairly straight forward too, no?
> > 
> >https://lkml.kernel.org/r/20200624153024.794671...@infradead.org
> > 
> Speaking of static_call, I am also looking forward to it. Do you have an
> idea when that will be merged?

0day had one crash on the last round, I think Steve send a fix for that
last night and I'll go look at it.

That said, the last posting got 0 feedback, so either everybody is
really happy with it, or not interested. So let us know in the thread,
with some review feedback.

Once I get through enough of the inbox to actually find the fix and test
it, I'll also update the thread, and maybe threaten to merge it if
everybody stays silent :-)


Re: [PATCH RESEND 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-07-09 Thread Christophe Leroy




Le 09/07/2020 à 09:56, Santosh Sivaraj a écrit :

Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Reviewed-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 
---
  arch/powerpc/platforms/pseries/papr_scm.c | 98 ++-
  1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09fd..5ebb1c797795d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -13,9 +13,11 @@
  #include 
  #include 
  #include 
+#include 
  
  #include 

  #include 
+#include 
  
  #define BIND_ANY_ADDR (~0ul)
  
@@ -80,6 +82,7 @@ struct papr_scm_priv {

struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
  
  	/* Protect dimm health data from concurrent read/writes */

struct mutex health_mutex;
@@ -91,6 +94,9 @@ struct papr_scm_priv {
u64 health_bitmap;
  };
  
+LIST_HEAD(papr_nd_regions);

+DEFINE_MUTEX(papr_ndr_lock);
+
  static int drc_pmem_bind(struct papr_scm_priv *p)
  {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online node 
%d",
 target_nid, online_nid);
  
+	mutex_lock(&papr_ndr_lock);

+   list_add_tail(&p->region_list, &papr_nd_regions);
+   mutex_unlock(&papr_ndr_lock);
+
return 0;
  
  err:	nvdimm_bus_unregister(p->bus);

@@ -766,6 +776,70 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
  }
  
+static void papr_scm_add_badblock(struct nd_region *region,

+ struct nvdimm_bus *bus, u64 phys_addr)
+{
+   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+   pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+   return;
+   }
+
+   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(&papr_nd_regions))
+   return NOTIFY_DONE;
+
+   /*
+* The physical address obtained here is PAGE_SIZE aligned, so get the
+* exact address from the effective address
+*/
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);


Not properly aligned


+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   /* mce notifier is called from a process context, so mutex is safe */
+   mutex_lock(&papr_ndr_lock);
+   list_for_each_entry(p, &papr_nd_regions, region_list) {
+   struct resource res = p->res;


Is this local struct really worth it ? Why not use p->res below directly ?


+
+   if (phys_addr >= res.start && phys_addr <= res.end) {
+   found = true;
+   break;
+   }
+   }
+
+   if (found)
+   papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+   mutex_unlock(&papr_ndr_lock);
+
+   return found ? NOTIFY_OK : NOTIFY_DONE;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
  static int papr_scm_probe(struct platform_device *pdev)
  {
struct device_node *dn = pdev->dev.of_node;
@@ -866,6 +940,10 @@ static int papr_scm_remove(struct platform_device *pdev)
  {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
  
+	mutex_lock(&papr_ndr_lock);

+   list_del(&(p->region_list));
+   mutex_unlock(&papr_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -888,7 +966,25 @@ static struct platform_driver papr_scm_driver = {
},
  };
  
-module_platform_driver(papr_scm_driver);

+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(&papr_scm_driver);
+   if (!ret)
+   mce_register_notifier(&mce_ue_nb);
+
+return ret;


Not properly aligned.


+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(&mce_ue_nb);
+   platform_driver_unregiste

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-09 Thread Zong Li
On Thu, Jul 9, 2020 at 1:05 PM Palmer Dabbelt  wrote:
>
> On Sun, 07 Jun 2020 00:59:46 PDT (-0700), a...@ghiti.fr wrote:
> > This is a preparatory patch for relocatable kernel.
> >
> > The kernel used to be linked at PAGE_OFFSET address and used to be loaded
> > physically at the beginning of the main memory. Therefore, we could use
> > the linear mapping for the kernel mapping.
> >
> > But the relocated kernel base address will be different from PAGE_OFFSET
> > and since in the linear mapping, two different virtual addresses cannot
> > point to the same physical address, the kernel mapping needs to lie outside
> > the linear mapping.
>
> I know it's been a while, but I keep opening this up to review it and just
> can't get over how ugly it is to put the kernel's linear map in the vmalloc
> region.
>
> I guess I don't understand why this is necessary at all.  Specifically: why
> can't we just relocate the kernel within the linear map?  That would let the
> bootloader put the kernel wherever it wants, modulo the physical memory size 
> we
> support.  We'd need to handle the regions that are coupled to the kernel's
> execution address, but we could just put them in an explicit memory region
> which is what we should probably be doing anyway.

The original implementation of relocation doesn't move the kernel's linear map
to the vmalloc region, and I also give the KASLR RFC patch [1] based on that.
In original, we relocate the kernel in the linear map region, we would
calculate a
random value first as the offset, then we move the kernel image to the
new target
address which is obtained by adding this offset to it's VA and PA.
It's enough for
randomizing the kernel, but it seems to me if we want to decouple the kernel's
linear mapping, the physical mapping of RAM and virtual mapping of RAM,
it might be good to move the kernel's mapping out from the linear region.
Even so, it is still an intrusive change. As far as I know, only arm64
does something
like that.

[1]  https://patchwork.kernel.org/project/linux-riscv/list/?series=260615



>
> > In addition, because modules and BPF must be close to the kernel (inside
> > +-2GB window), the kernel is placed at the end of the vmalloc zone minus
> > 2GB, which leaves room for modules and BPF. The kernel could not be
> > placed at the beginning of the vmalloc zone since other vmalloc
> > allocations from the kernel could get all the +-2GB window around the
> > kernel which would prevent new modules and BPF programs to be loaded.
>
> Well, that's not enough to make sure this doesn't happen -- it's just enough 
> to
> make sure it doesn't happen very quickily.  That's the same boat we're already
> in, though, so it's not like it's worse.
>
> > Signed-off-by: Alexandre Ghiti 
> > Reviewed-by: Zong Li 
> > ---
> >  arch/riscv/boot/loader.lds.S |  3 +-
> >  arch/riscv/include/asm/page.h| 10 +-
> >  arch/riscv/include/asm/pgtable.h | 38 ++---
> >  arch/riscv/kernel/head.S |  3 +-
> >  arch/riscv/kernel/module.c   |  4 +--
> >  arch/riscv/kernel/vmlinux.lds.S  |  3 +-
> >  arch/riscv/mm/init.c | 58 +---
> >  arch/riscv/mm/physaddr.c |  2 +-
> >  8 files changed, 88 insertions(+), 33 deletions(-)
> >
> > diff --git a/arch/riscv/boot/loader.lds.S b/arch/riscv/boot/loader.lds.S
> > index 47a5003c2e28..62d94696a19c 100644
> > --- a/arch/riscv/boot/loader.lds.S
> > +++ b/arch/riscv/boot/loader.lds.S
> > @@ -1,13 +1,14 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >
> >  #include 
> > +#include 
> >
> >  OUTPUT_ARCH(riscv)
> >  ENTRY(_start)
> >
> >  SECTIONS
> >  {
> > - . = PAGE_OFFSET;
> > + . = KERNEL_LINK_ADDR;
> >
> >   .payload : {
> >   *(.payload)
> > diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
> > index 2d50f76efe48..48bb09b6a9b7 100644
> > --- a/arch/riscv/include/asm/page.h
> > +++ b/arch/riscv/include/asm/page.h
> > @@ -90,18 +90,26 @@ typedef struct page *pgtable_t;
> >
> >  #ifdef CONFIG_MMU
> >  extern unsigned long va_pa_offset;
> > +extern unsigned long va_kernel_pa_offset;
> >  extern unsigned long pfn_base;
> >  #define ARCH_PFN_OFFSET  (pfn_base)
> >  #else
> >  #define va_pa_offset 0
> > +#define va_kernel_pa_offset  0
> >  #define ARCH_PFN_OFFSET  (PAGE_OFFSET >> PAGE_SHIFT)
> >  #endif /* CONFIG_MMU */
> >
> >  extern unsigned long max_low_pfn;
> >  extern unsigned long min_low_pfn;
> > +extern unsigned long kernel_virt_addr;
> >
> >  #define __pa_to_va_nodebug(x)((void *)((unsigned long) (x) + 
> > va_pa_offset))
> > -#define __va_to_pa_nodebug(x)((unsigned long)(x) - va_pa_offset)
> > +#define linear_mapping_va_to_pa(x)   ((unsigned long)(x) - va_pa_offset)
> > +#define kernel_mapping_va_to_pa(x)   \
> > + ((unsigned long)(x) - va_kernel_pa_offset)
> > +#define __va_to_pa_nodebug(x)\
> > + (((x) >= PAGE_OFFSET) ? \
> > + li

Re: [PATCH RESEND 1/2] powerpc/mce: Add MCE notification chain

2020-07-09 Thread Christophe Leroy




Le 09/07/2020 à 09:56, Santosh Sivaraj a écrit :

Introduce notification chain which lets know about uncorrected memory
errors(UE). This would help prospective users in pmem or nvdimm subsystem
to track bad blocks for better handling of persistent memory allocations.

Signed-off-by: Santosh S 
Signed-off-by: Ganesh Goudar 
---
  arch/powerpc/include/asm/mce.h |  2 ++
  arch/powerpc/kernel/mce.c  | 15 +++
  2 files changed, 17 insertions(+)

Send the two patches together, so the dependencies are clear. The earlier patch 
reviews are
here: 
https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/

Rebase the patches on top on 5.8-rc4

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 376a395daf329..a57b0772702a9 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct 
machine_check_event *evt,
  unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
  extern void mce_common_process_ue(struct pt_regs *regs,
  struct mce_error_info *mce_err);
+extern int mce_register_notifier(struct notifier_block *nb);
+extern int mce_unregister_notifier(struct notifier_block *nb);


Using the 'extern' keyword on function declaration is pointless and 
should be avoided in new patches. (checkpatch.pl --strict usually 
complains about it).



  #ifdef CONFIG_PPC_BOOK3S_64
  void flush_and_reload_slb(void);
  #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fd90c0eda2290..b7b3ed4e61937 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
  
  DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
  
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);

+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
  static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
  {
@@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+   blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
  #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but



Christophe


[PATCH RESEND 2/2] papr/scm: Add bad memory ranges to nvdimm bad ranges

2020-07-09 Thread Santosh Sivaraj
Subscribe to the MCE notification and add the physical address which
generated a memory error to nvdimm bad range.

Reviewed-by: Mahesh Salgaonkar 
Signed-off-by: Santosh Sivaraj 
---
 arch/powerpc/platforms/pseries/papr_scm.c | 98 ++-
 1 file changed, 97 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index 9c569078a09fd..5ebb1c797795d 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -13,9 +13,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #define BIND_ANY_ADDR (~0ul)
 
@@ -80,6 +82,7 @@ struct papr_scm_priv {
struct resource res;
struct nd_region *region;
struct nd_interleave_set nd_set;
+   struct list_head region_list;
 
/* Protect dimm health data from concurrent read/writes */
struct mutex health_mutex;
@@ -91,6 +94,9 @@ struct papr_scm_priv {
u64 health_bitmap;
 };
 
+LIST_HEAD(papr_nd_regions);
+DEFINE_MUTEX(papr_ndr_lock);
+
 static int drc_pmem_bind(struct papr_scm_priv *p)
 {
unsigned long ret[PLPAR_HCALL_BUFSIZE];
@@ -759,6 +765,10 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
dev_info(dev, "Region registered with target node %d and online 
node %d",
 target_nid, online_nid);
 
+   mutex_lock(&papr_ndr_lock);
+   list_add_tail(&p->region_list, &papr_nd_regions);
+   mutex_unlock(&papr_ndr_lock);
+
return 0;
 
 err:   nvdimm_bus_unregister(p->bus);
@@ -766,6 +776,70 @@ err:   nvdimm_bus_unregister(p->bus);
return -ENXIO;
 }
 
+static void papr_scm_add_badblock(struct nd_region *region,
+ struct nvdimm_bus *bus, u64 phys_addr)
+{
+   u64 aligned_addr = ALIGN_DOWN(phys_addr, L1_CACHE_BYTES);
+
+   if (nvdimm_bus_add_badrange(bus, aligned_addr, L1_CACHE_BYTES)) {
+   pr_err("Bad block registration for 0x%llx failed\n", phys_addr);
+   return;
+   }
+
+   pr_debug("Add memory range (0x%llx - 0x%llx) as bad range\n",
+aligned_addr, aligned_addr + L1_CACHE_BYTES);
+
+   nvdimm_region_notify(region, NVDIMM_REVALIDATE_POISON);
+}
+
+static int handle_mce_ue(struct notifier_block *nb, unsigned long val,
+void *data)
+{
+   struct machine_check_event *evt = data;
+   struct papr_scm_priv *p;
+   u64 phys_addr;
+   bool found = false;
+
+   if (evt->error_type != MCE_ERROR_TYPE_UE)
+   return NOTIFY_DONE;
+
+   if (list_empty(&papr_nd_regions))
+   return NOTIFY_DONE;
+
+   /*
+* The physical address obtained here is PAGE_SIZE aligned, so get the
+* exact address from the effective address
+*/
+   phys_addr = evt->u.ue_error.physical_address +
+   (evt->u.ue_error.effective_address & ~PAGE_MASK);
+
+   if (!evt->u.ue_error.physical_address_provided ||
+   !is_zone_device_page(pfn_to_page(phys_addr >> PAGE_SHIFT)))
+   return NOTIFY_DONE;
+
+   /* mce notifier is called from a process context, so mutex is safe */
+   mutex_lock(&papr_ndr_lock);
+   list_for_each_entry(p, &papr_nd_regions, region_list) {
+   struct resource res = p->res;
+
+   if (phys_addr >= res.start && phys_addr <= res.end) {
+   found = true;
+   break;
+   }
+   }
+
+   if (found)
+   papr_scm_add_badblock(p->region, p->bus, phys_addr);
+
+   mutex_unlock(&papr_ndr_lock);
+
+   return found ? NOTIFY_OK : NOTIFY_DONE;
+}
+
+static struct notifier_block mce_ue_nb = {
+   .notifier_call = handle_mce_ue
+};
+
 static int papr_scm_probe(struct platform_device *pdev)
 {
struct device_node *dn = pdev->dev.of_node;
@@ -866,6 +940,10 @@ static int papr_scm_remove(struct platform_device *pdev)
 {
struct papr_scm_priv *p = platform_get_drvdata(pdev);
 
+   mutex_lock(&papr_ndr_lock);
+   list_del(&(p->region_list));
+   mutex_unlock(&papr_ndr_lock);
+
nvdimm_bus_unregister(p->bus);
drc_pmem_unbind(p);
kfree(p->bus_desc.provider_name);
@@ -888,7 +966,25 @@ static struct platform_driver papr_scm_driver = {
},
 };
 
-module_platform_driver(papr_scm_driver);
+static int __init papr_scm_init(void)
+{
+   int ret;
+
+   ret = platform_driver_register(&papr_scm_driver);
+   if (!ret)
+   mce_register_notifier(&mce_ue_nb);
+
+return ret;
+}
+module_init(papr_scm_init);
+
+static void __exit papr_scm_exit(void)
+{
+   mce_unregister_notifier(&mce_ue_nb);
+   platform_driver_unregister(&papr_scm_driver);
+}
+module_exit(papr_scm_exit);
+
 MODULE_DEVICE_TABLE(of, papr_scm_match);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR("IBM Corporation");
-- 
2.26.2



[PATCH RESEND 1/2] powerpc/mce: Add MCE notification chain

2020-07-09 Thread Santosh Sivaraj
Introduce notification chain which lets know about uncorrected memory
errors(UE). This would help prospective users in pmem or nvdimm subsystem
to track bad blocks for better handling of persistent memory allocations.

Signed-off-by: Santosh S 
Signed-off-by: Ganesh Goudar 
---
 arch/powerpc/include/asm/mce.h |  2 ++
 arch/powerpc/kernel/mce.c  | 15 +++
 2 files changed, 17 insertions(+)

Send the two patches together, so the dependencies are clear. The earlier patch 
reviews are
here: 
https://lore.kernel.org/linuxppc-dev/20200330071219.12284-1-ganes...@linux.ibm.com/

Rebase the patches on top on 5.8-rc4

diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
index 376a395daf329..a57b0772702a9 100644
--- a/arch/powerpc/include/asm/mce.h
+++ b/arch/powerpc/include/asm/mce.h
@@ -220,6 +220,8 @@ extern void machine_check_print_event_info(struct 
machine_check_event *evt,
 unsigned long addr_to_pfn(struct pt_regs *regs, unsigned long addr);
 extern void mce_common_process_ue(struct pt_regs *regs,
  struct mce_error_info *mce_err);
+extern int mce_register_notifier(struct notifier_block *nb);
+extern int mce_unregister_notifier(struct notifier_block *nb);
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void);
 #endif /* CONFIG_PPC_BOOK3S_64 */
diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c
index fd90c0eda2290..b7b3ed4e61937 100644
--- a/arch/powerpc/kernel/mce.c
+++ b/arch/powerpc/kernel/mce.c
@@ -49,6 +49,20 @@ static struct irq_work mce_ue_event_irq_work = {
 
 DECLARE_WORK(mce_ue_event_work, machine_process_ue_event);
 
+static BLOCKING_NOTIFIER_HEAD(mce_notifier_list);
+
+int mce_register_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_register(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_register_notifier);
+
+int mce_unregister_notifier(struct notifier_block *nb)
+{
+   return blocking_notifier_chain_unregister(&mce_notifier_list, nb);
+}
+EXPORT_SYMBOL_GPL(mce_unregister_notifier);
+
 static void mce_set_error_info(struct machine_check_event *mce,
   struct mce_error_info *mce_err)
 {
@@ -278,6 +292,7 @@ static void machine_process_ue_event(struct work_struct 
*work)
while (__this_cpu_read(mce_ue_count) > 0) {
index = __this_cpu_read(mce_ue_count) - 1;
evt = this_cpu_ptr(&mce_ue_event_queue[index]);
+   blocking_notifier_call_chain(&mce_notifier_list, 0, evt);
 #ifdef CONFIG_MEMORY_FAILURE
/*
 * This should probably queued elsewhere, but
-- 
2.26.2



Re: [PATCH v3 1/4] iomap: Constify ioreadX() iomem argument (as in generic implementation)

2020-07-09 Thread Krzysztof Kozlowski
On Thu, Jul 09, 2020 at 09:28:34AM +0200, Krzysztof Kozlowski wrote:
> The ioreadX() and ioreadX_rep() helpers have inconsistent interface.  On
> some architectures void *__iomem address argument is a pointer to const,
> on some not.
> 
> Implementations of ioreadX() do not modify the memory under the address
> so they can be converted to a "const" version for const-safety and
> consistency among architectures.
> 
> Suggested-by: Geert Uytterhoeven 
> Signed-off-by: Krzysztof Kozlowski 
> Reviewed-by: Geert Uytterhoeven 
> Reviewed-by: Arnd Bergmann 

I forgot to put here one more Ack, for PowerPC:
Acked-by: Michael Ellerman  (powerpc)

https://lore.kernel.org/lkml/87ftedj0zz@mpe.ellerman.id.au/

Best regards,
Krzysztof



[PATCH v3 4/4] virtio: pci: Constify ioreadX() iomem argument (as in generic implementation)

2020-07-09 Thread Krzysztof Kozlowski
The ioreadX() helpers have inconsistent interface.  On some architectures
void *__iomem address argument is a pointer to const, on some not.

Implementations of ioreadX() do not modify the memory under the address
so they can be converted to a "const" version for const-safety and
consistency among architectures.

Signed-off-by: Krzysztof Kozlowski 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/virtio/virtio_pci_modern.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index db93cedd262f..90eff165a719 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -27,16 +27,16 @@
  * method, i.e. 32-bit accesses for 32-bit fields, 16-bit accesses
  * for 16-bit fields and 8-bit accesses for 8-bit fields.
  */
-static inline u8 vp_ioread8(u8 __iomem *addr)
+static inline u8 vp_ioread8(const u8 __iomem *addr)
 {
return ioread8(addr);
 }
-static inline u16 vp_ioread16 (__le16 __iomem *addr)
+static inline u16 vp_ioread16 (const __le16 __iomem *addr)
 {
return ioread16(addr);
 }
 
-static inline u32 vp_ioread32(__le32 __iomem *addr)
+static inline u32 vp_ioread32(const __le32 __iomem *addr)
 {
return ioread32(addr);
 }
-- 
2.17.1



[PATCH v3 3/4] ntb: intel: Constify ioreadX() iomem argument (as in generic implementation)

2020-07-09 Thread Krzysztof Kozlowski
The ioreadX() helpers have inconsistent interface.  On some architectures
void *__iomem address argument is a pointer to const, on some not.

Implementations of ioreadX() do not modify the memory under the address
so they can be converted to a "const" version for const-safety and
consistency among architectures.

Signed-off-by: Krzysztof Kozlowski 
Reviewed-by: Geert Uytterhoeven 
Acked-by: Dave Jiang 
---
 drivers/ntb/hw/intel/ntb_hw_gen1.c  | 2 +-
 drivers/ntb/hw/intel/ntb_hw_gen3.h  | 2 +-
 drivers/ntb/hw/intel/ntb_hw_intel.h | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/ntb/hw/intel/ntb_hw_gen1.c 
b/drivers/ntb/hw/intel/ntb_hw_gen1.c
index 423f9b8fbbcf..3185efeab487 100644
--- a/drivers/ntb/hw/intel/ntb_hw_gen1.c
+++ b/drivers/ntb/hw/intel/ntb_hw_gen1.c
@@ -1205,7 +1205,7 @@ int intel_ntb_peer_spad_write(struct ntb_dev *ntb, int 
pidx, int sidx,
   ndev->peer_reg->spad);
 }
 
-static u64 xeon_db_ioread(void __iomem *mmio)
+static u64 xeon_db_ioread(const void __iomem *mmio)
 {
return (u64)ioread16(mmio);
 }
diff --git a/drivers/ntb/hw/intel/ntb_hw_gen3.h 
b/drivers/ntb/hw/intel/ntb_hw_gen3.h
index 2bc5d8356045..dea93989942d 100644
--- a/drivers/ntb/hw/intel/ntb_hw_gen3.h
+++ b/drivers/ntb/hw/intel/ntb_hw_gen3.h
@@ -91,7 +91,7 @@
 #define GEN3_DB_TOTAL_SHIFT33
 #define GEN3_SPAD_COUNT16
 
-static inline u64 gen3_db_ioread(void __iomem *mmio)
+static inline u64 gen3_db_ioread(const void __iomem *mmio)
 {
return ioread64(mmio);
 }
diff --git a/drivers/ntb/hw/intel/ntb_hw_intel.h 
b/drivers/ntb/hw/intel/ntb_hw_intel.h
index d61fcd91714b..05e2335c9596 100644
--- a/drivers/ntb/hw/intel/ntb_hw_intel.h
+++ b/drivers/ntb/hw/intel/ntb_hw_intel.h
@@ -103,7 +103,7 @@ struct intel_ntb_dev;
 struct intel_ntb_reg {
int (*poll_link)(struct intel_ntb_dev *ndev);
int (*link_is_up)(struct intel_ntb_dev *ndev);
-   u64 (*db_ioread)(void __iomem *mmio);
+   u64 (*db_ioread)(const void __iomem *mmio);
void (*db_iowrite)(u64 db_bits, void __iomem *mmio);
unsigned long   ntb_ctl;
resource_size_t db_size;
-- 
2.17.1



[PATCH v3 2/4] rtl818x: Constify ioreadX() iomem argument (as in generic implementation)

2020-07-09 Thread Krzysztof Kozlowski
The ioreadX() helpers have inconsistent interface.  On some architectures
void *__iomem address argument is a pointer to const, on some not.

Implementations of ioreadX() do not modify the memory under the address
so they can be converted to a "const" version for const-safety and
consistency among architectures.

Signed-off-by: Krzysztof Kozlowski 
Reviewed-by: Geert Uytterhoeven 
Acked-by: Kalle Valo 
---
 drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h 
b/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h
index 7948a2da195a..2ff00800d45b 100644
--- a/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h
+++ b/drivers/net/wireless/realtek/rtl818x/rtl8180/rtl8180.h
@@ -150,17 +150,17 @@ void rtl8180_write_phy(struct ieee80211_hw *dev, u8 addr, 
u32 data);
 void rtl8180_set_anaparam(struct rtl8180_priv *priv, u32 anaparam);
 void rtl8180_set_anaparam2(struct rtl8180_priv *priv, u32 anaparam2);
 
-static inline u8 rtl818x_ioread8(struct rtl8180_priv *priv, u8 __iomem *addr)
+static inline u8 rtl818x_ioread8(struct rtl8180_priv *priv, const u8 __iomem 
*addr)
 {
return ioread8(addr);
 }
 
-static inline u16 rtl818x_ioread16(struct rtl8180_priv *priv, __le16 __iomem 
*addr)
+static inline u16 rtl818x_ioread16(struct rtl8180_priv *priv, const __le16 
__iomem *addr)
 {
return ioread16(addr);
 }
 
-static inline u32 rtl818x_ioread32(struct rtl8180_priv *priv, __le32 __iomem 
*addr)
+static inline u32 rtl818x_ioread32(struct rtl8180_priv *priv, const __le32 
__iomem *addr)
 {
return ioread32(addr);
 }
-- 
2.17.1



[PATCH v3 1/4] iomap: Constify ioreadX() iomem argument (as in generic implementation)

2020-07-09 Thread Krzysztof Kozlowski
The ioreadX() and ioreadX_rep() helpers have inconsistent interface.  On
some architectures void *__iomem address argument is a pointer to const,
on some not.

Implementations of ioreadX() do not modify the memory under the address
so they can be converted to a "const" version for const-safety and
consistency among architectures.

Suggested-by: Geert Uytterhoeven 
Signed-off-by: Krzysztof Kozlowski 
Reviewed-by: Geert Uytterhoeven 
Reviewed-by: Arnd Bergmann 
---
 arch/alpha/include/asm/core_apecs.h   |  6 +--
 arch/alpha/include/asm/core_cia.h |  6 +--
 arch/alpha/include/asm/core_lca.h |  6 +--
 arch/alpha/include/asm/core_marvel.h  |  4 +-
 arch/alpha/include/asm/core_mcpcia.h  |  6 +--
 arch/alpha/include/asm/core_t2.h  |  2 +-
 arch/alpha/include/asm/io.h   | 12 ++---
 arch/alpha/include/asm/io_trivial.h   | 16 +++---
 arch/alpha/include/asm/jensen.h   |  2 +-
 arch/alpha/include/asm/machvec.h  |  6 +--
 arch/alpha/kernel/core_marvel.c   |  2 +-
 arch/alpha/kernel/io.c| 12 ++---
 arch/parisc/include/asm/io.h  |  4 +-
 arch/parisc/lib/iomap.c   | 72 +--
 arch/powerpc/kernel/iomap.c   | 28 +--
 arch/sh/kernel/iomap.c| 22 
 drivers/sh/clk/cpg.c  |  2 +-
 include/asm-generic/iomap.h   | 28 +--
 include/linux/io-64-nonatomic-hi-lo.h |  4 +-
 include/linux/io-64-nonatomic-lo-hi.h |  4 +-
 lib/iomap.c   | 30 +--
 21 files changed, 137 insertions(+), 137 deletions(-)

diff --git a/arch/alpha/include/asm/core_apecs.h 
b/arch/alpha/include/asm/core_apecs.h
index 0a07055bc0fe..2d9726fc02ef 100644
--- a/arch/alpha/include/asm/core_apecs.h
+++ b/arch/alpha/include/asm/core_apecs.h
@@ -384,7 +384,7 @@ struct el_apecs_procdata
}   \
} while (0)
 
-__EXTERN_INLINE unsigned int apecs_ioread8(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int apecs_ioread8(const void __iomem *xaddr)
 {
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -420,7 +420,7 @@ __EXTERN_INLINE void apecs_iowrite8(u8 b, void __iomem 
*xaddr)
*(vuip) ((addr << 5) + base_and_type) = w;
 }
 
-__EXTERN_INLINE unsigned int apecs_ioread16(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int apecs_ioread16(const void __iomem *xaddr)
 {
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -456,7 +456,7 @@ __EXTERN_INLINE void apecs_iowrite16(u16 b, void __iomem 
*xaddr)
*(vuip) ((addr << 5) + base_and_type) = w;
 }
 
-__EXTERN_INLINE unsigned int apecs_ioread32(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int apecs_ioread32(const void __iomem *xaddr)
 {
unsigned long addr = (unsigned long) xaddr;
if (addr < APECS_DENSE_MEM)
diff --git a/arch/alpha/include/asm/core_cia.h 
b/arch/alpha/include/asm/core_cia.h
index c706a7f2b061..cb22991f6761 100644
--- a/arch/alpha/include/asm/core_cia.h
+++ b/arch/alpha/include/asm/core_cia.h
@@ -342,7 +342,7 @@ struct el_CIA_sysdata_mcheck {
 #define vuip   volatile unsigned int __force *
 #define vulp   volatile unsigned long __force *
 
-__EXTERN_INLINE unsigned int cia_ioread8(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int cia_ioread8(const void __iomem *xaddr)
 {
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -374,7 +374,7 @@ __EXTERN_INLINE void cia_iowrite8(u8 b, void __iomem *xaddr)
*(vuip) ((addr << 5) + base_and_type) = w;
 }
 
-__EXTERN_INLINE unsigned int cia_ioread16(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int cia_ioread16(const void __iomem *xaddr)
 {
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -404,7 +404,7 @@ __EXTERN_INLINE void cia_iowrite16(u16 b, void __iomem 
*xaddr)
*(vuip) ((addr << 5) + base_and_type) = w;
 }
 
-__EXTERN_INLINE unsigned int cia_ioread32(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int cia_ioread32(const void __iomem *xaddr)
 {
unsigned long addr = (unsigned long) xaddr;
if (addr < CIA_DENSE_MEM)
diff --git a/arch/alpha/include/asm/core_lca.h 
b/arch/alpha/include/asm/core_lca.h
index 84d5e5b84f4f..ec86314418cb 100644
--- a/arch/alpha/include/asm/core_lca.h
+++ b/arch/alpha/include/asm/core_lca.h
@@ -230,7 +230,7 @@ union el_lca {
} while (0)
 
 
-__EXTERN_INLINE unsigned int lca_ioread8(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int lca_ioread8(const void __iomem *xaddr)
 {
unsigned long addr = (unsigned long) xaddr;
unsigned long result, base_and_type;
@@ -266,7 +266,7 @@ __EXTERN_INLINE void lca_iowrite8(u8 b, void __iomem *xaddr)
*(vuip) ((addr << 5) + base_and_type) = w;
 }
 
-__EXTERN_INLINE unsigned int lca_ioread16(void __iomem *xaddr)
+__EXTERN_INLINE unsigned int lca_ioread16(const void __iomem *xaddr)

[PATCH v3 0/4] iomap: Constify ioreadX() iomem argument

2020-07-09 Thread Krzysztof Kozlowski
Hi,

Multiple architectures are affected in the first patch and all further
patches depend on the first.

Maybe this could go in through Andrew Morton's tree?


Changes since v2

1. Drop all non-essential patches (cleanups),
2. Update also drivers/sh/clk/cpg.c .


Changes since v1

https://lore.kernel.org/lkml/1578415992-24054-1-git-send-email-k...@kernel.org/
1. Constify also ioreadX_rep() and mmio_insX(),
2. Squash lib+alpha+powerpc+parisc+sh into one patch for bisectability,
3. Add acks and reviews,
4. Re-order patches so all optional driver changes are at the end.


Description
===
The ioread8/16/32() and others have inconsistent interface among the
architectures: some taking address as const, some not.

It seems there is nothing really stopping all of them to take
pointer to const.

Patchset was only compile tested on affected architectures.  No real
testing.


volatile

There is still interface inconsistency between architectures around
"volatile" qualifier:
 - include/asm-generic/io.h:static inline u32 ioread32(const volatile void 
__iomem *addr)
 - include/asm-generic/iomap.h:extern unsigned int ioread32(const void __iomem 
*);

This is still discussed and out of scope of this patchset.


Best regards,
Krzysztof


Krzysztof Kozlowski (4):
  iomap: Constify ioreadX() iomem argument (as in generic
implementation)
  rtl818x: Constify ioreadX() iomem argument (as in generic
implementation)
  ntb: intel: Constify ioreadX() iomem argument (as in generic
implementation)
  virtio: pci: Constify ioreadX() iomem argument (as in generic
implementation)

 arch/alpha/include/asm/core_apecs.h   |  6 +-
 arch/alpha/include/asm/core_cia.h |  6 +-
 arch/alpha/include/asm/core_lca.h |  6 +-
 arch/alpha/include/asm/core_marvel.h  |  4 +-
 arch/alpha/include/asm/core_mcpcia.h  |  6 +-
 arch/alpha/include/asm/core_t2.h  |  2 +-
 arch/alpha/include/asm/io.h   | 12 ++--
 arch/alpha/include/asm/io_trivial.h   | 16 ++---
 arch/alpha/include/asm/jensen.h   |  2 +-
 arch/alpha/include/asm/machvec.h  |  6 +-
 arch/alpha/kernel/core_marvel.c   |  2 +-
 arch/alpha/kernel/io.c| 12 ++--
 arch/parisc/include/asm/io.h  |  4 +-
 arch/parisc/lib/iomap.c   | 72 +--
 arch/powerpc/kernel/iomap.c   | 28 
 arch/sh/kernel/iomap.c| 22 +++---
 .../realtek/rtl818x/rtl8180/rtl8180.h |  6 +-
 drivers/ntb/hw/intel/ntb_hw_gen1.c|  2 +-
 drivers/ntb/hw/intel/ntb_hw_gen3.h|  2 +-
 drivers/ntb/hw/intel/ntb_hw_intel.h   |  2 +-
 drivers/sh/clk/cpg.c  |  2 +-
 drivers/virtio/virtio_pci_modern.c|  6 +-
 include/asm-generic/iomap.h   | 28 
 include/linux/io-64-nonatomic-hi-lo.h |  4 +-
 include/linux/io-64-nonatomic-lo-hi.h |  4 +-
 lib/iomap.c   | 30 
 26 files changed, 146 insertions(+), 146 deletions(-)

-- 
2.17.1



RE: [PATCH 05/20] Documentation: fpga: eliminate duplicated word

2020-07-09 Thread Wu, Hao
> Subject: [PATCH 05/20] Documentation: fpga: eliminate duplicated word
> 
> Drop the doubled word "this".
> 
> Signed-off-by: Randy Dunlap 
> Cc: Jonathan Corbet 
> Cc: linux-...@vger.kernel.org
> Cc: Wu Hao 
> Cc: linux-f...@vger.kernel.org

Acked-by: Wu Hao 

Thanks Randy.

Hao


Re: [PATCH v2 10/10] powerpc/perf: Add extended regs support for power10 platform

2020-07-09 Thread Athira Rajeev


> On 08-Jul-2020, at 5:34 PM, Michael Ellerman  wrote:
> 
> Athira Rajeev  > writes:
>> Include capability flag `PERF_PMU_CAP_EXTENDED_REGS` for power10
>> and expose MMCR3, SIER2, SIER3 registers as part of extended regs.
>> Also introduce `PERF_REG_PMU_MASK_31` to define extended mask
>> value at runtime for power10
>> 
>> Signed-off-by: Athira Rajeev 
>> ---
>> arch/powerpc/include/uapi/asm/perf_regs.h   |  6 ++
>> arch/powerpc/perf/perf_regs.c   | 10 +-
>> arch/powerpc/perf/power10-pmu.c |  6 ++
>> tools/arch/powerpc/include/uapi/asm/perf_regs.h |  6 ++
>> tools/perf/arch/powerpc/include/perf_regs.h |  3 +++
>> tools/perf/arch/powerpc/util/perf_regs.c|  6 ++
> 
> Please split into a kernel patch and a tools patch. And cc the tools people.

Ok sure
> 
>> 6 files changed, 36 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/include/uapi/asm/perf_regs.h 
>> b/arch/powerpc/include/uapi/asm/perf_regs.h
>> index 485b1d5..020b51c 100644
>> --- a/arch/powerpc/include/uapi/asm/perf_regs.h
>> +++ b/arch/powerpc/include/uapi/asm/perf_regs.h
>> @@ -52,6 +52,9 @@ enum perf_event_powerpc_regs {
>>  PERF_REG_POWERPC_MMCR0,
>>  PERF_REG_POWERPC_MMCR1,
>>  PERF_REG_POWERPC_MMCR2,
>> +PERF_REG_POWERPC_MMCR3,
>> +PERF_REG_POWERPC_SIER2,
>> +PERF_REG_POWERPC_SIER3,
>>  /* Max regs without the extended regs */
>>  PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>> };
>> @@ -62,4 +65,7 @@ enum perf_event_powerpc_regs {
>> #define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 
>> 1) \
>>  - PERF_REG_PMU_MASK)
>> 
>> +/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
>> +#define PERF_REG_PMU_MASK_31(((1ULL << (PERF_REG_POWERPC_SIER3 + 
>> 1)) - 1) \
>> +- PERF_REG_PMU_MASK)
> 
> Wrapping that provides no benefit, just let it be long.
> 

Ok,

>> #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
>> diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
>> index c8a7e8c..c969935 100644
>> --- a/arch/powerpc/perf/perf_regs.c
>> +++ b/arch/powerpc/perf/perf_regs.c
>> @@ -81,6 +81,12 @@ static u64 get_ext_regs_value(int idx)
>>  return mfspr(SPRN_MMCR1);
>>  case PERF_REG_POWERPC_MMCR2:
>>  return mfspr(SPRN_MMCR2);
>> +case PERF_REG_POWERPC_MMCR3:
>> +return mfspr(SPRN_MMCR3);
>> +case PERF_REG_POWERPC_SIER2:
>> +return mfspr(SPRN_SIER2);
>> +case PERF_REG_POWERPC_SIER3:
>> +return mfspr(SPRN_SIER3);
> 
> Indentation is wrong.
> 
>>  default: return 0;
>>  }
>> }
>> @@ -89,7 +95,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
>> {
>>  u64 PERF_REG_EXTENDED_MAX;
>> 
>> -if (cpu_has_feature(CPU_FTR_ARCH_300))
>> +if (cpu_has_feature(CPU_FTR_ARCH_31))
>> +PERF_REG_EXTENDED_MAX = PERF_REG_POWERPC_SIER3 + 1;
> 
> There's no way to know if that's correct other than going back to the
> header to look at the list of values.
> 
> So instead you should define it in the header, next to the other values,
> with a meaningful name, like PERF_REG_MAX_ISA_31 or something.
> 
>> +else if (cpu_has_feature(CPU_FTR_ARCH_300))
>>  PERF_REG_EXTENDED_MAX = PERF_REG_POWERPC_MMCR2 + 1;
> 
> Same.
> 

Ok, will make this change

>>  if (idx == PERF_REG_POWERPC_SIER &&
>> diff --git a/arch/powerpc/perf/power10-pmu.c 
>> b/arch/powerpc/perf/power10-pmu.c
>> index 07fb919..51082d6 100644
>> --- a/arch/powerpc/perf/power10-pmu.c
>> +++ b/arch/powerpc/perf/power10-pmu.c
>> @@ -86,6 +86,8 @@
>> #define POWER10_MMCRA_IFM3   0xC000UL
>> #define POWER10_MMCRA_BHRB_MASK  0xC000UL
>> 
>> +extern u64 mask_var;
> 
> Why is it extern? Also not a good name for a global.
> 
> Hang on, it's not even used? Is there some macro magic somewhere?

This is defined in patch 8 "powerpc/perf: Add support for outputting extended 
regs in perf intr_regs”, 
which adds the base support for extended regs in powerpc. Current patch covers 
changes to support
It for power10. 

`mask_var` is used to define `PERF_REG_EXTENDED_MASK` at run time. 
`PERF_REG_EXTENDED_MASK` basically contains mask value of supported extended 
registers.
And since supported registers may differ between processor versions, we are 
defining this mask at runtime.

The #define is done in arch/powerpc/include/asm/perf_event_server.h ( in patch 
8 ).
In the PMU driver init, we will set the respective mask value ( in the below 
code ). Hence it is extern

Sorry for the confusion here. 

Thanks
Athira

> 
>> /* Table of alternatives, sorted by column 0 */
>> static const unsigned int power10_event_alternatives[][MAX_ALT] = {
>>  { PM_RUN_CYC_ALT,   PM_RUN_CYC },
>> @@ -397,6 +399,7 @@ static void power10_config_bhrb(u64 pmu_bhrb_filter)
>>  .cache

  1   2   >