Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active

2021-10-21 Thread Nicholas Piggin
Excerpts from Michael Ellerman's message of October 22, 2021 10:19 am:
> Nathan Lynch  writes:
>> Athira Rajeev  writes:
>>> During Live Partition Migration (LPM), it is observed that perf
>>> counter values reports zero post migration completion. However
>>> 'perf stat' with workload continues to show counts post migration
>>> since PMU gets disabled/enabled during sched switches. But incase
>>> of system/cpu wide monitoring, zero counts were reported with 'perf
>>> stat' after migration completion.
>>>
>>> Example:
>>>  ./perf stat -e r1001e -I 1000
>>>time counts unit events
>>>  1.001010437 22,137,414  r1001e
>>>  2.002495447 15,455,821  r1001e
>>> <<>> As seen in next below logs, the counter values shows zero
>>> after migration is completed.
>>> <<>>
>>> 86.142535370129,392,333,440  r1001e
>>> 87.144714617  0  r1001e
>>> 88.146526636  0  r1001e
>>> 89.148085029  0  r1001e
>>
>> Confirmed in my environment:
>>
>> 51.099987985300,338  cache-misses
>> 52.101839374296,586  cache-misses
>> 53.116089796263,150  cache-misses
>> 54.117949249232,290  cache-misses
>> 55.602029375 68,700,421,711  cache-misses
>> 56.610073969  0  cache-misses
>> 57.614732000  0  cache-misses
>>
>> I wonder what it means that there is a very unlikely huge value before
>> the counter stops working -- I believe your example has this phenomenon
>> too.
> 
> AFAICS the patch is not reading the PMC values before the migration, so

My suggested change I think should take care of that.

Thanks,
Nick


Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.15-5 tag

2021-10-21 Thread pr-tracker-bot
The pull request you sent on Thu, 21 Oct 2021 22:32:45 +1100:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-5.15-5

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/0a3221b65874b5089f1742de59ef89f032b9f2ea

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active

2021-10-21 Thread Madhavan Srinivasan



On 10/21/21 11:03 PM, Nathan Lynch wrote:

Nicholas Piggin  writes:

Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:

During Live Partition Migration (LPM), it is observed that perf
counter values reports zero post migration completion. However
'perf stat' with workload continues to show counts post migration
since PMU gets disabled/enabled during sched switches. But incase
of system/cpu wide monitoring, zero counts were reported with 'perf
stat' after migration completion.

Example:
  ./perf stat -e r1001e -I 1000
time counts unit events
  1.001010437 22,137,414  r1001e
  2.002495447 15,455,821  r1001e
<<>> As seen in next below logs, the counter values shows zero
 after migration is completed.
<<>>
 86.142535370129,392,333,440  r1001e
 87.144714617  0  r1001e
 88.146526636  0  r1001e
 89.148085029  0  r1001e

Here PMU is enabled during start of perf session and counter
values are read at intervals. Counters are only disabled at the
end of session. The powerpc mobility code presently does not handle
disabling and enabling back of PMU counters during partition
migration. Also since the PMU register values are not saved/restored
during migration, PMU registers like Monitor Mode Control Register 0
(MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
the value it was programmed with. Hence PMU counters will not be
enabled correctly post migration.

Fix this in mobility code by handling disabling and enabling of
PMU in all cpu's before and after migration. Patch introduces two
functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
mobility_pmu_disable() is called before the processor threads goes
to suspend state so as to disable the PMU counters. And disable is
done only if there are any active events running on that cpu.
mobility_pmu_enable() is called after the processor threads are
back online to enable back the PMU counters.

Since the performance Monitor counters ( PMCs) are not
saved/restored during LPM, results in PMC value being zero and the
'event->hw.prev_count' being non-zero value. This causes problem

Interesting. Are they defined to not be migrated, or may not be
migrated?

PAPR may be silent on this... at least I haven't found anything yet. But
I'm not very familiar with perf counters.


IIUC, from the internal discussion with pHYP, migration of counters is 
OS thing.



How much assurance do we have that hardware events we've programmed on
the source can be reliably re-enabled on the destination, with the same
semantics? Aren't there some model-specific counters that don't make
sense to handle this way?


migration to same generation processor/model should be ok
but not to the different generation/model (but it is a case
to handle). That said, this patch is to fix the issue of large
value seen when migrating.





diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
index 9dc97d2..cea72d7 100644
--- a/arch/powerpc/include/asm/rtas.h
+++ b/arch/powerpc/include/asm/rtas.h
@@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
  static inline void read_24x7_sys_info(void) { }
  #endif
  
+#ifdef CONFIG_PPC_PERF_CTRS

+void mobility_pmu_disable(void);
+void mobility_pmu_enable(void);
+#else
+static inline void mobility_pmu_disable(void) { }
+static inline void mobility_pmu_enable(void) { }
+#endif
+
  #endif /* __KERNEL__ */
  #endif /* _POWERPC_RTAS_H */

It's not implemented in rtas, maybe consider putting this into a perf
header?

+1



Re: [PATCH 00/13] block: add_disk() error handling stragglers

2021-10-21 Thread Geoff Levand
Hi Luis,

On 10/18/21 9:15 AM, Luis Chamberlain wrote:
> On Sun, Oct 17, 2021 at 08:26:33AM -0700, Geoff Levand wrote:
>> Hi Luis,
>>
>> On 10/15/21 4:52 PM, Luis Chamberlain wrote:
>>> This patch set consists of al the straggler drivers for which we have
>>> have no patch reviews done for yet. I'd like to ask for folks to please
>>> consider chiming in, specially if you're the maintainer for the driver.
>>> Additionally if you can specify if you'll take the patch in yourself or
>>> if you want Jens to take it, that'd be great too.
>>
>> Do you have a git repo with the patch set applied that I can use to test 
>> with?
> 
> Sure, although the second to last patch is in a state of flux given
> the ataflop driver currently is broken and so we're seeing how to fix
> that first:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20211011-for-axboe-add-disk-error-handling

That branch has so many changes applied on top of the base v5.15-rc4
that the patches I need to apply to test on PS3 with don't apply.

Do you have something closer to say v5.15-rc5?  Preferred would be
just your add_disk() error handling patches plus what they depend
on.

Thanks.

-Geoff


Re: [PATCH 2/2] soc: fsl: rcpm: Make use of the helper function devm_platform_ioremap_resource()

2021-10-21 Thread Li Yang
On Wed, Sep 8, 2021 at 2:20 AM Cai Huoqing  wrote:
>
> Use the devm_platform_ioremap_resource() helper instead of
> calling platform_get_resource() and devm_ioremap_resource()
> separately
>
> Signed-off-by: Cai Huoqing 

Applied for next.  Thanks.

> ---
>  drivers/soc/fsl/rcpm.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/drivers/soc/fsl/rcpm.c b/drivers/soc/fsl/rcpm.c
> index 90d3f4060b0c..3d0cae30c769 100644
> --- a/drivers/soc/fsl/rcpm.c
> +++ b/drivers/soc/fsl/rcpm.c
> @@ -146,7 +146,6 @@ static const struct dev_pm_ops rcpm_pm_ops = {
>  static int rcpm_probe(struct platform_device *pdev)
>  {
> struct device   *dev = >dev;
> -   struct resource *r;
> struct rcpm *rcpm;
> int ret;
>
> @@ -154,11 +153,7 @@ static int rcpm_probe(struct platform_device *pdev)
> if (!rcpm)
> return -ENOMEM;
>
> -   r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> -   if (!r)
> -   return -ENODEV;
> -
> -   rcpm->ippdexpcr_base = devm_ioremap_resource(>dev, r);
> +   rcpm->ippdexpcr_base = devm_platform_ioremap_resource(pdev, 0);
> if (IS_ERR(rcpm->ippdexpcr_base)) {
> ret =  PTR_ERR(rcpm->ippdexpcr_base);
> return ret;
> --
> 2.25.1
>


Re: [PATCH 1/2] soc: fsl: guts: Make use of the helper function devm_platform_ioremap_resource()

2021-10-21 Thread Li Yang
On Wed, Sep 8, 2021 at 2:19 AM Cai Huoqing  wrote:
>
> Use the devm_platform_ioremap_resource() helper instead of
> calling platform_get_resource() and devm_ioremap_resource()
> separately
>
> Signed-off-by: Cai Huoqing 

Applied for next.  Thanks.

> ---
>  drivers/soc/fsl/guts.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
> index d5e9a5f2c087..072473a16f4d 100644
> --- a/drivers/soc/fsl/guts.c
> +++ b/drivers/soc/fsl/guts.c
> @@ -140,7 +140,6 @@ static int fsl_guts_probe(struct platform_device *pdev)
>  {
> struct device_node *np = pdev->dev.of_node;
> struct device *dev = >dev;
> -   struct resource *res;
> const struct fsl_soc_die_attr *soc_die;
> const char *machine;
> u32 svr;
> @@ -152,8 +151,7 @@ static int fsl_guts_probe(struct platform_device *pdev)
>
> guts->little_endian = of_property_read_bool(np, "little-endian");
>
> -   res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> -   guts->regs = devm_ioremap_resource(dev, res);
> +   guts->regs = devm_platform_ioremap_resource(pdev, 0);
> if (IS_ERR(guts->regs))
> return PTR_ERR(guts->regs);
>
> --
> 2.25.1
>


Re: [PATCH] soc: fsl: guts: Fix a resource leak in the error handling path of 'fsl_guts_probe()'

2021-10-21 Thread Li Yang
On Wed, Aug 18, 2021 at 4:23 PM Christophe JAILLET
 wrote:
>
> If an error occurs after 'of_find_node_by_path()', the reference taken for
> 'root' will never be released and some memory will leak.

Thanks for finding this.  This truly is a problem.

>
> Instead of adding an error handling path and modifying all the
> 'return -SOMETHING' into 'goto errorpath', use 'devm_add_action_or_reset()'
> to release the reference when needed.
>
> Simplify the remove function accordingly.
>
> As an extra benefit, the 'root' global variable can now be removed as well.
>
> Fixes: 3c0d64e867ed ("soc: fsl: guts: reuse machine name from device tree")
> Signed-off-by: Christophe JAILLET 
> ---
> Compile tested only
> ---
>  drivers/soc/fsl/guts.c | 16 ++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c
> index d5e9a5f2c087..4d9476c7b87c 100644
> --- a/drivers/soc/fsl/guts.c
> +++ b/drivers/soc/fsl/guts.c
> @@ -28,7 +28,6 @@ struct fsl_soc_die_attr {
>  static struct guts *guts;
>  static struct soc_device_attribute soc_dev_attr;
>  static struct soc_device *soc_dev;
> -static struct device_node *root;
>
>
>  /* SoC die attribute definition for QorIQ platform */
> @@ -136,14 +135,23 @@ static u32 fsl_guts_get_svr(void)
> return svr;
>  }
>
> +static void fsl_guts_put_root(void *data)
> +{
> +   struct device_node *root = data;
> +
> +   of_node_put(root);
> +}
> +
>  static int fsl_guts_probe(struct platform_device *pdev)
>  {
> struct device_node *np = pdev->dev.of_node;
> struct device *dev = >dev;
> +   struct device_node *root;
> struct resource *res;
> const struct fsl_soc_die_attr *soc_die;
> const char *machine;
> u32 svr;
> +   int ret;
>
> /* Initialize guts */
> guts = devm_kzalloc(dev, sizeof(*guts), GFP_KERNEL);
> @@ -159,6 +167,10 @@ static int fsl_guts_probe(struct platform_device *pdev)
>
> /* Register soc device */
> root = of_find_node_by_path("/");
> +   ret = devm_add_action_or_reset(dev, fsl_guts_put_root, root);
> +   if (ret)
> +   return ret;

We probably only need to hold the reference when we do get "machine"
from the device tree, otherwise we can put it directly.

Or maybe we just maintain a local copy of string machine which means
we can release the reference right away?

> +
> if (of_property_read_string(root, "model", ))
> of_property_read_string_index(root, "compatible", 0, 
> );
> if (machine)
> @@ -197,7 +209,7 @@ static int fsl_guts_probe(struct platform_device *pdev)
>  static int fsl_guts_remove(struct platform_device *dev)
>  {
> soc_device_unregister(soc_dev);
> -   of_node_put(root);
> +
> return 0;
>  }
>
> --
> 2.30.2
>


Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active

2021-10-21 Thread Michael Ellerman
Nathan Lynch  writes:
> Athira Rajeev  writes:
>> During Live Partition Migration (LPM), it is observed that perf
>> counter values reports zero post migration completion. However
>> 'perf stat' with workload continues to show counts post migration
>> since PMU gets disabled/enabled during sched switches. But incase
>> of system/cpu wide monitoring, zero counts were reported with 'perf
>> stat' after migration completion.
>>
>> Example:
>>  ./perf stat -e r1001e -I 1000
>>time counts unit events
>>  1.001010437 22,137,414  r1001e
>>  2.002495447 15,455,821  r1001e
>> <<>> As seen in next below logs, the counter values shows zero
>> after migration is completed.
>> <<>>
>> 86.142535370129,392,333,440  r1001e
>> 87.144714617  0  r1001e
>> 88.146526636  0  r1001e
>> 89.148085029  0  r1001e
>
> Confirmed in my environment:
>
> 51.099987985300,338  cache-misses
> 52.101839374296,586  cache-misses
> 53.116089796263,150  cache-misses
> 54.117949249232,290  cache-misses
> 55.602029375 68,700,421,711  cache-misses
> 56.610073969  0  cache-misses
> 57.614732000  0  cache-misses
>
> I wonder what it means that there is a very unlikely huge value before
> the counter stops working -- I believe your example has this phenomenon
> too.

AFAICS the patch is not reading the PMC values before the migration, so
I suspect we're losing some counts just before the migration and then
the delta is going negative somewhere, leading to an implausibly large
count.

cheers


[PATCH v3 18/18] powerpc/microwatt: Don't select the hash MMU code

2021-10-21 Thread Nicholas Piggin
Microwatt is radix-only, so it does not require hash MMU support.

This saves 20kB compressed dtbImage and 56kB vmlinux size.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/configs/microwatt_defconfig | 2 +-
 arch/powerpc/platforms/microwatt/Kconfig | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
index 6e62966730d3..de1dcff05734 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -16,6 +16,7 @@ CONFIG_EMBEDDED=y
 # CONFIG_SLAB_MERGE_DEFAULT is not set
 CONFIG_PPC64=y
 CONFIG_POWER9_CPU=y
+# CONFIG_PPC_64S_HASH_MMU is not set
 # CONFIG_PPC_KUEP is not set
 # CONFIG_PPC_KUAP is not set
 CONFIG_CPU_LITTLE_ENDIAN=y
@@ -27,7 +28,6 @@ CONFIG_PPC_MICROWATT=y
 # CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set
 CONFIG_CPU_FREQ=y
 CONFIG_HZ_100=y
-# CONFIG_PPC_MEM_KEYS is not set
 # CONFIG_SECCOMP is not set
 # CONFIG_MQ_IOSCHED_KYBER is not set
 # CONFIG_COREDUMP is not set
diff --git a/arch/powerpc/platforms/microwatt/Kconfig 
b/arch/powerpc/platforms/microwatt/Kconfig
index 823192e9d38a..5e320f49583a 100644
--- a/arch/powerpc/platforms/microwatt/Kconfig
+++ b/arch/powerpc/platforms/microwatt/Kconfig
@@ -5,7 +5,6 @@ config PPC_MICROWATT
select PPC_XICS
select PPC_ICS_NATIVE
select PPC_ICP_NATIVE
-   select PPC_HASH_MMU_NATIVE if PPC_64S_HASH_MMU
select PPC_UDBG_16550
select ARCH_RANDOM
help
-- 
2.23.0



[PATCH v3 17/18] powerpc/configs/microwatt: add POWER9_CPU

2021-10-21 Thread Nicholas Piggin
Microwatt implements a subset of ISA v3.0 (which is equivalent to
the POWER9_CPU option).

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/configs/microwatt_defconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/configs/microwatt_defconfig 
b/arch/powerpc/configs/microwatt_defconfig
index 9465209b8c5b..6e62966730d3 100644
--- a/arch/powerpc/configs/microwatt_defconfig
+++ b/arch/powerpc/configs/microwatt_defconfig
@@ -15,6 +15,7 @@ CONFIG_EMBEDDED=y
 # CONFIG_COMPAT_BRK is not set
 # CONFIG_SLAB_MERGE_DEFAULT is not set
 CONFIG_PPC64=y
+CONFIG_POWER9_CPU=y
 # CONFIG_PPC_KUEP is not set
 # CONFIG_PPC_KUAP is not set
 CONFIG_CPU_LITTLE_ENDIAN=y
-- 
2.23.0



[PATCH v3 16/18] powerpc/64s: Move hash MMU support code under CONFIG_PPC_64S_HASH_MMU

2021-10-21 Thread Nicholas Piggin
Compiling out hash support code when CONFIG_PPC_64S_HASH_MMU=n saves
128kB kernel image size (90kB text) on powernv_defconfig minus KVM,
350kB on pseries_defconfig minus KVM, 40kB on a tiny config.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig  |  2 +-
 arch/powerpc/include/asm/book3s/64/mmu.h  | 19 +--
 .../include/asm/book3s/64/tlbflush-hash.h |  7 
 arch/powerpc/include/asm/book3s/pgtable.h |  4 +++
 arch/powerpc/include/asm/mmu_context.h|  2 ++
 arch/powerpc/include/asm/paca.h   |  8 +
 arch/powerpc/kernel/asm-offsets.c |  2 ++
 arch/powerpc/kernel/entry_64.S|  4 +--
 arch/powerpc/kernel/exceptions-64s.S  | 16 +
 arch/powerpc/kernel/mce.c |  2 +-
 arch/powerpc/kernel/mce_power.c   | 10 --
 arch/powerpc/kernel/paca.c| 18 --
 arch/powerpc/kernel/process.c | 13 +++
 arch/powerpc/kernel/prom.c|  2 ++
 arch/powerpc/kernel/setup_64.c|  5 +++
 arch/powerpc/kexec/core_64.c  |  4 +--
 arch/powerpc/kexec/ranges.c   |  4 +++
 arch/powerpc/mm/book3s64/Makefile | 15 
 arch/powerpc/mm/book3s64/hugetlbpage.c|  2 ++
 arch/powerpc/mm/book3s64/mmu_context.c| 34 +++
 arch/powerpc/mm/book3s64/pgtable.c|  2 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  4 +++
 arch/powerpc/mm/copro_fault.c |  2 ++
 arch/powerpc/mm/ptdump/Makefile   |  2 +-
 arch/powerpc/platforms/powernv/idle.c |  2 ++
 arch/powerpc/platforms/powernv/setup.c|  2 ++
 arch/powerpc/platforms/pseries/lpar.c | 11 --
 arch/powerpc/platforms/pseries/lparcfg.c  |  2 +-
 arch/powerpc/platforms/pseries/mobility.c |  6 
 arch/powerpc/platforms/pseries/ras.c  |  2 ++
 arch/powerpc/platforms/pseries/reconfig.c |  2 ++
 arch/powerpc/platforms/pseries/setup.c|  6 ++--
 arch/powerpc/xmon/xmon.c  |  8 +++--
 drivers/misc/lkdtm/Makefile   |  2 +-
 drivers/misc/lkdtm/core.c |  2 +-
 35 files changed, 177 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 94cdd1cb94a2..e1f8b4a24d5b 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -129,7 +129,7 @@ config PPC
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_MEMBARRIER_SYNC_CORE
-   select ARCH_HAS_MEMREMAP_COMPAT_ALIGN   if PPC_BOOK3S_64
+   select ARCH_HAS_MEMREMAP_COMPAT_ALIGN   if PPC_64S_HASH_MMU
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PHYS_TO_DMA
diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index c02f42d1031e..d94ebae386b6 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -98,7 +98,9 @@ typedef struct {
 * from EA and new context ids to build the new VAs.
 */
mm_context_id_t id;
+#ifdef CONFIG_PPC_64S_HASH_MMU
mm_context_id_t extended_id[TASK_SIZE_USER64/TASK_CONTEXT_SIZE];
+#endif
};
 
/* Number of bits in the mm_cpumask */
@@ -110,7 +112,9 @@ typedef struct {
/* Number of user space windows opened in process mm_context */
atomic_t vas_windows;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
struct hash_mm_context *hash_context;
+#endif
 
void __user *vdso;
/*
@@ -133,6 +137,7 @@ typedef struct {
 #endif
 } mm_context_t;
 
+#ifdef CONFIG_PPC_64S_HASH_MMU
 static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
 {
return ctx->hash_context->user_psize;
@@ -193,8 +198,15 @@ static inline struct subpage_prot_table 
*mm_ctx_subpage_prot(mm_context_t *ctx)
 extern int mmu_linear_psize;
 extern int mmu_virtual_psize;
 extern int mmu_vmalloc_psize;
-extern int mmu_vmemmap_psize;
 extern int mmu_io_psize;
+#else /* CONFIG_PPC_64S_HASH_MMU */
+#ifdef CONFIG_PPC_64K_PAGES
+#define mmu_virtual_psize MMU_PAGE_64K
+#else
+#define mmu_virtual_psize MMU_PAGE_4K
+#endif
+#endif
+extern int mmu_vmemmap_psize;
 
 /* MMU initialization */
 void mmu_early_init_devtree(void);
@@ -233,7 +245,8 @@ static inline void setup_initial_memory_limit(phys_addr_t 
first_memblock_base,
 * know which translations we will pick. Hence go with hash
 * restrictions.
 */
-   return hash__setup_initial_memory_limit(first_memblock_base,
+   if (!radix_enabled())
+   return hash__setup_initial_memory_limit(first_memblock_base,
   first_memblock_size);
 }
 
@@ -255,6 +268,7 @@ static inline void radix_init_pseries(void) { }
 void cleanup_cpu_mmu_context(void);
 #endif
 

[PATCH v3 15/18] powerpc/64s: Make hash MMU support configurable

2021-10-21 Thread Nicholas Piggin
This adds Kconfig selection which allows 64s hash MMU support to be
disabled. It can be disabled if radix support is enabled, the minimum
supported CPU type is POWER9 (or higher), and KVM is not selected.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig |  3 ++-
 arch/powerpc/include/asm/mmu.h   | 16 ++---
 arch/powerpc/kernel/dt_cpu_ftrs.c| 14 
 arch/powerpc/kvm/Kconfig |  1 +
 arch/powerpc/mm/init_64.c| 15 +---
 arch/powerpc/platforms/Kconfig.cputype   | 29 ++--
 arch/powerpc/platforms/cell/Kconfig  |  1 +
 arch/powerpc/platforms/maple/Kconfig |  1 +
 arch/powerpc/platforms/microwatt/Kconfig |  2 +-
 arch/powerpc/platforms/pasemi/Kconfig|  1 +
 arch/powerpc/platforms/powermac/Kconfig  |  1 +
 arch/powerpc/platforms/powernv/Kconfig   |  2 +-
 12 files changed, 71 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 653f1fa9f8cb..94cdd1cb94a2 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -838,7 +838,7 @@ config FORCE_MAX_ZONEORDER
 config PPC_SUBPAGE_PROT
bool "Support setting protections for 4k subpages (subpage_prot 
syscall)"
default n
-   depends on PPC_BOOK3S_64 && PPC_64K_PAGES
+   depends on PPC_64S_HASH_MMU && PPC_64K_PAGES
help
  This option adds support for system call to allow user programs
  to set access permissions (read/write, readonly, or no access)
@@ -936,6 +936,7 @@ config PPC_MEM_KEYS
prompt "PowerPC Memory Protection Keys"
def_bool y
depends on PPC_BOOK3S_64
+   depends on PPC_64S_HASH_MMU
select ARCH_USES_HIGH_VMA_FLAGS
select ARCH_HAS_PKEYS
help
diff --git a/arch/powerpc/include/asm/mmu.h b/arch/powerpc/include/asm/mmu.h
index 8abe8e42e045..5f41565a1e5d 100644
--- a/arch/powerpc/include/asm/mmu.h
+++ b/arch/powerpc/include/asm/mmu.h
@@ -157,7 +157,7 @@ DECLARE_PER_CPU(int, next_tlbcam_idx);
 
 enum {
MMU_FTRS_POSSIBLE =
-#if defined(CONFIG_PPC_BOOK3S_64) || defined(CONFIG_PPC_BOOK3S_604)
+#if defined(CONFIG_PPC_BOOK3S_604)
MMU_FTR_HPTE_TABLE |
 #endif
 #ifdef CONFIG_PPC_8xx
@@ -184,15 +184,18 @@ enum {
MMU_FTR_USE_TLBRSRV | MMU_FTR_USE_PAIRED_MAS |
 #endif
 #ifdef CONFIG_PPC_BOOK3S_64
+   MMU_FTR_KERNEL_RO |
+#ifdef CONFIG_PPC_64S_HASH_MMU
MMU_FTR_NO_SLBIE_B | MMU_FTR_16M_PAGE | MMU_FTR_TLBIEL |
MMU_FTR_LOCKLESS_TLBIE | MMU_FTR_CI_LARGE_PAGE |
MMU_FTR_1T_SEGMENT | MMU_FTR_TLBIE_CROP_VA |
-   MMU_FTR_KERNEL_RO | MMU_FTR_68_BIT_VA |
+   MMU_FTR_68_BIT_VA | MMU_FTR_HPTE_TABLE |
 #endif
 #ifdef CONFIG_PPC_RADIX_MMU
MMU_FTR_TYPE_RADIX |
MMU_FTR_GTSE |
 #endif /* CONFIG_PPC_RADIX_MMU */
+#endif
 #ifdef CONFIG_PPC_KUAP
MMU_FTR_BOOK3S_KUAP |
 #endif /* CONFIG_PPC_KUAP */
@@ -224,6 +227,13 @@ enum {
 #define MMU_FTRS_ALWAYSMMU_FTR_TYPE_FSL_E
 #endif
 
+/* BOOK3S_64 options */
+#if defined(CONFIG_PPC_RADIX_MMU) && !defined(CONFIG_PPC_64S_HASH_MMU)
+#define MMU_FTRS_ALWAYSMMU_FTR_TYPE_RADIX
+#elif !defined(CONFIG_PPC_RADIX_MMU) && defined(CONFIG_PPC_64S_HASH_MMU)
+#define MMU_FTRS_ALWAYSMMU_FTR_HPTE_TABLE
+#endif
+
 #ifndef MMU_FTRS_ALWAYS
 #define MMU_FTRS_ALWAYS0
 #endif
@@ -329,7 +339,7 @@ static __always_inline bool radix_enabled(void)
return mmu_has_feature(MMU_FTR_TYPE_RADIX);
 }
 
-static inline bool early_radix_enabled(void)
+static __always_inline bool early_radix_enabled(void)
 {
return early_mmu_has_feature(MMU_FTR_TYPE_RADIX);
 }
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 358aee7c2d79..f1b86d687095 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -271,6 +271,9 @@ static int __init feat_enable_mmu_hash(struct 
dt_cpu_feature *f)
 {
u64 lpcr;
 
+   if (!IS_ENABLED(CONFIG_PPC_64S_HASH_MMU))
+   return 0;
+
lpcr = mfspr(SPRN_LPCR);
lpcr &= ~LPCR_ISL;
 
@@ -290,6 +293,9 @@ static int __init feat_enable_mmu_hash_v3(struct 
dt_cpu_feature *f)
 {
u64 lpcr;
 
+   if (!IS_ENABLED(CONFIG_PPC_64S_HASH_MMU))
+   return 0;
+
lpcr = mfspr(SPRN_LPCR);
lpcr &= ~(LPCR_ISL | LPCR_UPRT | LPCR_HR);
mtspr(SPRN_LPCR, lpcr);
@@ -303,15 +309,15 @@ static int __init feat_enable_mmu_hash_v3(struct 
dt_cpu_feature *f)
 
 static int __init feat_enable_mmu_radix(struct dt_cpu_feature *f)
 {
-#ifdef CONFIG_PPC_RADIX_MMU
+   if (!IS_ENABLED(CONFIG_PPC_RADIX_MMU))
+   return 0;
+
+   cur_cpu_spec->mmu_features |= MMU_FTR_KERNEL_RO;
cur_cpu_spec->mmu_features |= MMU_FTR_TYPE_RADIX;
-   cur_cpu_spec->mmu_features |= MMU_FTRS_HASH_BASE;
cur_cpu_spec->mmu_features |= 

[PATCH v3 14/18] powerpc/64s: Clear MMU_FTR_HPTE_TABLE when booting in radix

2021-10-21 Thread Nicholas Piggin
---
 arch/powerpc/mm/init_64.c | 1 +
 arch/powerpc/mm/pgtable.c | 9 ++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 386be136026e..23fbb2b0277c 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -453,6 +453,7 @@ void __init mmu_early_init_devtree(void)
early_check_vec5();
 
if (early_radix_enabled()) {
+   cur_cpu_spec->mmu_features &= ~MMU_FTR_HPTE_TABLE;
radix__early_init_devtree();
/*
 * We have finalized the translation we are going to use by now.
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index cd16b407f47e..9e67472b50be 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -81,9 +81,6 @@ static struct page *maybe_pte_to_page(pte_t pte)
 
 static pte_t set_pte_filter_hash(pte_t pte)
 {
-   if (radix_enabled())
-   return pte;
-
pte = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
if (pte_looks_normal(pte) && !(cpu_has_feature(CPU_FTR_COHERENT_ICACHE) 
||
   cpu_has_feature(CPU_FTR_NOEXECUTE))) {
@@ -112,6 +109,9 @@ static inline pte_t set_pte_filter(pte_t pte)
 {
struct page *pg;
 
+   if (radix_enabled())
+   return pte;
+
if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
return set_pte_filter_hash(pte);
 
@@ -144,6 +144,9 @@ static pte_t set_access_flags_filter(pte_t pte, struct 
vm_area_struct *vma,
 {
struct page *pg;
 
+   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64))
+   return pte;
+
if (mmu_has_feature(MMU_FTR_HPTE_TABLE))
return pte;
 
-- 
2.23.0



[PATCH v3 13/18] powerpc/64e: remove mmu_linear_psize

2021-10-21 Thread Nicholas Piggin
mmu_linear_psize is only set at boot once on 64e, is not necessarily
the correct size of the linear map pages, and is never used anywhere.
Remove it.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/nohash/tlb.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/arch/powerpc/mm/nohash/tlb.c b/arch/powerpc/mm/nohash/tlb.c
index 5872f69141d5..8c1523ae7f7f 100644
--- a/arch/powerpc/mm/nohash/tlb.c
+++ b/arch/powerpc/mm/nohash/tlb.c
@@ -150,7 +150,6 @@ static inline int mmu_get_tsize(int psize)
  */
 #ifdef CONFIG_PPC64
 
-int mmu_linear_psize;  /* Page size used for the linear mapping */
 int mmu_pte_psize; /* Page size used for PTE pages */
 int mmu_vmemmap_psize; /* Page size used for the virtual mem map */
 int book3e_htw_mode;   /* HW tablewalk?  Value is PPC_HTW_* */
@@ -655,14 +654,6 @@ static void early_init_this_mmu(void)
 
 static void __init early_init_mmu_global(void)
 {
-   /* XXX This will have to be decided at runtime, but right
-* now our boot and TLB miss code hard wires it. Ideally
-* we should find out a suitable page size and patch the
-* TLB miss code (either that or use the PACA to store
-* the value we want)
-*/
-   mmu_linear_psize = MMU_PAGE_1G;
-
/* XXX This should be decided at runtime based on supported
 * page sizes in the TLB, but for now let's assume 16M is
 * always there and a good fit (which it probably is)
-- 
2.23.0



[PATCH v3 12/18] powerpc: make memremap_compat_align 64s-only

2021-10-21 Thread Nicholas Piggin
memremap_compat_align is only relevant when ZONE_DEVICE is selected.
ZONE_DEVICE depends on ARCH_HAS_PTE_DEVMAP, which is only selected
by PPC_BOOK3S_64.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig   |  2 +-
 arch/powerpc/mm/book3s64/pgtable.c | 20 
 arch/powerpc/mm/ioremap.c  | 20 
 3 files changed, 21 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index ba5b66189358..653f1fa9f8cb 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -129,7 +129,7 @@ config PPC
select ARCH_HAS_KCOV
select ARCH_HAS_MEMBARRIER_CALLBACKS
select ARCH_HAS_MEMBARRIER_SYNC_CORE
-   select ARCH_HAS_MEMREMAP_COMPAT_ALIGN
+   select ARCH_HAS_MEMREMAP_COMPAT_ALIGN   if PPC_BOOK3S_64
select ARCH_HAS_MMIOWB  if PPC64
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PHYS_TO_DMA
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 06565b452cbc..7d556b5513e4 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -534,3 +534,23 @@ static int __init pgtable_debugfs_setup(void)
return 0;
 }
 arch_initcall(pgtable_debugfs_setup);
+
+#ifdef CONFIG_ZONE_DEVICE
+/*
+ * Override the generic version in mm/memremap.c.
+ *
+ * With hash translation, the direct-map range is mapped with just one
+ * page size selected by htab_init_page_sizes(). Consult
+ * mmu_psize_defs[] to determine the minimum page size alignment.
+*/
+unsigned long memremap_compat_align(void)
+{
+   if (!radix_enabled()) {
+   unsigned int shift = mmu_psize_defs[mmu_linear_psize].shift;
+   return max(SUBSECTION_SIZE, 1UL << shift);
+   }
+
+   return SUBSECTION_SIZE;
+}
+EXPORT_SYMBOL_GPL(memremap_compat_align);
+#endif
diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
index 57342154d2b0..4f12504fb405 100644
--- a/arch/powerpc/mm/ioremap.c
+++ b/arch/powerpc/mm/ioremap.c
@@ -98,23 +98,3 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t offset, 
unsigned long size,
 
return NULL;
 }
-
-#ifdef CONFIG_ZONE_DEVICE
-/*
- * Override the generic version in mm/memremap.c.
- *
- * With hash translation, the direct-map range is mapped with just one
- * page size selected by htab_init_page_sizes(). Consult
- * mmu_psize_defs[] to determine the minimum page size alignment.
-*/
-unsigned long memremap_compat_align(void)
-{
-   unsigned int shift = mmu_psize_defs[mmu_linear_psize].shift;
-
-   if (radix_enabled())
-   return SUBSECTION_SIZE;
-   return max(SUBSECTION_SIZE, 1UL << shift);
-
-}
-EXPORT_SYMBOL_GPL(memremap_compat_align);
-#endif
-- 
2.23.0



[PATCH v3 11/18] powerpc/64: pcpu setup avoid reading mmu_linear_psize on 64e or radix

2021-10-21 Thread Nicholas Piggin
Radix never sets mmu_linear_psize so it's always 4K, which causes pcpu
atom_size to always be PAGE_SIZE. 64e sets it to 1GB always.

Make paths for these platforms to be explicit about what value they set
atom_size to.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/setup_64.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index eaa79a0996d1..a6132b2fee9e 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -880,14 +880,23 @@ void __init setup_per_cpu_areas(void)
int rc = -EINVAL;
 
/*
-* Linear mapping is one of 4K, 1M and 16M.  For 4K, no need
-* to group units.  For larger mappings, use 1M atom which
-* should be large enough to contain a number of units.
+* BookE and BookS radix are historical values and should be revisited.
 */
-   if (mmu_linear_psize == MMU_PAGE_4K)
+   if (IS_ENABLED(CONFIG_PPC_BOOK3E)) {
+   atom_size = SZ_1M;
+   } else if (radix_enabled()) {
atom_size = PAGE_SIZE;
-   else
-   atom_size = 1 << 20;
+   } else {
+   /*
+* Linear mapping is one of 4K, 1M and 16M.  For 4K, no need
+* to group units.  For larger mappings, use 1M atom which
+* should be large enough to contain a number of units.
+*/
+   if (mmu_linear_psize == MMU_PAGE_4K)
+   atom_size = PAGE_SIZE;
+   else
+   atom_size = SZ_1M;
+   }
 
if (pcpu_chosen_fc != PCPU_FC_PAGE) {
rc = pcpu_embed_first_chunk(0, dyn_size, atom_size, 
pcpu_cpu_distance,
-- 
2.23.0



[PATCH v3 10/18] powerpc/64s: Rename hash_hugetlbpage.c to hugetlbpage.c

2021-10-21 Thread Nicholas Piggin
This file contains functions and data common to radix, so rename it to
remove the hash_ prefix.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/Makefile  | 2 +-
 arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename arch/powerpc/mm/book3s64/{hash_hugetlbpage.c => hugetlbpage.c} (100%)

diff --git a/arch/powerpc/mm/book3s64/Makefile 
b/arch/powerpc/mm/book3s64/Makefile
index 1579e18e098d..501efadb287f 100644
--- a/arch/powerpc/mm/book3s64/Makefile
+++ b/arch/powerpc/mm/book3s64/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_PPC_HASH_MMU_NATIVE) += hash_native.o
 obj-$(CONFIG_PPC_RADIX_MMU)+= radix_pgtable.o radix_tlb.o
 obj-$(CONFIG_PPC_4K_PAGES) += hash_4k.o
 obj-$(CONFIG_PPC_64K_PAGES)+= hash_64k.o
-obj-$(CONFIG_HUGETLB_PAGE) += hash_hugetlbpage.o
+obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
 ifdef CONFIG_HUGETLB_PAGE
 obj-$(CONFIG_PPC_RADIX_MMU)+= radix_hugetlbpage.o
 endif
diff --git a/arch/powerpc/mm/book3s64/hash_hugetlbpage.c 
b/arch/powerpc/mm/book3s64/hugetlbpage.c
similarity index 100%
rename from arch/powerpc/mm/book3s64/hash_hugetlbpage.c
rename to arch/powerpc/mm/book3s64/hugetlbpage.c
-- 
2.23.0



[PATCH v3 09/18] powerpc/64s: move page size definitions from hash specific file

2021-10-21 Thread Nicholas Piggin
The radix code uses some of the psize variables. Move the common
ones from hash_utils.c to pgtable.c.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 5 -
 arch/powerpc/mm/book3s64/pgtable.c| 7 +++
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index ffc52ff0b3f0..1cd28e3cd3b5 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -99,8 +99,6 @@
  */
 
 static unsigned long _SDR1;
-struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT];
-EXPORT_SYMBOL_GPL(mmu_psize_defs);
 
 u8 hpte_page_sizes[1 << LP_BITS];
 EXPORT_SYMBOL_GPL(hpte_page_sizes);
@@ -114,9 +112,6 @@ EXPORT_SYMBOL_GPL(mmu_linear_psize);
 int mmu_virtual_psize = MMU_PAGE_4K;
 int mmu_vmalloc_psize = MMU_PAGE_4K;
 EXPORT_SYMBOL_GPL(mmu_vmalloc_psize);
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-int mmu_vmemmap_psize = MMU_PAGE_4K;
-#endif
 int mmu_io_psize = MMU_PAGE_4K;
 int mmu_kernel_ssize = MMU_SEGSIZE_256M;
 EXPORT_SYMBOL_GPL(mmu_kernel_ssize);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 049843c8c875..06565b452cbc 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -22,6 +22,13 @@
 
 #include "internal.h"
 
+struct mmu_psize_def mmu_psize_defs[MMU_PAGE_COUNT];
+EXPORT_SYMBOL_GPL(mmu_psize_defs);
+
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+int mmu_vmemmap_psize = MMU_PAGE_4K;
+#endif
+
 unsigned long __pmd_frag_nr;
 EXPORT_SYMBOL(__pmd_frag_nr);
 unsigned long __pmd_frag_size_shift;
-- 
2.23.0



[PATCH v3 08/18] powerpc/64s: Make flush_and_reload_slb a no-op when radix is enabled

2021-10-21 Thread Nicholas Piggin
The radix test can exclude slb_flush_all_realmode() from being called
because flush_and_reload_slb() is only expected to flush ERAT when
called by flush_erat(), which is only on pre-ISA v3.0 CPUs that do not
support radix.

This helps the later change to make hash support configurable to not
introduce runtime changes to radix mode behaviour.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kernel/mce_power.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index c2f55fe7092d..cf5263b648fc 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -80,12 +80,12 @@ static bool mce_in_guest(void)
 #ifdef CONFIG_PPC_BOOK3S_64
 void flush_and_reload_slb(void)
 {
-   /* Invalidate all SLBs */
-   slb_flush_all_realmode();
-
if (early_radix_enabled())
return;
 
+   /* Invalidate all SLBs */
+   slb_flush_all_realmode();
+
/*
 * This probably shouldn't happen, but it may be possible it's
 * called in early boot before SLB shadows are allocated.
-- 
2.23.0



[PATCH v3 07/18] powerpc/64s: move THP trace point creation out of hash specific file

2021-10-21 Thread Nicholas Piggin
In preparation for making hash MMU support configurable, move THP
trace point function definitions out of an otherwise hash specific
file.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/Makefile   | 2 +-
 arch/powerpc/mm/book3s64/hash_pgtable.c | 1 -
 arch/powerpc/mm/book3s64/pgtable.c  | 1 +
 arch/powerpc/mm/book3s64/trace.c| 8 
 4 files changed, 10 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/mm/book3s64/trace.c

diff --git a/arch/powerpc/mm/book3s64/Makefile 
b/arch/powerpc/mm/book3s64/Makefile
index 319f4b7f3357..1579e18e098d 100644
--- a/arch/powerpc/mm/book3s64/Makefile
+++ b/arch/powerpc/mm/book3s64/Makefile
@@ -5,7 +5,7 @@ ccflags-y   := $(NO_MINIMAL_TOC)
 CFLAGS_REMOVE_slb.o = $(CC_FLAGS_FTRACE)
 
 obj-y  += hash_pgtable.o hash_utils.o slb.o \
-  mmu_context.o pgtable.o hash_tlb.o
+  mmu_context.o pgtable.o hash_tlb.o trace.o
 obj-$(CONFIG_PPC_HASH_MMU_NATIVE)  += hash_native.o
 obj-$(CONFIG_PPC_RADIX_MMU)+= radix_pgtable.o radix_tlb.o
 obj-$(CONFIG_PPC_4K_PAGES) += hash_4k.o
diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c 
b/arch/powerpc/mm/book3s64/hash_pgtable.c
index ad5eff097d31..7ce8914992e3 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -16,7 +16,6 @@
 
 #include 
 
-#define CREATE_TRACE_POINTS
 #include 
 
 #if H_PGTABLE_RANGE > (USER_VSID_RANGE * (TASK_SIZE_USER64 / 
TASK_CONTEXT_SIZE))
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 9e16c7b1a6c5..049843c8c875 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -28,6 +28,7 @@ unsigned long __pmd_frag_size_shift;
 EXPORT_SYMBOL(__pmd_frag_size_shift);
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+
 /*
  * This is called when relaxing access to a hugepage. It's also called in the 
page
  * fault path when we don't hit any of the major fault cases, ie, a minor
diff --git a/arch/powerpc/mm/book3s64/trace.c b/arch/powerpc/mm/book3s64/trace.c
new file mode 100644
index ..b86e7b906257
--- /dev/null
+++ b/arch/powerpc/mm/book3s64/trace.c
@@ -0,0 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * This file is for defining trace points and trace related helpers.
+ */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define CREATE_TRACE_POINTS
+#include 
+#endif
-- 
2.23.0



[PATCH v3 06/18] powerpc/pseries: lparcfg don't include slb_size line in radix mode

2021-10-21 Thread Nicholas Piggin
This avoids a change in behaviour in the later patch making hash
support configurable. This is possibly a user interface change, so
the alternative would be a hard-coded slb_size=0 here.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/pseries/lparcfg.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/lparcfg.c 
b/arch/powerpc/platforms/pseries/lparcfg.c
index f71eac74ea92..3354c00914fa 100644
--- a/arch/powerpc/platforms/pseries/lparcfg.c
+++ b/arch/powerpc/platforms/pseries/lparcfg.c
@@ -532,7 +532,8 @@ static int pseries_lparcfg_data(struct seq_file *m, void *v)
   lppaca_shared_proc(get_lppaca()));
 
 #ifdef CONFIG_PPC_BOOK3S_64
-   seq_printf(m, "slb_size=%d\n", mmu_slb_size);
+   if (!radix_enabled())
+   seq_printf(m, "slb_size=%d\n", mmu_slb_size);
 #endif
parse_em_data(m);
maxmem_data(m);
-- 
2.23.0



[PATCH v3 05/18] powerpc/pseries: move pseries_lpar_register_process_table() out from hash specific code

2021-10-21 Thread Nicholas Piggin
This reduces ifdefs in a later change making hash support configurable.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/pseries/lpar.c | 56 +--
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 3df6bdfea475..06d6a824c0dc 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -712,6 +712,34 @@ void vpa_init(int cpu)
 
 #ifdef CONFIG_PPC_BOOK3S_64
 
+static int pseries_lpar_register_process_table(unsigned long base,
+   unsigned long page_size, unsigned long table_size)
+{
+   long rc;
+   unsigned long flags = 0;
+
+   if (table_size)
+   flags |= PROC_TABLE_NEW;
+   if (radix_enabled()) {
+   flags |= PROC_TABLE_RADIX;
+   if (mmu_has_feature(MMU_FTR_GTSE))
+   flags |= PROC_TABLE_GTSE;
+   } else
+   flags |= PROC_TABLE_HPT_SLB;
+   for (;;) {
+   rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
+   page_size, table_size);
+   if (!H_IS_LONG_BUSY(rc))
+   break;
+   mdelay(get_longbusy_msecs(rc));
+   }
+   if (rc != H_SUCCESS) {
+   pr_err("Failed to register process table (rc=%ld)\n", rc);
+   BUG();
+   }
+   return rc;
+}
+
 static long pSeries_lpar_hpte_insert(unsigned long hpte_group,
 unsigned long vpn, unsigned long pa,
 unsigned long rflags, unsigned long vflags,
@@ -1680,34 +1708,6 @@ static int pseries_lpar_resize_hpt(unsigned long shift)
return 0;
 }
 
-static int pseries_lpar_register_process_table(unsigned long base,
-   unsigned long page_size, unsigned long table_size)
-{
-   long rc;
-   unsigned long flags = 0;
-
-   if (table_size)
-   flags |= PROC_TABLE_NEW;
-   if (radix_enabled()) {
-   flags |= PROC_TABLE_RADIX;
-   if (mmu_has_feature(MMU_FTR_GTSE))
-   flags |= PROC_TABLE_GTSE;
-   } else
-   flags |= PROC_TABLE_HPT_SLB;
-   for (;;) {
-   rc = plpar_hcall_norets(H_REGISTER_PROC_TBL, flags, base,
-   page_size, table_size);
-   if (!H_IS_LONG_BUSY(rc))
-   break;
-   mdelay(get_longbusy_msecs(rc));
-   }
-   if (rc != H_SUCCESS) {
-   pr_err("Failed to register process table (rc=%ld)\n", rc);
-   BUG();
-   }
-   return rc;
-}
-
 void __init hpte_init_pseries(void)
 {
mmu_hash_ops.hpte_invalidate = pSeries_lpar_hpte_invalidate;
-- 
2.23.0



[PATCH v3 04/18] powerpc/64s: Move and rename do_bad_slb_fault as it is not hash specific

2021-10-21 Thread Nicholas Piggin
slb.c is hash-specific SLB management, but do_bad_slb_fault deals with
segment interrupts that occur with radix MMU as well.
---
 arch/powerpc/include/asm/interrupt.h |  2 +-
 arch/powerpc/kernel/exceptions-64s.S |  4 ++--
 arch/powerpc/mm/book3s64/slb.c   | 16 
 arch/powerpc/mm/fault.c  | 17 +
 4 files changed, 20 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index a1d238255f07..3487aab12229 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -564,7 +564,7 @@ DECLARE_INTERRUPT_HANDLER(kernel_bad_stack);
 
 /* slb.c */
 DECLARE_INTERRUPT_HANDLER_RAW(do_slb_fault);
-DECLARE_INTERRUPT_HANDLER(do_bad_slb_fault);
+DECLARE_INTERRUPT_HANDLER(do_bad_segment_interrupt);
 
 /* hash_utils.c */
 DECLARE_INTERRUPT_HANDLER_RAW(do_hash_fault);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index eaf1f72131a1..046c99e31d01 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1430,7 +1430,7 @@ MMU_FTR_SECTION_ELSE
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
std r3,RESULT(r1)
addir3,r1,STACK_FRAME_OVERHEAD
-   bl  do_bad_slb_fault
+   bl  do_bad_segment_interrupt
b   interrupt_return_srr
 
 
@@ -1510,7 +1510,7 @@ MMU_FTR_SECTION_ELSE
 ALT_MMU_FTR_SECTION_END_IFCLR(MMU_FTR_TYPE_RADIX)
std r3,RESULT(r1)
addir3,r1,STACK_FRAME_OVERHEAD
-   bl  do_bad_slb_fault
+   bl  do_bad_segment_interrupt
b   interrupt_return_srr
 
 
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index f0037bcc47a0..31f4cef3adac 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -868,19 +868,3 @@ DEFINE_INTERRUPT_HANDLER_RAW(do_slb_fault)
return err;
}
 }
-
-DEFINE_INTERRUPT_HANDLER(do_bad_slb_fault)
-{
-   int err = regs->result;
-
-   if (err == -EFAULT) {
-   if (user_mode(regs))
-   _exception(SIGSEGV, regs, SEGV_BNDERR, regs->dar);
-   else
-   bad_page_fault(regs, SIGSEGV);
-   } else if (err == -EINVAL) {
-   unrecoverable_exception(regs);
-   } else {
-   BUG();
-   }
-}
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index a8d0ce85d39a..53ddcae0ac9e 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -620,4 +621,20 @@ DEFINE_INTERRUPT_HANDLER(do_bad_page_fault_segv)
 {
bad_page_fault(regs, SIGSEGV);
 }
+
+DEFINE_INTERRUPT_HANDLER(do_bad_segment_interrupt)
+{
+   int err = regs->result;
+
+   if (err == -EFAULT) {
+   if (user_mode(regs))
+   _exception(SIGSEGV, regs, SEGV_BNDERR, regs->dar);
+   else
+   bad_page_fault(regs, SIGSEGV);
+   } else if (err == -EINVAL) {
+   unrecoverable_exception(regs);
+   } else {
+   BUG();
+   }
+}
 #endif
-- 
2.23.0



[PATCH v3 03/18] powerpc/pseries: Stop selecting PPC_HASH_MMU_NATIVE

2021-10-21 Thread Nicholas Piggin
The pseries platform does not use the native hash code but the PAPR
virtualised hash interfaces, so remove PPC_HASH_MMU_NATIVE.

This requires moving tlbiel code from hash_native.c to hash_utils.c.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   4 -
 arch/powerpc/mm/book3s64/hash_native.c| 104 --
 arch/powerpc/mm/book3s64/hash_utils.c | 104 ++
 arch/powerpc/platforms/pseries/Kconfig|   1 -
 4 files changed, 104 insertions(+), 109 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 215973b4cb26..d2e80f178b6d 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -14,7 +14,6 @@ enum {
TLB_INVAL_SCOPE_LPID = 1,   /* invalidate TLBs for current LPID */
 };
 
-#ifdef CONFIG_PPC_NATIVE
 static inline void tlbiel_all(void)
 {
/*
@@ -30,9 +29,6 @@ static inline void tlbiel_all(void)
else
hash__tlbiel_all(TLB_INVAL_SCOPE_GLOBAL);
 }
-#else
-static inline void tlbiel_all(void) { BUG(); }
-#endif
 
 static inline void tlbiel_all_lpid(bool radix)
 {
diff --git a/arch/powerpc/mm/book3s64/hash_native.c 
b/arch/powerpc/mm/book3s64/hash_native.c
index d8279bfe68ea..d2a320828c0b 100644
--- a/arch/powerpc/mm/book3s64/hash_native.c
+++ b/arch/powerpc/mm/book3s64/hash_native.c
@@ -43,110 +43,6 @@
 
 static DEFINE_RAW_SPINLOCK(native_tlbie_lock);
 
-static inline void tlbiel_hash_set_isa206(unsigned int set, unsigned int is)
-{
-   unsigned long rb;
-
-   rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53));
-
-   asm volatile("tlbiel %0" : : "r" (rb));
-}
-
-/*
- * tlbiel instruction for hash, set invalidation
- * i.e., r=1 and is=01 or is=10 or is=11
- */
-static __always_inline void tlbiel_hash_set_isa300(unsigned int set, unsigned 
int is,
-   unsigned int pid,
-   unsigned int ric, unsigned int prs)
-{
-   unsigned long rb;
-   unsigned long rs;
-   unsigned int r = 0; /* hash format */
-
-   rb = (set << PPC_BITLSHIFT(51)) | (is << PPC_BITLSHIFT(53));
-   rs = ((unsigned long)pid << PPC_BITLSHIFT(31));
-
-   asm volatile(PPC_TLBIEL(%0, %1, %2, %3, %4)
-: : "r"(rb), "r"(rs), "i"(ric), "i"(prs), "i"(r)
-: "memory");
-}
-
-
-static void tlbiel_all_isa206(unsigned int num_sets, unsigned int is)
-{
-   unsigned int set;
-
-   asm volatile("ptesync": : :"memory");
-
-   for (set = 0; set < num_sets; set++)
-   tlbiel_hash_set_isa206(set, is);
-
-   ppc_after_tlbiel_barrier();
-}
-
-static void tlbiel_all_isa300(unsigned int num_sets, unsigned int is)
-{
-   unsigned int set;
-
-   asm volatile("ptesync": : :"memory");
-
-   /*
-* Flush the partition table cache if this is HV mode.
-*/
-   if (early_cpu_has_feature(CPU_FTR_HVMODE))
-   tlbiel_hash_set_isa300(0, is, 0, 2, 0);
-
-   /*
-* Now invalidate the process table cache. UPRT=0 HPT modes (what
-* current hardware implements) do not use the process table, but
-* add the flushes anyway.
-*
-* From ISA v3.0B p. 1078:
-* The following forms are invalid.
-*  * PRS=1, R=0, and RIC!=2 (The only process-scoped
-*HPT caching is of the Process Table.)
-*/
-   tlbiel_hash_set_isa300(0, is, 0, 2, 1);
-
-   /*
-* Then flush the sets of the TLB proper. Hash mode uses
-* partition scoped TLB translations, which may be flushed
-* in !HV mode.
-*/
-   for (set = 0; set < num_sets; set++)
-   tlbiel_hash_set_isa300(set, is, 0, 0, 0);
-
-   ppc_after_tlbiel_barrier();
-
-   asm volatile(PPC_ISA_3_0_INVALIDATE_ERAT "; isync" : : :"memory");
-}
-
-void hash__tlbiel_all(unsigned int action)
-{
-   unsigned int is;
-
-   switch (action) {
-   case TLB_INVAL_SCOPE_GLOBAL:
-   is = 3;
-   break;
-   case TLB_INVAL_SCOPE_LPID:
-   is = 2;
-   break;
-   default:
-   BUG();
-   }
-
-   if (early_cpu_has_feature(CPU_FTR_ARCH_300))
-   tlbiel_all_isa300(POWER9_TLB_SETS_HASH, is);
-   else if (early_cpu_has_feature(CPU_FTR_ARCH_207S))
-   tlbiel_all_isa206(POWER8_TLB_SETS, is);
-   else if (early_cpu_has_feature(CPU_FTR_ARCH_206))
-   tlbiel_all_isa206(POWER7_TLB_SETS, is);
-   else
-   WARN(1, "%s called on pre-POWER7 CPU\n", __func__);
-}
-
 static inline unsigned long  ___tlbie(unsigned long vpn, int psize,
int apsize, int ssize)
 {
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index ebe3044711ce..ffc52ff0b3f0 

[PATCH v3 02/18] powerpc: Rename PPC_NATIVE to PPC_HASH_MMU_NATIVE

2021-10-21 Thread Nicholas Piggin
PPC_NATIVE now only controls the native HPT code, so rename it to be
more descriptive. Restrict it to Book3S only.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/mm/book3s64/Makefile  | 2 +-
 arch/powerpc/mm/book3s64/hash_utils.c  | 2 +-
 arch/powerpc/platforms/52xx/Kconfig| 2 +-
 arch/powerpc/platforms/Kconfig | 4 ++--
 arch/powerpc/platforms/cell/Kconfig| 2 +-
 arch/powerpc/platforms/chrp/Kconfig| 2 +-
 arch/powerpc/platforms/embedded6xx/Kconfig | 2 +-
 arch/powerpc/platforms/maple/Kconfig   | 2 +-
 arch/powerpc/platforms/microwatt/Kconfig   | 2 +-
 arch/powerpc/platforms/pasemi/Kconfig  | 2 +-
 arch/powerpc/platforms/powermac/Kconfig| 2 +-
 arch/powerpc/platforms/powernv/Kconfig | 2 +-
 arch/powerpc/platforms/pseries/Kconfig | 2 +-
 13 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/Makefile 
b/arch/powerpc/mm/book3s64/Makefile
index 1b56d3af47d4..319f4b7f3357 100644
--- a/arch/powerpc/mm/book3s64/Makefile
+++ b/arch/powerpc/mm/book3s64/Makefile
@@ -6,7 +6,7 @@ CFLAGS_REMOVE_slb.o = $(CC_FLAGS_FTRACE)
 
 obj-y  += hash_pgtable.o hash_utils.o slb.o \
   mmu_context.o pgtable.o hash_tlb.o
-obj-$(CONFIG_PPC_NATIVE)   += hash_native.o
+obj-$(CONFIG_PPC_HASH_MMU_NATIVE)  += hash_native.o
 obj-$(CONFIG_PPC_RADIX_MMU)+= radix_pgtable.o radix_tlb.o
 obj-$(CONFIG_PPC_4K_PAGES) += hash_4k.o
 obj-$(CONFIG_PPC_64K_PAGES)+= hash_64k.o
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index c145776d3ae5..ebe3044711ce 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -1091,7 +1091,7 @@ void __init hash__early_init_mmu(void)
ps3_early_mm_init();
else if (firmware_has_feature(FW_FEATURE_LPAR))
hpte_init_pseries();
-   else if (IS_ENABLED(CONFIG_PPC_NATIVE))
+   else if (IS_ENABLED(CONFIG_PPC_HASH_MMU_NATIVE))
hpte_init_native();
 
if (!mmu_hash_ops.hpte_insert)
diff --git a/arch/powerpc/platforms/52xx/Kconfig 
b/arch/powerpc/platforms/52xx/Kconfig
index 99d60acc20c8..b72ed2950ca8 100644
--- a/arch/powerpc/platforms/52xx/Kconfig
+++ b/arch/powerpc/platforms/52xx/Kconfig
@@ -34,7 +34,7 @@ config PPC_EFIKA
bool "bPlan Efika 5k2. MPC5200B based computer"
depends on PPC_MPC52xx
select PPC_RTAS
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
 
 config PPC_LITE5200
bool "Freescale Lite5200 Eval Board"
diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index e02d29a9d12f..d41dad227de8 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -40,9 +40,9 @@ config EPAPR_PARAVIRT
 
  In case of doubt, say Y
 
-config PPC_NATIVE
+config PPC_HASH_MMU_NATIVE
bool
-   depends on PPC_BOOK3S_32 || PPC64
+   depends on PPC_BOOK3S
help
  Support for running natively on the hardware, i.e. without
  a hypervisor. This option is not user-selectable but should
diff --git a/arch/powerpc/platforms/cell/Kconfig 
b/arch/powerpc/platforms/cell/Kconfig
index cb70c5f25bc6..db4465c51b56 100644
--- a/arch/powerpc/platforms/cell/Kconfig
+++ b/arch/powerpc/platforms/cell/Kconfig
@@ -8,7 +8,7 @@ config PPC_CELL_COMMON
select PPC_DCR_MMIO
select PPC_INDIRECT_PIO
select PPC_INDIRECT_MMIO
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select PPC_RTAS
select IRQ_EDGE_EOI_HANDLER
 
diff --git a/arch/powerpc/platforms/chrp/Kconfig 
b/arch/powerpc/platforms/chrp/Kconfig
index 9b5c5505718a..ff30ed579a39 100644
--- a/arch/powerpc/platforms/chrp/Kconfig
+++ b/arch/powerpc/platforms/chrp/Kconfig
@@ -11,6 +11,6 @@ config PPC_CHRP
select RTAS_ERROR_LOGGING
select PPC_MPC106
select PPC_UDBG_16550
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select FORCE_PCI
default y
diff --git a/arch/powerpc/platforms/embedded6xx/Kconfig 
b/arch/powerpc/platforms/embedded6xx/Kconfig
index 4c6d703a4284..c54786f8461e 100644
--- a/arch/powerpc/platforms/embedded6xx/Kconfig
+++ b/arch/powerpc/platforms/embedded6xx/Kconfig
@@ -55,7 +55,7 @@ config MVME5100
select FORCE_PCI
select PPC_INDIRECT_PCI
select PPC_I8259
-   select PPC_NATIVE
+   select PPC_HASH_MMU_NATIVE
select PPC_UDBG_16550
help
  This option enables support for the Motorola (now Emerson) MVME5100
diff --git a/arch/powerpc/platforms/maple/Kconfig 
b/arch/powerpc/platforms/maple/Kconfig
index 86ae210bee9a..7fd84311ade5 100644
--- a/arch/powerpc/platforms/maple/Kconfig
+++ b/arch/powerpc/platforms/maple/Kconfig
@@ -9,7 +9,7 @@ config PPC_MAPLE
select GENERIC_TBSYNC
select PPC_UDBG_16550
select PPC_970_NAP
-   select PPC_NATIVE
+   select 

[PATCH v3 00/18] powerpc: Make hash MMU code build configurable

2021-10-21 Thread Nicholas Piggin
Now that there's a platform that can make good use of it, here's
a series that can prevent the hash MMU code being built for 64s
platforms that don't need it.

Since v2:
- Split MMU_FTR_HPTE_TABLE clearing for radix boot into its own patch.
- Remove memremap_compat_align from other sub archs entirely.
- Flip patch order of the 2 main patches to put Kconfig change first.
- Fixed Book3S/32 xmon segment dumping bug.
- Removed a few more ifdefs, changed numbers to use SZ_ definitions,
  etc.
- Fixed microwatt defconfig so it should actually disable hash MMU now.

Since v1:
- Split out most of the Kconfig change from the conditional compilation
  changes.
- Split out several more changes into preparatory patches.
- Reduced some ifdefs.
- Caught a few missing hash bits: pgtable dump, lkdtm,
  memremap_compat_align.

Since RFC:
- Split out large code movement from other changes.
- Used mmu ftr test constant folding rather than adding new constant
  true/false for radix_enabled().
- Restore tlbie trace point that had to be commented out in the
  previous.
- Avoid minor (probably unreachable) behaviour change in machine check
  handler when hash was not compiled.
- Fix microwatt updates so !HASH is not enforced.
- Rebase, build fixes.

Thanks,
Nick

Nicholas Piggin (18):
  powerpc: Remove unused FW_FEATURE_NATIVE references
  powerpc: Rename PPC_NATIVE to PPC_HASH_MMU_NATIVE
  powerpc/pseries: Stop selecting PPC_HASH_MMU_NATIVE
  powerpc/64s: Move and rename do_bad_slb_fault as it is not hash
specific
  powerpc/pseries: move pseries_lpar_register_process_table() out from
hash specific code
  powerpc/pseries: lparcfg don't include slb_size line in radix mode
  powerpc/64s: move THP trace point creation out of hash specific file
  powerpc/64s: Make flush_and_reload_slb a no-op when radix is enabled
  powerpc/64s: move page size definitions from hash specific file
  powerpc/64s: Rename hash_hugetlbpage.c to hugetlbpage.c
  powerpc/64: pcpu setup avoid reading mmu_linear_psize on 64e or radix
  powerpc: make memremap_compat_align 64s-only
  powerpc/64e: remove mmu_linear_psize
  powerpc/64s: Clear MMU_FTR_HPTE_TABLE when booting in radix
  powerpc/64s: Make hash MMU support configurable
  powerpc/64s: Move hash MMU support code under CONFIG_PPC_64S_HASH_MMU
  powerpc/configs/microwatt: add POWER9_CPU
  powerpc/microwatt: Don't select the hash MMU code

 arch/powerpc/Kconfig  |   5 +-
 arch/powerpc/configs/microwatt_defconfig  |   3 +-
 arch/powerpc/include/asm/book3s/64/mmu.h  |  19 ++-
 .../include/asm/book3s/64/tlbflush-hash.h |   7 ++
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   4 -
 arch/powerpc/include/asm/book3s/pgtable.h |   4 +
 arch/powerpc/include/asm/firmware.h   |   8 --
 arch/powerpc/include/asm/interrupt.h  |   2 +-
 arch/powerpc/include/asm/mmu.h|  16 ++-
 arch/powerpc/include/asm/mmu_context.h|   2 +
 arch/powerpc/include/asm/paca.h   |   8 ++
 arch/powerpc/kernel/asm-offsets.c |   2 +
 arch/powerpc/kernel/dt_cpu_ftrs.c |  14 ++-
 arch/powerpc/kernel/entry_64.S|   4 +-
 arch/powerpc/kernel/exceptions-64s.S  |  20 +++-
 arch/powerpc/kernel/mce.c |   2 +-
 arch/powerpc/kernel/mce_power.c   |  16 ++-
 arch/powerpc/kernel/paca.c|  18 ++-
 arch/powerpc/kernel/process.c |  13 +-
 arch/powerpc/kernel/prom.c|   2 +
 arch/powerpc/kernel/setup_64.c|  26 +++-
 arch/powerpc/kexec/core_64.c  |   4 +-
 arch/powerpc/kexec/ranges.c   |   4 +
 arch/powerpc/kvm/Kconfig  |   1 +
 arch/powerpc/mm/book3s64/Makefile |  19 +--
 arch/powerpc/mm/book3s64/hash_native.c| 104 
 arch/powerpc/mm/book3s64/hash_pgtable.c   |   1 -
 arch/powerpc/mm/book3s64/hash_utils.c | 111 +-
 .../{hash_hugetlbpage.c => hugetlbpage.c} |   2 +
 arch/powerpc/mm/book3s64/mmu_context.c|  34 +-
 arch/powerpc/mm/book3s64/pgtable.c|  28 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |   4 +
 arch/powerpc/mm/book3s64/slb.c|  16 ---
 arch/powerpc/mm/book3s64/trace.c  |   8 ++
 arch/powerpc/mm/copro_fault.c |   2 +
 arch/powerpc/mm/fault.c   |  17 +++
 arch/powerpc/mm/init_64.c |  14 ++-
 arch/powerpc/mm/ioremap.c |  20 
 arch/powerpc/mm/nohash/tlb.c  |   9 --
 arch/powerpc/mm/pgtable.c |   9 +-
 arch/powerpc/mm/ptdump/Makefile   |   2 +-
 arch/powerpc/platforms/52xx/Kconfig   |   2 +-
 arch/powerpc/platforms/Kconfig|   4 +-
 arch/powerpc/platforms/Kconfig.cputype|  29 -
 arch/powerpc/platforms/cell/Kconfig   |   3 +-
 arch/powerpc/platforms/chrp/Kconfig 

[PATCH v3 01/18] powerpc: Remove unused FW_FEATURE_NATIVE references

2021-10-21 Thread Nicholas Piggin
FW_FEATURE_NATIVE_ALWAYS and FW_FEATURE_NATIVE_POSSIBLE are always
zero and never do anything. Remove them.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/firmware.h | 8 
 1 file changed, 8 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h 
b/arch/powerpc/include/asm/firmware.h
index 97a3bd9ffeb9..9b702d2b80fb 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -80,8 +80,6 @@ enum {
FW_FEATURE_POWERNV_ALWAYS = 0,
FW_FEATURE_PS3_POSSIBLE = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1,
FW_FEATURE_PS3_ALWAYS = FW_FEATURE_LPAR | FW_FEATURE_PS3_LV1,
-   FW_FEATURE_NATIVE_POSSIBLE = 0,
-   FW_FEATURE_NATIVE_ALWAYS = 0,
FW_FEATURE_POSSIBLE =
 #ifdef CONFIG_PPC_PSERIES
FW_FEATURE_PSERIES_POSSIBLE |
@@ -91,9 +89,6 @@ enum {
 #endif
 #ifdef CONFIG_PPC_PS3
FW_FEATURE_PS3_POSSIBLE |
-#endif
-#ifdef CONFIG_PPC_NATIVE
-   FW_FEATURE_NATIVE_ALWAYS |
 #endif
0,
FW_FEATURE_ALWAYS =
@@ -105,9 +100,6 @@ enum {
 #endif
 #ifdef CONFIG_PPC_PS3
FW_FEATURE_PS3_ALWAYS &
-#endif
-#ifdef CONFIG_PPC_NATIVE
-   FW_FEATURE_NATIVE_ALWAYS &
 #endif
FW_FEATURE_POSSIBLE,
 
-- 
2.23.0



Re: [PATCH v3 01/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions

2021-10-21 Thread Pali Rohár
On Thursday 21 October 2021 20:37:26 Naveen Naidu wrote:
> An MMIO read from a PCI device that doesn't exist or doesn't respond
> causes a PCI error.  There's no real data to return to satisfy the
> CPU read, so most hardware fabricates ~0 data.
> 
> Add a PCI_ERROR_RESPONSE definition for that and use it where
> appropriate to make these checks consistent and easier to find.
> 
> Also add helper definitions SET_PCI_ERROR_RESPONSE and
> RESPONSE_IS_PCI_ERROR to make the code more readable.
> 
> Suggested-by: Bjorn Helgaas 
> Signed-off-by: Naveen Naidu 

Reviewed-by: Pali Rohár 

> ---
>  include/linux/pci.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index cd8aa6fce204..689c8277c584 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -154,6 +154,15 @@ enum pci_interrupt_pin {
>  /* The number of legacy PCI INTx interrupts */
>  #define PCI_NUM_INTX 4
>  
> +/*
> + * Reading from a device that doesn't respond typically returns ~0.  A
> + * successful read from a device may also return ~0, so you need additional
> + * information to reliably identify errors.
> + */
> +#define PCI_ERROR_RESPONSE (~0ULL)
> +#define SET_PCI_ERROR_RESPONSE(val)(*(val) = ((typeof(*(val))) 
> PCI_ERROR_RESPONSE))
> +#define RESPONSE_IS_PCI_ERROR(val) ((val) == ((typeof(val)) 
> PCI_ERROR_RESPONSE))
> +
>  /*
>   * pci_power_t values must match the bits in the Capabilities PME_Support
>   * and Control/Status PowerState fields in the Power Management capability.
> -- 
> 2.25.1
> 


Re: [PATCH v3 02/25] PCI: Set error response in config access defines when ops->read() fails

2021-10-21 Thread Pali Rohár
On Thursday 21 October 2021 20:37:27 Naveen Naidu wrote:
> Make PCI_OP_READ and PCI_USER_READ_CONFIG set the data value with error
> response (~0), when the PCI device read by a host controller fails.
> 
> This ensures that the controller drivers no longer need to fabricate
> (~0) value when they detect error. It also  gurantees that the error
> response (~0) is always set when the controller drivers fails to read a
> config register from a device.
> 
> This makes error response fabrication consistent and helps in removal of
> a lot of repeated code.
> 
> Suggested-by: Rob Herring 
> Reviewed-by: Rob Herring 
> Signed-off-by: Naveen Naidu 

Reviewed-by: Pali Rohár 

> ---
>  drivers/pci/access.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/access.c b/drivers/pci/access.c
> index 46935695cfb9..0f732ba2f71a 100644
> --- a/drivers/pci/access.c
> +++ b/drivers/pci/access.c
> @@ -42,7 +42,10 @@ int noinline pci_bus_read_config_##size \
>   if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;   \
>   pci_lock_config(flags); \
>   res = bus->ops->read(bus, devfn, pos, len, );  \
> - *value = (type)data;\
> + if (res)\
> + SET_PCI_ERROR_RESPONSE(value);  \
> + else\
> + *value = (type)data;\
>   pci_unlock_config(flags);   \
>   return res; \
>  }
> @@ -228,7 +231,10 @@ int pci_user_read_config_##size  
> \
>   ret = dev->bus->ops->read(dev->bus, dev->devfn, \
>   pos, sizeof(type), );  \
>   raw_spin_unlock_irq(_lock); \
> - *val = (type)data;  \
> + if (ret)\
> + SET_PCI_ERROR_RESPONSE(val);\
> + else\
> + *val = (type)data;  \
>   return pcibios_err_to_errno(ret);   \
>  }\
>  EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
> -- 
> 2.25.1
> 


[PATCH] powerpc: Enhance pmem DMA bypass handling

2021-10-21 Thread Brian King
If ibm,pmemory is installed in the system, it can appear anywhere
in the address space. This patch enhances how we handle DMA for devices when
ibm,pmemory is present. In the case where we have enough DMA space to
direct map all of RAM, but not ibm,pmemory, we use direct DMA for
I/O to RAM and use the default window to dynamically map ibm,pmemory.
In the case where we only have a single DMA window, this won't work,
so if the window is not big enough to map the entire address range,
we cannot direct map.

Signed-off-by: Brian King 
---
 arch/powerpc/platforms/pseries/iommu.c | 19 ++-
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 269f61d519c2..d9ae985d10a4 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1092,15 +1092,6 @@ static phys_addr_t ddw_memory_hotplug_max(void)
phys_addr_t max_addr = memory_hotplug_max();
struct device_node *memory;
 
-   /*
-* The "ibm,pmemory" can appear anywhere in the address space.
-* Assuming it is still backed by page structs, set the upper limit
-* for the huge DMA window as MAX_PHYSMEM_BITS.
-*/
-   if (of_find_node_by_type(NULL, "ibm,pmemory"))
-   return (sizeof(phys_addr_t) * 8 <= MAX_PHYSMEM_BITS) ?
-   (phys_addr_t) -1 : (1ULL << MAX_PHYSMEM_BITS);
-
for_each_node_by_type(memory, "memory") {
unsigned long start, size;
int n_mem_addr_cells, n_mem_size_cells, len;
@@ -1341,6 +1332,16 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 */
len = max_ram_len;
if (pmem_present) {
+   if (default_win_removed) {
+   /*
+* If we only have one DMA window and have pmem present,
+* then we need to be able to map the entire address
+* range in order to be able to do direct DMA to RAM.
+*/
+   len = order_base_2((sizeof(phys_addr_t) * 8 <= 
MAX_PHYSMEM_BITS) ?
+   (phys_addr_t) -1 : (1ULL << 
MAX_PHYSMEM_BITS));
+   }
+
if (query.largest_available_block >=
(1ULL << (MAX_PHYSMEM_BITS - page_shift)))
len = MAX_PHYSMEM_BITS;
-- 
2.27.0



Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active

2021-10-21 Thread Nathan Lynch
Nicholas Piggin  writes:
> Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
>> During Live Partition Migration (LPM), it is observed that perf
>> counter values reports zero post migration completion. However
>> 'perf stat' with workload continues to show counts post migration
>> since PMU gets disabled/enabled during sched switches. But incase
>> of system/cpu wide monitoring, zero counts were reported with 'perf
>> stat' after migration completion.
>> 
>> Example:
>>  ./perf stat -e r1001e -I 1000
>>time counts unit events
>>  1.001010437 22,137,414  r1001e
>>  2.002495447 15,455,821  r1001e
>> <<>> As seen in next below logs, the counter values shows zero
>> after migration is completed.
>> <<>>
>> 86.142535370129,392,333,440  r1001e
>> 87.144714617  0  r1001e
>> 88.146526636  0  r1001e
>> 89.148085029  0  r1001e
>> 
>> Here PMU is enabled during start of perf session and counter
>> values are read at intervals. Counters are only disabled at the
>> end of session. The powerpc mobility code presently does not handle
>> disabling and enabling back of PMU counters during partition
>> migration. Also since the PMU register values are not saved/restored
>> during migration, PMU registers like Monitor Mode Control Register 0
>> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
>> the value it was programmed with. Hence PMU counters will not be
>> enabled correctly post migration.
>> 
>> Fix this in mobility code by handling disabling and enabling of
>> PMU in all cpu's before and after migration. Patch introduces two
>> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
>> mobility_pmu_disable() is called before the processor threads goes
>> to suspend state so as to disable the PMU counters. And disable is
>> done only if there are any active events running on that cpu.
>> mobility_pmu_enable() is called after the processor threads are
>> back online to enable back the PMU counters.
>> 
>> Since the performance Monitor counters ( PMCs) are not
>> saved/restored during LPM, results in PMC value being zero and the
>> 'event->hw.prev_count' being non-zero value. This causes problem
>
> Interesting. Are they defined to not be migrated, or may not be 
> migrated?

PAPR may be silent on this... at least I haven't found anything yet. But
I'm not very familiar with perf counters.

How much assurance do we have that hardware events we've programmed on
the source can be reliably re-enabled on the destination, with the same
semantics? Aren't there some model-specific counters that don't make
sense to handle this way?


>> diff --git a/arch/powerpc/include/asm/rtas.h 
>> b/arch/powerpc/include/asm/rtas.h
>> index 9dc97d2..cea72d7 100644
>> --- a/arch/powerpc/include/asm/rtas.h
>> +++ b/arch/powerpc/include/asm/rtas.h
>> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>>  static inline void read_24x7_sys_info(void) { }
>>  #endif
>>  
>> +#ifdef CONFIG_PPC_PERF_CTRS
>> +void mobility_pmu_disable(void);
>> +void mobility_pmu_enable(void);
>> +#else
>> +static inline void mobility_pmu_disable(void) { }
>> +static inline void mobility_pmu_enable(void) { }
>> +#endif
>> +
>>  #endif /* __KERNEL__ */
>>  #endif /* _POWERPC_RTAS_H */
>
> It's not implemented in rtas, maybe consider putting this into a perf 
> header?

+1



Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active

2021-10-21 Thread Nathan Lynch
Athira Rajeev  writes:
> During Live Partition Migration (LPM), it is observed that perf
> counter values reports zero post migration completion. However
> 'perf stat' with workload continues to show counts post migration
> since PMU gets disabled/enabled during sched switches. But incase
> of system/cpu wide monitoring, zero counts were reported with 'perf
> stat' after migration completion.
>
> Example:
>  ./perf stat -e r1001e -I 1000
>time counts unit events
>  1.001010437 22,137,414  r1001e
>  2.002495447 15,455,821  r1001e
> <<>> As seen in next below logs, the counter values shows zero
> after migration is completed.
> <<>>
> 86.142535370129,392,333,440  r1001e
> 87.144714617  0  r1001e
> 88.146526636  0  r1001e
> 89.148085029  0  r1001e

Confirmed in my environment:

51.099987985300,338  cache-misses
52.101839374296,586  cache-misses
53.116089796263,150  cache-misses
54.117949249232,290  cache-misses
55.602029375 68,700,421,711  cache-misses
56.610073969  0  cache-misses
57.614732000  0  cache-misses

I wonder what it means that there is a very unlikely huge value before
the counter stops working -- I believe your example has this phenomenon
too.


> diff --git a/arch/powerpc/platforms/pseries/mobility.c 
> b/arch/powerpc/platforms/pseries/mobility.c
> index e83e089..ff7a77c 100644
> --- a/arch/powerpc/platforms/pseries/mobility.c
> +++ b/arch/powerpc/platforms/pseries/mobility.c
> @@ -476,6 +476,8 @@ static int do_join(void *arg)
>  retry:
>   /* Must ensure MSR.EE off for H_JOIN. */
>   hard_irq_disable();
> + /* Disable PMU before suspend */
> + mobility_pmu_disable();
>   hvrc = plpar_hcall_norets(H_JOIN);
>  
>   switch (hvrc) {
> @@ -530,6 +532,8 @@ static int do_join(void *arg)
>* reset the watchdog.
>*/
>   touch_nmi_watchdog();
> + /* Enable PMU after resuming */
> + mobility_pmu_enable();
>   return ret;
>  }

We should minimize calls into other subsystems from this context (the
callback function we've passed to stop_machine); it's fairly sensitive.
Can this be moved out to pseries_migrate_partition() or similar?


Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

2021-10-21 Thread Bjorn Helgaas
On Thu, Oct 21, 2021 at 10:23:30PM +0530, Naveen Naidu wrote:
> On 20/10, Bjorn Helgaas wrote:
> > On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:

> > > In EDR path, AER status registers are cleared irrespective of whether
> > > the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> > > AER status registers are cleared only when it's an unmasked uncorrectable
> > > error.
> > > 
> > > This leads to two different behaviours for the same task (handling of
> > > DPC errors) in FFS systems and when native OS has control.
> > 
> > FFS?
> 
> Firmware First Systems

I assumed that's what it was, but it's helpful to use the same terms
used by the specs to make things easier to find.  I don't think it's
actually the case that "Firmware First" necessary applies to the
entire system, since the ACPI FIRMWARE_FIRST flag is a per-error
source thing, not a per-system thing.


Re: [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

2021-10-21 Thread Bjorn Helgaas
On Thu, Oct 21, 2021 at 10:06:11PM +0530, Naveen Naidu wrote:
> On 20/10, Bjorn Helgaas wrote:
> > On Tue, Oct 05, 2021 at 10:48:11PM +0530, Naveen Naidu wrote:
> > > dpc_process_error() clears both AER fatal and non fatal status
> > > registers. Instead of clearing each status registers via a different
> > > function call use pci_aer_clear_status().
> > > 
> > > This helps clean up the code a bit.
> > > 
> > > Signed-off-by: Naveen Naidu 
> > > ---
> > >  drivers/pci/pcie/dpc.c | 3 +--
> > >  1 file changed, 1 insertion(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > > index df3f3a10f8bc..faf4a1e77fab 100644
> > > --- a/drivers/pci/pcie/dpc.c
> > > +++ b/drivers/pci/pcie/dpc.c
> > > @@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
> > >dpc_get_aer_uncorrect_severity(pdev, ) &&
> > >aer_get_device_error_info(pdev, )) {
> > >   aer_print_error(pdev, );
> > > - pci_aer_clear_nonfatal_status(pdev);
> > > - pci_aer_clear_fatal_status(pdev);
> > > + pci_aer_clear_status(pdev);
> > 
> > The commit log suggests that this is a simple cleanup that doesn't
> > change any behavior, but that's not quite true:
> > 
> >   - The new code would clear PCI_ERR_ROOT_STATUS, but the old code
> > does not.
> > 
> >   - The old code masks the status bits with the severity bits before
> > clearing, but the new code does not.
> > 
> > The commit log needs to show why these changes are what we want.
> >
> 
> Reading through the code again, I realize how wrong(stupid) I was when
> making this patch. I was thinking that:
> 
>   pci_aer_clear_status() = pci_aer_clear_fatal_status() + 
> pci_aer_clear_nonfatal_status()
> 
> Now I understand, that it is not at all the case. I apologize for the
> mistake. I'll make sure to be meticulous while reading functions and not
> just assume their behaviour just from their function names.

No problem, one could argue that the collection of pci_aer_clear_*()
functions that do slightly different things is itself a defect.


Re: [PATCH v4 1/8] PCI/AER: Remove ID from aer_agent_string[]

2021-10-21 Thread Bjorn Helgaas
On Thu, Oct 21, 2021 at 10:00:21PM +0530, Naveen Naidu wrote:
> On 20/10, Bjorn Helgaas wrote:
> > On Tue, Oct 05, 2021 at 10:48:08PM +0530, Naveen Naidu wrote:
> > > Currently, we do not print the "id" field in the AER error logs. Yet the
> > > aer_agent_string[] has the word "id" in it. The AER error log looks
> > > like:
> > > 
> > >   pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data 
> > > Link Layer, (Receiver ID)
> > > 
> > > Without the "id" field in the error log, The aer_agent_string[]
> > > (eg: "Receiver ID") does not make sense. A user reading the
> > > aer_agent_string[] in the log, might inadvertently look for an "id"
> > > field and not finding it might lead to confusion.
> > > 
> > > Remove the "ID" from the aer_agent_string[].
> > > 
> > > The following are sample dummy errors inject via aer-inject.
> > 
> > I like this, and the problem it fixes was my fault because
> > these "ID" strings should have been removed by 010caed4ccb6.
> > 
> > If it's straightforward enough, it would be nice to have the
> > aer-inject command line here in the commit log to make it easier
> > for people to play with this.
> >
> 
> Thank you for the review. Do you mean something like:
> 
> The following sample dummy errors are injected via aer-inject via the
> following steps:
> 
>   1. The steps to compile the aer-inject tool is mentioned in (Section
>  4. Software error inject) of the document [1]
> 
>  [1]: https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt
> 
>  Make sure to place the aer-inject executable at the home directory
>  of the qemu system or at any other place.
> 
>   2. Emulate a PCIE architecture using qemu, A sample looks like
>  following:
>  
>   qemu-system-x86_64 -kernel ../linux/arch/x86_64/boot/bzImage \
> -initrd  buildroot-build/images/rootfs.cpio.gz \
> -append "console=ttyS0"  \
> -enable-kvm -nographic \
> -M q35 \
> -device pcie-root-port,bus=pcie.0,id=rp1,slot=1 \
> -device pcie-pci-bridge,id=br1,bus=rp1 \
> -device e1000,bus=br1,addr=8
>
> Note that the PCIe features are available only when using the 
> 'q35' Machine [2]
> [2]: https://github.com/qemu/qemu/blob/master/docs/pcie.txt
> 
>   3. Once the qemu system starts up, create a sample aer-file or use any
>  example aer file from [3]
> 
>  [3]:
>  
> https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/tree/examples
> 
>   4. Inject any aer-error using
>   
>   ./aer-inject aer-file
> 
> This does look a tad bit longer for a commit log so I am unsure if you
> would like to have it there. If you are okay with it, I would be happy
> to add it to that :)

Yes, that's kind of long.  Something like this
https://git.kernel.org/linus/d95f20c4f070 would be enough for the
commit log, especially since you've now provided all the details in
the email thread, where we can find them via the Link: tag.

Bjorn


Re: [PATCH v4 5/8] PCI/DPC: Converge EDR and DPC Path of clearing AER registers

2021-10-21 Thread Naveen Naidu
On 20/10, Bjorn Helgaas wrote:
> [+cc Keith, Sinan, Oza]
> 
> On Tue, Oct 05, 2021 at 10:48:12PM +0530, Naveen Naidu wrote:
> > In the EDR path, AER registers are cleared *after* DPC error event is
> > processed. The process stack in EDR is:
> > 
> >   edr_handle_event()
> > dpc_process_error()
> > pci_aer_raw_clear_status()
> > pcie_do_recovery()
> > 
> > But in DPC path, AER status registers are cleared *while* processing
> > the error. The process stack in DPC is:
> > 
> >   dpc_handler()
> > dpc_process_error()
> >   pci_aer_clear_status()
> > pcie_do_recovery()
> 
> These are accurate but they both include dpc_process_error(), so we
> need a hint to show why the one here is different from the one in the
> EDR path, e.g.,
> 
>   dpc_handler
> dpc_process_error
>   if (reason == 0)
> pci_aer_clear_status# uncorrectable errors only
> pcie_do_recovery
> 
> > In EDR path, AER status registers are cleared irrespective of whether
> > the error was an RP PIO or unmasked uncorrectable error. But in DPC, the
> > AER status registers are cleared only when it's an unmasked uncorrectable
> > error.
> > 
> > This leads to two different behaviours for the same task (handling of
> > DPC errors) in FFS systems and when native OS has control.
> 
> FFS?
>

Firmware First Systems

> I'd really like to have a specific example of how a user would observe
> this difference.  I know you probably don't have two systems to
> compare like that, but maybe we can work it out manually.
> 

Apologies again! Reading through the code again and the specification, I
realize that my understanding was very incorrect at the time of making
this patch. I grossly oversimplified EDR and DPC when I was learning
about it.

I'll drop this patch when I send the v5 for the series.

Apologies again ^^'

> I guess you're saying the problem is in the native DPC handling, and
> we don't clear the AER status registers for ERR_NONFATAL,
> ERR_NONFATAL, etc., right?
> 

But yes, I did have this question though (I wasn't able to find the
answers to it when reading the spec). Why do we not clear the entire
ERR_NONFATAL and ERR_FATAL registers in the DPC path just like EDR does
using the pci_aer_raw_clear_status() before going to pcie_do_recovery()

I am sure I might have missed something in the spec. I guess I'll
look/re-read these bits again.

Thanks for the review :)

> I think the current behavior is from 8aefa9b0d910 ("PCI/DPC: Print AER
> status in DPC event handling"), where Keith explicitly mentions those
> cases.  The commit log here should connect back to that and explain
> whether something has changed.
> 
> I cc'd Keith and the reviewers of that change in case any of them have
> time to dig into this again.
> 
> > Bring the same semantics for clearing the AER status register in EDR
> > path and DPC path.
> > 
> > Signed-off-by: Naveen Naidu 
> > ---
> >  drivers/pci/pcie/dpc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index faf4a1e77fab..68899a3db126 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -288,7 +288,6 @@ void dpc_process_error(struct pci_dev *pdev)
> >  dpc_get_aer_uncorrect_severity(pdev, ) &&
> >  aer_get_device_error_info(pdev, )) {
> > aer_print_error(pdev, );
> > -   pci_aer_clear_status(pdev);
> > }
> >  }
> >  
> > @@ -297,6 +296,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
> > struct pci_dev *pdev = context;
> >  
> > dpc_process_error(pdev);
> > +   pci_aer_clear_status(pdev);
> >  
> > /* We configure DPC so it only triggers on ERR_FATAL */
> > pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees


Re: [PATCH v4 4/8] PCI/DPC: Use pci_aer_clear_status() in dpc_process_error()

2021-10-21 Thread Naveen Naidu
On 20/10, Bjorn Helgaas wrote:
> On Tue, Oct 05, 2021 at 10:48:11PM +0530, Naveen Naidu wrote:
> > dpc_process_error() clears both AER fatal and non fatal status
> > registers. Instead of clearing each status registers via a different
> > function call use pci_aer_clear_status().
> > 
> > This helps clean up the code a bit.
> > 
> > Signed-off-by: Naveen Naidu 
> > ---
> >  drivers/pci/pcie/dpc.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index df3f3a10f8bc..faf4a1e77fab 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -288,8 +288,7 @@ void dpc_process_error(struct pci_dev *pdev)
> >  dpc_get_aer_uncorrect_severity(pdev, ) &&
> >  aer_get_device_error_info(pdev, )) {
> > aer_print_error(pdev, );
> > -   pci_aer_clear_nonfatal_status(pdev);
> > -   pci_aer_clear_fatal_status(pdev);
> > +   pci_aer_clear_status(pdev);
> 
> The commit log suggests that this is a simple cleanup that doesn't
> change any behavior, but that's not quite true:
> 
>   - The new code would clear PCI_ERR_ROOT_STATUS, but the old code
> does not.
> 
>   - The old code masks the status bits with the severity bits before
> clearing, but the new code does not.
> 
> The commit log needs to show why these changes are what we want.
>

Reading through the code again, I realize how wrong(stupid) I was when
making this patch. I was thinking that:

  pci_aer_clear_status() = pci_aer_clear_fatal_status() + 
pci_aer_clear_nonfatal_status()

Now I understand, that it is not at all the case. I apologize for the
mistake. I'll make sure to be meticulous while reading functions and not
just assume their behaviour just from their function names.

I'll drop this patch in the next version of the patch series I make.

Apologies again ^^'

> > }
> >  }
> >  
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees


Re: [PATCH v4 3/8] PCI/DPC: Initialize info->id in dpc_process_error()

2021-10-21 Thread Naveen Naidu
On 20/10, Bjorn Helgaas wrote:
> On Tue, Oct 05, 2021 at 10:48:10PM +0530, Naveen Naidu wrote:
> > In the dpc_process_error() path, info->id isn't initialized before being
> > passed to aer_print_error(). In the corresponding AER path, it is
> > initialized in aer_isr_one_error().
> > 
> > The error message shown during Coverity Scan is:
> > 
> >   Coverity #1461602
> >   CID 1461602 (#1 of 1): Uninitialized scalar variable (UNINIT)
> >   8. uninit_use_in_call: Using uninitialized value info.id when calling 
> > aer_print_error.
> > 
> > Initialize the "info->id" before passing it to aer_print_error()
> > 
> > Fixes: 8aefa9b0d910 ("PCI/DPC: Print AER status in DPC event handling")
> > Signed-off-by: Naveen Naidu 
> > ---
> >  drivers/pci/pcie/dpc.c | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index c556e7beafe3..df3f3a10f8bc 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -262,14 +262,14 @@ static int dpc_get_aer_uncorrect_severity(struct 
> > pci_dev *dev,
> >  
> >  void dpc_process_error(struct pci_dev *pdev)
> >  {
> > -   u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
> > +   u16 cap = pdev->dpc_cap, status, reason, ext_reason;
> > struct aer_err_info info;
> >  
> > pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
> > -   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
> > +   pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
> >  
> > pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
> > -status, source);
> > +status, info.id);
> >  
> > reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
> 
> Per PCIe r5.0, sec 7.9.15.5, the Source ID is defined only when the
> Trigger Reason indicates ERR_NONFATAL or ERR_FATAL.  So I think we
> need to extract this reason before reading PCI_EXP_DPC_SOURCE_ID,
> e.g.,
> 
>   reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
>   if (reason == 1 || reason == 2)
> pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, );
>   else
> info.id = 0;
>

Thank you for the review, I'll make this change when I send a v5 for the
patch series.

> > ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees


Re: [PATCH v4 1/8] PCI/AER: Remove ID from aer_agent_string[]

2021-10-21 Thread Naveen Naidu
On 20/10, Bjorn Helgaas wrote:
> On Tue, Oct 05, 2021 at 10:48:08PM +0530, Naveen Naidu wrote:
> > Currently, we do not print the "id" field in the AER error logs. Yet the
> > aer_agent_string[] has the word "id" in it. The AER error log looks
> > like:
> > 
> >   pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
> > Layer, (Receiver ID)
> > 
> > Without the "id" field in the error log, The aer_agent_string[]
> > (eg: "Receiver ID") does not make sense. A user reading the
> > aer_agent_string[] in the log, might inadvertently look for an "id"
> > field and not finding it might lead to confusion.
> > 
> > Remove the "ID" from the aer_agent_string[].
> > 
> > The following are sample dummy errors inject via aer-inject.
> 
> I like this, and the problem it fixes was my fault because
> these "ID" strings should have been removed by 010caed4ccb6.
> 
> If it's straightforward enough, it would be nice to have the
> aer-inject command line here in the commit log to make it easier
> for people to play with this.
>

Thank you for the review. Do you mean something like:

The following sample dummy errors are injected via aer-inject via the
following steps:

  1. The steps to compile the aer-inject tool is mentioned in (Section
 4. Software error inject) of the document [1]

 [1]: https://www.kernel.org/doc/Documentation/PCI/pcieaer-howto.txt

 Make sure to place the aer-inject executable at the home directory
 of the qemu system or at any other place.

  2. Emulate a PCIE architecture using qemu, A sample looks like
 following:
 
qemu-system-x86_64 -kernel ../linux/arch/x86_64/boot/bzImage \
-initrd  buildroot-build/images/rootfs.cpio.gz \
-append "console=ttyS0"  \
-enable-kvm -nographic \
-M q35 \
-device pcie-root-port,bus=pcie.0,id=rp1,slot=1 \
-device pcie-pci-bridge,id=br1,bus=rp1 \
-device e1000,bus=br1,addr=8
   
Note that the PCIe features are available only when using the 
'q35' Machine [2]
[2]: https://github.com/qemu/qemu/blob/master/docs/pcie.txt

  3. Once the qemu system starts up, create a sample aer-file or use any
 example aer file from [3]

 [3]:
 
https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/tree/examples

  4. Inject any aer-error using
  
  ./aer-inject aer-file

This does look a tad bit longer for a commit log so I am unsure if you
would like to have it there. If you are okay with it, I would be happy
to add it to that :)

> > Before
> > ===
> > 
> > In 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID"),
> > the "id" field was removed from the AER error logs, so currently AER
> > logs look like:
> > 
> >   pcieport :00:03.0: AER: Corrected error received: :00:03:0
> >   pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
> > Layer, (Receiver ID) <--- no id field
> >   pcieport :00:03.0:   device [1b36:000c] error 
> > status/mask=0040/e000
> >   pcieport :00:03.0:[ 6] BadTLP
> > 
> > After
> > ==
> > 
> >   pcieport :00:03.0: AER: Corrected error received: :00:03.0
> >   pcieport :00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link 
> > Layer, (Receiver)
> >   pcieport :00:03.0:   device [1b36:000c] error 
> > status/mask=0040/e000
> >   pcieport :00:03.0:[ 6] BadTLP
> > 
> > Signed-off-by: Naveen Naidu 
> > ---
> >  drivers/pci/pcie/aer.c | 10 +-
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 9784fdcf3006..241ff361b43c 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -516,10 +516,10 @@ static const char *aer_uncorrectable_error_string[] = 
> > {
> >  };
> >  
> >  static const char *aer_agent_string[] = {
> > -   "Receiver ID",
> > -   "Requester ID",
> > -   "Completer ID",
> > -   "Transmitter ID"
> > +   "Receiver",
> > +   "Requester",
> > +   "Completer",
> > +   "Transmitter"
> >  };
> >  
> >  #define aer_stats_dev_attr(name, stats_array, strings_array,   
> > \
> > @@ -703,7 +703,7 @@ void aer_print_error(struct pci_dev *dev, struct 
> > aer_err_info *info)
> > const char *level;
> >  
> > if (!info->status) {
> > -   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
> > (Unregistered Agent ID)\n",
> > +   pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
> > (Unregistered Agent)\n",
> > aer_error_severity_string[info->severity]);
> > goto out;
> > }
> > -- 
> > 2.25.1
> > 
> > ___
> > Linux-kernel-mentees mailing list
> > linux-kernel-ment...@lists.linuxfoundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/linux-kernel-mentees


Re: [PATCH 07/20] signal/powerpc: On swapcontext failure force SIGSEGV

2021-10-21 Thread Kees Cook
On Wed, Oct 20, 2021 at 12:43:53PM -0500, Eric W. Biederman wrote:
> If the register state may be partial and corrupted instead of calling
> do_exit, call force_sigsegv(SIGSEGV).  Which properly kills the
> process with SIGSEGV and does not let any more userspace code execute,
> instead of just killing one thread of the process and potentially
> confusing everything.
> 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> History-tree: git://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
> Fixes: 756f1ae8a44e ("PPC32: Rework signal code and add a swapcontext system 
> call.")
> Fixes: 04879b04bf50 ("[PATCH] ppc64: VMX (Altivec) support & signal32 rework, 
> from Ben Herrenschmidt")
> Signed-off-by: "Eric W. Biederman" 

This looks right to me.

Reviewed-by: Kees Cook 

-- 
Kees Cook


[PATCH v3 19/25] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware

2021-10-21 Thread Naveen Naidu
An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Use RESPONSE_IS_PCI_ERROR() to check the response we get when we read
data from hardware.

This helps unify PCI error response checking and make error checks
consistent and easier to find.

Compile tested only.

Signed-off-by: Naveen Naidu 
---
 drivers/pci/pcie/dpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index c556e7beafe3..4a051a096075 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -79,7 +79,7 @@ static bool dpc_completed(struct pci_dev *pdev)
u16 status;
 
pci_read_config_word(pdev, pdev->dpc_cap + PCI_EXP_DPC_STATUS, );
-   if ((status != 0x) && (status & PCI_EXP_DPC_STATUS_TRIGGER))
+   if ((!RESPONSE_IS_PCI_ERROR(status)) && (status & 
PCI_EXP_DPC_STATUS_TRIGGER))
return false;
 
if (test_bit(PCI_DPC_RECOVERING, >priv_flags))
@@ -312,7 +312,7 @@ static irqreturn_t dpc_irq(int irq, void *context)
 
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, );
 
-   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || status == (u16)(~0))
+   if (!(status & PCI_EXP_DPC_STATUS_INTERRUPT) || 
RESPONSE_IS_PCI_ERROR(status))
return IRQ_NONE;
 
pci_write_config_word(pdev, cap + PCI_EXP_DPC_STATUS,
-- 
2.25.1



[PATCH v3 02/25] PCI: Set error response in config access defines when ops->read() fails

2021-10-21 Thread Naveen Naidu
Make PCI_OP_READ and PCI_USER_READ_CONFIG set the data value with error
response (~0), when the PCI device read by a host controller fails.

This ensures that the controller drivers no longer need to fabricate
(~0) value when they detect error. It also  gurantees that the error
response (~0) is always set when the controller drivers fails to read a
config register from a device.

This makes error response fabrication consistent and helps in removal of
a lot of repeated code.

Suggested-by: Rob Herring 
Reviewed-by: Rob Herring 
Signed-off-by: Naveen Naidu 
---
 drivers/pci/access.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 46935695cfb9..0f732ba2f71a 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -42,7 +42,10 @@ int noinline pci_bus_read_config_##size \
if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;   \
pci_lock_config(flags); \
res = bus->ops->read(bus, devfn, pos, len, );  \
-   *value = (type)data;\
+   if (res)\
+   SET_PCI_ERROR_RESPONSE(value);  \
+   else\
+   *value = (type)data;\
pci_unlock_config(flags);   \
return res; \
 }
@@ -228,7 +231,10 @@ int pci_user_read_config_##size
\
ret = dev->bus->ops->read(dev->bus, dev->devfn, \
pos, sizeof(type), );  \
raw_spin_unlock_irq(_lock); \
-   *val = (type)data;  \
+   if (ret)\
+   SET_PCI_ERROR_RESPONSE(val);\
+   else\
+   *val = (type)data;  \
return pcibios_err_to_errno(ret);   \
 }  \
 EXPORT_SYMBOL_GPL(pci_user_read_config_##size);
-- 
2.25.1



[PATCH v3 01/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions

2021-10-21 Thread Naveen Naidu
An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the
CPU read, so most hardware fabricates ~0 data.

Add a PCI_ERROR_RESPONSE definition for that and use it where
appropriate to make these checks consistent and easier to find.

Also add helper definitions SET_PCI_ERROR_RESPONSE and
RESPONSE_IS_PCI_ERROR to make the code more readable.

Suggested-by: Bjorn Helgaas 
Signed-off-by: Naveen Naidu 
---
 include/linux/pci.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index cd8aa6fce204..689c8277c584 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -154,6 +154,15 @@ enum pci_interrupt_pin {
 /* The number of legacy PCI INTx interrupts */
 #define PCI_NUM_INTX   4
 
+/*
+ * Reading from a device that doesn't respond typically returns ~0.  A
+ * successful read from a device may also return ~0, so you need additional
+ * information to reliably identify errors.
+ */
+#define PCI_ERROR_RESPONSE (~0ULL)
+#define SET_PCI_ERROR_RESPONSE(val)(*(val) = ((typeof(*(val))) 
PCI_ERROR_RESPONSE))
+#define RESPONSE_IS_PCI_ERROR(val) ((val) == ((typeof(val)) 
PCI_ERROR_RESPONSE))
+
 /*
  * pci_power_t values must match the bits in the Capabilities PME_Support
  * and Control/Status PowerState fields in the Power Management capability.
-- 
2.25.1



[PATCH v3 00/25] Unify PCI error response checking

2021-10-21 Thread Naveen Naidu
An MMIO read from a PCI device that doesn't exist or doesn't respond
causes a PCI error.  There's no real data to return to satisfy the 
CPU read, so most hardware fabricates ~0 data.

This patch series adds PCI_ERROR_RESPONSE definition and other helper
definition SET_PCI_ERROR_RESPONSE and RESPONSE_IS_PCI_ERROR and uses it
where appropriate to make these checks consistent and easier to find.

This helps unify PCI error response checking and make error check
consistent and easier to find.

This series also ensures that the error response fabrication now happens
in the PCI_OP_READ and PCI_USER_READ_CONFIG. This removes the
responsibility from controller drivers to do the error response setting. 

Patch 1:
- Adds the PCI_ERROR_RESPONSE and other related defintions
- All other patches are dependent on this patch. This patch needs to
  be applied first, before the others

Patch 2:
- Error fabrication happens in PCI_OP_READ and PCI_USER_READ_CONFIG
  whenever the data read via the controller driver fails.
- This patch needs to be applied before, Patch 4/24 to Patch 15/24 are
  applied.

Patch 3:
- Uses SET_PCI_ERROR_RESPONSE() when device is not found 

Patch 4 - 15:
- Removes redundant error fabrication that happens in controller 
  drivers when the read from a PCI device fails.
- These patches are dependent on Patch 2/24 of the series.
- These can be applied in any order.

Patch 16 - 22:
- Uses RESPONSE_IS_PCI_ERROR() to check the reads from hardware
- Patches can be applied in any order.

Patch 23 - 25:
- Edits the comments to include PCI_ERROR_RESPONSE alsong with
  0x, so that it becomes easier to grep for faulty 
  hardware reads.

Changelog
=
v3:
   - Change RESPONSE_IS_PCI_ERROR macro definition
   - Fix the macros, Add () around macro parameters
   - Fix alignment issue in Patch 2/24
   - Add proper receipients for all the patches

v2:
- Instead of using SET_PCI_ERROR_RESPONSE in all controller drivers
  to fabricate error response, only use them in PCI_OP_READ and
  PCI_USER_READ_CONFIG

Naveen Naidu (25):
  [Patch 1/25] PCI: Add PCI_ERROR_RESPONSE and it's related definitions
  [Patch 2/25] PCI: Set error response in config access defines when 
ops->read() fails
  [Patch 3/25] PCI: Use SET_PCI_ERROR_RESPONSE() when device not found
  [Patch 4/25] PCI: Remove redundant error fabrication when device read fails
  [Patch 5/25] PCI: thunder: Remove redundant error fabrication when device 
read fails
  [Patch 6/25] PCI: iproc: Remove redundant error fabrication when device read 
fails
  [Patch 7/25] PCI: mediatek: Remove redundant error fabrication when device 
read fails
  [Patch 8/25] PCI: exynos: Remove redundant error fabrication when device read 
fails
  [Patch 9/25] PCI: histb: Remove redundant error fabrication when device read 
fails
  [Patch 10/25] PCI: kirin: Remove redundant error fabrication when device read 
fails
  [Patch 11/25] PCI: aardvark: Remove redundant error fabrication when device 
read fails
  [Patch 12/25] PCI: mvebu: Remove redundant error fabrication when device read 
fails
  [Patch 13/25] PCI: altera: Remove redundant error fabrication when device 
read fails
  [Patch 14/25] PCI: rcar: Remove redundant error fabrication when device read 
fails
  [Patch 15/25] PCI: rockchip: Remove redundant error fabrication when device 
read fails
  [Patch 16/25] PCI/ERR: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [Patch 17/25] PCI: vmd: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [Patch 18/25] PCI: pciehp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [Patch 19/25] PCI/DPC: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [Patch 20/25] PCI/PME: Use RESPONSE_IS_PCI_ERROR() to check read from hardware
  [Patch 21/25] PCI: cpqphp: Use RESPONSE_IS_PCI_ERROR() to check read from 
hardware
  [Patch 22/25] PCI: Use PCI_ERROR_RESPONSE to specify hardware error
  [Patch 23/25] PCI: keystone: Use PCI_ERROR_RESPONSE to specify hardware error
  [Patch 24/25] PCI: hv: Use PCI_ERROR_RESPONSE to specify hardware read error
  [Patch 25/25] PCI: xgene: Use PCI_ERROR_RESPONSE to specify hardware error

 drivers/pci/access.c| 32 +++---
 drivers/pci/controller/dwc/pci-exynos.c |  4 +-
 drivers/pci/controller/dwc/pci-keystone.c   |  4 +-
 drivers/pci/controller/dwc/pcie-histb.c |  4 +-
 drivers/pci/controller/dwc/pcie-kirin.c |  4 +-
 drivers/pci/controller/pci-aardvark.c   | 10 +
 drivers/pci/controller/pci-hyperv.c |  2 +-
 drivers/pci/controller/pci-mvebu.c  |  8 +---
 drivers/pci/controller/pci-thunder-ecam.c   | 46 +++--
 drivers/pci/controller/pci-thunder-pem.c|  4 +-
 drivers/pci/controller/pci-xgene.c  |  8 ++--
 drivers/pci/controller/pcie-altera.c|  4 +-
 drivers/pci/controller/pcie-iproc.c |  4 +-
 drivers/pci/controller/pcie-mediatek.c  | 11 

Re: coherency issue observed after hotplug on POWER8

2021-10-21 Thread Krzysztof Kozlowski
On 24/09/2021 19:17, Naveen N. Rao wrote:
> Hi Cascardo,
> Thanks for reporting this.
> 
> 
> Thadeu Lima de Souza Cascardo wrote:
>> Hi, there.
>>
>> We have been investigating an issue we have observed on POWER8 POWERNV 
>> systems.
>> When running the kernel selftests reuseport_bpf_cpu after a CPU hotplug, we 
>> see
>> crashes, in different forms. [1]
> 
> Just to re-confirm: you are only seeing this on P8 powernv, and not in a 
> P8 guest/LPAR? I haven't been able to reproduce this on a firestone -- 
> can you share more details about your power8 machine?
> 
> Also, do you only see this with ubuntu kernels, or are you also able to 
> reproduce this with the upstream tree?

Let me just covert this part of your email:

Upstream trees (5.11, 5.13, 5.14). See also:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1927076/comments/28

I could not reproduce it on Power8 LPAR. Neither on Power9 QEMU guest.

Reproduced on few machines:
IBM, POWER8NVL, 8335-GTB
POWER8, 8001-22C and 8335-GTA

lspcpu for the last one:
https://bugs.launchpad.net/ubuntu-power-systems/+bug/1927076/comments/15


Best regards,
Krzysztof


Re: [PATCH 21/20] signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)

2021-10-21 Thread Eric W. Biederman
Geert Uytterhoeven  writes:

> Hi Eric,
>
> Patch 21/20?

In reviewing another part of the patchset Linus asked if force_sigsegv
could go away.  It can't completely but I can get this far.

Given that it is just a cleanup it makes most sense to me as an
additional patch on top of what is already here.


> On Wed, Oct 20, 2021 at 11:52 PM Eric W. Biederman
>  wrote:
>> Now that force_fatal_sig exists it is unnecessary and a bit confusing
>> to use force_sigsegv in cases where the simpler force_fatal_sig is
>> wanted.  So change every instance we can to make the code clearer.
>>
>> Signed-off-by: "Eric W. Biederman" 
>
>>  arch/m68k/kernel/traps.c| 2 +-
>
> Acked-by: Geert Uytterhoeven 

Thank you.

Eric


Re: [PATCH v2] powerpc/pseries/mobility: ignore ibm, platform-facilities updates

2021-10-21 Thread Nathan Lynch
Daniel Axtens  writes:
>> On VMs with NX encryption, compression, and/or RNG offload, these
>> capabilities are described by nodes in the ibm,platform-facilities device
>> tree hierarchy:
>>
>>   $ tree -d /sys/firmware/devicetree/base/ibm,platform-facilities/
>>   /sys/firmware/devicetree/base/ibm,platform-facilities/
>>   ├── ibm,compression-v1
>>   ├── ibm,random-v1
>>   └── ibm,sym-encryption-v1
>>
>>   3 directories
>>
>> The acceleration functions that these nodes describe are not disrupted by
>> live migration, not even temporarily.
>>
>> But the post-migration ibm,update-nodes sequence firmware always sends
>> "delete" messages for this hierarchy, followed by an "add" directive to
>> reconstruct it via ibm,configure-connector (log with debugging statements
>> enabled in mobility.c):
>>
>>   mobility: removing node /ibm,platform-facilities/ibm,random-v1:4294967285
>>   mobility: removing node 
>> /ibm,platform-facilities/ibm,compression-v1:4294967284
>>   mobility: removing node 
>> /ibm,platform-facilities/ibm,sym-encryption-v1:4294967283
>>   mobility: removing node /ibm,platform-facilities:4294967286
>>   ...
>>   mobility: added node /ibm,platform-facilities:4294967286
>>
>> Note we receive a single "add" message for the entire hierarchy, and what
>> we receive from the ibm,configure-connector sequence is the top-level
>> platform-facilities node along with its three children. The debug message
>> simply reports the parent node and not the whole subtree.
>
> If I understand correctly, (and again, this is not my area at all!) we
> still have to go out to the firmware and call the
> ibm,configure-connector sequence in order to figure out that the node
> we're supposed to add is the ibm,platform-facilites node, right? We
> can't save enough information at delete time to avoid the trip out to
> firmware?

That is right... but maybe I don't understand your angle here. Unsure
what avoiding the configure-connector sequence for the nodes would buy
us.


>> Until that can be realized we have a confirmed use-after-free and the
>> possibility of memory corruption. So add a limited workaround that
>> discriminates on the node type, ignoring adds and removes. This should be
>> amenable to backporting in the meantime.
>
> Yeah it's an unpleasant situation to find ourselves in. It's a bit icky
> but as I think you said in a previous email, at least this isn't worse:
> in the common case it should now succeed and and if properties change
> significantly it will still fail.
>
> My one question (from more of a security point of view) is:
>  1) Say you start using the facilities with a particular set of
> parameters.
>
>  2) Say you then get migrated and the parameters change.
>
>  3) If you keep using the platform facilities as if the original
> properties are still valid, can this cause any Interesting,
> unexpected or otherwise Bad consequences? Are we going to end up
> (for example) scribbling over random memory somehow?

If drivers are safely handling errors from H_COP_OP etc, then no. (I
know, this looks like a Well That Would Be a Driver Bug dismissal, but
that's not my attitude.) And again this is a case where the change
cannot make things worse.

In the current design of the pseries LPM implementation, user space and
other normal system activity resume as soon as we return from the
stop_machine() call which suspends the partition, executing concurrently
with any device tree updates. So even if we had code in place to
correctly resolve the DT changes and the drivers were able to respond to
the changes, there would still be a window of exposure to the kind of
problem you describe: the changed characteristics, if any, of the
destination obtain as soon as execution resumes, regardless of when the
OS initiates the update-nodes sequence.

The way out of that mess is to use the Linux suspend framework, or
otherwise prevent user space from executing until the destination
system's characteristics have been appropriately propagated out to the
necessary drivers etc. I'm trying to get there.


> Apart from that, the code seems to do what it says, it seems to solve a
> real problem, the error and memory handling makes sense, you _put the DT
> nodes that you _get, the comments are helpful and descriptive, and it
> passes the automated tests on patchwork/snowpatch.

I appreciate your review!


[GIT PULL] Please pull powerpc/linux.git powerpc-5.15-5 tag

2021-10-21 Thread Michael Ellerman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull some more powerpc fixes for 5.15:

The following changes since commit cdeb5d7d890e14f3b70e8087e745c4a6a7d9f337:

  KVM: PPC: Book3S HV: Make idle_kvm_start_guest() return 0 if it went to guest 
(2021-10-16 00:40:03 +1100)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-5.15-5

for you to fetch changes up to 787252a10d9422f3058df9a4821f389e5326c440:

  powerpc/smp: do not decrement idle task preempt count in CPU offline 
(2021-10-20 21:38:01 +1100)

- --
powerpc fixes for 5.15 #5

Fix a bug exposed by a previous fix, where running guests with certain SMT 
topologies
could crash the host on Power8.

Fix atomic sleep warnings when re-onlining CPUs, when PREEMPT is enabled.

Thanks to: Nathan Lynch, Srikar Dronamraju, Valentin Schneider.

- --
Michael Ellerman (1):
  powerpc/idle: Don't corrupt back chain when going idle

Nathan Lynch (1):
  powerpc/smp: do not decrement idle task preempt count in CPU offline


 arch/powerpc/kernel/idle_book3s.S | 10 ++
 arch/powerpc/kernel/smp.c |  2 --
 2 files changed, 6 insertions(+), 6 deletions(-)
-BEGIN PGP SIGNATURE-

iQIzBAEBCAAdFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmFxT3IACgkQUevqPMjh
pYD0Mg//fh9BEVcfCtGJskqpbppkLJL8Pk/npKdiXZ1//ESTnSSXp0SfjHwnKW8H
R+FYUomAdB4pis6lfUUlVFHcADPrf1C55IF6f4pP7mLWGKuscMTjDRmgBOkgEreY
pciP+aGkNWu6Lmzoz1ZEqYr1mZW6TX3/Os9BabFUNze4gzTT6Y4U+/QOYrt5VQZB
SAnzyfjOq0c9HDP3OFVn9xUGkOpikRA2rT/0lKVFs5CPkqmLv82i/slz9SwE96kX
Zfi9CCJ3ule0RgysYg33QpAzZfQiLATAJBLk+Wlyl9SAFQ8w+cOhFtJmHryGFPz5
n5JopbE2lECJxw5fhasLwraDZzTd84xHvx1xpl2nIQEzrRlpV+Kq2c7SEbNROakL
rp/xmnBjfo9wMVwjo7x20arqj+o5XBs7yW04gMXV5yJVMUjgn288LAZzTe9nNegk
drzfrVHyvNKCoEWZr0egUDazSuh1kiWC9srmZ3IB2Cx2AGXYvXk+MsftoIqP8Nu6
gs9+NRwNiroaX3aujlcZHC4J7QwGMUZo6Dvs2e0VX+kvGlvnrP8+7pQ0eSoSJlVv
DfeTZa630yD0TdPVqWXHHeOUqbza/bpeHUxdFwSYdlr4DLrr3XkfaO/VQ6XJmTHa
9aZq3y7ALgQGvXHUIJYH4upZYoK1BhNqCJ0A2w+8uqKlunlXx8w=
=ES/7
-END PGP SIGNATURE-


Re: [PATCH v2] powerpc/smp: do not decrement idle task preempt count in CPU offline

2021-10-21 Thread Michael Ellerman
On Fri, 15 Oct 2021 12:39:02 -0500, Nathan Lynch wrote:
> With PREEMPT_COUNT=y, when a CPU is offlined and then onlined again, we
> get:
> 
> BUG: scheduling while atomic: swapper/1/0/0x
> no locks held by swapper/1/0.
> CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.15.0-rc2+ #100
> Call Trace:
>  dump_stack_lvl+0xac/0x108
>  __schedule_bug+0xac/0xe0
>  __schedule+0xcf8/0x10d0
>  schedule_idle+0x3c/0x70
>  do_idle+0x2d8/0x4a0
>  cpu_startup_entry+0x38/0x40
>  start_secondary+0x2ec/0x3a0
>  start_secondary_prolog+0x10/0x14
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/smp: do not decrement idle task preempt count in CPU offline
  https://git.kernel.org/powerpc/c/787252a10d9422f3058df9a4821f389e5326c440

cheers


Re: [PATCH] powerpc/idle: Don't corrupt back chain when going idle

2021-10-21 Thread Michael Ellerman
On Wed, 20 Oct 2021 20:48:26 +1100, Michael Ellerman wrote:
> In isa206_idle_insn_mayloss() we store various registers into the stack
> red zone, which is allowed.
> 
> However inside the IDLE_STATE_ENTER_SEQ_NORET macro we save r2 again,
> to 0(r1), which corrupts the stack back chain.
> 
> We used to do the same in isa206_idle_insn_mayloss() itself, but we
> fixed that in 73287caa9210 ("powerpc64/idle: Fix SP offsets when saving
> GPRs"), however we missed that the macro also corrupts the back chain.
> 
> [...]

Applied to powerpc/fixes.

[1/1] powerpc/idle: Don't corrupt back chain when going idle
  https://git.kernel.org/powerpc/c/496c5fe25c377ddb7815c4ce8ecfb676f051e9b6

cheers


[PATCH] powerpc: mpc8349emitx: Make mcu_gpiochip_remove() return void

2021-10-21 Thread Uwe Kleine-König
Up to now mcu_gpiochip_remove() returns zero unconditionally. Make it
return void instead which makes it easier to see in the callers that
there is no error to handle.

Also the return value of i2c remove callbacks is ignored anyway.

Signed-off-by: Uwe Kleine-König 
---
 arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c 
b/arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c
index 409481016928..bb789f33c70e 100644
--- a/arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c
+++ b/arch/powerpc/platforms/83xx/mcu_mpc8349emitx.c
@@ -135,11 +135,10 @@ static int mcu_gpiochip_add(struct mcu *mcu)
return gpiochip_add_data(gc, mcu);
 }
 
-static int mcu_gpiochip_remove(struct mcu *mcu)
+static void mcu_gpiochip_remove(struct mcu *mcu)
 {
kfree(mcu->gc.label);
gpiochip_remove(>gc);
-   return 0;
 }
 
 static int mcu_probe(struct i2c_client *client)
@@ -198,9 +197,7 @@ static int mcu_remove(struct i2c_client *client)
glob_mcu = NULL;
}
 
-   ret = mcu_gpiochip_remove(mcu);
-   if (ret)
-   return ret;
+   mcu_gpiochip_remove(mcu);
kfree(mcu);
return 0;
 }
-- 
2.30.2



Re: [PATCH V2] powerpc/perf: Enable PMU counters post partition migration if PMU is active

2021-10-21 Thread Nicholas Piggin
Excerpts from Athira Rajeev's message of July 11, 2021 10:25 pm:
> During Live Partition Migration (LPM), it is observed that perf
> counter values reports zero post migration completion. However
> 'perf stat' with workload continues to show counts post migration
> since PMU gets disabled/enabled during sched switches. But incase
> of system/cpu wide monitoring, zero counts were reported with 'perf
> stat' after migration completion.
> 
> Example:
>  ./perf stat -e r1001e -I 1000
>time counts unit events
>  1.001010437 22,137,414  r1001e
>  2.002495447 15,455,821  r1001e
> <<>> As seen in next below logs, the counter values shows zero
> after migration is completed.
> <<>>
> 86.142535370129,392,333,440  r1001e
> 87.144714617  0  r1001e
> 88.146526636  0  r1001e
> 89.148085029  0  r1001e
> 
> Here PMU is enabled during start of perf session and counter
> values are read at intervals. Counters are only disabled at the
> end of session. The powerpc mobility code presently does not handle
> disabling and enabling back of PMU counters during partition
> migration. Also since the PMU register values are not saved/restored
> during migration, PMU registers like Monitor Mode Control Register 0
> (MMCR0), Monitor Mode Control Register 1 (MMCR1) will not contain
> the value it was programmed with. Hence PMU counters will not be
> enabled correctly post migration.
> 
> Fix this in mobility code by handling disabling and enabling of
> PMU in all cpu's before and after migration. Patch introduces two
> functions 'mobility_pmu_disable' and 'mobility_pmu_enable'.
> mobility_pmu_disable() is called before the processor threads goes
> to suspend state so as to disable the PMU counters. And disable is
> done only if there are any active events running on that cpu.
> mobility_pmu_enable() is called after the processor threads are
> back online to enable back the PMU counters.
> 
> Since the performance Monitor counters ( PMCs) are not
> saved/restored during LPM, results in PMC value being zero and the
> 'event->hw.prev_count' being non-zero value. This causes problem

Interesting. Are they defined to not be migrated, or may not be 
migrated?

I wonder what QEMU migration does with PMU registers.

> during updation of event->count since we always accumulate
> (event->hw.prev_count - PMC value) in event->count.  If
> event->hw.prev_count is greater PMC value, event->count becomes
> negative. Fix this by re-initialising 'prev_count' also for all
> events while enabling back the events. A new variable 'migrate' is
> introduced in 'struct cpu_hw_event' to achieve this for LPM cases
> in power_pmu_enable. Use the 'migrate' value to clear the PMC
> index (stored in event->hw.idx) for all events so that event
> count settings will get re-initialised correctly.
> 
> Signed-off-by: Athira Rajeev 
> [ Fixed compilation error reported by kernel test robot ]
> Reported-by: kernel test robot 
> ---
> Change from v1 -> v2:
>  - Moved the mobility_pmu_enable and mobility_pmu_disable
>declarations under CONFIG_PPC_PERF_CTRS in rtas.h.
>Also included 'asm/rtas.h' in core-book3s to fix the
>compilation warning reported by kernel test robot.
> 
>  arch/powerpc/include/asm/rtas.h   |  8 ++
>  arch/powerpc/perf/core-book3s.c   | 44 
> ---
>  arch/powerpc/platforms/pseries/mobility.c |  4 +++
>  3 files changed, 53 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
> index 9dc97d2..cea72d7 100644
> --- a/arch/powerpc/include/asm/rtas.h
> +++ b/arch/powerpc/include/asm/rtas.h
> @@ -380,5 +380,13 @@ static inline void rtas_initialize(void) { }
>  static inline void read_24x7_sys_info(void) { }
>  #endif
>  
> +#ifdef CONFIG_PPC_PERF_CTRS
> +void mobility_pmu_disable(void);
> +void mobility_pmu_enable(void);
> +#else
> +static inline void mobility_pmu_disable(void) { }
> +static inline void mobility_pmu_enable(void) { }
> +#endif
> +
>  #endif /* __KERNEL__ */
>  #endif /* _POWERPC_RTAS_H */

It's not implemented in rtas, maybe consider putting this into a perf 
header?

> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index bb0ee71..90da7fa 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #ifdef CONFIG_PPC64
>  #include "internal.h"
> @@ -58,6 +59,7 @@ struct cpu_hw_events {
>  
>   /* Store the PMC values */
>   unsigned long pmcs[MAX_HWEVENTS];
> + int migrate;
>  };
>  
>  static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events);
> @@ -1335,6 +1337,22 @@ static void power_pmu_disable(struct pmu *pmu)
>  }
>  
>  /*
> + * Called from powerpc mobility code
> + * before migration to disable counters
> + * if the PMU is 

Re: [PATCH v11 0/3] make hvc pass dma capable memory to its backend

2021-10-21 Thread Xianting Tian

I am very glad to get this reply:)

Thank you and other experts' kindly help and guide, which improved me a lot.

在 2021/10/21 下午4:35, Greg KH 写道:

On Fri, Oct 15, 2021 at 10:46:55AM +0800, Xianting Tian wrote:

Dear all,

This patch series make hvc framework pass DMA capable memory to
put_chars() of hvc backend(eg, virtio-console), and revert commit
c4baad5029 ("virtio-console: avoid DMA from stack”)

Thanks for sticking with this, looks much better now, all now queued up.

greg k-h


Re: [PATCH v11 0/3] make hvc pass dma capable memory to its backend

2021-10-21 Thread Greg KH
On Fri, Oct 15, 2021 at 10:46:55AM +0800, Xianting Tian wrote:
> Dear all,
> 
> This patch series make hvc framework pass DMA capable memory to
> put_chars() of hvc backend(eg, virtio-console), and revert commit
> c4baad5029 ("virtio-console: avoid DMA from stack”)

Thanks for sticking with this, looks much better now, all now queued up.

greg k-h


Re: [PATCH 21/20] signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)

2021-10-21 Thread Philippe Mathieu-Daudé
On Wed, Oct 20, 2021 at 11:52 PM Eric W. Biederman
 wrote:
>
>
> Now that force_fatal_sig exists it is unnecessary and a bit confusing
> to use force_sigsegv in cases where the simpler force_fatal_sig is
> wanted.  So change every instance we can to make the code clearer.
>
> Signed-off-by: "Eric W. Biederman" 
> ---
>  arch/arc/kernel/process.c   | 2 +-
>  arch/m68k/kernel/traps.c| 2 +-
>  arch/powerpc/kernel/signal_32.c | 2 +-
>  arch/powerpc/kernel/signal_64.c | 4 ++--
>  arch/s390/kernel/traps.c| 2 +-
>  arch/um/kernel/trap.c   | 2 +-
>  arch/x86/kernel/vm86_32.c   | 2 +-
>  fs/exec.c   | 2 +-
>  8 files changed, 9 insertions(+), 9 deletions(-)

Reviewed-by: Philippe Mathieu-Daudé 


Re: [PATCH 21/20] signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)

2021-10-21 Thread Geert Uytterhoeven
Hi Eric,

Patch 21/20?

On Wed, Oct 20, 2021 at 11:52 PM Eric W. Biederman
 wrote:
> Now that force_fatal_sig exists it is unnecessary and a bit confusing
> to use force_sigsegv in cases where the simpler force_fatal_sig is
> wanted.  So change every instance we can to make the code clearer.
>
> Signed-off-by: "Eric W. Biederman" 

>  arch/m68k/kernel/traps.c| 2 +-

Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH v2 16/16] powerpc/microwatt: Don't select the hash MMU code

2021-10-21 Thread Nicholas Piggin
Excerpts from Joel Stanley's message of October 21, 2021 3:19 pm:
> On Thu, 21 Oct 2021 at 04:04, Nicholas Piggin  wrote:
>>
>> Microwatt is radix-only, so it does not require hash MMU support.
>>
>> This saves 20kB compressed dtbImage and 56kB vmlinux size.
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/configs/microwatt_defconfig | 1 -
>>  arch/powerpc/platforms/microwatt/Kconfig | 1 -
>>  2 files changed, 2 deletions(-)
>>
>> diff --git a/arch/powerpc/configs/microwatt_defconfig 
>> b/arch/powerpc/configs/microwatt_defconfig
>> index 6e62966730d3..7c8eb29d8afe 100644
>> --- a/arch/powerpc/configs/microwatt_defconfig
>> +++ b/arch/powerpc/configs/microwatt_defconfig
>> @@ -27,7 +27,6 @@ CONFIG_PPC_MICROWATT=y
>>  # CONFIG_PPC_OF_BOOT_TRAMPOLINE is not set
>>  CONFIG_CPU_FREQ=y
>>  CONFIG_HZ_100=y
>> -# CONFIG_PPC_MEM_KEYS is not set
>>  # CONFIG_SECCOMP is not set
>>  # CONFIG_MQ_IOSCHED_KYBER is not set
>>  # CONFIG_COREDUMP is not set
> 
> We still end up with CONFIG_PPC_64S_HASH_MMU=y in the config as it
> defaults to y.

If you make microwatt_defconfig? Hm, IIRC this came from savedefconfig 
after unselecting hash mmu so I'm not sure why that doesn't work.
> 
> We should disable in the defconfig it so your new changes are tested
> by that defconfig:
> 
> +# CONFIG_PPC_64S_HASH_MMU is not set
> 
> I boot tested your series on Microwatt with microwatt_defconfig (with
> and without that option set) and ppc64le_defconfig.

Nice.

Thanks,
Nick

> 
> Cheers,
> 
> Joel
> 
>> diff --git a/arch/powerpc/platforms/microwatt/Kconfig 
>> b/arch/powerpc/platforms/microwatt/Kconfig
>> index 823192e9d38a..5e320f49583a 100644
>> --- a/arch/powerpc/platforms/microwatt/Kconfig
>> +++ b/arch/powerpc/platforms/microwatt/Kconfig
>> @@ -5,7 +5,6 @@ config PPC_MICROWATT
>> select PPC_XICS
>> select PPC_ICS_NATIVE
>> select PPC_ICP_NATIVE
>> -   select PPC_HASH_MMU_NATIVE if PPC_64S_HASH_MMU
>> select PPC_UDBG_16550
>> select ARCH_RANDOM
>> help
>> --
>> 2.23.0
>>
> 


Re: [PATCH v2 13/16] powerpc/64s: Move hash MMU code under a new Kconfig name

2021-10-21 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 21, 2021 3:43 pm:
> 
> 
> Le 21/10/2021 à 05:54, Nicholas Piggin a écrit :
>> Introduce a new option CONFIG_PPC_64S_HASH_MMU, and make 64s hash
>> code depend on it.
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>   arch/powerpc/Kconfig  |  2 +-
>>   arch/powerpc/include/asm/book3s/64/mmu.h  | 19 +--
>>   .../include/asm/book3s/64/tlbflush-hash.h |  7 
>>   arch/powerpc/include/asm/book3s/pgtable.h |  4 +++
>>   arch/powerpc/include/asm/mmu.h| 16 +++--
>>   arch/powerpc/include/asm/mmu_context.h|  2 ++
>>   arch/powerpc/include/asm/paca.h   |  8 +
>>   arch/powerpc/kernel/asm-offsets.c |  2 ++
>>   arch/powerpc/kernel/dt_cpu_ftrs.c | 14 +---
>>   arch/powerpc/kernel/entry_64.S|  4 +--
>>   arch/powerpc/kernel/exceptions-64s.S  | 16 +
>>   arch/powerpc/kernel/mce.c |  2 +-
>>   arch/powerpc/kernel/mce_power.c   | 10 --
>>   arch/powerpc/kernel/paca.c| 18 --
>>   arch/powerpc/kernel/process.c | 13 +++
>>   arch/powerpc/kernel/prom.c|  2 ++
>>   arch/powerpc/kernel/setup_64.c|  5 +++
>>   arch/powerpc/kexec/core_64.c  |  4 +--
>>   arch/powerpc/kexec/ranges.c   |  4 +++
>>   arch/powerpc/mm/book3s64/Makefile | 15 
>>   arch/powerpc/mm/book3s64/hugetlbpage.c|  2 ++
>>   arch/powerpc/mm/book3s64/mmu_context.c| 34 +++
>>   arch/powerpc/mm/book3s64/radix_pgtable.c  |  4 +++
>>   arch/powerpc/mm/copro_fault.c |  2 ++
>>   arch/powerpc/mm/ioremap.c | 13 ---
>>   arch/powerpc/mm/pgtable.c | 10 --
>>   arch/powerpc/mm/ptdump/Makefile   |  2 +-
>>   arch/powerpc/platforms/Kconfig.cputype|  4 +++
>>   arch/powerpc/platforms/powernv/idle.c |  2 ++
>>   arch/powerpc/platforms/powernv/setup.c|  2 ++
>>   arch/powerpc/platforms/pseries/lpar.c | 11 --
>>   arch/powerpc/platforms/pseries/lparcfg.c  |  2 +-
>>   arch/powerpc/platforms/pseries/mobility.c |  6 
>>   arch/powerpc/platforms/pseries/ras.c  |  2 ++
>>   arch/powerpc/platforms/pseries/reconfig.c |  2 ++
>>   arch/powerpc/platforms/pseries/setup.c|  6 ++--
>>   arch/powerpc/xmon/xmon.c  |  8 +++--
>>   drivers/misc/lkdtm/Makefile   |  2 +-
>>   drivers/misc/lkdtm/core.c |  2 +-
>>   39 files changed, 219 insertions(+), 64 deletions(-)
> 
> I'm still unconfortable with the quantity of files impacted in that commit.

Hmm. Splitting it into N partial patches that have the same result
doesn't seem better to me. There's a few other little things that are
be better split out, but size of the patch will not shrink much.

>> diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
>> b/arch/powerpc/include/asm/book3s/64/mmu.h
>> index c02f42d1031e..d94ebae386b6 100644
>> --- a/arch/powerpc/include/asm/book3s/64/mmu.h
>> +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> 
>> @@ -193,8 +198,15 @@ static inline struct subpage_prot_table 
>> *mm_ctx_subpage_prot(mm_context_t *ctx)
>>   extern int mmu_linear_psize;
>>   extern int mmu_virtual_psize;
>>   extern int mmu_vmalloc_psize;
>> -extern int mmu_vmemmap_psize;
>>   extern int mmu_io_psize;
>> +#else /* CONFIG_PPC_64S_HASH_MMU */
>> +#ifdef CONFIG_PPC_64K_PAGES
> 
> Avoid nested #ifdefs and do
> 
> #elif defined(CONFIG_PPC_64K_PAGES)

I sort of like the nesting because this is the !HASH block. But I don't 
care much so I can do it your way if you prefer.

>> @@ -223,6 +226,13 @@ enum {
>>   #ifdef CONFIG_E500
>>   #define MMU_FTRS_ALWAYSMMU_FTR_TYPE_FSL_E
>>   #endif
>> +#ifdef CONFIG_PPC_BOOK3S_64
> 
> No need of this CONFIG_PPC_BOOK3S_64 ifdef, it is necessarily defined if 
> either CONFIG_PPC_RADIX_MMU or CONFIG_PPC_64S_HASH_MMU is defined

Good point.

>> +#if defined(CONFIG_PPC_RADIX_MMU) && !defined(CONFIG_PPC_64S_HASH_MMU)
>> +#define MMU_FTRS_ALWAYS MMU_FTR_TYPE_RADIX
>> +#elif !defined(CONFIG_PPC_RADIX_MMU) && defined(CONFIG_PPC_64S_HASH_MMU)
>> +#define MMU_FTRS_ALWAYS MMU_FTR_HPTE_TABLE
>> +#endif
>> +#endif
>>   
>>   #ifndef MMU_FTRS_ALWAYS
>>   #define MMU_FTRS_ALWAYS0
> 
>> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
>> b/arch/powerpc/kernel/exceptions-64s.S
>> index 046c99e31d01..65b695e9401e 100644
>> --- a/arch/powerpc/kernel/exceptions-64s.S
>> +++ b/arch/powerpc/kernel/exceptions-64s.S
>> @@ -1369,11 +1369,15 @@ EXC_COMMON_BEGIN(data_access_common)
>>  addir3,r1,STACK_FRAME_OVERHEAD
>>  andis.  r0,r4,DSISR_DABRMATCH@h
>>  bne-1f
>> +#ifdef CONFIG_PPC_64S_HASH_MMU
>>   BEGIN_MMU_FTR_SECTION
>>  bl  do_hash_fault
>>   

Re: [PATCH v2 12/16] powerpc/64e: remove mmu_linear_psize

2021-10-21 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 21, 2021 3:03 pm:
> 
> 
> Le 21/10/2021 à 05:54, Nicholas Piggin a écrit :
>> mmu_linear_psize is only set at boot once on 64e, is not necessarily
>> the correct size of the linear map pages, and is never used anywhere
>> except memremap_compat_align.
>> 
>> Remove mmu_linear_psize and hard code the 1GB value instead in
>> memremap_compat_align.
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>   arch/powerpc/mm/ioremap.c| 6 +-
>>   arch/powerpc/mm/nohash/tlb.c | 9 -
>>   2 files changed, 5 insertions(+), 10 deletions(-)
>> 
>> diff --git a/arch/powerpc/mm/ioremap.c b/arch/powerpc/mm/ioremap.c
>> index 57342154d2b0..730c3bbe4759 100644
>> --- a/arch/powerpc/mm/ioremap.c
>> +++ b/arch/powerpc/mm/ioremap.c
>> @@ -109,12 +109,16 @@ void __iomem *do_ioremap(phys_addr_t pa, phys_addr_t 
>> offset, unsigned long size,
>>   */
>>   unsigned long memremap_compat_align(void)
>>   {
>> +#ifdef CONFIG_PPC_BOOK3E_64
> 
> I don't think this function really belongs to ioremap.c
> 
> Could avoid the #ifdef by going in:
> 
> arch/powerpc/mm/nohash/book3e_pgtable.c
> 
> and
> 
> arch/powerpc/mm/book3s64/pgtable.c

Yeah that might work.

Thanks,
Nick


Re: [PATCH v2 11/16] powerpc/64: pcpu setup avoid reading mmu_linear_psize on 64e or radix

2021-10-21 Thread Nicholas Piggin
Excerpts from Christophe Leroy's message of October 21, 2021 2:52 pm:
> 
> 
> Le 21/10/2021 à 05:54, Nicholas Piggin a écrit :
>> Radix never sets mmu_linear_psize so it's always 4K, which causes pcpu
>> atom_size to always be PAGE_SIZE. 64e sets it to 1GB always.
>> 
>> Make paths for these platforms to be explicit about what value they set
>> atom_size to.
>> 
>> Signed-off-by: Nicholas Piggin 
>> ---
>>   arch/powerpc/kernel/setup_64.c | 21 +++--
>>   1 file changed, 15 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
>> index eaa79a0996d1..7d4bcbc3124e 100644
>> --- a/arch/powerpc/kernel/setup_64.c
>> +++ b/arch/powerpc/kernel/setup_64.c
>> @@ -880,14 +880,23 @@ void __init setup_per_cpu_areas(void)
>>  int rc = -EINVAL;
>>   
>>  /*
>> - * Linear mapping is one of 4K, 1M and 16M.  For 4K, no need
>> - * to group units.  For larger mappings, use 1M atom which
>> - * should be large enough to contain a number of units.
>> + * BookE and BookS radix are historical values and should be revisited.
>>   */
>> -if (mmu_linear_psize == MMU_PAGE_4K)
>> -atom_size = PAGE_SIZE;
>> -else
>> +if (IS_ENABLED(CONFIG_PPC_BOOK3E)) {
>>  atom_size = 1 << 20;
>> +} else if (radix_enabled()) {
>> +atom_size = PAGE_SIZE;
>> +} else {
> 
> You can keep a single level (and remove the brackets)
> 
>   if (IS_ENABLED(CONFIG_PPC_BOOK3E))
>   atom_size = SZ_1M;
>   else if (radix_enabled())
>   atom_size = PAGE_SIZE;
>   else if (mmu_linear_psize == MMU_PAGE_4K)
>   atom_size = PAGE_SIZE;
>   else
>   atom_size = SZ_1M;

I could but the below comment and logic applies to BookS hash so I'm 
happier to put it in its own block. Radix later might also inherit a
comment and size from x86-64.

>> +/*
>> + * Linear mapping is one of 4K, 1M and 16M.  For 4K, no need
>> + * to group units.  For larger mappings, use 1M atom which
>> + * should be large enough to contain a number of units.
>> + */
>> +if (mmu_linear_psize == MMU_PAGE_4K)
>> +atom_size = PAGE_SIZE;
>> +else
>> +atom_size = 1 << 20;
> 
> Use SZ_1M instead of hardcoding.

Sure.

Thanks,
Nick



[next-20211019][PPC] kernel panics with lspci -vvnn command

2021-10-21 Thread Abdul Haleem

Greeting's

Today's next kernel panics when lspci -vvnn commands is executed on my 
powerpc machine


# lspci -vvnn
0012:01:00.0 Fibre Channel [0c04]: QLogic Corp. ISP2722-based 16/32Gb 
Fibre Channel to PCIe Adapter [1077:2261] (rev 01)

    Subsystem: IBM Device [1014:0650]
    Physical Slot: U78D8.ND0.FGD004S-P0-C2-C0
    Device tree node: 
/sys/firmware/devicetree/base/pci@8002012/fibre-channel@0
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ 
Stepping- SERR+ FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
SERR- 
    Latency: 0, Cache Line Size: 128 bytes
    Interrupt: pin A routed to IRQ 48
    NUMA node: 2
    IOMMU group: 0
    Region 0: Memory at 4285000 (64-bit, prefetchable) [size=4K]
    Region 2: Memory at 4282000 (64-bit, prefetchable) [size=8K]
    Region 4: Memory at 410 (64-bit, prefetchable) [size=1M]
    Expansion ROM at 424 [disabled] [size=256K]
    Capabilities: [44] Power Management version 3
    Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
PME(D0-,D1-,D2-,D3hot-,D3cold-)

    Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [4c] Express (v2) Endpoint, MSI 00
    DevCap:    MaxPayload 2048 bytes, PhantFunc 0, Latency L0s 
<4us, L1 <1us
        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ 
SlotPowerLimit 0.000W

    DevCtl:    CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
        MaxPayload 512 bytes, MaxReadReq 4096 bytes
    DevSta:    CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- 
TransPend-
    LnkCap:    Port #0, Speed 8GT/s, Width x8, ASPM L0s L1, Exit 
Latency L0s <2us, L1 <2us

        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
    LnkCtl:    ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
    LnkSta:    Speed 8GT/s (ok), Width x8 (ok)
        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
    DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR-
         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- 
EETLPPrefix-
         EmergencyPowerReduction Not Supported, 
EmergencyPowerReductionInit-

         FRS- TPHComp- ExtTPHComp-
         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
    DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+ LTR- 
OBFF Disabled,

         AtomicOpsCtl: ReqEn-
    LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 
2Retimers- DRS-

    LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
         Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-

         Compliance De-emphasis: -6dB
    LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ 
EqualizationPhase1+
         EqualizationPhase2+ EqualizationPhase3+ 
LinkEqualizationRequest-

         Retimer- 2Retimers- CrosslinkRes: unsupported
    Capabilities: [88] Vital Product Data
BUG: Kernel NULL pointer dereference on read at 0x80a0
BUG: Unable to handle kernel data access on read at 0x394940920078
BUG: Unable to handle kernel data access on read at 0x694a0002e94d00f0
Faulting instruction address: 0xc06f4498
Faulting instruction address: 0xc01d3680
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
Faulting instruction address: 0xc01abcf0
Modules linked in:
Thread overran stack, or stack corrupted
 rpadlpar_io rpaphp nfnetlink tcp_diag udp_diag inet_diag unix_diag 
af_packet_diag netlink_diag bonding rfkill sunrpc raid456 async_raid6_recov
async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c 
pseries_rng xts vmx_crypto gf128mul binfmt_misc sch_fq_codel ip_tables ext4
mbcache jbd2 dm_service_time sd_mod sg qla2xxx ibmvfc ibmveth nvme_fc 
nvme_fabrics nvme_core t10_pi scsi_transport_fc dm_multipath dm_mirror

dm_region_hash dm_log dm_mod fuse
CPU: 24 PID: 0 Comm: swapper/24 Kdump: loaded Not tainted 
5.15.0-rc5-next-20211012-autotest #1

NIP:  c06f4498 LR: c06f9c18 CTR: c0026e60
REGS: c6797560 TRAP: 0380   Not tainted 
(5.15.0-rc5-next-20211012-autotest)

MSR:  80009033   CR: 42000824  XER: 
CFAR: c06f440c IRQMASK: 1
GPR00: c022434c c6797800 c19b2500 c0117db0ac28
GPR04: c0117db0a520  394940920078 0001
GPR08: c00063bd3cf0 c073a7a8 892100602e3f 7265677368657265
GPR12: c0026e60 c0117fb4be80  1eef2b00
GPR16:    
GPR20:   0003 0001
GPR24: 638695346493 0002 0003 c0117db0a480
GPR28: c0117db0a480  c0117db0a520 c0117db0ac28
NIP