Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

2021-04-13 Thread Vincent Guittot
On Mon, 12 Apr 2021 at 17:24, Mel Gorman  wrote:
>
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and 
> > > > the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have 
> > > > any
> > > > suggestions on how we could handle the same in the core scheduler?
> >
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> >
>
> From topology.c
>
> SD_SHARE_PKG_RESOURCES - describes shared caches
>
> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
>
> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

But the SMT4 level is still present  here with select_idle_core taking
of the spreading

>
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> >
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> >
>
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.

What I mean is that you should directly do for_each_sched_domain in
the fast path because that what you are proposing at the end. It's no
more looks like a fast path but a traditional LB

>
> But still, your comment reminded me that different architectures have
> different requirements
>
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
> spillover to L3 when selecting and idle CPU.
>
> X86 varies, it might want the Power10 approach for some families and prefer
> L3 spilling over to a CPU on the same node in others.
>
> S390 cares about something called books and drawers although I've no
> what it means as such and whether it has any preferences on
> search order.
>
> ARM has similar requirements again according to "scheduler: expose the
> topology of clusters and add cluster scheduler" and that one *does*
> add another domain.
>
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
>
> The only commonality I thought might exist is that architectures may
> like to define what the first domain to search for an idle CPU and a
> second domain. Alternatively, architectures could specify a domain to
> search primarily but also search the next domain in the hierarchy if
> search depth permits. The default would be the existing behaviour --
> search CPUs sharing a last-level-cache.
>
> > SD_SHARE_PKG_RESOURCES should be set to the last level where we can
> > efficiently move task between CPUs at wakeup
> >
>
> The definition of "efficiently" varies. Moving tasks between CPUs sharing
> a cache is most efficient but moving the task to a CPU that at least has
> local memory channels is a reasonable option if there are no idle CPUs
> sharing cache and preferable to stacking.

That's why setting SD_SHARE_PKG_RESOURCES for P10 looks fine to me.
This last level of SD_SHARE_PKG_RESOURCES should define the cpumask to
be considered  in fast path

>
> > > allows within the node with the LLC CPUs masked out. While there would be
> > > a latency hit because cache is not shared, it would still be a CPU local
> > > to memory that is idle. That would potentially be beneficial on Zen*
> > > as well without having to introduce new domains in the topology hierarchy.
> >
> > What is the current sched_domain topology description for zen ?
> >
>
> The cache and NUMA topologies differ slightly between each generation
> of Zen. The common pattern is that a single NUMA node can have multiple
> L3 caches and at one point I thought it might be reasonable to allow
> spillover to select a local idle CPU instead of stacking multiple tasks
> on a CPU sharing cache. I never got as far as thinking how it could be
> done in a way that multiple architecture

Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Alexey Kardashevskiy




On 13/04/2021 15:49, Leonardo Bras wrote:

Thanks for the feedback!

On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:

-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
+static phys_addr_t ddw_memory_hotplug_max(void)



Please, forward declaration or a separate patch; this creates
unnecessary noise to the actual change.



Sure, done!




+   _iommu_table_setparms(tbl, pci->phb->bus->number, create.liobn, 
win_addr,
+ 1UL << len, page_shift, 0, 
&iommu_table_lpar_multi_ops);
+   iommu_init_table(tbl, pci->phb->node, 0, 0);



It is 0,0 only if win_addr>0 which is not the QEMU case.



Oh, ok.
I previously though it was ok to use 0,0 here as any other usage in
this file was also 0,0.

What should I use to get the correct parameters? Use the previous tbl
it_reserved_start and tbl->it_reserved_end is enough?


depends on whether you carry reserved start/end even if they are outside 
of the dma window.



--
Alexey


Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Leonardo Bras
On Tue, 2021-04-13 at 17:18 +1000, Alexey Kardashevskiy wrote:
> 
> On 13/04/2021 15:49, Leonardo Bras wrote:
> > Thanks for the feedback!
> > 
> > On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:
> > > > -static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> > > > +static phys_addr_t ddw_memory_hotplug_max(void)
> > > 
> > > 
> > > Please, forward declaration or a separate patch; this creates
> > > unnecessary noise to the actual change.
> > > 
> > 
> > Sure, done!
> > 
> > > 
> > > > +   _iommu_table_setparms(tbl, pci->phb->bus->number, 
> > > > create.liobn, win_addr,
> > > > + 1UL << len, page_shift, 0, 
> > > > &iommu_table_lpar_multi_ops);
> > > > +   iommu_init_table(tbl, pci->phb->node, 0, 0);
> > > 
> > > 
> > > It is 0,0 only if win_addr>0 which is not the QEMU case.
> > > 
> > 
> > Oh, ok.
> > I previously though it was ok to use 0,0 here as any other usage in
> > this file was also 0,0.
> > 
> > What should I use to get the correct parameters? Use the previous tbl
> > it_reserved_start and tbl->it_reserved_end is enough?
> 
> depends on whether you carry reserved start/end even if they are outside 
> of the dma window.
> 

Oh, that makes sense.
On a previous patch (5/14 IIRC), I changed the behavior to only store
the valid range on tbl, but now I understand why it's important to
store the raw value.

Ok, I will change it back so the reserved range stays in tbl even if it
does not intersect with the DMA window. This way I can reuse the values
in case of indirect mapping with DDW.

Is that ok? Are the reserved values are supposed to stay the same after
changing from Default DMA window to DDW?

Best regards,
Leonardo Bras



Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Alexey Kardashevskiy




On 13/04/2021 17:33, Leonardo Bras wrote:

On Tue, 2021-04-13 at 17:18 +1000, Alexey Kardashevskiy wrote:


On 13/04/2021 15:49, Leonardo Bras wrote:

Thanks for the feedback!

On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:

-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
+static phys_addr_t ddw_memory_hotplug_max(void)



Please, forward declaration or a separate patch; this creates
unnecessary noise to the actual change.



Sure, done!




+   _iommu_table_setparms(tbl, pci->phb->bus->number, create.liobn, 
win_addr,
+ 1UL << len, page_shift, 0, 
&iommu_table_lpar_multi_ops);
+   iommu_init_table(tbl, pci->phb->node, 0, 0);



It is 0,0 only if win_addr>0 which is not the QEMU case.



Oh, ok.
I previously though it was ok to use 0,0 here as any other usage in
this file was also 0,0.

What should I use to get the correct parameters? Use the previous tbl
it_reserved_start and tbl->it_reserved_end is enough?


depends on whether you carry reserved start/end even if they are outside
of the dma window.



Oh, that makes sense.
On a previous patch (5/14 IIRC), I changed the behavior to only store
the valid range on tbl, but now I understand why it's important to
store the raw value.

Ok, I will change it back so the reserved range stays in tbl even if it
does not intersect with the DMA window. This way I can reuse the values
in case of indirect mapping with DDW.

Is that ok? Are the reserved values are supposed to stay the same after
changing from Default DMA window to DDW?


I added them to know what bits in it_map to ignore when checking if 
there is any active user of the table. If you have non zero reserved 
start/end but they do not affect it_map, then it is rather weird way to 
carry reserved start/end from DDW to no-DDW. May be do not set these at 
all for DDW with window start at 1<<59 and when going back to no-DDW (or 
if DDW starts at 0) - just set them from MMIO32, just as they are 
initialized in the first place.




--
Alexey


Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Leonardo Bras
On Tue, 2021-04-13 at 17:41 +1000, Alexey Kardashevskiy wrote:
> 
> On 13/04/2021 17:33, Leonardo Bras wrote:
> > On Tue, 2021-04-13 at 17:18 +1000, Alexey Kardashevskiy wrote:
> > > 
> > > On 13/04/2021 15:49, Leonardo Bras wrote:
> > > > Thanks for the feedback!
> > > > 
> > > > On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:
> > > > > > -static bool find_existing_ddw(struct device_node *pdn, u64 
> > > > > > *dma_addr)
> > > > > > +static phys_addr_t ddw_memory_hotplug_max(void)
> > > > > 
> > > > > 
> > > > > Please, forward declaration or a separate patch; this creates
> > > > > unnecessary noise to the actual change.
> > > > > 
> > > > 
> > > > Sure, done!
> > > > 
> > > > > 
> > > > > > +   _iommu_table_setparms(tbl, pci->phb->bus->number, 
> > > > > > create.liobn, win_addr,
> > > > > > + 1UL << len, page_shift, 0, 
> > > > > > &iommu_table_lpar_multi_ops);
> > > > > > +   iommu_init_table(tbl, pci->phb->node, 0, 0);
> > > > > 
> > > > > 
> > > > > It is 0,0 only if win_addr>0 which is not the QEMU case.
> > > > > 
> > > > 
> > > > Oh, ok.
> > > > I previously though it was ok to use 0,0 here as any other usage in
> > > > this file was also 0,0.
> > > > 
> > > > What should I use to get the correct parameters? Use the previous tbl
> > > > it_reserved_start and tbl->it_reserved_end is enough?
> > > 
> > > depends on whether you carry reserved start/end even if they are outside
> > > of the dma window.
> > > 
> > 
> > Oh, that makes sense.
> > On a previous patch (5/14 IIRC), I changed the behavior to only store
> > the valid range on tbl, but now I understand why it's important to
> > store the raw value.
> > 
> > Ok, I will change it back so the reserved range stays in tbl even if it
> > does not intersect with the DMA window. This way I can reuse the values
> > in case of indirect mapping with DDW.
> > 
> > Is that ok? Are the reserved values are supposed to stay the same after
> > changing from Default DMA window to DDW?
> 
> I added them to know what bits in it_map to ignore when checking if 
> there is any active user of the table. If you have non zero reserved 
> start/end but they do not affect it_map, then it is rather weird way to 
> carry reserved start/end from DDW to no-DDW.
> 

Ok, agreed.

>  May be do not set these at 
> all for DDW with window start at 1<<59 and when going back to no-DDW (or 
> if DDW starts at 0) - just set them from MMIO32, just as they are 
> initialized in the first place.
> 

If I get it correctly from pci_of_scan.c, MMIO32 = {0, 32MB}, is that
correct?

So, if DDW starts at any value in this range (most probably at zero),
we should remove the rest, is that correct?

Could it always use iommu_init_table(..., 0, 32MB) here, so it always
reserve any part of the DMA window that's in this range? Ot there may
be other reserved values range?

> and when going back to no-DDW 

After iommu_init_table() there should be no failure, so it looks like
there is no 'going back to no-DDW'. Am I missing something?

Thanks for helping!

Best regards,
Leonardo Bras



[V2 PATCH 00/16] Enable VAS and NX-GZIP support on powerVM

2021-04-13 Thread Haren Myneni


This patch series enables VAS / NX-GZIP on powerVM which allows
the user space to do copy/paste with the same existing interface
that is available on powerNV.

VAS Enablement:
- Get all VAS capabilities using H_QUERY_VAS_CAPABILITIES that are
  available in the hypervisor. These capabilities tells OS which
  type of features (credit types such as Default and Quality of
  Service (QoS)). Also gives specific capabilities for each credit
  type: Maximum window credits, Maximum LPAR credits, Target credits
  in that parition (varies from max LPAR credits based DLPAR
  operation), whether supports user mode COPY/PASTE and etc.
- Register LPAR VAS operations such as open window. get paste
  address and close window with the current VAS user space API.
- Open window operation - Use H_ALLOCATE_VAS_WINDOW HCALL to open
  window and H_MODIFY_VAS_WINDOW HCALL to setup the window with LPAR
  PID and etc.
- mmap to paste address returned in H_ALLOCATE_VAS_WINDOW HCALL
- To close window, H_DEALLOCATE_VAS_WINDOW HCALL is used to close in
  the hypervisor.

NX Enablement:
- Get NX capabilities from the the hypervisor which provides Maximum
  buffer length in a single GZIP request, recommended minimum
  compression / decompression lengths.
- Register to VAS to enable user space VAS API

Main feature differences with powerNV implementation:
- Each VAS window will be configured with a number of credits which
  means that many requests can be issues simultaniously on that
  window. On powerNV, 1K credits are configured per window.
  Whereas on powerVM, the hypervisor allows 1 credit per window
  at present.
- The hypervisor introduced 2 different types of credits: Default -
  Uses normal priority FIFO and Quality of Service (QoS) - Uses high
  priority FIFO. On powerVM, VAS/NX HW resources are shared across
  LPARs. The total number of credits available on a system depends
  on cores configured. We may see more credits are assigned across
  the system than the NX HW resources can handle. So to avoid NX HW
  contention, pHyp introduced QoS credits which can be configured
  by system administration with HMC API. Then the total number of
  available default credits on LPAR varies based on QoS credits
  configured.
- On powerNV, windows are allocated on a specific VAS instance
  and the user space can select VAS instance with the open window
  ioctl. Since VAS instances can be shared across partitions on
  powerVM, the hypervisor manages window allocations on different
  VAS instances. So H_ALLOCATE_VAS_WINDOW allows to select by domain
  indentifiers (H_HOME_NODE_ASSOCIATIVITY values by cpu). By default
  the hypervisor selects VAS instance closer to CPU resources that the
  parition uses. So vas_id in ioctl interface is ignored on powerVM
  except vas_id=-1 which is used to allocate window based on CPU that
  the process is executing. This option is needed for process affinity
  to NUMA node.

  The existing applications that linked with libnxz should work as
  long as the job request length is restricted to
  req_max_processed_len.

  Tested the following patches on P10 successfully with test cases
  given: https://github.com/libnxz/power-gzip

  Note: The hypervisor supports user mode NX from p10 onwards. Linux
supports user mode VAS/NX on P10 only with radix page tables.

Patches 1- 4:   Make the code that is needed for both powerNV and
powerVM to powerpc platform independent.
Patch5: Modify vas-window struct to support both and the
related changes.
Patch 6:Define HCALL and the related VAS/NXGZIP specific
structs.
Patch 7:Define QoS credit flag in window open ioctl
Patch 8:Implement Allocate, Modify and Deallocate HCALLs
Patch 9:Retrieve VAS capabilities from the hypervisor
Patch 10;   Implement window operations and integrate with API
Patch 11:   Setup IRQ and NX fault handling
Patch 12;   Add sysfs interface to expose VAS capabilities
Patch 13 - 14:  Make the code common to add NX-GZIP enablement
Patch 15:   Get NX capabilities from the hypervisor
patch 16;   Add sysfs interface to expose NX capabilities

Changes in V2:
  - Rebase on 5.12-rc6
  - Moved VAS Kconfig changes to arch/powerpc/platform as suggested
by Christophe Leroy
  - build fix with allyesconfig (reported by kernel test build)

Haren Myneni (16):
  powerpc/powernv/vas: Rename register/unregister functions
  powerpc/vas: Make VAS API powerpc platform independent
  powerpc/vas: Create take/drop task reference functions
  powerpc/vas: Move update_csb and dump_crb to platform independent
  powerpc/vas:  Define and use common vas_window struct
  powerpc/pseries/vas: Define VAS/NXGZIP HCALLs and structs
  powerpc/vas: Define QoS credit flag to allocate window
  powerpc/pseries/VAS: Implement allocate/modify/deallocate HCALLS
  powerpc/pseries/vas: Implement to get all capabilities
  powerpc/pseries/vas: Integrate API with open/close windows
  p

[V2 PATCH 01/16] powerpc/powernv/vas: Rename register/unregister functions

2021-04-13 Thread Haren Myneni


powerNV and pseries drivers register / unregister to the corresponding
VAS code separately. So rename powerNV VAS API register/unregister
functions.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h   |  6 +++---
 arch/powerpc/platforms/powernv/vas-api.c | 10 +-
 drivers/crypto/nx/nx-common-powernv.c|  6 +++---
 3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index e33f80b0ea81..41f73fae7ab8 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -170,8 +170,8 @@ int vas_paste_crb(struct vas_window *win, int offset, bool 
re);
  * Only NX GZIP coprocessor type is supported now, but this API can be
  * used for others in future.
  */
-int vas_register_coproc_api(struct module *mod, enum vas_cop_type cop_type,
-   const char *name);
-void vas_unregister_coproc_api(void);
+int vas_register_api_powernv(struct module *mod, enum vas_cop_type cop_type,
+const char *name);
+void vas_unregister_api_powernv(void);
 
 #endif /* __ASM_POWERPC_VAS_H */
diff --git a/arch/powerpc/platforms/powernv/vas-api.c 
b/arch/powerpc/platforms/powernv/vas-api.c
index 98ed5d8c5441..72d8ce39e56c 100644
--- a/arch/powerpc/platforms/powernv/vas-api.c
+++ b/arch/powerpc/platforms/powernv/vas-api.c
@@ -207,8 +207,8 @@ static struct file_operations coproc_fops = {
  * Supporting only nx-gzip coprocessor type now, but this API code
  * extended to other coprocessor types later.
  */
-int vas_register_coproc_api(struct module *mod, enum vas_cop_type cop_type,
-   const char *name)
+int vas_register_api_powernv(struct module *mod, enum vas_cop_type cop_type,
+const char *name)
 {
int rc = -EINVAL;
dev_t devno;
@@ -262,9 +262,9 @@ int vas_register_coproc_api(struct module *mod, enum 
vas_cop_type cop_type,
unregister_chrdev_region(coproc_device.devt, 1);
return rc;
 }
-EXPORT_SYMBOL_GPL(vas_register_coproc_api);
+EXPORT_SYMBOL_GPL(vas_register_api_powernv);
 
-void vas_unregister_coproc_api(void)
+void vas_unregister_api_powernv(void)
 {
dev_t devno;
 
@@ -275,4 +275,4 @@ void vas_unregister_coproc_api(void)
class_destroy(coproc_device.class);
unregister_chrdev_region(coproc_device.devt, 1);
 }
-EXPORT_SYMBOL_GPL(vas_unregister_coproc_api);
+EXPORT_SYMBOL_GPL(vas_unregister_api_powernv);
diff --git a/drivers/crypto/nx/nx-common-powernv.c 
b/drivers/crypto/nx/nx-common-powernv.c
index 13c65deda8e9..88d728415bb2 100644
--- a/drivers/crypto/nx/nx-common-powernv.c
+++ b/drivers/crypto/nx/nx-common-powernv.c
@@ -1090,8 +1090,8 @@ static __init int nx_compress_powernv_init(void)
 * normal FIFO priority is assigned for userspace.
 * 842 compression is supported only in kernel.
 */
-   ret = vas_register_coproc_api(THIS_MODULE, VAS_COP_TYPE_GZIP,
-   "nx-gzip");
+   ret = vas_register_api_powernv(THIS_MODULE, VAS_COP_TYPE_GZIP,
+  "nx-gzip");
 
/*
 * GZIP is not supported in kernel right now.
@@ -1127,7 +1127,7 @@ static void __exit nx_compress_powernv_exit(void)
 * use. So delete this API use for GZIP engine.
 */
if (!nx842_ct)
-   vas_unregister_coproc_api();
+   vas_unregister_api_powernv();
 
crypto_unregister_alg(&nx842_powernv_alg);
 
-- 
2.18.2




[V2 PATCH 02/16] powerpc/vas: Make VAS API powerpc platform independent

2021-04-13 Thread Haren Myneni


Using the same /dev/crypto/nx-gzip interface for both powerNV and
pseries. So this patch moves VAS API to powerpc platform indepedent
directory. The actual functionality is not changed in this patch.

Common interface functions such as open, window open ioctl, mmap
and close are moved to arch/powerpc/kernel/vas-api.c. Added
hooks to call platform specific code, but the underline powerNV
code in these functions is not changed.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h| 22 ++-
 arch/powerpc/kernel/Makefile  |  1 +
 .../{platforms/powernv => kernel}/vas-api.c   | 64 ++
 arch/powerpc/platforms/Kconfig| 15 +
 arch/powerpc/platforms/powernv/Kconfig| 14 
 arch/powerpc/platforms/powernv/Makefile   |  2 +-
 arch/powerpc/platforms/powernv/vas-window.c   | 66 +++
 7 files changed, 140 insertions(+), 44 deletions(-)
 rename arch/powerpc/{platforms/powernv => kernel}/vas-api.c (83%)

diff --git a/arch/powerpc/include/asm/vas.h
b/arch/powerpc/include/asm/vas.h
index 41f73fae7ab8..6bbade60d8f4 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -5,6 +5,8 @@
 
 #ifndef _ASM_POWERPC_VAS_H
 #define _ASM_POWERPC_VAS_H
+#include 
+
 
 struct vas_window;
 
@@ -48,6 +50,16 @@ enum vas_cop_type {
VAS_COP_TYPE_MAX,
 };
 
+/*
+ * User space window operations used for powernv and powerVM
+ */
+struct vas_user_win_ops {
+   struct vas_window * (*open_win)(struct vas_tx_win_open_attr *,
+   enum vas_cop_type);
+   u64 (*paste_addr)(void *);
+   int (*close_win)(void *);
+};
+
 /*
  * Receive window attributes specified by the (in-kernel) owner of
window.
  */
@@ -161,6 +173,9 @@ int vas_copy_crb(void *crb, int offset);
  * assumed to be true for NX windows.
  */
 int vas_paste_crb(struct vas_window *win, int offset, bool re);
+int vas_register_api_powernv(struct module *mod, enum vas_cop_type
cop_type,
+const char *name);
+void vas_unregister_api_powernv(void);
 
 /*
  * Register / unregister coprocessor type to VAS API which will be
exported
@@ -170,8 +185,9 @@ int vas_paste_crb(struct vas_window *win, int
offset, bool re);
  * Only NX GZIP coprocessor type is supported now, but this API can be
  * used for others in future.
  */
-int vas_register_api_powernv(struct module *mod, enum vas_cop_type
cop_type,
-const char *name);
-void vas_unregister_api_powernv(void);
+int vas_register_coproc_api(struct module *mod, enum vas_cop_type
cop_type,
+   const char *name,
+   struct vas_user_win_ops *vops);
+void vas_unregister_coproc_api(void);
 
 #endif /* __ASM_POWERPC_VAS_H */
diff --git a/arch/powerpc/kernel/Makefile
b/arch/powerpc/kernel/Makefile
index f66b63e81c3b..e4a9ff11b23d 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -118,6 +118,7 @@ obj-$(CONFIG_PPC_UDBG_16550)+=
legacy_serial.o udbg_16550.o
 obj-$(CONFIG_STACKTRACE)   += stacktrace.o
 obj-$(CONFIG_SWIOTLB)  += dma-swiotlb.o
 obj-$(CONFIG_ARCH_HAS_DMA_SET_MASK) += dma-mask.o
+obj-$(CONFIG_PPC_VAS)  += vas-api.o
 
 pci64-$(CONFIG_PPC64)  += pci_dn.o pci-hotplug.o isa-bridge.o
 obj-$(CONFIG_PCI)  += pci_$(BITS).o $(pci64-y) \
diff --git a/arch/powerpc/platforms/powernv/vas-api.c
b/arch/powerpc/kernel/vas-api.c
similarity index 83%
rename from arch/powerpc/platforms/powernv/vas-api.c
rename to arch/powerpc/kernel/vas-api.c
index 72d8ce39e56c..05d7b99acf41 100644
--- a/arch/powerpc/platforms/powernv/vas-api.c
+++ b/arch/powerpc/kernel/vas-api.c
@@ -4,15 +4,20 @@
  * Copyright (C) 2019 Haren Myneni, IBM Corp
  */
 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
+#include 
 #include 
-#include "vas.h"
 
 /*
  * The driver creates the device node that can be used as follows:
@@ -42,6 +47,7 @@ static struct coproc_dev {
dev_t devt;
struct class *class;
enum vas_cop_type cop_type;
+   struct vas_user_win_ops *vops;
 } coproc_device;
 
 struct coproc_instance {
@@ -72,11 +78,10 @@ static int coproc_open(struct inode *inode, struct
file *fp)
 static int coproc_ioc_tx_win_open(struct file *fp, unsigned long arg)
 {
void __user *uptr = (void __user *)arg;
-   struct vas_tx_win_attr txattr = {};
struct vas_tx_win_open_attr uattr;
struct coproc_instance *cp_inst;
struct vas_window *txwin;
-   int rc, vasid;
+   int rc;
 
cp_inst = fp->private_data;
 
@@ -93,27 +98,20 @@ static int coproc_ioc_tx_win_open(struct file *fp,
unsigned long arg)
}
 
if (uattr.version != 1) {
-   pr_err("Invalid version\n");
+   pr_err("Invalid window open API version\n");
return -EINVAL;
}
 
-   vasi

RE: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-13 Thread David Laight
From: Matthew Wilcox 
> Sent: 12 April 2021 19:24
> 
> On Sun, Apr 11, 2021 at 11:43:07AM +0200, Jesper Dangaard Brouer wrote:
> > Could you explain your intent here?
> > I worry about @index.
> >
> > As I mentioned in other thread[1] netstack use page_is_pfmemalloc()
> > (code copy-pasted below signature) which imply that the member @index
> > have to be kept intact. In above, I'm unsure @index is untouched.
> 
> Well, I tried three different approaches.  Here's the one I hated the least.
> 
> From: "Matthew Wilcox (Oracle)" 
> Date: Sat, 10 Apr 2021 16:12:06 -0400
> Subject: [PATCH] mm: Fix struct page layout on 32-bit systems
> 
> 32-bit architectures which expect 8-byte alignment for 8-byte integers
> and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
> page inadvertently expanded in 2019.  When the dma_addr_t was added,
> it forced the alignment of the union to 8 bytes, which inserted a 4 byte
> gap between 'flags' and the union.
> 
> We could fix this by telling the compiler to use a smaller alignment
> for the dma_addr, but that seems a little fragile.  Instead, move the
> 'flags' into the union.  That causes dma_addr to shift into the same
> bits as 'mapping', which causes problems with page_mapping() called from
> set_page_dirty() in the munmap path.  To avoid this, insert three words
> of padding and use the same bits as ->index and ->private, neither of
> which have to be cleared on free.

This all looks horribly fragile and is bound to get broken again.
Are there two problems?
1) The 'folio' structure needs to match 'rcu' part of the page
   so that it can use the same rcu list to free items.
2) Various uses of 'struct page' need to overlay fields to save space.

For (1) the rcu bit should probably be a named structure in an
anonymous union - probably in both structures.

For (2) is it worth explicitly defining the word number for each field?
So you end up with something like:
#define F(offset, member) struct { long _pad_##offset[offset]; member; }
struct page [
union {
struct page_rcu;
unsigned long flags;
F(1, unsigned long xxx);
F(2, unsigned long yyy);
etc.



...
>   struct {/* page_pool used by netstack */
> - /**
> -  * @dma_addr: might require a 64-bit value even on
> -  * 32-bit architectures.
> -  */
> - dma_addr_t dma_addr;
> + unsigned long _pp_flags;
> + unsigned long pp_magic;
> + unsigned long xmi;
> + unsigned long _pp_mapping_pad;
> + dma_addr_t dma_addr;/* might be one or two words */
>   };

Isn't that 6 words?

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



[V2 PATCH 03/16] powerpc/vas: Create take/drop task reference functions

2021-04-13 Thread Haren Myneni


Take task reference when each window opens and drops during close.
This functionality is needed for powerNV and pseries. So this patch
defines the existing code as functions in powerpc platform
independent vas-api.c

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h  | 20 
 arch/powerpc/kernel/vas-api.c   | 51 ++
 arch/powerpc/platforms/powernv/vas-fault.c  | 10 ++--
 arch/powerpc/platforms/powernv/vas-window.c | 57 ++---
 arch/powerpc/platforms/powernv/vas.h|  6 +--
 5 files changed, 83 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index 6bbade60d8f4..2daaa1a2a9a9 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -5,6 +5,9 @@
 
 #ifndef _ASM_POWERPC_VAS_H
 #define _ASM_POWERPC_VAS_H
+#include 
+#include 
+#include 
 #include 
 
 
@@ -60,6 +63,22 @@ struct vas_user_win_ops {
int (*close_win)(void *);
 };
 
+struct vas_win_task {
+   struct pid *pid;/* Thread group ID of owner */
+   struct pid *tgid;   /* Linux process mm_struct */
+   struct mm_struct *mm;   /* Linux process mm_struct */
+};
+
+static inline void vas_drop_reference_task(struct vas_win_task *task)
+{
+   /* Drop references to pid and mm */
+   put_pid(task->pid);
+   if (task->mm) {
+   mm_context_remove_vas_window(task->mm);
+   mmdrop(task->mm);
+   }
+}
+
 /*
  * Receive window attributes specified by the (in-kernel) owner of window.
  */
@@ -190,4 +209,5 @@ int vas_register_coproc_api(struct module *mod, enum 
vas_cop_type cop_type,
struct vas_user_win_ops *vops);
 void vas_unregister_coproc_api(void);
 
+int vas_reference_task(struct vas_win_task *vtask);
 #endif /* __ASM_POWERPC_VAS_H */
diff --git a/arch/powerpc/kernel/vas-api.c b/arch/powerpc/kernel/vas-api.c
index 05d7b99acf41..d98caa734154 100644
--- a/arch/powerpc/kernel/vas-api.c
+++ b/arch/powerpc/kernel/vas-api.c
@@ -60,6 +60,57 @@ static char *coproc_devnode(struct device *dev, umode_t 
*mode)
return kasprintf(GFP_KERNEL, "crypto/%s", dev_name(dev));
 }
 
+/*
+ * Take reference to pid and mm
+ */
+int vas_reference_task(struct vas_win_task *vtask)
+{
+   /*
+* Window opened by a child thread may not be closed when
+* it exits. So take reference to its pid and release it
+* when the window is free by parent thread.
+* Acquire a reference to the task's pid to make sure
+* pid will not be re-used - needed only for multithread
+* applications.
+*/
+   vtask->pid = get_task_pid(current, PIDTYPE_PID);
+   /*
+* Acquire a reference to the task's mm.
+*/
+   vtask->mm = get_task_mm(current);
+   if (!vtask->mm) {
+   put_pid(vtask->pid);
+   pr_err("VAS: pid(%d): mm_struct is not found\n",
+   current->pid);
+   return -EPERM;
+   }
+
+   mmgrab(vtask->mm);
+   mmput(vtask->mm);
+   mm_context_add_vas_window(vtask->mm);
+   /*
+* Process closes window during exit. In the case of
+* multithread application, the child thread can open
+* window and can exit without closing it. Expects parent
+* thread to use and close the window. So do not need
+* to take pid reference for parent thread.
+*/
+   vtask->tgid = find_get_pid(task_tgid_vnr(current));
+   /*
+* Even a process that has no foreign real address mapping can
+* use an unpaired COPY instruction (to no real effect). Issue
+* CP_ABORT to clear any pending COPY and prevent a covert
+* channel.
+*
+* __switch_to() will issue CP_ABORT on future context switches
+* if process / thread has any open VAS window (Use
+* current->mm->context.vas_windows).
+*/
+   asm volatile(PPC_CP_ABORT);
+
+   return 0;
+}
+
 static int coproc_open(struct inode *inode, struct file *fp)
 {
struct coproc_instance *cp_inst;
diff --git a/arch/powerpc/platforms/powernv/vas-fault.c 
b/arch/powerpc/platforms/powernv/vas-fault.c
index 3d21fce254b7..a4835cb82c09 100644
--- a/arch/powerpc/platforms/powernv/vas-fault.c
+++ b/arch/powerpc/platforms/powernv/vas-fault.c
@@ -73,7 +73,7 @@ static void update_csb(struct vas_window *window,
 * NX user space windows can not be opened for task->mm=NULL
 * and faults will not be generated for kernel requests.
 */
-   if (WARN_ON_ONCE(!window->mm || !window->user_win))
+   if (WARN_ON_ONCE(!window->task.mm || !window->user_win))
return;
 
csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
@@ -92,7 +92,7 @@ static void update_csb(struct vas_window *window,
csb.address = crb->stamp.nx.fault_storage_addr;
csb.flags = 0;
 
-   pid = window->pid;
+   

[V2 PATCH 04/16] powerpc/vas: Move update_csb and dump_crb to platform independent

2021-04-13 Thread Haren Myneni


NX issues an interrupt when sees fault on user space buffer. The
kernel processes the fault by updating CSB. This functionality is
same for both powerNV and pseries. So this patch moves these
functions to vas-api.c and the actual functionality is not changed.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h |   3 +
 arch/powerpc/kernel/vas-api.c  | 146 ++-
 arch/powerpc/platforms/powernv/vas-fault.c | 155 ++---
 3 files changed, 157 insertions(+), 147 deletions(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index 2daaa1a2a9a9..66bf8fb1a1be 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -210,4 +210,7 @@ int vas_register_coproc_api(struct module *mod, enum 
vas_cop_type cop_type,
 void vas_unregister_coproc_api(void);
 
 int vas_reference_task(struct vas_win_task *vtask);
+void vas_update_csb(struct coprocessor_request_block *crb,
+   struct vas_win_task *vtask);
+void vas_dump_crb(struct coprocessor_request_block *crb);
 #endif /* __ASM_POWERPC_VAS_H */
diff --git a/arch/powerpc/kernel/vas-api.c b/arch/powerpc/kernel/vas-api.c
index d98caa734154..dc131b2e4acd 100644
--- a/arch/powerpc/kernel/vas-api.c
+++ b/arch/powerpc/kernel/vas-api.c
@@ -111,6 +111,150 @@ int vas_reference_task(struct vas_win_task *vtask)
return 0;
 }
 
+/*
+ * Update the CSB to indicate a translation error.
+ *
+ * User space will be polling on CSB after the request is issued.
+ * If NX can handle the request without any issues, it updates CSB.
+ * Whereas if NX encounters page fault, the kernel will handle the
+ * fault and update CSB with translation error.
+ *
+ * If we are unable to update the CSB means copy_to_user failed due to
+ * invalid csb_addr, send a signal to the process.
+ */
+void vas_update_csb(struct coprocessor_request_block *crb,
+   struct vas_win_task *vtask)
+{
+   struct coprocessor_status_block csb;
+   struct kernel_siginfo info;
+   struct task_struct *tsk;
+   void __user *csb_addr;
+   struct pid *pid;
+   int rc;
+
+   /*
+* NX user space windows can not be opened for task->mm=NULL
+* and faults will not be generated for kernel requests.
+*/
+   if (WARN_ON_ONCE(!vtask->mm))
+   return;
+
+   csb_addr = (void __user *)be64_to_cpu(crb->csb_addr);
+
+   memset(&csb, 0, sizeof(csb));
+   csb.cc = CSB_CC_FAULT_ADDRESS;
+   csb.ce = CSB_CE_TERMINATION;
+   csb.cs = 0;
+   csb.count = 0;
+
+   /*
+* NX operates and returns in BE format as defined CRB struct.
+* So saves fault_storage_addr in BE as NX pastes in FIFO and
+* expects user space to convert to CPU format.
+*/
+   csb.address = crb->stamp.nx.fault_storage_addr;
+   csb.flags = 0;
+
+   pid = vtask->pid;
+   tsk = get_pid_task(pid, PIDTYPE_PID);
+   /*
+* Process closes send window after all pending NX requests are
+* completed. In multi-thread applications, a child thread can
+* open a window and can exit without closing it. May be some
+* requests are pending or this window can be used by other
+* threads later. We should handle faults if NX encounters
+* pages faults on these requests. Update CSB with translation
+* error and fault address. If csb_addr passed by user space is
+* invalid, send SEGV signal to pid saved in window. If the
+* child thread is not running, send the signal to tgid.
+* Parent thread (tgid) will close this window upon its exit.
+*
+* pid and mm references are taken when window is opened by
+* process (pid). So tgid is used only when child thread opens
+* a window and exits without closing it.
+*/
+   if (!tsk) {
+   pid = vtask->tgid;
+   tsk = get_pid_task(pid, PIDTYPE_PID);
+   /*
+* Parent thread (tgid) will be closing window when it
+* exits. So should not get here.
+*/
+   if (WARN_ON_ONCE(!tsk))
+   return;
+   }
+
+   /* Return if the task is exiting. */
+   if (tsk->flags & PF_EXITING) {
+   put_task_struct(tsk);
+   return;
+   }
+
+   kthread_use_mm(vtask->mm);
+   rc = copy_to_user(csb_addr, &csb, sizeof(csb));
+   /*
+* User space polls on csb.flags (first byte). So add barrier
+* then copy first byte with csb flags update.
+*/
+   if (!rc) {
+   csb.flags = CSB_V;
+   /* Make sure update to csb.flags is visible now */
+   smp_mb();
+   rc = copy_to_user(csb_addr, &csb, sizeof(u8));
+   }
+   kthread_unuse_mm(vtask->mm);
+   put_task_struct(tsk);
+
+   /* Success */
+   if (!rc)
+   return;
+
+
+ 

[V2 PATCH 05/16] powerpc/vas: Define and use common vas_window struct

2021-04-13 Thread Haren Myneni


Same vas_window struct is used on powerNV and pseries. So this patch
changes in struct vas_window to support both platforms and also the
corresponding modifications in powerNV vas code.

On powerNV vas_window is used for both TX and RX windows, whereas
only for TX windows on powerVM. So some elements are specific to
these platforms.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h  |  48 
 arch/powerpc/platforms/powernv/vas-debug.c  |  12 +-
 arch/powerpc/platforms/powernv/vas-fault.c  |   4 +-
 arch/powerpc/platforms/powernv/vas-trace.h  |   6 +-
 arch/powerpc/platforms/powernv/vas-window.c | 129 +++-
 arch/powerpc/platforms/powernv/vas.h|  38 +-
 6 files changed, 135 insertions(+), 102 deletions(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index 66bf8fb1a1be..f928bf4c7e98 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -69,6 +69,54 @@ struct vas_win_task {
struct mm_struct *mm;   /* Linux process mm_struct */
 };
 
+/*
+ * In-kernel state a VAS window. One per window.
+ * powerVM: Used only for Tx windows.
+ * powerNV: Used for both Tx and Rx windows.
+ */
+struct vas_window {
+   u32 winid;
+   u32 wcreds_max; /* Window credits */
+   enum vas_cop_type cop;
+   struct vas_win_task task;
+   char *dbgname;
+   struct dentry *dbgdir;
+   union {
+   /* powerNV specific data */
+   struct {
+   void *vinst;/* points to VAS instance */
+   bool tx_win;/* True if send window */
+   bool nx_win;/* True if NX window */
+   bool user_win;  /* True if user space window */
+   void *hvwc_map; /* HV window context */
+   void *uwc_map;  /* OS/User window context */
+
+   /* Fields applicable only to send windows */
+   void *paste_kaddr;
+   char *paste_addr_name;
+   struct vas_window *rxwin;
+
+   atomic_t num_txwins;/* Only for receive windows */
+   } pnv;
+   struct {
+   u64 win_addr;   /* Physical paste address */
+   u8 win_type;/* QoS or Default window */
+   u8 status;
+   u32 complete_irq;   /* Completion interrupt */
+   u32 fault_irq;  /* Fault interrupt */
+   u64 domain[6];  /* Associativity domain Ids */
+   /* this window is allocated */
+   u64 util;
+
+   /* List of windows opened which is used for LPM */
+   struct list_head win_list;
+   u64 flags;
+   char *name;
+   int fault_virq;
+   } lpar;
+   };
+};
+
 static inline void vas_drop_reference_task(struct vas_win_task *task)
 {
/* Drop references to pid and mm */
diff --git a/arch/powerpc/platforms/powernv/vas-debug.c 
b/arch/powerpc/platforms/powernv/vas-debug.c
index 41fa90d2f4ab..80f735449ab8 100644
--- a/arch/powerpc/platforms/powernv/vas-debug.c
+++ b/arch/powerpc/platforms/powernv/vas-debug.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "vas.h"
 
 static struct dentry *vas_debugfs;
@@ -33,11 +34,11 @@ static int info_show(struct seq_file *s, void *private)
mutex_lock(&vas_mutex);
 
/* ensure window is not unmapped */
-   if (!window->hvwc_map)
+   if (!window->pnv.hvwc_map)
goto unlock;
 
seq_printf(s, "Type: %s, %s\n", cop_to_str(window->cop),
-   window->tx_win ? "Send" : "Receive");
+   window->pnv.tx_win ? "Send" : "Receive");
seq_printf(s, "Pid : %d\n", vas_window_pid(window));
 
 unlock:
@@ -60,7 +61,7 @@ static int hvwc_show(struct seq_file *s, void *private)
mutex_lock(&vas_mutex);
 
/* ensure window is not unmapped */
-   if (!window->hvwc_map)
+   if (!window->pnv.hvwc_map)
goto unlock;
 
print_reg(s, window, VREG(LPID));
@@ -115,9 +116,10 @@ void vas_window_free_dbgdir(struct vas_window *window)
 
 void vas_window_init_dbgdir(struct vas_window *window)
 {
+   struct vas_instance *vinst = window->pnv.vinst;
struct dentry *d;
 
-   if (!window->vinst->dbgdir)
+   if (!vinst->dbgdir)
return;
 
window->dbgname = kzalloc(16, GFP_KERNEL);
@@ -126,7 +128,7 @@ void vas_window_init_dbgdir(struct vas_window *window)
 
snprintf(window->dbgname, 16, "w%d", window->winid);
 
-   d = debugfs_create_dir(window->dbgname, window->vinst->dbgdir);
+   d = debugfs_create_dir(window->dbgname, vinst->dbgdir);
window->dbgdir = d;
 
  

[V2 PATCH 06/16] powerpc/pseries/vas: Define VAS/NXGZIP HCALLs and structs

2021-04-13 Thread Haren Myneni


This patch adds HCALLs and other definitions. Also define structs
that are used in VAS implementation on powerVM.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/hvcall.h|  7 ++
 arch/powerpc/include/asm/vas.h   | 28 
 arch/powerpc/platforms/pseries/vas.h | 96 
 3 files changed, 131 insertions(+)
 create mode 100644 arch/powerpc/platforms/pseries/vas.h

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index ed6086d57b22..accbb7f6f272 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -294,6 +294,13 @@
 #define H_RESIZE_HPT_COMMIT0x370
 #define H_REGISTER_PROC_TBL0x37C
 #define H_SIGNAL_SYS_RESET 0x380
+#defineH_ALLOCATE_VAS_WINDOW   0x388
+#defineH_MODIFY_VAS_WINDOW 0x38C
+#defineH_DEALLOCATE_VAS_WINDOW 0x390
+#defineH_QUERY_VAS_WINDOW  0x394
+#defineH_QUERY_VAS_CAPABILITIES0x398
+#defineH_QUERY_NX_CAPABILITIES 0x39C
+#defineH_GET_NX_FAULT  0x3A0
 #define H_INT_GET_SOURCE_INFO   0x3A8
 #define H_INT_SET_SOURCE_CONFIG 0x3AC
 #define H_INT_GET_SOURCE_CONFIG 0x3B0
diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index f928bf4c7e98..d15784506a54 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -179,6 +179,7 @@ struct vas_tx_win_attr {
bool rx_win_ord_mode;
 };
 
+#ifdef CONFIG_PPC_POWERNV
 /*
  * Helper to map a chip id to VAS id.
  * For POWER9, this is a 1:1 mapping. In the future this maybe a 1:N
@@ -243,6 +244,33 @@ int vas_paste_crb(struct vas_window *win, int offset, bool 
re);
 int vas_register_api_powernv(struct module *mod, enum vas_cop_type cop_type,
 const char *name);
 void vas_unregister_api_powernv(void);
+#endif
+
+#ifdef CONFIG_PPC_PSERIES
+
+/* VAS Capabilities */
+#define VAS_GZIP_QOS_FEAT  0x1
+#define VAS_GZIP_DEF_FEAT  0x2
+#define VAS_GZIP_QOS_FEAT_BIT  (1UL << (63 - VAS_GZIP_QOS_FEAT)) /* Bit 1 */
+#define VAS_GZIP_DEF_FEAT_BIT  (1UL << (63 - VAS_GZIP_DEF_FEAT)) /* Bit 2 */
+
+/* NX Capabilities */
+#defineVAS_NX_GZIP_FEAT0x1
+#defineVAS_NX_GZIP_FEAT_BIT(1UL << (63 - VAS_NX_GZIP_FEAT)) /* Bit 
1 */
+#defineVAS_DESCR_LEN   8
+
+struct vas_all_capabs_be {
+   __be64  descriptor;
+   __be64  feat_type;
+} __packed __aligned(0x1000);
+
+struct vas_all_capabs {
+   charname[VAS_DESCR_LEN + 1];
+   u64 descriptor;
+   u64 feat_type;
+};
+
+#endif
 
 /*
  * Register / unregister coprocessor type to VAS API which will be exported
diff --git a/arch/powerpc/platforms/pseries/vas.h 
b/arch/powerpc/platforms/pseries/vas.h
new file mode 100644
index ..208682fffa57
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/vas.h
@@ -0,0 +1,96 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright 2020-21 IBM Corp.
+ */
+
+#ifndef _VAS_H
+#define _VAS_H
+#include 
+#include 
+#include 
+
+/*
+ * VAS window modify flags
+ */
+#defineVAS_MOD_WIN_CLOSE   (1UL << 63)
+#defineVAS_MOD_WIN_JOBS_KILL   (1UL << (63 - 1))
+#defineVAS_MOD_WIN_DR  (1UL << (63 - 3))
+#defineVAS_MOD_WIN_PR  (1UL << (63 - 4))
+#defineVAS_MOD_WIN_SF  (1UL << (63 - 5))
+#defineVAS_MOD_WIN_TA  (1UL << (63 - 6))
+#defineVAS_MOD_WIN_FLAGS   (VAS_MOD_WIN_JOBS_KILL | VAS_MOD_WIN_DR 
| \
+   VAS_MOD_WIN_PR | VAS_MOD_WIN_SF)
+
+#defineVAS_WIN_ACTIVE  0x0
+#defineVAS_WIN_CLOSED  0x1
+#defineVAS_WIN_INACTIVE0x2 /* Inactive due to HW failure */
+/* Process of being modified, deallocated, or quiesced */
+#defineVAS_WIN_MOD_IN_PROCESS  0x3
+
+#defineVAS_COPY_PASTE_USER_MODE0x0001
+#defineVAS_COP_OP_USER_MODE0x0010
+
+/*
+ * Co-processor feature - GZIP QoS windows or GZIP default windows
+ */
+enum vas_cop_feat_type {
+   VAS_GZIP_QOS_FEAT_TYPE,
+   VAS_GZIP_DEF_FEAT_TYPE,
+   VAS_MAX_FEAT_TYPE,
+};
+
+struct vas_ct_capabs_be {
+   __be64  descriptor;
+   u8  win_type;   /* Default or QoS type */
+   u8  user_mode;
+   __be16  max_lpar_creds;
+   __be16  max_win_creds;
+   union {
+   __be16  reserved;
+   __be16  def_lpar_creds; /* Used for default capabilities */
+   };
+   __be16  target_lpar_creds;
+} __packed __aligned(0x1000);
+
+struct vas_ct_capabs {
+   charname[VAS_DESCR_LEN + 1];
+   u64 descriptor;
+   u8  win_type;   /* Default or QoS type */
+   u8  user_mode;  /* User mode copy/paste or COP HCALL */
+   u16 max_lpar_creds; /* Max credits available in LPAR */
+   /* Max credits can be assigned per win

[V2 PATCH 07/16] powerpc/vas: Define QoS credit flag to allocate window

2021-04-13 Thread Haren Myneni


pHyp introduces two different type of credits: Default and Quality
of service (QoS).

The total number of default credits available on each LPAR depends
on CPU resources configured. But these credits can be shared or
over-committed across LPARs in shared mode which can result in
paste command failure (RMA_busy). To avoid NX HW contention, phyp
introduces QoS credit type which makes sure guaranteed access to NX
resources. The system admins can assign QoS credits for each LPAR
via HMC.

Default credit type is used to allocate a VAS window by default as
on powerVM implementation. But the process can pass VAS_WIN_QOS_CREDITS
flag with VAS_TX_WIN_OPEN ioctl to open VAS QoS type window.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/uapi/asm/vas-api.h | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/uapi/asm/vas-api.h 
b/arch/powerpc/include/uapi/asm/vas-api.h
index ebd4b2424785..eb7c8694174f 100644
--- a/arch/powerpc/include/uapi/asm/vas-api.h
+++ b/arch/powerpc/include/uapi/asm/vas-api.h
@@ -13,11 +13,15 @@
 #define VAS_MAGIC  'v'
 #define VAS_TX_WIN_OPEN_IOW(VAS_MAGIC, 0x20, struct 
vas_tx_win_open_attr)
 
+/* Flags to VAS TX open window ioctl */
+/* To allocate a window with QoS credit, otherwise default credit is used */
+#defineVAS_WIN_QOS_CREDITS 0x0001
+
 struct vas_tx_win_open_attr {
__u32   version;
__s16   vas_id; /* specific instance of vas or -1 for default */
__u16   reserved1;
-   __u64   flags;  /* Future use */
+   __u64   flags;
__u64   reserved2[6];
 };
 
-- 
2.18.2




[V2 PATCH 08/16] powerpc/pseries/VAS: Implement allocate/modify/deallocate HCALLS

2021-04-13 Thread Haren Myneni


This patch adds the following HCALLs which are used to allocate,
modify and deallocate VAS windows.

H_ALLOCATE_VAS_WINDOW: Allocate VAS window
H_DEALLOCATE_VAS_WINDOW: Close VAS window
H_MODIFY_VAS_WINDOW: Setup window before using

Also adds phyp call (H_QUERY_VAS_CAPABILITIES) to get all VAS
capabilities that phyp provides.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/pseries/vas.c | 217 +++
 1 file changed, 217 insertions(+)
 create mode 100644 arch/powerpc/platforms/pseries/vas.c

diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
new file mode 100644
index ..06960151477c
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -0,0 +1,217 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2020-21 IBM Corp.
+ */
+
+#define pr_fmt(fmt) "vas: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "vas.h"
+
+#defineVAS_INVALID_WIN_ADDRESS 0xul
+#defineVAS_DEFAULT_DOMAIN_ID   0xul
+/* Authority Mask Register (AMR) value is not supported in */
+/* linux implementation. So pass '0' to modify window HCALL */
+#defineVAS_AMR_VALUE   0
+/* phyp allows one credit per window right now */
+#define DEF_WIN_CREDS  1
+
+static int64_t hcall_return_busy_check(int64_t rc)
+{
+   /* Check if we are stalled for some time */
+   if (H_IS_LONG_BUSY(rc)) {
+   msleep(get_longbusy_msecs(rc));
+   rc = H_BUSY;
+   } else if (rc == H_BUSY) {
+   cond_resched();
+   }
+
+   return rc;
+}
+
+/*
+ * Allocate VAS window HCALL
+ */
+static int plpar_vas_allocate_window(struct vas_window *win, u64 *domain,
+u8 wintype, u16 credits)
+{
+   long retbuf[PLPAR_HCALL9_BUFSIZE] = {0};
+   int64_t rc;
+
+   do {
+   rc = plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf, wintype,
+ credits, domain[0], domain[1], domain[2],
+ domain[3], domain[4], domain[5]);
+
+   rc = hcall_return_busy_check(rc);
+   } while (rc == H_BUSY);
+
+   switch (rc) {
+   case H_SUCCESS:
+   win->winid = retbuf[0];
+   win->lpar.win_addr = retbuf[1];
+   win->lpar.complete_irq = retbuf[2];
+   win->lpar.fault_irq = retbuf[3];
+   if (win->lpar.win_addr == VAS_INVALID_WIN_ADDRESS) {
+   pr_err("HCALL(%x): COPY/PASTE is not supported\n",
+   H_ALLOCATE_VAS_WINDOW);
+   return -ENOTSUPP;
+   }
+   return 0;
+   case H_PARAMETER:
+   pr_err("HCALL(%x): Invalid window type (%u)\n",
+   H_ALLOCATE_VAS_WINDOW, wintype);
+   return -EINVAL;
+   case H_P2:
+   pr_err("HCALL(%x): Credits(%u) exceed maximum window credits\n",
+   H_ALLOCATE_VAS_WINDOW, credits);
+   return -EINVAL;
+   case H_COP_HW:
+   pr_err("HCALL(%x): User-mode COPY/PASTE is not supported\n",
+   H_ALLOCATE_VAS_WINDOW);
+   return -ENOTSUPP;
+   case H_RESOURCE:
+   pr_err("HCALL(%x): LPAR credit limit exceeds window limit\n",
+   H_ALLOCATE_VAS_WINDOW);
+   return -EPERM;
+   case H_CONSTRAINED:
+   pr_err("HCALL(%x): Credits (%u) are not available\n",
+   H_ALLOCATE_VAS_WINDOW, credits);
+   return -EPERM;
+   default:
+   pr_err("HCALL(%x): Unexpected error %lld\n",
+   H_ALLOCATE_VAS_WINDOW, rc);
+   return -EIO;
+   }
+}
+
+/*
+ * Deallocate VAS window HCALL.
+ */
+static int plpar_vas_deallocate_window(u64 winid)
+{
+   int64_t rc;
+
+   do {
+   rc = plpar_hcall_norets(H_DEALLOCATE_VAS_WINDOW, winid);
+
+   rc = hcall_return_busy_check(rc);
+   } while (rc == H_BUSY);
+
+   switch (rc) {
+   case H_SUCCESS:
+   return 0;
+   case H_PARAMETER:
+   pr_err("HCALL(%x): Invalid window ID %llu\n",
+   H_DEALLOCATE_VAS_WINDOW, winid);
+   return -EINVAL;
+   case H_STATE:
+   pr_err("HCALL(%x): Window(%llu): Invalid page table entries\n",
+   H_DEALLOCATE_VAS_WINDOW, winid);
+   return -EPERM;
+   default:
+   pr_err("HCALL(%x): Unexpected error %lld for window(%llu)\n",
+   H_DEALLOCATE_VAS_WINDOW, rc, winid);
+   return -EIO;
+   }
+}
+
+/*
+ * Modify VAS window.
+ * After the window is opened with allocate window HCALL, configure it
+ * with flags and LPAR PID before using.

Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Alexey Kardashevskiy




On 13/04/2021 17:58, Leonardo Bras wrote:

On Tue, 2021-04-13 at 17:41 +1000, Alexey Kardashevskiy wrote:


On 13/04/2021 17:33, Leonardo Bras wrote:

On Tue, 2021-04-13 at 17:18 +1000, Alexey Kardashevskiy wrote:


On 13/04/2021 15:49, Leonardo Bras wrote:

Thanks for the feedback!

On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:

-static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
+static phys_addr_t ddw_memory_hotplug_max(void)



Please, forward declaration or a separate patch; this creates
unnecessary noise to the actual change.



Sure, done!




+   _iommu_table_setparms(tbl, pci->phb->bus->number, create.liobn, 
win_addr,
+ 1UL << len, page_shift, 0, 
&iommu_table_lpar_multi_ops);
+   iommu_init_table(tbl, pci->phb->node, 0, 0);



It is 0,0 only if win_addr>0 which is not the QEMU case.



Oh, ok.
I previously though it was ok to use 0,0 here as any other usage in
this file was also 0,0.

What should I use to get the correct parameters? Use the previous tbl
it_reserved_start and tbl->it_reserved_end is enough?


depends on whether you carry reserved start/end even if they are outside
of the dma window.



Oh, that makes sense.
On a previous patch (5/14 IIRC), I changed the behavior to only store
the valid range on tbl, but now I understand why it's important to
store the raw value.

Ok, I will change it back so the reserved range stays in tbl even if it
does not intersect with the DMA window. This way I can reuse the values
in case of indirect mapping with DDW.

Is that ok? Are the reserved values are supposed to stay the same after
changing from Default DMA window to DDW?


I added them to know what bits in it_map to ignore when checking if
there is any active user of the table. If you have non zero reserved
start/end but they do not affect it_map, then it is rather weird way to
carry reserved start/end from DDW to no-DDW.



Ok, agreed.


  May be do not set these at
all for DDW with window start at 1<<59 and when going back to no-DDW (or
if DDW starts at 0) - just set them from MMIO32, just as they are
initialized in the first place.



If I get it correctly from pci_of_scan.c, MMIO32 = {0, 32MB}, is that
correct?


No, under QEMU it is 0x8000.-0x1..:

/proc/device-tree/pci@8002000/ranges

7 cells for each resource, the second one is MMIO32 (the first is IO 
ports, the last is 64bit MMIO).




So, if DDW starts at any value in this range (most probably at zero),
we should remove the rest, is that correct?

Could it always use iommu_init_table(..., 0, 32MB) here, so it always
reserve any part of the DMA window that's in this range? Ot there may
be other reserved values range?


and when going back to no-DDW


After iommu_init_table() there should be no failure, so it looks like
there is no 'going back to no-DDW'. Am I missing something?


Well, a random driver could request 32bit DMA and if the new window is 
1:1, then it would break but this does not seem to happen and we do not 
support it anyway so no loss here.



--
Alexey


[V2 PATCH 09/16] powerpc/pseries/vas: Implement to get all capabilities

2021-04-13 Thread Haren Myneni


pHyp provides various VAS capabilities such as GZIP default and QoS
capabilities which are used to determine total number of credits
available in LPAR, maximum window credits, maximum LPAR credits,
whether usermode copy/paste is supported, and etc.

So first retrieve overall vas capabilities using
H_QUERY_VAS_CAPABILITIES HCALL which tells the specific features that
are available. Then retrieve the specific capabilities by using the
feature type in H_QUERY_VAS_CAPABILITIES HCALL.

pHyp supports only GZIP default and GZIP QoS capabilities right now.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/pseries/vas.c | 130 +++
 1 file changed, 130 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
index 06960151477c..35946fb02995 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -30,6 +30,13 @@
 /* phyp allows one credit per window right now */
 #define DEF_WIN_CREDS  1
 
+static struct vas_all_capabs capabs_all;
+static int copypaste_feat;
+
+struct vas_capabs vcapabs[VAS_MAX_FEAT_TYPE];
+
+DEFINE_MUTEX(vas_pseries_mutex);
+
 static int64_t hcall_return_busy_check(int64_t rc)
 {
/* Check if we are stalled for some time */
@@ -215,3 +222,126 @@ int plpar_vas_query_capabilities(const u64 hcall, u8 
query_type,
return -EIO;
}
 }
+
+/*
+ * Get the specific capabilities based on the feature type.
+ * Right now supports GZIP default and GZIP QoS capabilities.
+ */
+static int get_vas_capabilities(u8 feat, enum vas_cop_feat_type type,
+   struct vas_ct_capabs_be *capab_be)
+{
+   struct vas_ct_capabs *capab;
+   struct vas_capabs *vcapab;
+   int rc = 0;
+
+   vcapab = &vcapabs[type];
+   memset(vcapab, 0, sizeof(*vcapab));
+   INIT_LIST_HEAD(&vcapab->list);
+
+   capab = &vcapab->capab;
+
+   rc = plpar_vas_query_capabilities(H_QUERY_VAS_CAPABILITIES, feat,
+ (u64)virt_to_phys(capab_be));
+   if (rc)
+   return rc;
+
+   capab->user_mode = capab_be->user_mode;
+   if (!(capab->user_mode & VAS_COPY_PASTE_USER_MODE)) {
+   pr_err("User space COPY/PASTE is not supported\n");
+   return -ENOTSUPP;
+   }
+
+   snprintf(capab->name, VAS_DESCR_LEN + 1, "%.8s",
+(char *)&capab_be->descriptor);
+   capab->descriptor = be64_to_cpu(capab_be->descriptor);
+   capab->win_type = capab_be->win_type;
+   if (capab->win_type >= VAS_MAX_FEAT_TYPE) {
+   pr_err("Unsupported window type %u\n", capab->win_type);
+   return -EINVAL;
+   }
+   capab->max_lpar_creds = be16_to_cpu(capab_be->max_lpar_creds);
+   capab->max_win_creds = be16_to_cpu(capab_be->max_win_creds);
+   atomic_set(&capab->target_lpar_creds,
+  be16_to_cpu(capab_be->target_lpar_creds));
+   if (feat == VAS_GZIP_DEF_FEAT) {
+   capab->def_lpar_creds = be16_to_cpu(capab_be->def_lpar_creds);
+
+   if (capab->max_win_creds < DEF_WIN_CREDS) {
+   pr_err("Window creds(%u) > max allowed window 
creds(%u)\n",
+  DEF_WIN_CREDS, capab->max_win_creds);
+   return -EINVAL;
+   }
+   }
+
+   copypaste_feat = 1;
+
+   return 0;
+}
+
+static int __init pseries_vas_init(void)
+{
+   struct vas_ct_capabs_be *ct_capabs_be;
+   struct vas_all_capabs_be *capabs_be;
+   int rc;
+
+   /*
+* Linux supports user space COPY/PASTE only with Radix
+*/
+   if (!radix_enabled()) {
+   pr_err("API is supported only with radix page tables\n");
+   return -ENOTSUPP;
+   }
+
+   capabs_be = kmalloc(sizeof(*capabs_be), GFP_KERNEL);
+   if (!capabs_be)
+   return -ENOMEM;
+   /*
+* Get VAS overall capabilities by passing 0 to feature type.
+*/
+   rc = plpar_vas_query_capabilities(H_QUERY_VAS_CAPABILITIES, 0,
+ (u64)virt_to_phys(capabs_be));
+   if (rc)
+   goto out;
+
+   snprintf(capabs_all.name, VAS_DESCR_LEN, "%.7s",
+(char *)&capabs_be->descriptor);
+   capabs_all.descriptor = be64_to_cpu(capabs_be->descriptor);
+   capabs_all.feat_type = be64_to_cpu(capabs_be->feat_type);
+
+   ct_capabs_be = kmalloc(sizeof(*ct_capabs_be), GFP_KERNEL);
+   if (!ct_capabs_be) {
+   rc = -ENOMEM;
+   goto out;
+   }
+   /*
+* QOS capabilities available
+*/
+   if (capabs_all.feat_type & VAS_GZIP_QOS_FEAT_BIT) {
+   rc = get_vas_capabilities(VAS_GZIP_QOS_FEAT,
+ VAS_GZIP_QOS_FEAT_TYPE, ct_capabs_be);
+
+   if (rc)
+   goto out_ct;
+   }
+   /*
+* Def

[V2 PATCH 10/16] powerpc/pseries/vas: Integrate API with open/close windows

2021-04-13 Thread Haren Myneni


This patch adds VAS window allocatioa/close with the corresponding
HCALLs. Also changes to integrate with the existing user space VAS
API and provide register/unregister functions to NX pseries driver.

The driver register function is used to create the user space
interface (/dev/crypto/nx-gzip) and unregister to remove this entry.

The user space process opens this device node and makes an ioctl
to allocate VAS window. The close interface is used to deallocate
window.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/include/asm/vas.h  |   5 +
 arch/powerpc/platforms/Kconfig  |   2 +-
 arch/powerpc/platforms/pseries/Makefile |   1 +
 arch/powerpc/platforms/pseries/vas.c| 212 
 4 files changed, 219 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/vas.h b/arch/powerpc/include/asm/vas.h
index d15784506a54..aa1974aba27e 100644
--- a/arch/powerpc/include/asm/vas.h
+++ b/arch/powerpc/include/asm/vas.h
@@ -270,6 +270,11 @@ struct vas_all_capabs {
u64 feat_type;
 };
 
+int plpar_vas_query_capabilities(const u64 hcall, u8 query_type,
+u64 result);
+int vas_register_api_pseries(struct module *mod,
+enum vas_cop_type cop_type, const char *name);
+void vas_unregister_api_pseries(void);
 #endif
 
 /*
diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 1b14dc7343f3..778b9bbc11af 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -71,7 +71,7 @@ config PPC_DT_CPU_FTRS
 
 config PPC_VAS
bool "IBM Virtual Accelerator Switchboard (VAS)"
-   depends on PPC_POWERNV && PPC_64K_PAGES
+   depends on (PPC_POWERNV || PPC_PSERIES) && PPC_64K_PAGES
default y
help
  This enables support for IBM Virtual Accelerator Switchboard (VAS).
diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index c8a2b0b05ac0..4cda0ef87be0 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -30,3 +30,4 @@ obj-$(CONFIG_PPC_SVM) += svm.o
 obj-$(CONFIG_FA_DUMP)  += rtas-fadump.o
 
 obj-$(CONFIG_SUSPEND)  += suspend.o
+obj-$(CONFIG_PPC_VAS)  += vas.o
diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
index 35946fb02995..0ade0d6d728f 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -222,6 +222,218 @@ int plpar_vas_query_capabilities(const u64 hcall, u8 
query_type,
return -EIO;
}
 }
+EXPORT_SYMBOL_GPL(plpar_vas_query_capabilities);
+
+/*
+ * Allocate window and setup IRQ mapping.
+ */
+static int allocate_setup_window(struct vas_window *txwin,
+u64 *domain, u8 wintype)
+{
+   int rc;
+
+   rc = plpar_vas_allocate_window(txwin, domain, wintype, DEF_WIN_CREDS);
+   if (rc)
+   return rc;
+
+   txwin->wcreds_max = DEF_WIN_CREDS;
+
+   return 0;
+}
+
+static struct vas_window *vas_allocate_window(struct vas_tx_win_open_attr 
*uattr,
+ enum vas_cop_type cop_type)
+{
+   long domain[PLPAR_HCALL9_BUFSIZE] = {VAS_DEFAULT_DOMAIN_ID};
+   struct vas_ct_capabs *ct_capab;
+   struct vas_capabs *capabs;
+   struct vas_window *txwin;
+   int rc;
+
+   txwin = kzalloc(sizeof(*txwin), GFP_KERNEL);
+   if (!txwin)
+   return ERR_PTR(-ENOMEM);
+
+   /*
+* A VAS window can have many credits which means that many
+* requests can be issued simultaneously. But phyp restricts
+* one credit per window.
+* phyp introduces 2 different types of credits:
+* Default credit type (Uses normal priority FIFO):
+*  A limited number of credits are assigned to partitions
+*  based on processor entitlement. But these credits may be
+*  over-committed on a system depends on whether the CPUs
+*  are in shared or dedicated modes - that is, more requests
+*  may be issued across the system than NX can service at
+*  once which can result in paste command failure (RMA_busy).
+*  Then the process has to resend requests or fall-back to
+*  SW compression.
+* Quality of Service (QoS) credit type (Uses high priority FIFO):
+*  To avoid NX HW contention, the system admins can assign
+*  QoS credits for each LPAR so that this partition is
+*  guaranteed access to NX resources. These credits are
+*  assigned to partitions via the HMC.
+*  Refer PAPR for more information.
+*
+* Allocate window with QoS credits if user requested. Otherwise
+* default credits are used.
+*/
+   if (uattr->flags & VAS_WIN_QOS_CREDITS)
+   capabs = &vcapabs[VAS_GZIP_QOS_

[V2 PATCH 11/16] powerpc/pseries/vas: Setup IRQ and fault handling

2021-04-13 Thread Haren Myneni


When NX sees a fault on the user space buffer, generates a fault
interrupt and pHyp forwards that interrupt to OS. Then the kernel
makes H_GET_NX_FAULT HCALL to retrieve the fault CRB information.

This patch adds changes to setup IRQ per each window and handles
fault by updating CSB.

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/pseries/vas.c | 111 ++-
 1 file changed, 110 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/vas.c 
b/arch/powerpc/platforms/pseries/vas.c
index 0ade0d6d728f..2106eca0862a 100644
--- a/arch/powerpc/platforms/pseries/vas.c
+++ b/arch/powerpc/platforms/pseries/vas.c
@@ -224,6 +224,62 @@ int plpar_vas_query_capabilities(const u64 hcall, u8 
query_type,
 }
 EXPORT_SYMBOL_GPL(plpar_vas_query_capabilities);
 
+/*
+ * HCALL to get fault CRB from pHyp.
+ */
+static int plpar_get_nx_fault(u32 winid, u64 buffer)
+{
+   int64_t rc;
+
+   rc = plpar_hcall_norets(H_GET_NX_FAULT, winid, buffer);
+
+   switch (rc) {
+   case H_SUCCESS:
+   return 0;
+   case H_PARAMETER:
+   pr_err("HCALL(%x): Invalid window ID %u\n", H_GET_NX_FAULT,
+  winid);
+   return -EINVAL;
+   case H_STATE:
+   pr_err("HCALL(%x): No outstanding faults for window ID %u\n",
+  H_GET_NX_FAULT, winid);
+   return -EINVAL;
+   case H_PRIVILEGE:
+   pr_err("HCALL(%x): Window(%u): Invalid fault buffer 0x%llx\n",
+  H_GET_NX_FAULT, winid, buffer);
+   return -EACCES;
+   default:
+   pr_err("HCALL(%x): Unexpected error %lld for window(%u)\n",
+  H_GET_NX_FAULT, rc, winid);
+   return -EIO;
+   }
+}
+
+/*
+ * Handle the fault interrupt.
+ * When the fault interrupt is received for each window, query pHyp to get
+ * the fault CRB on the specific fault. Then process the CRB by updating
+ * CSB or send signal if the user space CSB is invalid.
+ * Note: pHyp forwards an interrupt for each fault request. So one fault
+ * CRB to process for each H_GET_NX_FAULT HCALL.
+ */
+irqreturn_t pseries_vas_fault_thread_fn(int irq, void *data)
+{
+   struct vas_window *txwin = data;
+   struct coprocessor_request_block crb;
+   struct vas_win_task *tsk;
+   int rc;
+
+   rc = plpar_get_nx_fault(txwin->winid, (u64)virt_to_phys(&crb));
+   if (!rc) {
+   tsk = &txwin->task;
+   vas_dump_crb(&crb);
+   vas_update_csb(&crb, tsk);
+   }
+
+   return IRQ_HANDLED;
+}
+
 /*
  * Allocate window and setup IRQ mapping.
  */
@@ -235,10 +291,51 @@ static int allocate_setup_window(struct vas_window *txwin,
rc = plpar_vas_allocate_window(txwin, domain, wintype, DEF_WIN_CREDS);
if (rc)
return rc;
+   /*
+* On powerVM, pHyp setup and forwards the fault interrupt per
+* window. So the IRQ setup and fault handling will be done for
+* each open window separately.
+*/
+   txwin->lpar.fault_virq = irq_create_mapping(NULL,
+   txwin->lpar.fault_irq);
+   if (!txwin->lpar.fault_virq) {
+   pr_err("Failed irq mapping %d\n", txwin->lpar.fault_irq);
+   rc = -EINVAL;
+   goto out_win;
+   }
+
+   txwin->lpar.name = kasprintf(GFP_KERNEL, "vas-win-%d", txwin->winid);
+   if (!txwin->lpar.name) {
+   rc = -ENOMEM;
+   goto out_irq;
+   }
+
+   rc = request_threaded_irq(txwin->lpar.fault_virq, NULL,
+ pseries_vas_fault_thread_fn, IRQF_ONESHOT,
+ txwin->lpar.name, txwin);
+   if (rc) {
+   pr_err("VAS-Window[%d]: Request IRQ(%u) failed with %d\n",
+  txwin->winid, txwin->lpar.fault_virq, rc);
+   goto out_free;
+   }
 
txwin->wcreds_max = DEF_WIN_CREDS;
 
return 0;
+out_free:
+   kfree(txwin->lpar.name);
+out_irq:
+   irq_dispose_mapping(txwin->lpar.fault_virq);
+out_win:
+   plpar_vas_deallocate_window(txwin->winid);
+   return rc;
+}
+
+static inline void free_irq_setup(struct vas_window *txwin)
+{
+   free_irq(txwin->lpar.fault_virq, txwin);
+   irq_dispose_mapping(txwin->lpar.fault_virq);
+   kfree(txwin->lpar.name);
 }
 
 static struct vas_window *vas_allocate_window(struct vas_tx_win_open_attr 
*uattr,
@@ -346,6 +443,11 @@ static struct vas_window *vas_allocate_window(struct 
vas_tx_win_open_attr *uattr
return txwin;
 
 out_free:
+   /*
+* Window is not operational. Free IRQ before closing
+* window so that do not have to hold mutex.
+*/
+   free_irq_setup(txwin);
plpar_vas_deallocate_window(txwin->winid);
 out:
atomic_dec(&ct_capab->used_lpar_creds);
@@ -364,9 +466,16 @@ static int deallocate_free_window(s

[V2 PATCH 12/16] powerpc/pseries/vas: sysfs interface to export capabilities

2021-04-13 Thread Haren Myneni


pHyp provides GZIP default and GZIP QoS capabilities which gives
the total number of credits are available in LPAR. This patch
creates sysfs entries and exports LPAR credits, the currently used
and the available credits for each feature.

/sys/kernel/vas/VasCaps/VDefGzip: (default GZIP capabilities)
avail_lpar_creds /* Available credits to use */
target_lpar_creds /* Total credits available which can be
 /* changed with DLPAR operation */
used_lpar_creds  /* Used credits */

/sys/kernel/vas/VasCaps/VQosGzip (QoS GZIP capabilities)
avail_lpar_creds
target_lpar_creds
used_lpar_creds

Signed-off-by: Haren Myneni 
---
 arch/powerpc/platforms/pseries/Makefile|   2 +-
 arch/powerpc/platforms/pseries/vas-sysfs.c | 173 +
 arch/powerpc/platforms/pseries/vas.c   |   6 +
 arch/powerpc/platforms/pseries/vas.h   |   2 +
 4 files changed, 182 insertions(+), 1 deletion(-)
 create mode 100644 arch/powerpc/platforms/pseries/vas-sysfs.c

diff --git a/arch/powerpc/platforms/pseries/Makefile 
b/arch/powerpc/platforms/pseries/Makefile
index 4cda0ef87be0..e24093bebc0b 100644
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -30,4 +30,4 @@ obj-$(CONFIG_PPC_SVM) += svm.o
 obj-$(CONFIG_FA_DUMP)  += rtas-fadump.o
 
 obj-$(CONFIG_SUSPEND)  += suspend.o
-obj-$(CONFIG_PPC_VAS)  += vas.o
+obj-$(CONFIG_PPC_VAS)  += vas.o vas-sysfs.o
diff --git a/arch/powerpc/platforms/pseries/vas-sysfs.c 
b/arch/powerpc/platforms/pseries/vas-sysfs.c
new file mode 100644
index ..bf12f38cb9f9
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/vas-sysfs.c
@@ -0,0 +1,173 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright 2016-17 IBM Corp.
+ */
+
+#define pr_fmt(fmt) "vas: " fmt
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vas.h"
+
+#ifdef CONFIG_SYSFS
+static struct kobject *pseries_vas_kobj;
+static struct kobject *vas_capabs_kobj;
+
+struct vas_capabs_entry {
+   struct kobject kobj;
+   struct vas_ct_capabs *capabs;
+};
+
+#define to_capabs_entry(entry) container_of(entry, struct vas_capabs_entry, 
kobj)
+
+static ssize_t avail_lpar_creds_show(struct vas_ct_capabs *capabs, char *buf)
+{
+   int avail_creds = atomic_read(&capabs->target_lpar_creds) -
+   atomic_read(&capabs->used_lpar_creds);
+   return sprintf(buf, "%d\n", avail_creds);
+}
+
+#define sysfs_capbs_entry_read(_name)  \
+static ssize_t _name##_show(struct vas_ct_capabs *capabs, char *buf)   \
+{  \
+   return sprintf(buf, "%d\n", atomic_read(&capabs->_name));   \
+}
+
+struct vas_sysfs_entry {
+   struct attribute attr;
+   ssize_t (*show)(struct vas_ct_capabs *, char *);
+   ssize_t (*store)(struct vas_ct_capabs *, const char *, size_t);
+};
+
+#define VAS_ATTR_RO(_name) \
+   sysfs_capbs_entry_read(_name);  \
+   static struct vas_sysfs_entry _name##_attribute = __ATTR(_name, \
+   0444, _name##_show, NULL);
+
+VAS_ATTR_RO(target_lpar_creds);
+VAS_ATTR_RO(used_lpar_creds);
+
+static struct vas_sysfs_entry avail_lpar_creds_attribute =
+   __ATTR(avail_lpar_creds, 0644, avail_lpar_creds_show, NULL);
+
+static struct attribute *vas_capab_attrs[] = {
+   &target_lpar_creds_attribute.attr,
+   &used_lpar_creds_attribute.attr,
+   &avail_lpar_creds_attribute.attr,
+   NULL,
+};
+
+static ssize_t vas_type_show(struct kobject *kobj, struct attribute *attr,
+char *buf)
+{
+   struct vas_capabs_entry *centry;
+   struct vas_ct_capabs *capabs;
+   struct vas_sysfs_entry *entry;
+
+   centry = to_capabs_entry(kobj);
+   capabs = centry->capabs;
+   entry = container_of(attr, struct vas_sysfs_entry, attr);
+
+   if (!entry->show)
+   return -EIO;
+
+   return entry->show(capabs, buf);
+}
+
+static ssize_t vas_type_store(struct kobject *kobj, struct attribute *attr,
+ const char *buf, size_t count)
+{
+   struct vas_capabs_entry *centry;
+   struct vas_ct_capabs *capabs;
+   struct vas_sysfs_entry *entry;
+
+   centry = to_capabs_entry(kobj);
+   capabs = centry->capabs;
+   entry = container_of(attr, struct vas_sysfs_entry, attr);
+   if (!entry->store)
+   return -EIO;
+
+   return entry->store(capabs, buf, count);
+}
+
+static void vas_type_release(struct kobject *kobj)
+{
+   struct vas_capabs_entry *centry = to_capabs_entry(kobj);
+   kfree(centry);
+}
+
+static const struct sysfs_ops vas_sysfs_ops = {
+   .show   =   vas_type_show,
+   .store  =   vas_type_store,
+};
+
+static struct kobj_type vas_attr_type = {
+   .release=   vas_type_release,
+

[V2 PATCH 13/16] crypto/nx: Rename nx-842-pseries file name to nx-common-pseries

2021-04-13 Thread Haren Myneni


Rename nx-842-pseries.c to nx-common-pseries.c to add code for new
GZIP compression type. The actual functionality is not changed in
this patch.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/Makefile  | 2 +-
 drivers/crypto/nx/{nx-842-pseries.c => nx-common-pseries.c} | 0
 2 files changed, 1 insertion(+), 1 deletion(-)
 rename drivers/crypto/nx/{nx-842-pseries.c => nx-common-pseries.c} (100%)

diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile
index bc89a20e5d9d..d00181a26dd6 100644
--- a/drivers/crypto/nx/Makefile
+++ b/drivers/crypto/nx/Makefile
@@ -14,5 +14,5 @@ nx-crypto-objs := nx.o \
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o 
nx-compress.o
 obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o 
nx-compress.o
 nx-compress-objs := nx-842.o
-nx-compress-pseries-objs := nx-842-pseries.o
+nx-compress-pseries-objs := nx-common-pseries.o
 nx-compress-powernv-objs := nx-common-powernv.o
diff --git a/drivers/crypto/nx/nx-842-pseries.c 
b/drivers/crypto/nx/nx-common-pseries.c
similarity index 100%
rename from drivers/crypto/nx/nx-842-pseries.c
rename to drivers/crypto/nx/nx-common-pseries.c
-- 
2.18.2




[V2 PATCH 14/16] crypto/nx: Register and unregister VAS interface

2021-04-13 Thread Haren Myneni


Changes to create /dev/crypto/nx-gzip interface with VAS register
and to remove this interface with VAS unregister.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/Kconfig | 1 +
 drivers/crypto/nx/nx-common-pseries.c | 9 +
 2 files changed, 10 insertions(+)

diff --git a/drivers/crypto/nx/Kconfig b/drivers/crypto/nx/Kconfig
index 23e3d0160e67..2a35e0e785bd 100644
--- a/drivers/crypto/nx/Kconfig
+++ b/drivers/crypto/nx/Kconfig
@@ -29,6 +29,7 @@ if CRYPTO_DEV_NX_COMPRESS
 config CRYPTO_DEV_NX_COMPRESS_PSERIES
tristate "Compression acceleration support on pSeries platform"
depends on PPC_PSERIES && IBMVIO
+   depends on PPC_VAS
default y
help
  Support for PowerPC Nest (NX) compression acceleration. This
diff --git a/drivers/crypto/nx/nx-common-pseries.c 
b/drivers/crypto/nx/nx-common-pseries.c
index cc8dd3072b8b..9a40fca8a9e6 100644
--- a/drivers/crypto/nx/nx-common-pseries.c
+++ b/drivers/crypto/nx/nx-common-pseries.c
@@ -9,6 +9,7 @@
  */
 
 #include 
+#include 
 
 #include "nx-842.h"
 #include "nx_csbcpb.h" /* struct nx_csbcpb */
@@ -1101,6 +1102,12 @@ static int __init nx842_pseries_init(void)
return ret;
}
 
+   ret = vas_register_api_pseries(THIS_MODULE, VAS_COP_TYPE_GZIP,
+  "nx-gzip");
+
+   if (ret)
+   pr_err("NX-GZIP is not supported. Returned=%d\n", ret);
+
return 0;
 }
 
@@ -,6 +1118,8 @@ static void __exit nx842_pseries_exit(void)
struct nx842_devdata *old_devdata;
unsigned long flags;
 
+   vas_unregister_api_pseries();
+
crypto_unregister_alg(&nx842_pseries_alg);
 
spin_lock_irqsave(&devdata_mutex, flags);
-- 
2.18.2




[V2 PATCH 15/16] crypto/nx: Get NX capabilities for GZIP coprocessor type

2021-04-13 Thread Haren Myneni


phyp provides NX capabilities which gives recommended minimum
compression / decompression length and maximum request buffer size
in bytes.

Changes to get NX overall capabilities which points to the specific
features phyp supports. Then retrieve NXGZIP specific capabilities.

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-common-pseries.c | 83 +++
 1 file changed, 83 insertions(+)

diff --git a/drivers/crypto/nx/nx-common-pseries.c 
b/drivers/crypto/nx/nx-common-pseries.c
index 9a40fca8a9e6..49224870d05e 100644
--- a/drivers/crypto/nx/nx-common-pseries.c
+++ b/drivers/crypto/nx/nx-common-pseries.c
@@ -9,6 +9,7 @@
  */
 
 #include 
+#include 
 #include 
 
 #include "nx-842.h"
@@ -20,6 +21,24 @@ MODULE_DESCRIPTION("842 H/W Compression driver for IBM Power 
processors");
 MODULE_ALIAS_CRYPTO("842");
 MODULE_ALIAS_CRYPTO("842-nx");
 
+struct nx_ct_capabs_be {
+   __be64  descriptor;
+   __be64  req_max_processed_len;  /* Max bytes in one GZIP request */
+   __be64  min_compress_len;   /* Min compression size in bytes */
+   __be64  min_decompress_len; /* Min decompression size in bytes */
+} __packed __aligned(0x1000);
+
+struct nx_ct_capabs {
+   charname[VAS_DESCR_LEN + 1];
+   u64 descriptor;
+   u64 req_max_processed_len;  /* Max bytes in one GZIP request */
+   u64 min_compress_len;   /* Min compression in bytes */
+   u64 min_decompress_len; /* Min decompression in bytes */
+};
+
+u64 capab_feat = 0;
+struct nx_ct_capabs nx_ct_capab;
+
 static struct nx842_constraints nx842_pseries_constraints = {
.alignment =DDE_BUFFER_ALIGN,
.multiple = DDE_BUFFER_LAST_MULT,
@@ -1066,6 +1085,66 @@ static void nx842_remove(struct vio_dev *viodev)
kfree(old_devdata);
 }
 
+/*
+ * Get NX capabilities from pHyp.
+ * Only NXGZIP capabilities are available right now and these values
+ * are available through sysfs.
+ */
+static void __init nxct_get_capabilities(void)
+{
+   struct vas_all_capabs_be *capabs_be;
+   struct nx_ct_capabs_be *nxc_be;
+   int rc;
+
+   capabs_be = kmalloc(sizeof(*capabs_be), GFP_KERNEL);
+   if (!capabs_be)
+   return;
+   /*
+* Get NX overall capabilities with feature type=0
+*/
+   rc = plpar_vas_query_capabilities(H_QUERY_NX_CAPABILITIES, 0,
+ (u64)virt_to_phys(capabs_be));
+   if (rc)
+   goto out;
+
+   capab_feat = be64_to_cpu(capabs_be->feat_type);
+   /*
+* NX-GZIP feature available
+*/
+   if (capab_feat & VAS_NX_GZIP_FEAT_BIT) {
+   nxc_be = kmalloc(sizeof(*nxc_be), GFP_KERNEL);
+   if (!nxc_be)
+   goto out;
+   /*
+* Get capabilities for NX-GZIP feature
+*/
+   rc = plpar_vas_query_capabilities(H_QUERY_NX_CAPABILITIES,
+ VAS_NX_GZIP_FEAT,
+ (u64)virt_to_phys(nxc_be));
+   } else {
+   pr_err("NX-GZIP feature is not available\n");
+   rc = -EINVAL;
+   }
+
+   if (!rc) {
+   snprintf(nx_ct_capab.name, VAS_DESCR_LEN + 1, "%.8s",
+(char *)&nxc_be->descriptor);
+   nx_ct_capab.descriptor = be64_to_cpu(nxc_be->descriptor);
+   nx_ct_capab.req_max_processed_len =
+   be64_to_cpu(nxc_be->req_max_processed_len);
+   nx_ct_capab.min_compress_len =
+   be64_to_cpu(nxc_be->min_compress_len);
+   nx_ct_capab.min_decompress_len =
+   be64_to_cpu(nxc_be->min_decompress_len);
+   } else {
+   capab_feat = 0;
+   }
+
+   kfree(nxc_be);
+out:
+   kfree(capabs_be);
+}
+
 static const struct vio_device_id nx842_vio_driver_ids[] = {
{"ibm,compression-v1", "ibm,compression"},
{"", ""},
@@ -1093,6 +1172,10 @@ static int __init nx842_pseries_init(void)
return -ENOMEM;
 
RCU_INIT_POINTER(devdata, new_devdata);
+   /*
+* Get NX capabilities from pHyp which is used for NX-GZIP.
+*/
+   nxct_get_capabilities();
 
ret = vio_register_driver(&nx842_vio_driver);
if (ret) {
-- 
2.18.2




[V2 PATCH 16/16] crypto/nx: Add sysfs interface to export NX capabilities

2021-04-13 Thread Haren Myneni


Changes to export the following NXGZIP capabilities through sysfs:

/sys/devices/vio/ibm,compression-v1/NxGzCaps:
min_compress_len  /*Recommended minimum compress length in bytes*/
min_decompress_len /*Recommended minimum decompress length in bytes*/
req_max_processed_len /* Maximum number of bytes processed in one
request */

Signed-off-by: Haren Myneni 
---
 drivers/crypto/nx/nx-common-pseries.c | 43 +++
 1 file changed, 43 insertions(+)

diff --git a/drivers/crypto/nx/nx-common-pseries.c 
b/drivers/crypto/nx/nx-common-pseries.c
index 49224870d05e..cc258d2c6475 100644
--- a/drivers/crypto/nx/nx-common-pseries.c
+++ b/drivers/crypto/nx/nx-common-pseries.c
@@ -962,6 +962,36 @@ static struct attribute_group nx842_attribute_group = {
.attrs = nx842_sysfs_entries,
 };
 
+#definenxct_capab_read(_name)  
\
+static ssize_t nxct_##_name##_show(struct device *dev, \
+   struct device_attribute *attr, char *buf)   \
+{  \
+   return sprintf(buf, "%lld\n", nx_ct_capab._name);   \
+}
+
+#define NXCT_ATTR_RO(_name)\
+   nxct_capab_read(_name); \
+   static struct device_attribute dev_attr_##_name = __ATTR(_name, \
+   0444,   \
+   nxct_##_name##_show,\
+   NULL);
+
+NXCT_ATTR_RO(req_max_processed_len);
+NXCT_ATTR_RO(min_compress_len);
+NXCT_ATTR_RO(min_decompress_len);
+
+static struct attribute *nxct_capab_sysfs_entries[] = {
+   &dev_attr_req_max_processed_len.attr,
+   &dev_attr_min_compress_len.attr,
+   &dev_attr_min_decompress_len.attr,
+   NULL,
+};
+
+static struct attribute_group nxct_capab_attr_group = {
+   .name   =   nx_ct_capab.name,
+   .attrs  =   nxct_capab_sysfs_entries,
+};
+
 static struct nx842_driver nx842_pseries_driver = {
.name = KBUILD_MODNAME,
.owner =THIS_MODULE,
@@ -1051,6 +1081,16 @@ static int nx842_probe(struct vio_dev *viodev,
goto error;
}
 
+   if (capab_feat) {
+   if (sysfs_create_group(&viodev->dev.kobj,
+   &nxct_capab_attr_group)) {
+   dev_err(&viodev->dev,
+   "Could not create sysfs NX capability 
entries\n");
+   ret = -1;
+   goto error;
+   }
+   }
+
return 0;
 
 error_unlock:
@@ -1070,6 +1110,9 @@ static void nx842_remove(struct vio_dev *viodev)
pr_info("Removing IBM Power 842 compression device\n");
sysfs_remove_group(&viodev->dev.kobj, &nx842_attribute_group);
 
+   if (capab_feat)
+   sysfs_remove_group(&viodev->dev.kobj, &nxct_capab_attr_group);
+
crypto_unregister_alg(&nx842_pseries_alg);
 
spin_lock_irqsave(&devdata_mutex, flags);
-- 
2.18.2




Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask

2021-04-13 Thread Nathan Lynch
Srikar Dronamraju  writes:
> That leaves us with just 2 options for now.
> 1. Update numa_mem later and only update numa_node here.
> - Over a longer period of time, this would be more confusing since we
> may lose track of why we are splitting the set of numa_node and numa_mem.
>
> or
> 2. Use my earlier patch.
>
> My choice would be to go with my earlier patch.
> Please do let me know your thoughts on the same.

OK, agreed. Thanks.


Re: [PATCH 1/1] powerpc/smp: Set numa node before updating mask

2021-04-13 Thread Nathan Lynch
Srikar Dronamraju  writes:
>
> Some of the per-CPU masks use cpu_cpu_mask as a filter to limit the search
> for related CPUs. On a dlpar add of a CPU, update cpu_cpu_mask before
> updating the per-CPU masks. This will ensure the cpu_cpu_mask is updated
> correctly before its used in setting the masks. Setting the numa_node will
> ensure that when cpu_cpu_mask() gets called, the correct node number is
> used. This code movement helped fix the above call trace.
>
> Reported-by: Geetika Moolchandani 
> Signed-off-by: Srikar Dronamraju 

Reviewed-by: Nathan Lynch 

Thanks.


Re: [PATCH 02/16] powerpc/vas: Make VAS API powerpc platform independent

2021-04-13 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 11/04/2021 à 02:31, Haren Myneni a écrit :
>> 
>> Using the same /dev/crypto/nx-gzip interface for both powerNV and
>> pseries. So this patcb moves VAS API to powerpc platform indepedent
>> directory. The actual functionality is not changed in this patch.
>
> This patch seems to do a lot more than moving VAS API to independent 
> directory. A more detailed 
> description would help.
>
> And it is not something defined in the powerpc architecture I think, so it 
> should
> remain in some common platform related directory.
>
>> 
>> Signed-off-by: Haren Myneni 
>> ---
>>   arch/powerpc/Kconfig  | 15 +
>>   arch/powerpc/include/asm/vas.h| 22 ++-
>>   arch/powerpc/kernel/Makefile  |  1 +
>>   .../{platforms/powernv => kernel}/vas-api.c   | 64 ++
>>   arch/powerpc/platforms/powernv/Kconfig| 14 
>>   arch/powerpc/platforms/powernv/Makefile   |  2 +-
>>   arch/powerpc/platforms/powernv/vas-window.c   | 66 +++
>>   7 files changed, 140 insertions(+), 44 deletions(-)
>>   rename arch/powerpc/{platforms/powernv => kernel}/vas-api.c (83%)
>> 
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 386ae12d8523..7aa1fbf7c1dc 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -478,6 +478,21 @@ config PPC_UV
>>   
>>If unsure, say "N".
>>   
>> +config PPC_VAS
>> +bool "IBM Virtual Accelerator Switchboard (VAS)"
>> +depends on PPC_POWERNV && PPC_64K_PAGES
>> +default y
>> +help
>> +  This enables support for IBM Virtual Accelerator Switchboard (VAS).
>
> IIUC is a functionnality in a coprocessor of some IBM processors. Something 
> similar in principle to 
> the communication coprocessors we find in Freescale processors.

It's not a coprocessor, it's a way you talk to coprocessors.

> It is not a generic functionnality part of the powerpc architecture, I don't 
> think this belongs to 
> arch/powerpc/Kconfig

But you're right it's not part of the ISA.

> I think it should go in arch/powerpc/platform/Kconfig

The problem with that is it's shared between two existing platforms, ie.
powernv and pseries. We don't want to put it in one or the other.

In the past we have put code like that in arch/powerpc/sysdev, but I am
not a big fan of it, because it's just a bit of a dumping ground.

A while back I created arch/powerpc/platforms/4xx for 40x and 44x
related things, even though there's no actual 4xx platform. I don't
think that's caused any problems.

So I'm inclined to say we should make a arch/powerpc/platforms/book3s
and put VAS in there.

The naming is a bit fishy, because not all book3s CPUs do or will have
VAS. But we would expect any future CPU with VAS to be book3s.

In contrast if we named it platforms/ibm, we could potentially have a
future non-IBM CPU that contains VAS, which would then make the ibm name
confusing.

cheers


[PATCH] powerpc/xive: Use the "ibm, chip-id" property only under PowerNV

2021-04-13 Thread Cédric Le Goater
The 'chip_id' field of the XIVE CPU structure is used to choose a
target for a source located on the same chip. For that, the XIVE
driver queries the chip identifier from the "ibm,chip-id" property
and compares it to a 'src_chip' field identifying the chip of a
source. This information is only available on the PowerNV platform,
'src_chip' being assigned to XIVE_INVALID_CHIP_ID under pSeries.

The "ibm,chip-id" property is also not available on all platforms. It
was first introduced on PowerNV and later, under QEMU for pSeries/KVM.
However, the property is not part of PAPR and does not exist under
pSeries/PowerVM.

Assign 'chip_id' to XIVE_INVALID_CHIP_ID by default and let the
PowerNV platform override the value with the "ibm,chip-id" property.

Signed-off-by: Cédric Le Goater 
---
 arch/powerpc/sysdev/xive/xive-internal.h | 1 +
 arch/powerpc/sysdev/xive/common.c| 9 +++--
 arch/powerpc/sysdev/xive/native.c| 6 ++
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/sysdev/xive/xive-internal.h 
b/arch/powerpc/sysdev/xive/xive-internal.h
index b3a456fdd3a5..504e7edce358 100644
--- a/arch/powerpc/sysdev/xive/xive-internal.h
+++ b/arch/powerpc/sysdev/xive/xive-internal.h
@@ -44,6 +44,7 @@ struct xive_ops {
  u32 *sw_irq);
int (*setup_queue)(unsigned int cpu, struct xive_cpu *xc, u8 prio);
void(*cleanup_queue)(unsigned int cpu, struct xive_cpu *xc, u8 
prio);
+   void(*prepare_cpu)(unsigned int cpu, struct xive_cpu *xc);
void(*setup_cpu)(unsigned int cpu, struct xive_cpu *xc);
void(*teardown_cpu)(unsigned int cpu, struct xive_cpu *xc);
bool(*match)(struct device_node *np);
diff --git a/arch/powerpc/sysdev/xive/common.c 
b/arch/powerpc/sysdev/xive/common.c
index 587738ec4229..5acd76403ee7 100644
--- a/arch/powerpc/sysdev/xive/common.c
+++ b/arch/powerpc/sysdev/xive/common.c
@@ -1414,17 +1414,14 @@ static int xive_prepare_cpu(unsigned int cpu)
 
xc = per_cpu(xive_cpu, cpu);
if (!xc) {
-   struct device_node *np;
-
xc = kzalloc_node(sizeof(struct xive_cpu),
  GFP_KERNEL, cpu_to_node(cpu));
if (!xc)
return -ENOMEM;
-   np = of_get_cpu_node(cpu, NULL);
-   if (np)
-   xc->chip_id = of_get_ibm_chip_id(np);
-   of_node_put(np);
xc->hw_ipi = XIVE_BAD_IRQ;
+   xc->chip_id = XIVE_INVALID_CHIP_ID;
+   if (xive_ops->prepare_cpu)
+   xive_ops->prepare_cpu(cpu, xc);
 
per_cpu(xive_cpu, cpu) = xc;
}
diff --git a/arch/powerpc/sysdev/xive/native.c 
b/arch/powerpc/sysdev/xive/native.c
index 1bb84febbaee..4fcd2dd1de71 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -382,6 +382,11 @@ static void xive_native_update_pending(struct xive_cpu *xc)
}
 }
 
+static void xive_native_prepare_cpu(unsigned int cpu, struct xive_cpu *xc)
+{
+   xc->chip_id = cpu_to_chip_id(cpu);
+}
+
 static void xive_native_setup_cpu(unsigned int cpu, struct xive_cpu *xc)
 {
s64 rc;
@@ -464,6 +469,7 @@ static const struct xive_ops xive_native_ops = {
.match  = xive_native_match,
.shutdown   = xive_native_shutdown,
.update_pending = xive_native_update_pending,
+   .prepare_cpu= xive_native_prepare_cpu,
.setup_cpu  = xive_native_setup_cpu,
.teardown_cpu   = xive_native_teardown_cpu,
.sync_source= xive_native_sync_source,
-- 
2.26.3



[PATCH v1] KVM: PPC: Book3S HV P9: implement kvmppc_xive_pull_vcpu in C

2021-04-13 Thread Nicholas Piggin
This is more symmetric with kvmppc_xive_push_vcpu, and has the advantage
that it runs with the MMU on.

The extra test added to the asm will go away with a future change.

Reviewed-by: Cédric Le Goater 
Reviewed-by: Alexey Kardashevskiy 
Signed-off-by: Nicholas Piggin 
---
Another bit that came from the KVM Cify series.

Thanks,
Nick

 arch/powerpc/include/asm/kvm_ppc.h  |  2 ++
 arch/powerpc/kvm/book3s_hv.c|  2 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S |  5 
 arch/powerpc/kvm/book3s_xive.c  | 31 +
 4 files changed, 40 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 9531b1c1b190..73b1ca5a6471 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -672,6 +672,7 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 
icpval);
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
   int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
+extern void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu);
 
 static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
 {
@@ -712,6 +713,7 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu 
*vcpu, u64 icpval) { retur
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 
irq,
  int level, bool line_status) { return 
-ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
+static inline void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu) { }
 
 static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
{ return 0; }
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 4a532410e128..981bcaf787a8 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3570,6 +3570,8 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
 
trap = __kvmhv_vcpu_entry_p9(vcpu);
 
+   kvmppc_xive_pull_vcpu(vcpu);
+
/* Advance host PURR/SPURR by the amount used by guest */
purr = mfspr(SPRN_PURR);
spurr = mfspr(SPRN_SPURR);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 75405ef53238..c11597f815e4 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1442,6 +1442,11 @@ guest_exit_cont: /* r9 = vcpu, r12 = trap, r13 = 
paca */
bl  kvmhv_accumulate_time
 #endif
 #ifdef CONFIG_KVM_XICS
+   /* If we came in through the P9 short path, xive pull is done in C */
+   lwz r0, STACK_SLOT_SHORT_PATH(r1)
+   cmpwi   r0, 0
+   bne 1f
+
/* We are exiting, pull the VP from the XIVE */
lbz r0, VCPU_XIVE_PUSHED(r9)
cmpwi   cr0, r0, 0
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e7219b6f5f9a..741bf1f4387a 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -127,6 +127,37 @@ void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvmppc_xive_push_vcpu);
 
+/*
+ * Pull a vcpu's context from the XIVE on guest exit.
+ * This assumes we are in virtual mode (MMU on)
+ */
+void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu)
+{
+   void __iomem *tima = local_paca->kvm_hstate.xive_tima_virt;
+
+   if (!vcpu->arch.xive_pushed)
+   return;
+
+   /*
+* Should not have been pushed if there is no tima
+*/
+   if (WARN_ON(!tima))
+   return;
+
+   eieio();
+   /* First load to pull the context, we ignore the value */
+   __raw_readl(tima + TM_SPC_PULL_OS_CTX);
+   /* Second load to recover the context state (Words 0 and 1) */
+   vcpu->arch.xive_saved_state.w01 = __raw_readq(tima + TM_QW1_OS);
+
+   /* Fixup some of the state for the next load */
+   vcpu->arch.xive_saved_state.lsmfb = 0;
+   vcpu->arch.xive_saved_state.ack = 0xff;
+   vcpu->arch.xive_pushed = 0;
+   eieio();
+}
+EXPORT_SYMBOL_GPL(kvmppc_xive_pull_vcpu);
+
 /*
  * This is a simple trigger for a generic XIVE IRQ. This must
  * only be called for interrupts that support a trigger page
-- 
2.23.0



[PATCH v1 0/2] KVM: PPC: timer improvements

2021-04-13 Thread Nicholas Piggin
These are small timer improvement and buglet fix, taken from the
KVM Cify series.

This is the last one for the upcoming merge window.

Thanks,
Nick

Nicholas Piggin (2):
  KVM: PPC: Book3S HV P9: Move setting HDEC after switching to guest
LPCR
  KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

 arch/powerpc/include/asm/time.h | 12 
 arch/powerpc/kernel/time.c  | 10 --
 arch/powerpc/kvm/book3s_hv.c| 34 +
 3 files changed, 34 insertions(+), 22 deletions(-)

-- 
2.23.0



[PATCH v1 1/2] KVM: PPC: Book3S HV P9: Move setting HDEC after switching to guest LPCR

2021-04-13 Thread Nicholas Piggin
LPCR[HDICE]=0 suppresses hypervisor decrementer exceptions on some
processors, so it must be enabled before HDEC is set.

Rather than set it in the host LPCR then setting HDEC, move the HDEC
update to after the guest MMU context (including LPCR) is loaded.
There shouldn't be much concern with delaying HDEC by some 10s or 100s
of nanoseconds by setting it a bit later.

Reviewed-by: Alexey Kardashevskiy 
Reviewed-by: Fabiano Rosas 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 981bcaf787a8..48df339affdf 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3502,20 +3502,9 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
host_dawrx1 = mfspr(SPRN_DAWRX1);
}
 
-   /*
-* P8 and P9 suppress the HDEC exception when LPCR[HDICE] = 0,
-* so set HDICE before writing HDEC.
-*/
-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr | LPCR_HDICE);
-   isync();
-
hdec = time_limit - mftb();
-   if (hdec < 0) {
-   mtspr(SPRN_LPCR, vcpu->kvm->arch.host_lpcr);
-   isync();
+   if (hdec < 0)
return BOOK3S_INTERRUPT_HV_DECREMENTER;
-   }
-   mtspr(SPRN_HDEC, hdec);
 
if (vc->tb_offset) {
u64 new_tb = mftb() + vc->tb_offset;
@@ -3563,6 +3552,12 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
mtspr(SPRN_LPCR, lpcr);
isync();
 
+   /*
+* P9 suppresses the HDEC exception when LPCR[HDICE] = 0,
+* so set guest LPCR (with HDICE) before writing HDEC.
+*/
+   mtspr(SPRN_HDEC, hdec);
+
kvmppc_xive_push_vcpu(vcpu);
 
mtspr(SPRN_SRR0, vcpu->arch.shregs.srr0);
-- 
2.23.0



[PATCH v1 2/2] KVM: PPC: Book3S HV P9: Reduce irq_work vs guest decrementer races

2021-04-13 Thread Nicholas Piggin
irq_work's use of the DEC SPR is racy with guest<->host switch and guest
entry which flips the DEC interrupt to guest, which could lose a host
work interrupt.

This patch closes one race, and attempts to comment another class of
races.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/time.h | 12 
 arch/powerpc/kernel/time.c  | 10 --
 arch/powerpc/kvm/book3s_hv.c| 15 +++
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/time.h b/arch/powerpc/include/asm/time.h
index 8dd3cdb25338..8c2c3dd4ddba 100644
--- a/arch/powerpc/include/asm/time.h
+++ b/arch/powerpc/include/asm/time.h
@@ -97,6 +97,18 @@ extern void div128_by_32(u64 dividend_high, u64 dividend_low,
 extern void secondary_cpu_time_init(void);
 extern void __init time_init(void);
 
+#ifdef CONFIG_PPC64
+static inline unsigned long test_irq_work_pending(void)
+{
+   unsigned long x;
+
+   asm volatile("lbz %0,%1(13)"
+   : "=r" (x)
+   : "i" (offsetof(struct paca_struct, irq_work_pending)));
+   return x;
+}
+#endif
+
 DECLARE_PER_CPU(u64, decrementers_next_tb);
 
 /* Convert timebase ticks to nanoseconds */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index b67d93a609a2..da995c5fb97d 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -508,16 +508,6 @@ EXPORT_SYMBOL(profile_pc);
  * 64-bit uses a byte in the PACA, 32-bit uses a per-cpu variable...
  */
 #ifdef CONFIG_PPC64
-static inline unsigned long test_irq_work_pending(void)
-{
-   unsigned long x;
-
-   asm volatile("lbz %0,%1(13)"
-   : "=r" (x)
-   : "i" (offsetof(struct paca_struct, irq_work_pending)));
-   return x;
-}
-
 static inline void set_irq_work_pending_flag(void)
 {
asm volatile("stb %0,%1(13)" : :
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 48df339affdf..bdc24e9cb096 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3708,6 +3708,18 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
if (!(vcpu->arch.ctrl & 1))
mtspr(SPRN_CTRLT, mfspr(SPRN_CTRLF) & ~1);
 
+   /*
+* When setting DEC, we must always deal with irq_work_raise via NMI vs
+* setting DEC. The problem occurs right as we switch into guest mode
+* if a NMI hits and sets pending work and sets DEC, then that will
+* apply to the guest and not bring us back to the host.
+*
+* irq_work_raise could check a flag (or possibly LPCR[HDICE] for
+* example) and set HDEC to 1? That wouldn't solve the nested hv
+* case which needs to abort the hcall or zero the time limit.
+*
+* XXX: Another day's problem.
+*/
mtspr(SPRN_DEC, vcpu->arch.dec_expires - mftb());
 
if (kvmhv_on_pseries()) {
@@ -3822,6 +3834,9 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
vc->in_guest = 0;
 
mtspr(SPRN_DEC, local_paca->kvm_hstate.dec_expires - mftb());
+   /* We may have raced with new irq work */
+   if (test_irq_work_pending())
+   set_dec(1);
mtspr(SPRN_SPRG_VDSO_WRITE, local_paca->sprg_vdso);
 
kvmhv_load_host_pmu();
-- 
2.23.0



Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-04-13 Thread Vaibhav Jain
Hi Shiva,

Apologies for a late review but something just caught my eye while
working on a different patch.

Shivaprasad G Bhat  writes:

> Add support for ND_REGION_ASYNC capability if the device tree
> indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.
>
> If the flush request failed, the hypervisor is expected to
> to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call.
>
> This patch prevents mmap of namespaces with MAP_SYNC flag if the
> nvdimm requires an explicit flush[1].
>
> References:
> [1] 
> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c
>
> Signed-off-by: Shivaprasad G Bhat 
> ---
> v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html
> Changes from v2:
>- Fixed the commit message.
>- Add dev_dbg before the H_SCM_FLUSH hcall
>
> v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
> Changes from v1:
>- Hcall semantics finalized, all changes are to accomodate them.
>
>  Documentation/powerpc/papr_hcalls.rst |   14 ++
>  arch/powerpc/include/asm/hvcall.h |3 +-
>  arch/powerpc/platforms/pseries/papr_scm.c |   40 
> +
>  3 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/powerpc/papr_hcalls.rst 
> b/Documentation/powerpc/papr_hcalls.rst
> index 48fcf1255a33..648f278eea8f 100644
> --- a/Documentation/powerpc/papr_hcalls.rst
> +++ b/Documentation/powerpc/papr_hcalls.rst
> @@ -275,6 +275,20 @@ Health Bitmap Flags:
>  Given a DRC Index collect the performance statistics for NVDIMM and copy them
>  to the resultBuffer.
>  
> +**H_SCM_FLUSH**
> +
> +| Input: *drcIndex, continue-token*
> +| Out: *continue-token*
> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
> +
> +Given a DRC Index Flush the data to backend NVDIMM device.
> +
> +The hcall returns H_BUSY when the flush takes longer time and the hcall needs
> +to be issued multiple times in order to be completely serviced. The
> +*continue-token* from the output to be passed in the argument list of
> +subsequent hcalls to the hypervisor until the hcall is completely serviced
> +at which point H_SUCCESS or other error is returned by the hypervisor.
> +
Does the hcall semantic also include measures to trigger a barrier/fence
(like pm_wmb()) so that all the stores before the hcall are gauranteed
to have hit the pmem controller ?

If not then the papr_scm_pmem_flush() introduced below should issue a
fence like pm_wmb() before issuing the flush hcall.

If yes then this behaviour should also be documented with the hcall
semantics above.

>  References
>  ==
>  .. [1] "Power Architecture Platform Reference"
> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index ed6086d57b22..9f7729a97ebd 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -315,7 +315,8 @@
>  #define H_SCM_HEALTH0x400
>  #define H_SCM_PERFORMANCE_STATS 0x418
>  #define H_RPT_INVALIDATE 0x448
> -#define MAX_HCALL_OPCODE H_RPT_INVALIDATE
> +#define H_SCM_FLUSH  0x44C
> +#define MAX_HCALL_OPCODE H_SCM_FLUSH
>  
>  /* Scope args for H_SCM_UNBIND_ALL */
>  #define H_UNBIND_SCOPE_ALL (0x1)
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 835163f54244..b7a47fcc5aa5 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -93,6 +93,7 @@ struct papr_scm_priv {
>   uint64_t block_size;
>   int metadata_size;
>   bool is_volatile;
> + bool hcall_flush_required;
>  
>   uint64_t bound_addr;
>  
> @@ -117,6 +118,39 @@ struct papr_scm_priv {
>   size_t stat_buffer_len;
>  };
>  
> +static int papr_scm_pmem_flush(struct nd_region *nd_region,
> +struct bio *bio __maybe_unused)
> +{
> + struct papr_scm_priv *p = nd_region_provider_data(nd_region);
> + unsigned long ret_buf[PLPAR_HCALL_BUFSIZE];
> + uint64_t token = 0;
> + int64_t rc;
> +
> + dev_dbg(&p->pdev->dev, "flush drc 0x%x", p->drc_index);
> +
> + do {
> + rc = plpar_hcall(H_SCM_FLUSH, ret_buf, p->drc_index, token);
> + token = ret_buf[0];
> +
> + /* Check if we are stalled for some time */
> + if (H_IS_LONG_BUSY(rc)) {
> + msleep(get_longbusy_msecs(rc));
> + rc = H_BUSY;
> + } else if (rc == H_BUSY) {
> + cond_resched();
> + }
> + } while (rc == H_BUSY);
> +
> + if (rc) {
> + dev_err(&p->pdev->dev, "flush error: %lld", rc);
> + rc = -EIO;
> + } else {
> + dev_dbg(&p->pdev->dev, "flush drc 0x%x complete", p->drc_index);
> + }
> +
> + return rc;
> +}
> +
>  static LIST_HEAD(papr_nd_regions);

Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-13 Thread Christophe Leroy




Le 12/04/2021 à 23:54, Segher Boessenkool a écrit :

Hi!

On Thu, Apr 08, 2021 at 03:33:44PM +, Christophe Leroy wrote:

For clear bits, on 32 bits 'rlwinm' can be used instead or 'andc' for
when all bits to be cleared are consecutive.


Also on 64-bits, as long as both the top and bottom bits are in the low
32-bit half (for 32 bit mode, it can wrap as well).


Yes. But here we are talking about clearing a few bits, all other ones must remain unchanged. An 
rlwinm on PPC64 will always clear the upper part, which is unlikely what we want.





For the time being only
handle the single bit case, which we detect by checking whether the
mask is a power of two.


You could look at rs6000_is_valid_mask in GCC:
   

used by rs6000_is_valid_and_mask immediately after it.  You probably
want to allow only rlwinm in your case, and please note this checks if
something is a valid mask, not the inverse of a valid mask (as you
want here).


This check looks more complex than what I need. It is used for both rlw... and rld..., and it 
calculates the operants. The only thing I need is to validate the mask.
I found a way: By anding the mask with the complement of itself rotated by left bits to 1, we 
identify the transitions from 0 to 1. If the result is a power of 2, it means there's only one 
transition so the mask is as expected.


So I did that in v2.




So yes this is pretty involved :-)

Your patch looks good btw.  But please use "n", not "i", as constraint?


Done.

Christophe


Re: [PATCH v1 2/2] powerpc/atomics: Use immediate operand when possible

2021-04-13 Thread Christophe Leroy




Le 13/04/2021 à 00:08, Segher Boessenkool a écrit :

Hi!

On Thu, Apr 08, 2021 at 03:33:45PM +, Christophe Leroy wrote:

+#define ATOMIC_OP(op, asm_op, dot, sign)   \
  static __inline__ void atomic_##op(int a, atomic_t *v)
\
  { \
int t;  \
\
__asm__ __volatile__(   \
  "1:  lwarx   %0,0,%3 # atomic_" #op "\n"  \
-   #asm_op " %0,%2,%0\n" \
+   #asm_op "%I2" dot " %0,%0,%2\n" \
  "stwcx.  %0,0,%3 \n"\
  "bne-1b\n"  \
-   : "=&r" (t), "+m" (v->counter)   \
-   : "r" (a), "r" (&v->counter) \
+   : "=&b" (t), "+m" (v->counter)   \
+   : "r"#sign (a), "r" (&v->counter)\
: "cc");  \
  } \


You need "b" (instead of "r") only for "addi".  You can use "addic"
instead, which clobbers XER[CA], but *all* inline asm does, so that is
not a downside here (it is also not slower on any CPU that matters).


@@ -238,14 +238,14 @@ static __inline__ int atomic_fetch_add_unless(atomic_t 
*v, int a, int u)
  "1:  lwarx   %0,0,%1 # atomic_fetch_add_unless\n\
cmpw0,%0,%3 \n\
beq 2f \n\
-   add %0,%2,%0 \n"
+   add%I2  %0,%0,%2 \n"
  "stwcx.  %0,0,%1 \n\
bne-1b \n"
PPC_ATOMIC_EXIT_BARRIER
-" subf%0,%2,%0 \n\
+" sub%I2  %0,%0,%2 \n\
  2:"
-   : "=&r" (t)
-   : "r" (&v->counter), "r" (a), "r" (u)
+   : "=&b" (t)
+   : "r" (&v->counter), "rI" (a), "r" (u)
: "cc", "memory");


Same here.


Yes, I thought about addic, I didn't find an early solution because I forgot 
the matching 'addc'.

Now with the couple addc/addic it works well.

Thanks



Nice patches!

Acked-by: Segher Boessenkool 



Christophe


[PATCH v2 1/3] powerpc/bitops: Use immediate operand when possible

2021-04-13 Thread Christophe Leroy
Today we get the following code generation for bitops like
set or clear bit:

c0009fe0:   39 40 08 00 li  r10,2048
c0009fe4:   7c e0 40 28 lwarx   r7,0,r8
c0009fe8:   7c e7 53 78 or  r7,r7,r10
c0009fec:   7c e0 41 2d stwcx.  r7,0,r8

c000d568:   39 00 18 00 li  r8,6144
c000d56c:   7c c0 38 28 lwarx   r6,0,r7
c000d570:   7c c6 40 78 andcr6,r6,r8
c000d574:   7c c0 39 2d stwcx.  r6,0,r7

Most set bits are constant on lower 16 bits, so it can easily
be replaced by the "immediate" version of the operation. Allow
GCC to choose between the normal or immediate form.

For clear bits, on 32 bits 'rlwinm' can be used instead of 'andc' for
when all bits to be cleared are consecutive. For this we detect
the number of transitions from 0 to 1 in the mask. This is done
by handing the mask with its complement rotated left 1 bit. If this
operation provides a number which is a power of 2, it means there
is only one transition from 0 to 1 in the number, so all 1 bits are
consecutives.
Can't use rol32() which is not defined yet, so do a
raw ((x << 1) | (x >> 31)). For the power of 2, can't use
is_power_of_2() for the same reason, but it can also be easily encoded
as  (mask & (mask - 1)) and even the 0 case which is not a power of
two is acceptable for us.

On 64 bits we don't have any equivalent single operation, we'd need
two 'rldicl' so it is not worth it.

With this patch we get:

c0009fe0:   7d 00 50 28 lwarx   r8,0,r10
c0009fe4:   61 08 08 00 ori r8,r8,2048
c0009fe8:   7d 00 51 2d stwcx.  r8,0,r10

c000d558:   7c e0 40 28 lwarx   r7,0,r8
c000d55c:   54 e7 05 64 rlwinm  r7,r7,0,21,18
c000d560:   7c e0 41 2d stwcx.  r7,0,r8

On pmac32_defconfig, it reduces the text by approx 10 kbytes.

Signed-off-by: Christophe Leroy 
---
v2:
- Use "n" instead of "i" as constraint for the rlwinm mask
- Improve mask verification to handle more than single bit masks
---
 arch/powerpc/include/asm/bitops.h | 85 ---
 1 file changed, 77 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/include/asm/bitops.h 
b/arch/powerpc/include/asm/bitops.h
index 299ab33505a6..baa7666c1094 100644
--- a/arch/powerpc/include/asm/bitops.h
+++ b/arch/powerpc/include/asm/bitops.h
@@ -71,19 +71,57 @@ static inline void fn(unsigned long mask,   \
__asm__ __volatile__ (  \
prefix  \
 "1:"   PPC_LLARX(%0,0,%3,0) "\n"   \
-   stringify_in_c(op) "%0,%0,%2\n" \
+   #op "%I2 %0,%0,%2\n"\
PPC_STLCX "%0,0,%3\n"   \
"bne- 1b\n" \
: "=&r" (old), "+m" (*p)\
-   : "r" (mask), "r" (p)   \
+   : "rK" (mask), "r" (p)  \
: "cc", "memory");  \
 }
 
 DEFINE_BITOP(set_bits, or, "")
-DEFINE_BITOP(clear_bits, andc, "")
-DEFINE_BITOP(clear_bits_unlock, andc, PPC_RELEASE_BARRIER)
 DEFINE_BITOP(change_bits, xor, "")
 
+static __always_inline bool is_rlwinm_mask_valid(unsigned long x)
+{
+   x = x & ~((x << 1) | (x >> 31));/* Flag transitions from 0 to 1 */
+
+   return !(x & (x - 1));  /* Is there only one transition */
+}
+
+#define DEFINE_CLROP(fn, prefix)   \
+static inline void fn(unsigned long mask, volatile unsigned long *_p)  \
+{  \
+   unsigned long old;  \
+   unsigned long *p = (unsigned long *)_p; \
+   \
+   if (IS_ENABLED(CONFIG_PPC32) && \
+   __builtin_constant_p(mask) && is_rlwinm_mask_valid(mask)) { \
+   asm volatile (  \
+   prefix  \
+   "1:""lwarx  %0,0,%3\n"  \
+   "rlwinm %0,%0,0,%2\n"   \
+   "stwcx. %0,0,%3\n"  \
+   "bne- 1b\n" \
+   : "=&r" (old), "+m" (*p)\
+   : "n" (~mask), "r" (p)  \
+   : "cc", "memory");  \
+   } else {\
+   asm volatile (  \
+   prefix  \
+   "1:"PPC_LLARX(%0,0,%3,0) "\n"  

[PATCH v2 2/3] powerpc/atomics: Use immediate operand when possible

2021-04-13 Thread Christophe Leroy
Today we get the following code generation for atomic operations:

c001bb2c:   39 20 00 01 li  r9,1
c001bb30:   7d 40 18 28 lwarx   r10,0,r3
c001bb34:   7d 09 50 50 subfr8,r9,r10
c001bb38:   7d 00 19 2d stwcx.  r8,0,r3

c001c7a8:   39 40 00 01 li  r10,1
c001c7ac:   7d 00 18 28 lwarx   r8,0,r3
c001c7b0:   7c ea 42 14 add r7,r10,r8
c001c7b4:   7c e0 19 2d stwcx.  r7,0,r3

By allowing GCC to choose between immediate or regular operation,
we get:

c001bb2c:   7d 20 18 28 lwarx   r9,0,r3
c001bb30:   39 49 ff ff addir10,r9,-1
c001bb34:   7d 40 19 2d stwcx.  r10,0,r3
--
c001c7a4:   7d 40 18 28 lwarx   r10,0,r3
c001c7a8:   39 0a 00 01 addir8,r10,1
c001c7ac:   7d 00 19 2d stwcx.  r8,0,r3

For "and", the dot form has to be used because "andi" doesn't exist.

For logical operations we use unsigned 16 bits immediate.
For arithmetic operations we use signed 16 bits immediate.

On pmac32_defconfig, it reduces the text by approx another 8 kbytes.

Signed-off-by: Christophe Leroy 
Acked-by: Segher Boessenkool 
---
v2: Use "addc/addic"
---
 arch/powerpc/include/asm/atomic.h | 56 +++
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index 61c6e8b200e8..eb1bdf14f67c 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -37,62 +37,62 @@ static __inline__ void atomic_set(atomic_t *v, int i)
__asm__ __volatile__("stw%U0%X0 %1,%0" : "=m"UPD_CONSTR(v->counter) : 
"r"(i));
 }
 
-#define ATOMIC_OP(op, asm_op)  \
+#define ATOMIC_OP(op, asm_op, suffix, sign, ...)   \
 static __inline__ void atomic_##op(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
 "1:lwarx   %0,0,%3 # atomic_" #op "\n" \
-   #asm_op " %0,%2,%0\n"   \
+   #asm_op "%I2" suffix " %0,%0,%2\n"  \
 "  stwcx.  %0,0,%3 \n" \
 "  bne-1b\n"   \
: "=&r" (t), "+m" (v->counter)  \
-   : "r" (a), "r" (&v->counter)\
-   : "cc");\
+   : "r"#sign (a), "r" (&v->counter)   \
+   : "cc", ##__VA_ARGS__); \
 }  \
 
-#define ATOMIC_OP_RETURN_RELAXED(op, asm_op)   \
+#define ATOMIC_OP_RETURN_RELAXED(op, asm_op, suffix, sign, ...)
\
 static inline int atomic_##op##_return_relaxed(int a, atomic_t *v) \
 {  \
int t;  \
\
__asm__ __volatile__(   \
 "1:lwarx   %0,0,%3 # atomic_" #op "_return_relaxed\n"  \
-   #asm_op " %0,%2,%0\n"   \
+   #asm_op "%I2" suffix " %0,%0,%2\n"  \
 "  stwcx.  %0,0,%3\n"  \
 "  bne-1b\n"   \
: "=&r" (t), "+m" (v->counter)  \
-   : "r" (a), "r" (&v->counter)\
-   : "cc");\
+   : "r"#sign (a), "r" (&v->counter)   \
+   : "cc", ##__VA_ARGS__); \
\
return t;   \
 }
 
-#define ATOMIC_FETCH_OP_RELAXED(op, asm_op)\
+#define ATOMIC_FETCH_OP_RELAXED(op, asm_op, suffix, sign, ...) \
 static inline int atomic_fetch_##op##_relaxed(int a, atomic_t *v)  \
 {  \
int res, t; \
\
__a

[PATCH v2 3/3] powerpc/atomics: Remove atomic_inc()/atomic_dec() and friends

2021-04-13 Thread Christophe Leroy
Now that atomic_add() and atomic_sub() handle immediate operands,
atomic_inc() and atomic_dec() have no added value compared to the
generic fallback which calls atomic_add(1) and atomic_sub(1).

Also remove atomic_inc_not_zero() which fallsback to
atomic_add_unless() which itself fallsback to
atomic_fetch_add_unless() which now handles immediate operands.

Signed-off-by: Christophe Leroy 
---
v2: New
---
 arch/powerpc/include/asm/atomic.h | 95 ---
 1 file changed, 95 deletions(-)

diff --git a/arch/powerpc/include/asm/atomic.h 
b/arch/powerpc/include/asm/atomic.h
index eb1bdf14f67c..00ba5d9e837b 100644
--- a/arch/powerpc/include/asm/atomic.h
+++ b/arch/powerpc/include/asm/atomic.h
@@ -118,71 +118,6 @@ ATOMIC_OPS(xor, xor, "", K)
 #undef ATOMIC_OP_RETURN_RELAXED
 #undef ATOMIC_OP
 
-static __inline__ void atomic_inc(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_inc\n\
-   addic   %0,%0,1\n"
-"  stwcx.  %0,0,%2 \n\
-   bne-1b"
-   : "=&r" (t), "+m" (v->counter)
-   : "r" (&v->counter)
-   : "cc", "xer");
-}
-#define atomic_inc atomic_inc
-
-static __inline__ int atomic_inc_return_relaxed(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_inc_return_relaxed\n"
-"  addic   %0,%0,1\n"
-"  stwcx.  %0,0,%2\n"
-"  bne-1b"
-   : "=&r" (t), "+m" (v->counter)
-   : "r" (&v->counter)
-   : "cc", "xer");
-
-   return t;
-}
-
-static __inline__ void atomic_dec(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_dec\n\
-   addic   %0,%0,-1\n"
-"  stwcx.  %0,0,%2\n\
-   bne-1b"
-   : "=&r" (t), "+m" (v->counter)
-   : "r" (&v->counter)
-   : "cc", "xer");
-}
-#define atomic_dec atomic_dec
-
-static __inline__ int atomic_dec_return_relaxed(atomic_t *v)
-{
-   int t;
-
-   __asm__ __volatile__(
-"1:lwarx   %0,0,%2 # atomic_dec_return_relaxed\n"
-"  addic   %0,%0,-1\n"
-"  stwcx.  %0,0,%2\n"
-"  bne-1b"
-   : "=&r" (t), "+m" (v->counter)
-   : "r" (&v->counter)
-   : "cc", "xer");
-
-   return t;
-}
-
-#define atomic_inc_return_relaxed atomic_inc_return_relaxed
-#define atomic_dec_return_relaxed atomic_dec_return_relaxed
-
 #define atomic_cmpxchg(v, o, n) (cmpxchg(&((v)->counter), (o), (n)))
 #define atomic_cmpxchg_relaxed(v, o, n) \
cmpxchg_relaxed(&((v)->counter), (o), (n))
@@ -252,36 +187,6 @@ static __inline__ int atomic_fetch_add_unless(atomic_t *v, 
int a, int u)
 }
 #define atomic_fetch_add_unless atomic_fetch_add_unless
 
-/**
- * atomic_inc_not_zero - increment unless the number is zero
- * @v: pointer of type atomic_t
- *
- * Atomically increments @v by 1, so long as @v is non-zero.
- * Returns non-zero if @v was non-zero, and zero otherwise.
- */
-static __inline__ int atomic_inc_not_zero(atomic_t *v)
-{
-   int t1, t2;
-
-   __asm__ __volatile__ (
-   PPC_ATOMIC_ENTRY_BARRIER
-"1:lwarx   %0,0,%2 # atomic_inc_not_zero\n\
-   cmpwi   0,%0,0\n\
-   beq-2f\n\
-   addic   %1,%0,1\n"
-"  stwcx.  %1,0,%2\n\
-   bne-1b\n"
-   PPC_ATOMIC_EXIT_BARRIER
-   "\n\
-2:"
-   : "=&r" (t1), "=&r" (t2)
-   : "r" (&v->counter)
-   : "cc", "xer", "memory");
-
-   return t1;
-}
-#define atomic_inc_not_zero(v) atomic_inc_not_zero((v))
-
 /*
  * Atomically test *v and decrement if it is greater than 0.
  * The function returns the old value of *v minus 1, even if
-- 
2.25.0



[PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32

2021-04-13 Thread Christophe Leroy
powerpc BUG_ON() and WARN_ON() are based on using twnei instruction.

For catching simple conditions like a variable having value 0, this
is efficient because it does the test and the trap at the same time.
But most conditions used with BUG_ON or WARN_ON are more complex and
forces GCC to format the condition into a 0 or 1 value in a register.
This will usually require 2 to 3 instructions.

The most efficient solution would be to use __builtin_trap() because
GCC is able to optimise the use of the different trap instructions
based on the requested condition, but this is complex if not
impossible for the following reasons:
- __builtin_trap() is a non-recoverable instruction, so it can't be
used for WARN_ON
- Knowing which line of code generated the trap would require the
analysis of DWARF information. This is not a feature we have today.

As mentioned in commit 8d4fbcfbe0a4 ("Fix WARN_ON() on bitfield ops")
the way WARN_ON() is implemented is suboptimal. That commit also
mentions an issue with 'long long' condition. It fixed it for
WARN_ON() but the same problem still exists today with BUG_ON() on
PPC32. It will be fixed by using the generic implementation.

By using the generic implementation, gcc will naturally generate a
branch to the unconditional trap generated by BUG().

As modern powerpc implement zero-cycle branch,
that's even more efficient.

And for the functions using WARN_ON() and its return, the test
on return from WARN_ON() is now also used for the WARN_ON() itself.

On PPC64 we don't want it because we want to be able to use CFAR
register to track how we entered the code that trapped. The CFAR
register would be clobbered by the branch.

A simple test function:

unsigned long test9w(unsigned long a, unsigned long b)
{
if (WARN_ON(!b))
return 0;
return a / b;
}

Before the patch:

046c :
 46c:   7c 89 00 34 cntlzw  r9,r4
 470:   55 29 d9 7e rlwinm  r9,r9,27,5,31
 474:   0f 09 00 00 twnei   r9,0
 478:   2c 04 00 00 cmpwi   r4,0
 47c:   41 82 00 0c beq 488 
 480:   7c 63 23 96 divwu   r3,r3,r4
 484:   4e 80 00 20 blr

 488:   38 60 00 00 li  r3,0
 48c:   4e 80 00 20 blr

After the patch:

0468 :
 468:   2c 04 00 00 cmpwi   r4,0
 46c:   41 82 00 0c beq 478 
 470:   7c 63 23 96 divwu   r3,r3,r4
 474:   4e 80 00 20 blr

 478:   0f e0 00 00 twuir0,0
 47c:   38 60 00 00 li  r3,0
 480:   4e 80 00 20 blr

So we see before the patch we need 3 instructions on the likely path
to handle the WARN_ON(). With the patch the trap goes on the unlikely
path.

See below the difference at the entry of system_call_exception where
we have several BUG_ON(), allthough less impressing.

With the patch:

 :
   0:   81 6a 00 84 lwz r11,132(r10)
   4:   90 6a 00 88 stw r3,136(r10)
   8:   71 60 00 02 andi.   r0,r11,2
   c:   41 82 00 70 beq 7c 
  10:   71 60 40 00 andi.   r0,r11,16384
  14:   41 82 00 6c beq 80 
  18:   71 6b 80 00 andi.   r11,r11,32768
  1c:   41 82 00 68 beq 84 
  20:   94 21 ff e0 stwur1,-32(r1)
  24:   93 e1 00 1c stw r31,28(r1)
  28:   7d 8c 42 e6 mftbr12
...
  7c:   0f e0 00 00 twuir0,0
  80:   0f e0 00 00 twuir0,0
  84:   0f e0 00 00 twuir0,0

Without the patch:

 :
   0:   94 21 ff e0 stwur1,-32(r1)
   4:   93 e1 00 1c stw r31,28(r1)
   8:   90 6a 00 88 stw r3,136(r10)
   c:   81 6a 00 84 lwz r11,132(r10)
  10:   69 60 00 02 xorir0,r11,2
  14:   54 00 ff fe rlwinm  r0,r0,31,31,31
  18:   0f 00 00 00 twnei   r0,0
  1c:   69 60 40 00 xorir0,r11,16384
  20:   54 00 97 fe rlwinm  r0,r0,18,31,31
  24:   0f 00 00 00 twnei   r0,0
  28:   69 6b 80 00 xorir11,r11,32768
  2c:   55 6b 8f fe rlwinm  r11,r11,17,31,31
  30:   0f 0b 00 00 twnei   r11,0
  34:   7d 8c 42 e6 mftbr12

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/bug.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index d1635ffbb179..101dea4eec8d 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -68,7 +68,11 @@
BUG_ENTRY("twi 31, 0, 0", 0);   \
unreachable();  \
 } while (0)
+#define HAVE_ARCH_BUG
+
+#define __WARN_FLAGS(flags) BUG_ENTRY("twi 31, 0, 0", BUGFLAG_WARNING | 
(flag

[PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto

2021-04-13 Thread Christophe Leroy
Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
flexibility to GCC.

For that add an entry to the exception table so that
program_check_exception() knowns where to resume execution
after a WARNING.

Here are two exemples. The first one is done on PPC32 (which
benefits from the previous patch), the second is on PPC64.

unsigned long test(struct pt_regs *regs)
{
int ret;

WARN_ON(regs->msr & MSR_PR);

return regs->gpr[3];
}

unsigned long test9w(unsigned long a, unsigned long b)
{
if (WARN_ON(!b))
return 0;
return a / b;
}

Before the patch:

03a8 :
 3a8:   81 23 00 84 lwz r9,132(r3)
 3ac:   71 29 40 00 andi.   r9,r9,16384
 3b0:   40 82 00 0c bne 3bc 
 3b4:   80 63 00 0c lwz r3,12(r3)
 3b8:   4e 80 00 20 blr

 3bc:   0f e0 00 00 twuir0,0
 3c0:   80 63 00 0c lwz r3,12(r3)
 3c4:   4e 80 00 20 blr

0bf0 <.test9w>:
 bf0:   7c 89 00 74 cntlzd  r9,r4
 bf4:   79 29 d1 82 rldicl  r9,r9,58,6
 bf8:   0b 09 00 00 tdnei   r9,0
 bfc:   2c 24 00 00 cmpdi   r4,0
 c00:   41 82 00 0c beq c0c <.test9w+0x1c>
 c04:   7c 63 23 92 divdu   r3,r3,r4
 c08:   4e 80 00 20 blr

 c0c:   38 60 00 00 li  r3,0
 c10:   4e 80 00 20 blr

After the patch:

03a8 :
 3a8:   81 23 00 84 lwz r9,132(r3)
 3ac:   71 29 40 00 andi.   r9,r9,16384
 3b0:   40 82 00 0c bne 3bc 
 3b4:   80 63 00 0c lwz r3,12(r3)
 3b8:   4e 80 00 20 blr

 3bc:   0f e0 00 00 twuir0,0

0c50 <.test9w>:
 c50:   7c 89 00 74 cntlzd  r9,r4
 c54:   79 29 d1 82 rldicl  r9,r9,58,6
 c58:   0b 09 00 00 tdnei   r9,0
 c5c:   7c 63 23 92 divdu   r3,r3,r4
 c60:   4e 80 00 20 blr

 c70:   38 60 00 00 li  r3,0
 c74:   4e 80 00 20 blr

In the first exemple, we see GCC doesn't need to duplicate what
happens after the trap.

In the second exemple, we see that GCC doesn't need to emit a test
and a branch in the likely path in addition to the trap.

We've got some WARN_ON() in .softirqentry.text section so it needs
to be added in the OTHER_TEXT_SECTIONS in modpost.c

Signed-off-by: Christophe Leroy 
---
v2: Fix build failure when CONFIG_BUG is not selected.
---
 arch/powerpc/include/asm/book3s/64/kup.h |  2 +-
 arch/powerpc/include/asm/bug.h   | 54 
 arch/powerpc/include/asm/extable.h   | 14 ++
 arch/powerpc/include/asm/ppc_asm.h   | 11 +
 arch/powerpc/kernel/entry_64.S   |  2 +-
 arch/powerpc/kernel/exceptions-64e.S |  2 +-
 arch/powerpc/kernel/misc_32.S|  2 +-
 arch/powerpc/kernel/traps.c  |  9 +++-
 scripts/mod/modpost.c|  2 +-
 9 files changed, 72 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 9700da3a4093..a22839cba32e 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -90,7 +90,7 @@
/* Prevent access to userspace using any key values */
LOAD_REG_IMMEDIATE(\gpr2, AMR_KUAP_BLOCKED)
 999:   tdne\gpr1, \gpr2
-   EMIT_BUG_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
+   EMIT_WARN_ENTRY 999b, __FILE__, __LINE__, (BUGFLAG_WARNING | 
BUGFLAG_ONCE)
END_MMU_FTR_SECTION_NESTED_IFSET(MMU_FTR_BOOK3S_KUAP, 67)
 #endif
 .endm
diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 101dea4eec8d..e22dc503fb2f 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -4,6 +4,7 @@
 #ifdef __KERNEL__
 
 #include 
+#include 
 
 #ifdef CONFIG_BUG
 
@@ -30,6 +31,11 @@
 .endm
 #endif /* verbose */
 
+.macro EMIT_WARN_ENTRY addr,file,line,flags
+   EX_TABLE(\addr,\addr+4)
+   EMIT_BUG_ENTRY \addr,\file,\line,\flags
+.endm
+
 #else /* !__ASSEMBLY__ */
 /* _EMIT_BUG_ENTRY expects args %0,%1,%2,%3 to be FILE, LINE, flags and
sizeof(struct bug_entry), respectively */
@@ -58,6 +64,16 @@
  "i" (sizeof(struct bug_entry)),   \
  ##__VA_ARGS__)
 
+#define WARN_ENTRY(insn, flags, label, ...)\
+   asm_volatile_goto(  \
+   "1: " insn "\n" \
+   EX_TABLE(1b, %l[label]) \
+   _EMIT_BUG_ENTRY \
+   : : "i" (__FILE__), "i" (__LINE__), \
+ "i" (flags),  \
+ "i" (sizeof(struct bug_entry)),   \
+   

[PATCH v2 2/4] powerpc: Make probe_kernel_read_inst() common to PPC32 and PPC64

2021-04-13 Thread Christophe Leroy
We have two independant versions of probe_kernel_read_inst(), one for
PPC32 and one for PPC64.

The PPC32 is identical to the first part of the PPC64 version.
The remaining part of PPC64 version is not relevant for PPC32, but
not contradictory, so we can easily have a common function with
the PPC64 part opted out via a IS_ENABLED(CONFIG_PPC64).

The only need is to add a version of ppc_inst_prefix() for PPC32.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  2 ++
 arch/powerpc/lib/inst.c | 17 +
 2 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 2902d4e6a363..a40c3913a4a3 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -102,6 +102,8 @@ static inline bool ppc_inst_equal(struct ppc_inst x, struct 
ppc_inst y)
 
 #define ppc_inst(x) ((struct ppc_inst){ .val = x })
 
+#define ppc_inst_prefix(x, y) ppc_inst(x)
+
 static inline bool ppc_inst_prefixed(struct ppc_inst x)
 {
return false;
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index c57b3548de37..0dff3ac2d45f 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -8,7 +8,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_PPC64
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -18,7 +17,7 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
err = copy_from_kernel_nofault(&val, src, sizeof(val));
if (err)
return err;
-   if (get_op(val) == OP_PREFIX) {
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
err = copy_from_kernel_nofault(&suffix, (void *)src + 4, 4);
*inst = ppc_inst_prefix(val, suffix);
} else {
@@ -26,17 +25,3 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
}
return err;
 }
-#else /* !CONFIG_PPC64 */
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_kernel_nofault(&val, src, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-#endif /* CONFIG_PPC64 */
-- 
2.25.0



[PATCH v2 1/4] powerpc: Remove probe_user_read_inst()

2021-04-13 Thread Christophe Leroy
Its name comes from former probe_user_read() function.
That function is now called copy_from_user_nofault().

probe_user_read_inst() uses copy_from_user_nofault() to read only
a few bytes. It is suboptimal.

It does the same as get_user_inst() but in addition disables
page faults.

But on the other hand, it is not used for the time being. So remove it
for now. If one day it is really needed, we can give it a new name
more in line with today's naming, and implement it using get_user_inst()

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h |  3 ---
 arch/powerpc/lib/inst.c | 31 ---
 2 files changed, 34 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index 19e18af2fac9..2902d4e6a363 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -175,9 +175,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip);
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src);
 
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
index 9cc17eb62462..c57b3548de37 100644
--- a/arch/powerpc/lib/inst.c
+++ b/arch/powerpc/lib/inst.c
@@ -9,24 +9,6 @@
 #include 
 
 #ifdef CONFIG_PPC64
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_user_nofault(&val, nip, sizeof(val));
-   if (err)
-   return err;
-   if (get_op(val) == OP_PREFIX) {
-   err = copy_from_user_nofault(&suffix, (void __user *)nip + 4, 
4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
@@ -45,19 +27,6 @@ int probe_kernel_read_inst(struct ppc_inst *inst,
return err;
 }
 #else /* !CONFIG_PPC64 */
-int probe_user_read_inst(struct ppc_inst *inst,
-struct ppc_inst __user *nip)
-{
-   unsigned int val;
-   int err;
-
-   err = copy_from_user_nofault(&val, nip, sizeof(val));
-   if (!err)
-   *inst = ppc_inst(val);
-
-   return err;
-}
-
 int probe_kernel_read_inst(struct ppc_inst *inst,
   struct ppc_inst *src)
 {
-- 
2.25.0



[PATCH v2 3/4] powerpc: Rename probe_kernel_read_inst()

2021-04-13 Thread Christophe Leroy
When probe_kernel_read_inst() was created, it was to mimic
probe_kernel_read() function.

Since then, probe_kernel_read() has been renamed
copy_from_kernel_nofault().

Rename probe_kernel_read_inst() into copy_from_kernel_nofault_inst().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/inst.h|  3 +--
 arch/powerpc/kernel/align.c|  2 +-
 arch/powerpc/kernel/trace/ftrace.c | 22 +++---
 arch/powerpc/lib/inst.c|  3 +--
 4 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/inst.h b/arch/powerpc/include/asm/inst.h
index a40c3913a4a3..a8ab0715f50e 100644
--- a/arch/powerpc/include/asm/inst.h
+++ b/arch/powerpc/include/asm/inst.h
@@ -177,7 +177,6 @@ static inline char *__ppc_inst_as_str(char 
str[PPC_INST_STR_LEN], struct ppc_ins
__str;  \
 })
 
-int probe_kernel_read_inst(struct ppc_inst *inst,
-  struct ppc_inst *src);
+int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src);
 
 #endif /* _ASM_POWERPC_INST_H */
diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index a97d5f1a3905..df3b55fec27d 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -311,7 +311,7 @@ int fix_alignment(struct pt_regs *regs)
CHECK_FULL_REGS(regs);
 
if (is_kernel_addr(regs->nip))
-   r = probe_kernel_read_inst(&instr, (void *)regs->nip);
+   r = copy_from_kernel_nofault_inst(&instr, (void *)regs->nip);
else
r = __get_user_instr(instr, (void __user *)regs->nip);
 
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index 42761ebec9f7..9daa4eb812ce 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -68,7 +68,7 @@ ftrace_modify_code(unsigned long ip, struct ppc_inst old, 
struct ppc_inst new)
 */
 
/* read the text we want to modify */
-   if (probe_kernel_read_inst(&replaced, (void *)ip))
+   if (copy_from_kernel_nofault_inst(&replaced, (void *)ip))
return -EFAULT;
 
/* Make sure it is what we expect it to be */
@@ -130,7 +130,7 @@ __ftrace_make_nop(struct module *mod,
struct ppc_inst op, pop;
 
/* read where this goes */
-   if (probe_kernel_read_inst(&op, (void *)ip)) {
+   if (copy_from_kernel_nofault_inst(&op, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -164,7 +164,7 @@ __ftrace_make_nop(struct module *mod,
/* When using -mkernel_profile there is no load to jump over */
pop = ppc_inst(PPC_INST_NOP);
 
-   if (probe_kernel_read_inst(&op, (void *)(ip - 4))) {
+   if (copy_from_kernel_nofault_inst(&op, (void *)(ip - 4))) {
pr_err("Fetching instruction at %lx failed.\n", ip - 4);
return -EFAULT;
}
@@ -197,7 +197,7 @@ __ftrace_make_nop(struct module *mod,
 * Check what is in the next instruction. We can see ld r2,40(r1), but
 * on first pass after boot we will see mflr r0.
 */
-   if (probe_kernel_read_inst(&op, (void *)(ip + 4))) {
+   if (copy_from_kernel_nofault_inst(&op, (void *)(ip + 4))) {
pr_err("Fetching op failed.\n");
return -EFAULT;
}
@@ -349,7 +349,7 @@ static int setup_mcount_compiler_tramp(unsigned long tramp)
return -1;
 
/* New trampoline -- read where this goes */
-   if (probe_kernel_read_inst(&op, (void *)tramp)) {
+   if (copy_from_kernel_nofault_inst(&op, (void *)tramp)) {
pr_debug("Fetching opcode failed.\n");
return -1;
}
@@ -399,7 +399,7 @@ static int __ftrace_make_nop_kernel(struct dyn_ftrace *rec, 
unsigned long addr)
struct ppc_inst op;
 
/* Read where this goes */
-   if (probe_kernel_read_inst(&op, (void *)ip)) {
+   if (copy_from_kernel_nofault_inst(&op, (void *)ip)) {
pr_err("Fetching opcode failed.\n");
return -EFAULT;
}
@@ -526,10 +526,10 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
struct module *mod = rec->arch.mod;
 
/* read where this goes */
-   if (probe_kernel_read_inst(op, ip))
+   if (copy_from_kernel_nofault_inst(op, ip))
return -EFAULT;
 
-   if (probe_kernel_read_inst(op + 1, ip + 4))
+   if (copy_from_kernel_nofault_inst(op + 1, ip + 4))
return -EFAULT;
 
if (!expected_nop_sequence(ip, op[0], op[1])) {
@@ -592,7 +592,7 @@ __ftrace_make_call(struct dyn_ftrace *rec, unsigned long 
addr)
unsigned long ip = rec->ip;
 
/* read where this goes */
-   if (probe_kernel_read_inst(&op, (void *)ip))
+   if (copy_from_kernel_nofault_inst(&op, (void *)ip))
return -EFAULT;
 
/* It should be pointing to a no

[PATCH v2 4/4] powerpc: Move copy_from_kernel_nofault_inst()

2021-04-13 Thread Christophe Leroy
When probe_kernel_read_inst() was created, there was no good place to
put it, so a file called lib/inst.c was dedicated for it.

Since then, probe_kernel_read_inst() has been renamed
copy_from_kernel_nofault_inst(). And mm/maccess.h didn't exist at that
time. Today, mm/maccess.h is related to copy_from_kernel_nofault().

Move copy_from_kernel_nofault_inst() into mm/maccess.c

Signed-off-by: Christophe Leroy 
---
v2: Remove inst.o from Makefile
---
 arch/powerpc/lib/Makefile |  2 +-
 arch/powerpc/lib/inst.c   | 26 --
 arch/powerpc/mm/maccess.c | 21 +
 3 files changed, 22 insertions(+), 27 deletions(-)
 delete mode 100644 arch/powerpc/lib/inst.c

diff --git a/arch/powerpc/lib/Makefile b/arch/powerpc/lib/Makefile
index d4efc182662a..f2c690ee75d1 100644
--- a/arch/powerpc/lib/Makefile
+++ b/arch/powerpc/lib/Makefile
@@ -16,7 +16,7 @@ CFLAGS_code-patching.o += -DDISABLE_BRANCH_PROFILING
 CFLAGS_feature-fixups.o += -DDISABLE_BRANCH_PROFILING
 endif
 
-obj-y += alloc.o code-patching.o feature-fixups.o pmem.o inst.o 
test_code-patching.o
+obj-y += alloc.o code-patching.o feature-fixups.o pmem.o test_code-patching.o
 
 ifndef CONFIG_KASAN
 obj-y  +=  string.o memcmp_$(BITS).o
diff --git a/arch/powerpc/lib/inst.c b/arch/powerpc/lib/inst.c
deleted file mode 100644
index ec7f6bae8b3c..
--- a/arch/powerpc/lib/inst.c
+++ /dev/null
@@ -1,26 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-or-later
-/*
- *  Copyright 2020, IBM Corporation.
- */
-
-#include 
-#include 
-#include 
-#include 
-
-int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src)
-{
-   unsigned int val, suffix;
-   int err;
-
-   err = copy_from_kernel_nofault(&val, src, sizeof(val));
-   if (err)
-   return err;
-   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
-   err = copy_from_kernel_nofault(&suffix, (void *)src + 4, 4);
-   *inst = ppc_inst_prefix(val, suffix);
-   } else {
-   *inst = ppc_inst(val);
-   }
-   return err;
-}
diff --git a/arch/powerpc/mm/maccess.c b/arch/powerpc/mm/maccess.c
index fa9a7a718fc6..e75e74c52a8a 100644
--- a/arch/powerpc/mm/maccess.c
+++ b/arch/powerpc/mm/maccess.c
@@ -3,7 +3,28 @@
 #include 
 #include 
 
+#include 
+#include 
+#include 
+
 bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
 {
return is_kernel_addr((unsigned long)unsafe_src);
 }
+
+int copy_from_kernel_nofault_inst(struct ppc_inst *inst, struct ppc_inst *src)
+{
+   unsigned int val, suffix;
+   int err;
+
+   err = copy_from_kernel_nofault(&val, src, sizeof(val));
+   if (err)
+   return err;
+   if (IS_ENABLED(CONFIG_PPC64) && get_op(val) == OP_PREFIX) {
+   err = copy_from_kernel_nofault(&suffix, (void *)src + 4, 4);
+   *inst = ppc_inst_prefix(val, suffix);
+   } else {
+   *inst = ppc_inst(val);
+   }
+   return err;
+}
-- 
2.25.0



Re: [PATCH] powerpc/pseries: extract host bridge from pci_bus prior to bus removal

2021-04-13 Thread Tyrel Datwyler
On 2/11/21 10:24 AM, Tyrel Datwyler wrote:
> The pci_bus->bridge reference may no longer be valid after
> pci_bus_remove() resulting in passing a bad value to device_unregister()
> for the associated bridge device.
> 
> Store the host_bridge reference in a separate variable prior to
> pci_bus_remove().
> 
> Fixes: 7340056567e3 ("powerpc/pci: Reorder pci bus/bridge unregistration 
> during PHB removal")
> Signed-off-by: Tyrel Datwyler 

Ping?


Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-13 Thread Segher Boessenkool
On Tue, Apr 13, 2021 at 06:33:19PM +0200, Christophe Leroy wrote:
> Le 12/04/2021 à 23:54, Segher Boessenkool a écrit :
> >On Thu, Apr 08, 2021 at 03:33:44PM +, Christophe Leroy wrote:
> >>For clear bits, on 32 bits 'rlwinm' can be used instead or 'andc' for
> >>when all bits to be cleared are consecutive.
> >
> >Also on 64-bits, as long as both the top and bottom bits are in the low
> >32-bit half (for 32 bit mode, it can wrap as well).
> 
> Yes. But here we are talking about clearing a few bits, all other ones must 
> remain unchanged. An rlwinm on PPC64 will always clear the upper part, 
> which is unlikely what we want.

No, it does not.  It takes the low 32 bits of the source reg, duplicated
to the top half as well, then rotated, then ANDed with the mask (which
can wrap around).  This isn't very often very useful, but :-)

(One useful operation is splatting 32 bits to both halves of a 64-bit
register, which is just rlwinm d,s,0,1,0).

If you only look at the low 32 bits, it does exactly the same as on
32-bit implementations.

> >>For the time being only
> >>handle the single bit case, which we detect by checking whether the
> >>mask is a power of two.
> >
> >You could look at rs6000_is_valid_mask in GCC:
> >   
> > 
> >used by rs6000_is_valid_and_mask immediately after it.  You probably
> >want to allow only rlwinm in your case, and please note this checks if
> >something is a valid mask, not the inverse of a valid mask (as you
> >want here).
> 
> This check looks more complex than what I need. It is used for both rlw... 
> and rld..., and it calculates the operants.  The only thing I need is to 
> validate the mask.

It has to do exactly the same thing for rlwinm as for all 64-bit
variants (rldicl, rldicr, rldic).

One side effect of calculation the bit positions with exact_log2 is that
that returns negative if the argument is not a power of two.

Here is a simpler way, that handles all cases:  input in "u32 val":

if (!val)
return nonono;
if (val & 1)
val = ~val; // make the mask non-wrapping
val += val & -val;  // adding the low set bit should result in
// at most one bit set
if (!(val & (val - 1)))
return okidoki_all_good;

> I found a way: By anding the mask with the complement of itself rotated by 
> left bits to 1, we identify the transitions from 0 to 1. If the result is a 
> power of 2, it means there's only one transition so the mask is as expected.

That does not handle all cases (it misses all bits set at least).  Which
isn't all that interesting of course, but is a valid mask (but won't
clear any bits, so not too interesting for your specific case :-) )


Segher


Re: [PATCH net-next v4 0/2] of: net: support non-platform devices in of_get_mac_address()

2021-04-13 Thread patchwork-bot+netdevbpf
Hello:

This series was applied to netdev/net-next.git (refs/heads/master):

On Mon, 12 Apr 2021 19:47:16 +0200 you wrote:
> of_get_mac_address() is commonly used to fetch the MAC address
> from the device tree. It also supports reading it from a NVMEM
> provider. But the latter is only possible for platform devices,
> because only platform devices are searched for a matching device
> node.
> 
> Add a second method to fetch the NVMEM cell by a device tree node
> instead of a "struct device".
> 
> [...]

Here is the summary with links:
  - [net-next,v4,1/2] of: net: pass the dst buffer to of_get_mac_address()
https://git.kernel.org/netdev/net-next/c/83216e3988cd
  - [net-next,v4,2/2] of: net: fix of_get_mac_addr_nvmem() for non-platform 
devices
https://git.kernel.org/netdev/net-next/c/f10843e04a07

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html




Re: [PATCH v2 13/14] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-13 Thread Leonardo Bras
On Tue, 2021-04-13 at 18:24 +1000, Alexey Kardashevskiy wrote:
> 
> On 13/04/2021 17:58, Leonardo Bras wrote:
> > On Tue, 2021-04-13 at 17:41 +1000, Alexey Kardashevskiy wrote:
> > > 
> > > On 13/04/2021 17:33, Leonardo Bras wrote:
> > > > On Tue, 2021-04-13 at 17:18 +1000, Alexey Kardashevskiy wrote:
> > > > > 
> > > > > On 13/04/2021 15:49, Leonardo Bras wrote:
> > > > > > Thanks for the feedback!
> > > > > > 
> > > > > > On Tue, 2020-09-29 at 13:56 +1000, Alexey Kardashevskiy wrote:
> > > > > > > > -static bool find_existing_ddw(struct device_node *pdn, u64 
> > > > > > > > *dma_addr)
> > > > > > > > +static phys_addr_t ddw_memory_hotplug_max(void)
> > > > > > > 
> > > > > > > 
> > > > > > > Please, forward declaration or a separate patch; this creates
> > > > > > > unnecessary noise to the actual change.
> > > > > > > 
> > > > > > 
> > > > > > Sure, done!
> > > > > > 
> > > > > > > 
> > > > > > > > +   _iommu_table_setparms(tbl, 
> > > > > > > > pci->phb->bus->number, create.liobn, win_addr,
> > > > > > > > + 1UL << len, page_shift, 
> > > > > > > > 0, &iommu_table_lpar_multi_ops);
> > > > > > > > +   iommu_init_table(tbl, pci->phb->node, 0, 0);
> > > > > > > 
> > > > > > > 
> > > > > > > It is 0,0 only if win_addr>0 which is not the QEMU case.
> > > > > > > 
> > > > > > 
> > > > > > Oh, ok.
> > > > > > I previously though it was ok to use 0,0 here as any other usage in
> > > > > > this file was also 0,0.
> > > > > > 
> > > > > > What should I use to get the correct parameters? Use the previous 
> > > > > > tbl
> > > > > > it_reserved_start and tbl->it_reserved_end is enough?
> > > > > 
> > > > > depends on whether you carry reserved start/end even if they are 
> > > > > outside
> > > > > of the dma window.
> > > > > 
> > > > 
> > > > Oh, that makes sense.
> > > > On a previous patch (5/14 IIRC), I changed the behavior to only store
> > > > the valid range on tbl, but now I understand why it's important to
> > > > store the raw value.
> > > > 
> > > > Ok, I will change it back so the reserved range stays in tbl even if it
> > > > does not intersect with the DMA window. This way I can reuse the values
> > > > in case of indirect mapping with DDW.
> > > > 
> > > > Is that ok? Are the reserved values are supposed to stay the same after
> > > > changing from Default DMA window to DDW?
> > > 
> > > I added them to know what bits in it_map to ignore when checking if
> > > there is any active user of the table. If you have non zero reserved
> > > start/end but they do not affect it_map, then it is rather weird way to
> > > carry reserved start/end from DDW to no-DDW.
> > > 
> > 
> > Ok, agreed.
> > 
> > >   May be do not set these at
> > > all for DDW with window start at 1<<59 and when going back to no-DDW (or
> > > if DDW starts at 0) - just set them from MMIO32, just as they are
> > > initialized in the first place.
> > > 
> > 
> > If I get it correctly from pci_of_scan.c, MMIO32 = {0, 32MB}, is that
> > correct?
> 
> No, under QEMU it is 0x8000.-0x1..:
> 
> /proc/device-tree/pci@8002000/ranges
> 
> 7 cells for each resource, the second one is MMIO32 (the first is IO 
> ports, the last is 64bit MMIO).
> > 
> > So, if DDW starts at any value in this range (most probably at zero),
> > we should remove the rest, is that correct?
> > 
> > Could it always use iommu_init_table(..., 0, 32MB) here, so it always
> > reserve any part of the DMA window that's in this range? Ot there may
> > be other reserved values range?
> > 
> > > and when going back to no-DDW
> > 
> > After iommu_init_table() there should be no failure, so it looks like
> > there is no 'going back to no-DDW'. Am I missing something?
> 
> Well, a random driver could request 32bit DMA and if the new window is 
> 1:1, then it would break but this does not seem to happen and we do not 
> support it anyway so no loss here.
> 

So you would recommend reading "ranges" with of_get_property() and
using the second entry (cells 7 - 13) in this point, get base & size to
make sure it does not map anything here? (should have no effect if the
value does not intersect with the DMA window)

Thank you for reviewing!
Leonardo Bras



Re: [PATCH v1 1/2] powerpc/bitops: Use immediate operand when possible

2021-04-13 Thread Nicholas Piggin
Excerpts from Segher Boessenkool's message of April 14, 2021 7:58 am:
> On Tue, Apr 13, 2021 at 06:33:19PM +0200, Christophe Leroy wrote:
>> Le 12/04/2021 à 23:54, Segher Boessenkool a écrit :
>> >On Thu, Apr 08, 2021 at 03:33:44PM +, Christophe Leroy wrote:
>> >>For clear bits, on 32 bits 'rlwinm' can be used instead or 'andc' for
>> >>when all bits to be cleared are consecutive.
>> >
>> >Also on 64-bits, as long as both the top and bottom bits are in the low
>> >32-bit half (for 32 bit mode, it can wrap as well).
>> 
>> Yes. But here we are talking about clearing a few bits, all other ones must 
>> remain unchanged. An rlwinm on PPC64 will always clear the upper part, 
>> which is unlikely what we want.
> 
> No, it does not.  It takes the low 32 bits of the source reg, duplicated
> to the top half as well, then rotated, then ANDed with the mask (which
> can wrap around).  This isn't very often very useful, but :-)
> 
> (One useful operation is splatting 32 bits to both halves of a 64-bit
> register, which is just rlwinm d,s,0,1,0).
> 
> If you only look at the low 32 bits, it does exactly the same as on
> 32-bit implementations.
> 
>> >>For the time being only
>> >>handle the single bit case, which we detect by checking whether the
>> >>mask is a power of two.
>> >
>> >You could look at rs6000_is_valid_mask in GCC:
>> >   
>> > 
>> >used by rs6000_is_valid_and_mask immediately after it.  You probably
>> >want to allow only rlwinm in your case, and please note this checks if
>> >something is a valid mask, not the inverse of a valid mask (as you
>> >want here).
>> 
>> This check looks more complex than what I need. It is used for both rlw... 
>> and rld..., and it calculates the operants.  The only thing I need is to 
>> validate the mask.
> 
> It has to do exactly the same thing for rlwinm as for all 64-bit
> variants (rldicl, rldicr, rldic).
> 
> One side effect of calculation the bit positions with exact_log2 is that
> that returns negative if the argument is not a power of two.
> 
> Here is a simpler way, that handles all cases:  input in "u32 val":
> 
>   if (!val)
>   return nonono;
>   if (val & 1)
>   val = ~val; // make the mask non-wrapping
>   val += val & -val;  // adding the low set bit should result in
>   // at most one bit set
>   if (!(val & (val - 1)))
>   return okidoki_all_good;
> 
>> I found a way: By anding the mask with the complement of itself rotated by 
>> left bits to 1, we identify the transitions from 0 to 1. If the result is a 
>> power of 2, it means there's only one transition so the mask is as expected.
> 
> That does not handle all cases (it misses all bits set at least).  Which
> isn't all that interesting of course, but is a valid mask (but won't
> clear any bits, so not too interesting for your specific case :-) )

Would be nice if we could let the compiler deal with it all...

static inline unsigned long lr(unsigned long *mem)
{
unsigned long val;

/*
 * This doesn't clobber memory but want to avoid memory operations
 * moving ahead of it
 */
asm volatile("ldarx %0, %y1" : "=r"(val) : "Z"(*mem) : "memory");

return val;
}

static inline bool stc(unsigned long *mem, unsigned long val)
{
/*
 * This doesn't really clobber memory but same as above, also can't
 * specify output in asm goto.
 */
asm volatile goto(
"stdcx. %0, %y1 \n\t"
"bne-   %l[fail]\n\t"
: : "r"(val), "Z"(*mem) : "cr0", "memory" : fail);

return true;
fail: __attribute__((cold))
return false;
}

static inline void atomic_add(unsigned long *mem, unsigned long val)
{
unsigned long old, new;

do {
old = lr(mem);
new = old + val;
} while (unlikely(!stc(mem, new)));
}



[PATCH] mm: Define ARCH_HAS_FIRST_USER_ADDRESS

2021-04-13 Thread Anshuman Khandual
Currently most platforms define FIRST_USER_ADDRESS as 0UL duplicating the
same code all over. Instead define a new option ARCH_HAS_FIRST_USER_ADDRESS
for those platforms which would override generic default FIRST_USER_ADDRESS
value 0UL. This makes it much cleaner with reduced code.

Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: linux-hexa...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: openr...@lists.librecores.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-xte...@linux-xtensa.org
Cc: x...@kernel.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
 arch/alpha/include/asm/pgtable.h | 1 -
 arch/arc/include/asm/pgtable.h   | 6 --
 arch/arm/Kconfig | 1 +
 arch/arm64/include/asm/pgtable.h | 2 --
 arch/csky/include/asm/pgtable.h  | 1 -
 arch/hexagon/include/asm/pgtable.h   | 3 ---
 arch/ia64/include/asm/pgtable.h  | 1 -
 arch/m68k/include/asm/pgtable_mm.h   | 1 -
 arch/microblaze/include/asm/pgtable.h| 2 --
 arch/mips/include/asm/pgtable-32.h   | 1 -
 arch/mips/include/asm/pgtable-64.h   | 1 -
 arch/nds32/Kconfig   | 1 +
 arch/nios2/include/asm/pgtable.h | 2 --
 arch/openrisc/include/asm/pgtable.h  | 1 -
 arch/parisc/include/asm/pgtable.h| 2 --
 arch/powerpc/include/asm/book3s/pgtable.h| 1 -
 arch/powerpc/include/asm/nohash/32/pgtable.h | 1 -
 arch/powerpc/include/asm/nohash/64/pgtable.h | 2 --
 arch/riscv/include/asm/pgtable.h | 2 --
 arch/s390/include/asm/pgtable.h  | 2 --
 arch/sh/include/asm/pgtable.h| 2 --
 arch/sparc/include/asm/pgtable_32.h  | 1 -
 arch/sparc/include/asm/pgtable_64.h  | 3 ---
 arch/um/include/asm/pgtable-2level.h | 1 -
 arch/um/include/asm/pgtable-3level.h | 1 -
 arch/x86/include/asm/pgtable_types.h | 2 --
 arch/xtensa/include/asm/pgtable.h| 1 -
 include/linux/mm.h   | 4 
 mm/Kconfig   | 4 
 29 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/arch/alpha/include/asm/pgtable.h b/arch/alpha/include/asm/pgtable.h
index 8d856c62e22a..1a2fb0dc905b 100644
--- a/arch/alpha/include/asm/pgtable.h
+++ b/arch/alpha/include/asm/pgtable.h
@@ -46,7 +46,6 @@ struct vm_area_struct;
 #define PTRS_PER_PMD   (1UL << (PAGE_SHIFT-3))
 #define PTRS_PER_PGD   (1UL << (PAGE_SHIFT-3))
 #define USER_PTRS_PER_PGD  (TASK_SIZE / PGDIR_SIZE)
-#define FIRST_USER_ADDRESS 0UL
 
 /* Number of pointers that fit on a page:  this will go away. */
 #define PTRS_PER_PAGE  (1UL << (PAGE_SHIFT-3))
diff --git a/arch/arc/include/asm/pgtable.h b/arch/arc/include/asm/pgtable.h
index 163641726a2b..a9fabfb70664 100644
--- a/arch/arc/include/asm/pgtable.h
+++ b/arch/arc/include/asm/pgtable.h
@@ -228,12 +228,6 @@
  */
 #defineUSER_PTRS_PER_PGD   (TASK_SIZE / PGDIR_SIZE)
 
-/*
- * No special requirements for lowest virtual address we permit any user space
- * mapping to be mapped at.
- */
-#define FIRST_USER_ADDRESS  0UL
-
 
 /
  * Bucket load of VM Helpers
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 5da96f5df48f..ad086e6d7155 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM
select ARCH_HAS_DEBUG_VIRTUAL if MMU
select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
select ARCH_HAS_ELF_RANDOMIZE
+   select ARCH_HAS_FIRST_USER_ADDRESS
select ARCH_HAS_FORTIFY_SOURCE
select ARCH_HAS_KEEPINITRD
select ARCH_HAS_KCOV
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 47027796c2f9..f6ab8b64967e 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -26,8 +26,6 @@
 
 #define vmemmap((struct page *)VMEMMAP_START - 
(memstart_addr >> PAGE_SHIFT))
 
-#define FIRST_USER_ADDRESS 0UL
-
 #ifndef __ASSEMBLY__
 
 #include 
diff --git a/arch/csky/include/asm/pgtable.h b/arch/csky/include/asm/pgtable.h
index 0d60367b6bfa..151607ed5158 100644
--- a/arch/csky/include/asm/pgtable.h
+++ b/arch/csky/include/asm/pgtable.h
@@ -14,7 +14,6 @@
 #define PGDIR_MASK (~(PGDIR_SIZE-1))
 
 #define USER_PTRS_PER_PGD  (PAGE_OFFSET/PGDIR_SIZE)
-#define FIRST_USER_ADDRESS 0UL
 
 /*
  * C-SKY is two-level paging structure:
diff --git a/arch/hexagon/include/asm/pgtable.h 
b/arch/hexagon/include/asm/pgtable.h
index dbb22

Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2021-04-13 Thread Pingfan Liu
On Sat, Apr 10, 2021 at 12:33 AM Michal Suchánek  wrote:
>
> Hello,
>
> On Fri, Aug 28, 2020 at 04:10:09PM +0800, Pingfan Liu wrote:
> > On Thu, Aug 27, 2020 at 3:53 PM Laurent Dufour  
> > wrote:
> > >
> > > Le 10/08/2020 à 10:52, Pingfan Liu a écrit :
> > > > A bug is observed on pseries by taking the following steps on rhel:
> > > > -1. drmgr -c mem -r -q 5
> > > > -2. echo c > /proc/sysrq-trigger
> > > >
> > > > And then, the failure looks like:
> > > > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> > > > kdump: saving vmcore-dmesg.txt
> > > > kdump: saving vmcore-dmesg.txt complete
> > > > kdump: saving vmcore
> > > >   Checking for memory holes : [  0.0 %] /   
> > > > Checking for memory holes : [100.0 
> > > > %] |   Excluding unnecessary pages  
> > > >  : [100.0 %] \   Copying data   
> > > >: [  0.3 %] -  eta: 38s[   44.337636] hash-mmu: mm: 
> > > > Hashing failure ! EA=0x7fffba40 access=0x8004 
> > > > current=makedumpfile
> > > > [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base 
> > > > psize=2 psize 2 pte=0xc0005504
> > > > [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> > > > access=0x8004 current=makedumpfile
> > > > [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base 
> > > > psize=2 psize 2 pte=0xc0005504
> > > > [   44.337708] makedumpfile[469]: unhandled signal 7 at 
> > > > 7fffba40 nip 7fffbbc4d7fc lr 00011356ca3c code 2
> > > > [   44.338548] Core dump to |/bin/false pipe failed
> > > > /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> > > > $CORE_COLLECTOR /proc/vmcore 
> > > > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> > > > kdump: saving vmcore failed
> > > >
> > > > * Root cause *
> > > >After analyzing, it turns out that in the current implementation,
> > > > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt 
> > > > updating as
> > > > the code __remove_memory() comes before drmem_update_dt().
> > > > So in kdump kernel, when read_from_oldmem() resorts to
> > > > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> > > > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, 
> > > > as it
> > > > can be observed "Bus error"
> > > >
> > > >  From a viewpoint of listener and publisher, the publisher notifies the
> > > > listener before data is ready.  This introduces a problem where udev
> > > > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> > > > updating. And in capture kernel, makedumpfile will access the memory 
> > > > based
> > > > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
> > > >
> > > > * Fix *
> > > > This bug is introduced by commit 063b8b1251fd
> > > > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> > > > request"), which tried to combine all the dt updating into one.
> > > >
> > > > To fix this issue, meanwhile not to introduce a quadratic runtime
> > > > complexity by the model:
> > > >dlpar_memory_add_by_count
> > > >  for_each_drmem_lmb <--
> > > >dlpar_add_lmb
> > > >  drmem_update_dt(_v1|_v2)
> > > >for_each_drmem_lmb   <--
> > > > The dt should still be only updated once, and just before the last 
> > > > memory
> > > > online/offline event is ejected to user space. Achieve this by tracing 
> > > > the
> > > > num of lmb added or removed.
> > > >
> > > > Signed-off-by: Pingfan Liu 
> > > > Cc: Michael Ellerman 
> > > > Cc: Hari Bathini 
> > > > Cc: Nathan Lynch 
> > > > Cc: Nathan Fontenot 
> > > > Cc: Laurent Dufour 
> > > > To: linuxppc-dev@lists.ozlabs.org
> > > > Cc: ke...@lists.infradead.org
> > > > ---
> > > > v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report
> > > >whether dt is updated successfully.
> > > >Fix a condition boundary check bug
> > > > v3 -> v4: resolve a quadratic runtime complexity issue.
> > > >This series is applied on next-test branch
> > > >   arch/powerpc/platforms/pseries/hotplug-memory.c | 102 
> > > > +++-
> > > >   1 file changed, 80 insertions(+), 22 deletions(-)
> > > >
> > > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > > > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > > > index 46cbcd1..1567d9f 100644
> > > > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > > > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > > > @@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb 
> > > > *lmb)
> > > >   return true;
> > > >   }
> > > >
> > > > -static int dlpar_add_lmb(struct drmem_lmb *);
> > > > +enum dt_update_status {
> > > > + DT_NOUPDATE,
> > > > + DT_TOUPDATE,
> > > > + DT

Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-13 Thread Leonardo Bras
On Mon, 2021-04-12 at 17:21 -0500, Segher Boessenkool wrote:
> On Fri, Apr 09, 2021 at 02:36:16PM +1000, Alexey Kardashevskiy wrote:
> > On 08/04/2021 19:04, Michael Ellerman wrote:
> > > > > > +#define QUERY_DDW_PGSIZE_4K0x01
> > > > > > +#define QUERY_DDW_PGSIZE_64K   0x02
> > > > > > +#define QUERY_DDW_PGSIZE_16M   0x04
> > > > > > +#define QUERY_DDW_PGSIZE_32M   0x08
> > > > > > +#define QUERY_DDW_PGSIZE_64M   0x10
> > > > > > +#define QUERY_DDW_PGSIZE_128M  0x20
> > > > > > +#define QUERY_DDW_PGSIZE_256M  0x40
> > > > > > +#define QUERY_DDW_PGSIZE_16G   0x80
> > > > > 
> > > > > I'm not sure the #defines really gain us much vs just putting the
> > > > > literal values in the array below?
> > > > 
> > > > Then someone says "u magic values" :) I do not mind either way. 
> > > > Thanks,
> > > 
> > > Yeah that's true. But #defining them doesn't make them less magic, if
> > > you only use them in one place :)
> > 
> > Defining them with "QUERY_DDW" in the names kinda tells where they are 
> > from. Can also grep QEMU using these to see how the other side handles 
> > it. Dunno.
> 
> And *not* defining anything reduces the mental load a lot.  You can add
> a comment at the single spot you use them, explaining what this is, in a
> much better way!
> 
> Comments are *good*.
> 
> 
> Segher

Thanks for the feedback Alexey, Michael and Segher!

I have sent a v3 for this patch. 
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210408201915.174217-1-leobra...@gmail.com/

Please let me know of your feedback in it.

Best regards,
Leonardo Bras



Re: [PATCH v3] powerpc/papr_scm: Implement support for H_SCM_FLUSH hcall

2021-04-13 Thread Aneesh Kumar K.V
Vaibhav Jain  writes:

> Hi Shiva,
>
> Apologies for a late review but something just caught my eye while
> working on a different patch.
>
> Shivaprasad G Bhat  writes:
>
>> Add support for ND_REGION_ASYNC capability if the device tree
>> indicates 'ibm,hcall-flush-required' property in the NVDIMM node.
>> Flush is done by issuing H_SCM_FLUSH hcall to the hypervisor.
>>
>> If the flush request failed, the hypervisor is expected to
>> to reflect the problem in the subsequent nvdimm H_SCM_HEALTH call.
>>
>> This patch prevents mmap of namespaces with MAP_SYNC flag if the
>> nvdimm requires an explicit flush[1].
>>
>> References:
>> [1] 
>> https://github.com/avocado-framework-tests/avocado-misc-tests/blob/master/memory/ndctl.py.data/map_sync.c
>>
>> Signed-off-by: Shivaprasad G Bhat 
>> ---
>> v2 - https://www.spinics.net/lists/kvm-ppc/msg18799.html
>> Changes from v2:
>>- Fixed the commit message.
>>- Add dev_dbg before the H_SCM_FLUSH hcall
>>
>> v1 - https://www.spinics.net/lists/kvm-ppc/msg18272.html
>> Changes from v1:
>>- Hcall semantics finalized, all changes are to accomodate them.
>>
>>  Documentation/powerpc/papr_hcalls.rst |   14 ++
>>  arch/powerpc/include/asm/hvcall.h |3 +-
>>  arch/powerpc/platforms/pseries/papr_scm.c |   40 
>> +
>>  3 files changed, 56 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/powerpc/papr_hcalls.rst 
>> b/Documentation/powerpc/papr_hcalls.rst
>> index 48fcf1255a33..648f278eea8f 100644
>> --- a/Documentation/powerpc/papr_hcalls.rst
>> +++ b/Documentation/powerpc/papr_hcalls.rst
>> @@ -275,6 +275,20 @@ Health Bitmap Flags:
>>  Given a DRC Index collect the performance statistics for NVDIMM and copy 
>> them
>>  to the resultBuffer.
>>  
>> +**H_SCM_FLUSH**
>> +
>> +| Input: *drcIndex, continue-token*
>> +| Out: *continue-token*
>> +| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
>> +
>> +Given a DRC Index Flush the data to backend NVDIMM device.
>> +
>> +The hcall returns H_BUSY when the flush takes longer time and the hcall 
>> needs
>> +to be issued multiple times in order to be completely serviced. The
>> +*continue-token* from the output to be passed in the argument list of
>> +subsequent hcalls to the hypervisor until the hcall is completely serviced
>> +at which point H_SUCCESS or other error is returned by the hypervisor.
>> +
> Does the hcall semantic also include measures to trigger a barrier/fence
> (like pm_wmb()) so that all the stores before the hcall are gauranteed
> to have hit the pmem controller ?

It is upto the hypervisor to implement the right sequence to ensure the
guarantee. The hcall doesn't expect the store to reach the platform
buffers.


>
> If not then the papr_scm_pmem_flush() introduced below should issue a
> fence like pm_wmb() before issuing the flush hcall.
>
> If yes then this behaviour should also be documented with the hcall
> semantics above.
>

IIUC fdatasync on the backend file is enough to ensure the
guaraantee. Such a request will have REQ_PREFLUSH and REQ_FUA which will
do the necessary barrier on the backing device holding the backend file.

-aneesh


Re: [PATCH] mm: Define ARCH_HAS_FIRST_USER_ADDRESS

2021-04-13 Thread Christophe Leroy




Le 14/04/2021 à 04:54, Anshuman Khandual a écrit :

Currently most platforms define FIRST_USER_ADDRESS as 0UL duplicating the
same code all over. Instead define a new option ARCH_HAS_FIRST_USER_ADDRESS
for those platforms which would override generic default FIRST_USER_ADDRESS
value 0UL. This makes it much cleaner with reduced code.

Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: linux-hexa...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: openr...@lists.librecores.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-xte...@linux-xtensa.org
Cc: x...@kernel.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
  arch/alpha/include/asm/pgtable.h | 1 -
  arch/arc/include/asm/pgtable.h   | 6 --
  arch/arm/Kconfig | 1 +
  arch/arm64/include/asm/pgtable.h | 2 --
  arch/csky/include/asm/pgtable.h  | 1 -
  arch/hexagon/include/asm/pgtable.h   | 3 ---
  arch/ia64/include/asm/pgtable.h  | 1 -
  arch/m68k/include/asm/pgtable_mm.h   | 1 -
  arch/microblaze/include/asm/pgtable.h| 2 --
  arch/mips/include/asm/pgtable-32.h   | 1 -
  arch/mips/include/asm/pgtable-64.h   | 1 -
  arch/nds32/Kconfig   | 1 +
  arch/nios2/include/asm/pgtable.h | 2 --
  arch/openrisc/include/asm/pgtable.h  | 1 -
  arch/parisc/include/asm/pgtable.h| 2 --
  arch/powerpc/include/asm/book3s/pgtable.h| 1 -
  arch/powerpc/include/asm/nohash/32/pgtable.h | 1 -
  arch/powerpc/include/asm/nohash/64/pgtable.h | 2 --
  arch/riscv/include/asm/pgtable.h | 2 --
  arch/s390/include/asm/pgtable.h  | 2 --
  arch/sh/include/asm/pgtable.h| 2 --
  arch/sparc/include/asm/pgtable_32.h  | 1 -
  arch/sparc/include/asm/pgtable_64.h  | 3 ---
  arch/um/include/asm/pgtable-2level.h | 1 -
  arch/um/include/asm/pgtable-3level.h | 1 -
  arch/x86/include/asm/pgtable_types.h | 2 --
  arch/xtensa/include/asm/pgtable.h| 1 -
  include/linux/mm.h   | 4 
  mm/Kconfig   | 4 
  29 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..47098ccd715e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -46,6 +46,10 @@ extern int sysctl_page_lock_unfairness;
  
  void init_mm_internals(void);
  
+#ifndef ARCH_HAS_FIRST_USER_ADDRESS


I guess you didn't test it . :)

should be #ifndef CONFIG_ARCH_HAS_FIRST_USER_ADDRESS


+#define FIRST_USER_ADDRESS 0UL
+#endif


But why do we need a config option at all for that ?

Why not just:

#ifndef FIRST_USER_ADDRESS
#define FIRST_USER_ADDRESS  0UL
#endif


+
  #ifndef CONFIG_NEED_MULTIPLE_NODES/* Don't use mapnrs, do it properly */
  extern unsigned long max_mapnr;
  
diff --git a/mm/Kconfig b/mm/Kconfig

index 24c045b24b95..373fbe377075 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -806,6 +806,10 @@ config VMAP_PFN
  
  config ARCH_USES_HIGH_VMA_FLAGS

bool
+
+config ARCH_HAS_FIRST_USER_ADDRESS
+   bool
+
  config ARCH_HAS_PKEYS
bool
  



Re: [PATCH v2 3/4] powerpc: Rename probe_kernel_read_inst()

2021-04-13 Thread Aneesh Kumar K.V
Christophe Leroy  writes:

> When probe_kernel_read_inst() was created, it was to mimic
> probe_kernel_read() function.
>
> Since then, probe_kernel_read() has been renamed
> copy_from_kernel_nofault().
>
> Rename probe_kernel_read_inst() into copy_from_kernel_nofault_inst().

At first glance I read it as copy from kernel nofault instruction.
How about copy_inst_from_kernel_nofault()? 


-aneesh


Re: [PATCH] mm: Define ARCH_HAS_FIRST_USER_ADDRESS

2021-04-13 Thread Anshuman Khandual



On 4/14/21 10:52 AM, Christophe Leroy wrote:
> 
> 
> Le 14/04/2021 à 04:54, Anshuman Khandual a écrit :
>> Currently most platforms define FIRST_USER_ADDRESS as 0UL duplicating the
>> same code all over. Instead define a new option ARCH_HAS_FIRST_USER_ADDRESS
>> for those platforms which would override generic default FIRST_USER_ADDRESS
>> value 0UL. This makes it much cleaner with reduced code.
>>
>> Cc: linux-al...@vger.kernel.org
>> Cc: linux-snps-...@lists.infradead.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-c...@vger.kernel.org
>> Cc: linux-hexa...@vger.kernel.org
>> Cc: linux-i...@vger.kernel.org
>> Cc: linux-m...@lists.linux-m68k.org
>> Cc: linux-m...@vger.kernel.org
>> Cc: openr...@lists.librecores.org
>> Cc: linux-par...@vger.kernel.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: linux-ri...@lists.infradead.org
>> Cc: linux-s...@vger.kernel.org
>> Cc: linux...@vger.kernel.org
>> Cc: sparcli...@vger.kernel.org
>> Cc: linux...@lists.infradead.org
>> Cc: linux-xte...@linux-xtensa.org
>> Cc: x...@kernel.org
>> Cc: linux...@kvack.org
>> Cc: linux-ker...@vger.kernel.org
>> Signed-off-by: Anshuman Khandual 
>> ---
>>   arch/alpha/include/asm/pgtable.h | 1 -
>>   arch/arc/include/asm/pgtable.h   | 6 --
>>   arch/arm/Kconfig | 1 +
>>   arch/arm64/include/asm/pgtable.h | 2 --
>>   arch/csky/include/asm/pgtable.h  | 1 -
>>   arch/hexagon/include/asm/pgtable.h   | 3 ---
>>   arch/ia64/include/asm/pgtable.h  | 1 -
>>   arch/m68k/include/asm/pgtable_mm.h   | 1 -
>>   arch/microblaze/include/asm/pgtable.h    | 2 --
>>   arch/mips/include/asm/pgtable-32.h   | 1 -
>>   arch/mips/include/asm/pgtable-64.h   | 1 -
>>   arch/nds32/Kconfig   | 1 +
>>   arch/nios2/include/asm/pgtable.h | 2 --
>>   arch/openrisc/include/asm/pgtable.h  | 1 -
>>   arch/parisc/include/asm/pgtable.h    | 2 --
>>   arch/powerpc/include/asm/book3s/pgtable.h    | 1 -
>>   arch/powerpc/include/asm/nohash/32/pgtable.h | 1 -
>>   arch/powerpc/include/asm/nohash/64/pgtable.h | 2 --
>>   arch/riscv/include/asm/pgtable.h | 2 --
>>   arch/s390/include/asm/pgtable.h  | 2 --
>>   arch/sh/include/asm/pgtable.h    | 2 --
>>   arch/sparc/include/asm/pgtable_32.h  | 1 -
>>   arch/sparc/include/asm/pgtable_64.h  | 3 ---
>>   arch/um/include/asm/pgtable-2level.h | 1 -
>>   arch/um/include/asm/pgtable-3level.h | 1 -
>>   arch/x86/include/asm/pgtable_types.h | 2 --
>>   arch/xtensa/include/asm/pgtable.h    | 1 -
>>   include/linux/mm.h   | 4 
>>   mm/Kconfig   | 4 
>>   29 files changed, 10 insertions(+), 43 deletions(-)
>>
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 8ba434287387..47098ccd715e 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -46,6 +46,10 @@ extern int sysctl_page_lock_unfairness;
>>     void init_mm_internals(void);
>>   +#ifndef ARCH_HAS_FIRST_USER_ADDRESS
> 
> I guess you didn't test it . :)

In fact I did :) Though just booted it on arm64 and cross compiled on
multiple others platforms.

> 
> should be #ifndef CONFIG_ARCH_HAS_FIRST_USER_ADDRESS

Right, meant that instead.

> 
>> +#define FIRST_USER_ADDRESS    0UL
>> +#endif
> 
> But why do we need a config option at all for that ?
> 
> Why not just:
> 
> #ifndef FIRST_USER_ADDRESS
> #define FIRST_USER_ADDRESS    0UL
> #endif

This sounds simpler. But just wondering, would not there be any possibility
of build problems due to compilation sequence between arch and generic code ?


Re: [PATCH] mm: Define ARCH_HAS_FIRST_USER_ADDRESS

2021-04-13 Thread Christophe Leroy




Le 14/04/2021 à 07:59, Anshuman Khandual a écrit :



On 4/14/21 10:52 AM, Christophe Leroy wrote:



Le 14/04/2021 à 04:54, Anshuman Khandual a écrit :

Currently most platforms define FIRST_USER_ADDRESS as 0UL duplicating the
same code all over. Instead define a new option ARCH_HAS_FIRST_USER_ADDRESS
for those platforms which would override generic default FIRST_USER_ADDRESS
value 0UL. This makes it much cleaner with reduced code.

Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c...@vger.kernel.org
Cc: linux-hexa...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: openr...@lists.librecores.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-xte...@linux-xtensa.org
Cc: x...@kernel.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
   arch/alpha/include/asm/pgtable.h | 1 -
   arch/arc/include/asm/pgtable.h   | 6 --
   arch/arm/Kconfig | 1 +
   arch/arm64/include/asm/pgtable.h | 2 --
   arch/csky/include/asm/pgtable.h  | 1 -
   arch/hexagon/include/asm/pgtable.h   | 3 ---
   arch/ia64/include/asm/pgtable.h  | 1 -
   arch/m68k/include/asm/pgtable_mm.h   | 1 -
   arch/microblaze/include/asm/pgtable.h    | 2 --
   arch/mips/include/asm/pgtable-32.h   | 1 -
   arch/mips/include/asm/pgtable-64.h   | 1 -
   arch/nds32/Kconfig   | 1 +
   arch/nios2/include/asm/pgtable.h | 2 --
   arch/openrisc/include/asm/pgtable.h  | 1 -
   arch/parisc/include/asm/pgtable.h    | 2 --
   arch/powerpc/include/asm/book3s/pgtable.h    | 1 -
   arch/powerpc/include/asm/nohash/32/pgtable.h | 1 -
   arch/powerpc/include/asm/nohash/64/pgtable.h | 2 --
   arch/riscv/include/asm/pgtable.h | 2 --
   arch/s390/include/asm/pgtable.h  | 2 --
   arch/sh/include/asm/pgtable.h    | 2 --
   arch/sparc/include/asm/pgtable_32.h  | 1 -
   arch/sparc/include/asm/pgtable_64.h  | 3 ---
   arch/um/include/asm/pgtable-2level.h | 1 -
   arch/um/include/asm/pgtable-3level.h | 1 -
   arch/x86/include/asm/pgtable_types.h | 2 --
   arch/xtensa/include/asm/pgtable.h    | 1 -
   include/linux/mm.h   | 4 
   mm/Kconfig   | 4 
   29 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8ba434287387..47098ccd715e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -46,6 +46,10 @@ extern int sysctl_page_lock_unfairness;
     void init_mm_internals(void);
   +#ifndef ARCH_HAS_FIRST_USER_ADDRESS


I guess you didn't test it . :)


In fact I did :) Though just booted it on arm64 and cross compiled on
multiple others platforms.



should be #ifndef CONFIG_ARCH_HAS_FIRST_USER_ADDRESS


Right, meant that instead.




+#define FIRST_USER_ADDRESS    0UL
+#endif


But why do we need a config option at all for that ?

Why not just:

#ifndef FIRST_USER_ADDRESS
#define FIRST_USER_ADDRESS    0UL
#endif


This sounds simpler. But just wondering, would not there be any possibility
of build problems due to compilation sequence between arch and generic code ?



For sure it has to be addresses carefully, but there are already a lot of stuff like that around 
pgtables.h


For instance, pte_offset_kernel() has a generic definition in linux/pgtables.h based on whether it 
is already defined or not.


Taking into account that FIRST_USER_ADDRESS is today in the architectures's asm/pgtables.h, I think 
putting the fallback definition in linux/pgtable.h would do the trick.


Re: [PATCH 1/5] uapi: remove the unused HAVE_ARCH_STRUCT_FLOCK64 define

2021-04-13 Thread Stephen Rothwell
Hi Arnd,

On Mon, 12 Apr 2021 11:55:41 +0200 Arnd Bergmann  wrote:
>
> On Mon, Apr 12, 2021 at 10:55 AM Christoph Hellwig  wrote:
> >
> > Signed-off-by: Christoph Hellwig   
> 
> The patch looks good, but I'd like to see a description for each one.
> How about:
> 
> | The check was added when Stephen Rothwell created the file, but
> | no architecture ever defined it.

Actually it was used by the xtensa architecture until Dec, 2006.

-- 
Cheers,
Stephen Rothwell


pgpI0dRUn4tC3.pgp
Description: OpenPGP digital signature