Re: [PATCH 1/1] powerpc/pseries/iommu: Fix window size for direct mapping with pmem

2021-04-19 Thread Alexey Kardashevskiy




On 20/04/2021 14:54, Leonardo Bras wrote:

As of today, if the DDW is big enough to fit (1 << MAX_PHYSMEM_BITS) it's
possible to use direct DMA mapping even with pmem region.

But, if that happens, the window size (len) is set to
(MAX_PHYSMEM_BITS - page_shift) instead of MAX_PHYSMEM_BITS, causing a
pagesize times smaller DDW to be created, being insufficient for correct
usage.

Fix this so the correct window size is used in this case.


Good find indeed.

afaict this does not create a huge problem though as 
query.largest_available_block is always smaller than (MAX_PHYSMEM_BITS - 
page_shift) where it matters (phyp).



Reviewed-by: Alexey Kardashevskiy 



Fixes: bf6e2d562bbc4("powerpc/dma: Fallback to dma_ops when persistent memory 
present")
Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..836cbbe0ecc5 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1229,7 +1229,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
if (pmem_present) {
if (query.largest_available_block >=
(1ULL << (MAX_PHYSMEM_BITS - page_shift)))
-   len = MAX_PHYSMEM_BITS - page_shift;
+   len = MAX_PHYSMEM_BITS;
else
dev_info(>dev, "Skipping ibm,pmemory");
}



--
Alexey


Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-08 Thread Alexey Kardashevskiy




On 08/04/2021 19:04, Michael Ellerman wrote:

Alexey Kardashevskiy  writes:

On 08/04/2021 15:37, Michael Ellerman wrote:

Leonardo Bras  writes:

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.


Do we know of any hardware & hypervisor combination that will actually
give us bigger pages?



On P8 16MB host pages and 16MB hardware iommu pages worked.

On P9, VM's 16MB IOMMU pages worked on top of 2MB host pages + 2MB
hardware IOMMU pages.


The current code already tries 16MB though.

I'm wondering if we're going to ask for larger sizes that have never
been tested and possibly expose bugs. But it sounds like this is mainly
targeted at future platforms.



I tried for fun to pass through a PCI device to a guest with this patch as:

pbuild/qemu-killslof-aiku1904le-ppc64/qemu-system-ppc64 \
-nodefaults \
-chardev stdio,id=STDIO0,signal=off,mux=on \
-device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
-mon id=MON0,chardev=STDIO0,mode=readline \
-nographic \
-vga none \
-enable-kvm \
-m 16G \
-kernel ./vmldbg \
-initrd /home/aik/t/le.cpio \
-device vfio-pci,id=vfio0001_01_00_0,host=0001:01:00.0 \
-mem-prealloc \
-mem-path qemu_hp_1G_node0 \
-global spapr-pci-host-bridge.pgsz=0xff000 \
-machine cap-cfpc=broken,cap-ccf-assist=off \
-smp 1,threads=1 \
-L /home/aik/t/qemu-ppc64-bios/ \
-trace events=qemu_trace_events \
-d guest_errors,mmu \
-chardev socket,id=SOCKET0,server=on,wait=off,path=qemu.mon.1_1_0_0 \
-mon chardev=SOCKET0,mode=control


The guest created a huge window:

xhci_hcd :00:00.0: ibm,create-pe-dma-window(2027) 0 800 2000 
22 22 returned 0 (liobn = 0x8001 starting addr = 800 0)


The first "22" is page_shift in hex (16GB), the second "22" is 
window_shift (so we have 1 TCE).


On the host side the window#1 was created with 1GB pages:
pci 0001:01 : [PE# fd] Setting up window#1 
800..80007ff pg=4000



The XHCI seems working. Without the patch 16MB was the maximum.





diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..6cda1c92597d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,20 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
   };


A comment saying where the values come from would be good.


+#define QUERY_DDW_PGSIZE_4K0x01
+#define QUERY_DDW_PGSIZE_64K   0x02
+#define QUERY_DDW_PGSIZE_16M   0x04
+#define QUERY_DDW_PGSIZE_32M   0x08
+#define QUERY_DDW_PGSIZE_64M   0x10
+#define QUERY_DDW_PGSIZE_128M  0x20
+#define QUERY_DDW_PGSIZE_256M  0x40
+#define QUERY_DDW_PGSIZE_16G   0x80


I'm not sure the #defines really gain us much vs just putting the
literal values in the array below?


Then someone says "u magic values" :) I do not mind either way. Thanks,


Yeah that's true. But #defining them doesn't make them less magic, if
you only use them in one place :)


Defining them with "QUERY_DDW" in the names kinda tells where they are 
from. Can also grep QEMU using these to see how the other side handles 
it. Dunno.


btw the bot complained about __builtin_ctz(SZ_16G) which should be 
__builtin_ctzl(SZ_16G) so we have to ask Leonardo to repost anyway :)




--
Alexey


Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-04-08 Thread Alexey Kardashevskiy




On 08/04/2021 15:37, Michael Ellerman wrote:

Leonardo Bras  writes:

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.


Do we know of any hardware & hypervisor combination that will actually
give us bigger pages?



On P8 16MB host pages and 16MB hardware iommu pages worked.

On P9, VM's 16MB IOMMU pages worked on top of 2MB host pages + 2MB 
hardware IOMMU pages.






Enabling bigger pages would be interesting for direct mapping systems
with a lot of RAM, while using less TCE entries.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 49 ++
  1 file changed, 42 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..6cda1c92597d 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,20 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
  };


A comment saying where the values come from would be good.


+#define QUERY_DDW_PGSIZE_4K0x01
+#define QUERY_DDW_PGSIZE_64K   0x02
+#define QUERY_DDW_PGSIZE_16M   0x04
+#define QUERY_DDW_PGSIZE_32M   0x08
+#define QUERY_DDW_PGSIZE_64M   0x10
+#define QUERY_DDW_PGSIZE_128M  0x20
+#define QUERY_DDW_PGSIZE_256M  0x40
+#define QUERY_DDW_PGSIZE_16G   0x80


I'm not sure the #defines really gain us much vs just putting the
literal values in the array below?



Then someone says "u magic values" :) I do not mind either way. Thanks,




+struct iommu_ddw_pagesize {
+   u32 mask;
+   int shift;
+};
+
  static struct iommu_table_group *iommu_pseries_alloc_group(int node)
  {
struct iommu_table_group *table_group;
@@ -1099,6 +1113,31 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
  }
  
+/* Returns page shift based on "IO Page Sizes" output at ibm,query-pe-dma-window. See LoPAR */

+static int iommu_get_page_shift(u32 query_page_size)
+{
+   const struct iommu_ddw_pagesize ddw_pagesize[] = {
+   { QUERY_DDW_PGSIZE_16G,  __builtin_ctz(SZ_16G)  },
+   { QUERY_DDW_PGSIZE_256M, __builtin_ctz(SZ_256M) },
+   { QUERY_DDW_PGSIZE_128M, __builtin_ctz(SZ_128M) },
+   { QUERY_DDW_PGSIZE_64M,  __builtin_ctz(SZ_64M)  },
+   { QUERY_DDW_PGSIZE_32M,  __builtin_ctz(SZ_32M)  },
+   { QUERY_DDW_PGSIZE_16M,  __builtin_ctz(SZ_16M)  },
+   { QUERY_DDW_PGSIZE_64K,  __builtin_ctz(SZ_64K)  },
+   { QUERY_DDW_PGSIZE_4K,   __builtin_ctz(SZ_4K)   }
+   };



cheers



--
Alexey


Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-23 Thread Alexey Kardashevskiy




On 24/03/2021 06:32, Jason Gunthorpe wrote:


For NVIDIA GPU Max checked internally and we saw it looks very much
like how Intel GPU works. Only some PCI IDs trigger checking on the
feature the firmware thing is linked to.


And as Alexey noted, the table came up incomplete.  But also those same
devices exist on platforms where this extension is completely
irrelevant.


I understood he ment that NVIDI GPUs *without* NVLINK can exist, but
the ID table we have here is supposed to be the NVLINK compatible
ID's.



I also meant there are more (than in the proposed list)  GPUs with 
NVLink which will work on P9.



--
Alexey


Re: [PATCH 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

2021-03-23 Thread Alexey Kardashevskiy




On 23/03/2021 06:09, Leonardo Bras wrote:

According to LoPAR, ibm,query-pe-dma-window output named "IO Page Sizes"
will let the OS know all possible pagesizes that can be used for creating a
new DDW.

Currently Linux will only try using 3 of the 8 available options:
4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M, 64M,
128M, 256M and 16G.

Enabling bigger pages would be interesting for direct mapping systems
with a lot of RAM, while using less TCE entries.
> Signed-off-by: Leonardo Bras 
---
  arch/powerpc/include/asm/iommu.h   |  8 
  arch/powerpc/platforms/pseries/iommu.c | 28 +++---
  2 files changed, 29 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..c170048b7a1b 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -19,6 +19,14 @@
  #include 
  #include 
  
+#define IOMMU_PAGE_SHIFT_16G	34

+#define IOMMU_PAGE_SHIFT_256M  28
+#define IOMMU_PAGE_SHIFT_128M  27
+#define IOMMU_PAGE_SHIFT_64M   26
+#define IOMMU_PAGE_SHIFT_32M   25
+#define IOMMU_PAGE_SHIFT_16M   24
+#define IOMMU_PAGE_SHIFT_64K   16



These are not very descriptive, these are just normal shifts, could be 
as simple as __builtin_ctz(SZ_4K) (gcc will optimize this) and so on.


OTOH the PAPR page sizes need macros as they are the ones which are 
weird and screaming for macros.


I'd steal/rework spapr_page_mask_to_query_mask() from QEMU. Thanks,





+
  #define IOMMU_PAGE_SHIFT_4K  12
  #define IOMMU_PAGE_SIZE_4K   (ASM_CONST(1) << IOMMU_PAGE_SHIFT_4K)
  #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 9fc5217f0c8e..02958e80aa91 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1099,6 +1099,24 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
  }
  
+/* Returns page shift based on "IO Page Sizes" output at ibm,query-pe-dma-window. SeeL LoPAR */

+static int iommu_get_page_shift(u32 query_page_size)
+{
+   const int shift[] = {IOMMU_PAGE_SHIFT_4K,   IOMMU_PAGE_SHIFT_64K,  
IOMMU_PAGE_SHIFT_16M,
+IOMMU_PAGE_SHIFT_32M,  IOMMU_PAGE_SHIFT_64M,  
IOMMU_PAGE_SHIFT_128M,
+IOMMU_PAGE_SHIFT_256M, IOMMU_PAGE_SHIFT_16G};
+   int i = ARRAY_SIZE(shift) - 1;
+
+   /* Looks for the largest page size supported */
+   for (; i >= 0; i--) {
+   if (query_page_size & (1 << i))
+   return shift[i];
+   }
+
+   /* No valid page size found. */
+   return 0;
+}
+
  /*
   * If the PE supports dynamic dma windows, and there is space for a table
   * that can map all pages in a linear offset, then setup such a table,
@@ -1206,13 +1224,9 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
}
-   if (query.page_size & 4) {
-   page_shift = 24; /* 16MB */
-   } else if (query.page_size & 2) {
-   page_shift = 16; /* 64kB */
-   } else if (query.page_size & 1) {
-   page_shift = 12; /* 4kB */
-   } else {
+
+   page_shift = iommu_get_page_shift(query.page_size);
+   if (!page_shift) {
dev_dbg(>dev, "no supported direct page size in mask %x",
  query.page_size);
goto out_failed;



--
Alexey


Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-10 Thread Alexey Kardashevskiy




On 11/03/2021 13:00, Jason Gunthorpe wrote:

On Thu, Mar 11, 2021 at 12:42:56PM +1100, Alexey Kardashevskiy wrote:

btw can the id list have only vendor ids and not have device ids?


The PCI matcher is quite flexable, see the other patch from Max for
the igd
  
ah cool, do this for NVIDIA GPUs then please, I just discovered another P9

system sold with NVIDIA T4s which is not in your list.


I think it will make things easier down the road if you maintain an
exact list 



Then why do not you do the exact list for Intel IGD? The commit log does 
not explain this detail.




But best practice is to be as narrow as possible as I hope this will
eventually impact module autoloading and other details.


The amount of device specific knowledge is too little to tie it up to device
ids, it is a generic PCI driver with quirks. We do not have a separate
drivers for the hardware which requires quirks.


It provides its own capability structure exposed to userspace, that is
absolutely not a "quirk"


And how do you hope this should impact autoloading?


I would like to autoload the most specific vfio driver for the target
hardware.



Is there an idea how it is going to work? For example, the Intel IGD 
driver and vfio-pci-igd - how should the system pick one? If there is no 
MODULE_DEVICE_TABLE in vfio-pci-xxx, is the user supposed to try binding 
all vfio-pci-xxx drivers until some binds?




If you someday need to support new GPU HW that needs a different VFIO
driver then you are really stuck because things become indeterminate
if there are two devices claiming the ID. We don't have the concept of
"best match", driver core works on exact match.




--
Alexey


Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-10 Thread Alexey Kardashevskiy




On 11/03/2021 12:34, Jason Gunthorpe wrote:

On Thu, Mar 11, 2021 at 12:20:33PM +1100, Alexey Kardashevskiy wrote:


It is supposed to match exactly the same match table as the pci_driver
above. We *don't* want different behavior from what the standrd PCI
driver matcher will do.


This is not a standard PCI driver though


It is now, that is what this patch makes it into. This is why it now
has a struct pci_driver.


and the main vfio-pci won't have a
list to match ever.


?? vfio-pci uses driver_override or new_id to manage its match list



Exactly, no list to update.



IBM NPU PCI id is unlikely to change ever but NVIDIA keeps making
new devices which work in those P9 boxes, are you going to keep
adding those ids to nvlink2gpu_vfio_pci_table?


Certainly, as needed. PCI list updates is normal for the kernel.


btw can the id list have only vendor ids and not have device ids?


The PCI matcher is quite flexable, see the other patch from Max for
the igd



ah cool, do this for NVIDIA GPUs then please, I just discovered another 
P9 system sold with NVIDIA T4s which is not in your list.




But best practice is to be as narrow as possible as I hope this will
eventually impact module autoloading and other details.


The amount of device specific knowledge is too little to tie it up to 
device ids, it is a generic PCI driver with quirks. We do not have a 
separate drivers for the hardware which requires quirks.


And how do you hope this should impact autoloading?



--
Alexey


Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-10 Thread Alexey Kardashevskiy




On 11/03/2021 06:40, Jason Gunthorpe wrote:

On Thu, Mar 11, 2021 at 01:24:47AM +1100, Alexey Kardashevskiy wrote:



On 11/03/2021 00:02, Jason Gunthorpe wrote:

On Wed, Mar 10, 2021 at 02:57:57PM +0200, Max Gurtovoy wrote:


+    .err_handler    = _pci_core_err_handlers,
+};
+
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
+{
+    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
+    return _vfio_pci_driver;



Why do we need matching PCI ids here instead of looking at the FDT which
will work better?


what is FDT ? any is it better to use it instead of match_id ?


This is emulating the device_driver match for the pci_driver.


No it is not, it is a device tree info which lets to skip the linux PCI
discovery part (the firmware does it anyway) but it tells nothing about
which drivers to bind.


I mean get_nvlink2gpu_vfio_pci_driver() is emulating the PCI match.

Max added a pci driver for NPU here:

+static struct pci_driver npu2_vfio_pci_driver = {
+   .name   = "npu2-vfio-pci",
+   .id_table   = npu2_vfio_pci_table,
+   .probe  = npu2_vfio_pci_probe,


new userspace should use driver_override with "npu-vfio-pci" as the
string not "vfio-pci"

The point of the get_npu2_vfio_pci_driver() is only optional
compatibility to redirect old userspace using "vfio-pci" in the
driver_override to the now split driver code so userspace doesn't see
any change in behavior.

If we don't do this then the vfio-pci driver override will disable the
npu2 special stuff, since Max took it all out of vfio-pci's
pci_driver.

It is supposed to match exactly the same match table as the pci_driver
above. We *don't* want different behavior from what the standrd PCI
driver matcher will do.



This is not a standard PCI driver though and the main vfio-pci won't 
have a list to match ever. IBM NPU PCI id is unlikely to change ever but 
NVIDIA keeps making new devices which work in those P9 boxes, are you 
going to keep adding those ids to nvlink2gpu_vfio_pci_table? btw can the 
id list have only vendor ids and not have device ids?




Since we don't have any way to mix in FDT discovery to the standard
PCI driver match it will still attach the npu driver but not enable
any special support. This seems OK.




--
Alexey


Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-10 Thread Alexey Kardashevskiy




On 11/03/2021 00:02, Jason Gunthorpe wrote:

On Wed, Mar 10, 2021 at 02:57:57PM +0200, Max Gurtovoy wrote:


+    .err_handler    = _pci_core_err_handlers,
+};
+
+#ifdef CONFIG_VFIO_PCI_DRIVER_COMPAT
+struct pci_driver *get_nvlink2gpu_vfio_pci_driver(struct pci_dev *pdev)
+{
+    if (pci_match_id(nvlink2gpu_vfio_pci_driver.id_table, pdev))
+    return _vfio_pci_driver;



Why do we need matching PCI ids here instead of looking at the FDT which
will work better?


what is FDT ? any is it better to use it instead of match_id ?


This is emulating the device_driver match for the pci_driver.



No it is not, it is a device tree info which lets to skip the linux PCI 
discovery part (the firmware does it anyway) but it tells nothing about 
which drivers to bind.




I don't think we can combine FDT matching with pci_driver, can we?


It is a c function calling another c function, all within vfio-pci, this 
is not called by the generic pci code.




--
Alexey


Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-10 Thread Alexey Kardashevskiy




On 10/03/2021 23:57, Max Gurtovoy wrote:


On 3/10/2021 8:39 AM, Alexey Kardashevskiy wrote:



On 09/03/2021 19:33, Max Gurtovoy wrote:

The new drivers introduced are nvlink2gpu_vfio_pci.ko and
npu2_vfio_pci.ko.
The first will be responsible for providing special extensions for
NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host
bus adapter).

Also, preserve backward compatibility for users that were binding
NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will
be dropped in the future

Signed-off-by: Max Gurtovoy 
---
  drivers/vfio/pci/Kconfig  |  28 +++-
  drivers/vfio/pci/Makefile |   7 +-
  .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 -
  drivers/vfio/pci/npu2_vfio_pci.h  |  24 +++
  ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +-
  drivers/vfio/pci/nvlink2gpu_vfio_pci.h    |  24 +++
  drivers/vfio/pci/vfio_pci.c   |  61 ++-
  drivers/vfio/pci/vfio_pci_core.c  |  18 ---
  drivers/vfio/pci/vfio_pci_core.h  |  14 --
  9 files changed, 422 insertions(+), 47 deletions(-)
  rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
  create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
  rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => 
nvlink2gpu_vfio_pci.c} (67%)

  create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 829e90a2e5a3..88c89863a205 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -48,8 +48,30 @@ config VFIO_PCI_IGD
      To enable Intel IGD assignment through vfio-pci, say Y.
  -config VFIO_PCI_NVLINK2
-    def_bool y
+config VFIO_PCI_NVLINK2GPU
+    tristate "VFIO support for NVIDIA NVLINK2 GPUs"
  depends on VFIO_PCI_CORE && PPC_POWERNV
  help
-  VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+  VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions
+  for P9 Witherspoon machine.
+
+config VFIO_PCI_NPU2
+    tristate "VFIO support for IBM NPU host bus adapter on P9"
+    depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
+    help
+  VFIO PCI specific extensions for IBM NVLink2 host bus adapter 
on P9

+  Witherspoon machine.
+
+config VFIO_PCI_DRIVER_COMPAT
+    bool "VFIO PCI backward compatibility for vendor specific 
extensions"

+    default y
+    depends on VFIO_PCI
+    help
+  Say Y here if you want to preserve VFIO PCI backward
+  compatibility. vfio_pci.ko will continue to automatically use
+  the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
+  a compatible device.
+
+  When N is selected the user must bind explicity to the module
+  they want to handle the device and vfio_pci.ko will have no
+  device specific special behaviors.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f539f32c9296..86fb62e271fc 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,10 +2,15 @@
    obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
    vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o 
vfio_pci_rdwr.o vfio_pci_config.o

  vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
vfio_pci_npu2.o

  vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
    vfio-pci-y := vfio_pci.o
+
+npu2-vfio-pci-y := npu2_vfio_pci.o
+
+nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_npu2.c 
b/drivers/vfio/pci/npu2_vfio_pci.c

similarity index 64%
rename from drivers/vfio/pci/vfio_pci_npu2.c
rename to drivers/vfio/pci/npu2_vfio_pci.c
index 717745256ab3..7071bda0f2b6 100644
--- a/drivers/vfio/pci/vfio_pci_npu2.c
+++ b/drivers/vfio/pci/npu2_vfio_pci.c
@@ -14,19 +14,28 @@
   *    Author: Alex Williamson 
   */
  +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include 
  #include 
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
    #include "vfio_pci_core.h"
+#include "npu2_vfio_pci.h"
    #define CREATE_TRACE_POINTS
  #include "npu2_trace.h"
  +#define DRIVER_VERSION  "0.1"
+#define DRIVER_AUTHOR   "Alexey Kardashevskiy "
+#define DRIVER_DESC "NPU2 VFIO PCI - User Level meta-driver for 
POWER9 NPU NVLink2 HBA"

+
  EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
    struct vfio_pci_npu2_data {
@@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
  unsigned int link_speed; /* The link speed from DT's 
ibm,nvlink-speed */

  };
  +struct npu2_vfio_pci_device {
+    struct vfio_pci_core_device    vdev;

Re: [PATCH 8/9] vfio/pci: export nvlink2 support into vendor vfio_pci drivers

2021-03-09 Thread Alexey Kardashevskiy




On 09/03/2021 19:33, Max Gurtovoy wrote:

The new drivers introduced are nvlink2gpu_vfio_pci.ko and
npu2_vfio_pci.ko.
The first will be responsible for providing special extensions for
NVIDIA GPUs with NVLINK2 support for P9 platform (and others in the
future). The last will be responsible for POWER9 NPU2 unit (NVLink2 host
bus adapter).

Also, preserve backward compatibility for users that were binding
NVLINK2 devices to vfio_pci.ko. Hopefully this compatibility layer will
be dropped in the future

Signed-off-by: Max Gurtovoy 
---
  drivers/vfio/pci/Kconfig  |  28 +++-
  drivers/vfio/pci/Makefile |   7 +-
  .../pci/{vfio_pci_npu2.c => npu2_vfio_pci.c}  | 144 -
  drivers/vfio/pci/npu2_vfio_pci.h  |  24 +++
  ...pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} | 149 +-
  drivers/vfio/pci/nvlink2gpu_vfio_pci.h|  24 +++
  drivers/vfio/pci/vfio_pci.c   |  61 ++-
  drivers/vfio/pci/vfio_pci_core.c  |  18 ---
  drivers/vfio/pci/vfio_pci_core.h  |  14 --
  9 files changed, 422 insertions(+), 47 deletions(-)
  rename drivers/vfio/pci/{vfio_pci_npu2.c => npu2_vfio_pci.c} (64%)
  create mode 100644 drivers/vfio/pci/npu2_vfio_pci.h
  rename drivers/vfio/pci/{vfio_pci_nvlink2gpu.c => nvlink2gpu_vfio_pci.c} (67%)
  create mode 100644 drivers/vfio/pci/nvlink2gpu_vfio_pci.h

diff --git a/drivers/vfio/pci/Kconfig b/drivers/vfio/pci/Kconfig
index 829e90a2e5a3..88c89863a205 100644
--- a/drivers/vfio/pci/Kconfig
+++ b/drivers/vfio/pci/Kconfig
@@ -48,8 +48,30 @@ config VFIO_PCI_IGD
  
  	  To enable Intel IGD assignment through vfio-pci, say Y.
  
-config VFIO_PCI_NVLINK2

-   def_bool y
+config VFIO_PCI_NVLINK2GPU
+   tristate "VFIO support for NVIDIA NVLINK2 GPUs"
depends on VFIO_PCI_CORE && PPC_POWERNV
help
- VFIO PCI support for P9 Witherspoon machine with NVIDIA V100 GPUs
+ VFIO PCI driver for NVIDIA NVLINK2 GPUs with specific extensions
+ for P9 Witherspoon machine.
+
+config VFIO_PCI_NPU2
+   tristate "VFIO support for IBM NPU host bus adapter on P9"
+   depends on VFIO_PCI_NVLINK2GPU && PPC_POWERNV
+   help
+ VFIO PCI specific extensions for IBM NVLink2 host bus adapter on P9
+ Witherspoon machine.
+
+config VFIO_PCI_DRIVER_COMPAT
+   bool "VFIO PCI backward compatibility for vendor specific extensions"
+   default y
+   depends on VFIO_PCI
+   help
+ Say Y here if you want to preserve VFIO PCI backward
+ compatibility. vfio_pci.ko will continue to automatically use
+ the NVLINK2, NPU2 and IGD VFIO drivers when it is attached to
+ a compatible device.
+
+ When N is selected the user must bind explicity to the module
+ they want to handle the device and vfio_pci.ko will have no
+ device specific special behaviors.
diff --git a/drivers/vfio/pci/Makefile b/drivers/vfio/pci/Makefile
index f539f32c9296..86fb62e271fc 100644
--- a/drivers/vfio/pci/Makefile
+++ b/drivers/vfio/pci/Makefile
@@ -2,10 +2,15 @@
  
  obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o

  obj-$(CONFIG_VFIO_PCI) += vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NPU2) += npu2-vfio-pci.o
+obj-$(CONFIG_VFIO_PCI_NVLINK2GPU) += nvlink2gpu-vfio-pci.o
  
  vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o

  vfio-pci-core-$(CONFIG_VFIO_PCI_IGD) += vfio_pci_igd.o
-vfio-pci-core-$(CONFIG_VFIO_PCI_NVLINK2) += vfio_pci_nvlink2gpu.o 
vfio_pci_npu2.o
  vfio-pci-core-$(CONFIG_S390) += vfio_pci_zdev.o
  
  vfio-pci-y := vfio_pci.o

+
+npu2-vfio-pci-y := npu2_vfio_pci.o
+
+nvlink2gpu-vfio-pci-y := nvlink2gpu_vfio_pci.o
diff --git a/drivers/vfio/pci/vfio_pci_npu2.c b/drivers/vfio/pci/npu2_vfio_pci.c
similarity index 64%
rename from drivers/vfio/pci/vfio_pci_npu2.c
rename to drivers/vfio/pci/npu2_vfio_pci.c
index 717745256ab3..7071bda0f2b6 100644
--- a/drivers/vfio/pci/vfio_pci_npu2.c
+++ b/drivers/vfio/pci/npu2_vfio_pci.c
@@ -14,19 +14,28 @@
   *Author: Alex Williamson 
   */
  
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

+
+#include 
  #include 
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
  
  #include "vfio_pci_core.h"

+#include "npu2_vfio_pci.h"
  
  #define CREATE_TRACE_POINTS

  #include "npu2_trace.h"
  
+#define DRIVER_VERSION  "0.1"

+#define DRIVER_AUTHOR   "Alexey Kardashevskiy "
+#define DRIVER_DESC "NPU2 VFIO PCI - User Level meta-driver for POWER9 NPU 
NVLink2 HBA"
+
  EXPORT_TRACEPOINT_SYMBOL_GPL(vfio_pci_npu2_mmap);
  
  struct vfio_pci_npu2_data {

@@ -36,6 +45,10 @@ struct vfio_pci_npu2_data {
unsigned int link_speed; /* The link speed from DT's ibm,nvlink-speed */
  };
  
+struct npu2_vfio_pci_device {

+   struct vfio_pci_core_device vdev;
+};
+
  static size_

Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-08 Thread Alexey Kardashevskiy




On 08/02/2021 23:44, Max Gurtovoy wrote:


On 2/5/2021 2:42 AM, Alexey Kardashevskiy wrote:



On 04/02/2021 23:51, Jason Gunthorpe wrote:

On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:


It is system firmware (==bios) which puts stuff in the device tree. The
stuff is:
1. emulated pci devices (custom pci bridges), one per nvlink, 
emulated by
the firmware, the driver is "ibmnpu" and it is a part on the nvidia 
driver;

these are basically config space proxies to the cpu's side of nvlink.
2. interconnect information - which of 6 gpus nvlinks connected to 
which

nvlink on the cpu side, and memory ranges.


So what is this vfio_nvlink driver supposed to be bound to?

The "emulated pci devices"?


Yes.


A real GPU function?


Yes.


A real nvswitch function?


What do you mean by this exactly? The cpu side of nvlink is "emulated 
pci devices", the gpu side is not in pci space at all, the nvidia 
driver manages it via the gpu's mmio or/and cfg space.



Something else?


Nope :)
In this new scheme which you are proposing it should be 2 drivers, I 
guess.


I see.

So should it be nvidia_vfio_pci.ko ? and it will do the NVLINK stuff in 
case the class code matches and otherwise just work as simple vfio_pci 
GPU ?


"nvidia_vfio_pci" would be too generic, sounds like it is for every 
nvidia on every platform. powernv_nvidia_vfio_pci.ko may be.



What about the second driver ? should it be called ibmnpu_vfio_pci.ko ?


This will do.








Jason





--
Alexey


Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-08 Thread Alexey Kardashevskiy




On 09/02/2021 05:13, Jason Gunthorpe wrote:

On Fri, Feb 05, 2021 at 11:42:11AM +1100, Alexey Kardashevskiy wrote:

A real nvswitch function?


What do you mean by this exactly? The cpu side of nvlink is "emulated pci
devices", the gpu side is not in pci space at all, the nvidia driver manages
it via the gpu's mmio or/and cfg space.


Some versions of the nvswitch chip have a PCI-E link too, that is what
I though this was all about when I first saw it.



So, it is really a special set of functions for NVIDIA GPU device
assignment only applicable to P9 systems, much like IGD is for Intel
on x86.




These GPUs are not P9 specific and they all have both PCIe and NVLink2 
links. The special part is that some nvlinks are between P9 and GPU and 
the rest are between GPUs, everywhere else (x86, may be ARM) the nvlinks 
are used between GPUs but even there I do not think the nvlink logic is 
presented to the host in the PCI space.




--
Alexey


Re: [PATCH 8/9] vfio/pci: use x86 naming instead of igd

2021-02-04 Thread Alexey Kardashevskiy




On 04/02/2021 23:51, Jason Gunthorpe wrote:

On Thu, Feb 04, 2021 at 12:05:22PM +1100, Alexey Kardashevskiy wrote:


It is system firmware (==bios) which puts stuff in the device tree. The
stuff is:
1. emulated pci devices (custom pci bridges), one per nvlink, emulated by
the firmware, the driver is "ibmnpu" and it is a part on the nvidia driver;
these are basically config space proxies to the cpu's side of nvlink.
2. interconnect information - which of 6 gpus nvlinks connected to which
nvlink on the cpu side, and memory ranges.


So what is this vfio_nvlink driver supposed to be bound to?

The "emulated pci devices"?


Yes.


A real GPU function?


Yes.


A real nvswitch function?


What do you mean by this exactly? The cpu side of nvlink is "emulated 
pci devices", the gpu side is not in pci space at all, the nvidia driver 
manages it via the gpu's mmio or/and cfg space.



Something else?


Nope :)
In this new scheme which you are proposing it should be 2 drivers, I guess.



Jason



--
Alexey


Re: [PATCH v5] tracepoint: Do not fail unregistering a probe due to memory failure

2021-02-01 Thread Alexey Kardashevskiy




On 02/02/2021 04:10, Steven Rostedt wrote:

On Mon, 1 Feb 2021 12:18:34 +1100
Alexey Kardashevskiy  wrote:


Just curious, does the following patch fix it for v5?



Yes it does!


Thanks for verifying.






-- Steve

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 7261fa0f5e3c..cf3a6d104fdb 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -306,6 +306,7 @@ static int tracepoint_remove_func(struct tracepoint *tp,
tp->unregfunc();
   
   		static_key_disable(>key);

+   tracepoint_synchronize_unregister();
rcu_assign_pointer(tp->funcs, tp_funcs);
} else {
rcu_assign_pointer(tp->funcs, tp_funcs);
   


OK, since it would be expensive to do a synchronization on every removal
like that, but the tp->funcs should not be reset.

It appears that your check is still required, since the iterator has been
added.

The quick fix is the check you gave.

But I think we could optimize this (not having to dereference the array
twice, and do the check twice) by making the iterator part of the tp_funcs
array, and having the rest of the array as its argument. But that can be a
separate update.

The check you added should be a patch and marked for stable. Care to send
it, and mark it for stable as well as:

Fixes: d25e37d89dd2f ("tracepoint: Optimize using static_call()")

Thanks!



Posted as "[PATCH kernel] tracepoint: Fix race between tracing and 
removing tracepoint", hopefully I got it right. Thanks,




--
Alexey


[PATCH kernel] tracepoint: Fix race between tracing and removing tracepoint

2021-02-01 Thread Alexey Kardashevskiy
When executing a tracepoint, the tracepoint's func is dereferenced twice -
in __DO_TRACE() (where the returned pointer is checked) and later on in
__traceiter_##_name where the returned pointer is dereferenced without
checking which leads to races against tracepoint_removal_sync() and
crashes.

This adds a check before referencing the pointer in tracepoint_ptr_deref.

Fixes: d25e37d89dd2f ("tracepoint: Optimize using static_call()")
Signed-off-by: Alexey Kardashevskiy 
---

This is in reply to https://lkml.org/lkml/2021/2/1/868

Feel free to change the commit log. Thanks!

Fixing it properly is rather scary :)
I tried passing it_func_ptr to it_func but this change triggered way too
many prototypes changes such as __bpf_trace_##call().

---
 include/linux/tracepoint.h | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 0f21617f1a66..966ed8980327 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -307,11 +307,13 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
\
it_func_ptr =   \
rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
-   do {\
-   it_func = (it_func_ptr)->func;  \
-   __data = (it_func_ptr)->data;   \
-   ((void(*)(void *, proto))(it_func))(__data, args); \
-   } while ((++it_func_ptr)->func);\
+   if (it_func_ptr) {  \
+   do {\
+   it_func = (it_func_ptr)->func;  \
+   __data = (it_func_ptr)->data;   \
+   ((void(*)(void *, proto))(it_func))(__data, 
args); \
+   } while ((++it_func_ptr)->func);\
+   }   \
return 0;   \
}   \
DEFINE_STATIC_CALL(tp_func_##_name, __traceiter_##_name);
-- 
2.17.1



Re: [PATCH v5] tracepoint: Do not fail unregistering a probe due to memory failure

2021-01-31 Thread Alexey Kardashevskiy




On 31/01/2021 01:42, Steven Rostedt wrote:

On Sat, 30 Jan 2021 09:36:26 -0500
Steven Rostedt  wrote:


Do you still have the same crash with v3 (that's the one I'm going to
go with for now.)

  https://lore.kernel.org/r/20201118093405.7a6d2...@gandalf.local.home


Just curious, does the following patch fix it for v5?



Yes it does!




-- Steve

diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 7261fa0f5e3c..cf3a6d104fdb 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c
@@ -306,6 +306,7 @@ static int tracepoint_remove_func(struct tracepoint *tp,
tp->unregfunc();
  
  		static_key_disable(>key);

+   tracepoint_synchronize_unregister();
rcu_assign_pointer(tp->funcs, tp_funcs);
} else {
rcu_assign_pointer(tp->funcs, tp_funcs);



--
Alexey


Re: [PATCH v5] tracepoint: Do not fail unregistering a probe due to memory failure

2021-01-30 Thread Alexey Kardashevskiy




On 28/01/2021 09:07, Steven Rostedt wrote:

From: "Steven Rostedt (VMware)" 

The list of tracepoint callbacks is managed by an array that is protected
by RCU. To update this array, a new array is allocated, the updates are
copied over to the new array, and then the list of functions for the
tracepoint is switched over to the new array. After a completion of an RCU
grace period, the old array is freed.

This process happens for both adding a callback as well as removing one.
But on removing a callback, if the new array fails to be allocated, the
callback is not removed, and may be used after it is freed by the clients
of the tracepoint.

The handling of a failed allocation for removing an event can break use
cases as the error report is not propagated up to the original callers. To
make matters worse, there's some paths that can not handle error cases.

Instead of allocating a new array for removing a tracepoint, allocate twice
the needed size when adding tracepoints to the array. On removing, use the
second half of the allocated array. This removes the need to allocate memory
for removing a tracepoint, as the allocation for removals will already have
been done.

Link: https://lkml.kernel.org/r/20201115055256.65625-1-mmull...@mmlx.us
Link: https://lkml.kernel.org/r/20201116175107.02db3...@gandalf.local.home
Link: https://lkml.kennel.org/r/20201118093405.7a6d2...@gandalf.local.home

Reported-by: Matt Mullins 
Signed-off-by: Steven Rostedt (VMware) 



I still need the following chunk (same "if (it_func_ptr)" as in the v2's 
reply) in order to stop crashes:




diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 82eba6a05a1c..b7cf7a5a4f43 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -311,6 +311,7 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)

\
it_func_ptr =   \

rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
+   if (it_func_ptr) \
do {\
it_func = (it_func_ptr)->func;  \
__data = (it_func_ptr)->data;   \




--
Alexey


Re: [PATCH v2] tracepoint: Do not fail unregistering a probe due to memory allocation

2021-01-26 Thread Alexey Kardashevskiy




On 18/11/2020 23:46, Steven Rostedt wrote:

On Tue, 17 Nov 2020 20:54:24 -0800
Alexei Starovoitov  wrote:


  extern int
@@ -310,7 +312,12 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
 do {\
 it_func = (it_func_ptr)->func;  \
 __data = (it_func_ptr)->data;   \
-   ((void(*)(void *, proto))(it_func))(__data, args); \
+   /*  \
+* Removed functions that couldn't be allocated \
+* are replaced with TRACEPOINT_STUB.   \
+*/ \
+   if (likely(it_func != TRACEPOINT_STUB)) \
+   ((void(*)(void *, proto))(it_func))(__data, 
args); \


I think you're overreacting to the problem.


I will disagree with that.


Adding run-time check to extremely unlikely problem seems wasteful.


Show me a real benchmark that you can notice a problem here. I bet that
check is even within the noise of calling an indirect function. Especially
on a machine with retpolines.


99.9% of the time allocate_probes() will do kmalloc from slab of small
objects.
If that slab is out of memory it means it cannot allocate a single page.
In such case so many things will be failing to alloc that system
is unlikely operational. oom should have triggered long ago.
Imo Matt's approach to add __GFP_NOFAIL to allocate_probes()


Looking at the GFP_NOFAIL comment:

  * %__GFP_NOFAIL: The VM implementation _must_ retry infinitely: the caller
  * cannot handle allocation failures. The allocation could block
  * indefinitely but will never return with failure. Testing for
  * failure is pointless.
  * New users should be evaluated carefully (and the flag should be
  * used only when there is no reasonable failure policy) but it is
  * definitely preferable to use the flag rather than opencode endless
  * loop around allocator.

I realized I made a mistake in my patch for using it, as my patch is a
failure policy. It looks like something we want to avoid in general.

Thanks, I'll send a v3 that removes it.


when it's called from func_remove() is much better.
The error was reported by syzbot that was using
memory fault injections. ENOMEM in allocate_probes() was
never seen in real life and highly unlikely will ever be seen.


And the biggest thing you are missing here, is that if you are running on a
machine that has static calls, this code is never hit unless you have more
than one callback on a single tracepoint. That's because when there's only
one callback, it gets called directly, and this loop is not involved.



I am running syzkaller and the kernel keeps crashing in 
__traceiter_##_name. This patch makes these crashes happen lot less 
often (and so did the v1) but the kernel still crashes (examples below 
but the common thing is that they crash in tracepoints). Disasm points 
to __DO_TRACE_CALL(name) and this fixes it:



--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -313,6 +313,7 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)

\
it_func_ptr =   \

rcu_dereference_raw((&__tracepoint_##_name)->funcs); \
+   if (it_func_ptr)\
do {\
it_func = (it_func_ptr)->func;  \


I am running on a powerpc box which does not have CONFIG_HAVE_STATIC_CALL.

I wonder if it is still the same problem which mentioned v3 might fix or 
it is something different? Thanks,




[  285.072538] Kernel attempted to read user page (0) - exploit attempt? 
(uid: 0)

[  285.073657] BUG: Kernel NULL pointer dereference on read at 0x
[  285.075129] Faulting instruction address: 0xc02edf48
cpu 0xd: Vector: 300 (Data Access) at [c000115db530]
pc: c02edf48: lock_acquire+0x2e8/0x5d0
lr: c06ee450: step_into+0x940/0xc20
sp: c000115db7d0
   msr: 80009033
   dar: 0
 dsisr: 4000
  current = 0xc000115af280
  paca= 0xc0005ff9fe00   irqmask: 0x03   irq_happened: 0x01
pid   = 182, comm = systemd-journal
Linux version 5.11.0-rc5-le_syzkaller_a+fstn1 (aik@fstn1-p1) (gcc 
(Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, GNU ld (GNU Binutils for Ubuntu) 
2.30) #65 SMP Wed Jan 27 16:46:46 AEDT 2021

enter ? for help
[c000115db8c0] c06ee450 step_into+0x940/0xc20
[c000115db950] c06efddc walk_component+0xbc/0x340
[c000115db9d0] c06f0418 link_path_walk.part.29+0x3b8/0x5b0

Re: BUG: MAX_LOCKDEP_KEYS too low!

2021-01-23 Thread Alexey Kardashevskiy




On 23/01/2021 21:29, Tetsuo Handa wrote:

On 2021/01/23 15:35, Alexey Kardashevskiy wrote:

this behaves quite different but still produces the message (i have 
show_workqueue_state() right after the bug message):


[   85.803991] BUG: MAX_LOCKDEP_KEYS too low!
[   85.804338] turning off the locking correctness validator.
[   85.804474] Showing busy workqueues and worker pools:
[   85.804620] workqueue events_unbound: flags=0x2
[   85.804764]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/512 refcnt=3
[   85.804965] in-flight: 81:bpf_map_free_deferred
[   85.805229] workqueue events_power_efficient: flags=0x80
[   85.805357]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[   85.805558] in-flight: 57:gc_worker
[   85.805877] pool 4: cpus=2 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 
82 24
[   85.806147] pool 16: cpus=0-7 flags=0x4 nice=0 hung=69s workers=3 idle: 7 251
^C[  100.129747] maxlockdep (5104) used greatest stack depth: 8032 bytes left

root@le-dbg:~# grep "lock-classes" /proc/lockdep_stats
  lock-classes: 8192 [max: 8192]



Right. Hillf's patch can reduce number of active workqueue's worker threads, for
only one worker thread can call bpf_map_free_deferred() (which is nice because
it avoids bloat of active= and refcnt= fields). But Hillf's patch is not for
fixing the cause of "BUG: MAX_LOCKDEP_KEYS too low!" message.

Like Dmitry mentioned, bpf syscall allows producing work items faster than
bpf_map_free_deferred() can consume. (And a similar problem is observed for
network namespaces.) Unless there is a bug that prevents bpf_map_free_deferred()
  from completing, the classical solution is to put pressure on producers (i.e.
slow down bpf syscall side) in a way that consumers (i.e. __bpf_map_put())
will not schedule thousands of backlog "struct bpf_map" works.



Should not the first 8192 from "grep lock-classes /proc/lockdep_stats" 
decrease after time (it does not), or once it has failed, it is permanent?





--
Alexey


Re: BUG: MAX_LOCKDEP_KEYS too low!

2021-01-22 Thread Alexey Kardashevskiy




On 23/01/2021 17:01, Hillf Danton wrote:

On Sat, 23 Jan 2021 09:53:42 +1100  Alexey Kardashevskiy wrote:

On 23/01/2021 02:30, Tetsuo Handa wrote:

On 2021/01/22 22:28, Tetsuo Handa wrote:

On 2021/01/22 21:10, Dmitry Vyukov wrote:

On Fri, Jan 22, 2021 at 1:03 PM Alexey Kardashevskiy  wrote:




On 22/01/2021 21:30, Tetsuo Handa wrote:

On 2021/01/22 18:16, Dmitry Vyukov wrote:

The reproducer only does 2 bpf syscalls, so something is slowly leaking in bpf.
My first suspect would be one of these. Since workqueue is async, it
may cause such slow drain that happens only when tasks are spawned
fast. I don't know if there is a procfs/debugfs introspection file to
monitor workqueue lengths to verify this hypothesis.


If you can reproduce locally, you can call show_workqueue_state()
(part of SysRq-t) when hitting the limit.

--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1277,6 +1277,7 @@ register_lock_class(struct lockdep_map *lock, unsigned 
int subclass, int force)

   print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!");
   dump_stack();
+   show_workqueue_state();
   return NULL;
   }
   nr_lock_classes++;





Here is the result:
https://pastebin.com/rPn0Cytu


Do you mind posting this publicly?
Yes, it seems that bpf_map_free_deferred is the problem (11138
outstanding callbacks).


Need to set up a local queue for releasing bpf maps if 10,000 means
flooding.





Wow. Horribly stuck queue. I guess BPF wants to try WQ created by

alloc_workqueue("bpf_free_wq", WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);

rather than system_wq . You can add Tejun Heo  for WQ.

Anyway, please post your result to ML.


https://pastebin.com/JfrmzguK is with the patch below applied. Seems
less output. Interestingly when I almost hit "send", OOM kicked in and
tried killing a bunch of "maxlockdep" processes (my test case):

[  891.037315] [  31007] 0 31007  2815491520
  1000 maxlockdep
[  891.037540] [  31009] 0 31009  2815491520
  1000 maxlockdep
[  891.037760] [  31012] 0 31012  2815491520
  1000 maxlockdep
[  891.037980] [  31013] 0 31013  2815471040
 0 maxlockdep
[  891.038210] [  31014] 0 31014  2815491520
  1000 maxlockdep
[  891.038429] [  31018] 0 31018  2815471040
 0 maxlockdep
[  891.038652] [  31019] 0 31019  2815491520
  1000 maxlockdep
[  891.038874] [  31020] 0 31020  2815491520
  1000 maxlockdep
[  891.039095] [  31021] 0 31021  2815491520
  1000 maxlockdep
[  891.039317] [  31022] 0 31022  2815471040
 0 maxlockdep



A local queue, the mix of list head and spin lock, helps to collapse
the entities of map->work into one work in order to cut the risk of
work flooding to WQ.

--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -448,16 +448,40 @@ static void bpf_map_release_memcg(struct
  }
  #endif
  
-/* called from workqueue */

+static int worker_idle = 1;
+static LIST_HEAD(bpf_map_work_list);
+static DEFINE_SPINLOCK(bpf_map_work_lock);
+
  static void bpf_map_free_deferred(struct work_struct *work)
  {
-   struct bpf_map *map = container_of(work, struct bpf_map, work);
+   struct bpf_map *map;
+
+   worker_idle = 0;
+again:
+   map = NULL;
+   spin_lock_irq(_map_work_lock);
+
+   if (!list_empty(_map_work_list)) {
+   map = list_first_entry(_map_work_list, struct bpf_map,
+   work_list);
+   list_del_init(>work_list);
+   } else
+   worker_idle = 1;
+
+   spin_unlock_irq(_map_work_lock);
+
+   if (!map)
+   return;
  
  	security_bpf_map_free(map);

bpf_map_release_memcg(map);
/* implementation dependent freeing */
map->ops->map_free(map);
+
+   cond_resched();
+   goto again;
  }
+static DECLARE_WORK(bpf_map_release_work, bpf_map_free_deferred);
  
  static void bpf_map_put_uref(struct bpf_map *map)

  {
@@ -473,11 +497,20 @@ static void bpf_map_put_uref(struct bpf_
  static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
  {
if (atomic64_dec_and_test(>refcnt)) {
+   unsigned long flags;
+   int idle;
+
/* bpf_map_free_id() must be called first */
bpf_map_free_id(map, do_idr_lock);
btf_put(map->btf);
-   INIT_WORK(>work, bpf_map_free_deferred);
-   schedule_work(>work);
+
+   spin_lock_irqsave(_map_work_lock, flags);
+   idle = worker_idle;
+   list_add(>

Re: BUG: MAX_LOCKDEP_KEYS too low!

2021-01-22 Thread Alexey Kardashevskiy




On 23/01/2021 02:30, Tetsuo Handa wrote:

On 2021/01/22 22:28, Tetsuo Handa wrote:

On 2021/01/22 21:10, Dmitry Vyukov wrote:

On Fri, Jan 22, 2021 at 1:03 PM Alexey Kardashevskiy  wrote:




On 22/01/2021 21:30, Tetsuo Handa wrote:

On 2021/01/22 18:16, Dmitry Vyukov wrote:

The reproducer only does 2 bpf syscalls, so something is slowly leaking in bpf.
My first suspect would be one of these. Since workqueue is async, it
may cause such slow drain that happens only when tasks are spawned
fast. I don't know if there is a procfs/debugfs introspection file to
monitor workqueue lengths to verify this hypothesis.


If you can reproduce locally, you can call show_workqueue_state()
(part of SysRq-t) when hitting the limit.

--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -1277,6 +1277,7 @@ register_lock_class(struct lockdep_map *lock, unsigned 
int subclass, int force)

  print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!");
  dump_stack();
+   show_workqueue_state();
  return NULL;
  }
  nr_lock_classes++;





Here is the result:
https://pastebin.com/rPn0Cytu


Do you mind posting this publicly?
Yes, it seems that bpf_map_free_deferred is the problem (11138
outstanding callbacks).



Wow. Horribly stuck queue. I guess BPF wants to try WQ created by

   alloc_workqueue("bpf_free_wq", WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);

rather than system_wq . You can add Tejun Heo  for WQ.

Anyway, please post your result to ML.




https://pastebin.com/JfrmzguK is with the patch below applied. Seems 
less output. Interestingly when I almost hit "send", OOM kicked in and 
tried killing a bunch of "maxlockdep" processes (my test case):


[  891.037315] [  31007] 0 31007  2815491520 
 1000 maxlockdep
[  891.037540] [  31009] 0 31009  2815491520 
 1000 maxlockdep
[  891.037760] [  31012] 0 31012  2815491520 
 1000 maxlockdep
[  891.037980] [  31013] 0 31013  2815471040 
0 maxlockdep
[  891.038210] [  31014] 0 31014  2815491520 
 1000 maxlockdep
[  891.038429] [  31018] 0 31018  2815471040 
0 maxlockdep
[  891.038652] [  31019] 0 31019  2815491520 
 1000 maxlockdep
[  891.038874] [  31020] 0 31020  2815491520 
 1000 maxlockdep
[  891.039095] [  31021] 0 31021  2815491520 
 1000 maxlockdep
[  891.039317] [  31022] 0 31022  2815471040 
0 maxlockdep





And (re)adding LKML and Tejun as suggested. Thanks,






Does this patch (which is only compile tested) reduce number of pending works
when hitting "BUG: MAX_LOCKDEP_KEYS too low!" ?

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 07cb5d15e743..c6c6902090f0 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -41,6 +41,7 @@ struct bpf_local_storage_map;
  struct kobject;
  struct mem_cgroup;
  
+extern struct workqueue_struct *bpf_free_wq;

  extern struct idr btf_idr;
  extern spinlock_t btf_idr_lock;
  extern struct kobject *btf_kobj;
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 1f8453343bf2..8b1cf6aab089 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -994,7 +994,7 @@ static void prog_array_map_clear(struct bpf_map *map)
struct bpf_array_aux *aux = container_of(map, struct bpf_array,
 map)->aux;
bpf_map_inc(map);
-   schedule_work(>work);
+   queue_work(bpf_free_wq, >work);
  }
  
  static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)

diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 96555a8a2c54..f272844163df 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -160,7 +160,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
  
  	INIT_WORK(>bpf.release_work, cgroup_bpf_release);

-   queue_work(system_wq, >bpf.release_work);
+   queue_work(bpf_free_wq, >bpf.release_work);
  }
  
  /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through

diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 261f8692d0d2..9d76c0d77687 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -34,6 +34,15 @@
  #include 
  #include 
  
+struct workqueue_struct *bpf_free_wq;

+
+static int __init bpf_free_wq_init(void)
+{
+   bpf_free_wq = alloc_workqueue("bpf_free_wq", WQ_MEM_RECLAIM | 
WQ_HIGHPRI | WQ_UNBOUND, 0);
+   return 0;
+}
+subsys_initcall(bpf_free_wq_init);
+
  /* Registers */
  #define BPF_R0regs[BPF_REG_0]
  #define BPF_R1regs[BPF_REG_1]

BUG: MAX_LOCKDEP_KEYS too low!

2021-01-21 Thread Alexey Kardashevskiy

Hi!

Syzkaller found this bug and it has a repro (below). I googled a similar 
bug in 2019 which was fixed so this seems new.


The repro takes about a half a minute to produce the message,  "grep 
lock-classes /proc/lockdep_stats" reports 8177 of 8192, before running 
the repro it is 702. It is a POWER8 box.


The offender is htab->lockdep_key. If I run repro at the slow rate, no 
problems appears, traces show lockdep_unregister_key() is called and the 
leak is quite slow.


Is this something known? Any hints how to debug this further? I'd give 
it a try since I have an easy reproducer. Thanks,




root@le-dbg:~# egrep "BD.*htab->lockdep_key" /proc/lockdep | wc -l
7449
root@le-dbg:~# egrep "BD.*htab->lockdep_key" /proc/lockdep | tail -n 3
(ptrval) FD:1 BD:1 : >lockdep_key#9531
(ptrval) FD:1 BD:1 : >lockdep_key#9532
(ptrval) FD:1 BD:1 : >lockdep_key#9533


// autogenerated by syzkaller (https://github.com/google/syzkaller)

#define __unix__ 1
#define __gnu_linux__ 1
#define __linux__ 1

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static unsigned long long procid;

static void sleep_ms(uint64_t ms)
{
usleep(ms * 1000);
}

static uint64_t current_time_ms(void)
{
struct timespec ts;
if (clock_gettime(CLOCK_MONOTONIC, ))
exit(1);
return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 100;
}

static bool write_file(const char* file, const char* what, ...)
{
char buf[1024];
va_list args;
va_start(args, what);
vsnprintf(buf, sizeof(buf), what, args);
va_end(args);
buf[sizeof(buf) - 1] = 0;
int len = strlen(buf);
int fd = open(file, O_WRONLY | O_CLOEXEC);
if (fd == -1)
return false;
if (write(fd, buf, len) != len) {
int err = errno;
close(fd);
errno = err;
return false;
}
close(fd);
return true;
}

static void kill_and_wait(int pid, int* status)
{
kill(-pid, SIGKILL);
kill(pid, SIGKILL);
for (int i = 0; i < 100; i++) {
if (waitpid(-1, status, WNOHANG | __WALL) == pid)
return;
usleep(1000);
}
DIR* dir = opendir("/sys/fs/fuse/connections");
if (dir) {
for (;;) {
struct dirent* ent = readdir(dir);
if (!ent)
break;
if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, 
"..") == 0)
continue;
char abort[300];
			snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort", 
ent->d_name);

int fd = open(abort, O_WRONLY);
if (fd == -1) {
continue;
}
if (write(fd, abort, 1) < 0) {
}
close(fd);
}
closedir(dir);
} else {
}
while (waitpid(-1, status, __WALL) != pid) {
}
}

static void setup_test()
{
prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
setpgrp();
write_file("/proc/self/oom_score_adj", "1000");
}

static void execute_one(void);

#define WAIT_FLAGS __WALL

static void loop(void)
{
int iter = 0;
for (;; iter++) {
int pid = fork();
if (pid < 0)
exit(1);
if (pid == 0) {
setup_test();
execute_one();
exit(0);
}
int status = 0;
uint64_t start = current_time_ms();
for (;;) {
if (waitpid(-1, , WNOHANG | WAIT_FLAGS) == pid)
break;
sleep_ms(1);
if (current_time_ms() - start < 5000) {
continue;
}
kill_and_wait(pid, );
break;
}
}
}

#ifndef __NR_bpf
#define __NR_bpf 361
#endif
#ifndef __NR_mmap
#define __NR_mmap 90
#endif

uint64_t r[1] = {0x};

void execute_one(void)
{
intptr_t res = 0;
*(uint32_t*)0x2280 = 9;
*(uint32_t*)0x2284 = 1;
*(uint32_t*)0x2288 = 6;
*(uint32_t*)0x228c = 5;
*(uint32_t*)0x2290 = 0;
*(uint32_t*)0x2294 = -1;
*(uint32_t*)0x2298 = 0;
*(uint8_t*)0x229c = 0;
*(uint8_t*)0x229d = 0;
*(uint8_t*)0x229e = 0;
*(uint8_t*)0x229f = 0;
*(uint8_t*)0x22a0 = 0;
*(uint8_t*)0x22a1 = 0;
*(uint8_t*)0x22a2 = 0;
*(uint8_t*)0x22a3 = 0;

Re: [RFC PATCH kernel] block: initialize block_device::bd_bdi for bdev_cache

2021-01-07 Thread Alexey Kardashevskiy




On 07/01/2021 18:48, Christoph Hellwig wrote:

On Thu, Jan 07, 2021 at 10:58:39AM +1100, Alexey Kardashevskiy wrote:

And AFAICT the root inode on
bdev superblock can get only to bdev_evict_inode() and bdev_free_inode().
Looking at bdev_evict_inode() the only thing that's used there from struct
block_device is really bd_bdi. bdev_free_inode() will also access
bdev->bd_stats and bdev->bd_meta_info. So we need to at least initialize
these to NULL as well.


These are all NULL.


IMO the most logical place for all these
initializations is in bdev_alloc_inode()...



This works. We can also check for NULL where it crashes. But I do not know
the code to make an informed decision...


The root inode is the special case, so I think moving the the initializers
for everything touched in ->evict_inode and ->free_inode to
bdev_alloc_inode makes most sense.

Alexey, do you want to respin or should I send a patch?


I really prefer you doing this as you will most likely end up fixing the 
commit log anyway :) Thanks,



--
Alexey


Re: [RFC PATCH kernel] block: initialize block_device::bd_bdi for bdev_cache

2021-01-06 Thread Alexey Kardashevskiy




On 06/01/2021 21:41, Jan Kara wrote:

On Wed 06-01-21 20:29:00, Alexey Kardashevskiy wrote:

This is a workaround to fix a null derefence crash:

[cb01f840] cb01f880 (unreliable)
[cb01f880] c0769a3c bdev_evict_inode+0x21c/0x370
[cb01f8c0] c070bacc evict+0x11c/0x230
[cb01f900] c070c138 iput+0x2a8/0x4a0
[cb01f970] c06ff030 dentry_unlink_inode+0x220/0x250
[cb01f9b0] c07001c0 __dentry_kill+0x190/0x320
[cb01fa00] c0701fb8 dput+0x5e8/0x860
[cb01fa80] c0705848 shrink_dcache_for_umount+0x58/0x100
[cb01fb00] c06cf864 generic_shutdown_super+0x54/0x200
[cb01fb80] c06cfd48 kill_anon_super+0x38/0x60
[cb01fbc0] c06d12cc deactivate_locked_super+0xbc/0x110
[cb01fbf0] c06d13bc deactivate_super+0x9c/0xc0
[cb01fc20] c071a340 cleanup_mnt+0x1b0/0x250
[cb01fc80] c0278fa8 task_work_run+0xf8/0x180
[cb01fcd0] c002b4ac do_notify_resume+0x4dc/0x5d0
[cb01fda0] c004ba0c syscall_exit_prepare+0x28c/0x370
[cb01fe10] c000e06c system_call_common+0xfc/0x27c
--- Exception: c00 (System Call) at 10034890

Is this fixed properly already somewhere? Thanks,

Fixes: e6cb53827ed6 ("block: initialize struct block_device in bdev_alloc")


I don't think it's fixed anywhere and I've seen the syzbot report and I was
wondering how this can happen when bdev_alloc() initializes bdev->bd_bdi
and it also wasn't clear to me whether bd_bdi is really the only field that
is problematic - if we can get to bdev_evict_inode() without going through
bdev_alloc(), we are probably missing initialization of other fields in
that place as well...

But now I've realized that probably the inode is a root inode for bdev
superblock which is allocated by VFS through new_inode() and thus doesn't
undergo the initialization in bdev_alloc(). 


yup, this is the case.


And AFAICT the root inode on
bdev superblock can get only to bdev_evict_inode() and bdev_free_inode().
Looking at bdev_evict_inode() the only thing that's used there from struct
block_device is really bd_bdi. bdev_free_inode() will also access
bdev->bd_stats and bdev->bd_meta_info. So we need to at least initialize
these to NULL as well.


These are all NULL.


IMO the most logical place for all these
initializations is in bdev_alloc_inode()...



This works. We can also check for NULL where it crashes. But I do not 
know the code to make an informed decision...




Honza


---
  fs/block_dev.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3e5b02f6606c..86fdc28d565e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -792,8 +792,10 @@ static void bdev_free_inode(struct inode *inode)
  static void init_once(void *data)
  {
struct bdev_inode *ei = data;
+   struct block_device *bdev = >bdev;
  
  	inode_init_once(>vfs_inode);

+   bdev->bd_bdi = _backing_dev_info;
  }
  
  static void bdev_evict_inode(struct inode *inode)

--
2.17.1



--
Alexey


[RFC PATCH kernel] block: initialize block_device::bd_bdi for bdev_cache

2021-01-06 Thread Alexey Kardashevskiy
This is a workaround to fix a null derefence crash:

[cb01f840] cb01f880 (unreliable)
[cb01f880] c0769a3c bdev_evict_inode+0x21c/0x370
[cb01f8c0] c070bacc evict+0x11c/0x230
[cb01f900] c070c138 iput+0x2a8/0x4a0
[cb01f970] c06ff030 dentry_unlink_inode+0x220/0x250
[cb01f9b0] c07001c0 __dentry_kill+0x190/0x320
[cb01fa00] c0701fb8 dput+0x5e8/0x860
[cb01fa80] c0705848 shrink_dcache_for_umount+0x58/0x100
[cb01fb00] c06cf864 generic_shutdown_super+0x54/0x200
[cb01fb80] c06cfd48 kill_anon_super+0x38/0x60
[cb01fbc0] c06d12cc deactivate_locked_super+0xbc/0x110
[cb01fbf0] c06d13bc deactivate_super+0x9c/0xc0
[cb01fc20] c071a340 cleanup_mnt+0x1b0/0x250
[cb01fc80] c0278fa8 task_work_run+0xf8/0x180
[cb01fcd0] c002b4ac do_notify_resume+0x4dc/0x5d0
[cb01fda0] c004ba0c syscall_exit_prepare+0x28c/0x370
[cb01fe10] c000e06c system_call_common+0xfc/0x27c
--- Exception: c00 (System Call) at 10034890

Is this fixed properly already somewhere? Thanks,

Fixes: e6cb53827ed6 ("block: initialize struct block_device in bdev_alloc")
---
 fs/block_dev.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 3e5b02f6606c..86fdc28d565e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -792,8 +792,10 @@ static void bdev_free_inode(struct inode *inode)
 static void init_once(void *data)
 {
struct bdev_inode *ei = data;
+   struct block_device *bdev = >bdev;
 
inode_init_once(>vfs_inode);
+   bdev->bd_bdi = _backing_dev_info;
 }
 
 static void bdev_evict_inode(struct inode *inode)
-- 
2.17.1



Re: WARN_ON_ONCE

2020-12-03 Thread Alexey Kardashevskiy




On 04/12/2020 12:25, Michael Ellerman wrote:

Dmitry Vyukov  writes:

On Thu, Dec 3, 2020 at 10:19 AM Dmitry Vyukov  wrote:

On Thu, Dec 3, 2020 at 10:10 AM Alexey Kardashevskiy  wrote:


Hi!

Syzkaller triggered WARN_ON_ONCE at

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/tracepoint.c?h=v5.10-rc6#n266


===
static int tracepoint_add_func(struct tracepoint *tp,
struct tracepoint_func *func, int prio)
{
 struct tracepoint_func *old, *tp_funcs;
 int ret;

 if (tp->regfunc && !static_key_enabled(>key)) {
 ret = tp->regfunc();
 if (ret < 0)
 return ret;
 }

 tp_funcs = rcu_dereference_protected(tp->funcs,
 lockdep_is_held(_mutex));
 old = func_add(_funcs, func, prio);
 if (IS_ERR(old)) {
 WARN_ON_ONCE(PTR_ERR(old) != -ENOMEM);
 return PTR_ERR(old);
 }

===

What is the common approach here? Syzkaller reacts on this as if it was
a bug but WARN_ON_ONCE here seems intentional. Do we still push for
removing such warnings?


AFAICS it is a bug if that fires.

See the commit that added it:
   d66a270be331 ("tracepoint: Do not warn on ENOMEM")

Which says:
   Tracepoint should only warn when a kernel API user does not respect the
   required preconditions (e.g. same tracepoint enabled twice,



This says that the userspace can trigger the warning if it does not use 
the API right.




or called
   to remove a tracepoint that does not exist).
   
   Silence warning in out-of-memory conditions, given that the error is

   returned to the caller.


So if you're seeing it then you've someone caused it to return something
other than ENOMEM, and that is a bug.



This is an userspace bug which registers the same thing twice, the 
kernel returns a correct error. The question is should it warn by 
WARN_ON or pr_err(). The comment in bug.h suggests pr_err() is the right 
way, is not it?



--
Alexey


[PATCH kernel v2] serial_core: Check for port state when tty is in error state

2020-12-02 Thread Alexey Kardashevskiy
At the moment opening a serial device node (such as /dev/ttyS3)
succeeds even if there is no actual serial device behind it.
Reading/writing/ioctls fail as expected because the uart port is not
initialized (the type is PORT_UNKNOWN) and the TTY_IO_ERROR error state
bit is set fot the tty.

However setting line discipline does not have these checks
8250_port.c (8250 is the default choice made by univ8250_console_init()).
As the result of PORT_UNKNOWN, uart_port::iobase is NULL which
a platform translates onto some address accessing which produces a crash
like below.

This adds tty_port_initialized() to uart_set_ldisc() to prevent the crash.

Found by syzkaller.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* changed to tty_port_initialized() as suggested in
https://www.spinics.net/lists/linux-serial/msg39942.html (sorry for delay)

---
The example of crash on PPC64/pseries:

BUG: Unable to handle kernel data access on write at 0xc00a0001
Faulting instruction address: 0xc0c9c9cc
cpu 0x0: Vector: 300 (Data Access) at [cc6d7800]
pc: c0c9c9cc: io_serial_out+0xcc/0xf0
lr: c0c9c9b4: io_serial_out+0xb4/0xf0
sp: cc6d7a90
   msr: 80009033
   dar: c00a0001
 dsisr: 4200
  current = 0xccd22500
  paca= 0xc35c   irqmask: 0x03   irq_happened: 0x01
pid   = 1371, comm = syz-executor.0
Linux version 5.8.0-rc7-le-guest_syzkaller_a+fstn1 (aik@fstn1-p1) (gcc (Ubunt
untu) 2.30) #660 SMP Tue Jul 28 22:29:22 AEST 2020
enter ? for help
[cc6d7a90] c18a8cc0 _raw_spin_lock_irq+0xb0/0xe0 (unreliable)
[cc6d7ad0] c0c9bdc0 serial8250_do_set_ldisc+0x140/0x180
[cc6d7b10] c0c9bea4 serial8250_set_ldisc+0xa4/0xb0
[cc6d7b50] c0c91138 uart_set_ldisc+0xb8/0x160
[cc6d7b90] c0c5a22c tty_set_ldisc+0x23c/0x330
[cc6d7c20] c0c4c220 tty_ioctl+0x990/0x12f0
[cc6d7d20] c056357c ksys_ioctl+0x14c/0x180
[cc6d7d70] c05635f0 sys_ioctl+0x40/0x60
[cc6d7db0] c003b814 system_call_exception+0x1a4/0x330
[cc6d7e20] c000d368 system_call_common+0xe8/0x214
---
 drivers/tty/serial/serial_core.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/tty/serial/serial_core.c b/drivers/tty/serial/serial_core.c
index f41cba10b86b..828f9ad1be49 100644
--- a/drivers/tty/serial/serial_core.c
+++ b/drivers/tty/serial/serial_core.c
@@ -1467,6 +1467,10 @@ static void uart_set_ldisc(struct tty_struct *tty)
 {
struct uart_state *state = tty->driver_data;
struct uart_port *uport;
+   struct tty_port *port = >port;
+
+   if (!tty_port_initialized(port))
+   return;
 
mutex_lock(>port.mutex);
uport = uart_port_check(state);
-- 
2.17.1



Re: [PATCH kernel] fs/io_ring: Fix lockdep warnings

2020-11-30 Thread Alexey Kardashevskiy




On 01/12/2020 03:34, Pavel Begunkov wrote:

On 30/11/2020 02:00, Alexey Kardashevskiy wrote:

There are a few potential deadlocks reported by lockdep and triggered by
syzkaller (a syscall fuzzer). These are reported as timer interrupts can
execute softirq handlers and if we were executing certain bits of io_ring,
a deadlock can occur. This fixes those bits by disabling soft interrupts.


Jens already fixed that, thanks

https://lore.kernel.org/io-uring/948d2d3b-5f36-034d-28e6-7490343a5...@kernel.dk/T/#t


Oh good! I assumed it must be fixed somewhere but could not find it 
quickly in the lists.



FYI, your email got into spam.


Not good :-/ Wonder why... Can you please forward my mail in attachment 
for debugging (there should be nothing private, I guess)? spf should be 
alright, not sure what else I can do. Thanks,



--
Alexey


[PATCH kernel] fs/io_ring: Fix lockdep warnings

2020-11-29 Thread Alexey Kardashevskiy
There are a few potential deadlocks reported by lockdep and triggered by
syzkaller (a syscall fuzzer). These are reported as timer interrupts can
execute softirq handlers and if we were executing certain bits of io_ring,
a deadlock can occur. This fixes those bits by disabling soft interrupts.

Signed-off-by: Alexey Kardashevskiy 
---

There are 2 reports.

Warning#1:


WARNING: inconsistent lock state
5.10.0-rc5_irqs_a+fstn1 #5 Not tainted

inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/14/0 [HC0[0]:SC1[1]:HE0:SE0] takes:
cb76f4a8 (_data->lock){+.?.}-{2:2}, at: 
io_file_data_ref_zero+0x58/0x300
{SOFTIRQ-ON-W} state was registered at:
  lock_acquire+0x2c4/0x5c0
  _raw_spin_lock+0x54/0x80
  sys_io_uring_register+0x1de0/0x2100
  system_call_exception+0x160/0x240
  system_call_common+0xf0/0x27c
irq event stamp: 4011767
hardirqs last  enabled at (4011766): [] 
_raw_spin_unlock_irqrestore+0x54/0x90
hardirqs last disabled at (4011767): [] 
_raw_spin_lock_irqsave+0x48/0xb0
softirqs last  enabled at (4011754): [] 
irq_enter_rcu+0xbc/0xc0
softirqs last disabled at (4011755): [] irq_exit+0x1d4/0x1e0

other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(_data->lock);
  
lock(_data->lock);

 *** DEADLOCK ***

2 locks held by swapper/14/0:
 #0: c21cc3e8 (rcu_callback){}-{0:0}, at: rcu_core+0x2b0/0xfe0
 #1: c21cc358 (rcu_read_lock){}-{1:2}, at: 
percpu_ref_switch_to_atomic_rcu+0x148/0x400

stack backtrace:
CPU: 14 PID: 0 Comm: swapper/14 Not tainted 5.10.0-rc5_irqs_a+fstn1 #5
Call Trace:
[c97672c0] [c02b0268] print_usage_bug+0x3e8/0x3f0
[c9767360] [c02b0e88] mark_lock.part.48+0xc18/0xee0
[c9767480] [c02b1fb8] __lock_acquire+0xac8/0x21e0
[c97675d0] [c02b4454] lock_acquire+0x2c4/0x5c0
[c97676c0] [c167a38c] _raw_spin_lock_irqsave+0x7c/0xb0
[c9767700] [c07321b8] io_file_data_ref_zero+0x58/0x300
[c9767770] [c0be93e4] 
percpu_ref_switch_to_atomic_rcu+0x3f4/0x400
[c9767800] [c02fe0d4] rcu_core+0x314/0xfe0
[c97678b0] [c167b5b8] __do_softirq+0x198/0x6c0
[c97679d0] [c020ba84] irq_exit+0x1d4/0x1e0
[c9767a00] [c00301c8] timer_interrupt+0x1e8/0x600
[c9767a70] [c0009d84] decrementer_common_virt+0x1e4/0x1f0
--- interrupt: 900 at snooze_loop+0xf4/0x300
LR = snooze_loop+0xe4/0x300
[c9767dc0] [c111b010] cpuidle_enter_state+0x520/0x910
[c9767e30] [c111b4c8] cpuidle_enter+0x58/0x80
[c9767e70] [c026da0c] call_cpuidle+0x4c/0x90
[c9767e90] [c026de80] do_idle+0x320/0x3d0
[c9767f10] [c026e308] cpu_startup_entry+0x38/0x50
[c9767f40] [c006f624] start_secondary+0x304/0x320
[c9767f90] [c000cc54] start_secondary_prolog+0x10/0x14
systemd[1]: systemd-udevd.service: Got notification message from PID 195 
(WATCHDOG=1)
systemd-journald[175]: Sent WATCHDOG=1 notification.



Warning#2:

WARNING: inconsistent lock state
5.10.0-rc5_irqs_a+fstn1 #7 Not tainted

inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
swapper/7/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
cc64b7a8 (_data->lock){+.?.}-{2:2}, at: 
io_file_data_ref_zero+0x54/0x2d0
{SOFTIRQ-ON-W} state was registered at:
  lock_acquire+0x2c4/0x5c0
  _raw_spin_lock+0x54/0x80
  io_sqe_files_unregister+0x5c/0x200
  io_ring_exit_work+0x230/0x640
  process_one_work+0x428/0xab0
  worker_thread+0x94/0x770
  kthread+0x204/0x210
  ret_from_kernel_thread+0x5c/0x6c
irq event stamp: 3250736
hardirqs last  enabled at (3250736): [] 
_raw_spin_unlock_irqrestore+0x54/0x90
hardirqs last disabled at (3250735): [] 
_raw_spin_lock_irqsave+0x48/0xb0
softirqs last  enabled at (3250722): [] 
irq_enter_rcu+0xbc/0xc0
softirqs last disabled at (3250723): [] irq_exit+0x1d4/0x1e0

other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(_data->lock);
  
lock(_data->lock);

 *** DEADLOCK ***

2 locks held by swapper/7/0:
 #0: c21cc3e8 (rcu_callback){}-{0:0}, at: rcu_core+0x2b0/0xfe0
 #1: c21cc358 (rcu_read_lock){}-{1:2}, at: 
percpu_ref_switch_to_atomic_rcu+0x148/0x400

stack backtrace:
CPU: 7 PID: 0 Comm: swapper/7 Not tainted 5.10.0-rc5_irqs_a+fstn1 #7
Call Trace:
[c974b280] [c02b0268] print_usage_bug+0x3e8/0x3f0
[c974b320] [c02b0e88] mark_lock.part.48+0xc18/0xee0
[c974b440] [c02b1fb8] __lock_acquire+0xac8/0x21e0
[c974b590] [c02b4454] lock_acquire+0x2c4/0x5c0
[c974b680] [c167a074] _raw_spin_lock+0x54/0x80
[c974b6b0] [c07321b4] io_file_data_ref_zero+0x54/0x2d0
[c97

Re: [PATCH kernel v4 2/8] genirq/irqdomain: Clean legacy IRQ allocation

2020-11-24 Thread Alexey Kardashevskiy




On 11/24/20 8:19 PM, Andy Shevchenko wrote:

On Tue, Nov 24, 2020 at 8:20 AM Alexey Kardashevskiy  wrote:


There are 10 users of __irq_domain_alloc_irqs() and only one - IOAPIC -
passes realloc==true. There is no obvious reason for handling this
specific case in the generic code.

This splits out __irq_domain_alloc_irqs_data() to make it clear what
IOAPIC does and makes __irq_domain_alloc_irqs() cleaner.

This should cause no behavioral change.



+   ret = __irq_domain_alloc_irqs_data(domain, virq, nr_irqs, node, arg, 
affinity);
+   if (ret <= 0)
 goto out_free_desc;


Was or wasn't 0 considered as error code previously?


Oh. I need to clean this up, the idea is since this does not allocate 
IRQs, this should return error code and not an irq, I'll make this explicit.





 return virq;



  out_free_desc:
 irq_free_descs(virq, nr_irqs);
 return ret;




--
Alexey


[PATCH kernel v4 8/8] powerpc/pci: Remove LSI mappings on device teardown

2020-11-23 Thread Alexey Kardashevskiy
From: Oliver O'Halloran 

When a passthrough IO adapter is removed from a pseries machine using hash
MMU and the XIVE interrupt mode, the POWER hypervisor expects the guest OS
to clear all page table entries related to the adapter. If some are still
present, the RTAS call which isolates the PCI slot returns error 9001
"valid outstanding translations" and the removal of the IO adapter fails.
This is because when the PHBs are scanned, Linux maps automatically the
INTx interrupts in the Linux interrupt number space but these are never
removed.

This problem can be fixed by adding the corresponding unmap operation when
the device is removed. There's no pcibios_* hook for the remove case, but
the same effect can be achieved using a bus notifier.

Signed-off-by: Oliver O'Halloran 
Reviewed-by: Cédric Le Goater 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/pci-common.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index be108616a721..95f4e173368a 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -404,6 +404,27 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
return 0;
 }
 
+static int ppc_pci_unmap_irq_line(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct pci_dev *pdev = to_pci_dev(data);
+
+   if (action == BUS_NOTIFY_DEL_DEVICE)
+   irq_dispose_mapping(pdev->irq);
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block ppc_pci_unmap_irq_notifier = {
+   .notifier_call = ppc_pci_unmap_irq_line,
+};
+
+static int ppc_pci_register_irq_notifier(void)
+{
+   return bus_register_notifier(_bus_type, 
_pci_unmap_irq_notifier);
+}
+arch_initcall(ppc_pci_register_irq_notifier);
+
 /*
  * Platform support for /proc/bus/pci/X/Y mmap()s.
  *  -- paulus.
-- 
2.17.1



[PATCH kernel v4 5/8] genirq: Add free_irq hook for IRQ descriptor and use for mapping disposal

2020-11-23 Thread Alexey Kardashevskiy
We want to make the irq_desc.kobj's release hook free associated resources
but we do not want to pollute the irqdesc code with domains.

This adds a free_irq hook which is called when the last reference to
the descriptor is dropped.

The first user is mapped irqs. This potentially can break the existing
users; however they seem to do the right thing and call dispose once
per mapping.

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/irqdesc.h|  1 +
 include/linux/irqdomain.h  |  2 --
 include/linux/irqhandler.h |  1 +
 kernel/irq/irqdesc.c   |  3 +++
 kernel/irq/irqdomain.c | 14 --
 5 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index 5745491303e0..6d44cb6a20ad 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -57,6 +57,7 @@ struct irq_desc {
struct irq_data irq_data;
unsigned int __percpu   *kstat_irqs;
irq_flow_handler_t  handle_irq;
+   irq_free_handler_t  free_irq;
struct irqaction*action;/* IRQ action list */
unsigned intstatus_use_accessors;
unsigned intcore_internal_state__do_not_mess_with_it;
diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index a353b93ddf9e..ccca87cd3d15 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -381,8 +381,6 @@ extern int irq_domain_associate(struct irq_domain *domain, 
unsigned int irq,
 extern void irq_domain_associate_many(struct irq_domain *domain,
  unsigned int irq_base,
  irq_hw_number_t hwirq_base, int count);
-extern void irq_domain_disassociate(struct irq_domain *domain,
-   unsigned int irq);
 
 extern unsigned int irq_create_mapping(struct irq_domain *host,
   irq_hw_number_t hwirq);
diff --git a/include/linux/irqhandler.h b/include/linux/irqhandler.h
index c30f454a9518..3dbc2bb764f3 100644
--- a/include/linux/irqhandler.h
+++ b/include/linux/irqhandler.h
@@ -10,5 +10,6 @@
 struct irq_desc;
 struct irq_data;
 typedefvoid (*irq_flow_handler_t)(struct irq_desc *desc);
+typedefvoid (*irq_free_handler_t)(struct irq_desc *desc);
 
 #endif
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 75374b7944b5..071363da8688 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -427,6 +427,9 @@ static void irq_kobj_release(struct kobject *kobj)
struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
unsigned int irq = desc->irq_data.irq;
 
+   if (desc->free_irq)
+   desc->free_irq(desc);
+
irq_remove_debugfs_entry(desc);
unregister_irq_proc(irq, desc);
 
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 805478f81d96..4779d912bb86 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -485,7 +485,7 @@ static void irq_domain_set_mapping(struct irq_domain 
*domain,
}
 }
 
-void irq_domain_disassociate(struct irq_domain *domain, unsigned int irq)
+static void irq_domain_disassociate(struct irq_domain *domain, unsigned int 
irq)
 {
struct irq_data *irq_data = irq_get_irq_data(irq);
irq_hw_number_t hwirq;
@@ -582,6 +582,13 @@ void irq_domain_associate_many(struct irq_domain *domain, 
unsigned int irq_base,
 }
 EXPORT_SYMBOL_GPL(irq_domain_associate_many);
 
+static void irq_mapped_free_desc(struct irq_desc *desc)
+{
+   unsigned int virq = desc->irq_data.irq;
+
+   irq_domain_disassociate(desc->irq_data.domain, virq);
+}
+
 /**
  * irq_create_direct_mapping() - Allocate an irq for direct mapping
  * @domain: domain to allocate the irq for or NULL for default domain
@@ -638,6 +645,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 {
struct device_node *of_node;
int virq;
+   struct irq_desc *desc;
 
pr_debug("irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq);
 
@@ -674,6 +682,9 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
pr_debug("irq %lu on domain %s mapped to virtual irq %u\n",
hwirq, of_node_full_name(of_node), virq);
 
+   desc = irq_to_desc(virq);
+   desc->free_irq = irq_mapped_free_desc;
+
return virq;
 }
 EXPORT_SYMBOL_GPL(irq_create_mapping);
@@ -865,7 +876,6 @@ void irq_dispose_mapping(unsigned int virq)
if (irq_domain_is_hierarchy(domain)) {
irq_domain_free_irqs(virq, 1);
} else {
-   irq_domain_disassociate(domain, virq);
irq_free_desc(virq);
}
 }
-- 
2.17.1



[PATCH kernel v4 4/8] genirq: Free IRQ descriptor via embedded kobject

2020-11-23 Thread Alexey Kardashevskiy
At the moment the IRQ descriptor is freed via the free_desc() helper.
We want to add reference counting to IRQ descriptors and there is already
kobj embedded into irq_desc which we want to reuse.

This shuffles free_desc()/etc to make it simply call kobject_put() and
moves all the cleanup into the kobject_release hook.

As a bonus, we do not need irq_sysfs_del() as kobj removes itself from
sysfs if it knows that it was added.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/irq/irqdesc.c | 42 --
 1 file changed, 12 insertions(+), 30 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..75374b7944b5 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -295,18 +295,6 @@ static void irq_sysfs_add(int irq, struct irq_desc *desc)
}
 }
 
-static void irq_sysfs_del(struct irq_desc *desc)
-{
-   /*
-* If irq_sysfs_init() has not yet been invoked (early boot), then
-* irq_kobj_base is NULL and the descriptor was never added.
-* kobject_del() complains about a object with no parent, so make
-* it conditional.
-*/
-   if (irq_kobj_base)
-   kobject_del(>kobj);
-}
-
 static int __init irq_sysfs_init(void)
 {
struct irq_desc *desc;
@@ -337,7 +325,6 @@ static struct kobj_type irq_kobj_type = {
 };
 
 static void irq_sysfs_add(int irq, struct irq_desc *desc) {}
-static void irq_sysfs_del(struct irq_desc *desc) {}
 
 #endif /* CONFIG_SYSFS */
 
@@ -419,39 +406,34 @@ static struct irq_desc *alloc_desc(int irq, int node, 
unsigned int flags,
return NULL;
 }
 
-static void irq_kobj_release(struct kobject *kobj)
-{
-   struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
-
-   free_masks(desc);
-   free_percpu(desc->kstat_irqs);
-   kfree(desc);
-}
-
 static void delayed_free_desc(struct rcu_head *rhp)
 {
struct irq_desc *desc = container_of(rhp, struct irq_desc, rcu);
 
+   free_masks(desc);
+   free_percpu(desc->kstat_irqs);
+   kfree(desc);
+}
+
+static void free_desc(unsigned int irq)
+{
+   struct irq_desc *desc = irq_to_desc(irq);
+
kobject_put(>kobj);
 }
 
-static void free_desc(unsigned int irq)
+static void irq_kobj_release(struct kobject *kobj)
 {
-   struct irq_desc *desc = irq_to_desc(irq);
+   struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
+   unsigned int irq = desc->irq_data.irq;
 
irq_remove_debugfs_entry(desc);
unregister_irq_proc(irq, desc);
 
/*
-* sparse_irq_lock protects also show_interrupts() and
-* kstat_irq_usr(). Once we deleted the descriptor from the
-* sparse tree we can free it. Access in proc will fail to
-* lookup the descriptor.
-*
 * The sysfs entry must be serialized against a concurrent
 * irq_sysfs_init() as well.
 */
-   irq_sysfs_del(desc);
delete_irq_desc(irq);
 
/*
-- 
2.17.1



[PATCH kernel v4 7/8] genirq/irqdomain: Reference irq_desc for already mapped irqs

2020-11-23 Thread Alexey Kardashevskiy
This references an irq_desc if already mapped interrupt requested to map
again. This happends for PCI legacy interrupts where 4 interrupts are
shared among all devices on the same PCI host bus adapter.

>From now on, the users shall call irq_dispose_mapping() for every
irq_create_fwspec_mapping(). Most (all?) users do not bother with
disposing though so it is not very likely to break many things.

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/irq/irqdomain.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index a0a81cc6c524..07f4bde87de5 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -663,7 +663,9 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
/* Check if mapping already exists */
virq = irq_find_mapping(domain, hwirq);
if (virq) {
+   desc = irq_to_desc(virq);
pr_debug("-> existing mapping on virq %d\n", virq);
+   kobject_get(>kobj);
return virq;
}
 
@@ -762,6 +764,7 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec 
*fwspec)
irq_hw_number_t hwirq;
unsigned int type = IRQ_TYPE_NONE;
int virq;
+   struct irq_desc *desc;
 
if (fwspec->fwnode) {
domain = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_WIRED);
@@ -798,8 +801,11 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec 
*fwspec)
 * current trigger type then we are done so return the
 * interrupt number.
 */
-   if (type == IRQ_TYPE_NONE || type == irq_get_trigger_type(virq))
+   if (type == IRQ_TYPE_NONE || type == 
irq_get_trigger_type(virq)) {
+   desc = irq_to_desc(virq);
+   kobject_get(>kobj);
return virq;
+   }
 
/*
 * If the trigger type has not been set yet, then set
@@ -811,6 +817,8 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec 
*fwspec)
return 0;
 
irqd_set_trigger_type(irq_data, type);
+   desc = irq_to_desc(virq);
+   kobject_get(>kobj);
return virq;
}
 
-- 
2.17.1



[PATCH kernel v4 1/8] genirq/ipi: Simplify irq_reserve_ipi

2020-11-23 Thread Alexey Kardashevskiy
__irq_domain_alloc_irqs() can already handle virq==-1 and free
descriptors if it failed allocating hardware interrupts so let's skip
this extra step.

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/irq/ipi.c | 16 +++-
 1 file changed, 3 insertions(+), 13 deletions(-)

diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c
index 43e3d1be622c..1b2807318ea9 100644
--- a/kernel/irq/ipi.c
+++ b/kernel/irq/ipi.c
@@ -75,18 +75,12 @@ int irq_reserve_ipi(struct irq_domain *domain,
}
}
 
-   virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL);
-   if (virq <= 0) {
-   pr_warn("Can't reserve IPI, failed to alloc descs\n");
-   return -ENOMEM;
-   }
-
-   virq = __irq_domain_alloc_irqs(domain, virq, nr_irqs, NUMA_NO_NODE,
-  (void *) dest, true, NULL);
+   virq = __irq_domain_alloc_irqs(domain, -1, nr_irqs, NUMA_NO_NODE,
+  (void *) dest, false, NULL);
 
if (virq <= 0) {
pr_warn("Can't reserve IPI, failed to alloc hw irqs\n");
-   goto free_descs;
+   return -EBUSY;
}
 
for (i = 0; i < nr_irqs; i++) {
@@ -96,10 +90,6 @@ int irq_reserve_ipi(struct irq_domain *domain,
irq_set_status_flags(virq + i, IRQ_NO_BALANCING);
}
return virq;
-
-free_descs:
-   irq_free_descs(virq, nr_irqs);
-   return -EBUSY;
 }
 
 /**
-- 
2.17.1



[PATCH kernel v4 6/8] genirq/irqdomain: Move hierarchical IRQ cleanup to kobject_release

2020-11-23 Thread Alexey Kardashevskiy
This moves hierarchical domain's irqs cleanup into the kobject release
hook to make irq_domain_free_irqs() as simple as kobject_put.

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/irq/irqdomain.c | 43 +-
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index 4779d912bb86..a0a81cc6c524 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -863,21 +863,9 @@ EXPORT_SYMBOL_GPL(irq_create_of_mapping);
  */
 void irq_dispose_mapping(unsigned int virq)
 {
-   struct irq_data *irq_data = irq_get_irq_data(virq);
-   struct irq_domain *domain;
+   struct irq_desc *desc = irq_to_desc(virq);
 
-   if (!virq || !irq_data)
-   return;
-
-   domain = irq_data->domain;
-   if (WARN_ON(domain == NULL))
-   return;
-
-   if (irq_domain_is_hierarchy(domain)) {
-   irq_domain_free_irqs(virq, 1);
-   } else {
-   irq_free_desc(virq);
-   }
+   kobject_put(>kobj);
 }
 EXPORT_SYMBOL_GPL(irq_dispose_mapping);
 
@@ -1396,6 +1384,19 @@ int irq_domain_alloc_irqs_hierarchy(struct irq_domain 
*domain,
return domain->ops->alloc(domain, irq_base, nr_irqs, arg);
 }
 
+static void irq_domain_hierarchy_free_desc(struct irq_desc *desc)
+{
+   unsigned int virq = desc->irq_data.irq;
+   struct irq_data *data = irq_get_irq_data(virq);
+
+   mutex_lock(_domain_mutex);
+   irq_domain_remove_irq(virq);
+   irq_domain_free_irqs_hierarchy(data->domain, virq, 1);
+   mutex_unlock(_domain_mutex);
+
+   irq_domain_free_irq_data(virq, 1);
+}
+
 int __irq_domain_alloc_irqs_data(struct irq_domain *domain, int virq,
 unsigned int nr_irqs, int node, void *arg,
 const struct irq_affinity_desc *affinity)
@@ -1430,7 +1431,10 @@ int __irq_domain_alloc_irqs_data(struct irq_domain 
*domain, int virq,
}
 
for (i = 0; i < nr_irqs; i++) {
+   struct irq_desc *desc = irq_to_desc(virq + i);
+
irq_domain_insert_irq(virq + i);
+   desc->free_irq = irq_domain_hierarchy_free_desc;
}
mutex_unlock(_domain_mutex);
 
@@ -1675,14 +1679,11 @@ void irq_domain_free_irqs(unsigned int virq, unsigned 
int nr_irqs)
 "NULL pointer, cannot free irq\n"))
return;
 
-   mutex_lock(_domain_mutex);
-   for (i = 0; i < nr_irqs; i++)
-   irq_domain_remove_irq(virq + i);
-   irq_domain_free_irqs_hierarchy(data->domain, virq, nr_irqs);
-   mutex_unlock(_domain_mutex);
+   for (i = 0; i < nr_irqs; i++) {
+   struct irq_desc *desc = irq_to_desc(virq + i);
 
-   irq_domain_free_irq_data(virq, nr_irqs);
-   irq_free_descs(virq, nr_irqs);
+   kobject_put(>kobj);
+   }
 }
 
 /**
-- 
2.17.1



[PATCH kernel v4 0/8] genirq/irqdomain: Add reference counting to IRQs

2020-11-23 Thread Alexey Kardashevskiy
This is another attempt to add reference counting to IRQ
descriptors; or - more to the point - reuse already existing
kobj from irq_desc. This allows the same IRQ to be used several
times (such as legacy PCI INTx) and when disposing those, only
the last reference drop clears the hardware mappings.
Domains do not add references to irq_desc as the whole point of
this exercise is to move actual cleanup in hardware to
the last reference drop. This only changes sparse interrupts
(no idea about the other case yet).

No changelog as it is all completely rewritten. I am still running
tests but I hope this demonstrates the idea.

Some context from Cedric:
The background context for such a need is that the POWER9 and POWER10
processors have a new XIVE interrupt controller which uses MMIO pages
for interrupt management. Each interrupt has a pair of pages which are
required to be unmapped in some environment, like PHB removal. And so,
all interrupts need to be unmmaped.

1/8 .. 3/8 are removing confusing "realloc" which not strictly required
but I was touching this anyway and legacy interrupts should probably use
the new counting anyway;

4/8 .. 6/8 is reordering irq_desc disposal;

7/8 adds extra references (probably missed other places);

8/8 is the fix for the original XIVE bug; it is here for demonstration.

I am cc'ing ppc list so people can pull the patches from that patchworks.

This is based on sha1
418baf2c28f3 Linus Torvalds "Linux 5.10-rc5".

and pushed out to
https://github.com/aik/linux/commits/irqs
sha1 3955f97c448242f6a

Please comment. Thanks.


Alexey Kardashevskiy (7):
  genirq/ipi:  Simplify irq_reserve_ipi
  genirq/irqdomain: Clean legacy IRQ allocation
  genirq/irqdomain: Drop unused realloc parameter from
__irq_domain_alloc_irqs
  genirq: Free IRQ descriptor via embedded kobject
  genirq: Add free_irq hook for IRQ descriptor and use for mapping
disposal
  genirq/irqdomain: Move hierarchical IRQ cleanup to kobject_release
  genirq/irqdomain: Reference irq_desc for already mapped irqs

Oliver O'Halloran (1):
  powerpc/pci: Remove LSI mappings on device teardown

 include/linux/irqdesc.h |   1 +
 include/linux/irqdomain.h   |   9 +-
 include/linux/irqhandler.h  |   1 +
 arch/powerpc/kernel/pci-common.c|  21 
 arch/x86/kernel/apic/io_apic.c  |  13 ++-
 drivers/gpio/gpiolib.c  |   1 -
 drivers/irqchip/irq-armada-370-xp.c |   2 +-
 drivers/irqchip/irq-bcm2836.c   |   3 +-
 drivers/irqchip/irq-gic-v3.c|   3 +-
 drivers/irqchip/irq-gic-v4.c|   6 +-
 drivers/irqchip/irq-gic.c   |   3 +-
 drivers/irqchip/irq-ixp4xx.c|   1 -
 kernel/irq/ipi.c|  16 +--
 kernel/irq/irqdesc.c|  45 +++-
 kernel/irq/irqdomain.c  | 160 +---
 kernel/irq/msi.c|   2 +-
 16 files changed, 158 insertions(+), 129 deletions(-)

-- 
2.17.1



[PATCH kernel v4 2/8] genirq/irqdomain: Clean legacy IRQ allocation

2020-11-23 Thread Alexey Kardashevskiy
There are 10 users of __irq_domain_alloc_irqs() and only one - IOAPIC -
passes realloc==true. There is no obvious reason for handling this
specific case in the generic code.

This splits out __irq_domain_alloc_irqs_data() to make it clear what
IOAPIC does and makes __irq_domain_alloc_irqs() cleaner.

This should cause no behavioral change.

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/irqdomain.h  |  3 ++
 arch/x86/kernel/apic/io_apic.c | 13 +++--
 kernel/irq/irqdomain.c | 89 --
 3 files changed, 65 insertions(+), 40 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 71535e87109f..6cc37bba9951 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -470,6 +470,9 @@ static inline struct irq_domain 
*irq_domain_add_hierarchy(struct irq_domain *par
   ops, host_data);
 }
 
+extern int __irq_domain_alloc_irqs_data(struct irq_domain *domain, int virq,
+   unsigned int nr_irqs, int node, void 
*arg,
+   const struct irq_affinity_desc 
*affinity);
 extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
   unsigned int nr_irqs, int node, void *arg,
   bool realloc,
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 7b3c7e0d4a09..df9c0ab3a119 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -970,9 +970,14 @@ static int alloc_irq_from_domain(struct irq_domain 
*domain, int ioapic, u32 gsi,
return -1;
}
 
-   return __irq_domain_alloc_irqs(domain, irq, 1,
-  ioapic_alloc_attr_node(info),
-  info, legacy, NULL);
+   if (irq == -1 || !legacy)
+   return __irq_domain_alloc_irqs(domain, irq, 1,
+  ioapic_alloc_attr_node(info),
+  info, false, NULL);
+
+   return __irq_domain_alloc_irqs_data(domain, irq, 1,
+   ioapic_alloc_attr_node(info),
+   info, NULL);
 }
 
 /*
@@ -1006,7 +1011,7 @@ static int alloc_isa_irq_from_domain(struct irq_domain 
*domain,
return -ENOMEM;
} else {
info->flags |= X86_IRQ_ALLOC_LEGACY;
-   irq = __irq_domain_alloc_irqs(domain, irq, 1, node, info, true,
+   irq = __irq_domain_alloc_irqs_data(domain, irq, 1, node, info,
  NULL);
if (irq >= 0) {
irq_data = irq_domain_get_irq_data(domain, irq);
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..ca5c78366c85 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -1386,6 +1386,51 @@ int irq_domain_alloc_irqs_hierarchy(struct irq_domain 
*domain,
return domain->ops->alloc(domain, irq_base, nr_irqs, arg);
 }
 
+int __irq_domain_alloc_irqs_data(struct irq_domain *domain, int virq,
+unsigned int nr_irqs, int node, void *arg,
+const struct irq_affinity_desc *affinity)
+{
+   int i, ret;
+
+   if (domain == NULL) {
+   domain = irq_default_domain;
+   if (WARN(!domain, "domain is NULL; cannot allocate IRQ\n"))
+   return -EINVAL;
+   }
+
+   if (irq_domain_alloc_irq_data(domain, virq, nr_irqs)) {
+   pr_debug("cannot allocate memory for IRQ%d\n", virq);
+   ret = -ENOMEM;
+   goto out_free_irq_data;
+   }
+
+   mutex_lock(_domain_mutex);
+   ret = irq_domain_alloc_irqs_hierarchy(domain, virq, nr_irqs, arg);
+   if (ret < 0) {
+   mutex_unlock(_domain_mutex);
+   goto out_free_irq_data;
+   }
+
+   for (i = 0; i < nr_irqs; i++) {
+   ret = irq_domain_trim_hierarchy(virq + i);
+   if (ret) {
+   mutex_unlock(_domain_mutex);
+   goto out_free_irq_data;
+   }
+   }
+
+   for (i = 0; i < nr_irqs; i++) {
+   irq_domain_insert_irq(virq + i);
+   }
+   mutex_unlock(_domain_mutex);
+
+   return virq;
+
+out_free_irq_data:
+   irq_domain_free_irq_data(virq, nr_irqs);
+   return ret;
+}
+
 /**
  * __irq_domain_alloc_irqs - Allocate IRQs from domain
  * @domain:domain to allocate from
@@ -1412,7 +1457,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, 
int irq_base,
unsigned int nr_irqs, int node, void *arg,
bool realloc, const struct irq_affinity_desc 
*affinity)
 {
- 

[PATCH kernel v4 3/8] genirq/irqdomain: Drop unused realloc parameter from __irq_domain_alloc_irqs

2020-11-23 Thread Alexey Kardashevskiy
The two previous patches made @realloc obsolete. This finishes removing it.

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/irqdomain.h   | 4 +---
 arch/x86/kernel/apic/io_apic.c  | 2 +-
 drivers/gpio/gpiolib.c  | 1 -
 drivers/irqchip/irq-armada-370-xp.c | 2 +-
 drivers/irqchip/irq-bcm2836.c   | 3 +--
 drivers/irqchip/irq-gic-v3.c| 3 +--
 drivers/irqchip/irq-gic-v4.c| 6 ++
 drivers/irqchip/irq-gic.c   | 3 +--
 drivers/irqchip/irq-ixp4xx.c| 1 -
 kernel/irq/ipi.c| 2 +-
 kernel/irq/irqdomain.c  | 4 +---
 kernel/irq/msi.c| 2 +-
 12 files changed, 11 insertions(+), 22 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 6cc37bba9951..a353b93ddf9e 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -475,7 +475,6 @@ extern int __irq_domain_alloc_irqs_data(struct irq_domain 
*domain, int virq,
const struct irq_affinity_desc 
*affinity);
 extern int __irq_domain_alloc_irqs(struct irq_domain *domain, int irq_base,
   unsigned int nr_irqs, int node, void *arg,
-  bool realloc,
   const struct irq_affinity_desc *affinity);
 extern void irq_domain_free_irqs(unsigned int virq, unsigned int nr_irqs);
 extern int irq_domain_activate_irq(struct irq_data *irq_data, bool early);
@@ -484,8 +483,7 @@ extern void irq_domain_deactivate_irq(struct irq_data 
*irq_data);
 static inline int irq_domain_alloc_irqs(struct irq_domain *domain,
unsigned int nr_irqs, int node, void *arg)
 {
-   return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, false,
-  NULL);
+   return __irq_domain_alloc_irqs(domain, -1, nr_irqs, node, arg, NULL);
 }
 
 extern int irq_domain_alloc_irqs_hierarchy(struct irq_domain *domain,
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index df9c0ab3a119..5b45f0874571 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -973,7 +973,7 @@ static int alloc_irq_from_domain(struct irq_domain *domain, 
int ioapic, u32 gsi,
if (irq == -1 || !legacy)
return __irq_domain_alloc_irqs(domain, irq, 1,
   ioapic_alloc_attr_node(info),
-  info, false, NULL);
+  info, NULL);
 
return __irq_domain_alloc_irqs_data(domain, irq, 1,
ioapic_alloc_attr_node(info),
diff --git a/drivers/gpio/gpiolib.c b/drivers/gpio/gpiolib.c
index 089ddcaa9bc6..b7cfecb5c701 100644
--- a/drivers/gpio/gpiolib.c
+++ b/drivers/gpio/gpiolib.c
@@ -1059,7 +1059,6 @@ static void gpiochip_set_hierarchical_irqchip(struct 
gpio_chip *gc,
  1,
  NUMA_NO_NODE,
  ,
- false,
  NULL);
if (ret < 0) {
chip_err(gc,
diff --git a/drivers/irqchip/irq-armada-370-xp.c 
b/drivers/irqchip/irq-armada-370-xp.c
index d7eb2e93db8f..bf17eb312669 100644
--- a/drivers/irqchip/irq-armada-370-xp.c
+++ b/drivers/irqchip/irq-armada-370-xp.c
@@ -431,7 +431,7 @@ static __init void armada_xp_ipi_init(struct device_node 
*node)
 
irq_domain_update_bus_token(ipi_domain, DOMAIN_BUS_IPI);
base_ipi = __irq_domain_alloc_irqs(ipi_domain, -1, IPI_DOORBELL_END,
-  NUMA_NO_NODE, NULL, false, NULL);
+  NUMA_NO_NODE, NULL, NULL);
if (WARN_ON(!base_ipi))
return;
 
diff --git a/drivers/irqchip/irq-bcm2836.c b/drivers/irqchip/irq-bcm2836.c
index cbc7c740e4dc..fe9ff90940d3 100644
--- a/drivers/irqchip/irq-bcm2836.c
+++ b/drivers/irqchip/irq-bcm2836.c
@@ -269,8 +269,7 @@ static void __init bcm2836_arm_irqchip_smp_init(void)
irq_domain_update_bus_token(ipi_domain, DOMAIN_BUS_IPI);
 
base_ipi = __irq_domain_alloc_irqs(ipi_domain, -1, BITS_PER_MBOX,
-  NUMA_NO_NODE, NULL,
-  false, NULL);
+  NUMA_NO_NODE, NULL, NULL);
 
if (WARN_ON(!base_ipi))
return;
diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
index 16fecc0febe8..ff20fd54921f 100644
--- a/drivers/irqchip/irq-gic-v3.c
+++ b/drivers/irqchip/irq-gic-v3.c
@@ -1163,8 +1163,7 @@ static void __init gic_smp_init(void)
 
/* Register all 8 non-secure SGIs */
base_sgi = __irq_domain_alloc_i

Re: [PATCH kernel v3] genirq/irqdomain: Add reference counting to IRQs

2020-11-13 Thread Alexey Kardashevskiy




On 14/11/2020 05:19, Cédric Le Goater wrote:

On 11/9/20 10:46 AM, Alexey Kardashevskiy wrote:

PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.


The background context for such a need is that the POWER9 and POWER10
processors have a new XIVE interrupt controller which uses MMIO pages
for interrupt management. Each interrupt has a pair of pages which are
required to be unmapped in some environment, like PHB removal. And so,
all interrupts need to be unmmaped.



This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

As kobject_put() is called directly now (not via RCU), it can also handle
the early boot case (irq_kobj_base==NULL) with the help of
the kobject::state_in_sysfs flag and without additional irq_sysfs_del().


Could this change be done in a following patch ?


No. Before this patch, we remove from sysfs -  call kobject_del() - 
before calling kobject_put() which we do via RCU. After the patch, this 
kobject_del() is called from the very last kobject_put() and when we get 
to this release handler - the sysfs node is already removed and we get a 
message about the missing parent.




While at this, clean up the comment at where irq_sysfs_del() was called.>
Quick grep shows no sign of irq reference counting in drivers. Drivers
typically request mapping when probing and dispose it when removing;


Some ARM drivers call directly irq_alloc_descs() and irq_free_descs().
Is  that a problem ?


Kind of, I'll need to go through these places and replace 
irq_free_descs() with kobject_put() (may be some wrapper or may be 
change irq_free_descs() to do kobject_put()).




platforms tend to dispose only if setup failed and the rest seems
calling one dispose per one mapping. Except (at least) PPC/pseries
which needs https://lkml.org/lkml/2020/10/27/259

Cc: Cédric Le Goater 
Cc: Marc Zyngier 
Cc: Michael Ellerman 
Cc: Qian Cai 
Cc: Rob Herring 
Cc: Frederic Barrat 
Cc: Michal Suchánek 
Cc: Thomas Gleixner 
Signed-off-by: Alexey Kardashevskiy 


I used this patch and the ppc one doing the LSI removal:

   
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20201027090655.14118-3-...@ozlabs.ru/

on different P10 and P9 systems, on a large system (>1K HW theads),
KVM guests and pSeries machines. Checked that PHB removal was OK.
  
Tested-by: Cédric Le Goater 


But IRQ subsystem covers much more than these systems.


Indeed. But doing our own powerpc-only reference counting on top of 
irs_desc is just ugly.





Some comments below,


---

This is what it is fixing for powerpc:

There was a comment about whether hierarchical IRQ domains should
contribute to this reference counter and I need some help here as
I cannot see why.
It is reverse now - IRQs contribute to domain->mapcount and
irq_domain_associate/irq_domain_disassociate take necessary steps to
keep this counter in order. What might be missing is that if we have
cascade of IRQs (as in the IOAPIC example from
Documentation/core-api/irq/irq-domain.rst ), then a parent IRQ should
contribute to the children IRQs and it is up to
irq_domain_ops::alloc/free hooks, and they all seem to be eventually
calling irq_domain_alloc_irqs_xxx/irq_domain_free_irqs_xxx which seems
right.

Documentation/core-api/irq/irq-domain.rst also suggests there is a lot
to see in debugfs about IRQs but on my thinkpad there nothing about
hierarchy.

So I'll ask again :)

What is the easiest way to get irq-hierarchical hardware?
I have a bunch of powerpc boxes (no good) but also a raspberry pi,
a bunch of 32/64bit orange pi's, an "armada" arm box,
thinkpads - is any of this good for the task?



---
Changes:
v3:
* removed very wrong kobject_get/_put from irq_domain_associate/
irq_domain_disassociate as these are called from kobject_release so
irq_descs were never actually released
* removed irq_sysfs_del as 1) we do not seem to need it with changed
counting  2) produces a "no parent" warning as it would be called from
kobject_release which removes sysfs nodes itself

v2:
* added more get/put, including irq_domain_associate/irq_domain_disassociate
---
  kernel/irq/irqdesc.c   | 55 ++
  kernel/irq/irqdomain.c | 37 
  2 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..79c904ebfd5c 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -295,18 +295,6 @@ static void irq_sysfs

Re: [PATCH kernel v3] genirq/irqdomain: Add reference counting to IRQs

2020-11-13 Thread Alexey Kardashevskiy




On 14/11/2020 05:34, Marc Zyngier wrote:

Hi Alexey,

On 2020-11-09 09:46, Alexey Kardashevskiy wrote:

PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.

This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

As kobject_put() is called directly now (not via RCU), it can also handle
the early boot case (irq_kobj_base==NULL) with the help of
the kobject::state_in_sysfs flag and without additional irq_sysfs_del().
While at this, clean up the comment at where irq_sysfs_del() was called.

Quick grep shows no sign of irq reference counting in drivers. Drivers
typically request mapping when probing and dispose it when removing;
platforms tend to dispose only if setup failed and the rest seems
calling one dispose per one mapping. Except (at least) PPC/pseries
which needs https://lkml.org/lkml/2020/10/27/259

Cc: Cédric Le Goater 
Cc: Marc Zyngier 
Cc: Michael Ellerman 
Cc: Qian Cai 
Cc: Rob Herring 
Cc: Frederic Barrat 
Cc: Michal Suchánek 
Cc: Thomas Gleixner 
Signed-off-by: Alexey Kardashevskiy 
---

This is what it is fixing for powerpc:

There was a comment about whether hierarchical IRQ domains should
contribute to this reference counter and I need some help here as
I cannot see why.
It is reverse now - IRQs contribute to domain->mapcount and
irq_domain_associate/irq_domain_disassociate take necessary steps to
keep this counter in order. What might be missing is that if we have
cascade of IRQs (as in the IOAPIC example from
Documentation/core-api/irq/irq-domain.rst ), then a parent IRQ should
contribute to the children IRQs and it is up to
irq_domain_ops::alloc/free hooks, and they all seem to be eventually
calling irq_domain_alloc_irqs_xxx/irq_domain_free_irqs_xxx which seems
right.

Documentation/core-api/irq/irq-domain.rst also suggests there is a lot
to see in debugfs about IRQs but on my thinkpad there nothing about
hierarchy.

So I'll ask again :)

What is the easiest way to get irq-hierarchical hardware?
I have a bunch of powerpc boxes (no good) but also a raspberry pi,
a bunch of 32/64bit orange pi's, an "armada" arm box,
thinkpads - is any of this good for the task?


If your HW doesn't require an interrupt hierarchy, run VMs!
Booting an arm64 guest with virtual PCI devices will result in
hierarchies being created (PCI-MSI -> GIC MSI widget -> GIC).


Absolutely :) But the beauty of ARM is that one can buy an actual ARM 
device for 20$, I have "opi one+ allwinner h6 64bit cortex a53 1GB RAM", 
is it worth using KVM on this device, or is it too small for that?



You can use KVM, or even bare QEMU on x86 if you are so inclined.


Have a QEMU command line handy for x86/tcg?


I'll try to go through this patch over the week-end (or more probably
early next week), and try to understand where our understandings
differ.


Great, thanks! Fred spotted a problem with irq_free_descs() not doing 
kobject_put() anymore and this is a problem for sa.c and the likes 
and I will go though these places anyway.



--
Alexey


Re: [PATCH] panic: Avoid dump_stack() twice

2020-11-12 Thread Alexey Kardashevskiy

Fixed already

https://ozlabs.org/~akpm/mmots/broken-out/panic-dont-dump-stack-twice-on-warn.patch

Sorry for breaking this :(


On 13/11/2020 16:47, Kefeng Wang wrote:

stacktrace will be dumped twice on ARM64 after commit 3f388f28639f
("panic: dump registers on panic_on_warn"), will not dump_stack
when no regs as before.

Fixes: 3f388f28639f ("panic: dump registers on panic_on_warn")
Signed-off-by: Kefeng Wang 
---
  kernel/panic.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index 396142ee43fd..332736a72a58 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -605,7 +605,8 @@ void __warn(const char *file, int line, void *caller, 
unsigned taint,
panic("panic_on_warn set ...\n");
}
  
-	dump_stack();

+   if (!regs)
+   dump_stack();
  
  	print_irqtrace_events(current);
  



--
Alexey


[PATCH kernel v3] genirq/irqdomain: Add reference counting to IRQs

2020-11-09 Thread Alexey Kardashevskiy
PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.

This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

As kobject_put() is called directly now (not via RCU), it can also handle
the early boot case (irq_kobj_base==NULL) with the help of
the kobject::state_in_sysfs flag and without additional irq_sysfs_del().
While at this, clean up the comment at where irq_sysfs_del() was called.

Quick grep shows no sign of irq reference counting in drivers. Drivers
typically request mapping when probing and dispose it when removing;
platforms tend to dispose only if setup failed and the rest seems
calling one dispose per one mapping. Except (at least) PPC/pseries
which needs https://lkml.org/lkml/2020/10/27/259

Cc: Cédric Le Goater 
Cc: Marc Zyngier 
Cc: Michael Ellerman 
Cc: Qian Cai 
Cc: Rob Herring 
Cc: Frederic Barrat 
Cc: Michal Suchánek 
Cc: Thomas Gleixner 
Signed-off-by: Alexey Kardashevskiy 
---

This is what it is fixing for powerpc:

There was a comment about whether hierarchical IRQ domains should
contribute to this reference counter and I need some help here as
I cannot see why.
It is reverse now - IRQs contribute to domain->mapcount and
irq_domain_associate/irq_domain_disassociate take necessary steps to
keep this counter in order. What might be missing is that if we have
cascade of IRQs (as in the IOAPIC example from
Documentation/core-api/irq/irq-domain.rst ), then a parent IRQ should
contribute to the children IRQs and it is up to
irq_domain_ops::alloc/free hooks, and they all seem to be eventually
calling irq_domain_alloc_irqs_xxx/irq_domain_free_irqs_xxx which seems
right.

Documentation/core-api/irq/irq-domain.rst also suggests there is a lot
to see in debugfs about IRQs but on my thinkpad there nothing about
hierarchy.

So I'll ask again :)

What is the easiest way to get irq-hierarchical hardware?
I have a bunch of powerpc boxes (no good) but also a raspberry pi,
a bunch of 32/64bit orange pi's, an "armada" arm box,
thinkpads - is any of this good for the task?



---
Changes:
v3:
* removed very wrong kobject_get/_put from irq_domain_associate/
irq_domain_disassociate as these are called from kobject_release so
irq_descs were never actually released
* removed irq_sysfs_del as 1) we do not seem to need it with changed
counting  2) produces a "no parent" warning as it would be called from
kobject_release which removes sysfs nodes itself

v2:
* added more get/put, including irq_domain_associate/irq_domain_disassociate
---
 kernel/irq/irqdesc.c   | 55 ++
 kernel/irq/irqdomain.c | 37 
 2 files changed, 46 insertions(+), 46 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..79c904ebfd5c 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -295,18 +295,6 @@ static void irq_sysfs_add(int irq, struct irq_desc *desc)
}
 }
 
-static void irq_sysfs_del(struct irq_desc *desc)
-{
-   /*
-* If irq_sysfs_init() has not yet been invoked (early boot), then
-* irq_kobj_base is NULL and the descriptor was never added.
-* kobject_del() complains about a object with no parent, so make
-* it conditional.
-*/
-   if (irq_kobj_base)
-   kobject_del(>kobj);
-}
-
 static int __init irq_sysfs_init(void)
 {
struct irq_desc *desc;
@@ -337,7 +325,6 @@ static struct kobj_type irq_kobj_type = {
 };
 
 static void irq_sysfs_add(int irq, struct irq_desc *desc) {}
-static void irq_sysfs_del(struct irq_desc *desc) {}
 
 #endif /* CONFIG_SYSFS */
 
@@ -419,20 +406,40 @@ static struct irq_desc *alloc_desc(int irq, int node, 
unsigned int flags,
return NULL;
 }
 
+static void delayed_free_desc(struct rcu_head *rhp);
 static void irq_kobj_release(struct kobject *kobj)
 {
struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
+#ifdef CONFIG_IRQ_DOMAIN
+   struct irq_domain *domain;
+   unsigned int virq = desc->irq_data.irq;
 
-   free_masks(desc);
-   free_percpu(desc->kstat_irqs);
-   kfree(desc);
+   domain = desc->irq_data.domain;
+   if (domain) {
+   if (irq_domain_is_hierarchy(domain)) {
+   irq_domain_free_irqs(virq, 1);
+   } else {
+   irq_domain_disassociate(domain, virq);
+   irq_free_desc(virq);
+   }
+   }
+#

Re: [PATCH kernel v2] irq: Add reference counting to IRQ mappings

2020-11-05 Thread Alexey Kardashevskiy

Hi,

This one seems to be broken in the domain associating part so please 
ignore it, I'll post v3 soon. Thanks,



On 29/10/2020 22:01, Alexey Kardashevskiy wrote:

PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.

This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

Quick grep shows no sign of irq reference counting in drivers. Drivers
typically request mapping when probing and dispose it when removing;
platforms tend to dispose only if setup failed and the rest seems
calling one dispose per one mapping. Except (at least) PPC/pseries
which needs https://lkml.org/lkml/2020/10/27/259

Signed-off-by: Alexey Kardashevskiy 
---

What is the easiest way to get irq-hierarchical hardware?
I have a bunch of powerpc boxes (no good) but also a raspberry pi,
a bunch of 32/64bit orange pi's, an "armada" arm box,
thinkpads - is any of this good for the task?


---
Changes:
v2:
* added more get/put, including irq_domain_associate/irq_domain_disassociate
---
  kernel/irq/irqdesc.c   | 36 ---
  kernel/irq/irqdomain.c | 49 +-
  2 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..bc8f62157ffa 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -419,20 +419,40 @@ static struct irq_desc *alloc_desc(int irq, int node, 
unsigned int flags,
return NULL;
  }
  
+static void delayed_free_desc(struct rcu_head *rhp);

  static void irq_kobj_release(struct kobject *kobj)
  {
struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
+#ifdef CONFIG_IRQ_DOMAIN
+   struct irq_domain *domain;
+   unsigned int virq = desc->irq_data.irq;
  
-	free_masks(desc);

-   free_percpu(desc->kstat_irqs);
-   kfree(desc);
+   domain = desc->irq_data.domain;
+   if (domain) {
+   if (irq_domain_is_hierarchy(domain)) {
+   irq_domain_free_irqs(virq, 1);
+   } else {
+   irq_domain_disassociate(domain, virq);
+   irq_free_desc(virq);
+   }
+   }
+#endif
+   /*
+* We free the descriptor, masks and stat fields via RCU. That
+* allows demultiplex interrupts to do rcu based management of
+* the child interrupts.
+* This also allows us to use rcu in kstat_irqs_usr().
+*/
+   call_rcu(>rcu, delayed_free_desc);
  }
  
  static void delayed_free_desc(struct rcu_head *rhp)

  {
struct irq_desc *desc = container_of(rhp, struct irq_desc, rcu);
  
-	kobject_put(>kobj);

+   free_masks(desc);
+   free_percpu(desc->kstat_irqs);
+   kfree(desc);
  }
  
  static void free_desc(unsigned int irq)

@@ -453,14 +473,6 @@ static void free_desc(unsigned int irq)
 */
irq_sysfs_del(desc);
delete_irq_desc(irq);
-
-   /*
-* We free the descriptor, masks and stat fields via RCU. That
-* allows demultiplex interrupts to do rcu based management of
-* the child interrupts.
-* This also allows us to use rcu in kstat_irqs_usr().
-*/
-   call_rcu(>rcu, delayed_free_desc);
  }
  
  static int alloc_descs(unsigned int start, unsigned int cnt, int node,

diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..5fb060e077e3 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -487,6 +487,7 @@ static void irq_domain_set_mapping(struct irq_domain 
*domain,
  
  void irq_domain_disassociate(struct irq_domain *domain, unsigned int irq)

  {
+   struct irq_desc *desc = irq_to_desc(irq);
struct irq_data *irq_data = irq_get_irq_data(irq);
irq_hw_number_t hwirq;
  
@@ -514,11 +515,14 @@ void irq_domain_disassociate(struct irq_domain *domain, unsigned int irq)
  
  	/* Clear reverse map for this hwirq */

irq_domain_clear_mapping(domain, hwirq);
+
+   kobject_put(>kobj);
  }
  
  int irq_domain_associate(struct irq_domain *domain, unsigned int virq,

 irq_hw_number_t hwirq)
  {
+   struct irq_desc *desc = irq_to_desc(virq);
struct irq_data *irq_data = irq_get_irq_data(virq);
int ret;
  
@@ -530,6 +534,8 @@ int irq_domain_associate(struct irq_domain *domain, unsigned int virq,

if (WARN(irq_data->domain, "error: virq%i is already associated", virq))
   

[PATCH kernel v3 1/2] dma: Allow mixing bypass and mapped DMA operation

2020-11-03 Thread Alexey Kardashevskiy
At the moment we allow bypassing DMA ops only when we can do this for
the entire RAM. However there are configs with mixed type memory
where we could still allow bypassing IOMMU in most cases;
POWERPC with persistent memory is one example.

This adds an arch hook to determine where bypass can still work and
we invoke direct DMA API. The following patch checks the bus limit
on POWERPC to allow or disallow direct mapping.

This adds a CONFIG_ARCH_HAS_DMA_SET_MASK config option to make arch_
hooks no-op by default.

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/dma/mapping.c | 24 
 kernel/dma/Kconfig   |  4 
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..a0bc9eb876ed 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -137,6 +137,18 @@ static inline bool dma_map_direct(struct device *dev,
return dma_go_direct(dev, *dev->dma_mask, ops);
 }
 
+#ifdef CONFIG_ARCH_HAS_DMA_MAP_DIRECT
+bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr);
+bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle);
+bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg, int 
nents);
+bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg, int 
nents);
+#else
+#define arch_dma_map_page_direct(d, a) (0)
+#define arch_dma_unmap_page_direct(d, a) (0)
+#define arch_dma_map_sg_direct(d, s, n) (0)
+#define arch_dma_unmap_sg_direct(d, s, n) (0)
+#endif
+
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size, enum dma_data_direction dir,
unsigned long attrs)
@@ -149,7 +161,8 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct 
page *page,
if (WARN_ON_ONCE(!dev->dma_mask))
return DMA_MAPPING_ERROR;
 
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops) ||
+   arch_dma_map_page_direct(dev, page_to_phys(page) + offset + size))
addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
else
addr = ops->map_page(dev, page, offset, size, dir, attrs);
@@ -165,7 +178,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t 
addr, size_t size,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops) ||
+   arch_dma_unmap_page_direct(dev, addr + size))
dma_direct_unmap_page(dev, addr, size, dir, attrs);
else if (ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
@@ -188,7 +202,8 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist 
*sg, int nents,
if (WARN_ON_ONCE(!dev->dma_mask))
return 0;
 
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops) ||
+   arch_dma_map_sg_direct(dev, sg, nents))
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
@@ -207,7 +222,8 @@ void dma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 
BUG_ON(!valid_dma_direction(dir));
debug_dma_unmap_sg(dev, sg, nents, dir);
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops) ||
+   arch_dma_unmap_sg_direct(dev, sg, nents))
dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
else if (ops->unmap_sg)
ops->unmap_sg(dev, sg, nents, dir, attrs);
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..43d106598e82 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -20,6 +20,10 @@ config DMA_OPS
 config DMA_OPS_BYPASS
bool
 
+# Lets platform IOMMU driver choose between bypass and IOMMU
+config ARCH_HAS_DMA_MAP_DIRECT
+   bool
+
 config NEED_SG_DMA_LENGTH
bool
 
-- 
2.17.1



[PATCH kernel v3 0/2] DMA, powerpc/dma: Fallback to dma_ops when persistent memory present

2020-11-03 Thread Alexey Kardashevskiy
This allows mixing direct DMA (to/from RAM) and
IOMMU (to/from apersistent memory) on the PPC64/pseries
platform.

This replaces https://lkml.org/lkml/2020/10/27/418
which replaces https://lkml.org/lkml/2020/10/20/1085


This is based on sha1
4525c8781ec0 Linus Torvalds "scsi: qla2xxx: remove incorrect sparse #ifdef".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  dma: Allow mixing bypass and mapped DMA operation
  powerpc/dma: Fallback to dma_ops when persistent memory present

 arch/powerpc/kernel/dma-iommu.c| 70 +-
 arch/powerpc/platforms/pseries/iommu.c | 44 
 kernel/dma/mapping.c   | 24 +++--
 arch/powerpc/Kconfig   |  1 +
 kernel/dma/Kconfig |  4 ++
 5 files changed, 127 insertions(+), 16 deletions(-)

-- 
2.17.1



[PATCH kernel v3 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present

2020-11-03 Thread Alexey Kardashevskiy
So far we have been using huge DMA windows to map all the RAM available.
The RAM is normally mapped to the VM address space contiguously, and
there is always a reasonable upper limit for possible future hot plugged
RAM which makes it easy to map all RAM via IOMMU.

Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike
normal RAM) can map anywhere in the VM space beyond the maximum RAM size
and since it can be used for DMA, it requires extending the huge window
up to MAX_PHYSMEM_BITS which requires hypervisor support for:
1. huge TCE tables;
2. multilevel TCE tables;
3. huge IOMMU pages.

Certain hypervisors cannot do either so the only option left is
restricting the huge DMA window to include only RAM and fallback to
the default DMA window for persistent memory.

This defines arch_dma_map_direct/etc to allow generic DMA code perform
additional checks on whether direct DMA is still possible.

This checks if the system has persistent memory. If it does not,
the DMA bypass mode is selected, i.e.
* dev->bus_dma_limit = 0
* dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping.

If there is such memory, this creates identity mapping only for RAM and
sets the dev->bus_dma_limit to let the generic code decide whether to
call into the direct DMA or the indirect DMA ops.

This should not change the existing behaviour when no persistent memory
as dev->dma_ops_bypass is expected to be set.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/dma-iommu.c| 70 +-
 arch/powerpc/platforms/pseries/iommu.c | 44 
 arch/powerpc/Kconfig   |  1 +
 3 files changed, 103 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index a1c744194018..21e2d9f059a9 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -10,6 +10,63 @@
 #include 
 #include 
 
+#define can_map_direct(dev, addr) \
+   ((dev)->bus_dma_limit >= phys_to_dma((dev), (addr)))
+
+bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr)
+{
+   if (likely(!dev->bus_dma_limit))
+   return false;
+
+   return can_map_direct(dev, addr);
+}
+EXPORT_SYMBOL_GPL(arch_dma_map_page_direct);
+
+#define is_direct_handle(dev, h) ((h) >= (dev)->archdata.dma_offset)
+
+bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle)
+{
+   if (likely(!dev->bus_dma_limit))
+   return false;
+
+   return is_direct_handle(dev, dma_handle);
+}
+EXPORT_SYMBOL_GPL(arch_dma_unmap_page_direct);
+
+bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg, int 
nents)
+{
+   struct scatterlist *s;
+   int i;
+
+   if (likely(!dev->bus_dma_limit))
+   return false;
+
+   for_each_sg(sg, s, nents, i) {
+   if (!can_map_direct(dev, sg_phys(s) + s->offset + s->length))
+   return false;
+   }
+
+   return true;
+}
+EXPORT_SYMBOL(arch_dma_map_sg_direct);
+
+bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg, int 
nents)
+{
+   struct scatterlist *s;
+   int i;
+
+   if (likely(!dev->bus_dma_limit))
+   return false;
+
+   for_each_sg(sg, s, nents, i) {
+   if (!is_direct_handle(dev, s->dma_address + s->length))
+   return false;
+   }
+
+   return true;
+}
+EXPORT_SYMBOL(arch_dma_unmap_sg_direct);
+
 /*
  * Generic iommu implementation
  */
@@ -90,8 +147,17 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
struct iommu_table *tbl = get_iommu_table_base(dev);
 
if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
-   dev->dma_ops_bypass = true;
-   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
+   /*
+* dma_iommu_bypass_supported() sets dma_max when there is
+* 1:1 mapping but it is somehow limited.
+* ibm,pmemory is one example.
+*/
+   dev->dma_ops_bypass = dev->bus_dma_limit == 0;
+   if (!dev->dma_ops_bypass)
+   dev_warn(dev, "iommu: 64-bit OK but direct DMA is 
limited by %llx\n",
+dev->bus_dma_limit);
+   else
+   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
return 1;
}
 
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e4198700ed1a..91112e748491 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -839,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
np, ret);
 }
 
-static u64 find_existing_ddw(struct device_node *pdn)

[PATCH kernel v2] irq: Add reference counting to IRQ mappings

2020-10-29 Thread Alexey Kardashevskiy
PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.

This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

Quick grep shows no sign of irq reference counting in drivers. Drivers
typically request mapping when probing and dispose it when removing;
platforms tend to dispose only if setup failed and the rest seems
calling one dispose per one mapping. Except (at least) PPC/pseries
which needs https://lkml.org/lkml/2020/10/27/259

Signed-off-by: Alexey Kardashevskiy 
---

What is the easiest way to get irq-hierarchical hardware?
I have a bunch of powerpc boxes (no good) but also a raspberry pi,
a bunch of 32/64bit orange pi's, an "armada" arm box,
thinkpads - is any of this good for the task?


---
Changes:
v2:
* added more get/put, including irq_domain_associate/irq_domain_disassociate
---
 kernel/irq/irqdesc.c   | 36 ---
 kernel/irq/irqdomain.c | 49 +-
 2 files changed, 58 insertions(+), 27 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..bc8f62157ffa 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -419,20 +419,40 @@ static struct irq_desc *alloc_desc(int irq, int node, 
unsigned int flags,
return NULL;
 }
 
+static void delayed_free_desc(struct rcu_head *rhp);
 static void irq_kobj_release(struct kobject *kobj)
 {
struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
+#ifdef CONFIG_IRQ_DOMAIN
+   struct irq_domain *domain;
+   unsigned int virq = desc->irq_data.irq;
 
-   free_masks(desc);
-   free_percpu(desc->kstat_irqs);
-   kfree(desc);
+   domain = desc->irq_data.domain;
+   if (domain) {
+   if (irq_domain_is_hierarchy(domain)) {
+   irq_domain_free_irqs(virq, 1);
+   } else {
+   irq_domain_disassociate(domain, virq);
+   irq_free_desc(virq);
+   }
+   }
+#endif
+   /*
+* We free the descriptor, masks and stat fields via RCU. That
+* allows demultiplex interrupts to do rcu based management of
+* the child interrupts.
+* This also allows us to use rcu in kstat_irqs_usr().
+*/
+   call_rcu(>rcu, delayed_free_desc);
 }
 
 static void delayed_free_desc(struct rcu_head *rhp)
 {
struct irq_desc *desc = container_of(rhp, struct irq_desc, rcu);
 
-   kobject_put(>kobj);
+   free_masks(desc);
+   free_percpu(desc->kstat_irqs);
+   kfree(desc);
 }
 
 static void free_desc(unsigned int irq)
@@ -453,14 +473,6 @@ static void free_desc(unsigned int irq)
 */
irq_sysfs_del(desc);
delete_irq_desc(irq);
-
-   /*
-* We free the descriptor, masks and stat fields via RCU. That
-* allows demultiplex interrupts to do rcu based management of
-* the child interrupts.
-* This also allows us to use rcu in kstat_irqs_usr().
-*/
-   call_rcu(>rcu, delayed_free_desc);
 }
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..5fb060e077e3 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -487,6 +487,7 @@ static void irq_domain_set_mapping(struct irq_domain 
*domain,
 
 void irq_domain_disassociate(struct irq_domain *domain, unsigned int irq)
 {
+   struct irq_desc *desc = irq_to_desc(irq);
struct irq_data *irq_data = irq_get_irq_data(irq);
irq_hw_number_t hwirq;
 
@@ -514,11 +515,14 @@ void irq_domain_disassociate(struct irq_domain *domain, 
unsigned int irq)
 
/* Clear reverse map for this hwirq */
irq_domain_clear_mapping(domain, hwirq);
+
+   kobject_put(>kobj);
 }
 
 int irq_domain_associate(struct irq_domain *domain, unsigned int virq,
 irq_hw_number_t hwirq)
 {
+   struct irq_desc *desc = irq_to_desc(virq);
struct irq_data *irq_data = irq_get_irq_data(virq);
int ret;
 
@@ -530,6 +534,8 @@ int irq_domain_associate(struct irq_domain *domain, 
unsigned int virq,
if (WARN(irq_data->domain, "error: virq%i is already associated", virq))
return -EINVAL;
 
+   kobject_get(>kobj);
+
mutex_lock(_domain_mutex);
irq_data->hwirq = hwirq;
irq_data->domain = domain;
@@ -548,6 +554,7 @@ int i

Re: [RFC PATCH kernel 1/2] irq: Add reference counting to IRQ mappings

2020-10-29 Thread Alexey Kardashevskiy




On 28/10/2020 03:09, Marc Zyngier wrote:

Hi Alexey,

On 2020-10-27 09:06, Alexey Kardashevskiy wrote:

PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.

This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.


That's quite interesting, as I was about to revive a patch series that
rework the irqdomain subsystem to directly cache irq_desc instead of
raw interrupt numbers. And for that, I needed some form of refcounting...



This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

If some driver or platform does its own reference counting, this expects
those parties to call irq_find_mapping() and call irq_dispose_mapping()
for every irq_create_fwspec_mapping()/irq_create_mapping().

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/irq/irqdesc.c   | 35 +++
 kernel/irq/irqdomain.c | 27 +--
 2 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..dae096238500 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -419,20 +419,39 @@ static struct irq_desc *alloc_desc(int irq, int
node, unsigned int flags,
 return NULL;
 }

+static void delayed_free_desc(struct rcu_head *rhp);
 static void irq_kobj_release(struct kobject *kobj)
 {
 struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
+    struct irq_domain *domain;
+    unsigned int virq = desc->irq_data.irq;

-    free_masks(desc);
-    free_percpu(desc->kstat_irqs);
-    kfree(desc);
+    domain = desc->irq_data.domain;
+    if (domain) {
+    if (irq_domain_is_hierarchy(domain)) {
+    irq_domain_free_irqs(virq, 1);


How does this work with hierarchical domains? Each domain should
contribute as a reference on the irq_desc. But if you got here,
it means the refcount has already dropped to 0.

So either there is nothing to free here, or you don't track the
references implied by the hierarchy. I suspect the latter.


This is correct, I did not look at hierarchy yet, looking now...




+    } else {
+    irq_domain_disassociate(domain, virq);
+    irq_free_desc(virq);
+    }
+    }
+
+    /*
+ * We free the descriptor, masks and stat fields via RCU. That
+ * allows demultiplex interrupts to do rcu based management of
+ * the child interrupts.
+ * This also allows us to use rcu in kstat_irqs_usr().
+ */
+    call_rcu(>rcu, delayed_free_desc);
 }

 static void delayed_free_desc(struct rcu_head *rhp)
 {
 struct irq_desc *desc = container_of(rhp, struct irq_desc, rcu);

-    kobject_put(>kobj);
+    free_masks(desc);
+    free_percpu(desc->kstat_irqs);
+    kfree(desc);
 }

 static void free_desc(unsigned int irq)
@@ -453,14 +472,6 @@ static void free_desc(unsigned int irq)
  */
 irq_sysfs_del(desc);
 delete_irq_desc(irq);
-
-    /*
- * We free the descriptor, masks and stat fields via RCU. That
- * allows demultiplex interrupts to do rcu based management of
- * the child interrupts.
- * This also allows us to use rcu in kstat_irqs_usr().
- */
-    call_rcu(>rcu, delayed_free_desc);
 }

 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..02733ddc321f 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -638,6 +638,7 @@ unsigned int irq_create_mapping(struct irq_domain 
*domain,

 {
 struct device_node *of_node;
 int virq;
+    struct irq_desc *desc;

 pr_debug("irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq);

@@ -655,7 +656,9 @@ unsigned int irq_create_mapping(struct irq_domain 
*domain,

 /* Check if mapping already exists */
 virq = irq_find_mapping(domain, hwirq);
 if (virq) {
+    desc = irq_to_desc(virq);
 pr_debug("-> existing mapping on virq %d\n", virq);
+    kobject_get(>kobj);


My worry with this is that there is probably a significant amount of
code out there that relies on multiple calls to irq_create_mapping()
with the same parameters not to have any side effects. They would
expect a subsequent irq_dispose_mapping() to drop the translation
altogether, and that's obviously not the case here.

Have you audited the various call sites to see what could break?



The vast majority calls one of irq..create_mapping in init/probe and 
then calls irq_dispose_mapping() right there if probing failed or when 
the driver is unloaded. I 

[PATCH kernel v4 0/2] DMA, powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-28 Thread Alexey Kardashevskiy


This allows mixing direct DMA (to/from RAM) and
IOMMU (to/from apersistent memory) on the PPC64/pseries
platform.

This replaces https://lkml.org/lkml/2020/10/28/929
which replaces https://lkml.org/lkml/2020/10/27/418
which replaces https://lkml.org/lkml/2020/10/20/1085


This is based on sha1
4525c8781ec0 Linus Torvalds "scsi: qla2xxx: remove incorrect sparse #ifdef".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  dma: Allow mixing bypass and mapped DMA operation
  powerpc/dma: Fallback to dma_ops when persistent memory present

 arch/powerpc/kernel/dma-iommu.c| 73 +-
 arch/powerpc/platforms/pseries/iommu.c | 51 ++
 kernel/dma/mapping.c   | 26 +++--
 arch/powerpc/Kconfig   |  1 +
 kernel/dma/Kconfig |  4 ++
 5 files changed, 139 insertions(+), 16 deletions(-)

-- 
2.17.1



[PATCH kernel v4 1/2] dma: Allow mixing bypass and mapped DMA operation

2020-10-28 Thread Alexey Kardashevskiy
At the moment we allow bypassing DMA ops only when we can do this for
the entire RAM. However there are configs with mixed type memory
where we could still allow bypassing IOMMU in most cases;
POWERPC with persistent memory is one example.

This adds an arch hook to determine where bypass can still work and
we invoke direct DMA API. The following patch checks the bus limit
on POWERPC to allow or disallow direct mapping.

This adds a CONFIG_ARCH_HAS_DMA_SET_MASK config option to make arch_
hooks no-op by default.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* wrapped long lines
---
 kernel/dma/mapping.c | 26 ++
 kernel/dma/Kconfig   |  4 
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..ad1f052e046d 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -137,6 +137,20 @@ static inline bool dma_map_direct(struct device *dev,
return dma_go_direct(dev, *dev->dma_mask, ops);
 }
 
+#ifdef CONFIG_ARCH_HAS_DMA_MAP_DIRECT
+bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr);
+bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle);
+bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg,
+   int nents);
+bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg,
+ int nents);
+#else
+#define arch_dma_map_page_direct(d, a) (0)
+#define arch_dma_unmap_page_direct(d, a) (0)
+#define arch_dma_map_sg_direct(d, s, n) (0)
+#define arch_dma_unmap_sg_direct(d, s, n) (0)
+#endif
+
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size, enum dma_data_direction dir,
unsigned long attrs)
@@ -149,7 +163,8 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct 
page *page,
if (WARN_ON_ONCE(!dev->dma_mask))
return DMA_MAPPING_ERROR;
 
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops) ||
+   arch_dma_map_page_direct(dev, page_to_phys(page) + offset + size))
addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
else
addr = ops->map_page(dev, page, offset, size, dir, attrs);
@@ -165,7 +180,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t 
addr, size_t size,
const struct dma_map_ops *ops = get_dma_ops(dev);
 
BUG_ON(!valid_dma_direction(dir));
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops) ||
+   arch_dma_unmap_page_direct(dev, addr + size))
dma_direct_unmap_page(dev, addr, size, dir, attrs);
else if (ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
@@ -188,7 +204,8 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist 
*sg, int nents,
if (WARN_ON_ONCE(!dev->dma_mask))
return 0;
 
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops) ||
+   arch_dma_map_sg_direct(dev, sg, nents))
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
@@ -207,7 +224,8 @@ void dma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 
BUG_ON(!valid_dma_direction(dir));
debug_dma_unmap_sg(dev, sg, nents, dir);
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops) ||
+   arch_dma_unmap_sg_direct(dev, sg, nents))
dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
else if (ops->unmap_sg)
ops->unmap_sg(dev, sg, nents, dir, attrs);
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..43d106598e82 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -20,6 +20,10 @@ config DMA_OPS
 config DMA_OPS_BYPASS
bool
 
+# Lets platform IOMMU driver choose between bypass and IOMMU
+config ARCH_HAS_DMA_MAP_DIRECT
+   bool
+
 config NEED_SG_DMA_LENGTH
bool
 
-- 
2.17.1



[PATCH kernel v4 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-28 Thread Alexey Kardashevskiy
So far we have been using huge DMA windows to map all the RAM available.
The RAM is normally mapped to the VM address space contiguously, and
there is always a reasonable upper limit for possible future hot plugged
RAM which makes it easy to map all RAM via IOMMU.

Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike
normal RAM) can map anywhere in the VM space beyond the maximum RAM size
and since it can be used for DMA, it requires extending the huge window
up to MAX_PHYSMEM_BITS which requires hypervisor support for:
1. huge TCE tables;
2. multilevel TCE tables;
3. huge IOMMU pages.

Certain hypervisors cannot do either so the only option left is
restricting the huge DMA window to include only RAM and fallback to
the default DMA window for persistent memory.

This defines arch_dma_map_direct/etc to allow generic DMA code perform
additional checks on whether direct DMA is still possible.

This checks if the system has persistent memory. If it does not,
the DMA bypass mode is selected, i.e.
* dev->bus_dma_limit = 0
* dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping.

If there is such memory, this creates identity mapping only for RAM and
sets the dev->bus_dma_limit to let the generic code decide whether to
call into the direct DMA or the indirect DMA ops.

This should not change the existing behaviour when no persistent memory
as dev->dma_ops_bypass is expected to be set.

Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v4:
* fixed leaked device_node
* wrapped long lines
---
 arch/powerpc/kernel/dma-iommu.c| 73 +-
 arch/powerpc/platforms/pseries/iommu.c | 51 ++
 arch/powerpc/Kconfig   |  1 +
 3 files changed, 113 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index a1c744194018..e5e9e5e3e3ca 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -10,6 +10,65 @@
 #include 
 #include 
 
+#define can_map_direct(dev, addr) \
+   ((dev)->bus_dma_limit >= phys_to_dma((dev), (addr)))
+
+bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr)
+{
+   if (likely(!dev->bus_dma_limit))
+   return false;
+
+   return can_map_direct(dev, addr);
+}
+EXPORT_SYMBOL_GPL(arch_dma_map_page_direct);
+
+#define is_direct_handle(dev, h) ((h) >= (dev)->archdata.dma_offset)
+
+bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle)
+{
+   if (likely(!dev->bus_dma_limit))
+   return false;
+
+   return is_direct_handle(dev, dma_handle);
+}
+EXPORT_SYMBOL_GPL(arch_dma_unmap_page_direct);
+
+bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg,
+   int nents)
+{
+   struct scatterlist *s;
+   int i;
+
+   if (likely(!dev->bus_dma_limit))
+   return false;
+
+   for_each_sg(sg, s, nents, i) {
+   if (!can_map_direct(dev, sg_phys(s) + s->offset + s->length))
+   return false;
+   }
+
+   return true;
+}
+EXPORT_SYMBOL(arch_dma_map_sg_direct);
+
+bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg,
+ int nents)
+{
+   struct scatterlist *s;
+   int i;
+
+   if (likely(!dev->bus_dma_limit))
+   return false;
+
+   for_each_sg(sg, s, nents, i) {
+   if (!is_direct_handle(dev, s->dma_address + s->length))
+   return false;
+   }
+
+   return true;
+}
+EXPORT_SYMBOL(arch_dma_unmap_sg_direct);
+
 /*
  * Generic iommu implementation
  */
@@ -90,8 +149,18 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
struct iommu_table *tbl = get_iommu_table_base(dev);
 
if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
-   dev->dma_ops_bypass = true;
-   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
+   /*
+* dma_iommu_bypass_supported() sets dma_max when there is
+* 1:1 mapping but it is somehow limited.
+* ibm,pmemory is one example.
+*/
+   dev->dma_ops_bypass = dev->bus_dma_limit == 0;
+   if (!dev->dma_ops_bypass)
+   dev_warn(dev,
+"iommu: 64-bit OK but direct DMA is limited by 
%llx\n",
+dev->bus_dma_limit);
+   else
+   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
return 1;
}
 
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e4198700ed1a..9fc5217f0c8e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -839,7 +839,7 @@ st

Re: [PATCH kernel v2 1/2] dma: Allow mixing bypass and normal IOMMU operation

2020-10-28 Thread Alexey Kardashevskiy




On 28/10/2020 03:48, Christoph Hellwig wrote:

+static inline bool dma_handle_direct(struct device *dev, dma_addr_t dma_handle)
+{
+   return dma_handle >= dev->archdata.dma_offset;
+}


This won't compile except for powerpc, and directly accesing arch members
in common code is a bad idea.  Maybe both your helpers need to be
supplied by arch code to better abstract this out.



rats, overlooked it :( bus_dma_limit is generic but dma_offset is in 
archdata :-/






if (dma_map_direct(dev, ops))
addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+   else if (dev->bus_dma_limit &&
+can_map_direct(dev, (phys_addr_t) page_to_phys(page) + offset 
+ size))
+   addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
+#endif


I don't think page_to_phys needs a phys_addr_t on the return value.
I'd also much prefer if we make this a little more beautiful, here
are a few suggestions:

  - hide the bus_dma_limit check inside can_map_direct, and provide a
stub so that we can avoid the ifdef
  - use a better name for can_map_direct, and maybe also a better calling
convention by passing the page (the sg code also has the page), 


It is passing an address of the end of the mapped area so passing a page 
struct means passing page and offset which is an extra parameter and we 
do not want to do anything with the page in those hooks anyway so I'd 
keep it as is.




and
maybe even hide the dma_map_direct inside it.


Call dma_map_direct() from arch_dma_map_page_direct() if 
arch_dma_map_page_direct() is defined? Seems suboptimal as it is going 
to be bypass=true in most cases and we save one call by avoiding calling 
arch_dma_map_page_direct(). Unless I missed something?





if (dma_map_direct(dev, ops) ||
arch_dma_map_page_direct(dev, page, offset, size))
addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);


BUG_ON(!valid_dma_direction(dir));
if (dma_map_direct(dev, ops))
dma_direct_unmap_page(dev, addr, size, dir, attrs);
+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+   else if (dev->bus_dma_limit && dma_handle_direct(dev, addr + size))
+   dma_direct_unmap_page(dev, addr, size, dir, attrs);
+#endif


Same here.


if (dma_map_direct(dev, ops))
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+   else if (dev->bus_dma_limit) {
+   struct scatterlist *s;
+   bool direct = true;
+   int i;
+
+   for_each_sg(sg, s, nents, i) {
+   direct = can_map_direct(dev, sg_phys(s) + s->offset + 
s->length);
+   if (!direct)
+   break;
+   }
+   if (direct)
+   ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
+   else
+   ents = ops->map_sg(dev, sg, nents, dir, attrs);
+   }
+#endif


This needs to go into a helper as well.  I think the same style as
above would work pretty nicely as well:


Yup. I'll repost v3 soon with this change. Thanks for the review.




if (dma_map_direct(dev, ops) ||
arch_dma_map_sg_direct(dev, sg, nents))
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);


+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+   if (dev->bus_dma_limit) {
+   struct scatterlist *s;
+   bool direct = true;
+   int i;
+
+   for_each_sg(sg, s, nents, i) {
+   direct = dma_handle_direct(dev, s->dma_address + 
s->length);
+   if (!direct)
+   break;
+   }
+   if (direct) {
+   dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
+   return;
+   }
+   }
+#endif


One more time here..



--
Alexey


Re: [PATCH kernel v3 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-28 Thread Alexey Kardashevskiy




On 29/10/2020 11:40, Michael Ellerman wrote:

Alexey Kardashevskiy  writes:

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e4198700ed1a..91112e748491 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -,11 +1112,13 @@ static void reset_dma_window(struct pci_dev *dev, 
struct device_node *par_dn)
   */
  static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
  {
-   int len, ret;
+   int len = 0, ret;
+   bool pmem_present = of_find_node_by_type(NULL, "ibm,pmemory") != NULL;


That leaks a reference on the returned node.

dn = of_find_node_by_type(NULL, "ibm,pmemory");
pmem_present = dn != NULL;
of_node_put(dn);



ah, true. v4 then.






@@ -1126,7 +1129,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
  
  	mutex_lock(_window_init_mutex);
  
-	dma_addr = find_existing_ddw(pdn);

+   dma_addr = find_existing_ddw(pdn, );


I don't see len used anywhere?


if (dma_addr != 0)
goto out_unlock;
  
@@ -1212,14 +1215,26 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)

}
/* verify the window * number of ptes will map the partition */
/* check largest block * page size > max memory hotplug addr */
-   max_addr = ddw_memory_hotplug_max();
-   if (query.largest_available_block < (max_addr >> page_shift)) {
-   dev_dbg(>dev, "can't map partition max 0x%llx with %llu "
- "%llu-sized pages\n", max_addr,  
query.largest_available_block,
- 1ULL << page_shift);
+   /*
+* The "ibm,pmemory" can appear anywhere in the address space.
+* Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
+* for the upper limit and fallback to max RAM otherwise but this
+* disables device::dma_ops_bypass.
+*/
+   len = max_ram_len;


Here you override whatever find_existing_ddw() wrote to len?



Not always, there is a bunch of gotos before this line to the end of the 
function and one (which returns the existing window) is legit. Thanks,







+   if (pmem_present) {
+   if (query.largest_available_block >=
+   (1ULL << (MAX_PHYSMEM_BITS - page_shift)))
+   len = MAX_PHYSMEM_BITS - page_shift;
+   else
+   dev_info(>dev, "Skipping ibm,pmemory");
+   }
+
+   if (query.largest_available_block < (1ULL << (len - page_shift))) {
+   dev_dbg(>dev, "can't map partition max 0x%llx with %llu 
%llu-sized pages\n",
+   1ULL << len, query.largest_available_block, 1ULL << 
page_shift);
goto out_failed;
}
-   len = order_base_2(max_addr);
win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
if (!win64) {
dev_info(>dev,



cheers



--
Alexey


Re: [PATCH kernel v2 1/2] dma: Allow mixing bypass and normal IOMMU operation

2020-10-28 Thread Alexey Kardashevskiy




On 29/10/2020 04:21, Christoph Hellwig wrote:

On Wed, Oct 28, 2020 at 05:55:23PM +1100, Alexey Kardashevskiy wrote:


It is passing an address of the end of the mapped area so passing a page
struct means passing page and offset which is an extra parameter and we do
not want to do anything with the page in those hooks anyway so I'd keep it
as is.



and
 maybe even hide the dma_map_direct inside it.


Call dma_map_direct() from arch_dma_map_page_direct() if
arch_dma_map_page_direct() is defined? Seems suboptimal as it is going to
be bypass=true in most cases and we save one call by avoiding calling
arch_dma_map_page_direct(). Unless I missed something?


C does not even evaluate the right hand side of a || expression if the
left hand evaluates to true.


Right, this is what I meant. dma_map_direct() is inline and fast so I 
did not want it inside the arch hook which is not inline.



--
Alexey


Re: [PATCH kernel v3 1/2] dma: Allow mixing bypass and mapped DMA operation

2020-10-28 Thread Alexey Kardashevskiy




On 29/10/2020 04:22, Christoph Hellwig wrote:

On Wed, Oct 28, 2020 at 06:00:29PM +1100, Alexey Kardashevskiy wrote:

At the moment we allow bypassing DMA ops only when we can do this for
the entire RAM. However there are configs with mixed type memory
where we could still allow bypassing IOMMU in most cases;
POWERPC with persistent memory is one example.

This adds an arch hook to determine where bypass can still work and
we invoke direct DMA API. The following patch checks the bus limit
on POWERPC to allow or disallow direct mapping.

This adds a CONFIG_ARCH_HAS_DMA_SET_MASK config option to make arch_
hooks no-op by default.

Signed-off-by: Alexey Kardashevskiy 
---
  kernel/dma/mapping.c | 24 
  kernel/dma/Kconfig   |  4 
  2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..a0bc9eb876ed 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -137,6 +137,18 @@ static inline bool dma_map_direct(struct device *dev,
return dma_go_direct(dev, *dev->dma_mask, ops);
  }
  
+#ifdef CONFIG_ARCH_HAS_DMA_MAP_DIRECT

+bool arch_dma_map_page_direct(struct device *dev, phys_addr_t addr);
+bool arch_dma_unmap_page_direct(struct device *dev, dma_addr_t dma_handle);
+bool arch_dma_map_sg_direct(struct device *dev, struct scatterlist *sg, int 
nents);
+bool arch_dma_unmap_sg_direct(struct device *dev, struct scatterlist *sg, int 
nents);
+#else
+#define arch_dma_map_page_direct(d, a) (0)
+#define arch_dma_unmap_page_direct(d, a) (0)
+#define arch_dma_map_sg_direct(d, s, n) (0)
+#define arch_dma_unmap_sg_direct(d, s, n) (0)
+#endif


A bunch of overly long lines here.  Except for that this looks ok to me.
If you want me to queue up the series I can just fix it up.


I thought 100 is the new limit since 
https://lkml.org/lkml/2020/5/29/1038 (yeah that mentioned some Christoph 
:) ) and having these multiline does not make a huge difference but feel 
free fixing them up.


Are you going to take both patches? Do you need mpe's ack? Thanks,


--
Alexey


[PATCH kernel v2 0/2] DMA, powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-27 Thread Alexey Kardashevskiy
This allows mixing direct DMA (to/from RAM) and
IOMMU (to/from apersistent memory) on the PPC64/pseries
platform.

This replaces this: https://lkml.org/lkml/2020/10/20/1085
A lesser evil this is :)

This is based on sha1
4525c8781ec0 Linus Torvalds "scsi: qla2xxx: remove incorrect sparse #ifdef".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  dma: Allow mixing bypass and normal IOMMU operation
  powerpc/dma: Fallback to dma_ops when persistent memory present

 arch/powerpc/kernel/dma-iommu.c| 12 -
 arch/powerpc/platforms/pseries/iommu.c | 44 ++-
 kernel/dma/mapping.c   | 61 +-
 arch/powerpc/Kconfig   |  1 +
 kernel/dma/Kconfig |  4 ++
 5 files changed, 108 insertions(+), 14 deletions(-)

-- 
2.17.1



[PATCH kernel v2 1/2] dma: Allow mixing bypass and normal IOMMU operation

2020-10-27 Thread Alexey Kardashevskiy
At the moment we allow bypassing DMA ops only when we can do this for
the entire RAM. However there are configs with mixed type memory
where we could still allow bypassing IOMMU in most cases;
POWERPC with persistent memory is one example.

This adds another check for the bus limit to determine where bypass
can still work and we invoke direct DMA API; when DMA handle is outside
that limit, we fall back to DMA ops.

This adds a CONFIG_DMA_OPS_BYPASS_BUS_LIMIT config option which is off
by default and will be enable for PPC_PSERIES in the following patch.

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/dma/mapping.c | 61 ++--
 kernel/dma/Kconfig   |  4 +++
 2 files changed, 63 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 51bb8fa8eb89..0f4f998e6c72 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -137,6 +137,18 @@ static inline bool dma_map_direct(struct device *dev,
return dma_go_direct(dev, *dev->dma_mask, ops);
 }
 
+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+static inline bool can_map_direct(struct device *dev, phys_addr_t addr)
+{
+   return dev->bus_dma_limit >= phys_to_dma(dev, addr);
+}
+
+static inline bool dma_handle_direct(struct device *dev, dma_addr_t dma_handle)
+{
+   return dma_handle >= dev->archdata.dma_offset;
+}
+#endif
+
 dma_addr_t dma_map_page_attrs(struct device *dev, struct page *page,
size_t offset, size_t size, enum dma_data_direction dir,
unsigned long attrs)
@@ -151,6 +163,11 @@ dma_addr_t dma_map_page_attrs(struct device *dev, struct 
page *page,
 
if (dma_map_direct(dev, ops))
addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+   else if (dev->bus_dma_limit &&
+can_map_direct(dev, (phys_addr_t) page_to_phys(page) + offset 
+ size))
+   addr = dma_direct_map_page(dev, page, offset, size, dir, attrs);
+#endif
else
addr = ops->map_page(dev, page, offset, size, dir, attrs);
debug_dma_map_page(dev, page, offset, size, dir, addr);
@@ -167,6 +184,10 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t 
addr, size_t size,
BUG_ON(!valid_dma_direction(dir));
if (dma_map_direct(dev, ops))
dma_direct_unmap_page(dev, addr, size, dir, attrs);
+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+   else if (dev->bus_dma_limit && dma_handle_direct(dev, addr + size))
+   dma_direct_unmap_page(dev, addr, size, dir, attrs);
+#endif
else if (ops->unmap_page)
ops->unmap_page(dev, addr, size, dir, attrs);
debug_dma_unmap_page(dev, addr, size, dir);
@@ -190,6 +211,23 @@ int dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg, int nents,
 
if (dma_map_direct(dev, ops))
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+   else if (dev->bus_dma_limit) {
+   struct scatterlist *s;
+   bool direct = true;
+   int i;
+
+   for_each_sg(sg, s, nents, i) {
+   direct = can_map_direct(dev, sg_phys(s) + s->offset + 
s->length);
+   if (!direct)
+   break;
+   }
+   if (direct)
+   ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
+   else
+   ents = ops->map_sg(dev, sg, nents, dir, attrs);
+   }
+#endif
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
BUG_ON(ents < 0);
@@ -207,9 +245,28 @@ void dma_unmap_sg_attrs(struct device *dev, struct 
scatterlist *sg,
 
BUG_ON(!valid_dma_direction(dir));
debug_dma_unmap_sg(dev, sg, nents, dir);
-   if (dma_map_direct(dev, ops))
+   if (dma_map_direct(dev, ops)) {
dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
-   else if (ops->unmap_sg)
+   return;
+   }
+#ifdef CONFIG_DMA_OPS_BYPASS_BUS_LIMIT
+   if (dev->bus_dma_limit) {
+   struct scatterlist *s;
+   bool direct = true;
+   int i;
+
+   for_each_sg(sg, s, nents, i) {
+   direct = dma_handle_direct(dev, s->dma_address + 
s->length);
+   if (!direct)
+   break;
+   }
+   if (direct) {
+   dma_direct_unmap_sg(dev, sg, nents, dir, attrs);
+   return;
+   }
+   }
+#endif
+   if (ops->unmap_sg)
ops->unmap_sg(dev, sg, nents, dir, attrs);
 }
 EXPORT_SYMBOL(dma_unmap_sg_attrs);
diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index c99de4a21458..02fa174fbdec 100644
--- a/kernel/dma/Kconfig

[PATCH kernel v2 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-27 Thread Alexey Kardashevskiy
So far we have been using huge DMA windows to map all the RAM available.
The RAM is normally mapped to the VM address space contiguously, and
there is always a reasonable upper limit for possible future hot plugged
RAM which makes it easy to map all RAM via IOMMU.

Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike
normal RAM) can map anywhere in the VM space beyond the maximum RAM size
and since it can be used for DMA, it requires extending the huge window
up to MAX_PHYSMEM_BITS which requires hypervisor support for:
1. huge TCE tables;
2. multilevel TCE tables;
3. huge IOMMU pages.

Certain hypervisors cannot do either so the only option left is
restricting the huge DMA window to include only RAM and fallback to
the default DMA window for persistent memory.

This checks if the system has persistent memory. If it does not,
the DMA bypass mode is selected, i.e.
* dev->bus_dma_limit = 0
* dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping.

If there is such memory, this creates identity mapping only for RAM and
sets the dev->bus_dma_limit to let the generic code decide whether to
call into the direct DMA or the indirect DMA ops.

This should not change the existing behaviour when no persistent memory
as dev->dma_ops_bypass is expected to be set.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/dma-iommu.c| 12 +--
 arch/powerpc/platforms/pseries/iommu.c | 44 --
 arch/powerpc/Kconfig   |  1 +
 3 files changed, 45 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index a1c744194018..d123b7205f76 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -90,8 +90,16 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
struct iommu_table *tbl = get_iommu_table_base(dev);
 
if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
-   dev->dma_ops_bypass = true;
-   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
+   /*
+* dma_iommu_bypass_supported() sets dma_max when there is
+* 1:1 mapping but it is somehow limited.
+* ibm,pmemory is one example.
+*/
+   dev->dma_ops_bypass = dev->bus_dma_limit == 0;
+   if (!dev->dma_ops_bypass)
+   dev_warn(dev, "iommu: 64-bit OK but using default 
ops\n");
+   else
+   dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
return 1;
}
 
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index e4198700ed1a..91112e748491 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -839,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
np, ret);
 }
 
-static u64 find_existing_ddw(struct device_node *pdn)
+static u64 find_existing_ddw(struct device_node *pdn, int *window_shift)
 {
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
@@ -851,6 +851,7 @@ static u64 find_existing_ddw(struct device_node *pdn)
if (window->device == pdn) {
direct64 = window->prop;
dma_addr = be64_to_cpu(direct64->dma_base);
+   *window_shift = be32_to_cpu(direct64->window_shift);
break;
}
}
@@ -,11 +1112,13 @@ static void reset_dma_window(struct pci_dev *dev, 
struct device_node *par_dn)
  */
 static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 {
-   int len, ret;
+   int len = 0, ret;
+   bool pmem_present = of_find_node_by_type(NULL, "ibm,pmemory") != NULL;
+   int max_ram_len = order_base_2(ddw_memory_hotplug_max());
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
-   u64 dma_addr, max_addr;
+   u64 dma_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
@@ -1126,7 +1129,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_window_init_mutex);
 
-   dma_addr = find_existing_ddw(pdn);
+   dma_addr = find_existing_ddw(pdn, );
if (dma_addr != 0)
goto out_unlock;
 
@@ -1212,14 +1215,26 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
}
/* verify the window * number of ptes will map the partition */
/* check largest block * page size > max memory hotplug addr */
-   max_addr = ddw_memory_hotplug_max();
-   if (query.largest_available_block < (max_addr >> page_shift)) {
-   

[RFC PATCH kernel 1/2] irq: Add reference counting to IRQ mappings

2020-10-27 Thread Alexey Kardashevskiy
PCI devices share 4 legacy INTx interrupts from the same PCI host bridge.
Device drivers map/unmap hardware interrupts via irq_create_mapping()/
irq_dispose_mapping(). The problem with that these interrupts are
shared and when performing hot unplug, we need to unmap the interrupt
only when the last device is released.

This reuses already existing irq_desc::kobj for this purpose.
The refcounter is naturally 1 when the descriptor is allocated already;
this adds kobject_get() in places where already existing mapped virq
is returned.

This reorganizes irq_dispose_mapping() to release the kobj and let
the release callback do the cleanup.

If some driver or platform does its own reference counting, this expects
those parties to call irq_find_mapping() and call irq_dispose_mapping()
for every irq_create_fwspec_mapping()/irq_create_mapping().

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/irq/irqdesc.c   | 35 +++
 kernel/irq/irqdomain.c | 27 +--
 2 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..dae096238500 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -419,20 +419,39 @@ static struct irq_desc *alloc_desc(int irq, int node, 
unsigned int flags,
return NULL;
 }
 
+static void delayed_free_desc(struct rcu_head *rhp);
 static void irq_kobj_release(struct kobject *kobj)
 {
struct irq_desc *desc = container_of(kobj, struct irq_desc, kobj);
+   struct irq_domain *domain;
+   unsigned int virq = desc->irq_data.irq;
 
-   free_masks(desc);
-   free_percpu(desc->kstat_irqs);
-   kfree(desc);
+   domain = desc->irq_data.domain;
+   if (domain) {
+   if (irq_domain_is_hierarchy(domain)) {
+   irq_domain_free_irqs(virq, 1);
+   } else {
+   irq_domain_disassociate(domain, virq);
+   irq_free_desc(virq);
+   }
+   }
+
+   /*
+* We free the descriptor, masks and stat fields via RCU. That
+* allows demultiplex interrupts to do rcu based management of
+* the child interrupts.
+* This also allows us to use rcu in kstat_irqs_usr().
+*/
+   call_rcu(>rcu, delayed_free_desc);
 }
 
 static void delayed_free_desc(struct rcu_head *rhp)
 {
struct irq_desc *desc = container_of(rhp, struct irq_desc, rcu);
 
-   kobject_put(>kobj);
+   free_masks(desc);
+   free_percpu(desc->kstat_irqs);
+   kfree(desc);
 }
 
 static void free_desc(unsigned int irq)
@@ -453,14 +472,6 @@ static void free_desc(unsigned int irq)
 */
irq_sysfs_del(desc);
delete_irq_desc(irq);
-
-   /*
-* We free the descriptor, masks and stat fields via RCU. That
-* allows demultiplex interrupts to do rcu based management of
-* the child interrupts.
-* This also allows us to use rcu in kstat_irqs_usr().
-*/
-   call_rcu(>rcu, delayed_free_desc);
 }
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index cf8b374b892d..02733ddc321f 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -638,6 +638,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 {
struct device_node *of_node;
int virq;
+   struct irq_desc *desc;
 
pr_debug("irq_create_mapping(0x%p, 0x%lx)\n", domain, hwirq);
 
@@ -655,7 +656,9 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
/* Check if mapping already exists */
virq = irq_find_mapping(domain, hwirq);
if (virq) {
+   desc = irq_to_desc(virq);
pr_debug("-> existing mapping on virq %d\n", virq);
+   kobject_get(>kobj);
return virq;
}
 
@@ -751,6 +754,7 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec 
*fwspec)
irq_hw_number_t hwirq;
unsigned int type = IRQ_TYPE_NONE;
int virq;
+   struct irq_desc *desc;
 
if (fwspec->fwnode) {
domain = irq_find_matching_fwspec(fwspec, DOMAIN_BUS_WIRED);
@@ -787,8 +791,11 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwspec 
*fwspec)
 * current trigger type then we are done so return the
 * interrupt number.
 */
-   if (type == IRQ_TYPE_NONE || type == irq_get_trigger_type(virq))
+   if (type == IRQ_TYPE_NONE || type == 
irq_get_trigger_type(virq)) {
+   desc = irq_to_desc(virq);
+   kobject_get(>kobj);
return virq;
+   }
 
/*
 * If the trigger type has not been set yet, then set
@@ -800,6 +807,8 @@ unsigned int irq_create_fwspec_mapping(struct irq_fwsp

[RFC PATCH kernel 2/2] powerpc/pci: Remove LSI mappings on device teardown

2020-10-27 Thread Alexey Kardashevskiy
From: Oliver O'Halloran 

When a passthrough IO adapter is removed from a pseries machine using hash
MMU and the XIVE interrupt mode, the POWER hypervisor expects the guest OS
to clear all page table entries related to the adapter. If some are still
present, the RTAS call which isolates the PCI slot returns error 9001
"valid outstanding translations" and the removal of the IO adapter fails.
This is because when the PHBs are scanned, Linux maps automatically the
INTx interrupts in the Linux interrupt number space but these are never
removed.

This problem can be fixed by adding the corresponding unmap operation when
the device is removed. There's no pcibios_* hook for the remove case, but
the same effect can be achieved using a bus notifier.

Cc: Cédric Le Goater 
Cc: Michael Ellerman 
Signed-off-by: Oliver O'Halloran 
Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/kernel/pci-common.c | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index be108616a721..95f4e173368a 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -404,6 +404,27 @@ static int pci_read_irq_line(struct pci_dev *pci_dev)
return 0;
 }
 
+static int ppc_pci_unmap_irq_line(struct notifier_block *nb,
+  unsigned long action, void *data)
+{
+   struct pci_dev *pdev = to_pci_dev(data);
+
+   if (action == BUS_NOTIFY_DEL_DEVICE)
+   irq_dispose_mapping(pdev->irq);
+
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block ppc_pci_unmap_irq_notifier = {
+   .notifier_call = ppc_pci_unmap_irq_line,
+};
+
+static int ppc_pci_register_irq_notifier(void)
+{
+   return bus_register_notifier(_bus_type, 
_pci_unmap_irq_notifier);
+}
+arch_initcall(ppc_pci_register_irq_notifier);
+
 /*
  * Platform support for /proc/bus/pci/X/Y mmap()s.
  *  -- paulus.
-- 
2.17.1



[RFC PATCH kernel 0/2] irq: Add reference counting to IRQ mappings

2020-10-27 Thread Alexey Kardashevskiy


This is an attempt to fix a bug with PCI hot unplug with
a bunch of PCIe bridges and devices sharing INTx.

This did not hit us before as even if we did not
call irq_domain_ops::unmap, the platform (PowerVM) would not
produce an error but with POWER9's XIVE interrupt controller
there is an error if unmap is not called at all (2/2 fixes that)
or an error if we unmapped an interrupt which is still in use
by another device (1/2 fixes that).

One way of fixing that is doing reference counting in
the POWERPC code but since there is a kobj in irq_desc
already, I thought I'll give it a try first.


This is based on sha1
4525c8781ec0 Linus Torvalds "scsi: qla2xxx: remove incorrect sparse #ifdef".

Please comment. Thanks.



Alexey Kardashevskiy (1):
  irq: Add reference counting to IRQ mappings

Oliver O'Halloran (1):
  powerpc/pci: Remove LSI mappings on device teardown

 arch/powerpc/kernel/pci-common.c | 21 +++
 kernel/irq/irqdesc.c | 35 +---
 kernel/irq/irqdomain.c   | 27 
 3 files changed, 57 insertions(+), 26 deletions(-)

-- 
2.17.1



[PATCH kernel 0/2] powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-20 Thread Alexey Kardashevskiy
This allows mixing direct DMA (to/from RAM) and
IOMMU (to/from apersistent memory) on the PPC64/pseries
platform. This was supposed to be a single patch but
unexpected move of direct DMA functions happened.

This is based on sha1
7cf726a59435 Linus Torvalds "Merge tag 'linux-kselftest-kunit-5.10-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest".

Please comment. Thanks.



Alexey Kardashevskiy (2):
  Revert "dma-mapping: move large parts of  to
kernel/dma"
  powerpc/dma: Fallback to dma_ops when persistent memory present

 include/linux/dma-direct.h | 106 ++
 kernel/dma/direct.h| 119 -
 arch/powerpc/kernel/dma-iommu.c|  68 +-
 arch/powerpc/platforms/pseries/iommu.c |  41 +++--
 kernel/dma/direct.c|   2 +-
 kernel/dma/mapping.c   |   2 +-
 6 files changed, 207 insertions(+), 131 deletions(-)
 delete mode 100644 kernel/dma/direct.h

-- 
2.17.1



[PATCH kernel 2/2] powerpc/dma: Fallback to dma_ops when persistent memory present

2020-10-20 Thread Alexey Kardashevskiy
So far we have been using huge DMA windows to map all the RAM available.
The RAM is normally mapped to the VM address space contiguously, and
there is always a reasonable upper limit for possible future hot plugged
RAM which makes it easy to map all RAM via IOMMU.

Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike
normal RAM) can map anywhere in the VM space beyond the maximum RAM size
and since it can be used for DMA, it requires extending the huge window
up to MAX_PHYSMEM_BITS which requires hypervisor support for:
1. huge TCE tables;
2. multilevel TCE tables;
3. huge IOMMU pages.

Certain hypervisors cannot do either so the only option left is
restricting the huge DMA window to include only RAM and fallback to
the default DMA window for persistent memory.

This checks if the system has persistent memory. If it does not,
the DMA bypass mode is selected, i.e.
* dev->bus_dma_limit = 0
* dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping.

If there is such memory, this creates identity mapping only for RAM and
disables the DMA bypass mode which makes generic DMA code use indirect
dma_ops which may have performance impact:
* dev->bus_dma_limit = bus_offset + max_ram_size
  for example 0x0800..8000. for a 2GB VM
* dev->dma_ops_bypass = false <- this forces indirect calls to dma_ops for
  every mapping which then directs these to small or huge window.

This should not change the existing behaviour when no persistent memory.

Signed-off-by: Alexey Kardashevskiy 
---

Without reverting 19c65c3d30bb5a97170, I could have added

I can repost if this is preferrable. Thanks.

---
Changelog:
v2:
* rebased on current upstream with the device::bypass added and DMA
direct code movement reverted
---
 arch/powerpc/kernel/dma-iommu.c| 68 +-
 arch/powerpc/platforms/pseries/iommu.c | 41 +---
 2 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index a1c744194018..9a2a3b95f72d 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -10,6 +10,16 @@
 #include 
 #include 
 
+static inline bool can_map_direct(struct device *dev, phys_addr_t addr)
+{
+   return dev->bus_dma_limit >= phys_to_dma(dev, addr);
+}
+
+static inline bool dma_handle_direct(struct device *dev, dma_addr_t dma_handle)
+{
+   return dma_handle >= dev->archdata.dma_offset;
+}
+
 /*
  * Generic iommu implementation
  */
@@ -44,6 +54,12 @@ static dma_addr_t dma_iommu_map_page(struct device *dev, 
struct page *page,
 enum dma_data_direction direction,
 unsigned long attrs)
 {
+   if (dev->bus_dma_limit &&
+   can_map_direct(dev, (phys_addr_t) page_to_phys(page) +
+  offset + size))
+   return dma_direct_map_page(dev, page, offset, size, direction,
+  attrs);
+
return iommu_map_page(dev, get_iommu_table_base(dev), page, offset,
  size, dma_get_mask(dev), direction, attrs);
 }
@@ -53,6 +69,12 @@ static void dma_iommu_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
 size_t size, enum dma_data_direction direction,
 unsigned long attrs)
 {
+   if (dev->bus_dma_limit &&
+   dma_handle_direct(dev, dma_handle + size)) {
+   dma_direct_unmap_page(dev, dma_handle, size, direction, attrs);
+   return;
+   }
+
iommu_unmap_page(get_iommu_table_base(dev), dma_handle, size, direction,
 attrs);
 }
@@ -62,6 +84,22 @@ static int dma_iommu_map_sg(struct device *dev, struct 
scatterlist *sglist,
int nelems, enum dma_data_direction direction,
unsigned long attrs)
 {
+   if (dev->bus_dma_limit) {
+   struct scatterlist *s;
+   bool direct = true;
+   int i;
+
+   for_each_sg(sglist, s, nelems, i) {
+   direct = can_map_direct(dev,
+   sg_phys(s) + s->offset + s->length);
+   if (!direct)
+   break;
+   }
+   if (direct)
+   return dma_direct_map_sg(dev, sglist, nelems, direction,
+attrs);
+   }
+
return ppc_iommu_map_sg(dev, get_iommu_table_base(dev), sglist, nelems,
dma_get_mask(dev), direction, attrs);
 }
@@ -70,6 +108,24 @@ static void dma_iommu_unmap_sg(struct device *dev, struct 
scatterlist *sglist,
int nelems, enum dma_data_direction direction,
unsigned long attrs)
 {
+   if (dev->bus_dma_l

[PATCH kernel 1/2] Revert "dma-mapping: move large parts of to kernel/dma"

2020-10-20 Thread Alexey Kardashevskiy
This reverts commit 19c65c3d30bb5a97170e425979d2e44ab2096c7d which
was a right move but sadly there is a POWERPC/pseries hardware config
which uses a mixture of direct and IOMMU DMA but bringing this
logic to the generic code won't benefit anybody else. The user of
this revert comes in the next patch.

Signed-off-by: Alexey Kardashevskiy 
---
 include/linux/dma-direct.h | 106 +
 kernel/dma/direct.h| 119 -
 kernel/dma/direct.c|   2 +-
 kernel/dma/mapping.c   |   2 +-
 4 files changed, 108 insertions(+), 121 deletions(-)
 delete mode 100644 kernel/dma/direct.h

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 18aade195884..e388b77e0048 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -120,8 +120,114 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
 void dma_direct_free_pages(struct device *dev, size_t size,
struct page *page, dma_addr_t dma_addr,
enum dma_data_direction dir);
+int dma_direct_get_sgtable(struct device *dev, struct sg_table *sgt,
+   void *cpu_addr, dma_addr_t dma_addr, size_t size,
+   unsigned long attrs);
+bool dma_direct_can_mmap(struct device *dev);
+int dma_direct_mmap(struct device *dev, struct vm_area_struct *vma,
+   void *cpu_addr, dma_addr_t dma_addr, size_t size,
+   unsigned long attrs);
 int dma_direct_supported(struct device *dev, u64 mask);
+bool dma_direct_need_sync(struct device *dev, dma_addr_t dma_addr);
+int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
+   enum dma_data_direction dir, unsigned long attrs);
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
size_t size, enum dma_data_direction dir, unsigned long attrs);
+size_t dma_direct_max_mapping_size(struct device *dev);
 
+#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
+defined(CONFIG_SWIOTLB)
+void dma_direct_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
+   int nents, enum dma_data_direction dir);
+#else
+static inline void dma_direct_sync_sg_for_device(struct device *dev,
+   struct scatterlist *sgl, int nents, enum dma_data_direction dir)
+{
+}
+#endif
+
+#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
+defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) || \
+defined(CONFIG_SWIOTLB)
+void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
+   int nents, enum dma_data_direction dir, unsigned long attrs);
+void dma_direct_sync_sg_for_cpu(struct device *dev,
+   struct scatterlist *sgl, int nents, enum dma_data_direction 
dir);
+#else
+static inline void dma_direct_unmap_sg(struct device *dev,
+   struct scatterlist *sgl, int nents, enum dma_data_direction dir,
+   unsigned long attrs)
+{
+}
+static inline void dma_direct_sync_sg_for_cpu(struct device *dev,
+   struct scatterlist *sgl, int nents, enum dma_data_direction dir)
+{
+}
+#endif
+
+static inline void dma_direct_sync_single_for_device(struct device *dev,
+   dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+   phys_addr_t paddr = dma_to_phys(dev, addr);
+
+   if (unlikely(is_swiotlb_buffer(paddr)))
+   swiotlb_tbl_sync_single(dev, paddr, size, dir, SYNC_FOR_DEVICE);
+
+   if (!dev_is_dma_coherent(dev))
+   arch_sync_dma_for_device(paddr, size, dir);
+}
+
+static inline void dma_direct_sync_single_for_cpu(struct device *dev,
+   dma_addr_t addr, size_t size, enum dma_data_direction dir)
+{
+   phys_addr_t paddr = dma_to_phys(dev, addr);
+
+   if (!dev_is_dma_coherent(dev)) {
+   arch_sync_dma_for_cpu(paddr, size, dir);
+   arch_sync_dma_for_cpu_all();
+   }
+
+   if (unlikely(is_swiotlb_buffer(paddr)))
+   swiotlb_tbl_sync_single(dev, paddr, size, dir, SYNC_FOR_CPU);
+
+   if (dir == DMA_FROM_DEVICE)
+   arch_dma_mark_clean(paddr, size);
+}
+
+static inline dma_addr_t dma_direct_map_page(struct device *dev,
+   struct page *page, unsigned long offset, size_t size,
+   enum dma_data_direction dir, unsigned long attrs)
+{
+   phys_addr_t phys = page_to_phys(page) + offset;
+   dma_addr_t dma_addr = phys_to_dma(dev, phys);
+
+   if (unlikely(swiotlb_force == SWIOTLB_FORCE))
+   return swiotlb_map(dev, phys, size, dir, attrs);
+
+   if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
+   if (swiotlb_force != SWIOTLB_NO_FORCE)
+   return swiotlb_map(dev, phys, size, dir, attrs);
+
+   dev_WARN_ONCE(dev, 1,
+"DMA addr %pad+%zu overflow (mask %llx, bus limit 
%llx).\n",
+_addr, size, *dev->dma_mask, 
dev-&

Re: [PATCH 8/9] dma-mapping: move large parts of to kernel/dma

2020-10-18 Thread Alexey Kardashevskiy




On 30/09/2020 18:55, Christoph Hellwig wrote:

Most of the dma_direct symbols should only be used by direct.c and
mapping.c, so move them to kernel/dma.  In fact more of dma-direct.h
should eventually move, but that will require more coordination with
other subsystems.


Because of this change, 
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200713062348.100552-1-...@ozlabs.ru/ 
does not work anymore.


Should I send a patch moving 
dma_direct_map_sg/dma_direct_map_page/+unmap back to include/ or there 
is a better idea? thanks,





Signed-off-by: Christoph Hellwig 
---
  include/linux/dma-direct.h | 106 -
  kernel/dma/direct.c|   2 +-
  kernel/dma/direct.h| 119 +
  kernel/dma/mapping.c   |   2 +-
  4 files changed, 121 insertions(+), 108 deletions(-)
  create mode 100644 kernel/dma/direct.h

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 38ed3b55034d50..a2d6640c42c04e 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -120,114 +120,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
  void dma_direct_free_pages(struct device *dev, size_t size,
struct page *page, dma_addr_t dma_addr,
enum dma_data_direction dir);
-int dma_direct_get_sgtable(struct device *dev, struct sg_table *sgt,
-   void *cpu_addr, dma_addr_t dma_addr, size_t size,
-   unsigned long attrs);
-bool dma_direct_can_mmap(struct device *dev);
-int dma_direct_mmap(struct device *dev, struct vm_area_struct *vma,
-   void *cpu_addr, dma_addr_t dma_addr, size_t size,
-   unsigned long attrs);
  int dma_direct_supported(struct device *dev, u64 mask);
-bool dma_direct_need_sync(struct device *dev, dma_addr_t dma_addr);
-int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
-   enum dma_data_direction dir, unsigned long attrs);
  dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
size_t size, enum dma_data_direction dir, unsigned long attrs);
-size_t dma_direct_max_mapping_size(struct device *dev);
  
-#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \

-defined(CONFIG_SWIOTLB)
-void dma_direct_sync_sg_for_device(struct device *dev, struct scatterlist *sgl,
-   int nents, enum dma_data_direction dir);
-#else
-static inline void dma_direct_sync_sg_for_device(struct device *dev,
-   struct scatterlist *sgl, int nents, enum dma_data_direction dir)
-{
-}
-#endif
-
-#if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) || \
-defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU_ALL) || \
-defined(CONFIG_SWIOTLB)
-void dma_direct_unmap_sg(struct device *dev, struct scatterlist *sgl,
-   int nents, enum dma_data_direction dir, unsigned long attrs);
-void dma_direct_sync_sg_for_cpu(struct device *dev,
-   struct scatterlist *sgl, int nents, enum dma_data_direction 
dir);
-#else
-static inline void dma_direct_unmap_sg(struct device *dev,
-   struct scatterlist *sgl, int nents, enum dma_data_direction dir,
-   unsigned long attrs)
-{
-}
-static inline void dma_direct_sync_sg_for_cpu(struct device *dev,
-   struct scatterlist *sgl, int nents, enum dma_data_direction dir)
-{
-}
-#endif
-
-static inline void dma_direct_sync_single_for_device(struct device *dev,
-   dma_addr_t addr, size_t size, enum dma_data_direction dir)
-{
-   phys_addr_t paddr = dma_to_phys(dev, addr);
-
-   if (unlikely(is_swiotlb_buffer(paddr)))
-   swiotlb_tbl_sync_single(dev, paddr, size, dir, SYNC_FOR_DEVICE);
-
-   if (!dev_is_dma_coherent(dev))
-   arch_sync_dma_for_device(paddr, size, dir);
-}
-
-static inline void dma_direct_sync_single_for_cpu(struct device *dev,
-   dma_addr_t addr, size_t size, enum dma_data_direction dir)
-{
-   phys_addr_t paddr = dma_to_phys(dev, addr);
-
-   if (!dev_is_dma_coherent(dev)) {
-   arch_sync_dma_for_cpu(paddr, size, dir);
-   arch_sync_dma_for_cpu_all();
-   }
-
-   if (unlikely(is_swiotlb_buffer(paddr)))
-   swiotlb_tbl_sync_single(dev, paddr, size, dir, SYNC_FOR_CPU);
-
-   if (dir == DMA_FROM_DEVICE)
-   arch_dma_mark_clean(paddr, size);
-}
-
-static inline dma_addr_t dma_direct_map_page(struct device *dev,
-   struct page *page, unsigned long offset, size_t size,
-   enum dma_data_direction dir, unsigned long attrs)
-{
-   phys_addr_t phys = page_to_phys(page) + offset;
-   dma_addr_t dma_addr = phys_to_dma(dev, phys);
-
-   if (unlikely(swiotlb_force == SWIOTLB_FORCE))
-   return swiotlb_map(dev, phys, size, dir, attrs);
-
-   if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
-   if (swiotlb_force != SWIOTLB_NO_FORCE)
-   

Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed

2020-10-14 Thread Alexey Kardashevskiy




On 23/09/2020 17:06, Cédric Le Goater wrote:

On 9/23/20 2:33 AM, Qian Cai wrote:

On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote:

When a passthrough IO adapter is removed from a pseries machine using
hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
guest OS to clear all page table entries related to the adapter. If
some are still present, the RTAS call which isolates the PCI slot
returns error 9001 "valid outstanding translations" and the removal of
the IO adapter fails. This is because when the PHBs are scanned, Linux
maps automatically the INTx interrupts in the Linux interrupt number
space but these are never removed.

To solve this problem, we introduce a PPC platform specific
pcibios_remove_bus() routine which clears all interrupt mappings when
the bus is removed. This also clears the associated page table entries
of the ESB pages when using XIVE.

For this purpose, we record the logical interrupt numbers of the
mapped interrupt under the PHB structure and let pcibios_remove_bus()
do the clean up.

Since some PCI adapters, like GPUs, use the "interrupt-map" property
to describe interrupt mappings other than the legacy INTx interrupts,
we can not restrict the size of the mapping array to PCI_NUM_INTX. The
number of interrupt mappings is computed from the "interrupt-map"
property and the mapping array is allocated accordingly.

Cc: "Oliver O'Halloran" 
Cc: Alexey Kardashevskiy 
Signed-off-by: Cédric Le Goater 


Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed to
this patch.

.config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config


OK. The patch is missing a NULL assignement after kfree() and that
might be the issue.

I did try PHB removal under PowerNV, so I would like to understand
how we managed to remove twice the PCI bus and possibly reproduce.
Any chance we could grab what the syscall fuzzer (syzkaller) did ?



How do you remove PHBs exactly? There is no such thing in the powernv 
platform, I thought someone added this and you are fixing it but no. 
PHBs on powernv are created at the boot time and there is no way to 
remove them, you can only try removing all the bridges.


So what exactly are you doing?


--
Alexey


Re: [PATCH v2] powerpc/pci: unmap legacy INTx interrupts when a PHB is removed

2020-09-23 Thread Alexey Kardashevskiy



On 23/09/2020 17:06, Cédric Le Goater wrote:
> On 9/23/20 2:33 AM, Qian Cai wrote:
>> On Fri, 2020-08-07 at 12:18 +0200, Cédric Le Goater wrote:
>>> When a passthrough IO adapter is removed from a pseries machine using
>>> hash MMU and the XIVE interrupt mode, the POWER hypervisor expects the
>>> guest OS to clear all page table entries related to the adapter. If
>>> some are still present, the RTAS call which isolates the PCI slot
>>> returns error 9001 "valid outstanding translations" and the removal of
>>> the IO adapter fails. This is because when the PHBs are scanned, Linux
>>> maps automatically the INTx interrupts in the Linux interrupt number
>>> space but these are never removed.
>>>
>>> To solve this problem, we introduce a PPC platform specific
>>> pcibios_remove_bus() routine which clears all interrupt mappings when
>>> the bus is removed. This also clears the associated page table entries
>>> of the ESB pages when using XIVE.
>>>
>>> For this purpose, we record the logical interrupt numbers of the
>>> mapped interrupt under the PHB structure and let pcibios_remove_bus()
>>> do the clean up.
>>>
>>> Since some PCI adapters, like GPUs, use the "interrupt-map" property
>>> to describe interrupt mappings other than the legacy INTx interrupts,
>>> we can not restrict the size of the mapping array to PCI_NUM_INTX. The
>>> number of interrupt mappings is computed from the "interrupt-map"
>>> property and the mapping array is allocated accordingly.
>>>
>>> Cc: "Oliver O'Halloran" 
>>> Cc: Alexey Kardashevskiy 
>>> Signed-off-by: Cédric Le Goater 
>>
>> Some syscall fuzzing will trigger this on POWER9 NV where the traces pointed 
>> to
>> this patch.
>>
>> .config: https://gitlab.com/cailca/linux-mm/-/blob/master/powerpc.config
> 
> OK. The patch is missing a NULL assignement after kfree() and that
> might be the issue. 
> 
> I did try PHB removal under PowerNV, so I would like to understand 
> how we managed to remove twice the PCI bus and possibly reproduce. 
> Any chance we could grab what the syscall fuzzer (syzkaller) did ? 



My guess would be it is doing this in parallel to provoke races.



-- 
Alexey


Re: [PATCH kernel] srcu: Fix static initialization

2020-09-21 Thread Alexey Kardashevskiy



On 17/09/2020 02:12, Paul E. McKenney wrote:
> On Fri, Sep 11, 2020 at 06:52:08AM -0700, Paul E. McKenney wrote:
>> On Fri, Sep 11, 2020 at 03:09:41PM +1000, Alexey Kardashevskiy wrote:
>>> On 11/09/2020 04:53, Paul E. McKenney wrote:
>>>> On Wed, Sep 09, 2020 at 10:31:03PM +1000, Alexey Kardashevskiy wrote:
> 
> [ . . . ]
> 
>>>>> init_srcu_struct_nodes() assumes ssp->sda!=NULL but alloc_percpu() fails
>>>>> here:
>>>>>
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/percpu.c#n1734
>>>>> ===
>>>>>   } else if (mutex_lock_killable(_alloc_mutex)) {
>>>>>   pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
>>>>>   return NULL;
>>>>> ===
>>>>>
>>>>> I am still up to reading that osr-rcuusage.pdf to provide better
>>>>> analysis :) Thanks,
>>>>
>>>> Ah, got it!  Does the following patch help?
>>>>
>>>> There will likely also need to be changes to cleanup_srcu_struct(),
>>>> but first let's see if I understand the problem.  ;-)
>>>>
>>>>Thanx, Paul
>>>>
>>>> 
>>>>
>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>> index c13348e..6f7880a 100644
>>>> --- a/kernel/rcu/srcutree.c
>>>> +++ b/kernel/rcu/srcutree.c
>>>> @@ -177,11 +177,13 @@ static int init_srcu_struct_fields(struct 
>>>> srcu_struct *ssp, bool is_static)
>>>>INIT_DELAYED_WORK(>work, process_srcu);
>>>>if (!is_static)
>>>>ssp->sda = alloc_percpu(struct srcu_data);
>>>> +  if (!ssp->sda)
>>>> +  return -ENOMEM;
>>>>init_srcu_struct_nodes(ssp, is_static);
>>>>ssp->srcu_gp_seq_needed_exp = 0;
>>>>ssp->srcu_last_gp_end = ktime_get_mono_fast_ns();
>>>>smp_store_release(>srcu_gp_seq_needed, 0); /* Init done. */
>>>
>>> The line above confuses me a bit. What you propose returns without
>>> smp_store_release() called which should not matter I suppose.
>>
>> The idea is that if init_srcu_struct() returns -ENOMEM, the structure
>> has not been initialized and had better not be used.  If the calling code
>> cannot handle that outcome, then the calling code needs to do something
>> to insulate init_srcu_struct() from signals.  One thing that it could
>> do would be to invoke init_srcu_struct() from a workqueue handler and
>> wait for this handler to complete.
>>
>> Please keep in mind that there is nothing init_srcu_struct() can do
>> about this:  The srcu_struct is useless unless alloc_percpu() succeeds.
>>
>> And yes, I do need to update the header comments to make this clear.
>>
>>> Otherwise it should work, although I cannot verify right now as my box
>>> went down and since it is across Pacific - it may take time to power
>>> cycle it :) Thanks,
>>
>> I know that feeling!  And here is hoping that the box is out of reach
>> of the local hot spots.  ;-)
> 
> Just following up...  Did that patch help?


Yes it did.

Tested-by: Alexey Kardashevskiy 



> 
>   Thanx, Paul
> 

-- 
Alexey


Re: [PATCH kernel] srcu: Fix static initialization

2020-09-10 Thread Alexey Kardashevskiy



On 11/09/2020 04:53, Paul E. McKenney wrote:
> On Wed, Sep 09, 2020 at 10:31:03PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 09/09/2020 21:50, Paul E. McKenney wrote:
>>> On Wed, Sep 09, 2020 at 07:24:11PM +1000, Alexey Kardashevskiy wrote:
>>>>
>>>>
>>>> On 09/09/2020 00:43, Alexey Kardashevskiy wrote:
>>>>> init_srcu_struct_nodes() is called with is_static==true only internally
>>>>> and when this happens, the srcu->sda is not initialized in
>>>>> init_srcu_struct_fields() and we crash on dereferencing @sdp.
>>>>>
>>>>> This fixes the crash by moving "if (is_static)" out of the loop which
>>>>> only does useful work for is_static=false case anyway.
>>>>>
>>>>> Found by syzkaller.
>>>>>
>>>>> Signed-off-by: Alexey Kardashevskiy 
>>>>> ---
>>>>>  kernel/rcu/srcutree.c | 5 +++--
>>>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>>>> index c100acf332ed..49b54a50bde8 100644
>>>>> --- a/kernel/rcu/srcutree.c
>>>>> +++ b/kernel/rcu/srcutree.c
>>>>> @@ -135,6 +135,9 @@ static void init_srcu_struct_nodes(struct srcu_struct 
>>>>> *ssp, bool is_static)
>>>>>  levelspread[level - 1];
>>>>>   }
>>>>>  
>>>>> + if (is_static)
>>>>> + return;
>>>>
>>>> Actually, this is needed here too:
>>>>
>>>>  if (!ssp->sda)
>>>>  return;
>>>>
>>>> as
>>>> ssp->sda = alloc_percpu(struct srcu_data)
>>>>
>>>> can fail if the process is killed too soon - it is quite easy to get
>>>> this situation with syzkaller (syscalls fuzzer)
>>>>
>>>> Makes sense?
>>>
>>> Just to make sure that I understand, these failures occur when the task
>>> running init_srcu_struct_nodes() is killed, correct?
>>
>> There are multiple user tasks (8) which open /dev/kvm, do 1 ioctl -
>> KVM_CREATE_VM - and terminate, running on 8 vcpus, 8 VMs, crashes every
>> 20min or so, less tasks or vcpus - and the problem does not appear.
>>
>>
>>>
>>> Or has someone managed to invoke (say) synchronize_srcu() on a
>>> dynamically allocated srcu_struct before invoking init_srcu_struct() on
>>> that srcu_struct?  
>>
>> Nah, none of that :)
>>
>> init_srcu_struct_nodes() assumes ssp->sda!=NULL but alloc_percpu() fails
>> here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/percpu.c#n1734
>> ===
>>  } else if (mutex_lock_killable(_alloc_mutex)) {
>>  pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
>>  return NULL;
>> ===
>>
>> I am still up to reading that osr-rcuusage.pdf to provide better
>> analysis :) Thanks,
> 
> Ah, got it!  Does the following patch help?
> 
> There will likely also need to be changes to cleanup_srcu_struct(),
> but first let's see if I understand the problem.  ;-)
> 
>   Thanx, Paul
> 
> 
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index c13348e..6f7880a 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -177,11 +177,13 @@ static int init_srcu_struct_fields(struct srcu_struct 
> *ssp, bool is_static)
>   INIT_DELAYED_WORK(>work, process_srcu);
>   if (!is_static)
>   ssp->sda = alloc_percpu(struct srcu_data);
> + if (!ssp->sda)
> + return -ENOMEM;
>   init_srcu_struct_nodes(ssp, is_static);
>   ssp->srcu_gp_seq_needed_exp = 0;
>   ssp->srcu_last_gp_end = ktime_get_mono_fast_ns();
>   smp_store_release(>srcu_gp_seq_needed, 0); /* Init done. */


The line above confuses me a bit. What you propose returns without
smp_store_release() called which should not matter I suppose.

Otherwise it should work, although I cannot verify right now as my box
went down and since it is across Pacific - it may take time to power
cycle it :) Thanks,


> - return ssp->sda ? 0 : -ENOMEM;
> + return 0;
>  }
>  
>  #ifdef CONFIG_DEBUG_LOCK_ALLOC
> 

-- 
Alexey


Re: [PATCH kernel] srcu: Fix static initialization

2020-09-09 Thread Alexey Kardashevskiy



On 09/09/2020 21:50, Paul E. McKenney wrote:
> On Wed, Sep 09, 2020 at 07:24:11PM +1000, Alexey Kardashevskiy wrote:
>>
>>
>> On 09/09/2020 00:43, Alexey Kardashevskiy wrote:
>>> init_srcu_struct_nodes() is called with is_static==true only internally
>>> and when this happens, the srcu->sda is not initialized in
>>> init_srcu_struct_fields() and we crash on dereferencing @sdp.
>>>
>>> This fixes the crash by moving "if (is_static)" out of the loop which
>>> only does useful work for is_static=false case anyway.
>>>
>>> Found by syzkaller.
>>>
>>> Signed-off-by: Alexey Kardashevskiy 
>>> ---
>>>  kernel/rcu/srcutree.c | 5 +++--
>>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
>>> index c100acf332ed..49b54a50bde8 100644
>>> --- a/kernel/rcu/srcutree.c
>>> +++ b/kernel/rcu/srcutree.c
>>> @@ -135,6 +135,9 @@ static void init_srcu_struct_nodes(struct srcu_struct 
>>> *ssp, bool is_static)
>>>levelspread[level - 1];
>>> }
>>>  
>>> +   if (is_static)
>>> +   return;
>>
>> Actually, this is needed here too:
>>
>>  if (!ssp->sda)
>>  return;
>>
>> as
>> ssp->sda = alloc_percpu(struct srcu_data)
>>
>> can fail if the process is killed too soon - it is quite easy to get
>> this situation with syzkaller (syscalls fuzzer)
>>
>> Makes sense?
> 
> Just to make sure that I understand, these failures occur when the task
> running init_srcu_struct_nodes() is killed, correct?

There are multiple user tasks (8) which open /dev/kvm, do 1 ioctl -
KVM_CREATE_VM - and terminate, running on 8 vcpus, 8 VMs, crashes every
20min or so, less tasks or vcpus - and the problem does not appear.


> 
> Or has someone managed to invoke (say) synchronize_srcu() on a
> dynamically allocated srcu_struct before invoking init_srcu_struct() on
> that srcu_struct?  

Nah, none of that :)

init_srcu_struct_nodes() assumes ssp->sda!=NULL but alloc_percpu() fails
here:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/percpu.c#n1734
===
} else if (mutex_lock_killable(_alloc_mutex)) {
pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size);
return NULL;
===

I am still up to reading that osr-rcuusage.pdf to provide better
analysis :) Thanks,


> This would be an SRCU usage bug.  If you dynamically
> allocate your srcu_struct, you are absolutely required to invoke
> init_srcu_struct() on it before doing anything else with it.
> 
> Or am I missing something here?
> 
> (The rcutorture test suite does test both static and dynamic allocation
> of the srcu_struct, so I am expecting something a bit subtle here.)
> 
>   Thanx, Paul
> 
>>> +
>>> /*
>>>  * Initialize the per-CPU srcu_data array, which feeds into the
>>>  * leaves of the srcu_node tree.
>>> @@ -161,8 +164,6 @@ static void init_srcu_struct_nodes(struct srcu_struct 
>>> *ssp, bool is_static)
>>> timer_setup(>delay_work, srcu_delay_timer, 0);
>>> sdp->ssp = ssp;
>>> sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
>>> -   if (is_static)
>>> -   continue;
>>>  
>>> /* Dynamically allocated, better be no srcu_read_locks()! */
>>> for (i = 0; i < ARRAY_SIZE(sdp->srcu_lock_count); i++) {
>>>
>>
>> -- 
>> Alexey

-- 
Alexey


Re: [PATCH kernel] srcu: Fix static initialization

2020-09-09 Thread Alexey Kardashevskiy



On 09/09/2020 00:43, Alexey Kardashevskiy wrote:
> init_srcu_struct_nodes() is called with is_static==true only internally
> and when this happens, the srcu->sda is not initialized in
> init_srcu_struct_fields() and we crash on dereferencing @sdp.
> 
> This fixes the crash by moving "if (is_static)" out of the loop which
> only does useful work for is_static=false case anyway.
> 
> Found by syzkaller.
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  kernel/rcu/srcutree.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index c100acf332ed..49b54a50bde8 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -135,6 +135,9 @@ static void init_srcu_struct_nodes(struct srcu_struct 
> *ssp, bool is_static)
>  levelspread[level - 1];
>   }
>  
> + if (is_static)
> + return;

Actually, this is needed here too:

 if (!ssp->sda)
 return;

as
ssp->sda = alloc_percpu(struct srcu_data)

can fail if the process is killed too soon - it is quite easy to get
this situation with syzkaller (syscalls fuzzer)

Makes sense?


> +
>   /*
>* Initialize the per-CPU srcu_data array, which feeds into the
>* leaves of the srcu_node tree.
> @@ -161,8 +164,6 @@ static void init_srcu_struct_nodes(struct srcu_struct 
> *ssp, bool is_static)
>   timer_setup(>delay_work, srcu_delay_timer, 0);
>   sdp->ssp = ssp;
>   sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
> - if (is_static)
> - continue;
>  
>   /* Dynamically allocated, better be no srcu_read_locks()! */
>   for (i = 0; i < ARRAY_SIZE(sdp->srcu_lock_count); i++) {
> 

-- 
Alexey


[PATCH kernel] srcu: Fix static initialization

2020-09-08 Thread Alexey Kardashevskiy
init_srcu_struct_nodes() is called with is_static==true only internally
and when this happens, the srcu->sda is not initialized in
init_srcu_struct_fields() and we crash on dereferencing @sdp.

This fixes the crash by moving "if (is_static)" out of the loop which
only does useful work for is_static=false case anyway.

Found by syzkaller.

Signed-off-by: Alexey Kardashevskiy 
---
 kernel/rcu/srcutree.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index c100acf332ed..49b54a50bde8 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -135,6 +135,9 @@ static void init_srcu_struct_nodes(struct srcu_struct *ssp, 
bool is_static)
   levelspread[level - 1];
}
 
+   if (is_static)
+   return;
+
/*
 * Initialize the per-CPU srcu_data array, which feeds into the
 * leaves of the srcu_node tree.
@@ -161,8 +164,6 @@ static void init_srcu_struct_nodes(struct srcu_struct *ssp, 
bool is_static)
timer_setup(>delay_work, srcu_delay_timer, 0);
sdp->ssp = ssp;
sdp->grpmask = 1 << (cpu - sdp->mynode->grplo);
-   if (is_static)
-   continue;
 
/* Dynamically allocated, better be no srcu_read_locks()! */
for (i = 0; i < ARRAY_SIZE(sdp->srcu_lock_count); i++) {
-- 
2.17.1



Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-09-07 Thread Alexey Kardashevskiy




On 04/09/2020 16:04, Leonardo Bras wrote:

On Thu, 2020-09-03 at 14:41 +1000, Alexey Kardashevskiy wrote:

I am new to this, so I am trying to understand how a memory page mapped

as DMA, and used for something else could be a problem.


  From the device prospective, there is PCI space and everything from 0
till 1<<64 is accessible and what is that mapped to - the device does
not know. PHB's IOMMU is the thing to notice invalid access and raise
EEH but PHB only knows about PCI->physical memory mapping (with IOMMU
pages) but nothing about the host kernel pages. Does this help? Thanks,


According to our conversation on Slack:
1- There is a problem if a hypervisor gives to it's VMs contiguous
memory blocks that are not aligned to IOMMU pages, because then an
iommu_map_page() could map some memory in this VM and some memory in
other VM / process.
2- To guarantee this, we should have system pagesize >= iommu_pagesize

One way to get (2) is by doing this in enable_ddw():
if ((query.page_size & 4) && PAGE_SHIFT >= 24) {


You won't ever (well, soon) see PAGE_SHIFT==24, it is either 4K or 64K. 
However 16MB IOMMU pages is fine - if hypervisor uses huge pages for VMs 
RAM, it also then advertises huge IOMMU pages in ddw-query. So for the 
1:1 case there must be no "PAGE_SHIFT >= 24".




page_shift = 24; /* 16MB */
} else if ((query.page_size & 2) &&  PAGE_SHIFT >= 16 ) {
page_shift = 16; /* 64kB */
} else if (query.page_size & 1 &&  PAGE_SHIFT >= 12) {
page_shift = 12; /* 4kB */
[...]

Another way of solving this, would be adding in LoPAR documentation
that the blocksize of contiguous memory the hypervisor gives a VM
should always be aligned to IOMMU pagesize offered.


I think this is assumed already by the design of the DDW API.



I think the best approach would be first sending the above patch, which
is faster, and then get working into adding that to documentation, so
hypervisors guarantee this.

If this gets into the docs, we can revert the patch.

What do you think?
I think we diverted from the original patch :) I am not quite sure what 
you were fixing there. Thanks,



--
Alexey


Re: [PATCH 5/5] powerpc: use the generic dma_ops_bypass mode

2020-09-05 Thread Alexey Kardashevskiy




On 31/08/2020 16:40, Christoph Hellwig wrote:

On Sun, Aug 30, 2020 at 11:04:21AM +0200, Cédric Le Goater wrote:

Hello,

On 7/8/20 5:24 PM, Christoph Hellwig wrote:

Use the DMA API bypass mechanism for direct window mappings.  This uses
common code and speed up the direct mapping case by avoiding indirect
calls just when not using dma ops at all.  It also fixes a problem where
the sync_* methods were using the bypass check for DMA allocations, but
those are part of the streaming ops.

Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which
has never been well defined, as is only used by a few drivers, which
IIRC never showed up in the typical Cell blade setups that are affected
by the ordering workaround.

Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB")
Signed-off-by: Christoph Hellwig 
---
  arch/powerpc/Kconfig  |  1 +
  arch/powerpc/include/asm/device.h |  5 --
  arch/powerpc/kernel/dma-iommu.c   | 90 ---
  3 files changed, 10 insertions(+), 86 deletions(-)


I am seeing corruptions on a couple of POWER9 systems (boston) when
stressed with IO. stress-ng gives some results but I have first seen
it when compiling the kernel in a guest and this is still the best way
to raise the issue.

These systems have of a SAS Adaptec controller :

   0003:01:00.0 Serial Attached SCSI controller: Adaptec Series 8 12G SAS/PCIe 
3 (rev 01)

When the failure occurs, the POWERPC EEH interrupt fires and dumps
lowlevel PHB4 registers among which :

   [ 2179.251069490,3] PHB#0003[0:3]:   phbErrorStatus = 
0280
   [ 2179.251117476,3] PHB#0003[0:3]:  phbFirstErrorStatus = 
0200

The bits raised identify a PPC 'TCE' error, which means it is related
to DMAs. See below for more details.


Reverting this patch "fixes" the issue but it is probably else where,
in some other layers or in the aacraid driver. How should I proceed
to get more information ?


The aacraid DMA masks look like a mess.



It kinds does and is. The thing is that after f1565c24b596 the driver 
sets 32 bit DMA mask which in turn enables the small DMA window (not 
bypass) and since the aacraid driver has at least one bug with double 
unmap of the same DMA handle, this somehow leads to EEH (PCI DMA error).



The driver sets 32but mask because it callis dma_get_required_mask() 
_before_ setting the mask so dma_get_required_mask() does not go the 
dma_alloc_direct() path and calls the powerpc's 
dma_iommu_get_required_mask() which:


1. does the math like this (spot 2 bugs):

mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1)

2. but even after fixing that, the driver crashes as f1565c24b596 
removed the call to dma_iommu_bypass_supported() so it enforces IOMMU.



The patch below (the first hunk to be precise) brings the things back to 
where they were (64bit mask). The double unmap bug in the driver is 
still to be investigated.




diff --git a/arch/powerpc/kernel/dma-iommu.c 
b/arch/powerpc/kernel/dma-iommu.c

index 569fecd7b5b2..785abccb90fc 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -117,10 +117,18 @@ u64 dma_iommu_get_required_mask(struct device *dev)
struct iommu_table *tbl = get_iommu_table_base(dev);
u64 mask;

+   if (dev_is_pci(dev)) {
+   u64 bypass_mask = dma_direct_get_required_mask(dev);
+
+   if (dma_iommu_bypass_supported(dev, bypass_mask))
+   return bypass_mask;
+   }
+
if (!tbl)
return 0;

-   mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1);
+   mask = 1ULL << (fls_long(tbl->it_offset + tbl->it_size) +
+   tbl->it_page_shift - 1);
mask += mask - 1;

return mask;



--
Alexey


Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-09-03 Thread Alexey Kardashevskiy




On 02/09/2020 16:11, Leonardo Bras wrote:

On Mon, 2020-08-31 at 14:35 +1000, Alexey Kardashevskiy wrote:


On 29/08/2020 04:36, Leonardo Bras wrote:

On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:

On 18/08/2020 09:40, Leonardo Bras wrote:

As of today, if the biggest DDW that can be created can't map the whole
partition, it's creation is skipped and the default DMA window
"ibm,dma-window" is used instead.

DDW is 16x bigger than the default DMA window,


16x only under very specific circumstances which are
1. phyp
2. sriov
3. device class in hmc (or what that priority number is in the lpar config).


Yeah, missing details.


having the same amount of
pages, but increasing the page size to 64k.
Besides larger DMA window,


"Besides being larger"?


You are right there.


it performs better for allocations over 4k,


Better how?


I was thinking for allocations larger than (512 * 4k), since >2
hypercalls are needed here, and for 64k pages would still be just 1
hypercall up to (512 * 64k).
But yeah, not the usual case anyway.


Yup.



so it would be nice to use it instead.


I'd rather say something like:
===
So far we assumed we can map the guest RAM 1:1 to the bus which worked
with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.
===


I mixed this in my commit message, it looks like this:

===
powerpc/pseries/iommu: Make use of DDW for indirect mapping

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW
creation is skipped and the default DMA window "ibm,dma-window" is used
instead.

The default DMA window uses 4k pages instead of 64k pages, and since
the amount of pages is the same,


Is the amount really the same? I thought you can prioritize some VFs
over others (== allocate different number of TCEs). Does it really
matter if it is the same?


On a conversation with Travis Pizel, he explained how it's supposed to
work, and I understood this:

When a VF is created, it will be assigned a capacity, like 4%, 20%, and
so on. The number of 'TCE entries' that are available to that partition
are proportional to that capacity.

If we use the default DMA window, the IOMMU pagesize/entry will be 4k,
and if we use DDW, we will get 64k pagesize. As the number of entries
will be the same (for the same capacity), the total space that can be
addressed by the IOMMU will be 16 times bigger. This sometimes enable
direct mapping, but sometimes it's still not enough.



Good to know. This is still an implementation detail, QEMU does not 
allocate TCEs like this.





On Travis words :
"A low capacity VF, with less resources available, will certainly have
less DMA window capability than a high capacity VF. But, an 8GB DMA
window (with 64k pages) is still 16x larger than an 512MB window (with
4K pages).
A high capacity VF - for example, one that Leonardo has in his scenario
- will go from 8GB (using 4K pages) to 128GB (using 64K pages) - again,
16x larger - but it's obviously still possible to create a partition
that exceeds 128GB of memory in size."



Right except the default dma window is not 8GB, it is <=2GB.








making use of DDW instead of the
default DMA window for indirect mapping will expand in 16x the amount
of memory that can be mapped on DMA.


Stop saying "16x", it is not guaranteed by anything :)



The DDW created will be used for direct mapping by default. [...]
===

What do you think?


The DDW created will be used for direct mapping by default.
If it's not available, indirect mapping will be used instead.

For indirect mapping, it's necessary to update the iommu_table so
iommu_alloc() can use the DDW created. For this,
iommu_table_update_window() is called when everything else succeeds
at enable_ddw().

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

As there will never have both direct and indirect mappings at the same
time, the same property name can be used for the created DDW.

So renaming
define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
to
define DMA64_PROPNAME "linux,dma64-ddr-window-info"
looks the right thing to do.


I know I suggested this but this does not look so good anymore as I
suspect it breaks kexec (from older kernel to this one) so you either
need to check for both DT names or just keep the old one. Changing the
macro

Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-09-02 Thread Alexey Kardashevskiy




On 02/09/2020 08:34, Leonardo Bras wrote:

On Mon, 2020-08-31 at 10:47 +1000, Alexey Kardashevskiy wrote:


Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
should be enough, is there any chance to get indirect mapping in qemu
like this? (DDW but with smaller DMA window available)


You will have to hack the guest kernel to always do indirect mapping or
hack QEMU's rtas_ibm_query_pe_dma_window() to return a small number of
available TCEs. But you will be testing QEMU/KVM which behave quite
differently to pHyp in this particular case.



As you suggested before, building for 4k cpu pagesize should be the
best approach. It would allow testing for both pHyp and qemu scenarios.


Because if we want the former (==support), then we'll have to align the
size up to the bigger page size when allocating/zeroing system pages,
etc.


This part I don't understand. Why do we need to align everything to the
bigger pagesize?

I mean, is not that enough that the range [ret, ret + size[ is both
allocated by mm and mapped on a iommu range?

Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
IOMMU_PAGE_SIZE() == 64k.
Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough?
All the space the user asked for is allocated and mapped for DMA.


The user asked to map 16K, the rest - 48K - is used for something else
(may be even mapped to another device) but you are making all 64K
accessible by the device which only should be able to access 16K.

In practice, if this happens, H_PUT_TCE will simply fail.


I have noticed mlx5 driver getting a few bytes in a buffer, and using
iommu_map_page(). It does map a whole page for as few bytes as the user


Whole 4K system page or whole 64K iommu page?


I tested it in 64k system page + 64k iommu page.

The 64K system page may be used for anything, and a small portion of it
(say 128 bytes) needs to be used for DMA.
The whole page is mapped by IOMMU, and the driver gets info of the
memory range it should access / modify.



This works because the whole system page belongs to the same memory 
context and IOMMU allows a device to access that page. You can still 
have problems if there is a bug within the page but it will go mostly 
unnoticed as it will be memory corruption.


If you system page is smaller (4K) than IOMMU page (64K), then the 
device gets wider access than it should but it is still going to be 
silent memory corruption.








wants mapped, and the other bytes get used for something else, or just
mapped on another DMA page.
It seems to work fine.



With 4K system page and 64K IOMMU page? In practice it would take an
effort or/and bad luck to see it crashing. Thanks,


I haven't tested it yet. On a 64k system page and 4k/64k iommu page, it
works as described above.

I am new to this, so I am trying to understand how a memory page mapped
as DMA, and used for something else could be a problem.


From the device prospective, there is PCI space and everything from 0 
till 1<<64 is accessible and what is that mapped to - the device does 
not know. PHB's IOMMU is the thing to notice invalid access and raise 
EEH but PHB only knows about PCI->physical memory mapping (with IOMMU 
pages) but nothing about the host kernel pages. Does this help? Thanks,





Thanks!






Bigger pages are not the case here as I understand it.


I did not get this part, what do you mean?


Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
supported set of sizes is different for P8/P9 and type of IO (PHB,
NVLink/CAPI).



Update those functions to guarantee alignment with requested size
using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().

Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/kernel/iommu.c | 17 +
  1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 9704f3f76e63..d7086087830f 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device *dev,
}
  
  	if (dev)

-   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
- 1 << tbl->it_page_shift);
+   boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, 
tbl);


Run checkpatch.pl, should complain about a long line.


It's 86 columns long, which is less than the new limit of 100 columns
Linus announced a few weeks ago. checkpatch.pl was updated too:
https://www.phoronix.com/scan.php?page=news_item=Linux-Kernel-Deprecates-80-Col


Yay finally :) Thanks,


:)




else
-   boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
+   boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
/* 4GB boundary fo

Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-09-02 Thread Alexey Kardashevskiy




On 02/09/2020 07:38, Leonardo Bras wrote:

On Mon, 2020-08-31 at 13:48 +1000, Alexey Kardashevskiy wrote:

Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
described as RPN, as described before.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
shows system memory mapping into a TCE, and the TCE also has bits 0-51
for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
In fact, by the looks of those figures, the RPN_MASK should always be a
52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.


I suspect the mask is there in the first place for extra protection
against too big addresses going to the TCE table (or/and for virtial vs
physical addresses). Using 52bit mask makes no sense for anything, you
could just drop the mask and let c compiler deal with 64bit "uint" as it
is basically a 4K page address anywhere in the 64bit space. Thanks,


Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
physical address space. The IODA3 spec does explicitly say the upper
bits are optional and the implementation only needs to support enough
to cover up to the physical address limit, which is 56bits of P9 /
PHB4. If you want to validate that the address will fit inside of
MAX_PHYSMEM_BITS then fine, but I think that should be done as a
WARN_ON or similar rather than just silently masking off the bits.


We can do this and probably should anyway but I am also pretty sure we
can just ditch the mask and have the hypervisor return an error which
will show up in dmesg.


Ok then, ditching the mask.



Well, you could run a little experiment and set some bits above that old 
mask and see how phyp reacts :)




Thanks!



--
Alexey


Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 04:36, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
>>
>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> As of today, if the biggest DDW that can be created can't map the whole
>>> partition, it's creation is skipped and the default DMA window
>>> "ibm,dma-window" is used instead.
>>>
>>> DDW is 16x bigger than the default DMA window,
>>
>> 16x only under very specific circumstances which are
>> 1. phyp
>> 2. sriov
>> 3. device class in hmc (or what that priority number is in the lpar config).
> 
> Yeah, missing details.
> 
>>> having the same amount of
>>> pages, but increasing the page size to 64k.
>>> Besides larger DMA window,
>>
>> "Besides being larger"?
> 
> You are right there.
> 
>>
>>> it performs better for allocations over 4k,
>>
>> Better how?
> 
> I was thinking for allocations larger than (512 * 4k), since >2
> hypercalls are needed here, and for 64k pages would still be just 1
> hypercall up to (512 * 64k). 
> But yeah, not the usual case anyway.

Yup.


> 
>>
>>> so it would be nice to use it instead.
>>
>> I'd rather say something like:
>> ===
>> So far we assumed we can map the guest RAM 1:1 to the bus which worked
>> with a small number of devices. SRIOV changes it as the user can
>> configure hundreds VFs and since phyp preallocates TCEs and does not
>> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
>> per a PE to limit waste of physical pages.
>> ===
> 
> I mixed this in my commit message, it looks like this:
> 
> ===
> powerpc/pseries/iommu: Make use of DDW for indirect mapping
> 
> So far it's assumed possible to map the guest RAM 1:1 to the bus, which
> works with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
> 
> As of today, if the assumed direct mapping is not possible, DDW
> creation is skipped and the default DMA window "ibm,dma-window" is used
> instead.
> 
> The default DMA window uses 4k pages instead of 64k pages, and since
> the amount of pages is the same,


Is the amount really the same? I thought you can prioritize some VFs
over others (== allocate different number of TCEs). Does it really
matter if it is the same?


> making use of DDW instead of the
> default DMA window for indirect mapping will expand in 16x the amount
> of memory that can be mapped on DMA.

Stop saying "16x", it is not guaranteed by anything :)


> 
> The DDW created will be used for direct mapping by default. [...]
> ===
> 
> What do you think?
> 
>>> The DDW created will be used for direct mapping by default.
>>> If it's not available, indirect mapping will be used instead.
>>>
>>> For indirect mapping, it's necessary to update the iommu_table so
>>> iommu_alloc() can use the DDW created. For this,
>>> iommu_table_update_window() is called when everything else succeeds
>>> at enable_ddw().
>>>
>>> Removing the default DMA window for using DDW with indirect mapping
>>> is only allowed if there is no current IOMMU memory allocated in
>>> the iommu_table. enable_ddw() is aborted otherwise.
>>>
>>> As there will never have both direct and indirect mappings at the same
>>> time, the same property name can be used for the created DDW.
>>>
>>> So renaming
>>> define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
>>> to
>>> define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>>> looks the right thing to do.
>>
>> I know I suggested this but this does not look so good anymore as I
>> suspect it breaks kexec (from older kernel to this one) so you either
>> need to check for both DT names or just keep the old one. Changing the
>> macro name is fine.
>>
> 
> Yeah, having 'direct' in the name don't really makes sense if it's used
> for indirect mapping. I will just add this new define instead of
> replacing the old one, and check for both. 
> Is that ok?


No, having two of these does not seem right or useful. It is pseries
which does not use petitboot (relies on grub instead so until the target
kernel is started, there will be no ddw) so realistically we need this
property for kexec/kdump which uses the same kernel but different
initramdisk so for that purpose we need the same property name.

But I 

Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 01:25, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 15:07 +1000, Alexey Kardashevskiy wrote:
>>
>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> Code used to create a ddw property that was previously scattered in
>>> enable_ddw() is now gathered in ddw_property_create(), which deals with
>>> allocation and filling the property, letting it ready for
>>> of_property_add(), which now occurs in sequence.
>>>
>>> This created an opportunity to reorganize the second part of enable_ddw():
>>>
>>> Without this patch enable_ddw() does, in order:
>>> kzalloc() property & members, create_ddw(), fill ddwprop inside property,
>>> ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
>>> of_add_property().
>>>
>>> With this patch enable_ddw() does, in order:
>>> create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
>>> do tce_setrange_multi_pSeriesLP_walk in all memory.
>>>
>>> This change requires of_remove_property() in case anything fails after
>>> of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
>>> in all memory, which looks the most expensive operation, only if
>>> everything else succeeds.
>>>
>>> Signed-off-by: Leonardo Bras 
>>> ---
>>>  arch/powerpc/platforms/pseries/iommu.c | 97 +++---
>>>  1 file changed, 57 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index 4031127c9537..3a1ef02ad9d5 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, 
>>> struct device_node *par_dn)
>>>  ret);
>>>  }
>>>  
>>> +static int ddw_property_create(struct property **ddw_win, const char 
>>> *propname,
>>
>> @propname is always the same, do you really want to pass it every time?
> 
> I think it reads better, like "create a ddw property with this name".

This reads as "there are at least two ddw properties".

> Also, it makes possible to create ddw properties with other names, in
> case we decide to create properties with different names depending on
> the window created.

It is one window at any given moment, why call it different names... I
get the part that it is not always "direct" anymore but still...


> Also, it's probably optimized / inlined at this point.
> Is it ok doing it like this?
> 
>>
>>> +  u32 liobn, u64 dma_addr, u32 page_shift, u32 
>>> window_shift)
>>> +{
>>> +   struct dynamic_dma_window_prop *ddwprop;
>>> +   struct property *win64;
>>> +
>>> +   *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
>>> +   if (!win64)
>>> +   return -ENOMEM;
>>> +
>>> +   win64->name = kstrdup(propname, GFP_KERNEL);
>>
>> Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
>> generic OF code does not try kfree() it but it is probably out of scope
>> here.
> 
> Yeah, I had that question too. 
> Previous code was like that, and I as trying not to mess too much on
> how it's done.
> 
>>> +   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
>>> +   win64->value = ddwprop;
>>> +   win64->length = sizeof(*ddwprop);
>>> +   if (!win64->name || !win64->value)
>>> +   return -ENOMEM;
>>
>> Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
>> still looks fragile. Instead you could simply return win64 as the only
>> error possible here is -ENOMEM and returning NULL is equally good.
> 
> I agree. It's better if this function have it's own cleaning routine.
> It will be fixed for next version.
> 
>>
>>
>>> +
>>> +   ddwprop->liobn = cpu_to_be32(liobn);
>>> +   ddwprop->dma_base = cpu_to_be64(dma_addr);
>>> +   ddwprop->tce_shift = cpu_to_be32(page_shift);
>>> +   ddwprop->window_shift = cpu_to_be32(window_shift);
>>> +
>>> +   return 0;
>>> +}
>>> +
>>>  /*
>>>   * If the PE supports dynamic dma windows, and there is space for a table
>>>   * that can map all pages in a linear offset, then setup such a table,
>>> @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
>>> device_n

Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-30 Thread Alexey Kardashevskiy



On 31/08/2020 11:41, Oliver O'Halloran wrote:
> On Mon, Aug 31, 2020 at 10:08 AM Alexey Kardashevskiy  wrote:
>>
>> On 29/08/2020 05:55, Leonardo Bras wrote:
>>> On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
>>>>
>>>> On 28/08/2020 01:32, Leonardo Bras wrote:
>>>>> Hello Alexey, thank you for this feedback!
>>>>>
>>>>> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
>>>>>>> +#define TCE_RPN_BITS 52  /* Bits 0-51 
>>>>>>> represent RPN on TCE */
>>>>>>
>>>>>> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>>>>>> is the actual limit.
>>>>>
>>>>> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical 
>>>>> memory addressable in the machine. IIUC, it means we can access physical 
>>>>> address up to (1ul << MAX_PHYSMEM_BITS).
>>>>>
>>>>> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
>>>>> 0-51 as the RPN. By looking at code, I understand that it means we may 
>>>>> input any address < (1ul << 52) to TCE.
>>>>>
>>>>> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I 
>>>>> suppose we can't ever pass a physical page address over
>>>>> (1ul << 51), and TCE accepts up to (1ul << 52).
>>>>> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means 
>>>>> that TCE_RPN_BITS will also be increased, so I think they are independent 
>>>>> values.
>>>>>
>>>>> Does it make sense? Please let me know if I am missing something.
>>>>
>>>> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
>>>> 6Apr2012.pdf spec says:
>>>>
>>>> "The number of most significant RPN bits implemented in the TCE is
>>>> dependent on the max size of System Memory to be supported by the 
>>>> platform".
>>>>
>>>> IODA3 is the same on this matter.
>>>>
>>>> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
>>>> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
>>>> where TCE_RPN_BITS comes from exactly - I have no idea.
>>>
>>> Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
>>> hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
>>> pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
>>> described as RPN, as described before.
>>>
>>> IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
>>> shows system memory mapping into a TCE, and the TCE also has bits 0-51
>>> for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
>>>> In fact, by the looks of those figures, the RPN_MASK should always be a
>>> 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
>>
>> I suspect the mask is there in the first place for extra protection
>> against too big addresses going to the TCE table (or/and for virtial vs
>> physical addresses). Using 52bit mask makes no sense for anything, you
>> could just drop the mask and let c compiler deal with 64bit "uint" as it
>> is basically a 4K page address anywhere in the 64bit space. Thanks,
> 
> Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
> physical address space. The IODA3 spec does explicitly say the upper
> bits are optional and the implementation only needs to support enough
> to cover up to the physical address limit, which is 56bits of P9 /
> PHB4. If you want to validate that the address will fit inside of
> MAX_PHYSMEM_BITS then fine, but I think that should be done as a
> WARN_ON or similar rather than just silently masking off the bits.

We can do this and probably should anyway but I am also pretty sure we
can just ditch the mask and have the hypervisor return an error which
will show up in dmesg.


-- 
Alexey


Re: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 00:04, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 13:44 +1000, Alexey Kardashevskiy wrote:
>>
>>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> enable_ddw() currently returns the address of the DMA window, which is
>>> considered invalid if has the value 0x00.
>>>
>>> Also, it only considers valid an address returned from find_existing_ddw
>>> if it's not 0x00.
>>>
>>> Changing this behavior makes sense, given the users of enable_ddw() only
>>> need to know if direct mapping is possible. It can also allow a DMA window
>>> starting at 0x00 to be used.
>>>
>>> This will be helpful for using a DDW with indirect mapping, as the window
>>> address will be different than 0x00, but it will not map the whole
>>> partition.
>>>
>>> Signed-off-by: Leonardo Bras 
>>> ---
>>>  arch/powerpc/platforms/pseries/iommu.c | 30 --
>>>  1 file changed, 14 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index fcdefcc0f365..4031127c9537 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool 
>>> remove_prop)
>>> np, ret);
>>>  }
>>>>  
>>> -static u64 find_existing_ddw(struct device_node *pdn)
>>> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>>>  {
>>> struct direct_window *window;
>>> const struct dynamic_dma_window_prop *direct64;
>>> -   u64 dma_addr = 0;
>>> +   bool found = false;
>>>  
>>> spin_lock(_window_list_lock);
>>> /* check if we already created a window and dupe that config if so */
>>> list_for_each_entry(window, _window_list, list) {
>>> if (window->device == pdn) {
>>> direct64 = window->prop;
>>> -   dma_addr = be64_to_cpu(direct64->dma_base);
>>> +   *dma_addr = be64_to_cpu(direct64->dma_base);
>>> +   found = true;
>>> break;
>>> }
>>> }
>>> spin_unlock(_window_list_lock);
>>>  
>>> -   return dma_addr;
>>> +   return found;
>>>  }
>>>  
>>>  static struct direct_window *ddw_list_add(struct device_node *pdn,
>>> @@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, 
>>> struct device_node *par_dn)
>>>   * pdn: the parent pe node with the ibm,dma_window property
>>>   * Future: also check if we can remap the base window for our base page 
>>> size
>>>   *
>>> - * returns the dma offset for use by the direct mapped DMA code.
>>> + * returns true if can map all pages (direct mapping), false otherwise..
>>>   */
>>> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>  {
>>> int len, ret;
>>> struct ddw_query_response query;
>>> struct ddw_create_response create;
>>> int page_shift;
>>> -   u64 dma_addr, max_addr;
>>> +   u64 max_addr;
>>> struct device_node *dn;
>>> u32 ddw_avail[DDW_APPLICABLE_SIZE];
>>> struct direct_window *window;
>>> @@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>>  
>>> mutex_lock(_window_init_mutex);
>>>  
>>> -   dma_addr = find_existing_ddw(pdn);
>>> -   if (dma_addr != 0)
>>> +   if (find_existing_ddw(pdn, >dev.archdata.dma_offset))
>>> goto out_unlock;
>>>  
>>> /*
>>> @@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> goto out_free_window;
>>> }
>>>  
>>> -   dma_addr = be64_to_cpu(ddwprop->dma_base);
>>> +   dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
>>
>> Do not you need the same chunk in the find_existing_ddw() case above as
>> well? Thanks,
> 
> The new signature of find_existing_ddw() is 
> static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> 
> And on enable_ddw(), we call 
> find_existing_ddw(pdn, >dev.archdata.dma_offset)
> 
> And inside the function we do:
> *dma_addr = be64_to_cpu(direct64->dma_base);
> 
> I think it's the same as the chunk before.
> Am I missing something?

ah no, sorry, you are not missing anything.


Reviewed-by: Alexey Kardashevskiy 




-- 
Alexey


Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 06:41, Leonardo Bras wrote:
> On Fri, 2020-08-28 at 11:40 +1000, Alexey Kardashevskiy wrote:
>>> I think it would be better to keep the code as much generic as possible
>>> regarding page sizes. 
>>
>> Then you need to test it. Does 4K guest even boot (it should but I would
>> not bet much on it)?
> 
> Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
> should be enough, is there any chance to get indirect mapping in qemu
> like this? (DDW but with smaller DMA window available) 


You will have to hack the guest kernel to always do indirect mapping or
hack QEMU's rtas_ibm_query_pe_dma_window() to return a small number of
available TCEs. But you will be testing QEMU/KVM which behave quite
differently to pHyp in this particular case.



>>>> Because if we want the former (==support), then we'll have to align the
>>>> size up to the bigger page size when allocating/zeroing system pages,
>>>> etc. 
>>>
>>> This part I don't understand. Why do we need to align everything to the
>>> bigger pagesize? 
>>>
>>> I mean, is not that enough that the range [ret, ret + size[ is both
>>> allocated by mm and mapped on a iommu range?
>>>
>>> Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
>>> IOMMU_PAGE_SIZE() == 64k.
>>> Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough? 
>>> All the space the user asked for is allocated and mapped for DMA.
>>
>> The user asked to map 16K, the rest - 48K - is used for something else
>> (may be even mapped to another device) but you are making all 64K
>> accessible by the device which only should be able to access 16K.
>>
>> In practice, if this happens, H_PUT_TCE will simply fail.
> 
> I have noticed mlx5 driver getting a few bytes in a buffer, and using
> iommu_map_page(). It does map a whole page for as few bytes as the user


Whole 4K system page or whole 64K iommu page?

> wants mapped, and the other bytes get used for something else, or just
> mapped on another DMA page.
> It seems to work fine.  



With 4K system page and 64K IOMMU page? In practice it would take an
effort or/and bad luck to see it crashing. Thanks,



> 
>>
>>
>>>> Bigger pages are not the case here as I understand it.
>>>
>>> I did not get this part, what do you mean?
>>
>> Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
>> supported set of sizes is different for P8/P9 and type of IO (PHB,
>> NVLink/CAPI).
>>
>>
>>>>> Update those functions to guarantee alignment with requested size
>>>>> using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
>>>>>
>>>>> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
>>>>> with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
>>>>>
>>>>> Signed-off-by: Leonardo Bras 
>>>>> ---
>>>>>  arch/powerpc/kernel/iommu.c | 17 +
>>>>>  1 file changed, 9 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>>>> index 9704f3f76e63..d7086087830f 100644
>>>>> --- a/arch/powerpc/kernel/iommu.c
>>>>> +++ b/arch/powerpc/kernel/iommu.c
>>>>> @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device 
>>>>> *dev,
>>>>>   }
>>>>>  
>>>>>   if (dev)
>>>>> - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
>>>>> -   1 << tbl->it_page_shift);
>>>>> + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, 
>>>>> tbl);
>>>>
>>>> Run checkpatch.pl, should complain about a long line.
>>>
>>> It's 86 columns long, which is less than the new limit of 100 columns
>>> Linus announced a few weeks ago. checkpatch.pl was updated too:
>>> https://www.phoronix.com/scan.php?page=news_item=Linux-Kernel-Deprecates-80-Col
>>
>> Yay finally :) Thanks,
> 
> :)
> 
>>
>>
>>>>
>>>>>   else
>>>>> - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
>>>>> + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
>>>>>   /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
>>>>>  
>>>>>   n = iommu_area_alloc(tbl->it_map, limit, start, n

Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 05:55, Leonardo Bras wrote:
> On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
>>
>> On 28/08/2020 01:32, Leonardo Bras wrote:
>>> Hello Alexey, thank you for this feedback!
>>>
>>> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
>>>>> +#define TCE_RPN_BITS 52  /* Bits 0-51 represent 
>>>>> RPN on TCE */
>>>>
>>>> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>>>> is the actual limit.
>>>
>>> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical 
>>> memory addressable in the machine. IIUC, it means we can access physical 
>>> address up to (1ul << MAX_PHYSMEM_BITS). 
>>>
>>> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
>>> 0-51 as the RPN. By looking at code, I understand that it means we may 
>>> input any address < (1ul << 52) to TCE.
>>>
>>> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I 
>>> suppose we can't ever pass a physical page address over 
>>> (1ul << 51), and TCE accepts up to (1ul << 52).
>>> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that 
>>> TCE_RPN_BITS will also be increased, so I think they are independent 
>>> values. 
>>>
>>> Does it make sense? Please let me know if I am missing something.
>>
>> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
>> 6Apr2012.pdf spec says:
>>
>> "The number of most significant RPN bits implemented in the TCE is
>> dependent on the max size of System Memory to be supported by the platform".
>>
>> IODA3 is the same on this matter.
>>
>> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
>> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
>> where TCE_RPN_BITS comes from exactly - I have no idea.
> 
> Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
> hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
> pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
> described as RPN, as described before.
> 
> IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
> shows system memory mapping into a TCE, and the TCE also has bits 0-51
> for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
>> In fact, by the looks of those figures, the RPN_MASK should always be a
> 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.


I suspect the mask is there in the first place for extra protection
against too big addresses going to the TCE table (or/and for virtial vs
physical addresses). Using 52bit mask makes no sense for anything, you
could just drop the mask and let c compiler deal with 64bit "uint" as it
is basically a 4K page address anywhere in the 64bit space. Thanks,


> Maybe that's it?




> 
>>
>>
>>>>
>>>>> +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
>>>>>  #define TCE_VALID0x800   /* TCE valid */
>>>>>  #define TCE_ALLIO0x400   /* TCE valid for all 
>>>>> lpars */
>>>>>  #define TCE_PCI_WRITE0x2 /* write from PCI 
>>>>> allowed */
>>>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>>>> b/arch/powerpc/platforms/pseries/iommu.c
>>>>> index e4198700ed1a..8fe23b7dff3a 100644
>>>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>>>> @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
>>>>> long index,
>>>>>   u64 proto_tce;
>>>>>   __be64 *tcep;
>>>>>   u64 rpn;
>>>>> + const unsigned long tceshift = tbl->it_page_shift;
>>>>> + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
>>>>> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
>>>>
>>>> Using IOMMU_PAGE_SIZE macro for the page size and not using
>>>> IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
>>>> explode :) I understand the history but man... Oh well, ok.
>>>>
>>>
>>> Yeah, it feels kind of weird after two IOMMU related consts. :)
>>> But sure IOMMU_PAGE_MASK() would not be useful here :)
>>>
>>> And this kind of let me thinking:
>>>>> + rpn = __pa(uaddr) >> tceshift;
>>>>> + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
>>> Why not:
>>> rpn_mask =  TCE_RPN_MASK(tceshift) << tceshift;
>>
>> A mask for a page number (but not the address!) hurts my brain, masks
>> are good against addresses but numbers should already have all bits
>> adjusted imho, may be it is just me :-/
>>
>>
>>> 
>>> rpn = __pa(uaddr) & rpn_mask;
>>> *tcep = cpu_to_be64(proto_tce | rpn)
>>>
>>> I am usually afraid of changing stuff like this, but I think it's safe.
>>>
>>>> Good, otherwise. Thanks,
>>>
>>> Thank you for reviewing!
>>>  
>>>
>>>
> 

-- 
Alexey


Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-27 Thread Alexey Kardashevskiy



On 28/08/2020 01:32, Leonardo Bras wrote:
> Hello Alexey, thank you for this feedback!
> 
> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
>>> +#define TCE_RPN_BITS   52  /* Bits 0-51 represent 
>>> RPN on TCE */
>>
>> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>> is the actual limit.
> 
> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical memory 
> addressable in the machine. IIUC, it means we can access physical address up 
> to (1ul << MAX_PHYSMEM_BITS). 
> 
> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> 0-51 as the RPN. By looking at code, I understand that it means we may input 
> any address < (1ul << 52) to TCE.
> 
> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I suppose 
> we can't ever pass a physical page address over 
> (1ul << 51), and TCE accepts up to (1ul << 52).
> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that 
> TCE_RPN_BITS will also be increased, so I think they are independent values. 
> 
> Does it make sense? Please let me know if I am missing something.

The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
6Apr2012.pdf spec says:

"The number of most significant RPN bits implemented in the TCE is
dependent on the max size of System Memory to be supported by the platform".

IODA3 is the same on this matter.

This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
where TCE_RPN_BITS comes from exactly - I have no idea.


> 
>>
>>
>>> +#define TCE_RPN_MASK(ps)   ((1ul << (TCE_RPN_BITS - (ps))) - 1)
>>>  #define TCE_VALID  0x800   /* TCE valid */
>>>  #define TCE_ALLIO  0x400   /* TCE valid for all lpars */
>>>  #define TCE_PCI_WRITE  0x2 /* write from PCI 
>>> allowed */
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index e4198700ed1a..8fe23b7dff3a 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
>>> long index,
>>> u64 proto_tce;
>>> __be64 *tcep;
>>> u64 rpn;
>>> +   const unsigned long tceshift = tbl->it_page_shift;
>>> +   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
>>> +   const u64 rpn_mask = TCE_RPN_MASK(tceshift);
>>
>> Using IOMMU_PAGE_SIZE macro for the page size and not using
>> IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
>> explode :) I understand the history but man... Oh well, ok.
>>
> 
> Yeah, it feels kind of weird after two IOMMU related consts. :)
> But sure IOMMU_PAGE_MASK() would not be useful here :)
> 
> And this kind of let me thinking:
>>> +   rpn = __pa(uaddr) >> tceshift;
>>> +   *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
> Why not:
>   rpn_mask =  TCE_RPN_MASK(tceshift) << tceshift;


A mask for a page number (but not the address!) hurts my brain, masks
are good against addresses but numbers should already have all bits
adjusted imho, may be it is just me :-/


>   
>   rpn = __pa(uaddr) & rpn_mask;
>   *tcep = cpu_to_be64(proto_tce | rpn)
> 
> I am usually afraid of changing stuff like this, but I think it's safe.
> 
>> Good, otherwise. Thanks,
> 
> Thank you for reviewing!
>  
> 
> 

-- 
Alexey


Re: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper

2020-08-27 Thread Alexey Kardashevskiy



On 28/08/2020 08:11, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 13:46 +1000, Alexey Kardashevskiy wrote:
>>>  static int find_existing_ddw_windows(void)
>>>  {
>>> int len;
>>> @@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
>>> if (!direct64)
>>> continue;
>>>  
>>> -   window = kzalloc(sizeof(*window), GFP_KERNEL);
>>> -   if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
>>> +   window = ddw_list_add(pdn, direct64);
>>> +   if (!window || len < sizeof(*direct64)) {
>>
>> Since you are touching this code, it looks like the "len <
>> sizeof(*direct64)" part should go above to "if (!direct64)".
> 
> Sure, makes sense.
> It will be fixed for v2.
> 
>>
>>
>>
>>> kfree(window);
>>> remove_ddw(pdn, true);
>>> -   continue;
>>> }
>>> -
>>> -   window->device = pdn;
>>> -   window->prop = direct64;
>>> -   spin_lock(_window_list_lock);
>>> -   list_add(>list, _window_list);
>>> -   spin_unlock(_window_list_lock);
>>> }
>>>  
>>> return 0;
>>> @@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
>>>   create.liobn, dn);
>>>  
>>> -   window = kzalloc(sizeof(*window), GFP_KERNEL);
>>> +   /* Add new window to existing DDW list */
>>
>> The comment seems to duplicate what the ddw_list_add name already suggests.
> 
> Ok, I will remove it then.
> 
>>> +   window = ddw_list_add(pdn, ddwprop);
>>> if (!window)
>>> goto out_clear_window;
>>>  
>>> @@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> goto out_free_window;
>>> }
>>>  
>>> -   window->device = pdn;
>>> -   window->prop = ddwprop;
>>> -   spin_lock(_window_list_lock);
>>> -   list_add(>list, _window_list);
>>> -   spin_unlock(_window_list_lock);
>>
>> I'd leave these 3 lines here and in find_existing_ddw_windows() (which
>> would make  ddw_list_add -> ddw_prop_alloc). In general you want to have
>> less stuff to do on the failure path. kmalloc may fail and needs kfree
>> but you can safely delay list_add (which cannot fail) and avoid having
>> the lock help twice in the same function (one of them is hidden inside
>> ddw_list_add).
>> Not sure if this change is really needed after all. Thanks,
> 
> I understand this leads to better performance in case anything fails.
> Also, I think list_add happening in the end is less error-prone (in
> case the list is checked between list_add and a fail).

Performance was not in my mind at all.

I noticed you remove from a list with a lock help and it was not there
before and there is a bunch on labels on the exit path and started
looking for list_add() and if you do not double remove from the list.


> But what if we put it at the end?
> What is the chance of a kzalloc of 4 pointers (struct direct_window)
> failing after walk_system_ram_range?

This is not about chances really, it is about readability. If let's say
kmalloc failed, you just to the error exit label and simply call kfree()
on that pointer, kfree will do nothing if it is NULL already, simple.
list_del() does not have this simplicity.


> Is it not worthy doing that for making enable_ddw() easier to
> understand?

This is my goal here :)


> 
> Best regards,
> Leonardo
> 

-- 
Alexey


Re: [PATCH v1 04/10] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2020-08-27 Thread Alexey Kardashevskiy



On 28/08/2020 04:34, Leonardo Bras wrote:
> On Sat, 2020-08-22 at 20:34 +1000, Alexey Kardashevskiy wrote:
>>> +
>>> +   /*ignore reserved bit0*/
>>
>> s/ignore reserved bit0/ ignore reserved bit0 /  (add spaces)
> 
> Fixed
> 
>>> +   if (tbl->it_offset == 0)
>>> +   p1_start = 1;
>>> +
>>> +   /* Check if reserved memory is valid*/
>>
>> A missing space here.
> 
> Fixed
> 
>>
>>> +   if (tbl->it_reserved_start >= tbl->it_offset &&
>>> +   tbl->it_reserved_start <= (tbl->it_offset + tbl->it_size) &&
>>> +   tbl->it_reserved_end   >= tbl->it_offset &&
>>> +   tbl->it_reserved_end   <= (tbl->it_offset + tbl->it_size)) {
>>
>> Uff. What if tbl->it_reserved_end is bigger than tbl->it_offset +
>> tbl->it_size?
>>
>> The reserved area is to preserve MMIO32 so it is for it_offset==0 only
>> and the boundaries are checked in the only callsite, and it is unlikely
>> to change soon or ever.
>>
>> Rather that bothering with fixing that, may be just add (did not test):
>>
>> if (WARN_ON((
>> (tbl->it_reserved_start || tbl->it_reserved_end) && (it_offset != 0))
>> (tbl->it_reserved_start > it_offset && tbl->it_reserved_end < it_offset
>> + it_size) && (it_offset == 0)) )
>>  return true;
>>
>> Or simply always look for it_offset..it_reserved_start and
>> it_reserved_end..it_offset+it_size and if there is no reserved area,
>> initialize it_reserved_start=it_reserved_end=it_offset so the first
>> it_offset..it_reserved_start becomes a no-op.
> 
> The problem here is that the values of it_reserved_{start,end} are not
> necessarily valid. I mean, on iommu_table_reserve_pages() the values
> are stored however they are given (bit reserving is done only if they
> are valid). 
> 
> Having a it_reserved_{start,end} value outside the valid ranges would
> cause find_next_bit() to run over memory outside the bitmap.
> Even if the those values are < tbl->it_offset, the resulting
> subtraction on unsigned would cause it to become a big value and run
> over memory outside the bitmap.
> 
> But I think you are right. That is not the place to check if the
> reserved values are valid. It should just trust them here.
> I intent to change iommu_table_reserve_pages() to only store the
> parameters in it_reserved_{start,end} if they are in the range, and or
> it_offset in both of them if they are not.
> 
> What do you think?

This should work, yes.


> 
> Thanks for the feedback!
> Leonardo Bras
> 
> 
> 

-- 
Alexey


Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-08-27 Thread Alexey Kardashevskiy



On 28/08/2020 02:51, Leonardo Bras wrote:
> On Sat, 2020-08-22 at 20:07 +1000, Alexey Kardashevskiy wrote:
>>
>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> Both iommu_alloc_coherent() and iommu_free_coherent() assume that once
>>> size is aligned to PAGE_SIZE it will be aligned to IOMMU_PAGE_SIZE.
>>
>> The only case when it is not aligned is when IOMMU_PAGE_SIZE > PAGE_SIZE
>> which is unlikely but not impossible, we could configure the kernel for
>> 4K system pages and 64K IOMMU pages I suppose. Do we really want to do
>> this here, or simply put WARN_ON(tbl->it_page_shift > PAGE_SHIFT)?
> 
> I think it would be better to keep the code as much generic as possible
> regarding page sizes. 

Then you need to test it. Does 4K guest even boot (it should but I would
not bet much on it)?

> 
>> Because if we want the former (==support), then we'll have to align the
>> size up to the bigger page size when allocating/zeroing system pages,
>> etc. 
> 
> This part I don't understand. Why do we need to align everything to the
> bigger pagesize? 
> 
> I mean, is not that enough that the range [ret, ret + size[ is both
> allocated by mm and mapped on a iommu range?
> 
> Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
> IOMMU_PAGE_SIZE() == 64k.
> Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough? 
> All the space the user asked for is allocated and mapped for DMA.


The user asked to map 16K, the rest - 48K - is used for something else
(may be even mapped to another device) but you are making all 64K
accessible by the device which only should be able to access 16K.

In practice, if this happens, H_PUT_TCE will simply fail.


> 
>> Bigger pages are not the case here as I understand it.
> 
> I did not get this part, what do you mean?


Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
supported set of sizes is different for P8/P9 and type of IO (PHB,
NVLink/CAPI).


> 
>>> Update those functions to guarantee alignment with requested size
>>> using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
>>>
>>> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
>>> with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
>>>
>>> Signed-off-by: Leonardo Bras 
>>> ---
>>>  arch/powerpc/kernel/iommu.c | 17 +
>>>  1 file changed, 9 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>>> index 9704f3f76e63..d7086087830f 100644
>>> --- a/arch/powerpc/kernel/iommu.c
>>> +++ b/arch/powerpc/kernel/iommu.c
>>> @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device 
>>> *dev,
>>> }
>>>  
>>> if (dev)
>>> -   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
>>> - 1 << tbl->it_page_shift);
>>> +   boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, 
>>> tbl);
>>
>> Run checkpatch.pl, should complain about a long line.
> 
> It's 86 columns long, which is less than the new limit of 100 columns
> Linus announced a few weeks ago. checkpatch.pl was updated too:
> https://www.phoronix.com/scan.php?page=news_item=Linux-Kernel-Deprecates-80-Col

Yay finally :) Thanks,


> 
>>
>>
>>> else
>>> -   boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
>>> +   boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
>>> /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
>>>  
>>> n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
>>> @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
>>> iommu_table *tbl,
>>> unsigned int order;
>>> unsigned int nio_pages, io_order;
>>> struct page *page;
>>> +   size_t size_io = size;
>>>  
>>> size = PAGE_ALIGN(size);
>>> order = get_order(size);
>>> @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct 
>>> iommu_table *tbl,
>>> memset(ret, 0, size);
>>>  
>>> /* Set up tces to cover the allocated range */
>>> -   nio_pages = size >> tbl->it_page_shift;
>>> -   io_order = get_iommu_order(size, tbl);
>>> +   size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
>>> +   nio_pages = size_io >> tbl->it_page_shift;
>>> +   io_order = get_iommu_order(size_io, tbl);
>&g

Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-08-23 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> As of today, if the biggest DDW that can be created can't map the whole
> partition, it's creation is skipped and the default DMA window
> "ibm,dma-window" is used instead.
> 
> DDW is 16x bigger than the default DMA window,

16x only under very specific circumstances which are
1. phyp
2. sriov
3. device class in hmc (or what that priority number is in the lpar config).

> having the same amount of
> pages, but increasing the page size to 64k.
> Besides larger DMA window,

"Besides being larger"?

> it performs better for allocations over 4k,

Better how?

> so it would be nice to use it instead.


I'd rather say something like:
===
So far we assumed we can map the guest RAM 1:1 to the bus which worked
with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.
===


> 
> The DDW created will be used for direct mapping by default.
> If it's not available, indirect mapping will be used instead.
> 
> For indirect mapping, it's necessary to update the iommu_table so
> iommu_alloc() can use the DDW created. For this,
> iommu_table_update_window() is called when everything else succeeds
> at enable_ddw().
> 
> Removing the default DMA window for using DDW with indirect mapping
> is only allowed if there is no current IOMMU memory allocated in
> the iommu_table. enable_ddw() is aborted otherwise.
> 
> As there will never have both direct and indirect mappings at the same
> time, the same property name can be used for the created DDW.
> 
> So renaming
> define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> to
> define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> looks the right thing to do.

I know I suggested this but this does not look so good anymore as I
suspect it breaks kexec (from older kernel to this one) so you either
need to check for both DT names or just keep the old one. Changing the
macro name is fine.


> 
> To make sure the property differentiates both cases, a new u32 for flags
> was added at the end of the property, where BIT(0) set means direct
> mapping.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 108 +++--
>  1 file changed, 84 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 3a1ef02ad9d5..9544e3c91ced 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
>   __be64  dma_base;   /* address hi,lo */
>   __be32  tce_shift;  /* ilog2(tce_page_size) */
>   __be32  window_shift;   /* ilog2(tce_window_size) */
> + __be32  flags;  /* DDW properties, see bellow */
>  };
>  
> +#define DDW_FLAGS_DIRECT 0x01

This is set if ((1<= ddw_memory_hotplug_max()), you
could simply check window_shift and drop the flags.


> +
>  struct direct_window {
>   struct device_node *device;
>   const struct dynamic_dma_window_prop *prop;
> @@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
>  static DEFINE_SPINLOCK(direct_window_list_lock);
>  /* protects initializing window twice for same device */
>  static DEFINE_MUTEX(direct_window_init_mutex);
> -#define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> +#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>  
>  static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
>   unsigned long num_pfn, const void *arg)
> @@ -836,7 +839,7 @@ static void remove_ddw(struct device_node *np, bool 
> remove_prop)
>   if (ret)
>   return;
>  
> - win = of_find_property(np, DIRECT64_PROPNAME, NULL);
> + win = of_find_property(np, DMA64_PROPNAME, NULL);
>   if (!win)
>   return;
>  
> @@ -852,7 +855,7 @@ static void remove_ddw(struct device_node *np, bool 
> remove_prop)
>   np, ret);
>  }
>  
> -static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, bool 
> *direct_mapping)
>  {
>   struct direct_window *window;
>   const struct dynamic_dma_window_prop *direct64;
> @@ -864,6 +867,7 @@ static bool find_existing_ddw(struct device_node *pdn, 
> u64 *dma_addr)
>   if (window->device == pdn) {
>   direct64 = window->prop;
>   *dma_addr = be64_to_cpu(direct64->dma_base);
> + *direct_mapping = be32_to_cpu(direct64->flags) & 
> DDW_FLAGS_DIRECT;
>   found = true;
>   break;
>   }
> @@ -901,8 +905,8 @@ static int find_existing_ddw_windows(void)
>   if (!firmware_has_feature(FW_FEATURE_LPAR))
>   return 0;
>  
> 

Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2020-08-23 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> Code used to create a ddw property that was previously scattered in
> enable_ddw() is now gathered in ddw_property_create(), which deals with
> allocation and filling the property, letting it ready for
> of_property_add(), which now occurs in sequence.
> 
> This created an opportunity to reorganize the second part of enable_ddw():
> 
> Without this patch enable_ddw() does, in order:
> kzalloc() property & members, create_ddw(), fill ddwprop inside property,
> ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
> of_add_property().
> 
> With this patch enable_ddw() does, in order:
> create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
> do tce_setrange_multi_pSeriesLP_walk in all memory.
> 
> This change requires of_remove_property() in case anything fails after
> of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
> in all memory, which looks the most expensive operation, only if
> everything else succeeds.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 97 +++---
>  1 file changed, 57 insertions(+), 40 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 4031127c9537..3a1ef02ad9d5 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, 
> struct device_node *par_dn)
>ret);
>  }
>  
> +static int ddw_property_create(struct property **ddw_win, const char 
> *propname,

@propname is always the same, do you really want to pass it every time?

> +u32 liobn, u64 dma_addr, u32 page_shift, u32 
> window_shift)
> +{
> + struct dynamic_dma_window_prop *ddwprop;
> + struct property *win64;
> +
> + *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
> + if (!win64)
> + return -ENOMEM;
> +
> + win64->name = kstrdup(propname, GFP_KERNEL);

Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
generic OF code does not try kfree() it but it is probably out of scope
here.


> + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
> + win64->value = ddwprop;
> + win64->length = sizeof(*ddwprop);
> + if (!win64->name || !win64->value)
> + return -ENOMEM;


Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
still looks fragile. Instead you could simply return win64 as the only
error possible here is -ENOMEM and returning NULL is equally good.


> +
> + ddwprop->liobn = cpu_to_be32(liobn);
> + ddwprop->dma_base = cpu_to_be64(dma_addr);
> + ddwprop->tce_shift = cpu_to_be32(page_shift);
> + ddwprop->window_shift = cpu_to_be32(window_shift);
> +
> + return 0;
> +}
> +
>  /*
>   * If the PE supports dynamic dma windows, and there is space for a table
>   * that can map all pages in a linear offset, then setup such a table,
> @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   struct ddw_query_response query;
>   struct ddw_create_response create;
>   int page_shift;
> - u64 max_addr;
> + u64 max_addr, win_addr;
>   struct device_node *dn;
>   u32 ddw_avail[DDW_APPLICABLE_SIZE];
>   struct direct_window *window;
> - struct property *win64;
> - struct dynamic_dma_window_prop *ddwprop;
> + struct property *win64 = NULL;
>   struct failed_ddw_pdn *fpdn;
>   bool default_win_removed = false;
>  
> @@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   goto out_failed;
>   }
>   len = order_base_2(max_addr);
> - win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
> - if (!win64) {
> - dev_info(>dev,
> - "couldn't allocate property for 64bit dma window\n");
> +
> + ret = create_ddw(dev, ddw_avail, , page_shift, len);
> + if (ret != 0)

It is usually just "if (ret)"


>   goto out_failed;
> - }
> - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
> - win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
> - win64->length = sizeof(*ddwprop);
> - if (!win64->name || !win64->value) {
> +
> + dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
> + create.liobn, dn);
> +
> + win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
> + ret = ddw_property_create(, DIRECT64_PROPNAME, create.liobn, 
> win_addr,
> +   page_shift, len);
> + if (ret) {
>   dev_info(>dev,
> - "couldn't allocate property name and value\n");
> +  "couldn't allocate property, property name, or 
> value\n");
>   goto out_free_prop;
>   }
>  
> - ret = create_ddw(dev, ddw_avail, 

Re: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper

2020-08-23 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> There are two functions adding DDW to the direct_window_list in a
> similar way, so create a ddw_list_add() to avoid duplicity and
> simplify those functions.
> 
> Also, on enable_ddw(), add list_del() on out_free_window to allow
> removing the window from list if any error occurs.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 42 --
>  1 file changed, 26 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 39617ce0ec83..fcdefcc0f365 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -872,6 +872,24 @@ static u64 find_existing_ddw(struct device_node *pdn)
>   return dma_addr;
>  }
>  
> +static struct direct_window *ddw_list_add(struct device_node *pdn,
> +   const struct dynamic_dma_window_prop 
> *dma64)
> +{
> + struct direct_window *window;
> +
> + window = kzalloc(sizeof(*window), GFP_KERNEL);
> + if (!window)
> + return NULL;
> +
> + window->device = pdn;
> + window->prop = dma64;
> + spin_lock(_window_list_lock);
> + list_add(>list, _window_list);
> + spin_unlock(_window_list_lock);
> +
> + return window;
> +}
> +
>  static int find_existing_ddw_windows(void)
>  {
>   int len;
> @@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
>   if (!direct64)
>   continue;
>  
> - window = kzalloc(sizeof(*window), GFP_KERNEL);
> - if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
> + window = ddw_list_add(pdn, direct64);
> + if (!window || len < sizeof(*direct64)) {


Since you are touching this code, it looks like the "len <
sizeof(*direct64)" part should go above to "if (!direct64)".



>   kfree(window);
>   remove_ddw(pdn, true);
> - continue;
>   }
> -
> - window->device = pdn;
> - window->prop = direct64;
> - spin_lock(_window_list_lock);
> - list_add(>list, _window_list);
> - spin_unlock(_window_list_lock);
>   }
>  
>   return 0;
> @@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
> create.liobn, dn);
>  
> - window = kzalloc(sizeof(*window), GFP_KERNEL);
> + /* Add new window to existing DDW list */

The comment seems to duplicate what the ddw_list_add name already suggests.


> + window = ddw_list_add(pdn, ddwprop);
>   if (!window)
>   goto out_clear_window;
>  
> @@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   goto out_free_window;
>   }
>  
> - window->device = pdn;
> - window->prop = ddwprop;
> - spin_lock(_window_list_lock);
> - list_add(>list, _window_list);
> - spin_unlock(_window_list_lock);

I'd leave these 3 lines here and in find_existing_ddw_windows() (which
would make  ddw_list_add -> ddw_prop_alloc). In general you want to have
less stuff to do on the failure path. kmalloc may fail and needs kfree
but you can safely delay list_add (which cannot fail) and avoid having
the lock help twice in the same function (one of them is hidden inside
ddw_list_add).

Not sure if this change is really needed after all. Thanks,

> -
>   dma_addr = be64_to_cpu(ddwprop->dma_base);
>   goto out_unlock;
>  
>  out_free_window:
> + spin_lock(_window_list_lock);
> + list_del(>list);
> + spin_unlock(_window_list_lock);
> +
>   kfree(window);
>  
>  out_clear_window:
> 

-- 
Alexey


Re: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

2020-08-23 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> enable_ddw() currently returns the address of the DMA window, which is
> considered invalid if has the value 0x00.
> 
> Also, it only considers valid an address returned from find_existing_ddw
> if it's not 0x00.
> 
> Changing this behavior makes sense, given the users of enable_ddw() only
> need to know if direct mapping is possible. It can also allow a DMA window
> starting at 0x00 to be used.
> 
> This will be helpful for using a DDW with indirect mapping, as the window
> address will be different than 0x00, but it will not map the whole
> partition.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 30 --
>  1 file changed, 14 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index fcdefcc0f365..4031127c9537 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool 
> remove_prop)
>   np, ret);
>  }
>  
> -static u64 find_existing_ddw(struct device_node *pdn)
> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>  {
>   struct direct_window *window;
>   const struct dynamic_dma_window_prop *direct64;
> - u64 dma_addr = 0;
> + bool found = false;
>  
>   spin_lock(_window_list_lock);
>   /* check if we already created a window and dupe that config if so */
>   list_for_each_entry(window, _window_list, list) {
>   if (window->device == pdn) {
>   direct64 = window->prop;
> - dma_addr = be64_to_cpu(direct64->dma_base);
> + *dma_addr = be64_to_cpu(direct64->dma_base);
> + found = true;
>   break;
>   }
>   }
>   spin_unlock(_window_list_lock);
>  
> - return dma_addr;
> + return found;
>  }
>  
>  static struct direct_window *ddw_list_add(struct device_node *pdn,
> @@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, 
> struct device_node *par_dn)
>   * pdn: the parent pe node with the ibm,dma_window property
>   * Future: also check if we can remap the base window for our base page size
>   *
> - * returns the dma offset for use by the direct mapped DMA code.
> + * returns true if can map all pages (direct mapping), false otherwise..
>   */
> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  {
>   int len, ret;
>   struct ddw_query_response query;
>   struct ddw_create_response create;
>   int page_shift;
> - u64 dma_addr, max_addr;
> + u64 max_addr;
>   struct device_node *dn;
>   u32 ddw_avail[DDW_APPLICABLE_SIZE];
>   struct direct_window *window;
> @@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>  
>   mutex_lock(_window_init_mutex);
>  
> - dma_addr = find_existing_ddw(pdn);
> - if (dma_addr != 0)
> + if (find_existing_ddw(pdn, >dev.archdata.dma_offset))
>   goto out_unlock;
>  
>   /*
> @@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   goto out_free_window;
>   }
>  
> - dma_addr = be64_to_cpu(ddwprop->dma_base);
> + dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);


Do not you need the same chunk in the find_existing_ddw() case above as
well? Thanks,


>   goto out_unlock;
>  
>  out_free_window:
> @@ -1309,6 +1309,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   kfree(win64->name);
>   kfree(win64->value);
>   kfree(win64);
> + win64 = NULL;
>  
>  out_failed:
>   if (default_win_removed)
> @@ -1322,7 +1323,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>  
>  out_unlock:
>   mutex_unlock(_window_init_mutex);
> - return dma_addr;
> + return win64;
>  }
>  
>  static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> @@ -1401,11 +1402,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct 
> pci_dev *pdev, u64 dma_mask)
>   break;
>   }
>  
> - if (pdn && PCI_DN(pdn)) {
> - pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
> - if (pdev->dev.archdata.dma_offset)
> - return true;
> - }
> + if (pdn && PCI_DN(pdn))
> + return enable_ddw(pdev, pdn);
>  
>   return false;
>  }
> 

-- 
Alexey


Re: [PATCH v1 05/10] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper

2020-08-23 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> Creates a helper to allow allocating a new iommu_table without the need
> to reallocate the iommu_group.
> 
> This will be helpful for replacing the iommu_table for the new DMA window,
> after we remove the old one with iommu_tce_table_put().
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 25 ++---
>  1 file changed, 14 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 8fe23b7dff3a..39617ce0ec83 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -53,28 +53,31 @@ enum {
>   DDW_EXT_QUERY_OUT_SIZE = 2
>  };
>  
> -static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> +static struct iommu_table *iommu_pseries_alloc_table(int node)
>  {
> - struct iommu_table_group *table_group;
>   struct iommu_table *tbl;
>  
> - table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
> -node);
> - if (!table_group)
> - return NULL;
> -
>   tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
>   if (!tbl)
> - goto free_group;
> + return NULL;
>  
>   INIT_LIST_HEAD_RCU(>it_group_list);
>   kref_init(>it_kref);
> + return tbl;
> +}
>  
> - table_group->tables[0] = tbl;
> +static struct iommu_table_group *iommu_pseries_alloc_group(int node)
> +{
> + struct iommu_table_group *table_group;
> +
> + table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);


I'd prefer you did not make unrelated changes (sizeof(struct
iommu_table_group) -> sizeof(*table_group)) so the diff stays shorter
and easier to follow. You changed  sizeof(struct iommu_table_group) but
not sizeof(struct iommu_table) and this confused me enough to spend more
time than this straight forward change deserves.

Not important in this case though so

Reviewed-by: Alexey Kardashevskiy 




> + if (!table_group)
> + return NULL;
>  
> - return table_group;
> + table_group->tables[0] = iommu_pseries_alloc_table(node);
> + if (table_group->tables[0])
> + return table_group;
>  
> -free_group:
>   kfree(table_group);
>   return NULL;
>  }
> 

-- 
Alexey


Re: [PATCH v1 04/10] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2020-08-22 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> Having a function to check if the iommu table has any allocation helps
> deciding if a tbl can be reset for using a new DMA window.
> 
> It should be enough to replace all instances of !bitmap_empty(tbl...).
> 
> iommu_table_in_use() skips reserved memory, so we don't need to worry about
> releasing it before testing. This causes iommu_table_release_pages() to
> become unnecessary, given it is only used to remove reserved memory for
> testing.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/include/asm/iommu.h |  1 +
>  arch/powerpc/kernel/iommu.c  | 62 ++--
>  2 files changed, 37 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 5032f1593299..2913e5c8b1f8 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
>   */
>  extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
>   int nid, unsigned long res_start, unsigned long res_end);
> +bool iommu_table_in_use(struct iommu_table *tbl);
>  
>  #define IOMMU_TABLE_GROUP_MAX_TABLES 2
>  
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 7f603d4e62d4..c5d5d36ab65e 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -668,21 +668,6 @@ static void iommu_table_reserve_pages(struct iommu_table 
> *tbl,
>   set_bit(i - tbl->it_offset, tbl->it_map);
>  }
>  
> -static void iommu_table_release_pages(struct iommu_table *tbl)
> -{
> - int i;
> -
> - /*
> -  * In case we have reserved the first bit, we should not emit
> -  * the warning below.
> -  */
> - if (tbl->it_offset == 0)
> - clear_bit(0, tbl->it_map);
> -
> - for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
> - clear_bit(i - tbl->it_offset, tbl->it_map);
> -}
> -
>  /*
>   * Build a iommu_table structure.  This contains a bit map which
>   * is used to manage allocation of the tce space.
> @@ -743,6 +728,38 @@ struct iommu_table *iommu_init_table(struct iommu_table 
> *tbl, int nid,
>   return tbl;
>  }
>  
> +bool iommu_table_in_use(struct iommu_table *tbl)
> +{
> + bool in_use;
> + unsigned long p1_start = 0, p1_end, p2_start, p2_end;
> +
> + /*ignore reserved bit0*/

s/ignore reserved bit0/ ignore reserved bit0 /  (add spaces)

> + if (tbl->it_offset == 0)
> + p1_start = 1;
> +
> + /* Check if reserved memory is valid*/

A missing space here.

> + if (tbl->it_reserved_start >= tbl->it_offset &&
> + tbl->it_reserved_start <= (tbl->it_offset + tbl->it_size) &&
> + tbl->it_reserved_end   >= tbl->it_offset &&
> + tbl->it_reserved_end   <= (tbl->it_offset + tbl->it_size)) {


Uff. What if tbl->it_reserved_end is bigger than tbl->it_offset +
tbl->it_size?

The reserved area is to preserve MMIO32 so it is for it_offset==0 only
and the boundaries are checked in the only callsite, and it is unlikely
to change soon or ever.

Rather that bothering with fixing that, may be just add (did not test):

if (WARN_ON((
(tbl->it_reserved_start || tbl->it_reserved_end) && (it_offset != 0))
||
(tbl->it_reserved_start > it_offset && tbl->it_reserved_end < it_offset
+ it_size) && (it_offset == 0)) )
 return true;

Or simply always look for it_offset..it_reserved_start and
it_reserved_end..it_offset+it_size and if there is no reserved area,
initialize it_reserved_start=it_reserved_end=it_offset so the first
it_offset..it_reserved_start becomes a no-op.


> + p1_end = tbl->it_reserved_start - tbl->it_offset;
> + p2_start = tbl->it_reserved_end - tbl->it_offset + 1;
> + p2_end = tbl->it_size;
> + } else {
> + p1_end = tbl->it_size;
> + p2_start = 0;
> + p2_end = 0;
> + }
> +
> + in_use = (find_next_bit(tbl->it_map, p1_end, p1_start) != p1_end);
> + if (in_use || p2_start == 0)
> + return in_use;
> +
> + in_use = (find_next_bit(tbl->it_map, p2_end, p2_start) != p2_end);
> +
> + return in_use;
> +}
> +
>  static void iommu_table_free(struct kref *kref)
>  {
>   unsigned long bitmap_sz;
> @@ -759,10 +776,8 @@ static void iommu_table_free(struct kref *kref)
>   return;
>   }
>  
> - iommu_table_release_pages(tbl);
> -
>   /* verify that table contains no entries */
> - if (!bitmap_empty(tbl->it_map, tbl->it_size))
> + if (iommu_table_in_use(tbl))
>   pr_warn("%s: Unexpected TCEs\n", __func__);
>  
>   /* calculate bitmap size in bytes */
> @@ -1069,18 +1084,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
>   for (i = 0; i < tbl->nr_pools; i++)
>   spin_lock(>pools[i].lock);
>  
> - iommu_table_release_pages(tbl);
> -
> - if 

Re: [PATCH v1 03/10] powerpc/kernel/iommu: Use largepool as a last resort when !largealloc

2020-08-22 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> As of today, doing iommu_range_alloc() only for !largealloc (npages <= 15)
> will only be able to use 3/4 of the available pages, given pages on
> largepool  not being available for !largealloc.
> 
> This could mean some drivers not being able to fully use all the available
> pages for the DMA window.
> 
> Add pages on largepool as a last resort for !largealloc, making all pages
> of the DMA window available.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/kernel/iommu.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index d7086087830f..7f603d4e62d4 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -261,6 +261,15 @@ static unsigned long iommu_range_alloc(struct device 
> *dev,
>   pass++;
>   goto again;
>  
> + } else if (pass == tbl->nr_pools + 1) {
> + /* Last resort: try largepool */
> + spin_unlock(>lock);
> + pool = >large_pool;
> + spin_lock(>lock);
> + pool->hint = pool->start;
> +     pass++;
> + goto again;
> +


A nit: unnecessary new line.


Reviewed-by: Alexey Kardashevskiy 



>   } else {
>   /* Give up */
>   spin_unlock_irqrestore(&(pool->lock), flags);
> 

-- 
Alexey


Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-08-22 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> Both iommu_alloc_coherent() and iommu_free_coherent() assume that once
> size is aligned to PAGE_SIZE it will be aligned to IOMMU_PAGE_SIZE.

The only case when it is not aligned is when IOMMU_PAGE_SIZE > PAGE_SIZE
which is unlikely but not impossible, we could configure the kernel for
4K system pages and 64K IOMMU pages I suppose. Do we really want to do
this here, or simply put WARN_ON(tbl->it_page_shift > PAGE_SHIFT)?
Because if we want the former (==support), then we'll have to align the
size up to the bigger page size when allocating/zeroing system pages,
etc. Bigger pages are not the case here as I understand it.


> 
> Update those functions to guarantee alignment with requested size
> using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
> 
> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/kernel/iommu.c | 17 +
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 9704f3f76e63..d7086087830f 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device 
> *dev,
>   }
>  
>   if (dev)
> - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> -   1 << tbl->it_page_shift);
> + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, 
> tbl);


Run checkpatch.pl, should complain about a long line.


>   else
> - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
> + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
>   /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
>  
>   n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
> @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
> iommu_table *tbl,
>   unsigned int order;
>   unsigned int nio_pages, io_order;
>   struct page *page;
> + size_t size_io = size;
>  
>   size = PAGE_ALIGN(size);
>   order = get_order(size);
> @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct 
> iommu_table *tbl,
>   memset(ret, 0, size);
>  
>   /* Set up tces to cover the allocated range */
> - nio_pages = size >> tbl->it_page_shift;
> - io_order = get_iommu_order(size, tbl);
> + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> + nio_pages = size_io >> tbl->it_page_shift;
> + io_order = get_iommu_order(size_io, tbl);
>   mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
> mask >> tbl->it_page_shift, io_order, 0);
>   if (mapping == DMA_MAPPING_ERROR) {
> @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>void *vaddr, dma_addr_t dma_handle)
>  {
>   if (tbl) {
> - unsigned int nio_pages;
> + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
> + unsigned int nio_pages = size_io >> tbl->it_page_shift;
>  
> - size = PAGE_ALIGN(size);
> - nio_pages = size >> tbl->it_page_shift;
>   iommu_free(tbl, dma_handle, nio_pages);
> +

Unrelated new line.


>   size = PAGE_ALIGN(size);
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
> 

-- 
Alexey


Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-22 Thread Alexey Kardashevskiy



On 18/08/2020 09:40, Leonardo Bras wrote:
> Some functions assume IOMMU page size can only be 4K (pageshift == 12).
> Update them to accept any page size passed, so we can use 64K pages.
> 
> In the process, some defines like TCE_SHIFT were made obsolete, and then
> removed. TCE_RPN_MASK was updated to generate a mask according to
> the pageshift used.
> 
> Most places had a tbl struct, so using tbl->it_page_shift was simple.
> tce_free_pSeriesLP() was a special case, since callers not always have a
> tbl struct, so adding a tceshift parameter seems the right thing to do.
> 
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/include/asm/tce.h | 10 ++
>  arch/powerpc/platforms/pseries/iommu.c | 42 --
>  2 files changed, 28 insertions(+), 24 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
> index db5fc2f2262d..971cba2d87cc 100644
> --- a/arch/powerpc/include/asm/tce.h
> +++ b/arch/powerpc/include/asm/tce.h
> @@ -19,15 +19,9 @@
>  #define TCE_VB   0
>  #define TCE_PCI  1
>  
> -/* TCE page size is 4096 bytes (1 << 12) */
> -
> -#define TCE_SHIFT12
> -#define TCE_PAGE_SIZE(1 << TCE_SHIFT)
> -
>  #define TCE_ENTRY_SIZE   8   /* each TCE is 64 bits 
> */
> -
> -#define TCE_RPN_MASK 0xfful  /* 40-bit RPN (4K pages) */
> -#define TCE_RPN_SHIFT12
> +#define TCE_RPN_BITS 52  /* Bits 0-51 represent RPN on 
> TCE */


Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
is the actual limit.


> +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
>  #define TCE_VALID0x800   /* TCE valid */
>  #define TCE_ALLIO0x400   /* TCE valid for all lpars */
>  #define TCE_PCI_WRITE0x2 /* write from PCI 
> allowed */
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index e4198700ed1a..8fe23b7dff3a 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
> long index,
>   u64 proto_tce;
>   __be64 *tcep;
>   u64 rpn;
> + const unsigned long tceshift = tbl->it_page_shift;
> + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);

Using IOMMU_PAGE_SIZE macro for the page size and not using
IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
explode :) I understand the history but man... Oh well, ok.

Good, otherwise. Thanks,

>  
>   proto_tce = TCE_PCI_READ; // Read allowed
>  
> @@ -117,10 +120,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
> long index,
>  
>   while (npages--) {
>   /* can't move this out since we might cross MEMBLOCK boundary */
> - rpn = __pa(uaddr) >> TCE_SHIFT;
> - *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << 
> TCE_RPN_SHIFT);
> + rpn = __pa(uaddr) >> tceshift;
> + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
>  
> - uaddr += TCE_PAGE_SIZE;
> + uaddr += pagesize;
>   tcep++;
>   }
>   return 0;
> @@ -146,7 +149,7 @@ static unsigned long tce_get_pseries(struct iommu_table 
> *tbl, long index)
>   return be64_to_cpu(*tcep);
>  }
>  
> -static void tce_free_pSeriesLP(unsigned long liobn, long, long);
> +static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
>  static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
>  
>  static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long 
> tceshift,
> @@ -159,6 +162,7 @@ static int tce_build_pSeriesLP(unsigned long liobn, long 
> tcenum, long tceshift,
>   u64 rpn;
>   int ret = 0;
>   long tcenum_start = tcenum, npages_start = npages;
> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);
>  
>   rpn = __pa(uaddr) >> tceshift;
>   proto_tce = TCE_PCI_READ;
> @@ -166,12 +170,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, 
> long tcenum, long tceshift,
>   proto_tce |= TCE_PCI_WRITE;
>  
>   while (npages--) {
> - tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
> + tce = proto_tce | (rpn & rpn_mask) << tceshift;
>   rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
>  
>   if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
>   ret = (int)rc;
> - tce_free_pSeriesLP(liobn, tcenum_start,
> + tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
>  (npages_start - (npages + 1)));
>   break;
>   }
> @@ -205,10 +209,12 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 

Re: [PATCH 1/2] lockdep: improve current->(hard|soft)irqs_enabled synchronisation with actual irq state

2020-08-19 Thread Alexey Kardashevskiy



On 19/08/2020 09:54, Nicholas Piggin wrote:
> Excerpts from pet...@infradead.org's message of August 19, 2020 1:41 am:
>> On Tue, Aug 18, 2020 at 05:22:33PM +1000, Nicholas Piggin wrote:
>>> Excerpts from pet...@infradead.org's message of August 12, 2020 8:35 pm:
 On Wed, Aug 12, 2020 at 06:18:28PM +1000, Nicholas Piggin wrote:
> Excerpts from pet...@infradead.org's message of August 7, 2020 9:11 pm:
>>
>> What's wrong with something like this?
>>
>> AFAICT there's no reason to actually try and add IRQ tracing here, it's
>> just a hand full of instructions at the most.
>
> Because we may want to use that in other places as well, so it would
> be nice to have tracing.
>
> Hmm... also, I thought NMI context was free to call local_irq_save/restore
> anyway so the bug would still be there in those cases?

 NMI code has in_nmi() true, in which case the IRQ tracing is disabled
 (except for x86 which has CONFIG_TRACE_IRQFLAGS_NMI).

>>>
>>> That doesn't help. It doesn't fix the lockdep irq state going out of
>>> synch with the actual irq state. The code which triggered this with the
>>> special powerpc irq disable has in_nmi() true as well.
>>
>> Urgh, you're talking about using lockdep_assert_irqs*() from NMI
>> context?
>>
>> If not, I'm afraid I might've lost the plot a little on what exact
>> failure case we're talking about.
>>
> 
> Hm, I may have been a bit confused actually. Since your Fix 
> TRACE_IRQFLAGS vs NMIs patch it might now work.
> 
> I'm worried powerpc disables trace irqs trace_hardirqs_off()
> before nmi_enter() might still be a problem, but not sure
> actually. Alexey did you end up re-testing with Peter's patch

The one above in the thread which replaces powerpc_local_irq_pmu_save()
with
raw_powerpc_local_irq_pmu_save()? It did not compile as there is no
raw_powerpc_local_irq_pmu_save() so I may be missing something here.

I applied the patch on top of the current upstream and replaced
raw_powerpc_local_irq_pmu_save() with raw_local_irq_pmu_save()  (which I
think was the intention) but I still see the issue.

> or current upstream?

The upstream 18445bf405cb (13 hours old) also shows the problem. Yours
1/2 still fixes it.


> 
> Thanks,
> Nick
> 

-- 
Alexey


Re: [PATCH v5 4/4] powerpc/pseries/iommu: Allow bigger 64bit window by removing default DMA window

2020-08-04 Thread Alexey Kardashevskiy



On 05/08/2020 13:04, Leonardo Bras wrote:
> On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
> default DMA window for the device, before attempting to configure a DDW,
> in order to make the maximum resources available for the next DDW to be
> created.
> 
> This is a requirement for using DDW on devices in which hypervisor
> allows only one DMA window.
> 
> If setting up a new DDW fails anywhere after the removal of this
> default DMA window, it's needed to restore the default DMA window.
> For this, an implementation of ibm,reset-pe-dma-windows rtas call is
> needed:
> 
> Platforms supporting the DDW option starting with LoPAR level 2.7 implement
> ibm,ddw-extensions. The first extension available (index 2) carries the
> token for ibm,reset-pe-dma-windows rtas call, which is used to restore
> the default DMA window for a device, if it has been deleted.
> 
> It does so by resetting the TCE table allocation for the PE to it's
> boot time value, available in "ibm,dma-window" device tree node.
> 
> Signed-off-by: Leonardo Bras 
> Tested-by: David Dai 


Reviewed-by: Alexey Kardashevskiy 



> ---
>  arch/powerpc/platforms/pseries/iommu.c | 73 +++---
>  1 file changed, 66 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index 4e33147825cc..e4198700ed1a 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1066,6 +1066,38 @@ static phys_addr_t ddw_memory_hotplug_max(void)
>   return max_addr;
>  }
>  
> +/*
> + * Platforms supporting the DDW option starting with LoPAR level 2.7 
> implement
> + * ibm,ddw-extensions, which carries the rtas token for
> + * ibm,reset-pe-dma-windows.
> + * That rtas-call can be used to restore the default DMA window for the 
> device.
> + */
> +static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> +{
> + int ret;
> + u32 cfg_addr, reset_dma_win;
> + u64 buid;
> + struct device_node *dn;
> + struct pci_dn *pdn;
> +
> + ret = ddw_read_ext(par_dn, DDW_EXT_RESET_DMA_WIN, _dma_win);
> + if (ret)
> + return;
> +
> + dn = pci_device_to_OF_node(dev);
> + pdn = PCI_DN(dn);
> + buid = pdn->phb->buid;
> + cfg_addr = (pdn->busno << 16) | (pdn->devfn << 8);
> +
> + ret = rtas_call(reset_dma_win, 3, 1, NULL, cfg_addr, BUID_HI(buid),
> + BUID_LO(buid));
> + if (ret)
> + dev_info(>dev,
> +  "ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
> +  reset_dma_win, cfg_addr, BUID_HI(buid), BUID_LO(buid),
> +  ret);
> +}
> +
>  /*
>   * If the PE supports dynamic dma windows, and there is space for a table
>   * that can map all pages in a linear offset, then setup such a table,
> @@ -1090,6 +1122,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   struct property *win64;
>   struct dynamic_dma_window_prop *ddwprop;
>   struct failed_ddw_pdn *fpdn;
> + bool default_win_removed = false;
>  
>   mutex_lock(_window_init_mutex);
>  
> @@ -1133,14 +1166,38 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> device_node *pdn)
>   if (ret != 0)
>   goto out_failed;
>  
> + /*
> +  * If there is no window available, remove the default DMA window,
> +  * if it's present. This will make all the resources available to the
> +  * new DDW window.
> +  * If anything fails after this, we need to restore it, so also check
> +  * for extensions presence.
> +  */
>   if (query.windows_available == 0) {
> - /*
> -  * no additional windows are available for this device.
> -  * We might be able to reallocate the existing window,
> -  * trading in for a larger page size.
> -  */
> - dev_dbg(>dev, "no free dynamic windows");
> - goto out_failed;
> + struct property *default_win;
> + int reset_win_ext;
> +
> + default_win = of_find_property(pdn, "ibm,dma-window", NULL);
> + if (!default_win)
> + goto out_failed;
> +
> + reset_win_ext = ddw_read_ext(pdn, DDW_EXT_RESET_DMA_WIN, NULL);
> + if (reset_win_ext)
> + goto out_failed;
> +
> + remove_dma_window(pdn, ddw_avail, default_win);
> + default_win_removed = true;
> +
>

  1   2   3   4   5   6   7   8   9   10   >