Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-04-22 Thread Christophe Leroy




Le 23/04/2021 à 00:58, Daniel Axtens a écrit :

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index c9d2c7825cd6..3a2f2001c62b 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr)
  KBUILD_AFLAGS += $(AFLAGS-y)
  KBUILD_CFLAGS += $(call cc-option,-msoft-float)
  KBUILD_CFLAGS += -pipe $(CFLAGS-y)
-CPP= $(CC) -E $(KBUILD_CFLAGS)


I was trying to check the history to see why powerpc has its own
definition. It seems to date back at least as far as merging the two
powerpc platforms into one, maybe it was helpful then. I agree it
doesn't seem to be needed now.



I digged a bit deaper. It has been there since the introduction of arch/ppc/ in 
Linux 1.3.45
At the time, there was the exact same CPP definition in arch/mips/Makefile

The CPP definition in mips disappeared is Linux 2.1.44pre3.
Since that version, ppc has been the only one with such CPP re-definition.


Snowpatch claims that this breaks the build, but I haven't been able to
reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it
off to a fork of mpe's linux-ci repo to see if any of those builds hit
any issues: https://github.com/daxtens/linux-ci/actions



By the way, I find snowpatch quite useless these days. It seems to delete the reports a few minutes 
after the test. We are less than one day after the patch was submitted and the report of the build 
failures are already gone.


Christophe


Re: [PATCH 7/7] swiotlb: don't override the command line in swiotlb_adjust_size

2021-04-22 Thread Tom Lendacky
On 4/22/21 2:19 AM, Christoph Hellwig wrote:
> When the user specified an explicit swiotlb size on the command line,
> the achitecture code should not override it.
> 
> Fixes: 2cbc2776efe4 ("swiotlb: remove swiotlb_nr_tbl")
> Reported-by: Tom Lendacky 
> Signed-off-by: Christoph Hellwig 

I tested this patchset and I'm not able to get the override via the
command line or via the SEV adjustment function. Looking at the code,
swiotlb_default_size is initialized, so the call to swiotlb_adjust_size()
always returns without setting the new size.

Thanks,
Tom

> ---
>  kernel/dma/swiotlb.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
> index 87d06ddf4753f3..aef02a3825b494 100644
> --- a/kernel/dma/swiotlb.c
> +++ b/kernel/dma/swiotlb.c
> @@ -106,7 +106,9 @@ void swiotlb_set_max_segment(unsigned int val)
>  
>  unsigned long swiotlb_size_or_default(void)
>  {
> - return swiotlb_default_size;
> + if (swiotlb_default_size)
> + return swiotlb_default_size;
> + return IO_TLB_DEFAULT_SIZE;
>  }
>  
>  void __init swiotlb_adjust_size(unsigned long size)
> @@ -116,6 +118,8 @@ void __init swiotlb_adjust_size(unsigned long size)
>* architectures such as those supporting memory encryption to
>* adjust/expand SWIOTLB size for their use.
>*/
> + if (swiotlb_default_size)
> + return;
>   swiotlb_default_size = ALIGN(size, IO_TLB_SEGSIZE << IO_TLB_SHIFT);
>   pr_info("SWIOTLB bounce buffer size adjusted to %luMB",
>   swiotlb_default_size >> 20);
> 


[powerpc:next] BUILD SUCCESS ee6b25fa7c037e42cc5f3b5c024b2a779edab6dd

2021-04-22 Thread kernel test robot
ig-a002-20210421
x86_64   randconfig-a001-20210421
x86_64   randconfig-a005-20210421
x86_64   randconfig-a006-20210421
x86_64   randconfig-a003-20210421
i386 randconfig-a005-20210421
i386 randconfig-a002-20210421
i386 randconfig-a001-20210421
i386 randconfig-a006-20210421
i386 randconfig-a004-20210421
i386 randconfig-a003-20210421
i386 randconfig-a012-20210421
i386 randconfig-a014-20210421
i386 randconfig-a011-20210421
i386 randconfig-a013-20210421
i386 randconfig-a015-20210421
i386 randconfig-a016-20210421
i386     randconfig-a014-20210422
i386     randconfig-a012-20210422
i386     randconfig-a011-20210422
i386     randconfig-a013-20210422
i386     randconfig-a015-20210422
i386     randconfig-a016-20210422
riscvnommu_k210_defconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
um   allmodconfig
umallnoconfig
um   allyesconfig
um  defconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-a015-20210421
x86_64   randconfig-a016-20210421
x86_64   randconfig-a011-20210421
x86_64   randconfig-a014-20210421
x86_64   randconfig-a013-20210421
x86_64   randconfig-a012-20210421
x86_64   randconfig-a002-20210422
x86_64   randconfig-a004-20210422
x86_64   randconfig-a001-20210422
x86_64   randconfig-a005-20210422
x86_64   randconfig-a006-20210422
x86_64   randconfig-a003-20210422

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[PATCH 4/4] powerpc/pseries: warn if recursing into the hcall tracing code

2021-04-22 Thread Nicholas Piggin
---
 arch/powerpc/platforms/pseries/lpar.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 835e7f661a05..a961a7ebeab3 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1828,8 +1828,11 @@ void hcall_tracepoint_unregfunc(void)
 
 /*
  * Since the tracing code might execute hcalls we need to guard against
- * recursion. H_CONFER from spin locks must be treated separately though
- * and use _notrace plpar_hcall variants, see yield_to_preempted().
+ * recursion, but this always seems risky -- __trace_hcall_entry might be
+ * ftraced, for example. So warn in this case.
+ *
+ * H_CONFER from spin locks must be treated separately though and use _notrace
+ * plpar_hcall variants, see yield_to_preempted().
  */
 static DEFINE_PER_CPU(unsigned int, hcall_trace_depth);
 
@@ -1843,7 +1846,7 @@ notrace void __trace_hcall_entry(unsigned long opcode, 
unsigned long *args)
 
depth = this_cpu_ptr(_trace_depth);
 
-   if (*depth)
+   if (WARN_ON_ONCE(*depth))
goto out;
 
(*depth)++;
@@ -1864,7 +1867,7 @@ notrace void __trace_hcall_exit(long opcode, long retval, 
unsigned long *retbuf)
 
depth = this_cpu_ptr(_trace_depth);
 
-   if (*depth)
+   if (*depth) /* Don't warning again on the way out */
goto out;
 
(*depth)++;
-- 
2.23.0



[PATCH 3/4] powerpc/pseries: use notrace hcall variant for H_CEDE idle

2021-04-22 Thread Nicholas Piggin
Rather than special-case H_CEDE in the hcall trace wrappers, make the
idle H_CEDE call use plpar_hcall_norets_notrace().

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/plpar_wrappers.h |  6 +-
 arch/powerpc/platforms/pseries/lpar.c | 10 --
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/plpar_wrappers.h 
b/arch/powerpc/include/asm/plpar_wrappers.h
index ece84a430701..83e0f701ebc6 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -28,7 +28,11 @@ static inline void set_cede_latency_hint(u8 latency_hint)
 
 static inline long cede_processor(void)
 {
-   return plpar_hcall_norets(H_CEDE);
+   /*
+* We cannot call tracepoints inside RCU idle regions which
+* means we must not trace H_CEDE.
+*/
+   return plpar_hcall_norets_notrace(H_CEDE);
 }
 
 static inline long extended_cede_processor(unsigned long latency_hint)
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 0641779e5cde..835e7f661a05 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1839,13 +1839,6 @@ notrace void __trace_hcall_entry(unsigned long opcode, 
unsigned long *args)
unsigned long flags;
unsigned int *depth;
 
-   /*
-* We cannot call tracepoints inside RCU idle regions which
-* means we must not trace H_CEDE.
-*/
-   if (opcode == H_CEDE)
-   return;
-
local_irq_save(flags);
 
depth = this_cpu_ptr(_trace_depth);
@@ -1867,9 +1860,6 @@ notrace void __trace_hcall_exit(long opcode, long retval, 
unsigned long *retbuf)
unsigned long flags;
unsigned int *depth;
 
-   if (opcode == H_CEDE)
-   return;
-
local_irq_save(flags);
 
depth = this_cpu_ptr(_trace_depth);
-- 
2.23.0



[PATCH 2/4] powerpc/pseries: Don't trace hcall tracing wrapper

2021-04-22 Thread Nicholas Piggin
This doesn't seem very useful to trace before the recursion check, even
if the ftrace code has any recursion checks of its own. Be on the safe
side and don't trace the hcall trace wrappers.

Reported-by: Naveen N. Rao 
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/platforms/pseries/lpar.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 6011b7b80946..0641779e5cde 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1834,7 +1834,7 @@ void hcall_tracepoint_unregfunc(void)
 static DEFINE_PER_CPU(unsigned int, hcall_trace_depth);
 
 
-void __trace_hcall_entry(unsigned long opcode, unsigned long *args)
+notrace void __trace_hcall_entry(unsigned long opcode, unsigned long *args)
 {
unsigned long flags;
unsigned int *depth;
@@ -1862,7 +1862,7 @@ void __trace_hcall_entry(unsigned long opcode, unsigned 
long *args)
local_irq_restore(flags);
 }
 
-void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf)
+notrace void __trace_hcall_exit(long opcode, long retval, unsigned long 
*retbuf)
 {
unsigned long flags;
unsigned int *depth;
-- 
2.23.0



[PATCH 1/4] powerpc/pseries: Fix hcall tracing recursion in pv queued spinlocks

2021-04-22 Thread Nicholas Piggin
The paravit queued spinlock slow path adds itself to the queue then
calls pv_wait to wait for the lock to become free. This is implemented
by calling H_CONFER to donate cycles.

When hcall tracing is enabled, this H_CONFER call can lead to a spin
lock being taken in the tracing code, which will result in the lock to
be taken again, which will also go to the slow path because it queues
behind itself and so won't ever make progress.

An example trace of a deadlock:

  __pv_queued_spin_lock_slowpath
  trace_clock_global
  ring_buffer_lock_reserve
  trace_event_buffer_lock_reserve
  trace_event_buffer_reserve
  trace_event_raw_event_hcall_exit
  __trace_hcall_exit
  plpar_hcall_norets_trace
  __pv_queued_spin_lock_slowpath
  trace_clock_global
  ring_buffer_lock_reserve
  trace_event_buffer_lock_reserve
  trace_event_buffer_reserve
  trace_event_raw_event_rcu_dyntick
  rcu_irq_exit
  irq_exit
  __do_irq
  call_do_irq
  do_IRQ
  hardware_interrupt_common_virt

Fix this by introducing plpar_hcall_norets_notrace(), and using that to
make SPLPAR virtual processor dispatching hcalls by the paravirt
spinlock code.

Fixes: 20c0e8269e9d ("powerpc/pseries: Implement paravirt qspinlocks for 
SPLPAR")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/hvcall.h   |  3 +++
 arch/powerpc/include/asm/paravirt.h | 22 +++---
 arch/powerpc/platforms/pseries/hvCall.S | 10 ++
 arch/powerpc/platforms/pseries/lpar.c   |  4 ++--
 4 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index ed6086d57b22..0c92b01a3c3c 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -446,6 +446,9 @@
  */
 long plpar_hcall_norets(unsigned long opcode, ...);
 
+/* Variant which does not do hcall tracing */
+long plpar_hcall_norets_notrace(unsigned long opcode, ...);
+
 /**
  * plpar_hcall: - Make a pseries hypervisor call
  * @opcode: The hypervisor call to make.
diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 5d1726bb28e7..3c13c2ec70a9 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -30,17 +30,33 @@ static inline u32 yield_count_of(int cpu)
 
 static inline void yield_to_preempted(int cpu, u32 yield_count)
 {
-   plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), 
yield_count);
+   /*
+* Spinlock code yields and prods, so don't trace the hcalls because
+* tracing code takes spinlocks which could recurse.
+*
+* These calls are made while the lock is not held, the lock slowpath
+* yields if it can not acquire the lock, and unlock slow path might
+* prod if a waiter has yielded). So this did not seem to be a problem
+* for simple spin locks because technically it didn't recuse on the
+* lock. However the queued spin lock contended path is more strictly
+* ordered: the H_CONFER hcall is made after the task has queued itself
+* on the lock, so then recursing on the lock will queue up behind that
+* (or worse: queued spinlocks uses tricks that assume a context never
+* waits on more than one spinlock, so that may cause random
+* corruption).
+*/
+   plpar_hcall_norets_notrace(H_CONFER,
+  get_hard_smp_processor_id(cpu), yield_count);
 }
 
 static inline void prod_cpu(int cpu)
 {
-   plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
+   plpar_hcall_norets_notrace(H_PROD, get_hard_smp_processor_id(cpu));
 }
 
 static inline void yield_to_any(void)
 {
-   plpar_hcall_norets(H_CONFER, -1, 0);
+   plpar_hcall_norets_notrace(H_CONFER, -1, 0);
 }
 #else
 static inline bool is_shared_processor(void)
diff --git a/arch/powerpc/platforms/pseries/hvCall.S 
b/arch/powerpc/platforms/pseries/hvCall.S
index 2136e42833af..8a2b8d64265b 100644
--- a/arch/powerpc/platforms/pseries/hvCall.S
+++ b/arch/powerpc/platforms/pseries/hvCall.S
@@ -102,6 +102,16 @@ END_FTR_SECTION(0, 1); 
\
 #define HCALL_BRANCH(LABEL)
 #endif
 
+_GLOBAL_TOC(plpar_hcall_norets_notrace)
+   HMT_MEDIUM
+
+   mfcrr0
+   stw r0,8(r1)
+   HVSC/* invoke the hypervisor */
+   lwz r0,8(r1)
+   mtcrf   0xff,r0
+   blr /* return r3 = status */
+
 _GLOBAL_TOC(plpar_hcall_norets)
HMT_MEDIUM
 
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 3805519a6469..6011b7b80946 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -1828,8 +1828,8 @@ void hcall_tracepoint_unregfunc(void)
 
 /*
  * Since the tracing code might execute hcalls we need to guard against
- * recursion. One example of this are spinlocks calling H_YIELD on
- * shared 

[PATCH 0/4] Fix queued spinlocks and a bit more

2021-04-22 Thread Nicholas Piggin
Patch 1 is the important fix. 2 might fix something, although I haven't
provoked a crash yet.

Patch 3 is a small cleanup, and patch 4 I think is the right thing to do
but these could wait for later.

Thanks,
Nick

Nicholas Piggin (4):
  powerpc/pseries: Fix hcall tracing recursion in pv queued spinlocks
  powerpc/pseries: Don't trace hcall tracing wrapper
  powerpc/pseries: use notrace hcall variant for H_CEDE idle
  powerpc/pseries: warn if recursing into the hcall tracing code

 arch/powerpc/include/asm/hvcall.h |  3 +++
 arch/powerpc/include/asm/paravirt.h   | 22 +---
 arch/powerpc/include/asm/plpar_wrappers.h |  6 +-
 arch/powerpc/platforms/pseries/hvCall.S   | 10 +
 arch/powerpc/platforms/pseries/lpar.c | 25 ---
 5 files changed, 46 insertions(+), 20 deletions(-)

-- 
2.23.0



Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-04-22 Thread Alexey Kardashevskiy




On 23/04/2021 08:58, Daniel Axtens wrote:

Hi Alexey,


The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in
-fno-lto -flto -fsanitize=cfi-mfcall   -fno-lto -flto -fsanitize=cfi-mfcall


Checkpatch doesn't like this line:
WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a 
maximum 75 chars per line)
#14:
-fno-lto -flto -fsanitize=cfi-mfcall   -fno-lto -flto -fsanitize=cfi-mfcall
However, it doesn't make sense to wrap the line so we should just ignore
checkpatch here.


in the clang command line and triggers error:

clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with 
'-flto'

This removes unnecessary CPP redifinition.

Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/Makefile | 1 -
  1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index c9d2c7825cd6..3a2f2001c62b 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr)
  KBUILD_AFLAGS += $(AFLAGS-y)
  KBUILD_CFLAGS += $(call cc-option,-msoft-float)
  KBUILD_CFLAGS += -pipe $(CFLAGS-y)
-CPP= $(CC) -E $(KBUILD_CFLAGS)


I was trying to check the history to see why powerpc has its own
definition. It seems to date back at least as far as merging the two
powerpc platforms into one, maybe it was helpful then. I agree it
doesn't seem to be needed now.

Snowpatch claims that this breaks the build, but I haven't been able to
reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it
off to a fork of mpe's linux-ci repo to see if any of those builds hit
any issues: https://github.com/daxtens/linux-ci/actions


To be precise, you need LLVM + LTO + byte code (-emit-llvm-bc), I am not 
even sure what is the point of having -flto without -emit-llvm-bc.


No flags duplication:

[fstn1-p1 1]$ /mnt/sdb/pbuild/llvm-no-lto/bin/clang-13  -emit-llvm-bc 
-fno-lto -flto -fvisibility=hidden -fsanitize=cfi-mfcall 
/mnt/sdb/pbuild/llvm-bugs/1/a.c
/usr/bin/ld: warning: cannot find entry symbol mit-llvm-bc; defaulting 
to 13e0
/usr/bin/ld: 
/usr/lib/powerpc64le-linux-gnu/crt1.o:(.data.rel.ro.local+0x8): 
undefined reference to `main'
clang-13: error: linker command failed with exit code 1 (use -v to see 
invocation)


=> command line is fine, the file is not (but it is for debugging this 
problem).



Now I am adding the second -fno-lto:

[fstn1-p1 1]$ /mnt/sdb/pbuild/llvm-no-lto/bin/clang-13  -emit-llvm-bc 
-fno-lto -flto -fvisibility=hidden -fsanitize=cfi-mfcall -fno-lto 
/mnt/sdb/pbuild/llvm-bugs/1/a.c
clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed 
with '-flto'



=> failed.


Assuming that doesn't break, this patch looks good to me:
Reviewed-by: Daniel Axtens 

Kind regards,
Daniel



--
Alexey


Re: [PATCH v5 16/16] of: Add plumbing for restricted DMA pool

2021-04-22 Thread Claire Chang
On Thu, Apr 22, 2021 at 4:17 PM Claire Chang  wrote:
>
> If a device is not behind an IOMMU, we look up the device node and set
> up the restricted DMA when the restricted-dma-pool is presented.
>
> Signed-off-by: Claire Chang 
> ---
>  drivers/of/address.c| 25 +
>  drivers/of/device.c |  3 +++
>  drivers/of/of_private.h |  5 +
>  3 files changed, 33 insertions(+)
>
> diff --git a/drivers/of/address.c b/drivers/of/address.c
> index 54f221dde267..fff3adfe4986 100644
> --- a/drivers/of/address.c
> +++ b/drivers/of/address.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np)
>  }
>  EXPORT_SYMBOL_GPL(of_dma_is_coherent);
>
> +int of_dma_set_restricted_buffer(struct device *dev)
> +{
> +   struct device_node *node;
> +   int count, i;
> +
> +   if (!dev->of_node)
> +   return 0;
> +
> +   count = of_property_count_elems_of_size(dev->of_node, "memory-region",
> +   sizeof(phandle));
> +   for (i = 0; i < count; i++) {
> +   node = of_parse_phandle(dev->of_node, "memory-region", i);
> +   /* There might be multiple memory regions, but only one
> +* restriced-dma-pool region is allowed.
> +*/
> +   if (of_device_is_compatible(node, "restricted-dma-pool") &&
> +   of_device_is_available(node))
> +   return of_reserved_mem_device_init_by_idx(
> +   dev, dev->of_node, i);
> +   }
> +
> +   return 0;
> +}
> +
>  /**
>   * of_mmio_is_nonposted - Check if device uses non-posted MMIO
>   * @np:device node
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index c5a9473a5fb1..d8d865223e51 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct 
> device_node *np,
>
> arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
>
> +   if (!iommu)
> +   return of_dma_set_restricted_buffer(dev);
> +
> return 0;
>  }
>  EXPORT_SYMBOL_GPL(of_dma_configure_id);
> diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
> index d717efbd637d..e9237f5eff48 100644
> --- a/drivers/of/of_private.h
> +++ b/drivers/of/of_private.h
> @@ -163,12 +163,17 @@ struct bus_dma_region;
>  #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA)
>  int of_dma_get_range(struct device_node *np,
> const struct bus_dma_region **map);
> +int of_dma_set_restricted_buffer(struct device *dev);
>  #else
>  static inline int of_dma_get_range(struct device_node *np,
> const struct bus_dma_region **map)
>  {
> return -ENODEV;
>  }
> +static inline int of_dma_get_restricted_buffer(struct device *dev)

This one should be of_dma_set_restricted_buffer. Sorry for the typo.

> +{
> +   return -ENODEV;
> +}
>  #endif
>
>  #endif /* _LINUX_OF_PRIVATE_H */
> --
> 2.31.1.368.gbe11c130af-goog
>


Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker

2021-04-22 Thread Nick Desaulniers
On Wed, Sep 2, 2020 at 11:02 AM Christophe Leroy
 wrote:
>
>
>
> Le 02/09/2020 à 19:41, Nick Desaulniers a écrit :
> > On Wed, Sep 2, 2020 at 5:14 AM Michael Ellerman  wrote:
> >>
> >> Nick Desaulniers  writes:
> >>> Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for 
> >>> orphan sections")
> >>
> >> I think I'll just revert that for v5.9 ?
> >
> > SGTM; you'll probably still want these changes with some modifications
> > at some point; vdso32 did have at least one orphaned section, and will
> > be important for hermetic builds.  Seeing crashes in supported
> > versions of the tools ties our hands at the moment.
> >
>
> Keeping the tool problem aside with binutils 2.26, do you have a way to
> really link an elf32ppc object when  building vdso32 for PPC64 ?

Sorry, I'm doing a bug scrub and found
https://github.com/ClangBuiltLinux/linux/issues/774 still open (and my
reply to this thread still in Drafts; never sent). With my patches
rebased:
$ file arch/powerpc/kernel/vdso32/vdso32.so
arch/powerpc/kernel/vdso32/vdso32.so: ELF 32-bit MSB shared object,
PowerPC or cisco 4500, version 1 (SYSV), dynamically linked, stripped

Are you still using 2.26?

I'm not able to repro Nathan's reported issue from
https://lore.kernel.org/lkml/20200902052123.GA2687902@ubuntu-n2-xlarge-x86/,
so I'm curious if I should resend the rebased patches as v2?

--
Thanks,
~Nick Desaulniers


Re: [PATCH 1/2] audit: add support for the openat2 syscall

2021-04-22 Thread Richard Guy Briggs
On 2021-03-18 08:08, Richard Guy Briggs wrote:
> On 2021-03-18 11:48, Christian Brauner wrote:
> > [+Cc Aleksa, the author of openat2()]
> 
> Ah!  Thanks for pulling in Aleksa.  I thought I caught everyone...
> 
> > and a comment below. :)
> 
> Same...
> 
> > On Wed, Mar 17, 2021 at 09:47:17PM -0400, Richard Guy Briggs wrote:
> > > The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9
> > > ("open: introduce openat2(2) syscall")
> > > 
> > > Add the openat2(2) syscall to the audit syscall classifier.
> > > 
> > > See the github issue
> > > https://github.com/linux-audit/audit-kernel/issues/67
> > > 
> > > Signed-off-by: Richard Guy Briggs 
> > > ---
> > >  arch/alpha/kernel/audit.c  | 2 ++
> > >  arch/ia64/kernel/audit.c   | 2 ++
> > >  arch/parisc/kernel/audit.c | 2 ++
> > >  arch/parisc/kernel/compat_audit.c  | 2 ++
> > >  arch/powerpc/kernel/audit.c| 2 ++
> > >  arch/powerpc/kernel/compat_audit.c | 2 ++
> > >  arch/s390/kernel/audit.c   | 2 ++
> > >  arch/s390/kernel/compat_audit.c| 2 ++
> > >  arch/sparc/kernel/audit.c  | 2 ++
> > >  arch/sparc/kernel/compat_audit.c   | 2 ++
> > >  arch/x86/ia32/audit.c  | 2 ++
> > >  arch/x86/kernel/audit_64.c | 2 ++
> > >  kernel/auditsc.c   | 3 +++
> > >  lib/audit.c| 4 
> > >  lib/compat_audit.c | 4 
> > >  15 files changed, 35 insertions(+)
> > > 
> > > diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
> > > index 96a9d18ff4c4..06a911b685d1 100644
> > > --- a/arch/alpha/kernel/audit.c
> > > +++ b/arch/alpha/kernel/audit.c
> > > @@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > >   return 3;
> > >   case __NR_execve:
> > >   return 5;
> > > + case __NR_openat2:
> > > + return 6;
> > >   default:
> > >   return 0;
> > >   }
> > > diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
> > > index 5192ca899fe6..5eaa888c8fd3 100644
> > > --- a/arch/ia64/kernel/audit.c
> > > +++ b/arch/ia64/kernel/audit.c
> > > @@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > >   return 3;
> > >   case __NR_execve:
> > >   return 5;
> > > + case __NR_openat2:
> > > + return 6;
> > >   default:
> > >   return 0;
> > >   }
> > > diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
> > > index 9eb47b2225d2..fc721a7727ba 100644
> > > --- a/arch/parisc/kernel/audit.c
> > > +++ b/arch/parisc/kernel/audit.c
> > > @@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > >   return 3;
> > >   case __NR_execve:
> > >   return 5;
> > > + case __NR_openat2:
> > > + return 6;
> > >   default:
> > >   return 0;
> > >   }
> > > diff --git a/arch/parisc/kernel/compat_audit.c 
> > > b/arch/parisc/kernel/compat_audit.c
> > > index 20c39c9d86a9..fc6d35918c44 100644
> > > --- a/arch/parisc/kernel/compat_audit.c
> > > +++ b/arch/parisc/kernel/compat_audit.c
> > > @@ -35,6 +35,8 @@ int parisc32_classify_syscall(unsigned syscall)
> > >   return 3;
> > >   case __NR_execve:
> > >   return 5;
> > > + case __NR_openat2:
> > > + return 6;
> > >   default:
> > >   return 1;
> > >   }
> > > diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
> > > index a27f3d09..8f32700b0baa 100644
> > > --- a/arch/powerpc/kernel/audit.c
> > > +++ b/arch/powerpc/kernel/audit.c
> > > @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > >   return 4;
> > >   case __NR_execve:
> > >   return 5;
> > > + case __NR_openat2:
> > > + return 6;
> > >   default:
> > >   return 0;
> > >   }
> > > diff --git a/arch/powerpc/kernel/compat_audit.c 
> > > b/arch/powerpc/kernel/compat_audit.c
> > > index 55c6ccda0a85..ebe45534b1c9 100644
> > > --- a/arch/powerpc/kernel/compat_audit.c
> > > +++ b/arch/powerpc/kernel/compat_audit.c
> > > @@ -38,6 +38,8 @@ int ppc32_classify_syscall(unsigned syscall)
> > >   return 4;
> > >   case __NR_execve:
> > >   return 5;
> > > + case __NR_openat2:
> > > + return 6;
> > >   default:
> > >   return 1;
> > >   }
> > > diff --git a/arch/s390/kernel/audit.c b/arch/s390/kernel/audit.c
> > > index d395c6c9944c..d964cb94cfaf 100644
> > > --- a/arch/s390/kernel/audit.c
> > > +++ b/arch/s390/kernel/audit.c
> > > @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > >   return 4;
> > >   case __NR_execve:
> > >   return 5;
> > > + case __NR_openat2:
> > > + return 6;
> > >   default:
> > >   return 0;
> > >   }
> > > diff --git a/arch/s390/kernel/compat_audit.c 
> > > b/arch/s390/kernel/compat_audit.c
> > > index 444fb1f66944..f7b32933ce0e 100644
> > > --- a/arch/s390/kernel/compat_audit.c
> > > +++ b/arch/s390/kernel/compat_audit.c
> > > @@ -39,6 +39,8 @@ int 

Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Rick Lindsley

On 4/22/21 10:21 AM, Michal Suchánek wrote:


Merging do_reset and do_hard_reset might be a good code cleanup which is
out of the scope of this fix.


I agree, on both points. Accepting the patch, and improving the reset
path are not in conflict with each other.

My position is that improving the reset path is certainly on the table,
but not with this bug or this fix.  Within the context of this discovered
problem, the issue is well understood, the fix is small and addresses the
immediate problem, and it has not generated new bugs under subsequent
testing.  For those reasons, the submitted patch has my support.

Rick



Re: [PATCH] powerpc/powernv/pci: remove dead code from !CONFIG_EEH

2021-04-22 Thread Oliver O'Halloran
On Fri, Apr 23, 2021 at 9:09 AM Daniel Axtens  wrote:
>
> Hi Nick,
>
> > While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a
> > possible candidate to convert to #ifdef CONFIG_EEH, but it seems that
> > based on Kconfig dependencies it's not possible to build this file
> > without CONFIG_EEH enabled.
>
> This seemed odd to me, but I think you're right:
>
> arch/powerpc/platforms/Kconfig contains:
>
> config EEH
> bool
> depends on (PPC_POWERNV || PPC_PSERIES) && PCI
> default y
>
> It's not configurable from e.g. make menuconfig because there's no prompt.
> You can attempt to explicitly disable it with e.g. `scripts/config -d EEH`
> but then something like `make oldconfig` will silently re-enable it for
> you.
>
> It's been forced on since commit e49f7a9997c6 ("powerpc/pseries: Rivet
> CONFIG_EEH for pSeries platform") in 2012 which fixed it for
> pseries. That moved out from pseries to pseries + powernv later on.
>
> There are other cleanups in the same vein that could be made, from the
> Makefile (which has files only built with CONFIG_EEH) through to other
> source files. It looks like there's one `#ifdef CONFIG_EEH` in
> arch/powerpc/platforms/powernv/pci-ioda.c that could be pulled out, for
> example.
>
> I think it's probably worth trying to rip out all of those in one patch?

The change in commit e49f7a9997c6 ("powerpc/pseries: Rivet CONFIG_EEH
for pSeries platform") never should have been made.

There's no inherent reason why EEH needs to be enabled and forcing it
on is (IMO) a large part of why EEH support is the byzantine
clusterfuck that it is. One of the things I was working towards was
allowing pseries and powernv to be built with !CONFIG_EEH since that
would help define a clearer boundary between what is "eeh support" and
what is required to support PCI on the platform. Pseries is
particularly bad for this since PAPR says the RTAS calls needed to do
a PCI bus reset are part of the EEH extension, but there's non-EEH
reasons why you might want to use those RTAS calls. The PHB reset that
we do when entering a kdump kernel is a good example since that uses
the same RTAS calls, but it has nothing to do with the EEH recovery
machinery enabled by CONFIG_EEH.

I was looking into that largely because people were considering using
OPAL for microwatt platforms. Breaking the assumption that
powernv==EEH support is one of the few bits of work required to enable
that, but even if you don't go down that road I think everyone would
be better off if you kept a degree of separation between the two.


[PATCH] powerpc/powernv/pci: remove dead code from !CONFIG_EEH

2021-04-22 Thread Nick Desaulniers
While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a
possible candidate to convert to #ifdef CONFIG_EEH, but it seems that
based on Kconfig dependencies it's not possible to build this file
without CONFIG_EEH enabled.

Suggested-by: Nathan Chancellor 
Suggested-by: Joe Perches 
Link: https://github.com/ClangBuiltLinux/linux/issues/570
Link: 
https://lore.kernel.org/lkml/67f6cd269684c9aa8463ff4812c3b4605e6739c3.ca...@perches.com/
Signed-off-by: Nick Desaulniers 
---
 arch/powerpc/platforms/powernv/pci.c | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci.c 
b/arch/powerpc/platforms/powernv/pci.c
index 9b9bca169275..591480a37b05 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -711,7 +711,6 @@ int pnv_pci_cfg_write(struct pci_dn *pdn,
return PCIBIOS_SUCCESSFUL;
 }
 
-#if CONFIG_EEH
 static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 {
struct eeh_dev *edev = NULL;
@@ -734,12 +733,6 @@ static bool pnv_pci_cfg_check(struct pci_dn *pdn)
 
return true;
 }
-#else
-static inline pnv_pci_cfg_check(struct pci_dn *pdn)
-{
-   return true;
-}
-#endif /* CONFIG_EEH */
 
 static int pnv_pci_read_config(struct pci_bus *bus,
   unsigned int devfn,

base-commit: 16fc44d6387e260f4932e9248b985837324705d8
prerequisite-patch-id: 950233069fb22099a8ff8990f620f5c3586a3cbd
-- 
2.31.1.498.g6c1eba8ee3d-goog



Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-22 Thread Segher Boessenkool
Hi!

On Fri, Apr 23, 2021 at 12:16:18AM +0200, Gabriel Paubert wrote:
> On Thu, Apr 22, 2021 at 02:13:34PM -0500, Segher Boessenkool wrote:
> > On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote:
> > > Sathvika Vasireddy  writes:
> > > > +   if ((regs->ccr) & (1 << (31 - ra)))
> > > > +   op->val = -1;
> > > > +   else if ((regs->ccr) & (1 << (30 - ra)))
> > > > +   op->val = 1;
> > > > +   else
> > > > +   op->val = 0;
> > 
> > It imo is clearer if written
> > 
> > if ((regs->ccr << ra) & 0x8000)
> > op->val = -1;
> > else if ((regs->ccr << ra) & 0x4000)
> > op->val = 1;
> > else
> > op->val = 0;
> > 
> > but I guess not everyone agrees :-)
> 
> But this can be made jump free :-):
> 
>   int tmp = regs->ccr << ra;
>   op->val = (tmp >> 31) | ((tmp >> 30) & 1);

The compiler will do so automatically (or think of some better way to
get the same result); in source code, what matters most is readability,
or clarity in general (also clarity to the compiler).

(Right shifts of negative numbers are implementation-defined in C,
fwiw -- but work like you expect in GCC).

> (IIRC the srawi instruction sign-extends its result to 64 bits).

If you consider it to work on 32-bit inputs, yeah, that is a simple way
to express it.

> > > CR field:  76543210
> > > bit:  0123 0123 0123 0123 0123 0123 0123 0123
> > > normal bit #: 0.31
> > > ibm bit #:   31.0
> > 
> > The bit numbers in CR fields are *always* numbered left-to-right.  I
> > have never seen anyone use LE for it, anyway.
> > 
> > Also, even people who write LE have the bigger end on the left normally
> > (they just write some things right-to-left, and other things
> > left-to-right).
> 
> Around 1985, I had a documentation for the the National's 32032
> (little-endian) processor family, and all the instruction encodings were
> presented with the LSB on the left and MSB on the right.

Ouch!  Did they write "regular" numbers with the least significant digit
on the left as well?

> BTW on these processors, the immediate operands and the offsets (1, 2 or
> 4 bytes) for the addressing modes were encoded in big-endian byte order,
> but I digress. Consistency is overrated ;-)

Inconsistency is the spice of life, yeah :-)


Segher


Re: [PATCH] powerpc/powernv/pci: remove dead code from !CONFIG_EEH

2021-04-22 Thread Daniel Axtens
Hi Nick,

> While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a
> possible candidate to convert to #ifdef CONFIG_EEH, but it seems that
> based on Kconfig dependencies it's not possible to build this file
> without CONFIG_EEH enabled.

This seemed odd to me, but I think you're right:

arch/powerpc/platforms/Kconfig contains:

config EEH
bool
depends on (PPC_POWERNV || PPC_PSERIES) && PCI
default y

It's not configurable from e.g. make menuconfig because there's no prompt.
You can attempt to explicitly disable it with e.g. `scripts/config -d EEH`
but then something like `make oldconfig` will silently re-enable it for
you.

It's been forced on since commit e49f7a9997c6 ("powerpc/pseries: Rivet
CONFIG_EEH for pSeries platform") in 2012 which fixed it for
pseries. That moved out from pseries to pseries + powernv later on.

There are other cleanups in the same vein that could be made, from the
Makefile (which has files only built with CONFIG_EEH) through to other
source files. It looks like there's one `#ifdef CONFIG_EEH` in
arch/powerpc/platforms/powernv/pci-ioda.c that could be pulled out, for
example.

I think it's probably worth trying to rip out all of those in one patch?

Kind regards,
Daniel


Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-04-22 Thread Daniel Axtens
Hi Alexey,

> The $(CPP) (do only preprocessing) macro is already defined in Makefile.
> However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
> in flags duplication. Which is not a big deal by itself except for
> the flags which depend on other flags and the compiler checks them
> as it parses the command line.
>
> Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
> If clang+llvm+sanitizer are enabled, this results in
> -fno-lto -flto -fsanitize=cfi-mfcall   -fno-lto -flto 
> -fsanitize=cfi-mfcall

Checkpatch doesn't like this line:
WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a 
maximum 75 chars per line)
#14: 
-fno-lto -flto -fsanitize=cfi-mfcall   -fno-lto -flto -fsanitize=cfi-mfcall
However, it doesn't make sense to wrap the line so we should just ignore
checkpatch here.

> in the clang command line and triggers error:
>
> clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with 
> '-flto'
>
> This removes unnecessary CPP redifinition.
>
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/Makefile | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
> index c9d2c7825cd6..3a2f2001c62b 100644
> --- a/arch/powerpc/Makefile
> +++ b/arch/powerpc/Makefile
> @@ -214,7 +214,6 @@ KBUILD_CPPFLAGS   += -I $(srctree)/arch/$(ARCH) $(asinstr)
>  KBUILD_AFLAGS+= $(AFLAGS-y)
>  KBUILD_CFLAGS+= $(call cc-option,-msoft-float)
>  KBUILD_CFLAGS+= -pipe $(CFLAGS-y)
> -CPP  = $(CC) -E $(KBUILD_CFLAGS)

I was trying to check the history to see why powerpc has its own
definition. It seems to date back at least as far as merging the two
powerpc platforms into one, maybe it was helpful then. I agree it
doesn't seem to be needed now.

Snowpatch claims that this breaks the build, but I haven't been able to
reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it
off to a fork of mpe's linux-ci repo to see if any of those builds hit
any issues: https://github.com/daxtens/linux-ci/actions

Assuming that doesn't break, this patch looks good to me:
Reviewed-by: Daniel Axtens 

Kind regards,
Daniel


Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-22 Thread Gabriel Paubert


Hi,

On Thu, Apr 22, 2021 at 02:13:34PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote:
> > Sathvika Vasireddy  writes:
> > Ok, if I've understood correctly...
> > 
> > > + ra = ra & ~0x3;
> > 
> > This masks off the bits of RA that are not part of BTF:
> > 
> > ra is in [0, 31] which is [0b0, 0b1]
> > Then ~0x3 = ~0b00011
> > ra = ra & 0b11100
> > 
> > This gives us then,
> > ra = btf << 2; or
> > btf = ra >> 2;
> 
> Yes.  In effect, you want the offset in bits of the CR field, which is
> just fine like this.  But a comment would not hurt.
> 
> > Let's then check to see if your calculations read the right fields.
> > 
> > > + if ((regs->ccr) & (1 << (31 - ra)))
> > > + op->val = -1;
> > > + else if ((regs->ccr) & (1 << (30 - ra)))
> > > + op->val = 1;
> > > + else
> > > + op->val = 0;
> 
> It imo is clearer if written
> 
>   if ((regs->ccr << ra) & 0x8000)
>   op->val = -1;
>   else if ((regs->ccr << ra) & 0x4000)
>   op->val = 1;
>   else
>   op->val = 0;
> 
> but I guess not everyone agrees :-)
> 

But this can be made jump free :-):

int tmp = regs->ccr << ra;
op->val = (tmp >> 31) | ((tmp >> 30) & 1);

(IIRC the srawi instruction sign-extends its result to 64 bits).



> > CR field:  76543210
> > bit:  0123 0123 0123 0123 0123 0123 0123 0123
> > normal bit #: 0.31
> > ibm bit #:   31.0
> 
> The bit numbers in CR fields are *always* numbered left-to-right.  I
> have never seen anyone use LE for it, anyway.
> 
> Also, even people who write LE have the bigger end on the left normally
> (they just write some things right-to-left, and other things
> left-to-right).

Around 1985, I had a documentation for the the National's 32032
(little-endian) processor family, and all the instruction encodings were
presented with the LSB on the left and MSB on the right.

BTW on these processors, the immediate operands and the offsets (1, 2 or
4 bytes) for the addressing modes were encoded in big-endian byte order,
but I digress. Consistency is overrated ;-)

Gabriel


> 
> > Checkpatch does have one complaint:
> > 
> > CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'regs->ccr'
> > #30: FILE: arch/powerpc/lib/sstep.c:1971:
> > +   if ((regs->ccr) & (1 << (31 - ra)))
> > 
> > I don't really mind the parenteses: I think you are safe to ignore
> > checkpatch here unless someone else complains :)
> 
> I find them annoying.  If there are too many parentheses, it is hard to
> see at a glance what groups where.  Also, a suspicious reader might
> think there is something special going on (with macros for example).
> 
> This is simple code of course, but :-)
> 
> > If you do end up respinning the patch, I think it would be good to make
> > the maths a bit clearer. I think it works because a left shift of 2 is
> > the same as multiplying by 4, but it would be easier to follow if you
> > used a temporary variable for btf.
> 
> It is very simple.  The BFA instruction field is closely related to the
> BI instruction field, which is 5 bits, and selects one of the 32 bits in
> the CR.  If you have "BFA00 BFA01 BFA10 BFA11", that gives the bit
> numbers of all four bits in the selected CR field.  So the "& ~3" does
> all you need.  It is quite pretty :-)
> 
> 
> Segher
 



Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-22 Thread Segher Boessenkool
Hi!

On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote:
> Sathvika Vasireddy  writes:
> Ok, if I've understood correctly...
> 
> > +   ra = ra & ~0x3;
> 
> This masks off the bits of RA that are not part of BTF:
> 
> ra is in [0, 31] which is [0b0, 0b1]
> Then ~0x3 = ~0b00011
> ra = ra & 0b11100
> 
> This gives us then,
> ra = btf << 2; or
> btf = ra >> 2;

Yes.  In effect, you want the offset in bits of the CR field, which is
just fine like this.  But a comment would not hurt.

> Let's then check to see if your calculations read the right fields.
> 
> > +   if ((regs->ccr) & (1 << (31 - ra)))
> > +   op->val = -1;
> > +   else if ((regs->ccr) & (1 << (30 - ra)))
> > +   op->val = 1;
> > +   else
> > +   op->val = 0;

It imo is clearer if written

if ((regs->ccr << ra) & 0x8000)
op->val = -1;
else if ((regs->ccr << ra) & 0x4000)
op->val = 1;
else
op->val = 0;

but I guess not everyone agrees :-)

> CR field:  76543210
> bit:  0123 0123 0123 0123 0123 0123 0123 0123
> normal bit #: 0.31
> ibm bit #:   31.0

The bit numbers in CR fields are *always* numbered left-to-right.  I
have never seen anyone use LE for it, anyway.

Also, even people who write LE have the bigger end on the left normally
(they just write some things right-to-left, and other things
left-to-right).

> Checkpatch does have one complaint:
> 
> CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'regs->ccr'
> #30: FILE: arch/powerpc/lib/sstep.c:1971:
> + if ((regs->ccr) & (1 << (31 - ra)))
> 
> I don't really mind the parenteses: I think you are safe to ignore
> checkpatch here unless someone else complains :)

I find them annoying.  If there are too many parentheses, it is hard to
see at a glance what groups where.  Also, a suspicious reader might
think there is something special going on (with macros for example).

This is simple code of course, but :-)

> If you do end up respinning the patch, I think it would be good to make
> the maths a bit clearer. I think it works because a left shift of 2 is
> the same as multiplying by 4, but it would be easier to follow if you
> used a temporary variable for btf.

It is very simple.  The BFA instruction field is closely related to the
BI instruction field, which is 5 bits, and selects one of the 32 bits in
the CR.  If you have "BFA00 BFA01 BFA10 BFA11", that gives the bit
numbers of all four bits in the selected CR field.  So the "& ~3" does
all you need.  It is quite pretty :-)


Segher


Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-22 Thread Vaidyanathan Srinivasan
* Gautham R Shenoy  [2021-04-22 20:37:29]:

> From: "Gautham R. Shenoy" 
> 
> Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> of the Extended CEDE states advertised by the platform
> 
> On some of the POWER9 LPARs, the older firmwares advertise a very low
> value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> measured value is 5us on an average. Due to the low advertised exit
> latency, we are entering CEDE(0) more aggressively on such
> platforms. While this helps achieve SMT folding faster, we pay the
> penalty of having to send an IPI to wakeup the CPU when the target
> residency is very short. This is showing up as a performance
> regression on the newer kernels running on the LPARs with older
> firmware.
> 
> Hence, set the exit latency of CEDE(0) based on the latency values
> advertized by platform only from POWER10 onwards. The values
> advertized on POWER10 platforms is more realistic and informed by the
> latency measurements.
> 
> For platforms older than POWER10, retain the hardcoded value of exit
> latency, which is 10us. Though this is higher than the measured
> values, we would be erring on the side of caution.
> 
> Reported-by: Enrico Joedecke 
> Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)")
> Signed-off-by: Gautham R. Shenoy 

Reviewed-by: Vaidyanathan Srinivasan 

> ---
>  drivers/cpuidle/cpuidle-pseries.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index a2b5c6f..7207467 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -419,7 +419,8 @@ static int pseries_idle_probe(void)
>   cpuidle_state_table = shared_states;
>   max_idle_state = ARRAY_SIZE(shared_states);
>   } else {
> - fixup_cede0_latency();
> + if (pvr_version_is(PVR_POWER10))
> + fixup_cede0_latency();
>   cpuidle_state_table = dedicated_states;
>   max_idle_state = NR_DEDICATED_STATES;
>   }

Thanks for the fix.  We should have such safeguards or fallbacks while
running on older platforms.  This fix is needed because of the
unfortunate regression on some older platforms that we failed to
notice while designing and testing the original feature.

--Vaidy



Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Lijun Pan
On Thu, Apr 22, 2021 at 12:21 PM Michal Suchánek  wrote:
>
> Hello,
>
> On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote:
> > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
> >  wrote:
> > >
> > > Lijun Pan [l...@linux.vnet.ibm.com] wrote:
> > > >
> > > >
> > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden  wrote:
> > > > >
> > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If 
> > > > > this
> > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > > > ibmvnic abandons the reset because of this failed set link down and 
> > > > > this
> > > > > is the last reset in the workqueue, then this adapter will be left in 
> > > > > an
> > > > > inoperable state.
> > > > >
> > > > > Instead, make the driver ignore this link down failure and continue to
> > > > > free and re-register CRQ so that the adapter has an opportunity to
> > > > > recover.
> > > >
> > > > This v2 does not adddress the concerns mentioned in v1.
> > > > And I think it is better to exit with error from do_reset, and schedule 
> > > > a thorough
> > > > do_hard_reset if the the adapter is already in unstable state.
> > >
> > > We had a FATAL error and when handling it, we failed to send a
> > > link-down message to the VIOS. So what we need to try next is to
> > > reset the connection with the VIOS. For this we must talk to the
> > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> > > does just that in ibmvnic_reset_crq().
> > >
> > > Now, sure we can attempt a "thorough hard reset" which also does
> > > the same hcalls to reestablish the connection. Is there any
> > > other magic in do_hard_reset()? But in addition, it also frees lot
> > > more Linux kernel buffers and reallocates them for instance.
> >
> > Working around everything in do_reset will make the code very difficult
> > to manage. Ultimately do_reset can do anything I am afraid, and 
> > do_hard_reset
> > can be removed completely or merged into do_reset.

Michal,

I should have given more details about the above statement. Thanks for
your detailed info in the below.

>
> This debate is not very constructive.
>
> In the context of driver that has separate do_reset and do_hard_reset
> this fix picks the correct one unless you can refute the arguments
> provided.
>
> Merging do_reset and do_hard_reset might be a good code cleanup which is
> out of the scope of this fix.

Right.

>
>
>
> Given that vast majority of fixes to the vnic driver are related to the
> reset handling it would improve stability and testability if every
> reset took the same code path.

I agree.

>
> In the context of merging do_hard_reset and do_reset the question is
> what is the intended distinction and performance gain by having
> 'lightweight' reset.

Right.

>
> I don't have a vnic protocol manual at hand and I suspect I would not
> get one even if I searched for one.
>
> From reading through the fixes in the past my understanding is that the
> full reset is required when the backend changes which then potentially
> requires different size/number of buffers.

Yes, full reset is better when the backend changes.

>
> What is the expected situation when reset is required without changing
> the backend?
>
> Is this so common that it warrants a separate 'lightweight' optimized
> function?
>


Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Michal Suchánek
Hello,

On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote:
> On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu
>  wrote:
> >
> > Lijun Pan [l...@linux.vnet.ibm.com] wrote:
> > >
> > >
> > > > On Apr 20, 2021, at 4:35 PM, Dany Madden  wrote:
> > > >
> > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks
> > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this
> > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is
> > > > still inactive, ibmvnic's attempt to set link down will also fail. If
> > > > ibmvnic abandons the reset because of this failed set link down and this
> > > > is the last reset in the workqueue, then this adapter will be left in an
> > > > inoperable state.
> > > >
> > > > Instead, make the driver ignore this link down failure and continue to
> > > > free and re-register CRQ so that the adapter has an opportunity to
> > > > recover.
> > >
> > > This v2 does not adddress the concerns mentioned in v1.
> > > And I think it is better to exit with error from do_reset, and schedule a 
> > > thorough
> > > do_hard_reset if the the adapter is already in unstable state.
> >
> > We had a FATAL error and when handling it, we failed to send a
> > link-down message to the VIOS. So what we need to try next is to
> > reset the connection with the VIOS. For this we must talk to the
> > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset()
> > does just that in ibmvnic_reset_crq().
> >
> > Now, sure we can attempt a "thorough hard reset" which also does
> > the same hcalls to reestablish the connection. Is there any
> > other magic in do_hard_reset()? But in addition, it also frees lot
> > more Linux kernel buffers and reallocates them for instance.
> 
> Working around everything in do_reset will make the code very difficult
> to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> can be removed completely or merged into do_reset.

This debate is not very constructive.

In the context of driver that has separate do_reset and do_hard_reset
this fix picks the correct one unless you can refute the arguments
provided.

Merging do_reset and do_hard_reset might be a good code cleanup which is
out of the scope of this fix.



Given that vast majority of fixes to the vnic driver are related to the
reset handling it would improve stability and testability if every
reset took the same code path.

In the context of merging do_hard_reset and do_reset the question is
what is the intended distinction and performance gain by having
'lightweight' reset.

I don't have a vnic protocol manual at hand and I suspect I would not
get one even if I searched for one.

>From reading through the fixes in the past my understanding is that the
full reset is required when the backend changes which then potentially
requires different size/number of buffers.

What is the expected situation when reset is required without changing
the backend?

Is this so common that it warrants a separate 'lightweight' optimized
function?

Thanks

Michal


Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Lijun Pan
On Thu, Apr 22, 2021 at 2:07 AM Rick Lindsley
 wrote:
>
> On 4/21/21 10:30 PM, Lijun Pan wrote:
> >> Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
> >> Signed-off-by: Dany Madden 
> >> Reviewed-by: Rick Lindsley 
> >> Reviewed-by: Sukadev Bhattiprolu 
> >
> > One thing I would like to point out as already pointed out by Nathan Lynch 
> > is
> > that those review-by tags given by the same groups of people from the same
> > company loses credibility over time if you never critique or ask
> > questions on the list.
> >
>
> Well, so far you aren't addressing either my critiques or questions.
>
> I have been asking questions but all I have from you are the above
> attempts to discredit the reputation of myself and other people, and
> non-technical statements like
>
>  will make the code very difficult to manage
>  I think there should be a trade off between optimization and stability.
>  So I don't think you could even compare the two results
>
> On the other hand, from the original submission I see some very specific
> details:
>
>  If ibmvnic abandons the reset because of this failed set link
>  down and this is the last reset in the workqueue, then this
>  adapter will be left in an inoperable state.
>
> and from a followup discussion:
>
>  We had a FATAL error and when handling it, we failed to
>  send a link-down message to the VIOS. So what we need
>  to try next is to reset the connection with the VIOS. For
>  this we must ...
>
> These are great technical points that could be argued or discussed.
> Problem is, I agree with them.
>
> I will ask again:  can you please supply some technical reasons for
> your objections.  Otherwise, your objections are meritless and at worst
> simply an ad hominem attack.

Well, from the beginning of v1, I started to provide technical inputs.
Then I was not
allowed to post anything in the community about this patch and VNIC
via l...@linux.ibm.com except giving an ack-by/reviewed-by.


Re: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()

2021-04-22 Thread Segher Boessenkool
On Thu, Apr 22, 2021 at 08:05:27AM +, David Laight wrote:
> > (Does anyone - and by anyone I mean any large distro - compile with
> > local variables inited by the compiler?)
> 
> There are compilers that initialise locals to zero for 'debug' builds
> and leave the 'random' for optimised 'release' builds.
> Lets not test what we are releasing!

Yeah, that's the worst of all possible worlds.

> I also think there is a new option to gcc (or clang?) to initialise
> on-stack structures and arrays to ensure garbage isn't passed.
> That seems to be a horrid performance hit!
> Especially in userspace where large stack allocations are almost free.
> 
> Any auto-initialise ought to be with a semi-random value
> (especially not zero) so that it is never right and doesn't
> lead to lazy coding.

Many compilers did something like this, decades ago -- for debug builds.


Segher


Re: [PATCH 2/2] powerpc: Print esr register when hitting Program Interrupt

2021-04-22 Thread Christophe Leroy




Le 22/04/2021 à 17:29, Christophe Leroy a écrit :



Le 22/04/2021 à 17:10, Xiongwei Song a écrit :

From: Xiongwei Song 

The esr register has the details of Program Interrupt on BookE/4xx cpus,
printing its value is helpful.

Signed-off-by: Xiongwei Song 
---
  arch/powerpc/kernel/process.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 5c3830837f3a..664aecf8ee2e 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1459,6 +1459,7 @@ static bool interrupt_detail_printable(int trap)
  case INTERRUPT_MACHINE_CHECK:
  case INTERRUPT_DATA_STORAGE:
  case INTERRUPT_ALIGNMENT:
+    case INTERRUPT_PROGRAM:


With this, it will also print the DSISR on 8xx/6xx so it will print garbage.

8xx/6xx provide the information in SRR1. If you want to proceed, you have to do the same as in ISI: 
Copy the content of SRR1 into regs->dsisr in the assembly handler in head_book3s_32.S and in the 
instruction TLB error handler in head_8xx.S


In fact, we already have get_reason() called by do_program_check() to retrieve the reason of the 
program check exception. Maybe it could be used to print usefull information instead of starting 
doing almost the same is another way.


Or we do as I suggested above, and we remove that get_reason() stuff. But get_reason() is also used 
by the alignment exception. So that doesn't look easy.


I don't know what the best approach is.





  return true;
  default:
  return false;



Re: [PATCH 2/2] powerpc: Print esr register when hitting Program Interrupt

2021-04-22 Thread Christophe Leroy




Le 22/04/2021 à 17:10, Xiongwei Song a écrit :

From: Xiongwei Song 

The esr register has the details of Program Interrupt on BookE/4xx cpus,
printing its value is helpful.

Signed-off-by: Xiongwei Song 
---
  arch/powerpc/kernel/process.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 5c3830837f3a..664aecf8ee2e 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1459,6 +1459,7 @@ static bool interrupt_detail_printable(int trap)
case INTERRUPT_MACHINE_CHECK:
case INTERRUPT_DATA_STORAGE:
case INTERRUPT_ALIGNMENT:
+   case INTERRUPT_PROGRAM:


With this, it will also print the DSISR on 8xx/6xx so it will print garbage.

8xx/6xx provide the information in SRR1. If you want to proceed, you have to do the same as in ISI: 
Copy the content of SRR1 into regs->dsisr in the assembly handler in head_book3s_32.S and in the 
instruction TLB error handler in head_8xx.S



return true;
default:
return false;



Re: [PATCH 1/2] powerpc: Make the code in __show_regs nice-looking

2021-04-22 Thread Christophe Leroy




Le 22/04/2021 à 17:10, Xiongwei Song a écrit :

From: Xiongwei Song 

Create a new function named interrupt_detail_printable to judge which
interrupts can print esr/dsisr register.


What is the benefit of that function ? It may be interesting if the test was done at several places, 
but as it is done at only one place, I don't thing it is an improvement.


Until know, you new immediately what was the traps that would print it. Now you have to go and look 
into a sub-function.


And the name is not nice either. All other functions testing anything on the trap type are called 
trap_is_something()


Here your function would be better called something like trap_sets_dsisr()



Signed-off-by: Xiongwei Song 
---
  arch/powerpc/kernel/process.c | 21 -
  1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 89e34aa273e2..5c3830837f3a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1453,9 +1453,23 @@ static void print_msr_bits(unsigned long val)
  #define REGS_PER_LINE 8
  #endif
  
+static bool interrupt_detail_printable(int trap)

+{
+   switch (trap) {
+   case INTERRUPT_MACHINE_CHECK:
+   case INTERRUPT_DATA_STORAGE:
+   case INTERRUPT_ALIGNMENT:
+   return true;
+   default:
+   return false;
+   }
+
+   return false;


That's redundant with the default case inside the switch.


+}
+
  static void __show_regs(struct pt_regs *regs)
  {
-   int i, trap;
+   int i;
  
  	printk("NIP:  "REG" LR: "REG" CTR: "REG"\n",

   regs->nip, regs->link, regs->ctr);
@@ -1464,12 +1478,9 @@ static void __show_regs(struct pt_regs *regs)
printk("MSR:  "REG" ", regs->msr);
print_msr_bits(regs->msr);
pr_cont("  CR: %08lx  XER: %08lx\n", regs->ccr, regs->xer);
-   trap = TRAP(regs);
if (!trap_is_syscall(regs) && cpu_has_feature(CPU_FTR_CFAR))
pr_cont("CFAR: "REG" ", regs->orig_gpr3);
-   if (trap == INTERRUPT_MACHINE_CHECK ||
-   trap == INTERRUPT_DATA_STORAGE ||
-   trap == INTERRUPT_ALIGNMENT) {
+   if (interrupt_detail_printable(TRAP(regs))) {
if (IS_ENABLED(CONFIG_4xx) || IS_ENABLED(CONFIG_BOOKE))
pr_cont("DEAR: "REG" ESR: "REG" ", regs->dar, 
regs->dsisr);
else



[PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-22 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
of the Extended CEDE states advertised by the platform

On some of the POWER9 LPARs, the older firmwares advertise a very low
value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
measured value is 5us on an average. Due to the low advertised exit
latency, we are entering CEDE(0) more aggressively on such
platforms. While this helps achieve SMT folding faster, we pay the
penalty of having to send an IPI to wakeup the CPU when the target
residency is very short. This is showing up as a performance
regression on the newer kernels running on the LPARs with older
firmware.

Hence, set the exit latency of CEDE(0) based on the latency values
advertized by platform only from POWER10 onwards. The values
advertized on POWER10 platforms is more realistic and informed by the
latency measurements.

For platforms older than POWER10, retain the hardcoded value of exit
latency, which is 10us. Though this is higher than the measured
values, we would be erring on the side of caution.

Reported-by: Enrico Joedecke 
Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
CEDE(0)")
Signed-off-by: Gautham R. Shenoy 
---
 drivers/cpuidle/cpuidle-pseries.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index a2b5c6f..7207467 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -419,7 +419,8 @@ static int pseries_idle_probe(void)
cpuidle_state_table = shared_states;
max_idle_state = ARRAY_SIZE(shared_states);
} else {
-   fixup_cede0_latency();
+   if (pvr_version_is(PVR_POWER10))
+   fixup_cede0_latency();
cpuidle_state_table = dedicated_states;
max_idle_state = NR_DEDICATED_STATES;
}
-- 
1.9.4



[PATCH 2/2] powerpc: Print esr register when hitting Program Interrupt

2021-04-22 Thread Xiongwei Song
From: Xiongwei Song 

The esr register has the details of Program Interrupt on BookE/4xx cpus,
printing its value is helpful.

Signed-off-by: Xiongwei Song 
---
 arch/powerpc/kernel/process.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 5c3830837f3a..664aecf8ee2e 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1459,6 +1459,7 @@ static bool interrupt_detail_printable(int trap)
case INTERRUPT_MACHINE_CHECK:
case INTERRUPT_DATA_STORAGE:
case INTERRUPT_ALIGNMENT:
+   case INTERRUPT_PROGRAM:
return true;
default:
return false;
-- 
2.17.1



[PATCH 1/2] powerpc: Make the code in __show_regs nice-looking

2021-04-22 Thread Xiongwei Song
From: Xiongwei Song 

Create a new function named interrupt_detail_printable to judge which
interrupts can print esr/dsisr register.

Signed-off-by: Xiongwei Song 
---
 arch/powerpc/kernel/process.c | 21 -
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 89e34aa273e2..5c3830837f3a 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1453,9 +1453,23 @@ static void print_msr_bits(unsigned long val)
 #define REGS_PER_LINE  8
 #endif
 
+static bool interrupt_detail_printable(int trap)
+{
+   switch (trap) {
+   case INTERRUPT_MACHINE_CHECK:
+   case INTERRUPT_DATA_STORAGE:
+   case INTERRUPT_ALIGNMENT:
+   return true;
+   default:
+   return false;
+   }
+
+   return false;
+}
+
 static void __show_regs(struct pt_regs *regs)
 {
-   int i, trap;
+   int i;
 
printk("NIP:  "REG" LR: "REG" CTR: "REG"\n",
   regs->nip, regs->link, regs->ctr);
@@ -1464,12 +1478,9 @@ static void __show_regs(struct pt_regs *regs)
printk("MSR:  "REG" ", regs->msr);
print_msr_bits(regs->msr);
pr_cont("  CR: %08lx  XER: %08lx\n", regs->ccr, regs->xer);
-   trap = TRAP(regs);
if (!trap_is_syscall(regs) && cpu_has_feature(CPU_FTR_CFAR))
pr_cont("CFAR: "REG" ", regs->orig_gpr3);
-   if (trap == INTERRUPT_MACHINE_CHECK ||
-   trap == INTERRUPT_DATA_STORAGE ||
-   trap == INTERRUPT_ALIGNMENT) {
+   if (interrupt_detail_printable(TRAP(regs))) {
if (IS_ENABLED(CONFIG_4xx) || IS_ENABLED(CONFIG_BOOKE))
pr_cont("DEAR: "REG" ESR: "REG" ", regs->dar, 
regs->dsisr);
else
-- 
2.17.1



Re: [PATCH v2] powerpc/kconfig: Restore alphabetic order of the selects under CONFIG_PPC

2021-04-22 Thread Michael Ellerman
Christophe Leroy  writes:
> Commit a7d2475af7ae ("powerpc: Sort the selects under CONFIG_PPC")
> sorted all selects under CONFIG_PPC.
>
> 4 years later, several items have been introduced at wrong place,
> a few other have been renamed without moving them to their correct
> place.
>
> Reorder them now.
>
> While we are at it, simplify the test for a couple of them:
> - PPC_64 && PPC_PSERIES is simplified in PPC_PSERIES
> - PPC_64 && PPC_BOOK3S is simplified in PPC_BOOK3S_64
>
> Signed-off-by: Christophe Leroy 
> ---
> v2: Rebased on d20f726744a0 ("Automatic merge of 'next' into merge 
> (2021-04-21 22:57)")

This will conflict badly with other things in linux-next if I merge it
now.

The best time to do this is just before rc1. I'll try to remember :)

cheers


Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2

2021-04-22 Thread Jason Gunthorpe
On Thu, Apr 22, 2021 at 11:49:31PM +1000, Michael Ellerman wrote:
> Alex Williamson  writes:
> > On Mon, 12 Apr 2021 19:41:41 +1000
> > Michael Ellerman  wrote:
> >
> >> Alex Williamson  writes:
> >> > On Fri, 26 Mar 2021 07:13:10 +0100
> >> > Christoph Hellwig  wrote:
> >> >  
> >> >> This driver never had any open userspace (which for VFIO would include
> >> >> VM kernel drivers) that use it, and thus should never have been added
> >> >> by our normal userspace ABI rules.
> >> >> 
> >> >> Signed-off-by: Christoph Hellwig 
> >> >> Acked-by: Greg Kroah-Hartman 
> >> >>  drivers/vfio/pci/Kconfig|   6 -
> >> >>  drivers/vfio/pci/Makefile   |   1 -
> >> >>  drivers/vfio/pci/vfio_pci.c |  18 -
> >> >>  drivers/vfio/pci/vfio_pci_nvlink2.c | 490 
> >> >>  drivers/vfio/pci/vfio_pci_private.h |  14 -
> >> >>  include/uapi/linux/vfio.h   |  38 +--
> >> >>  6 files changed, 4 insertions(+), 563 deletions(-)
> >> >>  delete mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c  
> >> >
> >> > Hearing no objections, applied to vfio next branch for v5.13.  Thanks,  
> >> 
> >> Looks like you only took patch 1?
> >> 
> >> I can't take patch 2 on its own, that would break the build.
> >> 
> >> Do you want to take both patches? There's currently no conflicts against
> >> my tree. It's possible one could appear before the v5.13 merge window,
> >> though it would probably just be something minor.
> >> 
> >> Or I could apply both patches to my tree, which means patch 1 would
> >> appear as two commits in the git history, but that's not a big deal.
> >
> > I've already got a conflict in my next branch with patch 1, so it's
> > best to go through my tree.  Seems like a shared branch would be
> > easiest to allow you to merge and manage potential conflicts against
> > patch 2, I've pushed a branch here:
> >
> > https://github.com/awilliam/linux-vfio.git v5.13/vfio/nvlink
> 
> Thanks.
> 
> My next is based on rc2, so I won't pull that in directly, because I
> don't want to pull all of rc6 in with it.

Linus is fine if you merge in rc's for development reasons. He doesn't
like it when people just merge rc's without a purpose.

Merge rc7 to your tree then pull the nvlink topic is acceptable.

Or just do nothing because Alex will send it through his tree - this
extra co-ordination is really only necessary if there are conflicts.

Jason


Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2

2021-04-22 Thread Michael Ellerman
Alex Williamson  writes:
> On Mon, 12 Apr 2021 19:41:41 +1000
> Michael Ellerman  wrote:
>
>> Alex Williamson  writes:
>> > On Fri, 26 Mar 2021 07:13:10 +0100
>> > Christoph Hellwig  wrote:
>> >  
>> >> This driver never had any open userspace (which for VFIO would include
>> >> VM kernel drivers) that use it, and thus should never have been added
>> >> by our normal userspace ABI rules.
>> >> 
>> >> Signed-off-by: Christoph Hellwig 
>> >> Acked-by: Greg Kroah-Hartman 
>> >> ---
>> >>  drivers/vfio/pci/Kconfig|   6 -
>> >>  drivers/vfio/pci/Makefile   |   1 -
>> >>  drivers/vfio/pci/vfio_pci.c |  18 -
>> >>  drivers/vfio/pci/vfio_pci_nvlink2.c | 490 
>> >>  drivers/vfio/pci/vfio_pci_private.h |  14 -
>> >>  include/uapi/linux/vfio.h   |  38 +--
>> >>  6 files changed, 4 insertions(+), 563 deletions(-)
>> >>  delete mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c  
>> >
>> > Hearing no objections, applied to vfio next branch for v5.13.  Thanks,  
>> 
>> Looks like you only took patch 1?
>> 
>> I can't take patch 2 on its own, that would break the build.
>> 
>> Do you want to take both patches? There's currently no conflicts against
>> my tree. It's possible one could appear before the v5.13 merge window,
>> though it would probably just be something minor.
>> 
>> Or I could apply both patches to my tree, which means patch 1 would
>> appear as two commits in the git history, but that's not a big deal.
>
> I've already got a conflict in my next branch with patch 1, so it's
> best to go through my tree.  Seems like a shared branch would be
> easiest to allow you to merge and manage potential conflicts against
> patch 2, I've pushed a branch here:
>
> https://github.com/awilliam/linux-vfio.git v5.13/vfio/nvlink

Thanks.

My next is based on rc2, so I won't pull that in directly, because I
don't want to pull all of rc6 in with it.

I'll put it in a topic branch and merge it into my next after my first
pull has gone to Linus.

cheers


Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-22 Thread Naveen N. Rao

Michael Ellerman wrote:

"Naveen N. Rao"  writes:

Daniel Axtens wrote:

Sathvika Vasireddy  writes:


This adds emulation support for the following instruction:
   * Set Boolean (setb)

Signed-off-by: Sathvika Vasireddy 

...


If you do end up respinning the patch, I think it would be good to make
the maths a bit clearer. I think it works because a left shift of 2 is
the same as multiplying by 4, but it would be easier to follow if you
used a temporary variable for btf.


Indeed. I wonder if it is better to follow the ISA itself. Per the ISA, 
the bit we are interested in is:

4 x BFA + 32

So, if we use that along with the PPC_BIT() macro, we get:
if (regs->ccr & PPC_BIT(ra + 32))


Use of PPC_BIT risks annoying your maintainer :)


Uh oh... that isn't good :)

I looked up previous discussions and I think I now understand why you 
don't prefer it.


But, I feel it helps make it easy to follow the code when referring to 
the ISA. I'm wondering if it is just the name you dislike and if so, 
does it make sense to rename PPC_BIT() to something else? We have 
BIT_ULL(), so perhaps BIT_MSB_ULL() or MSB_BIT_ULL()?



- Naveen



[PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-04-22 Thread Alexey Kardashevskiy
The $(CPP) (do only preprocessing) macro is already defined in Makefile.
However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results
in flags duplication. Which is not a big deal by itself except for
the flags which depend on other flags and the compiler checks them
as it parses the command line.

Specifically, scripts/Makefile.build:304 generates ksyms for .S files.
If clang+llvm+sanitizer are enabled, this results in
-fno-lto -flto -fsanitize=cfi-mfcall   -fno-lto -flto -fsanitize=cfi-mfcall
in the clang command line and triggers error:

clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with 
'-flto'

This removes unnecessary CPP redifinition.

Signed-off-by: Alexey Kardashevskiy 
---
 arch/powerpc/Makefile | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index c9d2c7825cd6..3a2f2001c62b 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr)
 KBUILD_AFLAGS  += $(AFLAGS-y)
 KBUILD_CFLAGS  += $(call cc-option,-msoft-float)
 KBUILD_CFLAGS  += -pipe $(CFLAGS-y)
-CPP= $(CC) -E $(KBUILD_CFLAGS)
 
 CHECKFLAGS += -m$(BITS) -D__powerpc__ -D__powerpc$(BITS)__
 ifdef CONFIG_CPU_BIG_ENDIAN
-- 
2.25.1



Re: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()

2021-04-22 Thread Dan Carpenter
On Thu, Apr 22, 2021 at 08:05:27AM +, David Laight wrote:
> From: Daniel Axtens
> > Sent: 22 April 2021 03:21
> > 
> > > Hi Lakshmi,
> > >
> > >> On 4/15/21 12:14 PM, Lakshmi Ramasubramanian wrote:
> > >>
> > >> Sorry - missed copying device-tree and powerpc mailing lists.
> > >>
> > >>> There are a few "goto out;" statements before the local variable "fdt"
> > >>> is initialized through the call to of_kexec_alloc_and_setup_fdt() in
> > >>> elf64_load(). This will result in an uninitialized "fdt" being passed
> > >>> to kvfree() in this function if there is an error before the call to
> > >>> of_kexec_alloc_and_setup_fdt().
> > >>>
> > >>> Initialize the local variable "fdt" to NULL.
> > >>>
> > > I'm a huge fan of initialising local variables! But I'm struggling to
> > > find the code path that will lead to an uninit fdt being returned...
> > 
> > OK, so perhaps this was putting it too strongly. I have been bitten
> > by uninitialised things enough in C that I may have taken a slightly
> > overly-agressive view of fixing them in the source rather than the
> > compiler. I do think compiler-level mitigations are better, and I take
> > the point that we don't want to defeat compiler checking.
> > 
> > (Does anyone - and by anyone I mean any large distro - compile with
> > local variables inited by the compiler?)
> 
> There are compilers that initialise locals to zero for 'debug' builds
> and leave the 'random' for optimised 'release' builds.
> Lets not test what we are releasing!

We're eventually going to move to a world where initializing to zero
it the default for the kernel.  I think people will still want to
initialize to a poison value for debug builds.

Initializing to zero is better for debugging because it's more
predictable.  An it avoid information leaks.  And dereferencing random
uninitialized pointers is a privilege escalation but dereferencing a
NULL is just an Oops.

The speed impact is not very significant because (conceptually) it only
needs to be done where there is a compiler warning about uninitialized
variables.  It's slightly more complicated in real life.  In this case,
the compiler doesn't know what happens inside the kexec_build_elf_info()
function so it silences the warning.  And GCC silences warnings if the
variable is initialized inside a loop even when it doesn't know that we
enter the loop.

regards,
dan carpenter



Re: [PATCH v4 00/14] Restricted DMA

2021-04-22 Thread Claire Chang
v5 here: https://lore.kernel.org/patchwork/cover/1416899/ to rebase
onto Christoph's swiotlb cleanups.


[PATCH v5 16/16] of: Add plumbing for restricted DMA pool

2021-04-22 Thread Claire Chang
If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
---
 drivers/of/address.c| 25 +
 drivers/of/device.c |  3 +++
 drivers/of/of_private.h |  5 +
 3 files changed, 33 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 54f221dde267..fff3adfe4986 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np)
 }
 EXPORT_SYMBOL_GPL(of_dma_is_coherent);
 
+int of_dma_set_restricted_buffer(struct device *dev)
+{
+   struct device_node *node;
+   int count, i;
+
+   if (!dev->of_node)
+   return 0;
+
+   count = of_property_count_elems_of_size(dev->of_node, "memory-region",
+   sizeof(phandle));
+   for (i = 0; i < count; i++) {
+   node = of_parse_phandle(dev->of_node, "memory-region", i);
+   /* There might be multiple memory regions, but only one
+* restriced-dma-pool region is allowed.
+*/
+   if (of_device_is_compatible(node, "restricted-dma-pool") &&
+   of_device_is_available(node))
+   return of_reserved_mem_device_init_by_idx(
+   dev, dev->of_node, i);
+   }
+
+   return 0;
+}
+
 /**
  * of_mmio_is_nonposted - Check if device uses non-posted MMIO
  * @np:device node
diff --git a/drivers/of/device.c b/drivers/of/device.c
index c5a9473a5fb1..d8d865223e51 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
 
arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
 
+   if (!iommu)
+   return of_dma_set_restricted_buffer(dev);
+
return 0;
 }
 EXPORT_SYMBOL_GPL(of_dma_configure_id);
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index d717efbd637d..e9237f5eff48 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -163,12 +163,17 @@ struct bus_dma_region;
 #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA)
 int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map);
+int of_dma_set_restricted_buffer(struct device *dev);
 #else
 static inline int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map)
 {
return -ENODEV;
 }
+static inline int of_dma_get_restricted_buffer(struct device *dev)
+{
+   return -ENODEV;
+}
 #endif
 
 #endif /* _LINUX_OF_PRIVATE_H */
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 15/16] dt-bindings: of: Add restricted DMA pool

2021-04-22 Thread Claire Chang
Introduce the new compatible string, restricted-dma-pool, for restricted
DMA. One can specify the address and length of the restricted DMA memory
region by restricted-dma-pool in the reserved-memory node.

Signed-off-by: Claire Chang 
---
 .../reserved-memory/reserved-memory.txt   | 24 +++
 1 file changed, 24 insertions(+)

diff --git 
a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt 
b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
index e8d3096d922c..fc9a12c2f679 100644
--- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
+++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
@@ -51,6 +51,20 @@ compatible (optional) - standard definition
   used as a shared pool of DMA buffers for a set of devices. It can
   be used by an operating system to instantiate the necessary pool
   management subsystem if necessary.
+- restricted-dma-pool: This indicates a region of memory meant to be
+  used as a pool of restricted DMA buffers for a set of devices. The
+  memory region would be the only region accessible to those devices.
+  When using this, the no-map and reusable properties must not be set,
+  so the operating system can create a virtual mapping that will be 
used
+  for synchronization. The main purpose for restricted DMA is to
+  mitigate the lack of DMA access control on systems without an IOMMU,
+  which could result in the DMA accessing the system memory at
+  unexpected times and/or unexpected addresses, possibly leading to 
data
+  leakage or corruption. The feature on its own provides a basic level
+  of protection against the DMA overwriting buffer contents at
+  unexpected times. However, to protect against general data leakage 
and
+  system memory corruption, the system needs to provide way to lock 
down
+  the memory access, e.g., MPU.
 - vendor specific string in the form ,[-]
 no-map (optional) - empty property
 - Indicates the operating system must not create a virtual mapping
@@ -120,6 +134,11 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
compatible = "acme,multimedia-memory";
reg = <0x7700 0x400>;
};
+
+   restricted_dma_mem_reserved: restricted_dma_mem_reserved {
+   compatible = "restricted-dma-pool";
+   reg = <0x5000 0x40>;
+   };
};
 
/* ... */
@@ -138,4 +157,9 @@ one for multimedia processing (named 
multimedia-memory@7700, 64MiB).
memory-region = <_reserved>;
/* ... */
};
+
+   pcie_device: pcie_device@0,0 {
+   memory-region = <_dma_mem_reserved>;
+   /* ... */
+   };
 };
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 14/16] dma-direct: Allocate memory from restricted DMA pool if available

2021-04-22 Thread Claire Chang
The restricted DMA pool is preferred if available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Signed-off-by: Claire Chang 
---
 kernel/dma/direct.c | 35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 7a27f0510fcc..29523d2a9845 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
 static void __dma_direct_free_pages(struct device *dev, struct page *page,
size_t size)
 {
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (swiotlb_free(dev, page, size))
+   return;
+#endif
dma_free_contiguous(dev, page, size);
 }
 
@@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device 
*dev, size_t size,
 
gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,
   _limit);
-   page = dma_alloc_contiguous(dev, size, gfp);
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   page = swiotlb_alloc(dev, size);
+   if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
+   __dma_direct_free_pages(dev, page, size);
+   page = NULL;
+   }
+#endif
+
+   if (!page)
+   page = dma_alloc_contiguous(dev, size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, size);
page = NULL;
@@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
gfp |= __GFP_NOWARN;
 
if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO);
if (!page)
return NULL;
@@ -161,8 +175,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
 
if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&
-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev))
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev))
return arch_dma_alloc(dev, size, dma_handle, gfp, attrs);
 
/*
@@ -172,7 +186,9 @@ void *dma_direct_alloc(struct device *dev, size_t size,
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
!gfpflags_allow_blocking(gfp) &&
(force_dma_unencrypted(dev) ||
-(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && 
!dev_is_dma_coherent(dev
+(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
+ !dev_is_dma_coherent(dev))) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
 
/* we always manually zero the memory once we are done */
@@ -253,15 +269,15 @@ void dma_direct_free(struct device *dev, size_t size,
unsigned int page_order = get_order(size);
 
if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&
-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
/* cpu_addr is a struct page cookie, not a kernel address */
dma_free_contiguous(dev, cpu_addr, size);
return;
}
 
if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&
-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev)) {
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev)) {
arch_dma_free(dev, size, cpu_addr, dma_addr, attrs);
return;
}
@@ -289,7 +305,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
void *ret;
 
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
-   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp))
+   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
 
page = __dma_direct_alloc_pages(dev, size, gfp);
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 13/16] swiotlb: Add restricted DMA alloc/free support.

2021-04-22 Thread Claire Chang
Add the functions, swiotlb_{alloc,free} to support the memory allocation
from restricted DMA pool.

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h |  4 
 kernel/dma/swiotlb.c| 35 +--
 2 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 0c5a18d9cf89..e8cf49bd90c5 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -134,6 +134,10 @@ unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
 bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+struct page *swiotlb_alloc(struct device *dev, size_t size);
+bool swiotlb_free(struct device *dev, struct page *page, size_t size);
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
 static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index af0feb8eaead..274272c79080 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -454,8 +454,9 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
 
index = wrap = wrap_index(mem, ALIGN(mem->index, stride));
do {
-   if ((slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
-   (orig_addr & iotlb_align_mask)) {
+   if (orig_addr &&
+   (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) !=
+   (orig_addr & iotlb_align_mask)) {
index = wrap_index(mem, index + 1);
continue;
}
@@ -695,6 +696,36 @@ late_initcall(swiotlb_create_default_debugfs);
 #endif
 
 #ifdef CONFIG_DMA_RESTRICTED_POOL
+struct page *swiotlb_alloc(struct device *dev, size_t size)
+{
+   struct io_tlb_mem *mem = dev->dma_io_tlb_mem;
+   phys_addr_t tlb_addr;
+   int index;
+
+   if (!mem)
+   return NULL;
+
+   index = find_slots(dev, 0, size);
+   if (index == -1)
+   return NULL;
+
+   tlb_addr = slot_addr(mem->start, index);
+
+   return pfn_to_page(PFN_DOWN(tlb_addr));
+}
+
+bool swiotlb_free(struct device *dev, struct page *page, size_t size)
+{
+   phys_addr_t tlb_addr = page_to_phys(page);
+
+   if (!is_swiotlb_buffer(dev, tlb_addr))
+   return false;
+
+   release_slots(dev, tlb_addr);
+
+   return true;
+}
+
 static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
struct device *dev)
 {
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 12/16] dma-direct: Add a new wrapper __dma_direct_free_pages()

2021-04-22 Thread Claire Chang
Add a new wrapper __dma_direct_free_pages() that will be useful later
for swiotlb_free().

Signed-off-by: Claire Chang 
---
 kernel/dma/direct.c | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 7a88c34d0867..7a27f0510fcc 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -75,6 +75,12 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit);
 }
 
+static void __dma_direct_free_pages(struct device *dev, struct page *page,
+   size_t size)
+{
+   dma_free_contiguous(dev, page, size);
+}
+
 static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
gfp_t gfp)
 {
@@ -237,7 +243,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
return NULL;
}
 out_free_pages:
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
return NULL;
 }
 
@@ -273,7 +279,7 @@ void dma_direct_free(struct device *dev, size_t size,
else if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_CLEAR_UNCACHED))
arch_dma_clear_uncached(cpu_addr, size);
 
-   dma_free_contiguous(dev, dma_direct_to_page(dev, dma_addr), size);
+   __dma_direct_free_pages(dev, dma_direct_to_page(dev, dma_addr), size);
 }
 
 struct page *dma_direct_alloc_pages(struct device *dev, size_t size,
@@ -310,7 +316,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
*dma_handle = phys_to_dma_direct(dev, page_to_phys(page));
return page;
 out_free_pages:
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
return NULL;
 }
 
@@ -329,7 +335,7 @@ void dma_direct_free_pages(struct device *dev, size_t size,
if (force_dma_unencrypted(dev))
set_memory_encrypted((unsigned long)vaddr, 1 << page_order);
 
-   dma_free_contiguous(dev, page, size);
+   __dma_direct_free_pages(dev, page, size);
 }
 
 #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 11/16] swiotlb: Refactor swiotlb_tbl_unmap_single

2021-04-22 Thread Claire Chang
Add a new function, release_slots, to make the code reusable for supporting
different bounce buffer pools, e.g. restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 35 ---
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index b7d634d7a7eb..af0feb8eaead 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -547,27 +547,15 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
return tlb_addr;
 }
 
-/*
- * tlb_addr is the physical address of the bounce buffer to unmap.
- */
-void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr,
- size_t mapping_size, enum dma_data_direction dir,
- unsigned long attrs)
+static void release_slots(struct device *dev, phys_addr_t tlb_addr)
 {
-   struct io_tlb_mem *mem = get_io_tlb_mem(hwdev);
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned long flags;
-   unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr);
+   unsigned int offset = swiotlb_align_offset(dev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
int nslots = nr_slots(mem->slots[index].alloc_size + offset);
int count, i;
 
-   /*
-* First, sync the memory before unmapping the entry
-*/
-   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
-   (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
-   swiotlb_bounce(hwdev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
-
/*
 * Return the buffer to the free list by setting the corresponding
 * entries to indicate the number of contiguous entries available.
@@ -602,6 +590,23 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
spin_unlock_irqrestore(>lock, flags);
 }
 
+/*
+ * tlb_addr is the physical address of the bounce buffer to unmap.
+ */
+void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr,
+ size_t mapping_size, enum dma_data_direction dir,
+ unsigned long attrs)
+{
+   /*
+* First, sync the memory before unmapping the entry
+*/
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
+   (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL))
+   swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE);
+
+   release_slots(dev, tlb_addr);
+}
+
 void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr,
size_t size, enum dma_data_direction dir)
 {
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 10/16] swiotlb: Move alloc_size to find_slots

2021-04-22 Thread Claire Chang
Move the maintenance of alloc_size to find_slots for better code
reusability later.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 96ff36d8ec53..b7d634d7a7eb 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -479,8 +479,11 @@ static int find_slots(struct device *dev, phys_addr_t 
orig_addr,
return -1;
 
 found:
-   for (i = index; i < index + nslots; i++)
+   for (i = index; i < index + nslots; i++) {
mem->slots[i].list = 0;
+   mem->slots[i].alloc_size =
+   alloc_size - ((i - index) << IO_TLB_SHIFT);
+   }
for (i = index - 1;
 io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 &&
 mem->slots[i].list; i--)
@@ -535,11 +538,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
 * This is needed when we sync the memory.  Then we sync the buffer if
 * needed.
 */
-   for (i = 0; i < nr_slots(alloc_size + offset); i++) {
+   for (i = 0; i < nr_slots(alloc_size + offset); i++)
mem->slots[index + i].orig_addr = slot_addr(orig_addr, i);
-   mem->slots[index + i].alloc_size =
-   alloc_size - (i << IO_TLB_SHIFT);
-   }
tlb_addr = slot_addr(mem->start, index) + offset;
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) &&
(dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL))
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 09/16] swiotlb: Bounce data from/to restricted DMA pool if available

2021-04-22 Thread Claire Chang
Regardless of swiotlb setting, the restricted DMA pool is preferred if
available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Note that is_dev_swiotlb_force doesn't check if
swiotlb_force == SWIOTLB_FORCE. Otherwise the memory allocation behavior
with default swiotlb will be changed by the following patche
("dma-direct: Allocate memory from restricted DMA pool if available").

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h | 13 +
 kernel/dma/direct.h |  3 ++-
 kernel/dma/swiotlb.c|  8 
 3 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index c530c976d18b..0c5a18d9cf89 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -120,6 +120,15 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
return mem && paddr >= mem->start && paddr < mem->end;
 }
 
+static inline bool is_dev_swiotlb_force(struct device *dev)
+{
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (dev->dma_io_tlb_mem)
+   return true;
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
+   return false;
+}
+
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
@@ -131,6 +140,10 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 {
return false;
 }
+static inline bool is_dev_swiotlb_force(struct device *dev)
+{
+   return false;
+}
 static inline void swiotlb_exit(void)
 {
 }
diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h
index 13e9e7158d94..f94813674e23 100644
--- a/kernel/dma/direct.h
+++ b/kernel/dma/direct.h
@@ -87,7 +87,8 @@ static inline dma_addr_t dma_direct_map_page(struct device 
*dev,
phys_addr_t phys = page_to_phys(page) + offset;
dma_addr_t dma_addr = phys_to_dma(dev, phys);
 
-   if (unlikely(swiotlb_force == SWIOTLB_FORCE))
+   if (unlikely(swiotlb_force == SWIOTLB_FORCE) ||
+   is_dev_swiotlb_force(dev))
return swiotlb_map(dev, phys, size, dir, attrs);
 
if (unlikely(!dma_capable(dev, dma_addr, size, true))) {
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 1d221343f1c8..96ff36d8ec53 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -344,7 +344,7 @@ void __init swiotlb_exit(void)
 static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t 
size,
   enum dma_data_direction dir)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT;
phys_addr_t orig_addr = mem->slots[index].orig_addr;
size_t alloc_size = mem->slots[index].alloc_size;
@@ -426,7 +426,7 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, 
unsigned int index)
 static int find_slots(struct device *dev, phys_addr_t orig_addr,
size_t alloc_size)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned long boundary_mask = dma_get_seg_boundary(dev);
dma_addr_t tbl_dma_addr =
phys_to_dma_unencrypted(dev, mem->start) & boundary_mask;
@@ -503,7 +503,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
size_t mapping_size, size_t alloc_size,
enum dma_data_direction dir, unsigned long attrs)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
unsigned int i;
int index;
@@ -554,7 +554,7 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, 
phys_addr_t tlb_addr,
  size_t mapping_size, enum dma_data_direction dir,
  unsigned long attrs)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(hwdev);
unsigned long flags;
unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr);
int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT;
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 08/16] swiotlb: Update is_swiotlb_active to add a struct device argument

2021-04-22 Thread Claire Chang
Update is_swiotlb_active to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +-
 drivers/pci/xen-pcifront.c   | 2 +-
 include/linux/swiotlb.h  | 4 ++--
 kernel/dma/direct.c  | 2 +-
 kernel/dma/swiotlb.c | 4 ++--
 6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index ce6b664b10aa..7d48c433446b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
 
max_order = MAX_ORDER;
 #ifdef CONFIG_SWIOTLB
-   if (is_swiotlb_active()) {
+   if (is_swiotlb_active(NULL)) {
unsigned int max_segment;
 
max_segment = swiotlb_max_segment();
diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index e8b506a6685b..2a2ae6d6cf6d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
}
 
 #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)
-   need_swiotlb = is_swiotlb_active();
+   need_swiotlb = is_swiotlb_active(NULL);
 #endif
 
ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev,
diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index b7a8f3a1921f..6d548ce53ce7 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
 
spin_unlock(_dev_lock);
 
-   if (!err && !is_swiotlb_active()) {
+   if (!err && !is_swiotlb_active(NULL)) {
err = pci_xen_swiotlb_init_late();
if (err)
dev_err(>xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 2a6cca07540b..c530c976d18b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
 void __init swiotlb_exit(void);
 unsigned int swiotlb_max_segment(void);
 size_t swiotlb_max_mapping_size(struct device *dev);
-bool is_swiotlb_active(void);
+bool is_swiotlb_active(struct device *dev);
 void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
@@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device 
*dev)
return SIZE_MAX;
 }
 
-static inline bool is_swiotlb_active(void)
+static inline bool is_swiotlb_active(struct device *dev)
 {
return false;
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 84c9feb5474a..7a88c34d0867 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
 size_t dma_direct_max_mapping_size(struct device *dev)
 {
/* If SWIOTLB is active, use its maximum mapping size */
-   if (is_swiotlb_active() &&
+   if (is_swiotlb_active(dev) &&
(dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))
return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ffbb8724e06c..1d221343f1c8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -659,9 +659,9 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
 }
 
-bool is_swiotlb_active(void)
+bool is_swiotlb_active(struct device *dev)
 {
-   return io_tlb_default_mem != NULL;
+   return get_io_tlb_mem(dev) != NULL;
 }
 EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 07/16] swiotlb: Update is_swiotlb_buffer to add a struct device argument

2021-04-22 Thread Claire Chang
Update is_swiotlb_buffer to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 drivers/iommu/dma-iommu.c | 12 ++--
 drivers/xen/swiotlb-xen.c |  2 +-
 include/linux/swiotlb.h   |  6 +++---
 kernel/dma/direct.c   |  6 +++---
 kernel/dma/direct.h   |  6 +++---
 5 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 7bcdd1205535..a5df35bfd150 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, 
dma_addr_t dma_addr,
 
__iommu_dma_unmap(dev, dma_addr, size);
 
-   if (unlikely(is_swiotlb_buffer(phys)))
+   if (unlikely(is_swiotlb_buffer(dev, phys)))
swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs);
 }
 
@@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device 
*dev, phys_addr_t phys,
}
 
iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask);
-   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys))
+   if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys))
swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs);
return iova;
 }
@@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device 
*dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(phys, size, dir);
 
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(dev, phys))
swiotlb_sync_single_for_cpu(dev, phys, size, dir);
 }
 
@@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device 
*dev,
return;
 
phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle);
-   if (is_swiotlb_buffer(phys))
+   if (is_swiotlb_buffer(dev, phys))
swiotlb_sync_single_for_device(dev, phys, size, dir);
 
if (!dev_is_dma_coherent(dev))
@@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev,
if (!dev_is_dma_coherent(dev))
arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir);
 
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(dev, sg_phys(sg)))
swiotlb_sync_single_for_cpu(dev, sg_phys(sg),
sg->length, dir);
}
@@ -832,7 +832,7 @@ static void iommu_dma_sync_sg_for_device(struct device *dev,
return;
 
for_each_sg(sgl, sg, nelems, i) {
-   if (is_swiotlb_buffer(sg_phys(sg)))
+   if (is_swiotlb_buffer(dev, sg_phys(sg)))
swiotlb_sync_single_for_device(dev, sg_phys(sg),
   sg->length, dir);
 
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4c89afc0df62..0c6ed09f8513 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -100,7 +100,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, 
dma_addr_t dma_addr)
 * in our domain. Therefore _only_ check address within our domain.
 */
if (pfn_valid(PFN_DOWN(paddr)))
-   return is_swiotlb_buffer(paddr);
+   return is_swiotlb_buffer(dev, paddr);
return 0;
 }
 
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index b469f04cca26..2a6cca07540b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -113,9 +113,9 @@ static inline struct io_tlb_mem *get_io_tlb_mem(struct 
device *dev)
return io_tlb_default_mem;
 }
 
-static inline bool is_swiotlb_buffer(phys_addr_t paddr)
+static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
+   struct io_tlb_mem *mem = get_io_tlb_mem(dev);
 
return mem && paddr >= mem->start && paddr < mem->end;
 }
@@ -127,7 +127,7 @@ bool is_swiotlb_active(void);
 void __init swiotlb_adjust_size(unsigned long size);
 #else
 #define swiotlb_force SWIOTLB_NO_FORCE
-static inline bool is_swiotlb_buffer(phys_addr_t paddr)
+static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr)
 {
return false;
 }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index f737e3347059..84c9feb5474a 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -343,7 +343,7 @@ void dma_direct_sync_sg_for_device(struct device *dev,
for_each_sg(sgl, sg, nents, i) {
phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg));
 
-   if (unlikely(is_swiotlb_buffer(paddr)))
+   if (unlikely(is_swiotlb_buffer(dev, paddr)))
swiotlb_sync_single_for_device(dev, paddr, sg->length,
   dir);
 
@@ -369,7 +369,7 @@ void dma_direct_sync_sg_for_cpu(struct device 

[PATCH v5 06/16] swiotlb: Add a new get_io_tlb_mem getter

2021-04-22 Thread Claire Chang
Add a new getter, get_io_tlb_mem, to help select the io_tlb_mem struct.
The restricted DMA pool is preferred if available.

Signed-off-by: Claire Chang 
---
 include/linux/swiotlb.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 03ad6e3b4056..b469f04cca26 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -2,6 +2,7 @@
 #ifndef __LINUX_SWIOTLB_H
 #define __LINUX_SWIOTLB_H
 
+#include 
 #include 
 #include 
 #include 
@@ -102,6 +103,16 @@ struct io_tlb_mem {
 };
 extern struct io_tlb_mem *io_tlb_default_mem;
 
+static inline struct io_tlb_mem *get_io_tlb_mem(struct device *dev)
+{
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (dev && dev->dma_io_tlb_mem)
+   return dev->dma_io_tlb_mem;
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
+
+   return io_tlb_default_mem;
+}
+
 static inline bool is_swiotlb_buffer(phys_addr_t paddr)
 {
struct io_tlb_mem *mem = io_tlb_default_mem;
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 05/16] swiotlb: Add restricted DMA pool initialization

2021-04-22 Thread Claire Chang
Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes.

Signed-off-by: Claire Chang 
---
 include/linux/device.h  |  4 +++
 include/linux/swiotlb.h |  3 +-
 kernel/dma/swiotlb.c| 80 +
 3 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 38a2071cf776..4987608ea4ff 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -416,6 +416,7 @@ struct dev_links_info {
  * @dma_pools: Dma pools (if dma'ble device).
  * @dma_mem:   Internal for coherent mem override.
  * @cma_area:  Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
  * @archdata:  For arch-specific additions.
  * @of_node:   Associated device tree node.
  * @fwnode:Associated device node supplied by platform firmware.
@@ -521,6 +522,9 @@ struct device {
 #ifdef CONFIG_DMA_CMA
struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
+#endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   struct io_tlb_mem *dma_io_tlb_mem;
 #endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 216854a5e513..03ad6e3b4056 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -72,7 +72,8 @@ extern enum swiotlb_force swiotlb_force;
  * range check to see if the memory was in fact allocated by this
  * API.
  * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and
- * @end. This is command line adjustable via setup_io_tlb_npages.
+ * @end. For default swiotlb, this is command line adjustable via
+ * setup_io_tlb_npages.
  * @used:  The number of used IO TLB block.
  * @list:  The free list describing the number of free entries available
  * from each index.
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 57a9adb920bf..ffbb8724e06c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -39,6 +39,13 @@
 #ifdef CONFIG_DEBUG_FS
 #include 
 #endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+#include 
+#include 
+#include 
+#include 
+#include 
+#endif
 
 #include 
 #include 
@@ -681,3 +688,76 @@ static int __init swiotlb_create_default_debugfs(void)
 late_initcall(swiotlb_create_default_debugfs);
 
 #endif
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   struct io_tlb_mem *mem = rmem->priv;
+   unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
+
+   if (dev->dma_io_tlb_mem)
+   return 0;
+
+   /* Since multiple devices can share the same pool, the private data,
+* io_tlb_mem struct, will be initialized by the first device attached
+* to it.
+*/
+   if (!mem) {
+   mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+#ifdef CONFIG_ARM
+   if (!PageHighMem(pfn_to_page(PHYS_PFN(rmem->base {
+   kfree(mem);
+   return -EINVAL;
+   }
+#endif /* CONFIG_ARM */
+   swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false);
+
+   rmem->priv = mem;
+   }
+
+#ifdef CONFIG_DEBUG_FS
+   if (!io_tlb_default_mem->debugfs)
+   io_tlb_default_mem->debugfs =
+   debugfs_create_dir("swiotlb", NULL);
+
+   swiotlb_create_debugfs(mem, rmem->name, io_tlb_default_mem->debugfs);
+#endif /* CONFIG_DEBUG_FS */
+
+   dev->dma_io_tlb_mem = mem;
+
+   return 0;
+}
+
+static void rmem_swiotlb_device_release(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   if (dev)
+   dev->dma_io_tlb_mem = NULL;
+}
+
+static const struct reserved_mem_ops rmem_swiotlb_ops = {
+   .device_init = rmem_swiotlb_device_init,
+   .device_release = rmem_swiotlb_device_release,
+};
+
+static int __init rmem_swiotlb_setup(struct reserved_mem *rmem)
+{
+   unsigned long node = rmem->fdt_node;
+
+   if (of_get_flat_dt_prop(node, "reusable", NULL) ||
+   of_get_flat_dt_prop(node, "linux,cma-default", NULL) ||
+   of_get_flat_dt_prop(node, "linux,dma-default", NULL) ||
+   of_get_flat_dt_prop(node, "no-map", NULL))
+   return -EINVAL;
+
+   rmem->ops = _swiotlb_ops;
+   pr_info("Reserved memory: created device swiotlb memory pool at %pa, 
size %ld MiB\n",
+   >base, (unsigned long)rmem->size / SZ_1M);
+   return 0;
+}
+
+RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", rmem_swiotlb_setup);
+#endif /* CONFIG_DMA_RESTRICTED_POOL */
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 04/16] swiotlb: Add DMA_RESTRICTED_POOL

2021-04-22 Thread Claire Chang
Add a new kconfig symbol, DMA_RESTRICTED_POOL, for restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/Kconfig | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
index 77b405508743..3e961dc39634 100644
--- a/kernel/dma/Kconfig
+++ b/kernel/dma/Kconfig
@@ -80,6 +80,20 @@ config SWIOTLB
bool
select NEED_DMA_MAP_STATE
 
+config DMA_RESTRICTED_POOL
+   bool "DMA Restricted Pool"
+   depends on OF && OF_RESERVED_MEM
+   select SWIOTLB
+   help
+ This enables support for restricted DMA pools which provide a level of
+ DMA memory protection on systems with limited hardware protection
+ capabilities, such as those lacking an IOMMU.
+
+ For more information see
+ 

+ and .
+ If unsure, say "n".
+
 #
 # Should be selected if we can mmap non-coherent mappings to userspace.
 # The only thing that is really required is a way to set an uncached bit
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 03/16] swiotlb: Refactor swiotlb_create_debugfs

2021-04-22 Thread Claire Chang
Split the debugfs creation to make the code reusable for supporting
different bounce buffer pools, e.g. restricted DMA pool.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 3f1adee35097..57a9adb920bf 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -660,18 +660,24 @@ EXPORT_SYMBOL_GPL(is_swiotlb_active);
 
 #ifdef CONFIG_DEBUG_FS
 
-static int __init swiotlb_create_debugfs(void)
+static void swiotlb_create_debugfs(struct io_tlb_mem *mem, const char *name,
+  struct dentry *node)
 {
-   struct io_tlb_mem *mem = io_tlb_default_mem;
-
if (!mem)
-   return 0;
-   mem->debugfs = debugfs_create_dir("swiotlb", NULL);
+   return;
+
+   mem->debugfs = debugfs_create_dir(name, node);
debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, >nslabs);
debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, >used);
+}
+
+static int __init swiotlb_create_default_debugfs(void)
+{
+   swiotlb_create_debugfs(io_tlb_default_mem, "swiotlb", NULL);
+
return 0;
 }
 
-late_initcall(swiotlb_create_debugfs);
+late_initcall(swiotlb_create_default_debugfs);
 
 #endif
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 02/16] swiotlb: Refactor swiotlb init functions

2021-04-22 Thread Claire Chang
Add a new function, swiotlb_init_io_tlb_mem, for the io_tlb_mem struct
initialization to make the code reusable.

Note that we now also call set_memory_decrypted in swiotlb_init_with_tbl.

Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 51 ++--
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 8635a57f88e9..3f1adee35097 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -166,9 +166,30 @@ void __init swiotlb_update_mem_attributes(void)
memset(vaddr, 0, bytes);
 }
 
-int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start,
+   unsigned long nslabs, bool late_alloc)
 {
+   void *vaddr = phys_to_virt(start);
unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
+
+   mem->nslabs = nslabs;
+   mem->start = start;
+   mem->end = mem->start + bytes;
+   mem->index = 0;
+   mem->late_alloc = late_alloc;
+   spin_lock_init(>lock);
+   for (i = 0; i < mem->nslabs; i++) {
+   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
+   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
+   mem->slots[i].alloc_size = 0;
+   }
+
+   set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT);
+   memset(vaddr, 0, bytes);
+}
+
+int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+{
struct io_tlb_mem *mem;
size_t alloc_size;
 
@@ -184,16 +205,8 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
if (!mem)
panic("%s: Failed to allocate %zu bytes align=0x%lx\n",
  __func__, alloc_size, PAGE_SIZE);
-   mem->nslabs = nslabs;
-   mem->start = __pa(tlb);
-   mem->end = mem->start + bytes;
-   mem->index = 0;
-   spin_lock_init(>lock);
-   for (i = 0; i < mem->nslabs; i++) {
-   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
-   mem->slots[i].alloc_size = 0;
-   }
+
+   swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false);
 
io_tlb_default_mem = mem;
if (verbose)
@@ -280,7 +293,6 @@ swiotlb_late_init_with_default_size(size_t default_size)
 int
 swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
 {
-   unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
struct io_tlb_mem *mem;
 
if (swiotlb_force == SWIOTLB_NO_FORCE)
@@ -295,20 +307,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
if (!mem)
return -ENOMEM;
 
-   mem->nslabs = nslabs;
-   mem->start = virt_to_phys(tlb);
-   mem->end = mem->start + bytes;
-   mem->index = 0;
-   mem->late_alloc = 1;
-   spin_lock_init(>lock);
-   for (i = 0; i < mem->nslabs; i++) {
-   mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i);
-   mem->slots[i].orig_addr = INVALID_PHYS_ADDR;
-   mem->slots[i].alloc_size = 0;
-   }
-
-   set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT);
-   memset(tlb, 0, bytes);
+   swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true);
 
io_tlb_default_mem = mem;
swiotlb_print_info();
-- 

I'm not 100% sure if set_memory_decrypted is safe to call in
swiotlb_init_with_tbl, but I didn't hit any issue booting with this.

2.31.1.368.gbe11c130af-goog



[PATCH v5 01/16] swiotlb: Fix the type of index

2021-04-22 Thread Claire Chang
Fix the type of index from unsigned int to int since find_slots() might
return -1.

Fixes: 0774983bc923 ("swiotlb: refactor swiotlb_tbl_map_single")
Signed-off-by: Claire Chang 
---
 kernel/dma/swiotlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 0a5b6f7e75bc..8635a57f88e9 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -499,7 +499,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, 
phys_addr_t orig_addr,
 {
struct io_tlb_mem *mem = io_tlb_default_mem;
unsigned int offset = swiotlb_align_offset(dev, orig_addr);
-   unsigned int index, i;
+   unsigned int i;
+   int index;
phys_addr_t tlb_addr;
 
if (!mem)
-- 
2.31.1.368.gbe11c130af-goog



[PATCH v5 00/16] Restricted DMA

2021-04-22 Thread Claire Chang
This series implements mitigations for lack of DMA access control on
systems without an IOMMU, which could result in the DMA accessing the
system memory at unexpected times and/or unexpected addresses, possibly
leading to data leakage or corruption.

For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
not behind an IOMMU. As PCI-e, by design, gives the device full access to
system memory, a vulnerability in the Wi-Fi firmware could easily escalate
to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
full chain of exploits; [2], [3]).

To mitigate the security concerns, we introduce restricted DMA. Restricted
DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
specially allocated region and does memory allocation from the same region.
The feature on its own provides a basic level of protection against the DMA
overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system needs
to provide a way to restrict the DMA to a predefined memory region (this is
usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).

[1a] 
https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
[1b] 
https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
[2] https://blade.tencent.com/en/advisories/qualpwn/
[3] 
https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
[4] 
https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132

v5:
  Rebase on latest linux-next

v4:
  - Fix spinlock bad magic
  - Use rmem->name for debugfs entry
  - Address the comments in v3

v3:
  Using only one reserved memory region for both streaming DMA and memory
  allocation.
  https://lore.kernel.org/patchwork/cover/1360992/

v2:
  Building on top of swiotlb.
  https://lore.kernel.org/patchwork/cover/1280705/

v1:
  Using dma_map_ops.
  https://lore.kernel.org/patchwork/cover/1271660/

Claire Chang (16):
  swiotlb: Fix the type of index
  swiotlb: Refactor swiotlb init functions
  swiotlb: Refactor swiotlb_create_debugfs
  swiotlb: Add DMA_RESTRICTED_POOL
  swiotlb: Add restricted DMA pool initialization
  swiotlb: Add a new get_io_tlb_mem getter
  swiotlb: Update is_swiotlb_buffer to add a struct device argument
  swiotlb: Update is_swiotlb_active to add a struct device argument
  swiotlb: Bounce data from/to restricted DMA pool if available
  swiotlb: Move alloc_size to find_slots
  swiotlb: Refactor swiotlb_tbl_unmap_single
  dma-direct: Add a new wrapper __dma_direct_free_pages()
  swiotlb: Add restricted DMA alloc/free support.
  dma-direct: Allocate memory from restricted DMA pool if available
  dt-bindings: of: Add restricted DMA pool
  of: Add plumbing for restricted DMA pool

 .../reserved-memory/reserved-memory.txt   |  24 ++
 drivers/gpu/drm/i915/gem/i915_gem_internal.c  |   2 +-
 drivers/gpu/drm/nouveau/nouveau_ttm.c |   2 +-
 drivers/iommu/dma-iommu.c |  12 +-
 drivers/of/address.c  |  25 ++
 drivers/of/device.c   |   3 +
 drivers/of/of_private.h   |   5 +
 drivers/pci/xen-pcifront.c|   2 +-
 drivers/xen/swiotlb-xen.c |   2 +-
 include/linux/device.h|   4 +
 include/linux/swiotlb.h   |  41 ++-
 kernel/dma/Kconfig|  14 +
 kernel/dma/direct.c   |  57 +++--
 kernel/dma/direct.h   |   9 +-
 kernel/dma/swiotlb.c  | 242 +-
 15 files changed, 347 insertions(+), 97 deletions(-)

-- 
2.31.1.368.gbe11c130af-goog



RE: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()

2021-04-22 Thread David Laight
From: Daniel Axtens
> Sent: 22 April 2021 03:21
> 
> > Hi Lakshmi,
> >
> >> On 4/15/21 12:14 PM, Lakshmi Ramasubramanian wrote:
> >>
> >> Sorry - missed copying device-tree and powerpc mailing lists.
> >>
> >>> There are a few "goto out;" statements before the local variable "fdt"
> >>> is initialized through the call to of_kexec_alloc_and_setup_fdt() in
> >>> elf64_load(). This will result in an uninitialized "fdt" being passed
> >>> to kvfree() in this function if there is an error before the call to
> >>> of_kexec_alloc_and_setup_fdt().
> >>>
> >>> Initialize the local variable "fdt" to NULL.
> >>>
> > I'm a huge fan of initialising local variables! But I'm struggling to
> > find the code path that will lead to an uninit fdt being returned...
> 
> OK, so perhaps this was putting it too strongly. I have been bitten
> by uninitialised things enough in C that I may have taken a slightly
> overly-agressive view of fixing them in the source rather than the
> compiler. I do think compiler-level mitigations are better, and I take
> the point that we don't want to defeat compiler checking.
> 
> (Does anyone - and by anyone I mean any large distro - compile with
> local variables inited by the compiler?)

There are compilers that initialise locals to zero for 'debug' builds
and leave the 'random' for optimised 'release' builds.
Lets not test what we are releasing!

I also think there is a new option to gcc (or clang?) to initialise
on-stack structures and arrays to ensure garbage isn't passed.
That seems to be a horrid performance hit!
Especially in userspace where large stack allocations are almost free.

Any auto-initialise ought to be with a semi-random value
(especially not zero) so that it is never right and doesn't
lead to lazy coding.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



[PATCH 6/7] swiotlb: replace default_nslabs with a byte value

2021-04-22 Thread Christoph Hellwig
Replace the default_nslabs variable with one that stores the size in
bytes as that is what all the users actually expect.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/swiotlb.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 93737d0932fbf2..87d06ddf4753f3 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -71,7 +71,7 @@ struct io_tlb_mem *io_tlb_default_mem;
  */
 static unsigned int max_segment;
 
-static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
+static unsigned long swiotlb_default_size = IO_TLB_DEFAULT_SIZE;
 
 static int __init
 setup_io_tlb_npages(char *str)
@@ -106,7 +106,7 @@ void swiotlb_set_max_segment(unsigned int val)
 
 unsigned long swiotlb_size_or_default(void)
 {
-   return default_nslabs << IO_TLB_SHIFT;
+   return swiotlb_default_size;
 }
 
 void __init swiotlb_adjust_size(unsigned long size)
@@ -116,9 +116,9 @@ void __init swiotlb_adjust_size(unsigned long size)
 * architectures such as those supporting memory encryption to
 * adjust/expand SWIOTLB size for their use.
 */
-   size = ALIGN(size, IO_TLB_SIZE);
-   default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
-   pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20);
+   swiotlb_default_size = ALIGN(size, IO_TLB_SEGSIZE << IO_TLB_SHIFT);
+   pr_info("SWIOTLB bounce buffer size adjusted to %luMB",
+   swiotlb_default_size >> 20);
 }
 
 void swiotlb_print_info(void)
-- 
2.30.1



[PATCH 1/7] swiotlb: pass bytes instead of nslabs to swiotlb_init_with_tbl

2021-04-22 Thread Christoph Hellwig
Pass the actual allocation size to swiotlb_init_with_tbl, which
simplifies things quite a bit.

Signed-off-by: Christoph Hellwig 
---
 arch/mips/cavium-octeon/dma-octeon.c |  2 +-
 arch/powerpc/platforms/pseries/svm.c |  3 +--
 drivers/xen/swiotlb-xen.c|  2 +-
 include/linux/swiotlb.h  |  2 +-
 kernel/dma/swiotlb.c | 10 +-
 5 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/mips/cavium-octeon/dma-octeon.c 
b/arch/mips/cavium-octeon/dma-octeon.c
index df70308db0e697..020b8ce5b8ff7c 100644
--- a/arch/mips/cavium-octeon/dma-octeon.c
+++ b/arch/mips/cavium-octeon/dma-octeon.c
@@ -245,6 +245,6 @@ void __init plat_swiotlb_setup(void)
panic("%s: Failed to allocate %zu bytes align=%lx\n",
  __func__, swiotlbsize, PAGE_SIZE);
 
-   if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlb_nslabs, 1) == -ENOMEM)
+   if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlbsize, 1) == -ENOMEM)
panic("Cannot allocate SWIOTLB buffer");
 }
diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
index 1d829e257996fb..4d281ff56ce96f 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -52,10 +52,9 @@ void __init svm_swiotlb_init(void)
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
 
vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
-   if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false))
+   if (vstart && !swiotlb_init_with_tbl(vstart, bytes, false))
return;
 
-
memblock_free_early(__pa(vstart),
PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
panic("SVM: Cannot allocate SWIOTLB buffer");
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 4c89afc0df6289..18d79f07b507ce 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -249,7 +249,7 @@ void __init xen_swiotlb_init_early(void)
panic("%s (rc:%d)", xen_swiotlb_error(XEN_SWIOTLB_EFIXUP), rc);
}
 
-   if (swiotlb_init_with_tbl(start, nslabs, false))
+   if (swiotlb_init_with_tbl(start, bytes, false))
panic("Cannot allocate SWIOTLB buffer");
swiotlb_set_max_segment(PAGE_SIZE);
 }
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 216854a5e5134b..d1d40ca5014b54 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -36,7 +36,7 @@ enum swiotlb_force {
 #define IO_TLB_DEFAULT_SIZE (64UL<<20)
 
 extern void swiotlb_init(int verbose);
-int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
+int swiotlb_init_with_tbl(char *tlb, size_t bytes, int verbose);
 unsigned long swiotlb_size_or_default(void);
 extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
 extern int swiotlb_late_init_with_default_size(size_t default_size);
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 0a5b6f7e75bce6..c7b3dd86db7f56 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -166,9 +166,9 @@ void __init swiotlb_update_mem_attributes(void)
memset(vaddr, 0, bytes);
 }
 
-int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose)
+int __init swiotlb_init_with_tbl(char *tlb, size_t bytes, int verbose)
 {
-   unsigned long bytes = nslabs << IO_TLB_SHIFT, i;
+   unsigned long nslabs = bytes >> IO_TLB_SHIFT, i;
struct io_tlb_mem *mem;
size_t alloc_size;
 
@@ -209,17 +209,17 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
 void  __init
 swiotlb_init(int verbose)
 {
-   size_t bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
+   size_t bytes = default_nslabs << IO_TLB_SHIFT;
void *tlb;
 
if (swiotlb_force == SWIOTLB_NO_FORCE)
return;
 
/* Get IO TLB memory from the low pages */
-   tlb = memblock_alloc_low(bytes, PAGE_SIZE);
+   tlb = memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE);
if (!tlb)
goto fail;
-   if (swiotlb_init_with_tbl(tlb, default_nslabs, verbose))
+   if (swiotlb_init_with_tbl(tlb, bytes, verbose))
goto fail_free_mem;
return;
 
-- 
2.30.1



[PATCH 7/7] swiotlb: don't override the command line in swiotlb_adjust_size

2021-04-22 Thread Christoph Hellwig
When the user specified an explicit swiotlb size on the command line,
the achitecture code should not override it.

Fixes: 2cbc2776efe4 ("swiotlb: remove swiotlb_nr_tbl")
Reported-by: Tom Lendacky 
Signed-off-by: Christoph Hellwig 
---
 kernel/dma/swiotlb.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 87d06ddf4753f3..aef02a3825b494 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -106,7 +106,9 @@ void swiotlb_set_max_segment(unsigned int val)
 
 unsigned long swiotlb_size_or_default(void)
 {
-   return swiotlb_default_size;
+   if (swiotlb_default_size)
+   return swiotlb_default_size;
+   return IO_TLB_DEFAULT_SIZE;
 }
 
 void __init swiotlb_adjust_size(unsigned long size)
@@ -116,6 +118,8 @@ void __init swiotlb_adjust_size(unsigned long size)
 * architectures such as those supporting memory encryption to
 * adjust/expand SWIOTLB size for their use.
 */
+   if (swiotlb_default_size)
+   return;
swiotlb_default_size = ALIGN(size, IO_TLB_SEGSIZE << IO_TLB_SHIFT);
pr_info("SWIOTLB bounce buffer size adjusted to %luMB",
swiotlb_default_size >> 20);
-- 
2.30.1



[PATCH 2/7] swiotlb: use swiotlb_size_or_default in swiotlb_init

2021-04-22 Thread Christoph Hellwig
Use swiotlb_size_or_default to calculate the buffer size insted of
open coding it.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/swiotlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c7b3dd86db7f56..27461fd63e0330 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -209,7 +209,7 @@ int __init swiotlb_init_with_tbl(char *tlb, size_t bytes, 
int verbose)
 void  __init
 swiotlb_init(int verbose)
 {
-   size_t bytes = default_nslabs << IO_TLB_SHIFT;
+   size_t bytes = swiotlb_size_or_default();
void *tlb;
 
if (swiotlb_force == SWIOTLB_NO_FORCE)
-- 
2.30.1



[PATCH 4/7] powerpc/pseries: simplify svm_swiotlb_init

2021-04-22 Thread Christoph Hellwig
The value returned by swiotlb_size_or_default is always properly
aligned now, so don't duplicate the work.

Signed-off-by: Christoph Hellwig 
---
 arch/powerpc/platforms/pseries/svm.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/svm.c 
b/arch/powerpc/platforms/pseries/svm.c
index 4d281ff56ce96f..9187d2a1ed568d 100644
--- a/arch/powerpc/platforms/pseries/svm.c
+++ b/arch/powerpc/platforms/pseries/svm.c
@@ -43,20 +43,14 @@ machine_early_initcall(pseries, init_svm);
  */
 void __init svm_swiotlb_init(void)
 {
+   unsigned long bytes = swiotlb_size_or_default();
unsigned char *vstart;
-   unsigned long bytes, io_tlb_nslabs;
-
-   io_tlb_nslabs = (swiotlb_size_or_default() >> IO_TLB_SHIFT);
-   io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
-
-   bytes = io_tlb_nslabs << IO_TLB_SHIFT;
 
vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE);
if (vstart && !swiotlb_init_with_tbl(vstart, bytes, false))
return;
 
-   memblock_free_early(__pa(vstart),
-   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+   memblock_free_early(__pa(vstart), PAGE_ALIGN(bytes));
panic("SVM: Cannot allocate SWIOTLB buffer");
 }
 
-- 
2.30.1



cleanup and fix swiotlb sizing

2021-04-22 Thread Christoph Hellwig
Hi all,

based on a report from Tom that overriding the default sizing provided
by the x86 SEV code on the command line doesn't work anymore, this series
cleans up how we handle default and command line sizes for the swiotlb
buffer and then fixes the recently introduced bug in a straight-forward
way.

Diffstat:
 arch/mips/cavium-octeon/dma-octeon.c  |   16 +--
 arch/mips/include/asm/octeon/pci-octeon.h |1 
 arch/mips/pci/pci-octeon.c|2 -
 arch/powerpc/platforms/pseries/svm.c  |   13 ++--
 drivers/xen/swiotlb-xen.c |2 -
 include/linux/swiotlb.h   |2 -
 kernel/dma/swiotlb.c  |   32 +++---
 7 files changed, 25 insertions(+), 43 deletions(-)


[PATCH 3/7] swiotlb: use swiotlb_adjust_size in setup_io_tlb_npages

2021-04-22 Thread Christoph Hellwig
Use the proper helper to do the proper alignment of the buffer size
to the requirements of the swiotlb allocator instead of open coding
the logic.

Signed-off-by: Christoph Hellwig 
---
 kernel/dma/swiotlb.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 27461fd63e0330..93737d0932fbf2 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -76,11 +76,9 @@ static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> 
IO_TLB_SHIFT;
 static int __init
 setup_io_tlb_npages(char *str)
 {
-   if (isdigit(*str)) {
-   /* avoid tail segment of size < IO_TLB_SEGSIZE */
-   default_nslabs =
-   ALIGN(simple_strtoul(str, , 0), IO_TLB_SEGSIZE);
-   }
+   if (isdigit(*str))
+   swiotlb_adjust_size(
+   simple_strtoul(str, , 0) << IO_TLB_SHIFT);
if (*str == ',')
++str;
if (!strcmp(str, "force"))
-- 
2.30.1



[PATCH 5/7] MIPS/octeon: simplify swiotlb initialization

2021-04-22 Thread Christoph Hellwig
Just use swiotlb_adjust_size and swiotlb_init to initialize
swiotlb instead of doing a lot of manual work.

Signed-off-by: Christoph Hellwig 
---
 arch/mips/cavium-octeon/dma-octeon.c  | 16 ++--
 arch/mips/include/asm/octeon/pci-octeon.h |  1 -
 arch/mips/pci/pci-octeon.c|  2 +-
 3 files changed, 3 insertions(+), 16 deletions(-)

diff --git a/arch/mips/cavium-octeon/dma-octeon.c 
b/arch/mips/cavium-octeon/dma-octeon.c
index 020b8ce5b8ff7c..6bc9ef5e3790ec 100644
--- a/arch/mips/cavium-octeon/dma-octeon.c
+++ b/arch/mips/cavium-octeon/dma-octeon.c
@@ -186,15 +186,12 @@ phys_addr_t dma_to_phys(struct device *dev, dma_addr_t 
daddr)
return daddr;
 }
 
-char *octeon_swiotlb;
-
 void __init plat_swiotlb_setup(void)
 {
phys_addr_t start, end;
phys_addr_t max_addr;
phys_addr_t addr_size;
size_t swiotlbsize;
-   unsigned long swiotlb_nslabs;
u64 i;
 
max_addr = 0;
@@ -236,15 +233,6 @@ void __init plat_swiotlb_setup(void)
if (OCTEON_IS_OCTEON2() && max_addr >= 0x1ul)
swiotlbsize = 64 * (1<<20);
 #endif
-   swiotlb_nslabs = swiotlbsize >> IO_TLB_SHIFT;
-   swiotlb_nslabs = ALIGN(swiotlb_nslabs, IO_TLB_SEGSIZE);
-   swiotlbsize = swiotlb_nslabs << IO_TLB_SHIFT;
-
-   octeon_swiotlb = memblock_alloc_low(swiotlbsize, PAGE_SIZE);
-   if (!octeon_swiotlb)
-   panic("%s: Failed to allocate %zu bytes align=%lx\n",
- __func__, swiotlbsize, PAGE_SIZE);
-
-   if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlbsize, 1) == -ENOMEM)
-   panic("Cannot allocate SWIOTLB buffer");
+   swiotlb_adjust_size(swiotlbsize);
+   swiotlb_init(false);
 }
diff --git a/arch/mips/include/asm/octeon/pci-octeon.h 
b/arch/mips/include/asm/octeon/pci-octeon.h
index b12d9a3fbfb6c0..a2f20a44fb6143 100644
--- a/arch/mips/include/asm/octeon/pci-octeon.h
+++ b/arch/mips/include/asm/octeon/pci-octeon.h
@@ -64,6 +64,5 @@ enum octeon_dma_bar_type {
 extern enum octeon_dma_bar_type octeon_dma_bar_type;
 
 void octeon_pci_dma_init(void);
-extern char *octeon_swiotlb;
 
 #endif
diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
index fc29b85cfa926d..ff26cd9dc083f6 100644
--- a/arch/mips/pci/pci-octeon.c
+++ b/arch/mips/pci/pci-octeon.c
@@ -664,7 +664,7 @@ static int __init octeon_pci_setup(void)
 
/* BAR1 movable regions contiguous to cover the swiotlb */
octeon_bar1_pci_phys =
-   virt_to_phys(octeon_swiotlb) & ~((1ull << 22) - 1);
+   io_tlb_default_mem->start & ~((1ull << 22) - 1);
 
for (index = 0; index < 32; index++) {
union cvmx_pci_bar1_indexx bar1_index;
-- 
2.30.1



Re: [PATCH v3 00/11] DDW + Indirect Mapping

2021-04-22 Thread Leonardo Bras
Changes since v2:
- Some patches got removed from the series and sent by themselves,
- New tbl created for DDW + indirect mapping reserves MMIO32 space,
- Improved reserved area algorithm,
- Improved commit messages,
- Removed define for default DMA window prop name,
- Avoided some unnecessary renaming,
- Removed some unnecessary empty lines,
- Changed some code moving to forward declarations.
v2
Link: 
http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=201210=%2A=both
 

On Thu, 2021-04-22 at 04:07 -0300, Leonardo Bras wrote:
> So far it's assumed possible to map the guest RAM 1:1 to the bus, which
> works with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
> 
> As of today, if the assumed direct mapping is not possible, DDW creation
> is skipped and the default DMA window "ibm,dma-window" is used instead.
> 
> Using the DDW instead of the default DMA window may allow to expand the
> amount of memory that can be DMA-mapped, given the number of pages (TCEs)
> may stay the same (or increase) and the default DMA window offers only
> 4k-pages while DDW may offer larger pages (4k, 64k, 16M ...).
> 
> Patch #1 replaces hard-coded 4K page size with a variable containing the
> correct page size for the window.
> 
> Patch #2 introduces iommu_table_in_use(), and replace manual bit-field
> checking where it's used. It will be used for aborting enable_ddw() if
> there is any current iommu allocation and we are trying single window
> indirect mapping.
> 
> Patch #3 introduces iommu_pseries_alloc_table() that will be helpful
> when indirect mapping needs to replace the iommu_table.
> 
> Patch #4 adds helpers for adding DDWs in the list.
> 
> Patch #5 refactors enable_ddw() so it returns if direct mapping is
> possible, instead of DMA offset. It helps for next patches on
> indirect DMA mapping and also allows DMA windows starting at 0x00.
> 
> Patch #6 bring new helper to simplify enable_ddw(), allowing
> some reorganization for introducing indirect mapping DDW.
> 
> Patch #7 adds new helper _iommu_table_setparms() and use it in other
> *setparams*() to fill iommu_table. It will also be used for creating a
> new iommu_table for indirect mapping.
> 
> Patch #8 updates remove_dma_window() to accept different property names,
> so we can introduce a new property for indirect mapping.
> 
> Patch #9 extracts find_existing_ddw_windows() into
> find_existing_ddw_windows_named(), and calls it by it's property name.
> This will be useful when the property for indirect mapping is created,
> so we can search the device-tree for both properties.
> 
> Patch #10:
> Instead of destroying the created DDW if it doesn't map the whole
> partition, make use of it instead of the default DMA window as it improves
> performance. Also, update the iommu_table and re-generate the pools.
> It introduces a new property name for DDW with indirect DMA mapping.
> 
> Patch #11:
> Does some renaming of 'direct window' to 'dma window', given the DDW
> created can now be also used in indirect mapping if direct mapping is not
> available.
> 
> All patches were tested into an LPAR with an virtio-net interface that
> allows default DMA window and DDW to coexist.
> 
> Leonardo Bras (11):
>   powerpc/pseries/iommu: Replace hard-coded page shift
>   powerpc/kernel/iommu: Add new iommu_table_in_use() helper
>   powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
>   powerpc/pseries/iommu: Add ddw_list_new_entry() helper
>   powerpc/pseries/iommu: Allow DDW windows starting at 0x00
>   powerpc/pseries/iommu: Add ddw_property_create() and refactor
> enable_ddw()
>   powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new
> helper
>   powerpc/pseries/iommu: Update remove_dma_window() to accept property
> name
>   powerpc/pseries/iommu: Find existing DDW with given property name
>   powerpc/pseries/iommu: Make use of DDW for indirect mapping
>   powerpc/pseries/iommu: Rename "direct window" to "dma window"
> 
>  arch/powerpc/include/asm/iommu.h   |   1 +
>  arch/powerpc/include/asm/tce.h |   8 -
>  arch/powerpc/kernel/iommu.c|  65 ++--
>  arch/powerpc/platforms/pseries/iommu.c | 504 +++--
>  4 files changed, 338 insertions(+), 240 deletions(-)
> 




[PATCH v3 11/11] powerpc/pseries/iommu: Rename "direct window" to "dma window"

2021-04-22 Thread Leonardo Bras
A previous change introduced the usage of DDW as a bigger indirect DMA
mapping when the DDW available size does not map the whole partition.

As most of the code that manipulates direct mappings was reused for
indirect mappings, it's necessary to rename all names and debug/info
messages to reflect that it can be used for both kinds of mapping.

This should cause no behavioural change, just adjust naming.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 93 +-
 1 file changed, 48 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 50909cbc73f6..f5d0a6f012da 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -355,7 +355,7 @@ struct dynamic_dma_window_prop {
__be32  window_shift;   /* ilog2(tce_window_size) */
 };
 
-struct direct_window {
+struct dma_win {
struct device_node *device;
const struct dynamic_dma_window_prop *prop;
struct list_head list;
@@ -375,11 +375,11 @@ struct ddw_create_response {
u32 addr_lo;
 };
 
-static LIST_HEAD(direct_window_list);
+static LIST_HEAD(dma_win_list);
 /* prevents races between memory on/offline and window creation */
-static DEFINE_SPINLOCK(direct_window_list_lock);
+static DEFINE_SPINLOCK(dma_win_list_lock);
 /* protects initializing window twice for same device */
-static DEFINE_MUTEX(direct_window_init_mutex);
+static DEFINE_MUTEX(dma_win_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
 #define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
@@ -712,7 +712,10 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus 
*bus)
pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n",
 dn);
 
-   /* Find nearest ibm,dma-window, walking up the device tree */
+   /*
+* Find nearest ibm,dma-window (default DMA window), walking up the
+* device tree
+*/
for (pdn = dn; pdn != NULL; pdn = pdn->parent) {
dma_window = of_get_property(pdn, "ibm,dma-window", NULL);
if (dma_window != NULL)
@@ -816,11 +819,11 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
 
ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn);
if (ret)
-   pr_warn("%pOF: failed to remove direct window: rtas returned "
+   pr_warn("%pOF: failed to remove DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
else
-   pr_debug("%pOF: successfully removed direct window: rtas 
returned "
+   pr_debug("%pOF: successfully removed DMA window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
@@ -848,37 +851,37 @@ static int remove_ddw(struct device_node *np, bool 
remove_prop, const char *win_
 
ret = of_remove_property(np, win);
if (ret)
-   pr_warn("%pOF: failed to remove direct window property: %d\n",
+   pr_warn("%pOF: failed to remove DMA window property: %d\n",
np, ret);
return 0;
 }
 
 static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
 {
-   struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
+   struct dma_win *window;
+   const struct dynamic_dma_window_prop *dma64;
bool found = false;
 
-   spin_lock(_window_list_lock);
+   spin_lock(_win_list_lock);
/* check if we already created a window and dupe that config if so */
-   list_for_each_entry(window, _window_list, list) {
+   list_for_each_entry(window, _win_list, list) {
if (window->device == pdn) {
-   direct64 = window->prop;
-   *dma_addr = be64_to_cpu(direct64->dma_base);
-   *window_shift = be32_to_cpu(direct64->window_shift);
+   dma64 = window->prop;
+   *dma_addr = be64_to_cpu(dma64->dma_base);
+   *window_shift = be32_to_cpu(dma64->window_shift);
found = true;
break;
}
}
-   spin_unlock(_window_list_lock);
+   spin_unlock(_win_list_lock);
 
return found;
 }
 
-static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
-   const struct 
dynamic_dma_window_prop *dma64)
+static struct dma_win *ddw_list_new_entry(struct device_node *pdn,
+ const struct dynamic_dma_window_prop 
*dma64)
 {
-   struct direct_window *window;
+   struct dma_win *window;
 
window = 

[PATCH v3 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping

2021-04-22 Thread Leonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

By using DDW, indirect mapping  can get more TCEs than available for the
default DMA window, and also get access to using much larger pagesizes
(16MB as implemented in qemu vs 4k from default DMA window), causing a
significant increase on the maximum amount of memory that can be IOMMU
mapped at the same time.

Indirect mapping will only be used if direct mapping is not a
possibility.

For indirect mapping, it's necessary to re-create the iommu_table with
the new DMA window parameters, so iommu_alloc() can use it.

Removing the default DMA window for using DDW with indirect mapping
is only allowed if there is no current IOMMU memory allocated in
the iommu_table. enable_ddw() is aborted otherwise.

Even though there won't be both direct and indirect mappings at the
same time, we can't reuse the DIRECT64_PROPNAME property name, or else
an older kexec()ed kernel can assume direct mapping, and skip
iommu_alloc(), causing undesirable behavior.
So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info"
was created to represent a DDW that does not allow direct mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 87 +-
 1 file changed, 72 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 3367233a5535..50909cbc73f6 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,7 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
+static phys_addr_t ddw_memory_hotplug_max(void);
 #ifdef CONFIG_IOMMU_API
 static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned 
long *tce,
enum dma_data_direction *direction, bool 
realmode);
@@ -380,6 +381,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock);
 /* protects initializing window twice for same device */
 static DEFINE_MUTEX(direct_window_init_mutex);
 #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
+#define DMA64_PROPNAME "linux,dma64-ddr-window-info"
 
 static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn,
unsigned long num_pfn, const void *arg)
@@ -918,6 +920,7 @@ static int find_existing_ddw_windows(void)
return 0;
 
find_existing_ddw_windows_named(DIRECT64_PROPNAME);
+   find_existing_ddw_windows_named(DMA64_PROPNAME);
 
return 0;
 }
@@ -1207,10 +1210,13 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
+   const char *win_name;
struct property *win64 = NULL;
struct failed_ddw_pdn *fpdn;
-   bool default_win_removed = false;
+   bool default_win_removed = false, direct_mapping = false;
bool pmem_present;
+   struct pci_dn *pci = PCI_DN(pdn);
+   struct iommu_table *tbl = pci->table_group->tables[0];
 
dn = of_find_node_by_type(NULL, "ibm,pmemory");
pmem_present = dn != NULL;
@@ -1218,8 +1224,12 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_window_init_mutex);
 
-   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, ))
-   goto out_unlock;
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) {
+   direct_mapping = (len >= max_ram_len);
+
+   mutex_unlock(_window_init_mutex);
+   return direct_mapping;
+   }
 
/*
 * If we already went through this for a previous function of
@@ -1298,7 +1308,6 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_failed;
}
/* verify the window * number of ptes will map the partition */
-   /* check largest block * page size > max memory hotplug addr */
/*
 * The "ibm,pmemory" can appear anywhere in the address space.
 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
@@ -1320,6 +1329,17 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << len,
query.largest_available_block,
1ULL << page_shift);
+
+   len = order_base_2(query.largest_available_block << page_shift);
+   win_name = DMA64_PROPNAME;
+   } else {
+   direct_mapping = true;
+   win_name = 

[PATCH v3 09/11] powerpc/pseries/iommu: Find existing DDW with given property name

2021-04-22 Thread Leonardo Bras
At the moment pseries stores information about created directly mapped
DDW window in DIRECT64_PROPNAME.

With the objective of implementing indirect DMA mapping with DDW, it's
necessary to have another propriety name to make sure kexec'ing into older
kernels does not break, as it would if we reuse DIRECT64_PROPNAME.

In order to have this, find_existing_ddw_windows() needs to be able to
look for different property names.

Extract find_existing_ddw_windows() into find_existing_ddw_windows_named()
and calls it with current property name.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 075c6e08f012..3367233a5535 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -888,24 +888,21 @@ static struct direct_window *ddw_list_new_entry(struct 
device_node *pdn,
return window;
 }
 
-static int find_existing_ddw_windows(void)
+static void find_existing_ddw_windows_named(const char *name)
 {
int len;
struct device_node *pdn;
struct direct_window *window;
-   const struct dynamic_dma_window_prop *direct64;
-
-   if (!firmware_has_feature(FW_FEATURE_LPAR))
-   return 0;
+   const struct dynamic_dma_window_prop *dma64;
 
-   for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
-   direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true, DIRECT64_PROPNAME);
+   for_each_node_with_property(pdn, name) {
+   dma64 = of_get_property(pdn, name, );
+   if (!dma64 || len < sizeof(*dma64)) {
+   remove_ddw(pdn, true, name);
continue;
}
 
-   window = ddw_list_new_entry(pdn, direct64);
+   window = ddw_list_new_entry(pdn, dma64);
if (!window)
break;
 
@@ -913,6 +910,14 @@ static int find_existing_ddw_windows(void)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
}
+}
+
+static int find_existing_ddw_windows(void)
+{
+   if (!firmware_has_feature(FW_FEATURE_LPAR))
+   return 0;
+
+   find_existing_ddw_windows_named(DIRECT64_PROPNAME);
 
return 0;
 }
-- 
2.30.2



[PATCH v3 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name

2021-04-22 Thread Leonardo Bras
Update remove_dma_window() so it can be used to remove DDW with a given
property name.

This enables the creation of new property names for DDW, so we can
have different usage for it, like indirect mapping.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0147ccaf0be4..075c6e08f012 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -823,31 +823,32 @@ static void remove_dma_window(struct device_node *np, u32 
*ddw_avail,
np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn);
 }
 
-static void remove_ddw(struct device_node *np, bool remove_prop)
+static int remove_ddw(struct device_node *np, bool remove_prop, const char 
*win_name)
 {
struct property *win;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
int ret = 0;
 
+   win = of_find_property(np, win_name, NULL);
+   if (!win)
+   return -EINVAL;
+
ret = of_property_read_u32_array(np, "ibm,ddw-applicable",
 _avail[0], DDW_APPLICABLE_SIZE);
if (ret)
-   return;
-
-   win = of_find_property(np, DIRECT64_PROPNAME, NULL);
-   if (!win)
-   return;
+   return 0;
 
if (win->length >= sizeof(struct dynamic_dma_window_prop))
remove_dma_window(np, ddw_avail, win);
 
if (!remove_prop)
-   return;
+   return 0;
 
ret = of_remove_property(np, win);
if (ret)
pr_warn("%pOF: failed to remove direct window property: %d\n",
np, ret);
+   return 0;
 }
 
 static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
@@ -900,7 +901,7 @@ static int find_existing_ddw_windows(void)
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
if (!direct64 || len < sizeof(*direct64)) {
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
continue;
}
 
@@ -1372,7 +1373,7 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
win64 = NULL;
 
 out_del_win:
-   remove_ddw(pdn, true);
+   remove_ddw(pdn, true, DIRECT64_PROPNAME);
 
 out_failed:
if (default_win_removed)
@@ -1536,7 +1537,7 @@ static int iommu_reconfig_notifier(struct notifier_block 
*nb, unsigned long acti
 * we have to remove the property when releasing
 * the device node.
 */
-   remove_ddw(np, false);
+   remove_ddw(np, false, DIRECT64_PROPNAME);
if (pci && pci->table_group)
iommu_pseries_free_group(pci->table_group,
np->full_name);
-- 
2.30.2



[PATCH v3 07/11] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper

2021-04-22 Thread Leonardo Bras
Add a new helper _iommu_table_setparms(), and use it in
iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated
code.

Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(),
so move it to the new helper. Since we need the iommu_table_ops to be
declared before used, move iommu_table_lpar_multi_ops and
iommu_table_pseries_ops to before their respective iommu_table_setparms*().

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 100 -
 1 file changed, 50 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 48c029386d94..0147ccaf0be4 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,11 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
+#ifdef CONFIG_IOMMU_API
+static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned 
long *tce,
+   enum dma_data_direction *direction, bool 
realmode);
+#endif
+
 static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
struct iommu_table *tbl;
@@ -501,6 +506,28 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long 
start_pfn,
return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg);
 }
 
+static inline void _iommu_table_setparms(struct iommu_table *tbl, unsigned 
long busno,
+unsigned long liobn, unsigned long 
win_addr,
+unsigned long window_size, unsigned 
long page_shift,
+unsigned long base, struct 
iommu_table_ops *table_ops)
+{
+   tbl->it_busno = busno;
+   tbl->it_index = liobn;
+   tbl->it_offset = win_addr >> page_shift;
+   tbl->it_size = window_size >> page_shift;
+   tbl->it_page_shift = page_shift;
+   tbl->it_base = base;
+   tbl->it_blocksize = 16;
+   tbl->it_type = TCE_PCI;
+   tbl->it_ops = table_ops;
+}
+
+struct iommu_table_ops iommu_table_pseries_ops = {
+   .set = tce_build_pSeries,
+   .clear = tce_free_pSeries,
+   .get = tce_get_pseries
+};
+
 static void iommu_table_setparms(struct pci_controller *phb,
 struct device_node *dn,
 struct iommu_table *tbl)
@@ -509,8 +536,13 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
const unsigned long *basep;
const u32 *sizep;
 
-   node = phb->dn;
+   /* Test if we are going over 2GB of DMA space */
+   if (phb->dma_window_base_cur + phb->dma_window_size > SZ_2G) {
+   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
+   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
+   }
 
+   node = phb->dn;
basep = of_get_property(node, "linux,tce-base", NULL);
sizep = of_get_property(node, "linux,tce-size", NULL);
if (basep == NULL || sizep == NULL) {
@@ -519,33 +551,25 @@ static void iommu_table_setparms(struct pci_controller 
*phb,
return;
}
 
-   tbl->it_base = (unsigned long)__va(*basep);
+   _iommu_table_setparms(tbl, phb->bus->number, 0, 
phb->dma_window_base_cur,
+ phb->dma_window_size, IOMMU_PAGE_SHIFT_4K,
+ (unsigned long)__va(*basep), 
_table_pseries_ops);
 
if (!is_kdump_kernel())
memset((void *)tbl->it_base, 0, *sizep);
 
-   tbl->it_busno = phb->bus->number;
-   tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K;
-
-   /* Units of tce entries */
-   tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift;
-
-   /* Test if we are going over 2GB of DMA space */
-   if (phb->dma_window_base_cur + phb->dma_window_size > 0x8000ul) {
-   udbg_printf("PCI_DMA: Unexpected number of IOAs under this 
PHB.\n");
-   panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n");
-   }
-
phb->dma_window_base_cur += phb->dma_window_size;
-
-   /* Set the tce table size - measured in entries */
-   tbl->it_size = phb->dma_window_size >> tbl->it_page_shift;
-
-   tbl->it_index = 0;
-   tbl->it_blocksize = 16;
-   tbl->it_type = TCE_PCI;
 }
 
+struct iommu_table_ops iommu_table_lpar_multi_ops = {
+   .set = tce_buildmulti_pSeriesLP,
+#ifdef CONFIG_IOMMU_API
+   .xchg_no_kill = tce_exchange_pseries,
+#endif
+   .clear = tce_freemulti_pSeriesLP,
+   .get = tce_get_pSeriesLP
+};
+
 /*
  * iommu_table_setparms_lpar
  *
@@ -557,28 +581,17 @@ static void iommu_table_setparms_lpar(struct 
pci_controller *phb,
  struct iommu_table_group *table_group,
  const __be32 *dma_window)
 {
-   unsigned long offset, size;
+   unsigned long offset, size, liobn;
 
-   of_parse_dma_window(dn, 

[PATCH v3 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2021-04-22 Thread Leonardo Bras
Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/platforms/pseries/iommu.c | 93 --
 1 file changed, 57 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 955cf095416c..48c029386d94 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1122,6 +1122,35 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
 }
 
+static struct property *ddw_property_create(const char *propname, u32 liobn, 
u64 dma_addr,
+   u32 page_shift, u32 window_shift)
+{
+   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64;
+
+   win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+   if (!win64)
+   return NULL;
+
+   win64->name = kstrdup(propname, GFP_KERNEL);
+   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+   win64->value = ddwprop;
+   win64->length = sizeof(*ddwprop);
+   if (!win64->name || !win64->value) {
+   kfree(win64);
+   kfree(win64->name);
+   kfree(win64->value);
+   return NULL;
+   }
+
+   ddwprop->liobn = cpu_to_be32(liobn);
+   ddwprop->dma_base = cpu_to_be64(dma_addr);
+   ddwprop->tce_shift = cpu_to_be32(page_shift);
+   ddwprop->window_shift = cpu_to_be32(window_shift);
+
+   return win64;
+}
+
 /* Return largest page shift based on "IO Page Sizes" output of 
ibm,query-pe-dma-window. */
 static int iommu_get_page_shift(u32 query_page_size)
 {
@@ -1167,11 +1196,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
+   u64 win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64 = NULL;
-   struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
bool pmem_present;
@@ -1286,65 +1315,54 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << page_shift);
goto out_failed;
}
-   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
-   if (!win64) {
-   dev_info(>dev,
-   "couldn't allocate property for 64bit dma window\n");
-   goto out_failed;
-   }
-   win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
-   win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
-   win64->length = sizeof(*ddwprop);
-   if (!win64->name || !win64->value) {
-   dev_info(>dev,
-   "couldn't allocate property name and value\n");
-   goto out_free_prop;
-   }
 
ret = create_ddw(dev, ddw_avail, , page_shift, len);
if (ret != 0)
-   goto out_free_prop;
-
-   ddwprop->liobn = cpu_to_be32(create.liobn);
-   ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
-   create.addr_lo);
-   ddwprop->tce_shift = cpu_to_be32(page_shift);
-   ddwprop->window_shift = cpu_to_be32(len);
+   goto out_failed;
 
dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
  create.liobn, dn);
 
-   window = ddw_list_new_entry(pdn, ddwprop);
+   win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+   win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
+   page_shift, len);
+   if (!win64) {
+   dev_info(>dev,
+"couldn't allocate property, property name, or 
value\n");
+   goto out_del_win;
+   }
+
+   ret = of_add_property(pdn, win64);
+   if (ret) {
+   dev_err(>dev, 

[PATCH v3 05/11] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

2021-04-22 Thread Leonardo Bras
enable_ddw() currently returns the address of the DMA window, which is
considered invalid if has the value 0x00.

Also, it only considers valid an address returned from find_existing_ddw
if it's not 0x00.

Changing this behavior makes sense, given the users of enable_ddw() only
need to know if direct mapping is possible. It can also allow a DMA window
starting at 0x00 to be used.

This will be helpful for using a DDW with indirect mapping, as the window
address will be different than 0x00, but it will not map the whole
partition.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 35 --
 1 file changed, 16 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 6f14894d2d04..955cf095416c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -849,25 +849,26 @@ static void remove_ddw(struct device_node *np, bool 
remove_prop)
np, ret);
 }
 
-static u64 find_existing_ddw(struct device_node *pdn, int *window_shift)
+static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int 
*window_shift)
 {
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
-   u64 dma_addr = 0;
+   bool found = false;
 
spin_lock(_window_list_lock);
/* check if we already created a window and dupe that config if so */
list_for_each_entry(window, _window_list, list) {
if (window->device == pdn) {
direct64 = window->prop;
-   dma_addr = be64_to_cpu(direct64->dma_base);
+   *dma_addr = be64_to_cpu(direct64->dma_base);
*window_shift = be32_to_cpu(direct64->window_shift);
+   found = true;
break;
}
}
spin_unlock(_window_list_lock);
 
-   return dma_addr;
+   return found;
 }
 
 static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
@@ -1157,20 +1158,19 @@ static int iommu_get_page_shift(u32 query_page_size)
  * pdn: the parent pe node with the ibm,dma_window property
  * Future: also check if we can remap the base window for our base page size
  *
- * returns the dma offset for use by the direct mapped DMA code.
+ * returns true if can map all pages (direct mapping), false otherwise..
  */
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 {
int len = 0, ret;
int max_ram_len = order_base_2(ddw_memory_hotplug_max());
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
-   u64 dma_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
-   struct property *win64;
+   struct property *win64 = NULL;
struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
@@ -1182,8 +1182,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 
mutex_lock(_window_init_mutex);
 
-   dma_addr = find_existing_ddw(pdn, );
-   if (dma_addr != 0)
+   if (find_existing_ddw(pdn, >dev.archdata.dma_offset, ))
goto out_unlock;
 
/*
@@ -1338,7 +1337,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
 
-   dma_addr = be64_to_cpu(ddwprop->dma_base);
+   dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
goto out_unlock;
 
 out_free_window:
@@ -1351,6 +1350,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
kfree(win64->name);
kfree(win64->value);
kfree(win64);
+   win64 = NULL;
 
 out_failed:
if (default_win_removed)
@@ -1370,10 +1370,10 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
 * as RAM, then we failed to create a window to cover persistent
 * memory and need to set the DMA limit.
 */
-   if (pmem_present && dma_addr && (len == max_ram_len))
-   dev->dev.bus_dma_limit = dma_addr + (1ULL << len);
+   if (pmem_present && win64 && (len == max_ram_len))
+   dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL 
<< len);
 
-   return dma_addr;
+   return win64;
 }
 
 static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
@@ -1452,11 +1452,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct 
pci_dev *pdev, u64 dma_mask)
break;
}
 
-   if (pdn && PCI_DN(pdn)) {
-   pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn);
-   if 

[PATCH v3 04/11] powerpc/pseries/iommu: Add ddw_list_new_entry() helper

2021-04-22 Thread Leonardo Bras
There are two functions creating direct_window_list entries in a
similar way, so create a ddw_list_new_entry() to avoid duplicity and
simplify those functions.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 32 +-
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index d02359ca1f9f..6f14894d2d04 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -870,6 +870,21 @@ static u64 find_existing_ddw(struct device_node *pdn, int 
*window_shift)
return dma_addr;
 }
 
+static struct direct_window *ddw_list_new_entry(struct device_node *pdn,
+   const struct 
dynamic_dma_window_prop *dma64)
+{
+   struct direct_window *window;
+
+   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   if (!window)
+   return NULL;
+
+   window->device = pdn;
+   window->prop = dma64;
+
+   return window;
+}
+
 static int find_existing_ddw_windows(void)
 {
int len;
@@ -882,18 +897,15 @@ static int find_existing_ddw_windows(void)
 
for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
direct64 = of_get_property(pdn, DIRECT64_PROPNAME, );
-   if (!direct64)
-   continue;
-
-   window = kzalloc(sizeof(*window), GFP_KERNEL);
-   if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
-   kfree(window);
+   if (!direct64 || len < sizeof(*direct64)) {
remove_ddw(pdn, true);
continue;
}
 
-   window->device = pdn;
-   window->prop = direct64;
+   window = ddw_list_new_entry(pdn, direct64);
+   if (!window)
+   break;
+
spin_lock(_window_list_lock);
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
@@ -1303,7 +1315,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
  create.liobn, dn);
 
-   window = kzalloc(sizeof(*window), GFP_KERNEL);
+   window = ddw_list_new_entry(pdn, ddwprop);
if (!window)
goto out_clear_window;
 
@@ -1322,8 +1334,6 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
goto out_free_window;
}
 
-   window->device = pdn;
-   window->prop = ddwprop;
spin_lock(_window_list_lock);
list_add(>list, _window_list);
spin_unlock(_window_list_lock);
-- 
2.30.2



[PATCH v3 03/11] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper

2021-04-22 Thread Leonardo Bras
Creates a helper to allow allocating a new iommu_table without the need
to reallocate the iommu_group.

This will be helpful for replacing the iommu_table for the new DMA window,
after we remove the old one with iommu_tce_table_put().

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/platforms/pseries/iommu.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 796ab356341c..d02359ca1f9f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,28 +53,31 @@ enum {
DDW_EXT_QUERY_OUT_SIZE = 2
 };
 
-static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+static struct iommu_table *iommu_pseries_alloc_table(int node)
 {
-   struct iommu_table_group *table_group;
struct iommu_table *tbl;
 
-   table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL,
-  node);
-   if (!table_group)
-   return NULL;
-
tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node);
if (!tbl)
-   goto free_group;
+   return NULL;
 
INIT_LIST_HEAD_RCU(>it_group_list);
kref_init(>it_kref);
+   return tbl;
+}
 
-   table_group->tables[0] = tbl;
+static struct iommu_table_group *iommu_pseries_alloc_group(int node)
+{
+   struct iommu_table_group *table_group;
+
+   table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node);
+   if (!table_group)
+   return NULL;
 
-   return table_group;
+   table_group->tables[0] = iommu_pseries_alloc_table(node);
+   if (table_group->tables[0])
+   return table_group;
 
-free_group:
kfree(table_group);
return NULL;
 }
-- 
2.30.2



[PATCH v3 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper

2021-04-22 Thread Leonardo Bras
Having a function to check if the iommu table has any allocation helps
deciding if a tbl can be reset for using a new DMA window.

It should be enough to replace all instances of !bitmap_empty(tbl...).

iommu_table_in_use() skips reserved memory, so we don't need to worry about
releasing it before testing. This causes iommu_table_release_pages() to
become unnecessary, given it is only used to remove reserved memory for
testing.

Also, only allow storing reserved memory values in tbl if they are valid
in the table, so there is no need to check it in the new helper.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/iommu.h |  1 +
 arch/powerpc/kernel/iommu.c  | 65 
 2 files changed, 34 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index deef7c94d7b6..bf3b84128525 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl);
  */
 extern struct iommu_table *iommu_init_table(struct iommu_table *tbl,
int nid, unsigned long res_start, unsigned long res_end);
+bool iommu_table_in_use(struct iommu_table *tbl);
 
 #define IOMMU_TABLE_GROUP_MAX_TABLES   2
 
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ad82dda81640..5e168bd91401 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -691,32 +691,24 @@ static void iommu_table_reserve_pages(struct iommu_table 
*tbl,
if (tbl->it_offset == 0)
set_bit(0, tbl->it_map);
 
-   tbl->it_reserved_start = res_start;
-   tbl->it_reserved_end = res_end;
-
-   /* Check if res_start..res_end isn't empty and overlaps the table */
-   if (res_start && res_end &&
-   (tbl->it_offset + tbl->it_size < res_start ||
-res_end < tbl->it_offset))
-   return;
+   if (res_start < tbl->it_offset)
+   res_start = tbl->it_offset;
 
-   for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
-   set_bit(i - tbl->it_offset, tbl->it_map);
-}
+   if (res_end > (tbl->it_offset + tbl->it_size))
+   res_end = tbl->it_offset + tbl->it_size;
 
-static void iommu_table_release_pages(struct iommu_table *tbl)
-{
-   int i;
+   /* Check if res_start..res_end is a valid range in the table */
+   if (res_start >= res_end) {
+   tbl->it_reserved_start = tbl->it_offset;
+   tbl->it_reserved_end = tbl->it_offset;
+   return;
+   }
 
-   /*
-* In case we have reserved the first bit, we should not emit
-* the warning below.
-*/
-   if (tbl->it_offset == 0)
-   clear_bit(0, tbl->it_map);
+   tbl->it_reserved_start = res_start;
+   tbl->it_reserved_end = res_end;
 
for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i)
-   clear_bit(i - tbl->it_offset, tbl->it_map);
+   set_bit(i - tbl->it_offset, tbl->it_map);
 }
 
 /*
@@ -781,6 +773,22 @@ struct iommu_table *iommu_init_table(struct iommu_table 
*tbl, int nid,
return tbl;
 }
 
+bool iommu_table_in_use(struct iommu_table *tbl)
+{
+   unsigned long start = 0, end;
+
+   /* ignore reserved bit0 */
+   if (tbl->it_offset == 0)
+   start = 1;
+   end = tbl->it_reserved_start - tbl->it_offset;
+   if (find_next_bit(tbl->it_map, end, start) != end)
+   return true;
+
+   start = tbl->it_reserved_end - tbl->it_offset;
+   end = tbl->it_size;
+   return find_next_bit(tbl->it_map, end, start) != end;
+}
+
 static void iommu_table_free(struct kref *kref)
 {
unsigned long bitmap_sz;
@@ -799,10 +807,8 @@ static void iommu_table_free(struct kref *kref)
 
iommu_debugfs_del(tbl);
 
-   iommu_table_release_pages(tbl);
-
/* verify that table contains no entries */
-   if (!bitmap_empty(tbl->it_map, tbl->it_size))
+   if (iommu_table_in_use(tbl))
pr_warn("%s: Unexpected TCEs\n", __func__);
 
/* calculate bitmap size in bytes */
@@ -1108,18 +1114,13 @@ int iommu_take_ownership(struct iommu_table *tbl)
for (i = 0; i < tbl->nr_pools; i++)
spin_lock(>pools[i].lock);
 
-   iommu_table_release_pages(tbl);
-
-   if (!bitmap_empty(tbl->it_map, tbl->it_size)) {
+   if (iommu_table_in_use(tbl)) {
pr_err("iommu_tce: it_map is not empty");
ret = -EBUSY;
-   /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */
-   iommu_table_reserve_pages(tbl, tbl->it_reserved_start,
-   tbl->it_reserved_end);
-   } else {
-   memset(tbl->it_map, 0xff, sz);
}
 
+   memset(tbl->it_map, 0xff, sz);
+
for (i = 0; i < tbl->nr_pools; i++)

[PATCH v3 01/11] powerpc/pseries/iommu: Replace hard-coded page shift

2021-04-22 Thread Leonardo Bras
Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.

In the process, some defines like TCE_SHIFT were made obsolete, and then
removed.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
tce_buildmulti_pSeriesLP().

Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
 arch/powerpc/include/asm/tce.h |  8 --
 arch/powerpc/platforms/pseries/iommu.c | 39 +++---
 2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index db5fc2f2262d..0c34d2756d92 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -19,15 +19,7 @@
 #define TCE_VB 0
 #define TCE_PCI1
 
-/* TCE page size is 4096 bytes (1 << 12) */
-
-#define TCE_SHIFT  12
-#define TCE_PAGE_SIZE  (1 << TCE_SHIFT)
-
 #define TCE_ENTRY_SIZE 8   /* each TCE is 64 bits */
-
-#define TCE_RPN_MASK   0xfful  /* 40-bit RPN (4K pages) */
-#define TCE_RPN_SHIFT  12
 #define TCE_VALID  0x800   /* TCE valid */
 #define TCE_ALLIO  0x400   /* TCE valid for all lpars */
 #define TCE_PCI_WRITE  0x2 /* write from PCI allowed */
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 67c9953a6503..796ab356341c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long 
index,
u64 proto_tce;
__be64 *tcep;
u64 rpn;
+   const unsigned long tceshift = tbl->it_page_shift;
+   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
 
proto_tce = TCE_PCI_READ; // Read allowed
 
@@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
long index,
 
while (npages--) {
/* can't move this out since we might cross MEMBLOCK boundary */
-   rpn = __pa(uaddr) >> TCE_SHIFT;
-   *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << 
TCE_RPN_SHIFT);
+   rpn = __pa(uaddr) >> tceshift;
+   *tcep = cpu_to_be64(proto_tce | rpn << tceshift);
 
-   uaddr += TCE_PAGE_SIZE;
+   uaddr += pagesize;
tcep++;
}
return 0;
@@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table 
*tbl, long index)
return be64_to_cpu(*tcep);
 }
 
-static void tce_free_pSeriesLP(unsigned long liobn, long, long);
+static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
 static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
 
 static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,
@@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
proto_tce |= TCE_PCI_WRITE;
 
while (npages--) {
-   tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
+   tce = proto_tce | rpn << tceshift;
rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
 
if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
ret = (int)rc;
-   tce_free_pSeriesLP(liobn, tcenum_start,
+   tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
   (npages_start - (npages + 1)));
break;
}
@@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
long tcenum_start = tcenum, npages_start = npages;
int ret = 0;
unsigned long flags;
+   const unsigned long tceshift = tbl->it_page_shift;
 
if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {
return tce_build_pSeriesLP(tbl->it_index, tcenum,
-  tbl->it_page_shift, npages, uaddr,
+  tceshift, npages, uaddr,
   direction, attrs);
}
 
@@ -225,13 +228,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
if (!tcep) {
local_irq_restore(flags);
return tce_build_pSeriesLP(tbl->it_index, tcenum,
-  

Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Rick Lindsley

On 4/21/21 10:30 PM, Lijun Pan wrote:

Fixes: ed651a10875f ("ibmvnic: Updated reset handling")
Signed-off-by: Dany Madden 
Reviewed-by: Rick Lindsley 
Reviewed-by: Sukadev Bhattiprolu 


One thing I would like to point out as already pointed out by Nathan Lynch is
that those review-by tags given by the same groups of people from the same
company loses credibility over time if you never critique or ask
questions on the list.



Well, so far you aren't addressing either my critiques or questions.

I have been asking questions but all I have from you are the above
attempts to discredit the reputation of myself and other people, and
non-technical statements like

will make the code very difficult to manage
I think there should be a trade off between optimization and stability.
So I don't think you could even compare the two results

On the other hand, from the original submission I see some very specific
details:

If ibmvnic abandons the reset because of this failed set link
down and this is the last reset in the workqueue, then this
adapter will be left in an inoperable state.

and from a followup discussion:

We had a FATAL error and when handling it, we failed to
send a link-down message to the VIOS. So what we need
to try next is to reset the connection with the VIOS. For
this we must ...

These are great technical points that could be argued or discussed.
Problem is, I agree with them.

I will ask again:  can you please supply some technical reasons for
your objections.  Otherwise, your objections are meritless and at worst
simply an ad hominem attack.

Rick


[PATCH v3 00/11] DDW + Indirect Mapping

2021-04-22 Thread Leonardo Bras
So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW creation
is skipped and the default DMA window "ibm,dma-window" is used instead.

Using the DDW instead of the default DMA window may allow to expand the
amount of memory that can be DMA-mapped, given the number of pages (TCEs)
may stay the same (or increase) and the default DMA window offers only
4k-pages while DDW may offer larger pages (4k, 64k, 16M ...).

Patch #1 replaces hard-coded 4K page size with a variable containing the
correct page size for the window.

Patch #2 introduces iommu_table_in_use(), and replace manual bit-field
checking where it's used. It will be used for aborting enable_ddw() if
there is any current iommu allocation and we are trying single window
indirect mapping.

Patch #3 introduces iommu_pseries_alloc_table() that will be helpful
when indirect mapping needs to replace the iommu_table.

Patch #4 adds helpers for adding DDWs in the list.

Patch #5 refactors enable_ddw() so it returns if direct mapping is
possible, instead of DMA offset. It helps for next patches on
indirect DMA mapping and also allows DMA windows starting at 0x00.

Patch #6 bring new helper to simplify enable_ddw(), allowing
some reorganization for introducing indirect mapping DDW.

Patch #7 adds new helper _iommu_table_setparms() and use it in other
*setparams*() to fill iommu_table. It will also be used for creating a
new iommu_table for indirect mapping.

Patch #8 updates remove_dma_window() to accept different property names,
so we can introduce a new property for indirect mapping.

Patch #9 extracts find_existing_ddw_windows() into
find_existing_ddw_windows_named(), and calls it by it's property name.
This will be useful when the property for indirect mapping is created,
so we can search the device-tree for both properties.

Patch #10:
Instead of destroying the created DDW if it doesn't map the whole
partition, make use of it instead of the default DMA window as it improves
performance. Also, update the iommu_table and re-generate the pools.
It introduces a new property name for DDW with indirect DMA mapping.

Patch #11:
Does some renaming of 'direct window' to 'dma window', given the DDW
created can now be also used in indirect mapping if direct mapping is not
available.

All patches were tested into an LPAR with an virtio-net interface that
allows default DMA window and DDW to coexist.

Leonardo Bras (11):
  powerpc/pseries/iommu: Replace hard-coded page shift
  powerpc/kernel/iommu: Add new iommu_table_in_use() helper
  powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
  powerpc/pseries/iommu: Add ddw_list_new_entry() helper
  powerpc/pseries/iommu: Allow DDW windows starting at 0x00
  powerpc/pseries/iommu: Add ddw_property_create() and refactor
enable_ddw()
  powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new
helper
  powerpc/pseries/iommu: Update remove_dma_window() to accept property
name
  powerpc/pseries/iommu: Find existing DDW with given property name
  powerpc/pseries/iommu: Make use of DDW for indirect mapping
  powerpc/pseries/iommu: Rename "direct window" to "dma window"

 arch/powerpc/include/asm/iommu.h   |   1 +
 arch/powerpc/include/asm/tce.h |   8 -
 arch/powerpc/kernel/iommu.c|  65 ++--
 arch/powerpc/platforms/pseries/iommu.c | 504 +++--
 4 files changed, 338 insertions(+), 240 deletions(-)

-- 
2.30.2



Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Rick Lindsley

On 4/21/21 10:06 PM, Lijun Pan wrote:

No real customer runs the system under that heavy load created by
HTX stress test, which can tear down any working system.


So, are you saying the bugs that HTX uncovers are not worth fixing?

There's a fair number of individuals and teams out there that
utilize HTX in the absence of such a workload, and not just for vnic.
What workload would you suggest to better serve "real customers"?


I think such optimizations are catered for passing HTX tests.


Well, yes.  If the bugs are found with HTX, the fixes are verified
against HTX.  If they were found with a "real customer workload" they'd
be verified against a "real customer workload."


Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Rick Lindsley

On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley
 wrote:


Please describe the advantage in deferring it further by routing it through
do_hard_reset().  I don't see one.


On 4/21/21 10:12 PM, Lijun Pan replied:


It is not deferred. It exits with error and calls do_hard_reset.
See my reply to Suka's.


I saw your reply, but it does not answer the question I asked.  The patch
would have us reinitialize and restart the communication queues.  Your
suggestion would do more work than that.  Please describe the advantage
in deferring the reinitialization - and yes, defer is the right word -
by routing it through do_hard_reset() and executing that extra code.  I
see that route as doing more work than necessary and so introducing
additional risk, for no clear advantage.  So I would find it helpful
if you would describe the advantage.


The testing was done on this patch. It was not performed on a full hard reset.
So I don't think you could even compare the two results.


A problem has been noted, a patch has been proposed, and the reasoning that
the patch is correct has been given.  Testing with this patch has
demonstrated the problem has not returned.  So far, that sounds like a
pretty reasonable solution.

Your comment is curious - why would testing for this patch be done on a full
hard reset when this patch does not invoke a full hard reset?  If you have
data to consider then let's have it. I'm willing to be convinced, but so far
this just sounds like "I wouldn't do it that way myself, and I have a bad
feeling about doing it any other way."

Rick


Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed

2021-04-22 Thread Sukadev Bhattiprolu
Lijun Pan [lijunp...@gmail.com] wrote:
> > Now, sure we can attempt a "thorough hard reset" which also does
> > the same hcalls to reestablish the connection. Is there any
> > other magic in do_hard_reset()? But in addition, it also frees lot
> > more Linux kernel buffers and reallocates them for instance.
> 
> Working around everything in do_reset will make the code very difficult

We are not working around everything. We are doing in do_reset()
exactly what we would do in hard reset for this error (ignore the
set link down error and try to reestablish the connection with the
VIOS).

What we are avoiding is unnecessary work on the Linux side for a
communication problem on the VIOS side.

> to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset
> can be removed completely or merged into do_reset.
> 
> >
> > If we are having a communication problem with the VIOS, what is
> > the point of freeing and reallocating Linux kernel buffers? Beside
> > being inefficient, it would expose us to even more errors during
> > reset under heavy workloads?
> 
> No real customer runs the system under that heavy load created by
> HTX stress test, which can tear down any working system.

We need to talk to capacity planning and test architects about that,
but all I want to know is what hard reset would do differently to
fix this communication error with VIOS.

Sukadev


Re: [V3 PATCH 16/16] crypto/nx: Add sysfs interface to export NX capabilities

2021-04-22 Thread Herbert Xu
On Sat, Apr 17, 2021 at 02:13:40PM -0700, Haren Myneni wrote:
> 
> Changes to export the following NXGZIP capabilities through sysfs:
> 
> /sys/devices/vio/ibm,compression-v1/NxGzCaps:
> min_compress_len  /*Recommended minimum compress length in bytes*/
> min_decompress_len /*Recommended minimum decompress length in bytes*/
> req_max_processed_len /* Maximum number of bytes processed in one
>   request */
> 
> Signed-off-by: Haren Myneni 
> ---
>  drivers/crypto/nx/nx-common-pseries.c | 43 +++
>  1 file changed, 43 insertions(+)

Acked-by: Herbert Xu 
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [V3 PATCH 15/16] crypto/nx: Get NX capabilities for GZIP coprocessor type

2021-04-22 Thread Herbert Xu
On Sat, Apr 17, 2021 at 02:12:51PM -0700, Haren Myneni wrote:
> 
> phyp provides NX capabilities which gives recommended minimum
> compression / decompression length and maximum request buffer size
> in bytes.
> 
> Changes to get NX overall capabilities which points to the specific
> features phyp supports. Then retrieve NXGZIP specific capabilities.
> 
> Signed-off-by: Haren Myneni 
> ---
>  drivers/crypto/nx/nx-common-pseries.c | 83 +++
>  1 file changed, 83 insertions(+)

Acked-by: Herbert Xu 
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [V3 PATCH 14/16] crypto/nx: Register and unregister VAS interface

2021-04-22 Thread Herbert Xu
On Sat, Apr 17, 2021 at 02:12:12PM -0700, Haren Myneni wrote:
> 
> Changes to create /dev/crypto/nx-gzip interface with VAS register
> and to remove this interface with VAS unregister.
> 
> Signed-off-by: Haren Myneni 
> ---
>  drivers/crypto/nx/Kconfig | 1 +
>  drivers/crypto/nx/nx-common-pseries.c | 9 +
>  2 files changed, 10 insertions(+)

Acked-by: Herbert Xu 
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [V3 PATCH 13/16] crypto/nx: Rename nx-842-pseries file name to nx-common-pseries

2021-04-22 Thread Herbert Xu
On Sat, Apr 17, 2021 at 02:11:15PM -0700, Haren Myneni wrote:
> 
> Rename nx-842-pseries.c to nx-common-pseries.c to add code for new
> GZIP compression type. The actual functionality is not changed in
> this patch.
> 
> Signed-off-by: Haren Myneni 
> ---
>  drivers/crypto/nx/Makefile  | 2 +-
>  drivers/crypto/nx/{nx-842-pseries.c => nx-common-pseries.c} | 0
>  2 files changed, 1 insertion(+), 1 deletion(-)
>  rename drivers/crypto/nx/{nx-842-pseries.c => nx-common-pseries.c} (100%)
> 
> diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile
> index bc89a20e5d9d..d00181a26dd6 100644
> --- a/drivers/crypto/nx/Makefile
> +++ b/drivers/crypto/nx/Makefile
> @@ -14,5 +14,5 @@ nx-crypto-objs := nx.o \
>  obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o 
> nx-compress.o
>  obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o 
> nx-compress.o
>  nx-compress-objs := nx-842.o
> -nx-compress-pseries-objs := nx-842-pseries.o
> +nx-compress-pseries-objs := nx-common-pseries.o
>  nx-compress-powernv-objs := nx-common-powernv.o
> diff --git a/drivers/crypto/nx/nx-842-pseries.c 
> b/drivers/crypto/nx/nx-common-pseries.c
> similarity index 100%
> rename from drivers/crypto/nx/nx-842-pseries.c
> rename to drivers/crypto/nx/nx-common-pseries.c

Acked-by: Herbert Xu 
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 1/1] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs

2021-04-22 Thread Leonardo Bras
Hello,

This patch was also reviewed when it was part of another patchset:
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200911170738.82818-4-leobra...@gmail.com/

On Thu, 2021-03-18 at 14:44 -0300, Leonardo Bras wrote:
> Currently both iommu_alloc_coherent() and iommu_free_coherent() align the
> desired allocation size to PAGE_SIZE, and gets system pages and IOMMU
> mappings (TCEs) for that value.
> 
> When IOMMU_PAGE_SIZE < PAGE_SIZE, this behavior may cause unnecessary
> TCEs to be created for mapping the whole system page.
> 
> Example:
> - PAGE_SIZE = 64k, IOMMU_PAGE_SIZE() = 4k
> - iommu_alloc_coherent() is called for 128 bytes
> - 1 system page (64k) is allocated
> - 16 IOMMU pages (16 x 4k) are allocated (16 TCEs used)
> 
> It would be enough to use a single TCE for this, so 15 TCEs are
> wasted in the process.
> 
> Update iommu_*_coherent() to make sure the size alignment happens only
> for IOMMU_PAGE_SIZE() before calling iommu_alloc() and iommu_free().
> 
> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> with IOMMU_PAGE_ALIGN(n, tbl), which is easier to read and does the
> same.
> 
> Signed-off-by: Leonardo Bras 
> Reviewed-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/kernel/iommu.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 5b69a6a72a0e..3329ef045805 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -851,6 +851,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
> iommu_table *tbl,
>   unsigned int order;
>   unsigned int nio_pages, io_order;
>   struct page *page;
> + size_t size_io = size;
>  
> 
>   size = PAGE_ALIGN(size);
>   order = get_order(size);
> @@ -877,8 +878,9 @@ void *iommu_alloc_coherent(struct device *dev, struct 
> iommu_table *tbl,
>   memset(ret, 0, size);
>  
> 
>   /* Set up tces to cover the allocated range */
> - nio_pages = size >> tbl->it_page_shift;
> - io_order = get_iommu_order(size, tbl);
> + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> + nio_pages = size_io >> tbl->it_page_shift;
> + io_order = get_iommu_order(size_io, tbl);
>   mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
>     mask >> tbl->it_page_shift, io_order, 0);
>   if (mapping == DMA_MAPPING_ERROR) {
> @@ -893,10 +895,9 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
> size,
>    void *vaddr, dma_addr_t dma_handle)
>  {
>   if (tbl) {
> - unsigned int nio_pages;
> + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl);
> + unsigned int nio_pages = size_io >> tbl->it_page_shift;
>  
> 
> - size = PAGE_ALIGN(size);
> - nio_pages = size >> tbl->it_page_shift;
>   iommu_free(tbl, dma_handle, nio_pages);
>   size = PAGE_ALIGN(size);
>   free_pages((unsigned long)vaddr, get_order(size));




Re: [PATCH 1/1] powerpc/kernel/iommu: Use largepool as a last resort when !largealloc

2021-04-22 Thread Leonardo Bras
Hello,

FYI: This patch was reviewed when it was part of another patchset:
http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200817234033.442511-4-leobra...@gmail.com/


On Thu, 2021-03-18 at 14:44 -0300, Leonardo Bras wrote:
> As of today, doing iommu_range_alloc() only for !largealloc (npages <= 15)
> will only be able to use 3/4 of the available pages, given pages on
> largepool  not being available for !largealloc.
> 
> This could mean some drivers not being able to fully use all the available
> pages for the DMA window.
> 
> Add pages on largepool as a last resort for !largealloc, making all pages
> of the DMA window available.
> 
> Signed-off-by: Leonardo Bras 
> Reviewed-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/kernel/iommu.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 3329ef045805..ae6ad8dca605 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -255,6 +255,15 @@ static unsigned long iommu_range_alloc(struct device 
> *dev,
>   pass++;
>   goto again;
>  
> 
> + } else if (pass == tbl->nr_pools + 1) {
> + /* Last resort: try largepool */
> + spin_unlock(>lock);
> + pool = >large_pool;
> + spin_lock(>lock);
> + pool->hint = pool->start;
> + pass++;
> + goto again;
> +
>   } else {
>   /* Give up */
>   spin_unlock_irqrestore(&(pool->lock), flags);




Re: [PATCH] powerpc/mce: save ignore_event flag unconditionally for UE

2021-04-22 Thread Ganesh



On 4/22/21 11:31 AM, Ganesh wrote:

On 4/7/21 10:28 AM, Ganesh Goudar wrote:


When we hit an UE while using machine check safe copy routines,
ignore_event flag is set and the event is ignored by mce handler,
And the flag is also saved for defered handling and printing of
mce event information, But as of now saving of this flag is done
on checking if the effective address is provided or physical address
is calculated, which is not right.

Save ignore_event flag regardless of whether the effective address is
provided or physical address is calculated.

Without this change following log is seen, when the event is to be
ignored.

[  512.971365] MCE: CPU1: machine check (Severe)  UE Load/Store 
[Recovered]

[  512.971509] MCE: CPU1: NIP: [c00b67c0] memcpy+0x40/0x90
[  512.971655] MCE: CPU1: Initiator CPU
[  512.971739] MCE: CPU1: Unknown
[  512.972209] MCE: CPU1: machine check (Severe)  UE Load/Store 
[Recovered]

[  512.972334] MCE: CPU1: NIP: [c00b6808] memcpy+0x88/0x90
[  512.972456] MCE: CPU1: Initiator CPU
[  512.972534] MCE: CPU1: Unknown

Signed-off-by: Ganesh Goudar 
---
  arch/powerpc/kernel/mce.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)


Hi mpe, Any comments on this patch?

Please ignore, I see its applied.


Re: [PATCH] powerpc/mce: save ignore_event flag unconditionally for UE

2021-04-22 Thread Ganesh

On 4/7/21 10:28 AM, Ganesh Goudar wrote:


When we hit an UE while using machine check safe copy routines,
ignore_event flag is set and the event is ignored by mce handler,
And the flag is also saved for defered handling and printing of
mce event information, But as of now saving of this flag is done
on checking if the effective address is provided or physical address
is calculated, which is not right.

Save ignore_event flag regardless of whether the effective address is
provided or physical address is calculated.

Without this change following log is seen, when the event is to be
ignored.

[  512.971365] MCE: CPU1: machine check (Severe)  UE Load/Store [Recovered]
[  512.971509] MCE: CPU1: NIP: [c00b67c0] memcpy+0x40/0x90
[  512.971655] MCE: CPU1: Initiator CPU
[  512.971739] MCE: CPU1: Unknown
[  512.972209] MCE: CPU1: machine check (Severe)  UE Load/Store [Recovered]
[  512.972334] MCE: CPU1: NIP: [c00b6808] memcpy+0x88/0x90
[  512.972456] MCE: CPU1: Initiator CPU
[  512.972534] MCE: CPU1: Unknown

Signed-off-by: Ganesh Goudar 
---
  arch/powerpc/kernel/mce.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)


Hi mpe, Any comments on this patch?