Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor
Le 23/04/2021 à 00:58, Daniel Axtens a écrit : diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index c9d2c7825cd6..3a2f2001c62b 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr) KBUILD_AFLAGS += $(AFLAGS-y) KBUILD_CFLAGS += $(call cc-option,-msoft-float) KBUILD_CFLAGS += -pipe $(CFLAGS-y) -CPP= $(CC) -E $(KBUILD_CFLAGS) I was trying to check the history to see why powerpc has its own definition. It seems to date back at least as far as merging the two powerpc platforms into one, maybe it was helpful then. I agree it doesn't seem to be needed now. I digged a bit deaper. It has been there since the introduction of arch/ppc/ in Linux 1.3.45 At the time, there was the exact same CPP definition in arch/mips/Makefile The CPP definition in mips disappeared is Linux 2.1.44pre3. Since that version, ppc has been the only one with such CPP re-definition. Snowpatch claims that this breaks the build, but I haven't been able to reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it off to a fork of mpe's linux-ci repo to see if any of those builds hit any issues: https://github.com/daxtens/linux-ci/actions By the way, I find snowpatch quite useless these days. It seems to delete the reports a few minutes after the test. We are less than one day after the patch was submitted and the report of the build failures are already gone. Christophe
Re: [PATCH 7/7] swiotlb: don't override the command line in swiotlb_adjust_size
On 4/22/21 2:19 AM, Christoph Hellwig wrote: > When the user specified an explicit swiotlb size on the command line, > the achitecture code should not override it. > > Fixes: 2cbc2776efe4 ("swiotlb: remove swiotlb_nr_tbl") > Reported-by: Tom Lendacky > Signed-off-by: Christoph Hellwig I tested this patchset and I'm not able to get the override via the command line or via the SEV adjustment function. Looking at the code, swiotlb_default_size is initialized, so the call to swiotlb_adjust_size() always returns without setting the new size. Thanks, Tom > --- > kernel/dma/swiotlb.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index 87d06ddf4753f3..aef02a3825b494 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -106,7 +106,9 @@ void swiotlb_set_max_segment(unsigned int val) > > unsigned long swiotlb_size_or_default(void) > { > - return swiotlb_default_size; > + if (swiotlb_default_size) > + return swiotlb_default_size; > + return IO_TLB_DEFAULT_SIZE; > } > > void __init swiotlb_adjust_size(unsigned long size) > @@ -116,6 +118,8 @@ void __init swiotlb_adjust_size(unsigned long size) >* architectures such as those supporting memory encryption to >* adjust/expand SWIOTLB size for their use. >*/ > + if (swiotlb_default_size) > + return; > swiotlb_default_size = ALIGN(size, IO_TLB_SEGSIZE << IO_TLB_SHIFT); > pr_info("SWIOTLB bounce buffer size adjusted to %luMB", > swiotlb_default_size >> 20); >
[powerpc:next] BUILD SUCCESS ee6b25fa7c037e42cc5f3b5c024b2a779edab6dd
ig-a002-20210421 x86_64 randconfig-a001-20210421 x86_64 randconfig-a005-20210421 x86_64 randconfig-a006-20210421 x86_64 randconfig-a003-20210421 i386 randconfig-a005-20210421 i386 randconfig-a002-20210421 i386 randconfig-a001-20210421 i386 randconfig-a006-20210421 i386 randconfig-a004-20210421 i386 randconfig-a003-20210421 i386 randconfig-a012-20210421 i386 randconfig-a014-20210421 i386 randconfig-a011-20210421 i386 randconfig-a013-20210421 i386 randconfig-a015-20210421 i386 randconfig-a016-20210421 i386 randconfig-a014-20210422 i386 randconfig-a012-20210422 i386 randconfig-a011-20210422 i386 randconfig-a013-20210422 i386 randconfig-a015-20210422 i386 randconfig-a016-20210422 riscvnommu_k210_defconfig riscvnommu_virt_defconfig riscv allnoconfig riscv defconfig riscv rv32_defconfig um allmodconfig umallnoconfig um allyesconfig um defconfig x86_64rhel-8.3-kselftests x86_64 defconfig x86_64 rhel-8.3 x86_64 rhel-8.3-kbuiltin x86_64 kexec clang tested configs: x86_64 randconfig-a015-20210421 x86_64 randconfig-a016-20210421 x86_64 randconfig-a011-20210421 x86_64 randconfig-a014-20210421 x86_64 randconfig-a013-20210421 x86_64 randconfig-a012-20210421 x86_64 randconfig-a002-20210422 x86_64 randconfig-a004-20210422 x86_64 randconfig-a001-20210422 x86_64 randconfig-a005-20210422 x86_64 randconfig-a006-20210422 x86_64 randconfig-a003-20210422 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
[PATCH 4/4] powerpc/pseries: warn if recursing into the hcall tracing code
--- arch/powerpc/platforms/pseries/lpar.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 835e7f661a05..a961a7ebeab3 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -1828,8 +1828,11 @@ void hcall_tracepoint_unregfunc(void) /* * Since the tracing code might execute hcalls we need to guard against - * recursion. H_CONFER from spin locks must be treated separately though - * and use _notrace plpar_hcall variants, see yield_to_preempted(). + * recursion, but this always seems risky -- __trace_hcall_entry might be + * ftraced, for example. So warn in this case. + * + * H_CONFER from spin locks must be treated separately though and use _notrace + * plpar_hcall variants, see yield_to_preempted(). */ static DEFINE_PER_CPU(unsigned int, hcall_trace_depth); @@ -1843,7 +1846,7 @@ notrace void __trace_hcall_entry(unsigned long opcode, unsigned long *args) depth = this_cpu_ptr(_trace_depth); - if (*depth) + if (WARN_ON_ONCE(*depth)) goto out; (*depth)++; @@ -1864,7 +1867,7 @@ notrace void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf) depth = this_cpu_ptr(_trace_depth); - if (*depth) + if (*depth) /* Don't warning again on the way out */ goto out; (*depth)++; -- 2.23.0
[PATCH 3/4] powerpc/pseries: use notrace hcall variant for H_CEDE idle
Rather than special-case H_CEDE in the hcall trace wrappers, make the idle H_CEDE call use plpar_hcall_norets_notrace(). Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/plpar_wrappers.h | 6 +- arch/powerpc/platforms/pseries/lpar.c | 10 -- 2 files changed, 5 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h index ece84a430701..83e0f701ebc6 100644 --- a/arch/powerpc/include/asm/plpar_wrappers.h +++ b/arch/powerpc/include/asm/plpar_wrappers.h @@ -28,7 +28,11 @@ static inline void set_cede_latency_hint(u8 latency_hint) static inline long cede_processor(void) { - return plpar_hcall_norets(H_CEDE); + /* +* We cannot call tracepoints inside RCU idle regions which +* means we must not trace H_CEDE. +*/ + return plpar_hcall_norets_notrace(H_CEDE); } static inline long extended_cede_processor(unsigned long latency_hint) diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 0641779e5cde..835e7f661a05 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -1839,13 +1839,6 @@ notrace void __trace_hcall_entry(unsigned long opcode, unsigned long *args) unsigned long flags; unsigned int *depth; - /* -* We cannot call tracepoints inside RCU idle regions which -* means we must not trace H_CEDE. -*/ - if (opcode == H_CEDE) - return; - local_irq_save(flags); depth = this_cpu_ptr(_trace_depth); @@ -1867,9 +1860,6 @@ notrace void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf) unsigned long flags; unsigned int *depth; - if (opcode == H_CEDE) - return; - local_irq_save(flags); depth = this_cpu_ptr(_trace_depth); -- 2.23.0
[PATCH 2/4] powerpc/pseries: Don't trace hcall tracing wrapper
This doesn't seem very useful to trace before the recursion check, even if the ftrace code has any recursion checks of its own. Be on the safe side and don't trace the hcall trace wrappers. Reported-by: Naveen N. Rao Signed-off-by: Nicholas Piggin --- arch/powerpc/platforms/pseries/lpar.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 6011b7b80946..0641779e5cde 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -1834,7 +1834,7 @@ void hcall_tracepoint_unregfunc(void) static DEFINE_PER_CPU(unsigned int, hcall_trace_depth); -void __trace_hcall_entry(unsigned long opcode, unsigned long *args) +notrace void __trace_hcall_entry(unsigned long opcode, unsigned long *args) { unsigned long flags; unsigned int *depth; @@ -1862,7 +1862,7 @@ void __trace_hcall_entry(unsigned long opcode, unsigned long *args) local_irq_restore(flags); } -void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf) +notrace void __trace_hcall_exit(long opcode, long retval, unsigned long *retbuf) { unsigned long flags; unsigned int *depth; -- 2.23.0
[PATCH 1/4] powerpc/pseries: Fix hcall tracing recursion in pv queued spinlocks
The paravit queued spinlock slow path adds itself to the queue then calls pv_wait to wait for the lock to become free. This is implemented by calling H_CONFER to donate cycles. When hcall tracing is enabled, this H_CONFER call can lead to a spin lock being taken in the tracing code, which will result in the lock to be taken again, which will also go to the slow path because it queues behind itself and so won't ever make progress. An example trace of a deadlock: __pv_queued_spin_lock_slowpath trace_clock_global ring_buffer_lock_reserve trace_event_buffer_lock_reserve trace_event_buffer_reserve trace_event_raw_event_hcall_exit __trace_hcall_exit plpar_hcall_norets_trace __pv_queued_spin_lock_slowpath trace_clock_global ring_buffer_lock_reserve trace_event_buffer_lock_reserve trace_event_buffer_reserve trace_event_raw_event_rcu_dyntick rcu_irq_exit irq_exit __do_irq call_do_irq do_IRQ hardware_interrupt_common_virt Fix this by introducing plpar_hcall_norets_notrace(), and using that to make SPLPAR virtual processor dispatching hcalls by the paravirt spinlock code. Fixes: 20c0e8269e9d ("powerpc/pseries: Implement paravirt qspinlocks for SPLPAR") Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/hvcall.h | 3 +++ arch/powerpc/include/asm/paravirt.h | 22 +++--- arch/powerpc/platforms/pseries/hvCall.S | 10 ++ arch/powerpc/platforms/pseries/lpar.c | 4 ++-- 4 files changed, 34 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h index ed6086d57b22..0c92b01a3c3c 100644 --- a/arch/powerpc/include/asm/hvcall.h +++ b/arch/powerpc/include/asm/hvcall.h @@ -446,6 +446,9 @@ */ long plpar_hcall_norets(unsigned long opcode, ...); +/* Variant which does not do hcall tracing */ +long plpar_hcall_norets_notrace(unsigned long opcode, ...); + /** * plpar_hcall: - Make a pseries hypervisor call * @opcode: The hypervisor call to make. diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h index 5d1726bb28e7..3c13c2ec70a9 100644 --- a/arch/powerpc/include/asm/paravirt.h +++ b/arch/powerpc/include/asm/paravirt.h @@ -30,17 +30,33 @@ static inline u32 yield_count_of(int cpu) static inline void yield_to_preempted(int cpu, u32 yield_count) { - plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), yield_count); + /* +* Spinlock code yields and prods, so don't trace the hcalls because +* tracing code takes spinlocks which could recurse. +* +* These calls are made while the lock is not held, the lock slowpath +* yields if it can not acquire the lock, and unlock slow path might +* prod if a waiter has yielded). So this did not seem to be a problem +* for simple spin locks because technically it didn't recuse on the +* lock. However the queued spin lock contended path is more strictly +* ordered: the H_CONFER hcall is made after the task has queued itself +* on the lock, so then recursing on the lock will queue up behind that +* (or worse: queued spinlocks uses tricks that assume a context never +* waits on more than one spinlock, so that may cause random +* corruption). +*/ + plpar_hcall_norets_notrace(H_CONFER, + get_hard_smp_processor_id(cpu), yield_count); } static inline void prod_cpu(int cpu) { - plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu)); + plpar_hcall_norets_notrace(H_PROD, get_hard_smp_processor_id(cpu)); } static inline void yield_to_any(void) { - plpar_hcall_norets(H_CONFER, -1, 0); + plpar_hcall_norets_notrace(H_CONFER, -1, 0); } #else static inline bool is_shared_processor(void) diff --git a/arch/powerpc/platforms/pseries/hvCall.S b/arch/powerpc/platforms/pseries/hvCall.S index 2136e42833af..8a2b8d64265b 100644 --- a/arch/powerpc/platforms/pseries/hvCall.S +++ b/arch/powerpc/platforms/pseries/hvCall.S @@ -102,6 +102,16 @@ END_FTR_SECTION(0, 1); \ #define HCALL_BRANCH(LABEL) #endif +_GLOBAL_TOC(plpar_hcall_norets_notrace) + HMT_MEDIUM + + mfcrr0 + stw r0,8(r1) + HVSC/* invoke the hypervisor */ + lwz r0,8(r1) + mtcrf 0xff,r0 + blr /* return r3 = status */ + _GLOBAL_TOC(plpar_hcall_norets) HMT_MEDIUM diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 3805519a6469..6011b7b80946 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -1828,8 +1828,8 @@ void hcall_tracepoint_unregfunc(void) /* * Since the tracing code might execute hcalls we need to guard against - * recursion. One example of this are spinlocks calling H_YIELD on - * shared
[PATCH 0/4] Fix queued spinlocks and a bit more
Patch 1 is the important fix. 2 might fix something, although I haven't provoked a crash yet. Patch 3 is a small cleanup, and patch 4 I think is the right thing to do but these could wait for later. Thanks, Nick Nicholas Piggin (4): powerpc/pseries: Fix hcall tracing recursion in pv queued spinlocks powerpc/pseries: Don't trace hcall tracing wrapper powerpc/pseries: use notrace hcall variant for H_CEDE idle powerpc/pseries: warn if recursing into the hcall tracing code arch/powerpc/include/asm/hvcall.h | 3 +++ arch/powerpc/include/asm/paravirt.h | 22 +--- arch/powerpc/include/asm/plpar_wrappers.h | 6 +- arch/powerpc/platforms/pseries/hvCall.S | 10 + arch/powerpc/platforms/pseries/lpar.c | 25 --- 5 files changed, 46 insertions(+), 20 deletions(-) -- 2.23.0
Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor
On 23/04/2021 08:58, Daniel Axtens wrote: Hi Alexey, The $(CPP) (do only preprocessing) macro is already defined in Makefile. However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results in flags duplication. Which is not a big deal by itself except for the flags which depend on other flags and the compiler checks them as it parses the command line. Specifically, scripts/Makefile.build:304 generates ksyms for .S files. If clang+llvm+sanitizer are enabled, this results in -fno-lto -flto -fsanitize=cfi-mfcall -fno-lto -flto -fsanitize=cfi-mfcall Checkpatch doesn't like this line: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line) #14: -fno-lto -flto -fsanitize=cfi-mfcall -fno-lto -flto -fsanitize=cfi-mfcall However, it doesn't make sense to wrap the line so we should just ignore checkpatch here. in the clang command line and triggers error: clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with '-flto' This removes unnecessary CPP redifinition. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/Makefile | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index c9d2c7825cd6..3a2f2001c62b 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr) KBUILD_AFLAGS += $(AFLAGS-y) KBUILD_CFLAGS += $(call cc-option,-msoft-float) KBUILD_CFLAGS += -pipe $(CFLAGS-y) -CPP= $(CC) -E $(KBUILD_CFLAGS) I was trying to check the history to see why powerpc has its own definition. It seems to date back at least as far as merging the two powerpc platforms into one, maybe it was helpful then. I agree it doesn't seem to be needed now. Snowpatch claims that this breaks the build, but I haven't been able to reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it off to a fork of mpe's linux-ci repo to see if any of those builds hit any issues: https://github.com/daxtens/linux-ci/actions To be precise, you need LLVM + LTO + byte code (-emit-llvm-bc), I am not even sure what is the point of having -flto without -emit-llvm-bc. No flags duplication: [fstn1-p1 1]$ /mnt/sdb/pbuild/llvm-no-lto/bin/clang-13 -emit-llvm-bc -fno-lto -flto -fvisibility=hidden -fsanitize=cfi-mfcall /mnt/sdb/pbuild/llvm-bugs/1/a.c /usr/bin/ld: warning: cannot find entry symbol mit-llvm-bc; defaulting to 13e0 /usr/bin/ld: /usr/lib/powerpc64le-linux-gnu/crt1.o:(.data.rel.ro.local+0x8): undefined reference to `main' clang-13: error: linker command failed with exit code 1 (use -v to see invocation) => command line is fine, the file is not (but it is for debugging this problem). Now I am adding the second -fno-lto: [fstn1-p1 1]$ /mnt/sdb/pbuild/llvm-no-lto/bin/clang-13 -emit-llvm-bc -fno-lto -flto -fvisibility=hidden -fsanitize=cfi-mfcall -fno-lto /mnt/sdb/pbuild/llvm-bugs/1/a.c clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with '-flto' => failed. Assuming that doesn't break, this patch looks good to me: Reviewed-by: Daniel Axtens Kind regards, Daniel -- Alexey
Re: [PATCH v5 16/16] of: Add plumbing for restricted DMA pool
On Thu, Apr 22, 2021 at 4:17 PM Claire Chang wrote: > > If a device is not behind an IOMMU, we look up the device node and set > up the restricted DMA when the restricted-dma-pool is presented. > > Signed-off-by: Claire Chang > --- > drivers/of/address.c| 25 + > drivers/of/device.c | 3 +++ > drivers/of/of_private.h | 5 + > 3 files changed, 33 insertions(+) > > diff --git a/drivers/of/address.c b/drivers/of/address.c > index 54f221dde267..fff3adfe4986 100644 > --- a/drivers/of/address.c > +++ b/drivers/of/address.c > @@ -8,6 +8,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np) > } > EXPORT_SYMBOL_GPL(of_dma_is_coherent); > > +int of_dma_set_restricted_buffer(struct device *dev) > +{ > + struct device_node *node; > + int count, i; > + > + if (!dev->of_node) > + return 0; > + > + count = of_property_count_elems_of_size(dev->of_node, "memory-region", > + sizeof(phandle)); > + for (i = 0; i < count; i++) { > + node = of_parse_phandle(dev->of_node, "memory-region", i); > + /* There might be multiple memory regions, but only one > +* restriced-dma-pool region is allowed. > +*/ > + if (of_device_is_compatible(node, "restricted-dma-pool") && > + of_device_is_available(node)) > + return of_reserved_mem_device_init_by_idx( > + dev, dev->of_node, i); > + } > + > + return 0; > +} > + > /** > * of_mmio_is_nonposted - Check if device uses non-posted MMIO > * @np:device node > diff --git a/drivers/of/device.c b/drivers/of/device.c > index c5a9473a5fb1..d8d865223e51 100644 > --- a/drivers/of/device.c > +++ b/drivers/of/device.c > @@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct > device_node *np, > > arch_setup_dma_ops(dev, dma_start, size, iommu, coherent); > > + if (!iommu) > + return of_dma_set_restricted_buffer(dev); > + > return 0; > } > EXPORT_SYMBOL_GPL(of_dma_configure_id); > diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h > index d717efbd637d..e9237f5eff48 100644 > --- a/drivers/of/of_private.h > +++ b/drivers/of/of_private.h > @@ -163,12 +163,17 @@ struct bus_dma_region; > #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA) > int of_dma_get_range(struct device_node *np, > const struct bus_dma_region **map); > +int of_dma_set_restricted_buffer(struct device *dev); > #else > static inline int of_dma_get_range(struct device_node *np, > const struct bus_dma_region **map) > { > return -ENODEV; > } > +static inline int of_dma_get_restricted_buffer(struct device *dev) This one should be of_dma_set_restricted_buffer. Sorry for the typo. > +{ > + return -ENODEV; > +} > #endif > > #endif /* _LINUX_OF_PRIVATE_H */ > -- > 2.31.1.368.gbe11c130af-goog >
Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker
On Wed, Sep 2, 2020 at 11:02 AM Christophe Leroy wrote: > > > > Le 02/09/2020 à 19:41, Nick Desaulniers a écrit : > > On Wed, Sep 2, 2020 at 5:14 AM Michael Ellerman wrote: > >> > >> Nick Desaulniers writes: > >>> Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for > >>> orphan sections") > >> > >> I think I'll just revert that for v5.9 ? > > > > SGTM; you'll probably still want these changes with some modifications > > at some point; vdso32 did have at least one orphaned section, and will > > be important for hermetic builds. Seeing crashes in supported > > versions of the tools ties our hands at the moment. > > > > Keeping the tool problem aside with binutils 2.26, do you have a way to > really link an elf32ppc object when building vdso32 for PPC64 ? Sorry, I'm doing a bug scrub and found https://github.com/ClangBuiltLinux/linux/issues/774 still open (and my reply to this thread still in Drafts; never sent). With my patches rebased: $ file arch/powerpc/kernel/vdso32/vdso32.so arch/powerpc/kernel/vdso32/vdso32.so: ELF 32-bit MSB shared object, PowerPC or cisco 4500, version 1 (SYSV), dynamically linked, stripped Are you still using 2.26? I'm not able to repro Nathan's reported issue from https://lore.kernel.org/lkml/20200902052123.GA2687902@ubuntu-n2-xlarge-x86/, so I'm curious if I should resend the rebased patches as v2? -- Thanks, ~Nick Desaulniers
Re: [PATCH 1/2] audit: add support for the openat2 syscall
On 2021-03-18 08:08, Richard Guy Briggs wrote: > On 2021-03-18 11:48, Christian Brauner wrote: > > [+Cc Aleksa, the author of openat2()] > > Ah! Thanks for pulling in Aleksa. I thought I caught everyone... > > > and a comment below. :) > > Same... > > > On Wed, Mar 17, 2021 at 09:47:17PM -0400, Richard Guy Briggs wrote: > > > The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9 > > > ("open: introduce openat2(2) syscall") > > > > > > Add the openat2(2) syscall to the audit syscall classifier. > > > > > > See the github issue > > > https://github.com/linux-audit/audit-kernel/issues/67 > > > > > > Signed-off-by: Richard Guy Briggs > > > --- > > > arch/alpha/kernel/audit.c | 2 ++ > > > arch/ia64/kernel/audit.c | 2 ++ > > > arch/parisc/kernel/audit.c | 2 ++ > > > arch/parisc/kernel/compat_audit.c | 2 ++ > > > arch/powerpc/kernel/audit.c| 2 ++ > > > arch/powerpc/kernel/compat_audit.c | 2 ++ > > > arch/s390/kernel/audit.c | 2 ++ > > > arch/s390/kernel/compat_audit.c| 2 ++ > > > arch/sparc/kernel/audit.c | 2 ++ > > > arch/sparc/kernel/compat_audit.c | 2 ++ > > > arch/x86/ia32/audit.c | 2 ++ > > > arch/x86/kernel/audit_64.c | 2 ++ > > > kernel/auditsc.c | 3 +++ > > > lib/audit.c| 4 > > > lib/compat_audit.c | 4 > > > 15 files changed, 35 insertions(+) > > > > > > diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c > > > index 96a9d18ff4c4..06a911b685d1 100644 > > > --- a/arch/alpha/kernel/audit.c > > > +++ b/arch/alpha/kernel/audit.c > > > @@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > return 3; > > > case __NR_execve: > > > return 5; > > > + case __NR_openat2: > > > + return 6; > > > default: > > > return 0; > > > } > > > diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c > > > index 5192ca899fe6..5eaa888c8fd3 100644 > > > --- a/arch/ia64/kernel/audit.c > > > +++ b/arch/ia64/kernel/audit.c > > > @@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > return 3; > > > case __NR_execve: > > > return 5; > > > + case __NR_openat2: > > > + return 6; > > > default: > > > return 0; > > > } > > > diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c > > > index 9eb47b2225d2..fc721a7727ba 100644 > > > --- a/arch/parisc/kernel/audit.c > > > +++ b/arch/parisc/kernel/audit.c > > > @@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > return 3; > > > case __NR_execve: > > > return 5; > > > + case __NR_openat2: > > > + return 6; > > > default: > > > return 0; > > > } > > > diff --git a/arch/parisc/kernel/compat_audit.c > > > b/arch/parisc/kernel/compat_audit.c > > > index 20c39c9d86a9..fc6d35918c44 100644 > > > --- a/arch/parisc/kernel/compat_audit.c > > > +++ b/arch/parisc/kernel/compat_audit.c > > > @@ -35,6 +35,8 @@ int parisc32_classify_syscall(unsigned syscall) > > > return 3; > > > case __NR_execve: > > > return 5; > > > + case __NR_openat2: > > > + return 6; > > > default: > > > return 1; > > > } > > > diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c > > > index a27f3d09..8f32700b0baa 100644 > > > --- a/arch/powerpc/kernel/audit.c > > > +++ b/arch/powerpc/kernel/audit.c > > > @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > return 4; > > > case __NR_execve: > > > return 5; > > > + case __NR_openat2: > > > + return 6; > > > default: > > > return 0; > > > } > > > diff --git a/arch/powerpc/kernel/compat_audit.c > > > b/arch/powerpc/kernel/compat_audit.c > > > index 55c6ccda0a85..ebe45534b1c9 100644 > > > --- a/arch/powerpc/kernel/compat_audit.c > > > +++ b/arch/powerpc/kernel/compat_audit.c > > > @@ -38,6 +38,8 @@ int ppc32_classify_syscall(unsigned syscall) > > > return 4; > > > case __NR_execve: > > > return 5; > > > + case __NR_openat2: > > > + return 6; > > > default: > > > return 1; > > > } > > > diff --git a/arch/s390/kernel/audit.c b/arch/s390/kernel/audit.c > > > index d395c6c9944c..d964cb94cfaf 100644 > > > --- a/arch/s390/kernel/audit.c > > > +++ b/arch/s390/kernel/audit.c > > > @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > return 4; > > > case __NR_execve: > > > return 5; > > > + case __NR_openat2: > > > + return 6; > > > default: > > > return 0; > > > } > > > diff --git a/arch/s390/kernel/compat_audit.c > > > b/arch/s390/kernel/compat_audit.c > > > index 444fb1f66944..f7b32933ce0e 100644 > > > --- a/arch/s390/kernel/compat_audit.c > > > +++ b/arch/s390/kernel/compat_audit.c > > > @@ -39,6 +39,8 @@ int
Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
On 4/22/21 10:21 AM, Michal Suchánek wrote: Merging do_reset and do_hard_reset might be a good code cleanup which is out of the scope of this fix. I agree, on both points. Accepting the patch, and improving the reset path are not in conflict with each other. My position is that improving the reset path is certainly on the table, but not with this bug or this fix. Within the context of this discovered problem, the issue is well understood, the fix is small and addresses the immediate problem, and it has not generated new bugs under subsequent testing. For those reasons, the submitted patch has my support. Rick
Re: [PATCH] powerpc/powernv/pci: remove dead code from !CONFIG_EEH
On Fri, Apr 23, 2021 at 9:09 AM Daniel Axtens wrote: > > Hi Nick, > > > While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a > > possible candidate to convert to #ifdef CONFIG_EEH, but it seems that > > based on Kconfig dependencies it's not possible to build this file > > without CONFIG_EEH enabled. > > This seemed odd to me, but I think you're right: > > arch/powerpc/platforms/Kconfig contains: > > config EEH > bool > depends on (PPC_POWERNV || PPC_PSERIES) && PCI > default y > > It's not configurable from e.g. make menuconfig because there's no prompt. > You can attempt to explicitly disable it with e.g. `scripts/config -d EEH` > but then something like `make oldconfig` will silently re-enable it for > you. > > It's been forced on since commit e49f7a9997c6 ("powerpc/pseries: Rivet > CONFIG_EEH for pSeries platform") in 2012 which fixed it for > pseries. That moved out from pseries to pseries + powernv later on. > > There are other cleanups in the same vein that could be made, from the > Makefile (which has files only built with CONFIG_EEH) through to other > source files. It looks like there's one `#ifdef CONFIG_EEH` in > arch/powerpc/platforms/powernv/pci-ioda.c that could be pulled out, for > example. > > I think it's probably worth trying to rip out all of those in one patch? The change in commit e49f7a9997c6 ("powerpc/pseries: Rivet CONFIG_EEH for pSeries platform") never should have been made. There's no inherent reason why EEH needs to be enabled and forcing it on is (IMO) a large part of why EEH support is the byzantine clusterfuck that it is. One of the things I was working towards was allowing pseries and powernv to be built with !CONFIG_EEH since that would help define a clearer boundary between what is "eeh support" and what is required to support PCI on the platform. Pseries is particularly bad for this since PAPR says the RTAS calls needed to do a PCI bus reset are part of the EEH extension, but there's non-EEH reasons why you might want to use those RTAS calls. The PHB reset that we do when entering a kdump kernel is a good example since that uses the same RTAS calls, but it has nothing to do with the EEH recovery machinery enabled by CONFIG_EEH. I was looking into that largely because people were considering using OPAL for microwatt platforms. Breaking the assumption that powernv==EEH support is one of the few bits of work required to enable that, but even if you don't go down that road I think everyone would be better off if you kept a degree of separation between the two.
[PATCH] powerpc/powernv/pci: remove dead code from !CONFIG_EEH
While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a possible candidate to convert to #ifdef CONFIG_EEH, but it seems that based on Kconfig dependencies it's not possible to build this file without CONFIG_EEH enabled. Suggested-by: Nathan Chancellor Suggested-by: Joe Perches Link: https://github.com/ClangBuiltLinux/linux/issues/570 Link: https://lore.kernel.org/lkml/67f6cd269684c9aa8463ff4812c3b4605e6739c3.ca...@perches.com/ Signed-off-by: Nick Desaulniers --- arch/powerpc/platforms/powernv/pci.c | 7 --- 1 file changed, 7 deletions(-) diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index 9b9bca169275..591480a37b05 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -711,7 +711,6 @@ int pnv_pci_cfg_write(struct pci_dn *pdn, return PCIBIOS_SUCCESSFUL; } -#if CONFIG_EEH static bool pnv_pci_cfg_check(struct pci_dn *pdn) { struct eeh_dev *edev = NULL; @@ -734,12 +733,6 @@ static bool pnv_pci_cfg_check(struct pci_dn *pdn) return true; } -#else -static inline pnv_pci_cfg_check(struct pci_dn *pdn) -{ - return true; -} -#endif /* CONFIG_EEH */ static int pnv_pci_read_config(struct pci_bus *bus, unsigned int devfn, base-commit: 16fc44d6387e260f4932e9248b985837324705d8 prerequisite-patch-id: 950233069fb22099a8ff8990f620f5c3586a3cbd -- 2.31.1.498.g6c1eba8ee3d-goog
Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction
Hi! On Fri, Apr 23, 2021 at 12:16:18AM +0200, Gabriel Paubert wrote: > On Thu, Apr 22, 2021 at 02:13:34PM -0500, Segher Boessenkool wrote: > > On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote: > > > Sathvika Vasireddy writes: > > > > + if ((regs->ccr) & (1 << (31 - ra))) > > > > + op->val = -1; > > > > + else if ((regs->ccr) & (1 << (30 - ra))) > > > > + op->val = 1; > > > > + else > > > > + op->val = 0; > > > > It imo is clearer if written > > > > if ((regs->ccr << ra) & 0x8000) > > op->val = -1; > > else if ((regs->ccr << ra) & 0x4000) > > op->val = 1; > > else > > op->val = 0; > > > > but I guess not everyone agrees :-) > > But this can be made jump free :-): > > int tmp = regs->ccr << ra; > op->val = (tmp >> 31) | ((tmp >> 30) & 1); The compiler will do so automatically (or think of some better way to get the same result); in source code, what matters most is readability, or clarity in general (also clarity to the compiler). (Right shifts of negative numbers are implementation-defined in C, fwiw -- but work like you expect in GCC). > (IIRC the srawi instruction sign-extends its result to 64 bits). If you consider it to work on 32-bit inputs, yeah, that is a simple way to express it. > > > CR field: 76543210 > > > bit: 0123 0123 0123 0123 0123 0123 0123 0123 > > > normal bit #: 0.31 > > > ibm bit #: 31.0 > > > > The bit numbers in CR fields are *always* numbered left-to-right. I > > have never seen anyone use LE for it, anyway. > > > > Also, even people who write LE have the bigger end on the left normally > > (they just write some things right-to-left, and other things > > left-to-right). > > Around 1985, I had a documentation for the the National's 32032 > (little-endian) processor family, and all the instruction encodings were > presented with the LSB on the left and MSB on the right. Ouch! Did they write "regular" numbers with the least significant digit on the left as well? > BTW on these processors, the immediate operands and the offsets (1, 2 or > 4 bytes) for the addressing modes were encoded in big-endian byte order, > but I digress. Consistency is overrated ;-) Inconsistency is the spice of life, yeah :-) Segher
Re: [PATCH] powerpc/powernv/pci: remove dead code from !CONFIG_EEH
Hi Nick, > While looking at -Wundef warnings, the #if CONFIG_EEH stood out as a > possible candidate to convert to #ifdef CONFIG_EEH, but it seems that > based on Kconfig dependencies it's not possible to build this file > without CONFIG_EEH enabled. This seemed odd to me, but I think you're right: arch/powerpc/platforms/Kconfig contains: config EEH bool depends on (PPC_POWERNV || PPC_PSERIES) && PCI default y It's not configurable from e.g. make menuconfig because there's no prompt. You can attempt to explicitly disable it with e.g. `scripts/config -d EEH` but then something like `make oldconfig` will silently re-enable it for you. It's been forced on since commit e49f7a9997c6 ("powerpc/pseries: Rivet CONFIG_EEH for pSeries platform") in 2012 which fixed it for pseries. That moved out from pseries to pseries + powernv later on. There are other cleanups in the same vein that could be made, from the Makefile (which has files only built with CONFIG_EEH) through to other source files. It looks like there's one `#ifdef CONFIG_EEH` in arch/powerpc/platforms/powernv/pci-ioda.c that could be pulled out, for example. I think it's probably worth trying to rip out all of those in one patch? Kind regards, Daniel
Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor
Hi Alexey, > The $(CPP) (do only preprocessing) macro is already defined in Makefile. > However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results > in flags duplication. Which is not a big deal by itself except for > the flags which depend on other flags and the compiler checks them > as it parses the command line. > > Specifically, scripts/Makefile.build:304 generates ksyms for .S files. > If clang+llvm+sanitizer are enabled, this results in > -fno-lto -flto -fsanitize=cfi-mfcall -fno-lto -flto > -fsanitize=cfi-mfcall Checkpatch doesn't like this line: WARNING:COMMIT_LOG_LONG_LINE: Possible unwrapped commit description (prefer a maximum 75 chars per line) #14: -fno-lto -flto -fsanitize=cfi-mfcall -fno-lto -flto -fsanitize=cfi-mfcall However, it doesn't make sense to wrap the line so we should just ignore checkpatch here. > in the clang command line and triggers error: > > clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with > '-flto' > > This removes unnecessary CPP redifinition. > > Signed-off-by: Alexey Kardashevskiy > --- > arch/powerpc/Makefile | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile > index c9d2c7825cd6..3a2f2001c62b 100644 > --- a/arch/powerpc/Makefile > +++ b/arch/powerpc/Makefile > @@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr) > KBUILD_AFLAGS+= $(AFLAGS-y) > KBUILD_CFLAGS+= $(call cc-option,-msoft-float) > KBUILD_CFLAGS+= -pipe $(CFLAGS-y) > -CPP = $(CC) -E $(KBUILD_CFLAGS) I was trying to check the history to see why powerpc has its own definition. It seems to date back at least as far as merging the two powerpc platforms into one, maybe it was helpful then. I agree it doesn't seem to be needed now. Snowpatch claims that this breaks the build, but I haven't been able to reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it off to a fork of mpe's linux-ci repo to see if any of those builds hit any issues: https://github.com/daxtens/linux-ci/actions Assuming that doesn't break, this patch looks good to me: Reviewed-by: Daniel Axtens Kind regards, Daniel
Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction
Hi, On Thu, Apr 22, 2021 at 02:13:34PM -0500, Segher Boessenkool wrote: > Hi! > > On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote: > > Sathvika Vasireddy writes: > > Ok, if I've understood correctly... > > > > > + ra = ra & ~0x3; > > > > This masks off the bits of RA that are not part of BTF: > > > > ra is in [0, 31] which is [0b0, 0b1] > > Then ~0x3 = ~0b00011 > > ra = ra & 0b11100 > > > > This gives us then, > > ra = btf << 2; or > > btf = ra >> 2; > > Yes. In effect, you want the offset in bits of the CR field, which is > just fine like this. But a comment would not hurt. > > > Let's then check to see if your calculations read the right fields. > > > > > + if ((regs->ccr) & (1 << (31 - ra))) > > > + op->val = -1; > > > + else if ((regs->ccr) & (1 << (30 - ra))) > > > + op->val = 1; > > > + else > > > + op->val = 0; > > It imo is clearer if written > > if ((regs->ccr << ra) & 0x8000) > op->val = -1; > else if ((regs->ccr << ra) & 0x4000) > op->val = 1; > else > op->val = 0; > > but I guess not everyone agrees :-) > But this can be made jump free :-): int tmp = regs->ccr << ra; op->val = (tmp >> 31) | ((tmp >> 30) & 1); (IIRC the srawi instruction sign-extends its result to 64 bits). > > CR field: 76543210 > > bit: 0123 0123 0123 0123 0123 0123 0123 0123 > > normal bit #: 0.31 > > ibm bit #: 31.0 > > The bit numbers in CR fields are *always* numbered left-to-right. I > have never seen anyone use LE for it, anyway. > > Also, even people who write LE have the bigger end on the left normally > (they just write some things right-to-left, and other things > left-to-right). Around 1985, I had a documentation for the the National's 32032 (little-endian) processor family, and all the instruction encodings were presented with the LSB on the left and MSB on the right. BTW on these processors, the immediate operands and the offsets (1, 2 or 4 bytes) for the addressing modes were encoded in big-endian byte order, but I digress. Consistency is overrated ;-) Gabriel > > > Checkpatch does have one complaint: > > > > CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'regs->ccr' > > #30: FILE: arch/powerpc/lib/sstep.c:1971: > > + if ((regs->ccr) & (1 << (31 - ra))) > > > > I don't really mind the parenteses: I think you are safe to ignore > > checkpatch here unless someone else complains :) > > I find them annoying. If there are too many parentheses, it is hard to > see at a glance what groups where. Also, a suspicious reader might > think there is something special going on (with macros for example). > > This is simple code of course, but :-) > > > If you do end up respinning the patch, I think it would be good to make > > the maths a bit clearer. I think it works because a left shift of 2 is > > the same as multiplying by 4, but it would be easier to follow if you > > used a temporary variable for btf. > > It is very simple. The BFA instruction field is closely related to the > BI instruction field, which is 5 bits, and selects one of the 32 bits in > the CR. If you have "BFA00 BFA01 BFA10 BFA11", that gives the bit > numbers of all four bits in the selected CR field. So the "& ~3" does > all you need. It is quite pretty :-) > > > Segher
Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction
Hi! On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote: > Sathvika Vasireddy writes: > Ok, if I've understood correctly... > > > + ra = ra & ~0x3; > > This masks off the bits of RA that are not part of BTF: > > ra is in [0, 31] which is [0b0, 0b1] > Then ~0x3 = ~0b00011 > ra = ra & 0b11100 > > This gives us then, > ra = btf << 2; or > btf = ra >> 2; Yes. In effect, you want the offset in bits of the CR field, which is just fine like this. But a comment would not hurt. > Let's then check to see if your calculations read the right fields. > > > + if ((regs->ccr) & (1 << (31 - ra))) > > + op->val = -1; > > + else if ((regs->ccr) & (1 << (30 - ra))) > > + op->val = 1; > > + else > > + op->val = 0; It imo is clearer if written if ((regs->ccr << ra) & 0x8000) op->val = -1; else if ((regs->ccr << ra) & 0x4000) op->val = 1; else op->val = 0; but I guess not everyone agrees :-) > CR field: 76543210 > bit: 0123 0123 0123 0123 0123 0123 0123 0123 > normal bit #: 0.31 > ibm bit #: 31.0 The bit numbers in CR fields are *always* numbered left-to-right. I have never seen anyone use LE for it, anyway. Also, even people who write LE have the bigger end on the left normally (they just write some things right-to-left, and other things left-to-right). > Checkpatch does have one complaint: > > CHECK:UNNECESSARY_PARENTHESES: Unnecessary parentheses around 'regs->ccr' > #30: FILE: arch/powerpc/lib/sstep.c:1971: > + if ((regs->ccr) & (1 << (31 - ra))) > > I don't really mind the parenteses: I think you are safe to ignore > checkpatch here unless someone else complains :) I find them annoying. If there are too many parentheses, it is hard to see at a glance what groups where. Also, a suspicious reader might think there is something special going on (with macros for example). This is simple code of course, but :-) > If you do end up respinning the patch, I think it would be good to make > the maths a bit clearer. I think it works because a left shift of 2 is > the same as multiplying by 4, but it would be easier to follow if you > used a temporary variable for btf. It is very simple. The BFA instruction field is closely related to the BI instruction field, which is 5 bits, and selects one of the 32 bits in the CR. If you have "BFA00 BFA01 BFA10 BFA11", that gives the bit numbers of all four bits in the selected CR field. So the "& ~3" does all you need. It is quite pretty :-) Segher
Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
* Gautham R Shenoy [2021-04-22 20:37:29]: > From: "Gautham R. Shenoy" > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > of the Extended CEDE states advertised by the platform > > On some of the POWER9 LPARs, the older firmwares advertise a very low > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the > measured value is 5us on an average. Due to the low advertised exit > latency, we are entering CEDE(0) more aggressively on such > platforms. While this helps achieve SMT folding faster, we pay the > penalty of having to send an IPI to wakeup the CPU when the target > residency is very short. This is showing up as a performance > regression on the newer kernels running on the LPARs with older > firmware. > > Hence, set the exit latency of CEDE(0) based on the latency values > advertized by platform only from POWER10 onwards. The values > advertized on POWER10 platforms is more realistic and informed by the > latency measurements. > > For platforms older than POWER10, retain the hardcoded value of exit > latency, which is 10us. Though this is higher than the measured > values, we would be erring on the side of caution. > > Reported-by: Enrico Joedecke > Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > CEDE(0)") > Signed-off-by: Gautham R. Shenoy Reviewed-by: Vaidyanathan Srinivasan > --- > drivers/cpuidle/cpuidle-pseries.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/cpuidle/cpuidle-pseries.c > b/drivers/cpuidle/cpuidle-pseries.c > index a2b5c6f..7207467 100644 > --- a/drivers/cpuidle/cpuidle-pseries.c > +++ b/drivers/cpuidle/cpuidle-pseries.c > @@ -419,7 +419,8 @@ static int pseries_idle_probe(void) > cpuidle_state_table = shared_states; > max_idle_state = ARRAY_SIZE(shared_states); > } else { > - fixup_cede0_latency(); > + if (pvr_version_is(PVR_POWER10)) > + fixup_cede0_latency(); > cpuidle_state_table = dedicated_states; > max_idle_state = NR_DEDICATED_STATES; > } Thanks for the fix. We should have such safeguards or fallbacks while running on older platforms. This fix is needed because of the unfortunate regression on some older platforms that we failed to notice while designing and testing the original feature. --Vaidy
Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
On Thu, Apr 22, 2021 at 12:21 PM Michal Suchánek wrote: > > Hello, > > On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote: > > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu > > wrote: > > > > > > Lijun Pan [l...@linux.vnet.ibm.com] wrote: > > > > > > > > > > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden wrote: > > > > > > > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If > > > > > this > > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > > > > still inactive, ibmvnic's attempt to set link down will also fail. If > > > > > ibmvnic abandons the reset because of this failed set link down and > > > > > this > > > > > is the last reset in the workqueue, then this adapter will be left in > > > > > an > > > > > inoperable state. > > > > > > > > > > Instead, make the driver ignore this link down failure and continue to > > > > > free and re-register CRQ so that the adapter has an opportunity to > > > > > recover. > > > > > > > > This v2 does not adddress the concerns mentioned in v1. > > > > And I think it is better to exit with error from do_reset, and schedule > > > > a thorough > > > > do_hard_reset if the the adapter is already in unstable state. > > > > > > We had a FATAL error and when handling it, we failed to send a > > > link-down message to the VIOS. So what we need to try next is to > > > reset the connection with the VIOS. For this we must talk to the > > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() > > > does just that in ibmvnic_reset_crq(). > > > > > > Now, sure we can attempt a "thorough hard reset" which also does > > > the same hcalls to reestablish the connection. Is there any > > > other magic in do_hard_reset()? But in addition, it also frees lot > > > more Linux kernel buffers and reallocates them for instance. > > > > Working around everything in do_reset will make the code very difficult > > to manage. Ultimately do_reset can do anything I am afraid, and > > do_hard_reset > > can be removed completely or merged into do_reset. Michal, I should have given more details about the above statement. Thanks for your detailed info in the below. > > This debate is not very constructive. > > In the context of driver that has separate do_reset and do_hard_reset > this fix picks the correct one unless you can refute the arguments > provided. > > Merging do_reset and do_hard_reset might be a good code cleanup which is > out of the scope of this fix. Right. > > > > Given that vast majority of fixes to the vnic driver are related to the > reset handling it would improve stability and testability if every > reset took the same code path. I agree. > > In the context of merging do_hard_reset and do_reset the question is > what is the intended distinction and performance gain by having > 'lightweight' reset. Right. > > I don't have a vnic protocol manual at hand and I suspect I would not > get one even if I searched for one. > > From reading through the fixes in the past my understanding is that the > full reset is required when the backend changes which then potentially > requires different size/number of buffers. Yes, full reset is better when the backend changes. > > What is the expected situation when reset is required without changing > the backend? > > Is this so common that it warrants a separate 'lightweight' optimized > function? >
Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
Hello, On Thu, Apr 22, 2021 at 12:06:45AM -0500, Lijun Pan wrote: > On Wed, Apr 21, 2021 at 2:25 AM Sukadev Bhattiprolu > wrote: > > > > Lijun Pan [l...@linux.vnet.ibm.com] wrote: > > > > > > > > > > On Apr 20, 2021, at 4:35 PM, Dany Madden wrote: > > > > > > > > When ibmvnic gets a FATAL error message from the vnicserver, it marks > > > > the Command Respond Queue (CRQ) inactive and resets the adapter. If this > > > > FATAL reset fails and a transmission timeout reset follows, the CRQ is > > > > still inactive, ibmvnic's attempt to set link down will also fail. If > > > > ibmvnic abandons the reset because of this failed set link down and this > > > > is the last reset in the workqueue, then this adapter will be left in an > > > > inoperable state. > > > > > > > > Instead, make the driver ignore this link down failure and continue to > > > > free and re-register CRQ so that the adapter has an opportunity to > > > > recover. > > > > > > This v2 does not adddress the concerns mentioned in v1. > > > And I think it is better to exit with error from do_reset, and schedule a > > > thorough > > > do_hard_reset if the the adapter is already in unstable state. > > > > We had a FATAL error and when handling it, we failed to send a > > link-down message to the VIOS. So what we need to try next is to > > reset the connection with the VIOS. For this we must talk to the > > firmware using the H_FREE_CRQ and H_REG_CRQ hcalls. do_reset() > > does just that in ibmvnic_reset_crq(). > > > > Now, sure we can attempt a "thorough hard reset" which also does > > the same hcalls to reestablish the connection. Is there any > > other magic in do_hard_reset()? But in addition, it also frees lot > > more Linux kernel buffers and reallocates them for instance. > > Working around everything in do_reset will make the code very difficult > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset > can be removed completely or merged into do_reset. This debate is not very constructive. In the context of driver that has separate do_reset and do_hard_reset this fix picks the correct one unless you can refute the arguments provided. Merging do_reset and do_hard_reset might be a good code cleanup which is out of the scope of this fix. Given that vast majority of fixes to the vnic driver are related to the reset handling it would improve stability and testability if every reset took the same code path. In the context of merging do_hard_reset and do_reset the question is what is the intended distinction and performance gain by having 'lightweight' reset. I don't have a vnic protocol manual at hand and I suspect I would not get one even if I searched for one. >From reading through the fixes in the past my understanding is that the full reset is required when the backend changes which then potentially requires different size/number of buffers. What is the expected situation when reset is required without changing the backend? Is this so common that it warrants a separate 'lightweight' optimized function? Thanks Michal
Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
On Thu, Apr 22, 2021 at 2:07 AM Rick Lindsley wrote: > > On 4/21/21 10:30 PM, Lijun Pan wrote: > >> Fixes: ed651a10875f ("ibmvnic: Updated reset handling") > >> Signed-off-by: Dany Madden > >> Reviewed-by: Rick Lindsley > >> Reviewed-by: Sukadev Bhattiprolu > > > > One thing I would like to point out as already pointed out by Nathan Lynch > > is > > that those review-by tags given by the same groups of people from the same > > company loses credibility over time if you never critique or ask > > questions on the list. > > > > Well, so far you aren't addressing either my critiques or questions. > > I have been asking questions but all I have from you are the above > attempts to discredit the reputation of myself and other people, and > non-technical statements like > > will make the code very difficult to manage > I think there should be a trade off between optimization and stability. > So I don't think you could even compare the two results > > On the other hand, from the original submission I see some very specific > details: > > If ibmvnic abandons the reset because of this failed set link > down and this is the last reset in the workqueue, then this > adapter will be left in an inoperable state. > > and from a followup discussion: > > We had a FATAL error and when handling it, we failed to > send a link-down message to the VIOS. So what we need > to try next is to reset the connection with the VIOS. For > this we must ... > > These are great technical points that could be argued or discussed. > Problem is, I agree with them. > > I will ask again: can you please supply some technical reasons for > your objections. Otherwise, your objections are meritless and at worst > simply an ad hominem attack. Well, from the beginning of v1, I started to provide technical inputs. Then I was not allowed to post anything in the community about this patch and VNIC via l...@linux.ibm.com except giving an ack-by/reviewed-by.
Re: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()
On Thu, Apr 22, 2021 at 08:05:27AM +, David Laight wrote: > > (Does anyone - and by anyone I mean any large distro - compile with > > local variables inited by the compiler?) > > There are compilers that initialise locals to zero for 'debug' builds > and leave the 'random' for optimised 'release' builds. > Lets not test what we are releasing! Yeah, that's the worst of all possible worlds. > I also think there is a new option to gcc (or clang?) to initialise > on-stack structures and arrays to ensure garbage isn't passed. > That seems to be a horrid performance hit! > Especially in userspace where large stack allocations are almost free. > > Any auto-initialise ought to be with a semi-random value > (especially not zero) so that it is never right and doesn't > lead to lazy coding. Many compilers did something like this, decades ago -- for debug builds. Segher
Re: [PATCH 2/2] powerpc: Print esr register when hitting Program Interrupt
Le 22/04/2021 à 17:29, Christophe Leroy a écrit : Le 22/04/2021 à 17:10, Xiongwei Song a écrit : From: Xiongwei Song The esr register has the details of Program Interrupt on BookE/4xx cpus, printing its value is helpful. Signed-off-by: Xiongwei Song --- arch/powerpc/kernel/process.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 5c3830837f3a..664aecf8ee2e 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1459,6 +1459,7 @@ static bool interrupt_detail_printable(int trap) case INTERRUPT_MACHINE_CHECK: case INTERRUPT_DATA_STORAGE: case INTERRUPT_ALIGNMENT: + case INTERRUPT_PROGRAM: With this, it will also print the DSISR on 8xx/6xx so it will print garbage. 8xx/6xx provide the information in SRR1. If you want to proceed, you have to do the same as in ISI: Copy the content of SRR1 into regs->dsisr in the assembly handler in head_book3s_32.S and in the instruction TLB error handler in head_8xx.S In fact, we already have get_reason() called by do_program_check() to retrieve the reason of the program check exception. Maybe it could be used to print usefull information instead of starting doing almost the same is another way. Or we do as I suggested above, and we remove that get_reason() stuff. But get_reason() is also used by the alignment exception. So that doesn't look easy. I don't know what the best approach is. return true; default: return false;
Re: [PATCH 2/2] powerpc: Print esr register when hitting Program Interrupt
Le 22/04/2021 à 17:10, Xiongwei Song a écrit : From: Xiongwei Song The esr register has the details of Program Interrupt on BookE/4xx cpus, printing its value is helpful. Signed-off-by: Xiongwei Song --- arch/powerpc/kernel/process.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 5c3830837f3a..664aecf8ee2e 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1459,6 +1459,7 @@ static bool interrupt_detail_printable(int trap) case INTERRUPT_MACHINE_CHECK: case INTERRUPT_DATA_STORAGE: case INTERRUPT_ALIGNMENT: + case INTERRUPT_PROGRAM: With this, it will also print the DSISR on 8xx/6xx so it will print garbage. 8xx/6xx provide the information in SRR1. If you want to proceed, you have to do the same as in ISI: Copy the content of SRR1 into regs->dsisr in the assembly handler in head_book3s_32.S and in the instruction TLB error handler in head_8xx.S return true; default: return false;
Re: [PATCH 1/2] powerpc: Make the code in __show_regs nice-looking
Le 22/04/2021 à 17:10, Xiongwei Song a écrit : From: Xiongwei Song Create a new function named interrupt_detail_printable to judge which interrupts can print esr/dsisr register. What is the benefit of that function ? It may be interesting if the test was done at several places, but as it is done at only one place, I don't thing it is an improvement. Until know, you new immediately what was the traps that would print it. Now you have to go and look into a sub-function. And the name is not nice either. All other functions testing anything on the trap type are called trap_is_something() Here your function would be better called something like trap_sets_dsisr() Signed-off-by: Xiongwei Song --- arch/powerpc/kernel/process.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 89e34aa273e2..5c3830837f3a 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1453,9 +1453,23 @@ static void print_msr_bits(unsigned long val) #define REGS_PER_LINE 8 #endif +static bool interrupt_detail_printable(int trap) +{ + switch (trap) { + case INTERRUPT_MACHINE_CHECK: + case INTERRUPT_DATA_STORAGE: + case INTERRUPT_ALIGNMENT: + return true; + default: + return false; + } + + return false; That's redundant with the default case inside the switch. +} + static void __show_regs(struct pt_regs *regs) { - int i, trap; + int i; printk("NIP: "REG" LR: "REG" CTR: "REG"\n", regs->nip, regs->link, regs->ctr); @@ -1464,12 +1478,9 @@ static void __show_regs(struct pt_regs *regs) printk("MSR: "REG" ", regs->msr); print_msr_bits(regs->msr); pr_cont(" CR: %08lx XER: %08lx\n", regs->ccr, regs->xer); - trap = TRAP(regs); if (!trap_is_syscall(regs) && cpu_has_feature(CPU_FTR_CFAR)) pr_cont("CFAR: "REG" ", regs->orig_gpr3); - if (trap == INTERRUPT_MACHINE_CHECK || - trap == INTERRUPT_DATA_STORAGE || - trap == INTERRUPT_ALIGNMENT) { + if (interrupt_detail_printable(TRAP(regs))) { if (IS_ENABLED(CONFIG_4xx) || IS_ENABLED(CONFIG_BOOKE)) pr_cont("DEAR: "REG" ESR: "REG" ", regs->dar, regs->dsisr); else
[PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
From: "Gautham R. Shenoy" Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for CEDE(0)") sets the exit latency of CEDE(0) based on the latency values of the Extended CEDE states advertised by the platform On some of the POWER9 LPARs, the older firmwares advertise a very low value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the measured value is 5us on an average. Due to the low advertised exit latency, we are entering CEDE(0) more aggressively on such platforms. While this helps achieve SMT folding faster, we pay the penalty of having to send an IPI to wakeup the CPU when the target residency is very short. This is showing up as a performance regression on the newer kernels running on the LPARs with older firmware. Hence, set the exit latency of CEDE(0) based on the latency values advertized by platform only from POWER10 onwards. The values advertized on POWER10 platforms is more realistic and informed by the latency measurements. For platforms older than POWER10, retain the hardcoded value of exit latency, which is 10us. Though this is higher than the measured values, we would be erring on the side of caution. Reported-by: Enrico Joedecke Fixes: commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for CEDE(0)") Signed-off-by: Gautham R. Shenoy --- drivers/cpuidle/cpuidle-pseries.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/cpuidle/cpuidle-pseries.c b/drivers/cpuidle/cpuidle-pseries.c index a2b5c6f..7207467 100644 --- a/drivers/cpuidle/cpuidle-pseries.c +++ b/drivers/cpuidle/cpuidle-pseries.c @@ -419,7 +419,8 @@ static int pseries_idle_probe(void) cpuidle_state_table = shared_states; max_idle_state = ARRAY_SIZE(shared_states); } else { - fixup_cede0_latency(); + if (pvr_version_is(PVR_POWER10)) + fixup_cede0_latency(); cpuidle_state_table = dedicated_states; max_idle_state = NR_DEDICATED_STATES; } -- 1.9.4
[PATCH 2/2] powerpc: Print esr register when hitting Program Interrupt
From: Xiongwei Song The esr register has the details of Program Interrupt on BookE/4xx cpus, printing its value is helpful. Signed-off-by: Xiongwei Song --- arch/powerpc/kernel/process.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 5c3830837f3a..664aecf8ee2e 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1459,6 +1459,7 @@ static bool interrupt_detail_printable(int trap) case INTERRUPT_MACHINE_CHECK: case INTERRUPT_DATA_STORAGE: case INTERRUPT_ALIGNMENT: + case INTERRUPT_PROGRAM: return true; default: return false; -- 2.17.1
[PATCH 1/2] powerpc: Make the code in __show_regs nice-looking
From: Xiongwei Song Create a new function named interrupt_detail_printable to judge which interrupts can print esr/dsisr register. Signed-off-by: Xiongwei Song --- arch/powerpc/kernel/process.c | 21 - 1 file changed, 16 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 89e34aa273e2..5c3830837f3a 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -1453,9 +1453,23 @@ static void print_msr_bits(unsigned long val) #define REGS_PER_LINE 8 #endif +static bool interrupt_detail_printable(int trap) +{ + switch (trap) { + case INTERRUPT_MACHINE_CHECK: + case INTERRUPT_DATA_STORAGE: + case INTERRUPT_ALIGNMENT: + return true; + default: + return false; + } + + return false; +} + static void __show_regs(struct pt_regs *regs) { - int i, trap; + int i; printk("NIP: "REG" LR: "REG" CTR: "REG"\n", regs->nip, regs->link, regs->ctr); @@ -1464,12 +1478,9 @@ static void __show_regs(struct pt_regs *regs) printk("MSR: "REG" ", regs->msr); print_msr_bits(regs->msr); pr_cont(" CR: %08lx XER: %08lx\n", regs->ccr, regs->xer); - trap = TRAP(regs); if (!trap_is_syscall(regs) && cpu_has_feature(CPU_FTR_CFAR)) pr_cont("CFAR: "REG" ", regs->orig_gpr3); - if (trap == INTERRUPT_MACHINE_CHECK || - trap == INTERRUPT_DATA_STORAGE || - trap == INTERRUPT_ALIGNMENT) { + if (interrupt_detail_printable(TRAP(regs))) { if (IS_ENABLED(CONFIG_4xx) || IS_ENABLED(CONFIG_BOOKE)) pr_cont("DEAR: "REG" ESR: "REG" ", regs->dar, regs->dsisr); else -- 2.17.1
Re: [PATCH v2] powerpc/kconfig: Restore alphabetic order of the selects under CONFIG_PPC
Christophe Leroy writes: > Commit a7d2475af7ae ("powerpc: Sort the selects under CONFIG_PPC") > sorted all selects under CONFIG_PPC. > > 4 years later, several items have been introduced at wrong place, > a few other have been renamed without moving them to their correct > place. > > Reorder them now. > > While we are at it, simplify the test for a couple of them: > - PPC_64 && PPC_PSERIES is simplified in PPC_PSERIES > - PPC_64 && PPC_BOOK3S is simplified in PPC_BOOK3S_64 > > Signed-off-by: Christophe Leroy > --- > v2: Rebased on d20f726744a0 ("Automatic merge of 'next' into merge > (2021-04-21 22:57)") This will conflict badly with other things in linux-next if I merge it now. The best time to do this is just before rc1. I'll try to remember :) cheers
Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2
On Thu, Apr 22, 2021 at 11:49:31PM +1000, Michael Ellerman wrote: > Alex Williamson writes: > > On Mon, 12 Apr 2021 19:41:41 +1000 > > Michael Ellerman wrote: > > > >> Alex Williamson writes: > >> > On Fri, 26 Mar 2021 07:13:10 +0100 > >> > Christoph Hellwig wrote: > >> > > >> >> This driver never had any open userspace (which for VFIO would include > >> >> VM kernel drivers) that use it, and thus should never have been added > >> >> by our normal userspace ABI rules. > >> >> > >> >> Signed-off-by: Christoph Hellwig > >> >> Acked-by: Greg Kroah-Hartman > >> >> drivers/vfio/pci/Kconfig| 6 - > >> >> drivers/vfio/pci/Makefile | 1 - > >> >> drivers/vfio/pci/vfio_pci.c | 18 - > >> >> drivers/vfio/pci/vfio_pci_nvlink2.c | 490 > >> >> drivers/vfio/pci/vfio_pci_private.h | 14 - > >> >> include/uapi/linux/vfio.h | 38 +-- > >> >> 6 files changed, 4 insertions(+), 563 deletions(-) > >> >> delete mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c > >> > > >> > Hearing no objections, applied to vfio next branch for v5.13. Thanks, > >> > >> Looks like you only took patch 1? > >> > >> I can't take patch 2 on its own, that would break the build. > >> > >> Do you want to take both patches? There's currently no conflicts against > >> my tree. It's possible one could appear before the v5.13 merge window, > >> though it would probably just be something minor. > >> > >> Or I could apply both patches to my tree, which means patch 1 would > >> appear as two commits in the git history, but that's not a big deal. > > > > I've already got a conflict in my next branch with patch 1, so it's > > best to go through my tree. Seems like a shared branch would be > > easiest to allow you to merge and manage potential conflicts against > > patch 2, I've pushed a branch here: > > > > https://github.com/awilliam/linux-vfio.git v5.13/vfio/nvlink > > Thanks. > > My next is based on rc2, so I won't pull that in directly, because I > don't want to pull all of rc6 in with it. Linus is fine if you merge in rc's for development reasons. He doesn't like it when people just merge rc's without a purpose. Merge rc7 to your tree then pull the nvlink topic is acceptable. Or just do nothing because Alex will send it through his tree - this extra co-ordination is really only necessary if there are conflicts. Jason
Re: [PATCH 1/2] vfio/pci: remove vfio_pci_nvlink2
Alex Williamson writes: > On Mon, 12 Apr 2021 19:41:41 +1000 > Michael Ellerman wrote: > >> Alex Williamson writes: >> > On Fri, 26 Mar 2021 07:13:10 +0100 >> > Christoph Hellwig wrote: >> > >> >> This driver never had any open userspace (which for VFIO would include >> >> VM kernel drivers) that use it, and thus should never have been added >> >> by our normal userspace ABI rules. >> >> >> >> Signed-off-by: Christoph Hellwig >> >> Acked-by: Greg Kroah-Hartman >> >> --- >> >> drivers/vfio/pci/Kconfig| 6 - >> >> drivers/vfio/pci/Makefile | 1 - >> >> drivers/vfio/pci/vfio_pci.c | 18 - >> >> drivers/vfio/pci/vfio_pci_nvlink2.c | 490 >> >> drivers/vfio/pci/vfio_pci_private.h | 14 - >> >> include/uapi/linux/vfio.h | 38 +-- >> >> 6 files changed, 4 insertions(+), 563 deletions(-) >> >> delete mode 100644 drivers/vfio/pci/vfio_pci_nvlink2.c >> > >> > Hearing no objections, applied to vfio next branch for v5.13. Thanks, >> >> Looks like you only took patch 1? >> >> I can't take patch 2 on its own, that would break the build. >> >> Do you want to take both patches? There's currently no conflicts against >> my tree. It's possible one could appear before the v5.13 merge window, >> though it would probably just be something minor. >> >> Or I could apply both patches to my tree, which means patch 1 would >> appear as two commits in the git history, but that's not a big deal. > > I've already got a conflict in my next branch with patch 1, so it's > best to go through my tree. Seems like a shared branch would be > easiest to allow you to merge and manage potential conflicts against > patch 2, I've pushed a branch here: > > https://github.com/awilliam/linux-vfio.git v5.13/vfio/nvlink Thanks. My next is based on rc2, so I won't pull that in directly, because I don't want to pull all of rc6 in with it. I'll put it in a topic branch and merge it into my next after my first pull has gone to Linus. cheers
Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction
Michael Ellerman wrote: "Naveen N. Rao" writes: Daniel Axtens wrote: Sathvika Vasireddy writes: This adds emulation support for the following instruction: * Set Boolean (setb) Signed-off-by: Sathvika Vasireddy ... If you do end up respinning the patch, I think it would be good to make the maths a bit clearer. I think it works because a left shift of 2 is the same as multiplying by 4, but it would be easier to follow if you used a temporary variable for btf. Indeed. I wonder if it is better to follow the ISA itself. Per the ISA, the bit we are interested in is: 4 x BFA + 32 So, if we use that along with the PPC_BIT() macro, we get: if (regs->ccr & PPC_BIT(ra + 32)) Use of PPC_BIT risks annoying your maintainer :) Uh oh... that isn't good :) I looked up previous discussions and I think I now understand why you don't prefer it. But, I feel it helps make it easy to follow the code when referring to the ISA. I'm wondering if it is just the name you dislike and if so, does it make sense to rename PPC_BIT() to something else? We have BIT_ULL(), so perhaps BIT_MSB_ULL() or MSB_BIT_ULL()? - Naveen
[PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor
The $(CPP) (do only preprocessing) macro is already defined in Makefile. However POWERPC redefines it and adds $(KBUILD_CFLAGS) which results in flags duplication. Which is not a big deal by itself except for the flags which depend on other flags and the compiler checks them as it parses the command line. Specifically, scripts/Makefile.build:304 generates ksyms for .S files. If clang+llvm+sanitizer are enabled, this results in -fno-lto -flto -fsanitize=cfi-mfcall -fno-lto -flto -fsanitize=cfi-mfcall in the clang command line and triggers error: clang-13: error: invalid argument '-fsanitize=cfi-mfcall' only allowed with '-flto' This removes unnecessary CPP redifinition. Signed-off-by: Alexey Kardashevskiy --- arch/powerpc/Makefile | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile index c9d2c7825cd6..3a2f2001c62b 100644 --- a/arch/powerpc/Makefile +++ b/arch/powerpc/Makefile @@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr) KBUILD_AFLAGS += $(AFLAGS-y) KBUILD_CFLAGS += $(call cc-option,-msoft-float) KBUILD_CFLAGS += -pipe $(CFLAGS-y) -CPP= $(CC) -E $(KBUILD_CFLAGS) CHECKFLAGS += -m$(BITS) -D__powerpc__ -D__powerpc$(BITS)__ ifdef CONFIG_CPU_BIG_ENDIAN -- 2.25.1
Re: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()
On Thu, Apr 22, 2021 at 08:05:27AM +, David Laight wrote: > From: Daniel Axtens > > Sent: 22 April 2021 03:21 > > > > > Hi Lakshmi, > > > > > >> On 4/15/21 12:14 PM, Lakshmi Ramasubramanian wrote: > > >> > > >> Sorry - missed copying device-tree and powerpc mailing lists. > > >> > > >>> There are a few "goto out;" statements before the local variable "fdt" > > >>> is initialized through the call to of_kexec_alloc_and_setup_fdt() in > > >>> elf64_load(). This will result in an uninitialized "fdt" being passed > > >>> to kvfree() in this function if there is an error before the call to > > >>> of_kexec_alloc_and_setup_fdt(). > > >>> > > >>> Initialize the local variable "fdt" to NULL. > > >>> > > > I'm a huge fan of initialising local variables! But I'm struggling to > > > find the code path that will lead to an uninit fdt being returned... > > > > OK, so perhaps this was putting it too strongly. I have been bitten > > by uninitialised things enough in C that I may have taken a slightly > > overly-agressive view of fixing them in the source rather than the > > compiler. I do think compiler-level mitigations are better, and I take > > the point that we don't want to defeat compiler checking. > > > > (Does anyone - and by anyone I mean any large distro - compile with > > local variables inited by the compiler?) > > There are compilers that initialise locals to zero for 'debug' builds > and leave the 'random' for optimised 'release' builds. > Lets not test what we are releasing! We're eventually going to move to a world where initializing to zero it the default for the kernel. I think people will still want to initialize to a poison value for debug builds. Initializing to zero is better for debugging because it's more predictable. An it avoid information leaks. And dereferencing random uninitialized pointers is a privilege escalation but dereferencing a NULL is just an Oops. The speed impact is not very significant because (conceptually) it only needs to be done where there is a compiler warning about uninitialized variables. It's slightly more complicated in real life. In this case, the compiler doesn't know what happens inside the kexec_build_elf_info() function so it silences the warning. And GCC silences warnings if the variable is initialized inside a loop even when it doesn't know that we enter the loop. regards, dan carpenter
Re: [PATCH v4 00/14] Restricted DMA
v5 here: https://lore.kernel.org/patchwork/cover/1416899/ to rebase onto Christoph's swiotlb cleanups.
[PATCH v5 16/16] of: Add plumbing for restricted DMA pool
If a device is not behind an IOMMU, we look up the device node and set up the restricted DMA when the restricted-dma-pool is presented. Signed-off-by: Claire Chang --- drivers/of/address.c| 25 + drivers/of/device.c | 3 +++ drivers/of/of_private.h | 5 + 3 files changed, 33 insertions(+) diff --git a/drivers/of/address.c b/drivers/of/address.c index 54f221dde267..fff3adfe4986 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np) } EXPORT_SYMBOL_GPL(of_dma_is_coherent); +int of_dma_set_restricted_buffer(struct device *dev) +{ + struct device_node *node; + int count, i; + + if (!dev->of_node) + return 0; + + count = of_property_count_elems_of_size(dev->of_node, "memory-region", + sizeof(phandle)); + for (i = 0; i < count; i++) { + node = of_parse_phandle(dev->of_node, "memory-region", i); + /* There might be multiple memory regions, but only one +* restriced-dma-pool region is allowed. +*/ + if (of_device_is_compatible(node, "restricted-dma-pool") && + of_device_is_available(node)) + return of_reserved_mem_device_init_by_idx( + dev, dev->of_node, i); + } + + return 0; +} + /** * of_mmio_is_nonposted - Check if device uses non-posted MMIO * @np:device node diff --git a/drivers/of/device.c b/drivers/of/device.c index c5a9473a5fb1..d8d865223e51 100644 --- a/drivers/of/device.c +++ b/drivers/of/device.c @@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, arch_setup_dma_ops(dev, dma_start, size, iommu, coherent); + if (!iommu) + return of_dma_set_restricted_buffer(dev); + return 0; } EXPORT_SYMBOL_GPL(of_dma_configure_id); diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h index d717efbd637d..e9237f5eff48 100644 --- a/drivers/of/of_private.h +++ b/drivers/of/of_private.h @@ -163,12 +163,17 @@ struct bus_dma_region; #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA) int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map); +int of_dma_set_restricted_buffer(struct device *dev); #else static inline int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map) { return -ENODEV; } +static inline int of_dma_get_restricted_buffer(struct device *dev) +{ + return -ENODEV; +} #endif #endif /* _LINUX_OF_PRIVATE_H */ -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 15/16] dt-bindings: of: Add restricted DMA pool
Introduce the new compatible string, restricted-dma-pool, for restricted DMA. One can specify the address and length of the restricted DMA memory region by restricted-dma-pool in the reserved-memory node. Signed-off-by: Claire Chang --- .../reserved-memory/reserved-memory.txt | 24 +++ 1 file changed, 24 insertions(+) diff --git a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt index e8d3096d922c..fc9a12c2f679 100644 --- a/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt +++ b/Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt @@ -51,6 +51,20 @@ compatible (optional) - standard definition used as a shared pool of DMA buffers for a set of devices. It can be used by an operating system to instantiate the necessary pool management subsystem if necessary. +- restricted-dma-pool: This indicates a region of memory meant to be + used as a pool of restricted DMA buffers for a set of devices. The + memory region would be the only region accessible to those devices. + When using this, the no-map and reusable properties must not be set, + so the operating system can create a virtual mapping that will be used + for synchronization. The main purpose for restricted DMA is to + mitigate the lack of DMA access control on systems without an IOMMU, + which could result in the DMA accessing the system memory at + unexpected times and/or unexpected addresses, possibly leading to data + leakage or corruption. The feature on its own provides a basic level + of protection against the DMA overwriting buffer contents at + unexpected times. However, to protect against general data leakage and + system memory corruption, the system needs to provide way to lock down + the memory access, e.g., MPU. - vendor specific string in the form ,[-] no-map (optional) - empty property - Indicates the operating system must not create a virtual mapping @@ -120,6 +134,11 @@ one for multimedia processing (named multimedia-memory@7700, 64MiB). compatible = "acme,multimedia-memory"; reg = <0x7700 0x400>; }; + + restricted_dma_mem_reserved: restricted_dma_mem_reserved { + compatible = "restricted-dma-pool"; + reg = <0x5000 0x40>; + }; }; /* ... */ @@ -138,4 +157,9 @@ one for multimedia processing (named multimedia-memory@7700, 64MiB). memory-region = <_reserved>; /* ... */ }; + + pcie_device: pcie_device@0,0 { + memory-region = <_dma_mem_reserved>; + /* ... */ + }; }; -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 14/16] dma-direct: Allocate memory from restricted DMA pool if available
The restricted DMA pool is preferred if available. The restricted DMA pools provide a basic level of protection against the DMA overwriting buffer contents at unexpected times. However, to protect against general data leakage and system memory corruption, the system needs to provide a way to lock down the memory access, e.g., MPU. Signed-off-by: Claire Chang --- kernel/dma/direct.c | 35 ++- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 7a27f0510fcc..29523d2a9845 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size) static void __dma_direct_free_pages(struct device *dev, struct page *page, size_t size) { +#ifdef CONFIG_DMA_RESTRICTED_POOL + if (swiotlb_free(dev, page, size)) + return; +#endif dma_free_contiguous(dev, page, size); } @@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask, _limit); - page = dma_alloc_contiguous(dev, size, gfp); + +#ifdef CONFIG_DMA_RESTRICTED_POOL + page = swiotlb_alloc(dev, size); + if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) { + __dma_direct_free_pages(dev, page, size); + page = NULL; + } +#endif + + if (!page) + page = dma_alloc_contiguous(dev, size, gfp); if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) { dma_free_contiguous(dev, page, size); page = NULL; @@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size, gfp |= __GFP_NOWARN; if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) && - !force_dma_unencrypted(dev)) { + !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) { page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO); if (!page) return NULL; @@ -161,8 +175,8 @@ void *dma_direct_alloc(struct device *dev, size_t size, } if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && - !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && - !dev_is_dma_coherent(dev)) + !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) && + !is_dev_swiotlb_force(dev)) return arch_dma_alloc(dev, size, dma_handle, gfp, attrs); /* @@ -172,7 +186,9 @@ void *dma_direct_alloc(struct device *dev, size_t size, if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) && !gfpflags_allow_blocking(gfp) && (force_dma_unencrypted(dev) || -(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev +(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && + !dev_is_dma_coherent(dev))) && + !is_dev_swiotlb_force(dev)) return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp); /* we always manually zero the memory once we are done */ @@ -253,15 +269,15 @@ void dma_direct_free(struct device *dev, size_t size, unsigned int page_order = get_order(size); if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) && - !force_dma_unencrypted(dev)) { + !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) { /* cpu_addr is a struct page cookie, not a kernel address */ dma_free_contiguous(dev, cpu_addr, size); return; } if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && - !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && - !dev_is_dma_coherent(dev)) { + !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) && + !is_dev_swiotlb_force(dev)) { arch_dma_free(dev, size, cpu_addr, dma_addr, attrs); return; } @@ -289,7 +305,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size, void *ret; if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) && - force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp)) + force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) && + !is_dev_swiotlb_force(dev)) return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp); page = __dma_direct_alloc_pages(dev, size, gfp); -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 13/16] swiotlb: Add restricted DMA alloc/free support.
Add the functions, swiotlb_{alloc,free} to support the memory allocation from restricted DMA pool. Signed-off-by: Claire Chang --- include/linux/swiotlb.h | 4 kernel/dma/swiotlb.c| 35 +-- 2 files changed, 37 insertions(+), 2 deletions(-) diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 0c5a18d9cf89..e8cf49bd90c5 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -134,6 +134,10 @@ unsigned int swiotlb_max_segment(void); size_t swiotlb_max_mapping_size(struct device *dev); bool is_swiotlb_active(struct device *dev); void __init swiotlb_adjust_size(unsigned long size); +#ifdef CONFIG_DMA_RESTRICTED_POOL +struct page *swiotlb_alloc(struct device *dev, size_t size); +bool swiotlb_free(struct device *dev, struct page *page, size_t size); +#endif /* CONFIG_DMA_RESTRICTED_POOL */ #else #define swiotlb_force SWIOTLB_NO_FORCE static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index af0feb8eaead..274272c79080 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -454,8 +454,9 @@ static int find_slots(struct device *dev, phys_addr_t orig_addr, index = wrap = wrap_index(mem, ALIGN(mem->index, stride)); do { - if ((slot_addr(tbl_dma_addr, index) & iotlb_align_mask) != - (orig_addr & iotlb_align_mask)) { + if (orig_addr && + (slot_addr(tbl_dma_addr, index) & iotlb_align_mask) != + (orig_addr & iotlb_align_mask)) { index = wrap_index(mem, index + 1); continue; } @@ -695,6 +696,36 @@ late_initcall(swiotlb_create_default_debugfs); #endif #ifdef CONFIG_DMA_RESTRICTED_POOL +struct page *swiotlb_alloc(struct device *dev, size_t size) +{ + struct io_tlb_mem *mem = dev->dma_io_tlb_mem; + phys_addr_t tlb_addr; + int index; + + if (!mem) + return NULL; + + index = find_slots(dev, 0, size); + if (index == -1) + return NULL; + + tlb_addr = slot_addr(mem->start, index); + + return pfn_to_page(PFN_DOWN(tlb_addr)); +} + +bool swiotlb_free(struct device *dev, struct page *page, size_t size) +{ + phys_addr_t tlb_addr = page_to_phys(page); + + if (!is_swiotlb_buffer(dev, tlb_addr)) + return false; + + release_slots(dev, tlb_addr); + + return true; +} + static int rmem_swiotlb_device_init(struct reserved_mem *rmem, struct device *dev) { -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 12/16] dma-direct: Add a new wrapper __dma_direct_free_pages()
Add a new wrapper __dma_direct_free_pages() that will be useful later for swiotlb_free(). Signed-off-by: Claire Chang --- kernel/dma/direct.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 7a88c34d0867..7a27f0510fcc 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -75,6 +75,12 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size) min_not_zero(dev->coherent_dma_mask, dev->bus_dma_limit); } +static void __dma_direct_free_pages(struct device *dev, struct page *page, + size_t size) +{ + dma_free_contiguous(dev, page, size); +} + static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, gfp_t gfp) { @@ -237,7 +243,7 @@ void *dma_direct_alloc(struct device *dev, size_t size, return NULL; } out_free_pages: - dma_free_contiguous(dev, page, size); + __dma_direct_free_pages(dev, page, size); return NULL; } @@ -273,7 +279,7 @@ void dma_direct_free(struct device *dev, size_t size, else if (IS_ENABLED(CONFIG_ARCH_HAS_DMA_CLEAR_UNCACHED)) arch_dma_clear_uncached(cpu_addr, size); - dma_free_contiguous(dev, dma_direct_to_page(dev, dma_addr), size); + __dma_direct_free_pages(dev, dma_direct_to_page(dev, dma_addr), size); } struct page *dma_direct_alloc_pages(struct device *dev, size_t size, @@ -310,7 +316,7 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size, *dma_handle = phys_to_dma_direct(dev, page_to_phys(page)); return page; out_free_pages: - dma_free_contiguous(dev, page, size); + __dma_direct_free_pages(dev, page, size); return NULL; } @@ -329,7 +335,7 @@ void dma_direct_free_pages(struct device *dev, size_t size, if (force_dma_unencrypted(dev)) set_memory_encrypted((unsigned long)vaddr, 1 << page_order); - dma_free_contiguous(dev, page, size); + __dma_direct_free_pages(dev, page, size); } #if defined(CONFIG_ARCH_HAS_SYNC_DMA_FOR_DEVICE) || \ -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 11/16] swiotlb: Refactor swiotlb_tbl_unmap_single
Add a new function, release_slots, to make the code reusable for supporting different bounce buffer pools, e.g. restricted DMA pool. Signed-off-by: Claire Chang --- kernel/dma/swiotlb.c | 35 --- 1 file changed, 20 insertions(+), 15 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index b7d634d7a7eb..af0feb8eaead 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -547,27 +547,15 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr, return tlb_addr; } -/* - * tlb_addr is the physical address of the bounce buffer to unmap. - */ -void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr, - size_t mapping_size, enum dma_data_direction dir, - unsigned long attrs) +static void release_slots(struct device *dev, phys_addr_t tlb_addr) { - struct io_tlb_mem *mem = get_io_tlb_mem(hwdev); + struct io_tlb_mem *mem = get_io_tlb_mem(dev); unsigned long flags; - unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr); + unsigned int offset = swiotlb_align_offset(dev, tlb_addr); int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT; int nslots = nr_slots(mem->slots[index].alloc_size + offset); int count, i; - /* -* First, sync the memory before unmapping the entry -*/ - if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && - (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)) - swiotlb_bounce(hwdev, tlb_addr, mapping_size, DMA_FROM_DEVICE); - /* * Return the buffer to the free list by setting the corresponding * entries to indicate the number of contiguous entries available. @@ -602,6 +590,23 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr, spin_unlock_irqrestore(>lock, flags); } +/* + * tlb_addr is the physical address of the bounce buffer to unmap. + */ +void swiotlb_tbl_unmap_single(struct device *dev, phys_addr_t tlb_addr, + size_t mapping_size, enum dma_data_direction dir, + unsigned long attrs) +{ + /* +* First, sync the memory before unmapping the entry +*/ + if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && + (dir == DMA_FROM_DEVICE || dir == DMA_BIDIRECTIONAL)) + swiotlb_bounce(dev, tlb_addr, mapping_size, DMA_FROM_DEVICE); + + release_slots(dev, tlb_addr); +} + void swiotlb_sync_single_for_device(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) { -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 10/16] swiotlb: Move alloc_size to find_slots
Move the maintenance of alloc_size to find_slots for better code reusability later. Signed-off-by: Claire Chang --- kernel/dma/swiotlb.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 96ff36d8ec53..b7d634d7a7eb 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -479,8 +479,11 @@ static int find_slots(struct device *dev, phys_addr_t orig_addr, return -1; found: - for (i = index; i < index + nslots; i++) + for (i = index; i < index + nslots; i++) { mem->slots[i].list = 0; + mem->slots[i].alloc_size = + alloc_size - ((i - index) << IO_TLB_SHIFT); + } for (i = index - 1; io_tlb_offset(i) != IO_TLB_SEGSIZE - 1 && mem->slots[i].list; i--) @@ -535,11 +538,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr, * This is needed when we sync the memory. Then we sync the buffer if * needed. */ - for (i = 0; i < nr_slots(alloc_size + offset); i++) { + for (i = 0; i < nr_slots(alloc_size + offset); i++) mem->slots[index + i].orig_addr = slot_addr(orig_addr, i); - mem->slots[index + i].alloc_size = - alloc_size - (i << IO_TLB_SHIFT); - } tlb_addr = slot_addr(mem->start, index) + offset; if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC) && (dir == DMA_TO_DEVICE || dir == DMA_BIDIRECTIONAL)) -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 09/16] swiotlb: Bounce data from/to restricted DMA pool if available
Regardless of swiotlb setting, the restricted DMA pool is preferred if available. The restricted DMA pools provide a basic level of protection against the DMA overwriting buffer contents at unexpected times. However, to protect against general data leakage and system memory corruption, the system needs to provide a way to lock down the memory access, e.g., MPU. Note that is_dev_swiotlb_force doesn't check if swiotlb_force == SWIOTLB_FORCE. Otherwise the memory allocation behavior with default swiotlb will be changed by the following patche ("dma-direct: Allocate memory from restricted DMA pool if available"). Signed-off-by: Claire Chang --- include/linux/swiotlb.h | 13 + kernel/dma/direct.h | 3 ++- kernel/dma/swiotlb.c| 8 3 files changed, 19 insertions(+), 5 deletions(-) diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index c530c976d18b..0c5a18d9cf89 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -120,6 +120,15 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) return mem && paddr >= mem->start && paddr < mem->end; } +static inline bool is_dev_swiotlb_force(struct device *dev) +{ +#ifdef CONFIG_DMA_RESTRICTED_POOL + if (dev->dma_io_tlb_mem) + return true; +#endif /* CONFIG_DMA_RESTRICTED_POOL */ + return false; +} + void __init swiotlb_exit(void); unsigned int swiotlb_max_segment(void); size_t swiotlb_max_mapping_size(struct device *dev); @@ -131,6 +140,10 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) { return false; } +static inline bool is_dev_swiotlb_force(struct device *dev) +{ + return false; +} static inline void swiotlb_exit(void) { } diff --git a/kernel/dma/direct.h b/kernel/dma/direct.h index 13e9e7158d94..f94813674e23 100644 --- a/kernel/dma/direct.h +++ b/kernel/dma/direct.h @@ -87,7 +87,8 @@ static inline dma_addr_t dma_direct_map_page(struct device *dev, phys_addr_t phys = page_to_phys(page) + offset; dma_addr_t dma_addr = phys_to_dma(dev, phys); - if (unlikely(swiotlb_force == SWIOTLB_FORCE)) + if (unlikely(swiotlb_force == SWIOTLB_FORCE) || + is_dev_swiotlb_force(dev)) return swiotlb_map(dev, phys, size, dir, attrs); if (unlikely(!dma_capable(dev, dma_addr, size, true))) { diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 1d221343f1c8..96ff36d8ec53 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -344,7 +344,7 @@ void __init swiotlb_exit(void) static void swiotlb_bounce(struct device *dev, phys_addr_t tlb_addr, size_t size, enum dma_data_direction dir) { - struct io_tlb_mem *mem = io_tlb_default_mem; + struct io_tlb_mem *mem = get_io_tlb_mem(dev); int index = (tlb_addr - mem->start) >> IO_TLB_SHIFT; phys_addr_t orig_addr = mem->slots[index].orig_addr; size_t alloc_size = mem->slots[index].alloc_size; @@ -426,7 +426,7 @@ static unsigned int wrap_index(struct io_tlb_mem *mem, unsigned int index) static int find_slots(struct device *dev, phys_addr_t orig_addr, size_t alloc_size) { - struct io_tlb_mem *mem = io_tlb_default_mem; + struct io_tlb_mem *mem = get_io_tlb_mem(dev); unsigned long boundary_mask = dma_get_seg_boundary(dev); dma_addr_t tbl_dma_addr = phys_to_dma_unencrypted(dev, mem->start) & boundary_mask; @@ -503,7 +503,7 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr, size_t mapping_size, size_t alloc_size, enum dma_data_direction dir, unsigned long attrs) { - struct io_tlb_mem *mem = io_tlb_default_mem; + struct io_tlb_mem *mem = get_io_tlb_mem(dev); unsigned int offset = swiotlb_align_offset(dev, orig_addr); unsigned int i; int index; @@ -554,7 +554,7 @@ void swiotlb_tbl_unmap_single(struct device *hwdev, phys_addr_t tlb_addr, size_t mapping_size, enum dma_data_direction dir, unsigned long attrs) { - struct io_tlb_mem *mem = io_tlb_default_mem; + struct io_tlb_mem *mem = get_io_tlb_mem(hwdev); unsigned long flags; unsigned int offset = swiotlb_align_offset(hwdev, tlb_addr); int index = (tlb_addr - offset - mem->start) >> IO_TLB_SHIFT; -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 08/16] swiotlb: Update is_swiotlb_active to add a struct device argument
Update is_swiotlb_active to add a struct device argument. This will be useful later to allow for restricted DMA pool. Signed-off-by: Claire Chang --- drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +- drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +- drivers/pci/xen-pcifront.c | 2 +- include/linux/swiotlb.h | 4 ++-- kernel/dma/direct.c | 2 +- kernel/dma/swiotlb.c | 4 ++-- 6 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c b/drivers/gpu/drm/i915/gem/i915_gem_internal.c index ce6b664b10aa..7d48c433446b 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c @@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj) max_order = MAX_ORDER; #ifdef CONFIG_SWIOTLB - if (is_swiotlb_active()) { + if (is_swiotlb_active(NULL)) { unsigned int max_segment; max_segment = swiotlb_max_segment(); diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c index e8b506a6685b..2a2ae6d6cf6d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c @@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm) } #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86) - need_swiotlb = is_swiotlb_active(); + need_swiotlb = is_swiotlb_active(NULL); #endif ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev, diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c index b7a8f3a1921f..6d548ce53ce7 100644 --- a/drivers/pci/xen-pcifront.c +++ b/drivers/pci/xen-pcifront.c @@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct pcifront_device *pdev) spin_unlock(_dev_lock); - if (!err && !is_swiotlb_active()) { + if (!err && !is_swiotlb_active(NULL)) { err = pci_xen_swiotlb_init_late(); if (err) dev_err(>xdev->dev, "Could not setup SWIOTLB!\n"); diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 2a6cca07540b..c530c976d18b 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) void __init swiotlb_exit(void); unsigned int swiotlb_max_segment(void); size_t swiotlb_max_mapping_size(struct device *dev); -bool is_swiotlb_active(void); +bool is_swiotlb_active(struct device *dev); void __init swiotlb_adjust_size(unsigned long size); #else #define swiotlb_force SWIOTLB_NO_FORCE @@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device *dev) return SIZE_MAX; } -static inline bool is_swiotlb_active(void) +static inline bool is_swiotlb_active(struct device *dev) { return false; } diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 84c9feb5474a..7a88c34d0867 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask) size_t dma_direct_max_mapping_size(struct device *dev) { /* If SWIOTLB is active, use its maximum mapping size */ - if (is_swiotlb_active() && + if (is_swiotlb_active(dev) && (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE)) return swiotlb_max_mapping_size(dev); return SIZE_MAX; diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index ffbb8724e06c..1d221343f1c8 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -659,9 +659,9 @@ size_t swiotlb_max_mapping_size(struct device *dev) return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE; } -bool is_swiotlb_active(void) +bool is_swiotlb_active(struct device *dev) { - return io_tlb_default_mem != NULL; + return get_io_tlb_mem(dev) != NULL; } EXPORT_SYMBOL_GPL(is_swiotlb_active); -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 07/16] swiotlb: Update is_swiotlb_buffer to add a struct device argument
Update is_swiotlb_buffer to add a struct device argument. This will be useful later to allow for restricted DMA pool. Signed-off-by: Claire Chang --- drivers/iommu/dma-iommu.c | 12 ++-- drivers/xen/swiotlb-xen.c | 2 +- include/linux/swiotlb.h | 6 +++--- kernel/dma/direct.c | 6 +++--- kernel/dma/direct.h | 6 +++--- 5 files changed, 16 insertions(+), 16 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 7bcdd1205535..a5df35bfd150 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -504,7 +504,7 @@ static void __iommu_dma_unmap_swiotlb(struct device *dev, dma_addr_t dma_addr, __iommu_dma_unmap(dev, dma_addr, size); - if (unlikely(is_swiotlb_buffer(phys))) + if (unlikely(is_swiotlb_buffer(dev, phys))) swiotlb_tbl_unmap_single(dev, phys, size, dir, attrs); } @@ -575,7 +575,7 @@ static dma_addr_t __iommu_dma_map_swiotlb(struct device *dev, phys_addr_t phys, } iova = __iommu_dma_map(dev, phys, aligned_size, prot, dma_mask); - if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(phys)) + if (iova == DMA_MAPPING_ERROR && is_swiotlb_buffer(dev, phys)) swiotlb_tbl_unmap_single(dev, phys, org_size, dir, attrs); return iova; } @@ -781,7 +781,7 @@ static void iommu_dma_sync_single_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(phys, size, dir); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(dev, phys)) swiotlb_sync_single_for_cpu(dev, phys, size, dir); } @@ -794,7 +794,7 @@ static void iommu_dma_sync_single_for_device(struct device *dev, return; phys = iommu_iova_to_phys(iommu_get_dma_domain(dev), dma_handle); - if (is_swiotlb_buffer(phys)) + if (is_swiotlb_buffer(dev, phys)) swiotlb_sync_single_for_device(dev, phys, size, dir); if (!dev_is_dma_coherent(dev)) @@ -815,7 +815,7 @@ static void iommu_dma_sync_sg_for_cpu(struct device *dev, if (!dev_is_dma_coherent(dev)) arch_sync_dma_for_cpu(sg_phys(sg), sg->length, dir); - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(dev, sg_phys(sg))) swiotlb_sync_single_for_cpu(dev, sg_phys(sg), sg->length, dir); } @@ -832,7 +832,7 @@ static void iommu_dma_sync_sg_for_device(struct device *dev, return; for_each_sg(sgl, sg, nelems, i) { - if (is_swiotlb_buffer(sg_phys(sg))) + if (is_swiotlb_buffer(dev, sg_phys(sg))) swiotlb_sync_single_for_device(dev, sg_phys(sg), sg->length, dir); diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 4c89afc0df62..0c6ed09f8513 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -100,7 +100,7 @@ static int is_xen_swiotlb_buffer(struct device *dev, dma_addr_t dma_addr) * in our domain. Therefore _only_ check address within our domain. */ if (pfn_valid(PFN_DOWN(paddr))) - return is_swiotlb_buffer(paddr); + return is_swiotlb_buffer(dev, paddr); return 0; } diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index b469f04cca26..2a6cca07540b 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -113,9 +113,9 @@ static inline struct io_tlb_mem *get_io_tlb_mem(struct device *dev) return io_tlb_default_mem; } -static inline bool is_swiotlb_buffer(phys_addr_t paddr) +static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) { - struct io_tlb_mem *mem = io_tlb_default_mem; + struct io_tlb_mem *mem = get_io_tlb_mem(dev); return mem && paddr >= mem->start && paddr < mem->end; } @@ -127,7 +127,7 @@ bool is_swiotlb_active(void); void __init swiotlb_adjust_size(unsigned long size); #else #define swiotlb_force SWIOTLB_NO_FORCE -static inline bool is_swiotlb_buffer(phys_addr_t paddr) +static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) { return false; } diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index f737e3347059..84c9feb5474a 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -343,7 +343,7 @@ void dma_direct_sync_sg_for_device(struct device *dev, for_each_sg(sgl, sg, nents, i) { phys_addr_t paddr = dma_to_phys(dev, sg_dma_address(sg)); - if (unlikely(is_swiotlb_buffer(paddr))) + if (unlikely(is_swiotlb_buffer(dev, paddr))) swiotlb_sync_single_for_device(dev, paddr, sg->length, dir); @@ -369,7 +369,7 @@ void dma_direct_sync_sg_for_cpu(struct device
[PATCH v5 06/16] swiotlb: Add a new get_io_tlb_mem getter
Add a new getter, get_io_tlb_mem, to help select the io_tlb_mem struct. The restricted DMA pool is preferred if available. Signed-off-by: Claire Chang --- include/linux/swiotlb.h | 11 +++ 1 file changed, 11 insertions(+) diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 03ad6e3b4056..b469f04cca26 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -2,6 +2,7 @@ #ifndef __LINUX_SWIOTLB_H #define __LINUX_SWIOTLB_H +#include #include #include #include @@ -102,6 +103,16 @@ struct io_tlb_mem { }; extern struct io_tlb_mem *io_tlb_default_mem; +static inline struct io_tlb_mem *get_io_tlb_mem(struct device *dev) +{ +#ifdef CONFIG_DMA_RESTRICTED_POOL + if (dev && dev->dma_io_tlb_mem) + return dev->dma_io_tlb_mem; +#endif /* CONFIG_DMA_RESTRICTED_POOL */ + + return io_tlb_default_mem; +} + static inline bool is_swiotlb_buffer(phys_addr_t paddr) { struct io_tlb_mem *mem = io_tlb_default_mem; -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 05/16] swiotlb: Add restricted DMA pool initialization
Add the initialization function to create restricted DMA pools from matching reserved-memory nodes. Signed-off-by: Claire Chang --- include/linux/device.h | 4 +++ include/linux/swiotlb.h | 3 +- kernel/dma/swiotlb.c| 80 + 3 files changed, 86 insertions(+), 1 deletion(-) diff --git a/include/linux/device.h b/include/linux/device.h index 38a2071cf776..4987608ea4ff 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -416,6 +416,7 @@ struct dev_links_info { * @dma_pools: Dma pools (if dma'ble device). * @dma_mem: Internal for coherent mem override. * @cma_area: Contiguous memory area for dma allocations + * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override. * @archdata: For arch-specific additions. * @of_node: Associated device tree node. * @fwnode:Associated device node supplied by platform firmware. @@ -521,6 +522,9 @@ struct device { #ifdef CONFIG_DMA_CMA struct cma *cma_area; /* contiguous memory area for dma allocations */ +#endif +#ifdef CONFIG_DMA_RESTRICTED_POOL + struct io_tlb_mem *dma_io_tlb_mem; #endif /* arch specific additions */ struct dev_archdata archdata; diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 216854a5e513..03ad6e3b4056 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -72,7 +72,8 @@ extern enum swiotlb_force swiotlb_force; * range check to see if the memory was in fact allocated by this * API. * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and - * @end. This is command line adjustable via setup_io_tlb_npages. + * @end. For default swiotlb, this is command line adjustable via + * setup_io_tlb_npages. * @used: The number of used IO TLB block. * @list: The free list describing the number of free entries available * from each index. diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 57a9adb920bf..ffbb8724e06c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -39,6 +39,13 @@ #ifdef CONFIG_DEBUG_FS #include #endif +#ifdef CONFIG_DMA_RESTRICTED_POOL +#include +#include +#include +#include +#include +#endif #include #include @@ -681,3 +688,76 @@ static int __init swiotlb_create_default_debugfs(void) late_initcall(swiotlb_create_default_debugfs); #endif + +#ifdef CONFIG_DMA_RESTRICTED_POOL +static int rmem_swiotlb_device_init(struct reserved_mem *rmem, + struct device *dev) +{ + struct io_tlb_mem *mem = rmem->priv; + unsigned long nslabs = rmem->size >> IO_TLB_SHIFT; + + if (dev->dma_io_tlb_mem) + return 0; + + /* Since multiple devices can share the same pool, the private data, +* io_tlb_mem struct, will be initialized by the first device attached +* to it. +*/ + if (!mem) { + mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL); + if (!mem) + return -ENOMEM; +#ifdef CONFIG_ARM + if (!PageHighMem(pfn_to_page(PHYS_PFN(rmem->base { + kfree(mem); + return -EINVAL; + } +#endif /* CONFIG_ARM */ + swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false); + + rmem->priv = mem; + } + +#ifdef CONFIG_DEBUG_FS + if (!io_tlb_default_mem->debugfs) + io_tlb_default_mem->debugfs = + debugfs_create_dir("swiotlb", NULL); + + swiotlb_create_debugfs(mem, rmem->name, io_tlb_default_mem->debugfs); +#endif /* CONFIG_DEBUG_FS */ + + dev->dma_io_tlb_mem = mem; + + return 0; +} + +static void rmem_swiotlb_device_release(struct reserved_mem *rmem, + struct device *dev) +{ + if (dev) + dev->dma_io_tlb_mem = NULL; +} + +static const struct reserved_mem_ops rmem_swiotlb_ops = { + .device_init = rmem_swiotlb_device_init, + .device_release = rmem_swiotlb_device_release, +}; + +static int __init rmem_swiotlb_setup(struct reserved_mem *rmem) +{ + unsigned long node = rmem->fdt_node; + + if (of_get_flat_dt_prop(node, "reusable", NULL) || + of_get_flat_dt_prop(node, "linux,cma-default", NULL) || + of_get_flat_dt_prop(node, "linux,dma-default", NULL) || + of_get_flat_dt_prop(node, "no-map", NULL)) + return -EINVAL; + + rmem->ops = _swiotlb_ops; + pr_info("Reserved memory: created device swiotlb memory pool at %pa, size %ld MiB\n", + >base, (unsigned long)rmem->size / SZ_1M); + return 0; +} + +RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", rmem_swiotlb_setup); +#endif /* CONFIG_DMA_RESTRICTED_POOL */ -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 04/16] swiotlb: Add DMA_RESTRICTED_POOL
Add a new kconfig symbol, DMA_RESTRICTED_POOL, for restricted DMA pool. Signed-off-by: Claire Chang --- kernel/dma/Kconfig | 14 ++ 1 file changed, 14 insertions(+) diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig index 77b405508743..3e961dc39634 100644 --- a/kernel/dma/Kconfig +++ b/kernel/dma/Kconfig @@ -80,6 +80,20 @@ config SWIOTLB bool select NEED_DMA_MAP_STATE +config DMA_RESTRICTED_POOL + bool "DMA Restricted Pool" + depends on OF && OF_RESERVED_MEM + select SWIOTLB + help + This enables support for restricted DMA pools which provide a level of + DMA memory protection on systems with limited hardware protection + capabilities, such as those lacking an IOMMU. + + For more information see + + and . + If unsure, say "n". + # # Should be selected if we can mmap non-coherent mappings to userspace. # The only thing that is really required is a way to set an uncached bit -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 03/16] swiotlb: Refactor swiotlb_create_debugfs
Split the debugfs creation to make the code reusable for supporting different bounce buffer pools, e.g. restricted DMA pool. Signed-off-by: Claire Chang --- kernel/dma/swiotlb.c | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 3f1adee35097..57a9adb920bf 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -660,18 +660,24 @@ EXPORT_SYMBOL_GPL(is_swiotlb_active); #ifdef CONFIG_DEBUG_FS -static int __init swiotlb_create_debugfs(void) +static void swiotlb_create_debugfs(struct io_tlb_mem *mem, const char *name, + struct dentry *node) { - struct io_tlb_mem *mem = io_tlb_default_mem; - if (!mem) - return 0; - mem->debugfs = debugfs_create_dir("swiotlb", NULL); + return; + + mem->debugfs = debugfs_create_dir(name, node); debugfs_create_ulong("io_tlb_nslabs", 0400, mem->debugfs, >nslabs); debugfs_create_ulong("io_tlb_used", 0400, mem->debugfs, >used); +} + +static int __init swiotlb_create_default_debugfs(void) +{ + swiotlb_create_debugfs(io_tlb_default_mem, "swiotlb", NULL); + return 0; } -late_initcall(swiotlb_create_debugfs); +late_initcall(swiotlb_create_default_debugfs); #endif -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 02/16] swiotlb: Refactor swiotlb init functions
Add a new function, swiotlb_init_io_tlb_mem, for the io_tlb_mem struct initialization to make the code reusable. Note that we now also call set_memory_decrypted in swiotlb_init_with_tbl. Signed-off-by: Claire Chang --- kernel/dma/swiotlb.c | 51 ++-- 1 file changed, 25 insertions(+), 26 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 8635a57f88e9..3f1adee35097 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -166,9 +166,30 @@ void __init swiotlb_update_mem_attributes(void) memset(vaddr, 0, bytes); } -int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) +static void swiotlb_init_io_tlb_mem(struct io_tlb_mem *mem, phys_addr_t start, + unsigned long nslabs, bool late_alloc) { + void *vaddr = phys_to_virt(start); unsigned long bytes = nslabs << IO_TLB_SHIFT, i; + + mem->nslabs = nslabs; + mem->start = start; + mem->end = mem->start + bytes; + mem->index = 0; + mem->late_alloc = late_alloc; + spin_lock_init(>lock); + for (i = 0; i < mem->nslabs; i++) { + mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i); + mem->slots[i].orig_addr = INVALID_PHYS_ADDR; + mem->slots[i].alloc_size = 0; + } + + set_memory_decrypted((unsigned long)vaddr, bytes >> PAGE_SHIFT); + memset(vaddr, 0, bytes); +} + +int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) +{ struct io_tlb_mem *mem; size_t alloc_size; @@ -184,16 +205,8 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) if (!mem) panic("%s: Failed to allocate %zu bytes align=0x%lx\n", __func__, alloc_size, PAGE_SIZE); - mem->nslabs = nslabs; - mem->start = __pa(tlb); - mem->end = mem->start + bytes; - mem->index = 0; - spin_lock_init(>lock); - for (i = 0; i < mem->nslabs; i++) { - mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i); - mem->slots[i].orig_addr = INVALID_PHYS_ADDR; - mem->slots[i].alloc_size = 0; - } + + swiotlb_init_io_tlb_mem(mem, __pa(tlb), nslabs, false); io_tlb_default_mem = mem; if (verbose) @@ -280,7 +293,6 @@ swiotlb_late_init_with_default_size(size_t default_size) int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs) { - unsigned long bytes = nslabs << IO_TLB_SHIFT, i; struct io_tlb_mem *mem; if (swiotlb_force == SWIOTLB_NO_FORCE) @@ -295,20 +307,7 @@ swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs) if (!mem) return -ENOMEM; - mem->nslabs = nslabs; - mem->start = virt_to_phys(tlb); - mem->end = mem->start + bytes; - mem->index = 0; - mem->late_alloc = 1; - spin_lock_init(>lock); - for (i = 0; i < mem->nslabs; i++) { - mem->slots[i].list = IO_TLB_SEGSIZE - io_tlb_offset(i); - mem->slots[i].orig_addr = INVALID_PHYS_ADDR; - mem->slots[i].alloc_size = 0; - } - - set_memory_decrypted((unsigned long)tlb, bytes >> PAGE_SHIFT); - memset(tlb, 0, bytes); + swiotlb_init_io_tlb_mem(mem, virt_to_phys(tlb), nslabs, true); io_tlb_default_mem = mem; swiotlb_print_info(); -- I'm not 100% sure if set_memory_decrypted is safe to call in swiotlb_init_with_tbl, but I didn't hit any issue booting with this. 2.31.1.368.gbe11c130af-goog
[PATCH v5 01/16] swiotlb: Fix the type of index
Fix the type of index from unsigned int to int since find_slots() might return -1. Fixes: 0774983bc923 ("swiotlb: refactor swiotlb_tbl_map_single") Signed-off-by: Claire Chang --- kernel/dma/swiotlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 0a5b6f7e75bc..8635a57f88e9 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -499,7 +499,8 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr, { struct io_tlb_mem *mem = io_tlb_default_mem; unsigned int offset = swiotlb_align_offset(dev, orig_addr); - unsigned int index, i; + unsigned int i; + int index; phys_addr_t tlb_addr; if (!mem) -- 2.31.1.368.gbe11c130af-goog
[PATCH v5 00/16] Restricted DMA
This series implements mitigations for lack of DMA access control on systems without an IOMMU, which could result in the DMA accessing the system memory at unexpected times and/or unexpected addresses, possibly leading to data leakage or corruption. For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is not behind an IOMMU. As PCI-e, by design, gives the device full access to system memory, a vulnerability in the Wi-Fi firmware could easily escalate to a full system exploit (remote wifi exploits: [1a], [1b] that shows a full chain of exploits; [2], [3]). To mitigate the security concerns, we introduce restricted DMA. Restricted DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a specially allocated region and does memory allocation from the same region. The feature on its own provides a basic level of protection against the DMA overwriting buffer contents at unexpected times. However, to protect against general data leakage and system memory corruption, the system needs to provide a way to restrict the DMA to a predefined memory region (this is usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]). [1a] https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html [1b] https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html [2] https://blade.tencent.com/en/advisories/qualpwn/ [3] https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/ [4] https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132 v5: Rebase on latest linux-next v4: - Fix spinlock bad magic - Use rmem->name for debugfs entry - Address the comments in v3 v3: Using only one reserved memory region for both streaming DMA and memory allocation. https://lore.kernel.org/patchwork/cover/1360992/ v2: Building on top of swiotlb. https://lore.kernel.org/patchwork/cover/1280705/ v1: Using dma_map_ops. https://lore.kernel.org/patchwork/cover/1271660/ Claire Chang (16): swiotlb: Fix the type of index swiotlb: Refactor swiotlb init functions swiotlb: Refactor swiotlb_create_debugfs swiotlb: Add DMA_RESTRICTED_POOL swiotlb: Add restricted DMA pool initialization swiotlb: Add a new get_io_tlb_mem getter swiotlb: Update is_swiotlb_buffer to add a struct device argument swiotlb: Update is_swiotlb_active to add a struct device argument swiotlb: Bounce data from/to restricted DMA pool if available swiotlb: Move alloc_size to find_slots swiotlb: Refactor swiotlb_tbl_unmap_single dma-direct: Add a new wrapper __dma_direct_free_pages() swiotlb: Add restricted DMA alloc/free support. dma-direct: Allocate memory from restricted DMA pool if available dt-bindings: of: Add restricted DMA pool of: Add plumbing for restricted DMA pool .../reserved-memory/reserved-memory.txt | 24 ++ drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +- drivers/gpu/drm/nouveau/nouveau_ttm.c | 2 +- drivers/iommu/dma-iommu.c | 12 +- drivers/of/address.c | 25 ++ drivers/of/device.c | 3 + drivers/of/of_private.h | 5 + drivers/pci/xen-pcifront.c| 2 +- drivers/xen/swiotlb-xen.c | 2 +- include/linux/device.h| 4 + include/linux/swiotlb.h | 41 ++- kernel/dma/Kconfig| 14 + kernel/dma/direct.c | 57 +++-- kernel/dma/direct.h | 9 +- kernel/dma/swiotlb.c | 242 +- 15 files changed, 347 insertions(+), 97 deletions(-) -- 2.31.1.368.gbe11c130af-goog
RE: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()
From: Daniel Axtens > Sent: 22 April 2021 03:21 > > > Hi Lakshmi, > > > >> On 4/15/21 12:14 PM, Lakshmi Ramasubramanian wrote: > >> > >> Sorry - missed copying device-tree and powerpc mailing lists. > >> > >>> There are a few "goto out;" statements before the local variable "fdt" > >>> is initialized through the call to of_kexec_alloc_and_setup_fdt() in > >>> elf64_load(). This will result in an uninitialized "fdt" being passed > >>> to kvfree() in this function if there is an error before the call to > >>> of_kexec_alloc_and_setup_fdt(). > >>> > >>> Initialize the local variable "fdt" to NULL. > >>> > > I'm a huge fan of initialising local variables! But I'm struggling to > > find the code path that will lead to an uninit fdt being returned... > > OK, so perhaps this was putting it too strongly. I have been bitten > by uninitialised things enough in C that I may have taken a slightly > overly-agressive view of fixing them in the source rather than the > compiler. I do think compiler-level mitigations are better, and I take > the point that we don't want to defeat compiler checking. > > (Does anyone - and by anyone I mean any large distro - compile with > local variables inited by the compiler?) There are compilers that initialise locals to zero for 'debug' builds and leave the 'random' for optimised 'release' builds. Lets not test what we are releasing! I also think there is a new option to gcc (or clang?) to initialise on-stack structures and arrays to ensure garbage isn't passed. That seems to be a horrid performance hit! Especially in userspace where large stack allocations are almost free. Any auto-initialise ought to be with a semi-random value (especially not zero) so that it is never right and doesn't lead to lazy coding. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[PATCH 6/7] swiotlb: replace default_nslabs with a byte value
Replace the default_nslabs variable with one that stores the size in bytes as that is what all the users actually expect. Signed-off-by: Christoph Hellwig --- kernel/dma/swiotlb.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 93737d0932fbf2..87d06ddf4753f3 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -71,7 +71,7 @@ struct io_tlb_mem *io_tlb_default_mem; */ static unsigned int max_segment; -static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT; +static unsigned long swiotlb_default_size = IO_TLB_DEFAULT_SIZE; static int __init setup_io_tlb_npages(char *str) @@ -106,7 +106,7 @@ void swiotlb_set_max_segment(unsigned int val) unsigned long swiotlb_size_or_default(void) { - return default_nslabs << IO_TLB_SHIFT; + return swiotlb_default_size; } void __init swiotlb_adjust_size(unsigned long size) @@ -116,9 +116,9 @@ void __init swiotlb_adjust_size(unsigned long size) * architectures such as those supporting memory encryption to * adjust/expand SWIOTLB size for their use. */ - size = ALIGN(size, IO_TLB_SIZE); - default_nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE); - pr_info("SWIOTLB bounce buffer size adjusted to %luMB", size >> 20); + swiotlb_default_size = ALIGN(size, IO_TLB_SEGSIZE << IO_TLB_SHIFT); + pr_info("SWIOTLB bounce buffer size adjusted to %luMB", + swiotlb_default_size >> 20); } void swiotlb_print_info(void) -- 2.30.1
[PATCH 1/7] swiotlb: pass bytes instead of nslabs to swiotlb_init_with_tbl
Pass the actual allocation size to swiotlb_init_with_tbl, which simplifies things quite a bit. Signed-off-by: Christoph Hellwig --- arch/mips/cavium-octeon/dma-octeon.c | 2 +- arch/powerpc/platforms/pseries/svm.c | 3 +-- drivers/xen/swiotlb-xen.c| 2 +- include/linux/swiotlb.h | 2 +- kernel/dma/swiotlb.c | 10 +- 5 files changed, 9 insertions(+), 10 deletions(-) diff --git a/arch/mips/cavium-octeon/dma-octeon.c b/arch/mips/cavium-octeon/dma-octeon.c index df70308db0e697..020b8ce5b8ff7c 100644 --- a/arch/mips/cavium-octeon/dma-octeon.c +++ b/arch/mips/cavium-octeon/dma-octeon.c @@ -245,6 +245,6 @@ void __init plat_swiotlb_setup(void) panic("%s: Failed to allocate %zu bytes align=%lx\n", __func__, swiotlbsize, PAGE_SIZE); - if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlb_nslabs, 1) == -ENOMEM) + if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlbsize, 1) == -ENOMEM) panic("Cannot allocate SWIOTLB buffer"); } diff --git a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c index 1d829e257996fb..4d281ff56ce96f 100644 --- a/arch/powerpc/platforms/pseries/svm.c +++ b/arch/powerpc/platforms/pseries/svm.c @@ -52,10 +52,9 @@ void __init svm_swiotlb_init(void) bytes = io_tlb_nslabs << IO_TLB_SHIFT; vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE); - if (vstart && !swiotlb_init_with_tbl(vstart, io_tlb_nslabs, false)) + if (vstart && !swiotlb_init_with_tbl(vstart, bytes, false)) return; - memblock_free_early(__pa(vstart), PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT)); panic("SVM: Cannot allocate SWIOTLB buffer"); diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 4c89afc0df6289..18d79f07b507ce 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -249,7 +249,7 @@ void __init xen_swiotlb_init_early(void) panic("%s (rc:%d)", xen_swiotlb_error(XEN_SWIOTLB_EFIXUP), rc); } - if (swiotlb_init_with_tbl(start, nslabs, false)) + if (swiotlb_init_with_tbl(start, bytes, false)) panic("Cannot allocate SWIOTLB buffer"); swiotlb_set_max_segment(PAGE_SIZE); } diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 216854a5e5134b..d1d40ca5014b54 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -36,7 +36,7 @@ enum swiotlb_force { #define IO_TLB_DEFAULT_SIZE (64UL<<20) extern void swiotlb_init(int verbose); -int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose); +int swiotlb_init_with_tbl(char *tlb, size_t bytes, int verbose); unsigned long swiotlb_size_or_default(void); extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs); extern int swiotlb_late_init_with_default_size(size_t default_size); diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 0a5b6f7e75bce6..c7b3dd86db7f56 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -166,9 +166,9 @@ void __init swiotlb_update_mem_attributes(void) memset(vaddr, 0, bytes); } -int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) +int __init swiotlb_init_with_tbl(char *tlb, size_t bytes, int verbose) { - unsigned long bytes = nslabs << IO_TLB_SHIFT, i; + unsigned long nslabs = bytes >> IO_TLB_SHIFT, i; struct io_tlb_mem *mem; size_t alloc_size; @@ -209,17 +209,17 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) void __init swiotlb_init(int verbose) { - size_t bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT); + size_t bytes = default_nslabs << IO_TLB_SHIFT; void *tlb; if (swiotlb_force == SWIOTLB_NO_FORCE) return; /* Get IO TLB memory from the low pages */ - tlb = memblock_alloc_low(bytes, PAGE_SIZE); + tlb = memblock_alloc_low(PAGE_ALIGN(bytes), PAGE_SIZE); if (!tlb) goto fail; - if (swiotlb_init_with_tbl(tlb, default_nslabs, verbose)) + if (swiotlb_init_with_tbl(tlb, bytes, verbose)) goto fail_free_mem; return; -- 2.30.1
[PATCH 7/7] swiotlb: don't override the command line in swiotlb_adjust_size
When the user specified an explicit swiotlb size on the command line, the achitecture code should not override it. Fixes: 2cbc2776efe4 ("swiotlb: remove swiotlb_nr_tbl") Reported-by: Tom Lendacky Signed-off-by: Christoph Hellwig --- kernel/dma/swiotlb.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 87d06ddf4753f3..aef02a3825b494 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -106,7 +106,9 @@ void swiotlb_set_max_segment(unsigned int val) unsigned long swiotlb_size_or_default(void) { - return swiotlb_default_size; + if (swiotlb_default_size) + return swiotlb_default_size; + return IO_TLB_DEFAULT_SIZE; } void __init swiotlb_adjust_size(unsigned long size) @@ -116,6 +118,8 @@ void __init swiotlb_adjust_size(unsigned long size) * architectures such as those supporting memory encryption to * adjust/expand SWIOTLB size for their use. */ + if (swiotlb_default_size) + return; swiotlb_default_size = ALIGN(size, IO_TLB_SEGSIZE << IO_TLB_SHIFT); pr_info("SWIOTLB bounce buffer size adjusted to %luMB", swiotlb_default_size >> 20); -- 2.30.1
[PATCH 2/7] swiotlb: use swiotlb_size_or_default in swiotlb_init
Use swiotlb_size_or_default to calculate the buffer size insted of open coding it. Signed-off-by: Christoph Hellwig --- kernel/dma/swiotlb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index c7b3dd86db7f56..27461fd63e0330 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -209,7 +209,7 @@ int __init swiotlb_init_with_tbl(char *tlb, size_t bytes, int verbose) void __init swiotlb_init(int verbose) { - size_t bytes = default_nslabs << IO_TLB_SHIFT; + size_t bytes = swiotlb_size_or_default(); void *tlb; if (swiotlb_force == SWIOTLB_NO_FORCE) -- 2.30.1
[PATCH 4/7] powerpc/pseries: simplify svm_swiotlb_init
The value returned by swiotlb_size_or_default is always properly aligned now, so don't duplicate the work. Signed-off-by: Christoph Hellwig --- arch/powerpc/platforms/pseries/svm.c | 10 ++ 1 file changed, 2 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/pseries/svm.c b/arch/powerpc/platforms/pseries/svm.c index 4d281ff56ce96f..9187d2a1ed568d 100644 --- a/arch/powerpc/platforms/pseries/svm.c +++ b/arch/powerpc/platforms/pseries/svm.c @@ -43,20 +43,14 @@ machine_early_initcall(pseries, init_svm); */ void __init svm_swiotlb_init(void) { + unsigned long bytes = swiotlb_size_or_default(); unsigned char *vstart; - unsigned long bytes, io_tlb_nslabs; - - io_tlb_nslabs = (swiotlb_size_or_default() >> IO_TLB_SHIFT); - io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE); - - bytes = io_tlb_nslabs << IO_TLB_SHIFT; vstart = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE); if (vstart && !swiotlb_init_with_tbl(vstart, bytes, false)) return; - memblock_free_early(__pa(vstart), - PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT)); + memblock_free_early(__pa(vstart), PAGE_ALIGN(bytes)); panic("SVM: Cannot allocate SWIOTLB buffer"); } -- 2.30.1
cleanup and fix swiotlb sizing
Hi all, based on a report from Tom that overriding the default sizing provided by the x86 SEV code on the command line doesn't work anymore, this series cleans up how we handle default and command line sizes for the swiotlb buffer and then fixes the recently introduced bug in a straight-forward way. Diffstat: arch/mips/cavium-octeon/dma-octeon.c | 16 +-- arch/mips/include/asm/octeon/pci-octeon.h |1 arch/mips/pci/pci-octeon.c|2 - arch/powerpc/platforms/pseries/svm.c | 13 ++-- drivers/xen/swiotlb-xen.c |2 - include/linux/swiotlb.h |2 - kernel/dma/swiotlb.c | 32 +++--- 7 files changed, 25 insertions(+), 43 deletions(-)
[PATCH 3/7] swiotlb: use swiotlb_adjust_size in setup_io_tlb_npages
Use the proper helper to do the proper alignment of the buffer size to the requirements of the swiotlb allocator instead of open coding the logic. Signed-off-by: Christoph Hellwig --- kernel/dma/swiotlb.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 27461fd63e0330..93737d0932fbf2 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -76,11 +76,9 @@ static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT; static int __init setup_io_tlb_npages(char *str) { - if (isdigit(*str)) { - /* avoid tail segment of size < IO_TLB_SEGSIZE */ - default_nslabs = - ALIGN(simple_strtoul(str, , 0), IO_TLB_SEGSIZE); - } + if (isdigit(*str)) + swiotlb_adjust_size( + simple_strtoul(str, , 0) << IO_TLB_SHIFT); if (*str == ',') ++str; if (!strcmp(str, "force")) -- 2.30.1
[PATCH 5/7] MIPS/octeon: simplify swiotlb initialization
Just use swiotlb_adjust_size and swiotlb_init to initialize swiotlb instead of doing a lot of manual work. Signed-off-by: Christoph Hellwig --- arch/mips/cavium-octeon/dma-octeon.c | 16 ++-- arch/mips/include/asm/octeon/pci-octeon.h | 1 - arch/mips/pci/pci-octeon.c| 2 +- 3 files changed, 3 insertions(+), 16 deletions(-) diff --git a/arch/mips/cavium-octeon/dma-octeon.c b/arch/mips/cavium-octeon/dma-octeon.c index 020b8ce5b8ff7c..6bc9ef5e3790ec 100644 --- a/arch/mips/cavium-octeon/dma-octeon.c +++ b/arch/mips/cavium-octeon/dma-octeon.c @@ -186,15 +186,12 @@ phys_addr_t dma_to_phys(struct device *dev, dma_addr_t daddr) return daddr; } -char *octeon_swiotlb; - void __init plat_swiotlb_setup(void) { phys_addr_t start, end; phys_addr_t max_addr; phys_addr_t addr_size; size_t swiotlbsize; - unsigned long swiotlb_nslabs; u64 i; max_addr = 0; @@ -236,15 +233,6 @@ void __init plat_swiotlb_setup(void) if (OCTEON_IS_OCTEON2() && max_addr >= 0x1ul) swiotlbsize = 64 * (1<<20); #endif - swiotlb_nslabs = swiotlbsize >> IO_TLB_SHIFT; - swiotlb_nslabs = ALIGN(swiotlb_nslabs, IO_TLB_SEGSIZE); - swiotlbsize = swiotlb_nslabs << IO_TLB_SHIFT; - - octeon_swiotlb = memblock_alloc_low(swiotlbsize, PAGE_SIZE); - if (!octeon_swiotlb) - panic("%s: Failed to allocate %zu bytes align=%lx\n", - __func__, swiotlbsize, PAGE_SIZE); - - if (swiotlb_init_with_tbl(octeon_swiotlb, swiotlbsize, 1) == -ENOMEM) - panic("Cannot allocate SWIOTLB buffer"); + swiotlb_adjust_size(swiotlbsize); + swiotlb_init(false); } diff --git a/arch/mips/include/asm/octeon/pci-octeon.h b/arch/mips/include/asm/octeon/pci-octeon.h index b12d9a3fbfb6c0..a2f20a44fb6143 100644 --- a/arch/mips/include/asm/octeon/pci-octeon.h +++ b/arch/mips/include/asm/octeon/pci-octeon.h @@ -64,6 +64,5 @@ enum octeon_dma_bar_type { extern enum octeon_dma_bar_type octeon_dma_bar_type; void octeon_pci_dma_init(void); -extern char *octeon_swiotlb; #endif diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c index fc29b85cfa926d..ff26cd9dc083f6 100644 --- a/arch/mips/pci/pci-octeon.c +++ b/arch/mips/pci/pci-octeon.c @@ -664,7 +664,7 @@ static int __init octeon_pci_setup(void) /* BAR1 movable regions contiguous to cover the swiotlb */ octeon_bar1_pci_phys = - virt_to_phys(octeon_swiotlb) & ~((1ull << 22) - 1); + io_tlb_default_mem->start & ~((1ull << 22) - 1); for (index = 0; index < 32; index++) { union cvmx_pci_bar1_indexx bar1_index; -- 2.30.1
Re: [PATCH v3 00/11] DDW + Indirect Mapping
Changes since v2: - Some patches got removed from the series and sent by themselves, - New tbl created for DDW + indirect mapping reserves MMIO32 space, - Improved reserved area algorithm, - Improved commit messages, - Removed define for default DMA window prop name, - Avoided some unnecessary renaming, - Removed some unnecessary empty lines, - Changed some code moving to forward declarations. v2 Link: http://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=201210=%2A=both On Thu, 2021-04-22 at 04:07 -0300, Leonardo Bras wrote: > So far it's assumed possible to map the guest RAM 1:1 to the bus, which > works with a small number of devices. SRIOV changes it as the user can > configure hundreds VFs and since phyp preallocates TCEs and does not > allow IOMMU pages bigger than 64K, it has to limit the number of TCEs > per a PE to limit waste of physical pages. > > As of today, if the assumed direct mapping is not possible, DDW creation > is skipped and the default DMA window "ibm,dma-window" is used instead. > > Using the DDW instead of the default DMA window may allow to expand the > amount of memory that can be DMA-mapped, given the number of pages (TCEs) > may stay the same (or increase) and the default DMA window offers only > 4k-pages while DDW may offer larger pages (4k, 64k, 16M ...). > > Patch #1 replaces hard-coded 4K page size with a variable containing the > correct page size for the window. > > Patch #2 introduces iommu_table_in_use(), and replace manual bit-field > checking where it's used. It will be used for aborting enable_ddw() if > there is any current iommu allocation and we are trying single window > indirect mapping. > > Patch #3 introduces iommu_pseries_alloc_table() that will be helpful > when indirect mapping needs to replace the iommu_table. > > Patch #4 adds helpers for adding DDWs in the list. > > Patch #5 refactors enable_ddw() so it returns if direct mapping is > possible, instead of DMA offset. It helps for next patches on > indirect DMA mapping and also allows DMA windows starting at 0x00. > > Patch #6 bring new helper to simplify enable_ddw(), allowing > some reorganization for introducing indirect mapping DDW. > > Patch #7 adds new helper _iommu_table_setparms() and use it in other > *setparams*() to fill iommu_table. It will also be used for creating a > new iommu_table for indirect mapping. > > Patch #8 updates remove_dma_window() to accept different property names, > so we can introduce a new property for indirect mapping. > > Patch #9 extracts find_existing_ddw_windows() into > find_existing_ddw_windows_named(), and calls it by it's property name. > This will be useful when the property for indirect mapping is created, > so we can search the device-tree for both properties. > > Patch #10: > Instead of destroying the created DDW if it doesn't map the whole > partition, make use of it instead of the default DMA window as it improves > performance. Also, update the iommu_table and re-generate the pools. > It introduces a new property name for DDW with indirect DMA mapping. > > Patch #11: > Does some renaming of 'direct window' to 'dma window', given the DDW > created can now be also used in indirect mapping if direct mapping is not > available. > > All patches were tested into an LPAR with an virtio-net interface that > allows default DMA window and DDW to coexist. > > Leonardo Bras (11): > powerpc/pseries/iommu: Replace hard-coded page shift > powerpc/kernel/iommu: Add new iommu_table_in_use() helper > powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper > powerpc/pseries/iommu: Add ddw_list_new_entry() helper > powerpc/pseries/iommu: Allow DDW windows starting at 0x00 > powerpc/pseries/iommu: Add ddw_property_create() and refactor > enable_ddw() > powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new > helper > powerpc/pseries/iommu: Update remove_dma_window() to accept property > name > powerpc/pseries/iommu: Find existing DDW with given property name > powerpc/pseries/iommu: Make use of DDW for indirect mapping > powerpc/pseries/iommu: Rename "direct window" to "dma window" > > arch/powerpc/include/asm/iommu.h | 1 + > arch/powerpc/include/asm/tce.h | 8 - > arch/powerpc/kernel/iommu.c| 65 ++-- > arch/powerpc/platforms/pseries/iommu.c | 504 +++-- > 4 files changed, 338 insertions(+), 240 deletions(-) >
[PATCH v3 11/11] powerpc/pseries/iommu: Rename "direct window" to "dma window"
A previous change introduced the usage of DDW as a bigger indirect DMA mapping when the DDW available size does not map the whole partition. As most of the code that manipulates direct mappings was reused for indirect mappings, it's necessary to rename all names and debug/info messages to reflect that it can be used for both kinds of mapping. This should cause no behavioural change, just adjust naming. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 93 +- 1 file changed, 48 insertions(+), 45 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 50909cbc73f6..f5d0a6f012da 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -355,7 +355,7 @@ struct dynamic_dma_window_prop { __be32 window_shift; /* ilog2(tce_window_size) */ }; -struct direct_window { +struct dma_win { struct device_node *device; const struct dynamic_dma_window_prop *prop; struct list_head list; @@ -375,11 +375,11 @@ struct ddw_create_response { u32 addr_lo; }; -static LIST_HEAD(direct_window_list); +static LIST_HEAD(dma_win_list); /* prevents races between memory on/offline and window creation */ -static DEFINE_SPINLOCK(direct_window_list_lock); +static DEFINE_SPINLOCK(dma_win_list_lock); /* protects initializing window twice for same device */ -static DEFINE_MUTEX(direct_window_init_mutex); +static DEFINE_MUTEX(dma_win_init_mutex); #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" #define DMA64_PROPNAME "linux,dma64-ddr-window-info" @@ -712,7 +712,10 @@ static void pci_dma_bus_setup_pSeriesLP(struct pci_bus *bus) pr_debug("pci_dma_bus_setup_pSeriesLP: setting up bus %pOF\n", dn); - /* Find nearest ibm,dma-window, walking up the device tree */ + /* +* Find nearest ibm,dma-window (default DMA window), walking up the +* device tree +*/ for (pdn = dn; pdn != NULL; pdn = pdn->parent) { dma_window = of_get_property(pdn, "ibm,dma-window", NULL); if (dma_window != NULL) @@ -816,11 +819,11 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail, ret = rtas_call(ddw_avail[DDW_REMOVE_PE_DMA_WIN], 1, 1, NULL, liobn); if (ret) - pr_warn("%pOF: failed to remove direct window: rtas returned " + pr_warn("%pOF: failed to remove DMA window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); else - pr_debug("%pOF: successfully removed direct window: rtas returned " + pr_debug("%pOF: successfully removed DMA window: rtas returned " "%d to ibm,remove-pe-dma-window(%x) %llx\n", np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); } @@ -848,37 +851,37 @@ static int remove_ddw(struct device_node *np, bool remove_prop, const char *win_ ret = of_remove_property(np, win); if (ret) - pr_warn("%pOF: failed to remove direct window property: %d\n", + pr_warn("%pOF: failed to remove DMA window property: %d\n", np, ret); return 0; } static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift) { - struct direct_window *window; - const struct dynamic_dma_window_prop *direct64; + struct dma_win *window; + const struct dynamic_dma_window_prop *dma64; bool found = false; - spin_lock(_window_list_lock); + spin_lock(_win_list_lock); /* check if we already created a window and dupe that config if so */ - list_for_each_entry(window, _window_list, list) { + list_for_each_entry(window, _win_list, list) { if (window->device == pdn) { - direct64 = window->prop; - *dma_addr = be64_to_cpu(direct64->dma_base); - *window_shift = be32_to_cpu(direct64->window_shift); + dma64 = window->prop; + *dma_addr = be64_to_cpu(dma64->dma_base); + *window_shift = be32_to_cpu(dma64->window_shift); found = true; break; } } - spin_unlock(_window_list_lock); + spin_unlock(_win_list_lock); return found; } -static struct direct_window *ddw_list_new_entry(struct device_node *pdn, - const struct dynamic_dma_window_prop *dma64) +static struct dma_win *ddw_list_new_entry(struct device_node *pdn, + const struct dynamic_dma_window_prop *dma64) { - struct direct_window *window; + struct dma_win *window; window =
[PATCH v3 10/11] powerpc/pseries/iommu: Make use of DDW for indirect mapping
So far it's assumed possible to map the guest RAM 1:1 to the bus, which works with a small number of devices. SRIOV changes it as the user can configure hundreds VFs and since phyp preallocates TCEs and does not allow IOMMU pages bigger than 64K, it has to limit the number of TCEs per a PE to limit waste of physical pages. As of today, if the assumed direct mapping is not possible, DDW creation is skipped and the default DMA window "ibm,dma-window" is used instead. By using DDW, indirect mapping can get more TCEs than available for the default DMA window, and also get access to using much larger pagesizes (16MB as implemented in qemu vs 4k from default DMA window), causing a significant increase on the maximum amount of memory that can be IOMMU mapped at the same time. Indirect mapping will only be used if direct mapping is not a possibility. For indirect mapping, it's necessary to re-create the iommu_table with the new DMA window parameters, so iommu_alloc() can use it. Removing the default DMA window for using DDW with indirect mapping is only allowed if there is no current IOMMU memory allocated in the iommu_table. enable_ddw() is aborted otherwise. Even though there won't be both direct and indirect mappings at the same time, we can't reuse the DIRECT64_PROPNAME property name, or else an older kexec()ed kernel can assume direct mapping, and skip iommu_alloc(), causing undesirable behavior. So a new property name DMA64_PROPNAME "linux,dma64-ddr-window-info" was created to represent a DDW that does not allow direct mapping. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 87 +- 1 file changed, 72 insertions(+), 15 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 3367233a5535..50909cbc73f6 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -53,6 +53,7 @@ enum { DDW_EXT_QUERY_OUT_SIZE = 2 }; +static phys_addr_t ddw_memory_hotplug_max(void); #ifdef CONFIG_IOMMU_API static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned long *tce, enum dma_data_direction *direction, bool realmode); @@ -380,6 +381,7 @@ static DEFINE_SPINLOCK(direct_window_list_lock); /* protects initializing window twice for same device */ static DEFINE_MUTEX(direct_window_init_mutex); #define DIRECT64_PROPNAME "linux,direct64-ddr-window-info" +#define DMA64_PROPNAME "linux,dma64-ddr-window-info" static int tce_clearrange_multi_pSeriesLP(unsigned long start_pfn, unsigned long num_pfn, const void *arg) @@ -918,6 +920,7 @@ static int find_existing_ddw_windows(void) return 0; find_existing_ddw_windows_named(DIRECT64_PROPNAME); + find_existing_ddw_windows_named(DMA64_PROPNAME); return 0; } @@ -1207,10 +1210,13 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) struct device_node *dn; u32 ddw_avail[DDW_APPLICABLE_SIZE]; struct direct_window *window; + const char *win_name; struct property *win64 = NULL; struct failed_ddw_pdn *fpdn; - bool default_win_removed = false; + bool default_win_removed = false, direct_mapping = false; bool pmem_present; + struct pci_dn *pci = PCI_DN(pdn); + struct iommu_table *tbl = pci->table_group->tables[0]; dn = of_find_node_by_type(NULL, "ibm,pmemory"); pmem_present = dn != NULL; @@ -1218,8 +1224,12 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) mutex_lock(_window_init_mutex); - if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) - goto out_unlock; + if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) { + direct_mapping = (len >= max_ram_len); + + mutex_unlock(_window_init_mutex); + return direct_mapping; + } /* * If we already went through this for a previous function of @@ -1298,7 +1308,6 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) goto out_failed; } /* verify the window * number of ptes will map the partition */ - /* check largest block * page size > max memory hotplug addr */ /* * The "ibm,pmemory" can appear anywhere in the address space. * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS @@ -1320,6 +1329,17 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) 1ULL << len, query.largest_available_block, 1ULL << page_shift); + + len = order_base_2(query.largest_available_block << page_shift); + win_name = DMA64_PROPNAME; + } else { + direct_mapping = true; + win_name =
[PATCH v3 09/11] powerpc/pseries/iommu: Find existing DDW with given property name
At the moment pseries stores information about created directly mapped DDW window in DIRECT64_PROPNAME. With the objective of implementing indirect DMA mapping with DDW, it's necessary to have another propriety name to make sure kexec'ing into older kernels does not break, as it would if we reuse DIRECT64_PROPNAME. In order to have this, find_existing_ddw_windows() needs to be able to look for different property names. Extract find_existing_ddw_windows() into find_existing_ddw_windows_named() and calls it with current property name. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 25 +++-- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 075c6e08f012..3367233a5535 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -888,24 +888,21 @@ static struct direct_window *ddw_list_new_entry(struct device_node *pdn, return window; } -static int find_existing_ddw_windows(void) +static void find_existing_ddw_windows_named(const char *name) { int len; struct device_node *pdn; struct direct_window *window; - const struct dynamic_dma_window_prop *direct64; - - if (!firmware_has_feature(FW_FEATURE_LPAR)) - return 0; + const struct dynamic_dma_window_prop *dma64; - for_each_node_with_property(pdn, DIRECT64_PROPNAME) { - direct64 = of_get_property(pdn, DIRECT64_PROPNAME, ); - if (!direct64 || len < sizeof(*direct64)) { - remove_ddw(pdn, true, DIRECT64_PROPNAME); + for_each_node_with_property(pdn, name) { + dma64 = of_get_property(pdn, name, ); + if (!dma64 || len < sizeof(*dma64)) { + remove_ddw(pdn, true, name); continue; } - window = ddw_list_new_entry(pdn, direct64); + window = ddw_list_new_entry(pdn, dma64); if (!window) break; @@ -913,6 +910,14 @@ static int find_existing_ddw_windows(void) list_add(>list, _window_list); spin_unlock(_window_list_lock); } +} + +static int find_existing_ddw_windows(void) +{ + if (!firmware_has_feature(FW_FEATURE_LPAR)) + return 0; + + find_existing_ddw_windows_named(DIRECT64_PROPNAME); return 0; } -- 2.30.2
[PATCH v3 08/11] powerpc/pseries/iommu: Update remove_dma_window() to accept property name
Update remove_dma_window() so it can be used to remove DDW with a given property name. This enables the creation of new property names for DDW, so we can have different usage for it, like indirect mapping. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 0147ccaf0be4..075c6e08f012 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -823,31 +823,32 @@ static void remove_dma_window(struct device_node *np, u32 *ddw_avail, np, ret, ddw_avail[DDW_REMOVE_PE_DMA_WIN], liobn); } -static void remove_ddw(struct device_node *np, bool remove_prop) +static int remove_ddw(struct device_node *np, bool remove_prop, const char *win_name) { struct property *win; u32 ddw_avail[DDW_APPLICABLE_SIZE]; int ret = 0; + win = of_find_property(np, win_name, NULL); + if (!win) + return -EINVAL; + ret = of_property_read_u32_array(np, "ibm,ddw-applicable", _avail[0], DDW_APPLICABLE_SIZE); if (ret) - return; - - win = of_find_property(np, DIRECT64_PROPNAME, NULL); - if (!win) - return; + return 0; if (win->length >= sizeof(struct dynamic_dma_window_prop)) remove_dma_window(np, ddw_avail, win); if (!remove_prop) - return; + return 0; ret = of_remove_property(np, win); if (ret) pr_warn("%pOF: failed to remove direct window property: %d\n", np, ret); + return 0; } static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift) @@ -900,7 +901,7 @@ static int find_existing_ddw_windows(void) for_each_node_with_property(pdn, DIRECT64_PROPNAME) { direct64 = of_get_property(pdn, DIRECT64_PROPNAME, ); if (!direct64 || len < sizeof(*direct64)) { - remove_ddw(pdn, true); + remove_ddw(pdn, true, DIRECT64_PROPNAME); continue; } @@ -1372,7 +1373,7 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) win64 = NULL; out_del_win: - remove_ddw(pdn, true); + remove_ddw(pdn, true, DIRECT64_PROPNAME); out_failed: if (default_win_removed) @@ -1536,7 +1537,7 @@ static int iommu_reconfig_notifier(struct notifier_block *nb, unsigned long acti * we have to remove the property when releasing * the device node. */ - remove_ddw(np, false); + remove_ddw(np, false, DIRECT64_PROPNAME); if (pci && pci->table_group) iommu_pseries_free_group(pci->table_group, np->full_name); -- 2.30.2
[PATCH v3 07/11] powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper
Add a new helper _iommu_table_setparms(), and use it in iommu_table_setparms() and iommu_table_setparms_lpar() to avoid duplicated code. Also, setting tbl->it_ops was happening outsite iommu_table_setparms*(), so move it to the new helper. Since we need the iommu_table_ops to be declared before used, move iommu_table_lpar_multi_ops and iommu_table_pseries_ops to before their respective iommu_table_setparms*(). Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 100 - 1 file changed, 50 insertions(+), 50 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 48c029386d94..0147ccaf0be4 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -53,6 +53,11 @@ enum { DDW_EXT_QUERY_OUT_SIZE = 2 }; +#ifdef CONFIG_IOMMU_API +static int tce_exchange_pseries(struct iommu_table *tbl, long index, unsigned long *tce, + enum dma_data_direction *direction, bool realmode); +#endif + static struct iommu_table *iommu_pseries_alloc_table(int node) { struct iommu_table *tbl; @@ -501,6 +506,28 @@ static int tce_setrange_multi_pSeriesLP_walk(unsigned long start_pfn, return tce_setrange_multi_pSeriesLP(start_pfn, num_pfn, arg); } +static inline void _iommu_table_setparms(struct iommu_table *tbl, unsigned long busno, +unsigned long liobn, unsigned long win_addr, +unsigned long window_size, unsigned long page_shift, +unsigned long base, struct iommu_table_ops *table_ops) +{ + tbl->it_busno = busno; + tbl->it_index = liobn; + tbl->it_offset = win_addr >> page_shift; + tbl->it_size = window_size >> page_shift; + tbl->it_page_shift = page_shift; + tbl->it_base = base; + tbl->it_blocksize = 16; + tbl->it_type = TCE_PCI; + tbl->it_ops = table_ops; +} + +struct iommu_table_ops iommu_table_pseries_ops = { + .set = tce_build_pSeries, + .clear = tce_free_pSeries, + .get = tce_get_pseries +}; + static void iommu_table_setparms(struct pci_controller *phb, struct device_node *dn, struct iommu_table *tbl) @@ -509,8 +536,13 @@ static void iommu_table_setparms(struct pci_controller *phb, const unsigned long *basep; const u32 *sizep; - node = phb->dn; + /* Test if we are going over 2GB of DMA space */ + if (phb->dma_window_base_cur + phb->dma_window_size > SZ_2G) { + udbg_printf("PCI_DMA: Unexpected number of IOAs under this PHB.\n"); + panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n"); + } + node = phb->dn; basep = of_get_property(node, "linux,tce-base", NULL); sizep = of_get_property(node, "linux,tce-size", NULL); if (basep == NULL || sizep == NULL) { @@ -519,33 +551,25 @@ static void iommu_table_setparms(struct pci_controller *phb, return; } - tbl->it_base = (unsigned long)__va(*basep); + _iommu_table_setparms(tbl, phb->bus->number, 0, phb->dma_window_base_cur, + phb->dma_window_size, IOMMU_PAGE_SHIFT_4K, + (unsigned long)__va(*basep), _table_pseries_ops); if (!is_kdump_kernel()) memset((void *)tbl->it_base, 0, *sizep); - tbl->it_busno = phb->bus->number; - tbl->it_page_shift = IOMMU_PAGE_SHIFT_4K; - - /* Units of tce entries */ - tbl->it_offset = phb->dma_window_base_cur >> tbl->it_page_shift; - - /* Test if we are going over 2GB of DMA space */ - if (phb->dma_window_base_cur + phb->dma_window_size > 0x8000ul) { - udbg_printf("PCI_DMA: Unexpected number of IOAs under this PHB.\n"); - panic("PCI_DMA: Unexpected number of IOAs under this PHB.\n"); - } - phb->dma_window_base_cur += phb->dma_window_size; - - /* Set the tce table size - measured in entries */ - tbl->it_size = phb->dma_window_size >> tbl->it_page_shift; - - tbl->it_index = 0; - tbl->it_blocksize = 16; - tbl->it_type = TCE_PCI; } +struct iommu_table_ops iommu_table_lpar_multi_ops = { + .set = tce_buildmulti_pSeriesLP, +#ifdef CONFIG_IOMMU_API + .xchg_no_kill = tce_exchange_pseries, +#endif + .clear = tce_freemulti_pSeriesLP, + .get = tce_get_pSeriesLP +}; + /* * iommu_table_setparms_lpar * @@ -557,28 +581,17 @@ static void iommu_table_setparms_lpar(struct pci_controller *phb, struct iommu_table_group *table_group, const __be32 *dma_window) { - unsigned long offset, size; + unsigned long offset, size, liobn; - of_parse_dma_window(dn,
[PATCH v3 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
Code used to create a ddw property that was previously scattered in enable_ddw() is now gathered in ddw_property_create(), which deals with allocation and filling the property, letting it ready for of_property_add(), which now occurs in sequence. This created an opportunity to reorganize the second part of enable_ddw(): Without this patch enable_ddw() does, in order: kzalloc() property & members, create_ddw(), fill ddwprop inside property, ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory, of_add_property(), and list_add(). With this patch enable_ddw() does, in order: create_ddw(), ddw_property_create(), of_add_property(), ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory, and list_add(). This change requires of_remove_property() in case anything fails after of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk in all memory, which looks the most expensive operation, only if everything else succeeds. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 93 -- 1 file changed, 57 insertions(+), 36 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 955cf095416c..48c029386d94 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1122,6 +1122,35 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn) ret); } +static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr, + u32 page_shift, u32 window_shift) +{ + struct dynamic_dma_window_prop *ddwprop; + struct property *win64; + + win64 = kzalloc(sizeof(*win64), GFP_KERNEL); + if (!win64) + return NULL; + + win64->name = kstrdup(propname, GFP_KERNEL); + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL); + win64->value = ddwprop; + win64->length = sizeof(*ddwprop); + if (!win64->name || !win64->value) { + kfree(win64); + kfree(win64->name); + kfree(win64->value); + return NULL; + } + + ddwprop->liobn = cpu_to_be32(liobn); + ddwprop->dma_base = cpu_to_be64(dma_addr); + ddwprop->tce_shift = cpu_to_be32(page_shift); + ddwprop->window_shift = cpu_to_be32(window_shift); + + return win64; +} + /* Return largest page shift based on "IO Page Sizes" output of ibm,query-pe-dma-window. */ static int iommu_get_page_shift(u32 query_page_size) { @@ -1167,11 +1196,11 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) struct ddw_query_response query; struct ddw_create_response create; int page_shift; + u64 win_addr; struct device_node *dn; u32 ddw_avail[DDW_APPLICABLE_SIZE]; struct direct_window *window; struct property *win64 = NULL; - struct dynamic_dma_window_prop *ddwprop; struct failed_ddw_pdn *fpdn; bool default_win_removed = false; bool pmem_present; @@ -1286,65 +1315,54 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) 1ULL << page_shift); goto out_failed; } - win64 = kzalloc(sizeof(struct property), GFP_KERNEL); - if (!win64) { - dev_info(>dev, - "couldn't allocate property for 64bit dma window\n"); - goto out_failed; - } - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL); - win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL); - win64->length = sizeof(*ddwprop); - if (!win64->name || !win64->value) { - dev_info(>dev, - "couldn't allocate property name and value\n"); - goto out_free_prop; - } ret = create_ddw(dev, ddw_avail, , page_shift, len); if (ret != 0) - goto out_free_prop; - - ddwprop->liobn = cpu_to_be32(create.liobn); - ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) | - create.addr_lo); - ddwprop->tce_shift = cpu_to_be32(page_shift); - ddwprop->window_shift = cpu_to_be32(len); + goto out_failed; dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n", create.liobn, dn); - window = ddw_list_new_entry(pdn, ddwprop); + win_addr = ((u64)create.addr_hi << 32) | create.addr_lo; + win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr, + page_shift, len); + if (!win64) { + dev_info(>dev, +"couldn't allocate property, property name, or value\n"); + goto out_del_win; + } + + ret = of_add_property(pdn, win64); + if (ret) { + dev_err(>dev,
[PATCH v3 05/11] powerpc/pseries/iommu: Allow DDW windows starting at 0x00
enable_ddw() currently returns the address of the DMA window, which is considered invalid if has the value 0x00. Also, it only considers valid an address returned from find_existing_ddw if it's not 0x00. Changing this behavior makes sense, given the users of enable_ddw() only need to know if direct mapping is possible. It can also allow a DMA window starting at 0x00 to be used. This will be helpful for using a DDW with indirect mapping, as the window address will be different than 0x00, but it will not map the whole partition. Signed-off-by: Leonardo Bras Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/platforms/pseries/iommu.c | 35 -- 1 file changed, 16 insertions(+), 19 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 6f14894d2d04..955cf095416c 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -849,25 +849,26 @@ static void remove_ddw(struct device_node *np, bool remove_prop) np, ret); } -static u64 find_existing_ddw(struct device_node *pdn, int *window_shift) +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr, int *window_shift) { struct direct_window *window; const struct dynamic_dma_window_prop *direct64; - u64 dma_addr = 0; + bool found = false; spin_lock(_window_list_lock); /* check if we already created a window and dupe that config if so */ list_for_each_entry(window, _window_list, list) { if (window->device == pdn) { direct64 = window->prop; - dma_addr = be64_to_cpu(direct64->dma_base); + *dma_addr = be64_to_cpu(direct64->dma_base); *window_shift = be32_to_cpu(direct64->window_shift); + found = true; break; } } spin_unlock(_window_list_lock); - return dma_addr; + return found; } static struct direct_window *ddw_list_new_entry(struct device_node *pdn, @@ -1157,20 +1158,19 @@ static int iommu_get_page_shift(u32 query_page_size) * pdn: the parent pe node with the ibm,dma_window property * Future: also check if we can remap the base window for our base page size * - * returns the dma offset for use by the direct mapped DMA code. + * returns true if can map all pages (direct mapping), false otherwise.. */ -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) { int len = 0, ret; int max_ram_len = order_base_2(ddw_memory_hotplug_max()); struct ddw_query_response query; struct ddw_create_response create; int page_shift; - u64 dma_addr; struct device_node *dn; u32 ddw_avail[DDW_APPLICABLE_SIZE]; struct direct_window *window; - struct property *win64; + struct property *win64 = NULL; struct dynamic_dma_window_prop *ddwprop; struct failed_ddw_pdn *fpdn; bool default_win_removed = false; @@ -1182,8 +1182,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) mutex_lock(_window_init_mutex); - dma_addr = find_existing_ddw(pdn, ); - if (dma_addr != 0) + if (find_existing_ddw(pdn, >dev.archdata.dma_offset, )) goto out_unlock; /* @@ -1338,7 +1337,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) list_add(>list, _window_list); spin_unlock(_window_list_lock); - dma_addr = be64_to_cpu(ddwprop->dma_base); + dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base); goto out_unlock; out_free_window: @@ -1351,6 +1350,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) kfree(win64->name); kfree(win64->value); kfree(win64); + win64 = NULL; out_failed: if (default_win_removed) @@ -1370,10 +1370,10 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) * as RAM, then we failed to create a window to cover persistent * memory and need to set the DMA limit. */ - if (pmem_present && dma_addr && (len == max_ram_len)) - dev->dev.bus_dma_limit = dma_addr + (1ULL << len); + if (pmem_present && win64 && (len == max_ram_len)) + dev->dev.bus_dma_limit = dev->dev.archdata.dma_offset + (1ULL << len); - return dma_addr; + return win64; } static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev) @@ -1452,11 +1452,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct pci_dev *pdev, u64 dma_mask) break; } - if (pdn && PCI_DN(pdn)) { - pdev->dev.archdata.dma_offset = enable_ddw(pdev, pdn); - if
[PATCH v3 04/11] powerpc/pseries/iommu: Add ddw_list_new_entry() helper
There are two functions creating direct_window_list entries in a similar way, so create a ddw_list_new_entry() to avoid duplicity and simplify those functions. Signed-off-by: Leonardo Bras Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/platforms/pseries/iommu.c | 32 +- 1 file changed, 21 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index d02359ca1f9f..6f14894d2d04 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -870,6 +870,21 @@ static u64 find_existing_ddw(struct device_node *pdn, int *window_shift) return dma_addr; } +static struct direct_window *ddw_list_new_entry(struct device_node *pdn, + const struct dynamic_dma_window_prop *dma64) +{ + struct direct_window *window; + + window = kzalloc(sizeof(*window), GFP_KERNEL); + if (!window) + return NULL; + + window->device = pdn; + window->prop = dma64; + + return window; +} + static int find_existing_ddw_windows(void) { int len; @@ -882,18 +897,15 @@ static int find_existing_ddw_windows(void) for_each_node_with_property(pdn, DIRECT64_PROPNAME) { direct64 = of_get_property(pdn, DIRECT64_PROPNAME, ); - if (!direct64) - continue; - - window = kzalloc(sizeof(*window), GFP_KERNEL); - if (!window || len < sizeof(struct dynamic_dma_window_prop)) { - kfree(window); + if (!direct64 || len < sizeof(*direct64)) { remove_ddw(pdn, true); continue; } - window->device = pdn; - window->prop = direct64; + window = ddw_list_new_entry(pdn, direct64); + if (!window) + break; + spin_lock(_window_list_lock); list_add(>list, _window_list); spin_unlock(_window_list_lock); @@ -1303,7 +1315,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n", create.liobn, dn); - window = kzalloc(sizeof(*window), GFP_KERNEL); + window = ddw_list_new_entry(pdn, ddwprop); if (!window) goto out_clear_window; @@ -1322,8 +1334,6 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn) goto out_free_window; } - window->device = pdn; - window->prop = ddwprop; spin_lock(_window_list_lock); list_add(>list, _window_list); spin_unlock(_window_list_lock); -- 2.30.2
[PATCH v3 03/11] powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper
Creates a helper to allow allocating a new iommu_table without the need to reallocate the iommu_group. This will be helpful for replacing the iommu_table for the new DMA window, after we remove the old one with iommu_tce_table_put(). Signed-off-by: Leonardo Bras Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/platforms/pseries/iommu.c | 25 ++--- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 796ab356341c..d02359ca1f9f 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -53,28 +53,31 @@ enum { DDW_EXT_QUERY_OUT_SIZE = 2 }; -static struct iommu_table_group *iommu_pseries_alloc_group(int node) +static struct iommu_table *iommu_pseries_alloc_table(int node) { - struct iommu_table_group *table_group; struct iommu_table *tbl; - table_group = kzalloc_node(sizeof(struct iommu_table_group), GFP_KERNEL, - node); - if (!table_group) - return NULL; - tbl = kzalloc_node(sizeof(struct iommu_table), GFP_KERNEL, node); if (!tbl) - goto free_group; + return NULL; INIT_LIST_HEAD_RCU(>it_group_list); kref_init(>it_kref); + return tbl; +} - table_group->tables[0] = tbl; +static struct iommu_table_group *iommu_pseries_alloc_group(int node) +{ + struct iommu_table_group *table_group; + + table_group = kzalloc_node(sizeof(*table_group), GFP_KERNEL, node); + if (!table_group) + return NULL; - return table_group; + table_group->tables[0] = iommu_pseries_alloc_table(node); + if (table_group->tables[0]) + return table_group; -free_group: kfree(table_group); return NULL; } -- 2.30.2
[PATCH v3 02/11] powerpc/kernel/iommu: Add new iommu_table_in_use() helper
Having a function to check if the iommu table has any allocation helps deciding if a tbl can be reset for using a new DMA window. It should be enough to replace all instances of !bitmap_empty(tbl...). iommu_table_in_use() skips reserved memory, so we don't need to worry about releasing it before testing. This causes iommu_table_release_pages() to become unnecessary, given it is only used to remove reserved memory for testing. Also, only allow storing reserved memory values in tbl if they are valid in the table, so there is no need to check it in the new helper. Signed-off-by: Leonardo Bras --- arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/kernel/iommu.c | 65 2 files changed, 34 insertions(+), 32 deletions(-) diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h index deef7c94d7b6..bf3b84128525 100644 --- a/arch/powerpc/include/asm/iommu.h +++ b/arch/powerpc/include/asm/iommu.h @@ -154,6 +154,7 @@ extern int iommu_tce_table_put(struct iommu_table *tbl); */ extern struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid, unsigned long res_start, unsigned long res_end); +bool iommu_table_in_use(struct iommu_table *tbl); #define IOMMU_TABLE_GROUP_MAX_TABLES 2 diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ad82dda81640..5e168bd91401 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -691,32 +691,24 @@ static void iommu_table_reserve_pages(struct iommu_table *tbl, if (tbl->it_offset == 0) set_bit(0, tbl->it_map); - tbl->it_reserved_start = res_start; - tbl->it_reserved_end = res_end; - - /* Check if res_start..res_end isn't empty and overlaps the table */ - if (res_start && res_end && - (tbl->it_offset + tbl->it_size < res_start || -res_end < tbl->it_offset)) - return; + if (res_start < tbl->it_offset) + res_start = tbl->it_offset; - for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i) - set_bit(i - tbl->it_offset, tbl->it_map); -} + if (res_end > (tbl->it_offset + tbl->it_size)) + res_end = tbl->it_offset + tbl->it_size; -static void iommu_table_release_pages(struct iommu_table *tbl) -{ - int i; + /* Check if res_start..res_end is a valid range in the table */ + if (res_start >= res_end) { + tbl->it_reserved_start = tbl->it_offset; + tbl->it_reserved_end = tbl->it_offset; + return; + } - /* -* In case we have reserved the first bit, we should not emit -* the warning below. -*/ - if (tbl->it_offset == 0) - clear_bit(0, tbl->it_map); + tbl->it_reserved_start = res_start; + tbl->it_reserved_end = res_end; for (i = tbl->it_reserved_start; i < tbl->it_reserved_end; ++i) - clear_bit(i - tbl->it_offset, tbl->it_map); + set_bit(i - tbl->it_offset, tbl->it_map); } /* @@ -781,6 +773,22 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid, return tbl; } +bool iommu_table_in_use(struct iommu_table *tbl) +{ + unsigned long start = 0, end; + + /* ignore reserved bit0 */ + if (tbl->it_offset == 0) + start = 1; + end = tbl->it_reserved_start - tbl->it_offset; + if (find_next_bit(tbl->it_map, end, start) != end) + return true; + + start = tbl->it_reserved_end - tbl->it_offset; + end = tbl->it_size; + return find_next_bit(tbl->it_map, end, start) != end; +} + static void iommu_table_free(struct kref *kref) { unsigned long bitmap_sz; @@ -799,10 +807,8 @@ static void iommu_table_free(struct kref *kref) iommu_debugfs_del(tbl); - iommu_table_release_pages(tbl); - /* verify that table contains no entries */ - if (!bitmap_empty(tbl->it_map, tbl->it_size)) + if (iommu_table_in_use(tbl)) pr_warn("%s: Unexpected TCEs\n", __func__); /* calculate bitmap size in bytes */ @@ -1108,18 +1114,13 @@ int iommu_take_ownership(struct iommu_table *tbl) for (i = 0; i < tbl->nr_pools; i++) spin_lock(>pools[i].lock); - iommu_table_release_pages(tbl); - - if (!bitmap_empty(tbl->it_map, tbl->it_size)) { + if (iommu_table_in_use(tbl)) { pr_err("iommu_tce: it_map is not empty"); ret = -EBUSY; - /* Undo iommu_table_release_pages, i.e. restore bit#0, etc */ - iommu_table_reserve_pages(tbl, tbl->it_reserved_start, - tbl->it_reserved_end); - } else { - memset(tbl->it_map, 0xff, sz); } + memset(tbl->it_map, 0xff, sz); + for (i = 0; i < tbl->nr_pools; i++)
[PATCH v3 01/11] powerpc/pseries/iommu: Replace hard-coded page shift
Some functions assume IOMMU page size can only be 4K (pageshift == 12). Update them to accept any page size passed, so we can use 64K pages. In the process, some defines like TCE_SHIFT were made obsolete, and then removed. IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show a RPN of 52-bit, and considers a 12-bit pageshift, so there should be no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn. It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and tce_buildmulti_pSeriesLP(). Most places had a tbl struct, so using tbl->it_page_shift was simple. tce_free_pSeriesLP() was a special case, since callers not always have a tbl struct, so adding a tceshift parameter seems the right thing to do. Signed-off-by: Leonardo Bras Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/tce.h | 8 -- arch/powerpc/platforms/pseries/iommu.c | 39 +++--- 2 files changed, 23 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h index db5fc2f2262d..0c34d2756d92 100644 --- a/arch/powerpc/include/asm/tce.h +++ b/arch/powerpc/include/asm/tce.h @@ -19,15 +19,7 @@ #define TCE_VB 0 #define TCE_PCI1 -/* TCE page size is 4096 bytes (1 << 12) */ - -#define TCE_SHIFT 12 -#define TCE_PAGE_SIZE (1 << TCE_SHIFT) - #define TCE_ENTRY_SIZE 8 /* each TCE is 64 bits */ - -#define TCE_RPN_MASK 0xfful /* 40-bit RPN (4K pages) */ -#define TCE_RPN_SHIFT 12 #define TCE_VALID 0x800 /* TCE valid */ #define TCE_ALLIO 0x400 /* TCE valid for all lpars */ #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */ diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 67c9953a6503..796ab356341c 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index, u64 proto_tce; __be64 *tcep; u64 rpn; + const unsigned long tceshift = tbl->it_page_shift; + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl); proto_tce = TCE_PCI_READ; // Read allowed @@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index, while (npages--) { /* can't move this out since we might cross MEMBLOCK boundary */ - rpn = __pa(uaddr) >> TCE_SHIFT; - *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT); + rpn = __pa(uaddr) >> tceshift; + *tcep = cpu_to_be64(proto_tce | rpn << tceshift); - uaddr += TCE_PAGE_SIZE; + uaddr += pagesize; tcep++; } return 0; @@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table *tbl, long index) return be64_to_cpu(*tcep); } -static void tce_free_pSeriesLP(unsigned long liobn, long, long); +static void tce_free_pSeriesLP(unsigned long liobn, long, long, long); static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long); static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift, @@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift, proto_tce |= TCE_PCI_WRITE; while (npages--) { - tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift; + tce = proto_tce | rpn << tceshift; rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce); if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) { ret = (int)rc; - tce_free_pSeriesLP(liobn, tcenum_start, + tce_free_pSeriesLP(liobn, tcenum_start, tceshift, (npages_start - (npages + 1))); break; } @@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long tcenum_start = tcenum, npages_start = npages; int ret = 0; unsigned long flags; + const unsigned long tceshift = tbl->it_page_shift; if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) { return tce_build_pSeriesLP(tbl->it_index, tcenum, - tbl->it_page_shift, npages, uaddr, + tceshift, npages, uaddr, direction, attrs); } @@ -225,13 +228,13 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, if (!tcep) { local_irq_restore(flags); return tce_build_pSeriesLP(tbl->it_index, tcenum, -
Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
On 4/21/21 10:30 PM, Lijun Pan wrote: Fixes: ed651a10875f ("ibmvnic: Updated reset handling") Signed-off-by: Dany Madden Reviewed-by: Rick Lindsley Reviewed-by: Sukadev Bhattiprolu One thing I would like to point out as already pointed out by Nathan Lynch is that those review-by tags given by the same groups of people from the same company loses credibility over time if you never critique or ask questions on the list. Well, so far you aren't addressing either my critiques or questions. I have been asking questions but all I have from you are the above attempts to discredit the reputation of myself and other people, and non-technical statements like will make the code very difficult to manage I think there should be a trade off between optimization and stability. So I don't think you could even compare the two results On the other hand, from the original submission I see some very specific details: If ibmvnic abandons the reset because of this failed set link down and this is the last reset in the workqueue, then this adapter will be left in an inoperable state. and from a followup discussion: We had a FATAL error and when handling it, we failed to send a link-down message to the VIOS. So what we need to try next is to reset the connection with the VIOS. For this we must ... These are great technical points that could be argued or discussed. Problem is, I agree with them. I will ask again: can you please supply some technical reasons for your objections. Otherwise, your objections are meritless and at worst simply an ad hominem attack. Rick
[PATCH v3 00/11] DDW + Indirect Mapping
So far it's assumed possible to map the guest RAM 1:1 to the bus, which works with a small number of devices. SRIOV changes it as the user can configure hundreds VFs and since phyp preallocates TCEs and does not allow IOMMU pages bigger than 64K, it has to limit the number of TCEs per a PE to limit waste of physical pages. As of today, if the assumed direct mapping is not possible, DDW creation is skipped and the default DMA window "ibm,dma-window" is used instead. Using the DDW instead of the default DMA window may allow to expand the amount of memory that can be DMA-mapped, given the number of pages (TCEs) may stay the same (or increase) and the default DMA window offers only 4k-pages while DDW may offer larger pages (4k, 64k, 16M ...). Patch #1 replaces hard-coded 4K page size with a variable containing the correct page size for the window. Patch #2 introduces iommu_table_in_use(), and replace manual bit-field checking where it's used. It will be used for aborting enable_ddw() if there is any current iommu allocation and we are trying single window indirect mapping. Patch #3 introduces iommu_pseries_alloc_table() that will be helpful when indirect mapping needs to replace the iommu_table. Patch #4 adds helpers for adding DDWs in the list. Patch #5 refactors enable_ddw() so it returns if direct mapping is possible, instead of DMA offset. It helps for next patches on indirect DMA mapping and also allows DMA windows starting at 0x00. Patch #6 bring new helper to simplify enable_ddw(), allowing some reorganization for introducing indirect mapping DDW. Patch #7 adds new helper _iommu_table_setparms() and use it in other *setparams*() to fill iommu_table. It will also be used for creating a new iommu_table for indirect mapping. Patch #8 updates remove_dma_window() to accept different property names, so we can introduce a new property for indirect mapping. Patch #9 extracts find_existing_ddw_windows() into find_existing_ddw_windows_named(), and calls it by it's property name. This will be useful when the property for indirect mapping is created, so we can search the device-tree for both properties. Patch #10: Instead of destroying the created DDW if it doesn't map the whole partition, make use of it instead of the default DMA window as it improves performance. Also, update the iommu_table and re-generate the pools. It introduces a new property name for DDW with indirect DMA mapping. Patch #11: Does some renaming of 'direct window' to 'dma window', given the DDW created can now be also used in indirect mapping if direct mapping is not available. All patches were tested into an LPAR with an virtio-net interface that allows default DMA window and DDW to coexist. Leonardo Bras (11): powerpc/pseries/iommu: Replace hard-coded page shift powerpc/kernel/iommu: Add new iommu_table_in_use() helper powerpc/pseries/iommu: Add iommu_pseries_alloc_table() helper powerpc/pseries/iommu: Add ddw_list_new_entry() helper powerpc/pseries/iommu: Allow DDW windows starting at 0x00 powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw() powerpc/pseries/iommu: Reorganize iommu_table_setparms*() with new helper powerpc/pseries/iommu: Update remove_dma_window() to accept property name powerpc/pseries/iommu: Find existing DDW with given property name powerpc/pseries/iommu: Make use of DDW for indirect mapping powerpc/pseries/iommu: Rename "direct window" to "dma window" arch/powerpc/include/asm/iommu.h | 1 + arch/powerpc/include/asm/tce.h | 8 - arch/powerpc/kernel/iommu.c| 65 ++-- arch/powerpc/platforms/pseries/iommu.c | 504 +++-- 4 files changed, 338 insertions(+), 240 deletions(-) -- 2.30.2
Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
On 4/21/21 10:06 PM, Lijun Pan wrote: No real customer runs the system under that heavy load created by HTX stress test, which can tear down any working system. So, are you saying the bugs that HTX uncovers are not worth fixing? There's a fair number of individuals and teams out there that utilize HTX in the absence of such a workload, and not just for vnic. What workload would you suggest to better serve "real customers"? I think such optimizations are catered for passing HTX tests. Well, yes. If the bugs are found with HTX, the fixes are verified against HTX. If they were found with a "real customer workload" they'd be verified against a "real customer workload."
Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
On Wed, Apr 21, 2021 at 3:06 AM Rick Lindsley wrote: Please describe the advantage in deferring it further by routing it through do_hard_reset(). I don't see one. On 4/21/21 10:12 PM, Lijun Pan replied: It is not deferred. It exits with error and calls do_hard_reset. See my reply to Suka's. I saw your reply, but it does not answer the question I asked. The patch would have us reinitialize and restart the communication queues. Your suggestion would do more work than that. Please describe the advantage in deferring the reinitialization - and yes, defer is the right word - by routing it through do_hard_reset() and executing that extra code. I see that route as doing more work than necessary and so introducing additional risk, for no clear advantage. So I would find it helpful if you would describe the advantage. The testing was done on this patch. It was not performed on a full hard reset. So I don't think you could even compare the two results. A problem has been noted, a patch has been proposed, and the reasoning that the patch is correct has been given. Testing with this patch has demonstrated the problem has not returned. So far, that sounds like a pretty reasonable solution. Your comment is curious - why would testing for this patch be done on a full hard reset when this patch does not invoke a full hard reset? If you have data to consider then let's have it. I'm willing to be convinced, but so far this just sounds like "I wouldn't do it that way myself, and I have a bad feeling about doing it any other way." Rick
Re: [PATCH V2 net] ibmvnic: Continue with reset if set link down failed
Lijun Pan [lijunp...@gmail.com] wrote: > > Now, sure we can attempt a "thorough hard reset" which also does > > the same hcalls to reestablish the connection. Is there any > > other magic in do_hard_reset()? But in addition, it also frees lot > > more Linux kernel buffers and reallocates them for instance. > > Working around everything in do_reset will make the code very difficult We are not working around everything. We are doing in do_reset() exactly what we would do in hard reset for this error (ignore the set link down error and try to reestablish the connection with the VIOS). What we are avoiding is unnecessary work on the Linux side for a communication problem on the VIOS side. > to manage. Ultimately do_reset can do anything I am afraid, and do_hard_reset > can be removed completely or merged into do_reset. > > > > > If we are having a communication problem with the VIOS, what is > > the point of freeing and reallocating Linux kernel buffers? Beside > > being inefficient, it would expose us to even more errors during > > reset under heavy workloads? > > No real customer runs the system under that heavy load created by > HTX stress test, which can tear down any working system. We need to talk to capacity planning and test architects about that, but all I want to know is what hard reset would do differently to fix this communication error with VIOS. Sukadev
Re: [V3 PATCH 16/16] crypto/nx: Add sysfs interface to export NX capabilities
On Sat, Apr 17, 2021 at 02:13:40PM -0700, Haren Myneni wrote: > > Changes to export the following NXGZIP capabilities through sysfs: > > /sys/devices/vio/ibm,compression-v1/NxGzCaps: > min_compress_len /*Recommended minimum compress length in bytes*/ > min_decompress_len /*Recommended minimum decompress length in bytes*/ > req_max_processed_len /* Maximum number of bytes processed in one > request */ > > Signed-off-by: Haren Myneni > --- > drivers/crypto/nx/nx-common-pseries.c | 43 +++ > 1 file changed, 43 insertions(+) Acked-by: Herbert Xu -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [V3 PATCH 15/16] crypto/nx: Get NX capabilities for GZIP coprocessor type
On Sat, Apr 17, 2021 at 02:12:51PM -0700, Haren Myneni wrote: > > phyp provides NX capabilities which gives recommended minimum > compression / decompression length and maximum request buffer size > in bytes. > > Changes to get NX overall capabilities which points to the specific > features phyp supports. Then retrieve NXGZIP specific capabilities. > > Signed-off-by: Haren Myneni > --- > drivers/crypto/nx/nx-common-pseries.c | 83 +++ > 1 file changed, 83 insertions(+) Acked-by: Herbert Xu -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [V3 PATCH 14/16] crypto/nx: Register and unregister VAS interface
On Sat, Apr 17, 2021 at 02:12:12PM -0700, Haren Myneni wrote: > > Changes to create /dev/crypto/nx-gzip interface with VAS register > and to remove this interface with VAS unregister. > > Signed-off-by: Haren Myneni > --- > drivers/crypto/nx/Kconfig | 1 + > drivers/crypto/nx/nx-common-pseries.c | 9 + > 2 files changed, 10 insertions(+) Acked-by: Herbert Xu -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [V3 PATCH 13/16] crypto/nx: Rename nx-842-pseries file name to nx-common-pseries
On Sat, Apr 17, 2021 at 02:11:15PM -0700, Haren Myneni wrote: > > Rename nx-842-pseries.c to nx-common-pseries.c to add code for new > GZIP compression type. The actual functionality is not changed in > this patch. > > Signed-off-by: Haren Myneni > --- > drivers/crypto/nx/Makefile | 2 +- > drivers/crypto/nx/{nx-842-pseries.c => nx-common-pseries.c} | 0 > 2 files changed, 1 insertion(+), 1 deletion(-) > rename drivers/crypto/nx/{nx-842-pseries.c => nx-common-pseries.c} (100%) > > diff --git a/drivers/crypto/nx/Makefile b/drivers/crypto/nx/Makefile > index bc89a20e5d9d..d00181a26dd6 100644 > --- a/drivers/crypto/nx/Makefile > +++ b/drivers/crypto/nx/Makefile > @@ -14,5 +14,5 @@ nx-crypto-objs := nx.o \ > obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_PSERIES) += nx-compress-pseries.o > nx-compress.o > obj-$(CONFIG_CRYPTO_DEV_NX_COMPRESS_POWERNV) += nx-compress-powernv.o > nx-compress.o > nx-compress-objs := nx-842.o > -nx-compress-pseries-objs := nx-842-pseries.o > +nx-compress-pseries-objs := nx-common-pseries.o > nx-compress-powernv-objs := nx-common-powernv.o > diff --git a/drivers/crypto/nx/nx-842-pseries.c > b/drivers/crypto/nx/nx-common-pseries.c > similarity index 100% > rename from drivers/crypto/nx/nx-842-pseries.c > rename to drivers/crypto/nx/nx-common-pseries.c Acked-by: Herbert Xu -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 1/1] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs
Hello, This patch was also reviewed when it was part of another patchset: http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200911170738.82818-4-leobra...@gmail.com/ On Thu, 2021-03-18 at 14:44 -0300, Leonardo Bras wrote: > Currently both iommu_alloc_coherent() and iommu_free_coherent() align the > desired allocation size to PAGE_SIZE, and gets system pages and IOMMU > mappings (TCEs) for that value. > > When IOMMU_PAGE_SIZE < PAGE_SIZE, this behavior may cause unnecessary > TCEs to be created for mapping the whole system page. > > Example: > - PAGE_SIZE = 64k, IOMMU_PAGE_SIZE() = 4k > - iommu_alloc_coherent() is called for 128 bytes > - 1 system page (64k) is allocated > - 16 IOMMU pages (16 x 4k) are allocated (16 TCEs used) > > It would be enough to use a single TCE for this, so 15 TCEs are > wasted in the process. > > Update iommu_*_coherent() to make sure the size alignment happens only > for IOMMU_PAGE_SIZE() before calling iommu_alloc() and iommu_free(). > > Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift) > with IOMMU_PAGE_ALIGN(n, tbl), which is easier to read and does the > same. > > Signed-off-by: Leonardo Bras > Reviewed-by: Alexey Kardashevskiy > --- > arch/powerpc/kernel/iommu.c | 11 ++- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c > index 5b69a6a72a0e..3329ef045805 100644 > --- a/arch/powerpc/kernel/iommu.c > +++ b/arch/powerpc/kernel/iommu.c > @@ -851,6 +851,7 @@ void *iommu_alloc_coherent(struct device *dev, struct > iommu_table *tbl, > unsigned int order; > unsigned int nio_pages, io_order; > struct page *page; > + size_t size_io = size; > > > size = PAGE_ALIGN(size); > order = get_order(size); > @@ -877,8 +878,9 @@ void *iommu_alloc_coherent(struct device *dev, struct > iommu_table *tbl, > memset(ret, 0, size); > > > /* Set up tces to cover the allocated range */ > - nio_pages = size >> tbl->it_page_shift; > - io_order = get_iommu_order(size, tbl); > + size_io = IOMMU_PAGE_ALIGN(size_io, tbl); > + nio_pages = size_io >> tbl->it_page_shift; > + io_order = get_iommu_order(size_io, tbl); > mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL, > mask >> tbl->it_page_shift, io_order, 0); > if (mapping == DMA_MAPPING_ERROR) { > @@ -893,10 +895,9 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t > size, > void *vaddr, dma_addr_t dma_handle) > { > if (tbl) { > - unsigned int nio_pages; > + size_t size_io = IOMMU_PAGE_ALIGN(size, tbl); > + unsigned int nio_pages = size_io >> tbl->it_page_shift; > > > - size = PAGE_ALIGN(size); > - nio_pages = size >> tbl->it_page_shift; > iommu_free(tbl, dma_handle, nio_pages); > size = PAGE_ALIGN(size); > free_pages((unsigned long)vaddr, get_order(size));
Re: [PATCH 1/1] powerpc/kernel/iommu: Use largepool as a last resort when !largealloc
Hello, FYI: This patch was reviewed when it was part of another patchset: http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200817234033.442511-4-leobra...@gmail.com/ On Thu, 2021-03-18 at 14:44 -0300, Leonardo Bras wrote: > As of today, doing iommu_range_alloc() only for !largealloc (npages <= 15) > will only be able to use 3/4 of the available pages, given pages on > largepool not being available for !largealloc. > > This could mean some drivers not being able to fully use all the available > pages for the DMA window. > > Add pages on largepool as a last resort for !largealloc, making all pages > of the DMA window available. > > Signed-off-by: Leonardo Bras > Reviewed-by: Alexey Kardashevskiy > --- > arch/powerpc/kernel/iommu.c | 9 + > 1 file changed, 9 insertions(+) > > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c > index 3329ef045805..ae6ad8dca605 100644 > --- a/arch/powerpc/kernel/iommu.c > +++ b/arch/powerpc/kernel/iommu.c > @@ -255,6 +255,15 @@ static unsigned long iommu_range_alloc(struct device > *dev, > pass++; > goto again; > > > + } else if (pass == tbl->nr_pools + 1) { > + /* Last resort: try largepool */ > + spin_unlock(>lock); > + pool = >large_pool; > + spin_lock(>lock); > + pool->hint = pool->start; > + pass++; > + goto again; > + > } else { > /* Give up */ > spin_unlock_irqrestore(&(pool->lock), flags);
Re: [PATCH] powerpc/mce: save ignore_event flag unconditionally for UE
On 4/22/21 11:31 AM, Ganesh wrote: On 4/7/21 10:28 AM, Ganesh Goudar wrote: When we hit an UE while using machine check safe copy routines, ignore_event flag is set and the event is ignored by mce handler, And the flag is also saved for defered handling and printing of mce event information, But as of now saving of this flag is done on checking if the effective address is provided or physical address is calculated, which is not right. Save ignore_event flag regardless of whether the effective address is provided or physical address is calculated. Without this change following log is seen, when the event is to be ignored. [ 512.971365] MCE: CPU1: machine check (Severe) UE Load/Store [Recovered] [ 512.971509] MCE: CPU1: NIP: [c00b67c0] memcpy+0x40/0x90 [ 512.971655] MCE: CPU1: Initiator CPU [ 512.971739] MCE: CPU1: Unknown [ 512.972209] MCE: CPU1: machine check (Severe) UE Load/Store [Recovered] [ 512.972334] MCE: CPU1: NIP: [c00b6808] memcpy+0x88/0x90 [ 512.972456] MCE: CPU1: Initiator CPU [ 512.972534] MCE: CPU1: Unknown Signed-off-by: Ganesh Goudar --- arch/powerpc/kernel/mce.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Hi mpe, Any comments on this patch? Please ignore, I see its applied.
Re: [PATCH] powerpc/mce: save ignore_event flag unconditionally for UE
On 4/7/21 10:28 AM, Ganesh Goudar wrote: When we hit an UE while using machine check safe copy routines, ignore_event flag is set and the event is ignored by mce handler, And the flag is also saved for defered handling and printing of mce event information, But as of now saving of this flag is done on checking if the effective address is provided or physical address is calculated, which is not right. Save ignore_event flag regardless of whether the effective address is provided or physical address is calculated. Without this change following log is seen, when the event is to be ignored. [ 512.971365] MCE: CPU1: machine check (Severe) UE Load/Store [Recovered] [ 512.971509] MCE: CPU1: NIP: [c00b67c0] memcpy+0x40/0x90 [ 512.971655] MCE: CPU1: Initiator CPU [ 512.971739] MCE: CPU1: Unknown [ 512.972209] MCE: CPU1: machine check (Severe) UE Load/Store [Recovered] [ 512.972334] MCE: CPU1: NIP: [c00b6808] memcpy+0x88/0x90 [ 512.972456] MCE: CPU1: Initiator CPU [ 512.972534] MCE: CPU1: Unknown Signed-off-by: Ganesh Goudar --- arch/powerpc/kernel/mce.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Hi mpe, Any comments on this patch?