Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote: > * Michal Such?nek [2021-04-23 19:45:05]: > > > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote: > > > * Michal Such?nek [2021-04-23 09:35:51]: > > > > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote: > > > > > From: "Gautham R. Shenoy" > > > > > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > > > > > of the Extended CEDE states advertised by the platform > > > > > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low > > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the > > > > Can you be more specific about 'older firmwares'? > > > > > > Hi Michal, > > > > > > This is POWER9 vs POWER10 difference, not really an obsolete FW. The > > > key idea behind the original patch was to make the H_CEDE latency and > > > hence target residency come from firmware instead of being decided by > > > the kernel. The advantage is such that, different type of systems in > > > POWER10 generation can adjust this value and have an optimal H_CEDE > > > entry criteria which balances good single thread performance and > > > wakeup latency. Further we can have additional H_CEDE state to feed > > > into the cpuidle. > > > > So all POWER9 machines are affected by the firmware bug where firmware > > reports CEDE1 exit latency of 2us and the real latency is 5us which > > causes the kernel to prefer CEDE1 too much when relying on the values > > supplied by the firmware. It is not about 'older firmware'. > > Correct. All POWER9 systems running Linux as guest LPARs will see > extra usage of CEDE idle state, but not baremetal (PowerNV). > > The correct definition of the bug or miss-match in expectation is that > firmware reports wakeup latency from a core/thread wakeup timing, but > not end-to-end time from sending a wakeup event like an IPI using > H_calls and receiving the events on the target. Practically there are > few extra micro-seconds needed after deciding to wakeup a target > core/thread to getting the target to start executing instructions > within the LPAR instance. Thanks for the detailed explanation. Maybe just adding a few microseconds to the reported time would be a more reasonable workaround than using a blanket fixed value then. > > > I still think it would be preferrable to adjust the latency value > > reported by the firmware to match reality over a kernel workaround. > > Right, practically we can fix for future releases and as such we > targeted this scheme from POWER10 but expected no harm on POWER9 which > proved to be wrong. > > We can possibly change this FW value for POWER9, but it is too > expensive and not practical because many release streams exist for > different platforms and further customers are at different streams as > well. We cannot force all of them to update because that blows up > co-dependency matrix. >From the user point of view only few firmware release streams exist but what is packaged in such binaries might be another story. Thanks Michal
Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
* Michal Such?nek [2021-04-23 19:45:05]: > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote: > > * Michal Such?nek [2021-04-23 09:35:51]: > > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote: > > > > From: "Gautham R. Shenoy" > > > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > > > > of the Extended CEDE states advertised by the platform > > > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the > > > Can you be more specific about 'older firmwares'? > > > > Hi Michal, > > > > This is POWER9 vs POWER10 difference, not really an obsolete FW. The > > key idea behind the original patch was to make the H_CEDE latency and > > hence target residency come from firmware instead of being decided by > > the kernel. The advantage is such that, different type of systems in > > POWER10 generation can adjust this value and have an optimal H_CEDE > > entry criteria which balances good single thread performance and > > wakeup latency. Further we can have additional H_CEDE state to feed > > into the cpuidle. > > So all POWER9 machines are affected by the firmware bug where firmware > reports CEDE1 exit latency of 2us and the real latency is 5us which > causes the kernel to prefer CEDE1 too much when relying on the values > supplied by the firmware. It is not about 'older firmware'. Correct. All POWER9 systems running Linux as guest LPARs will see extra usage of CEDE idle state, but not baremetal (PowerNV). The correct definition of the bug or miss-match in expectation is that firmware reports wakeup latency from a core/thread wakeup timing, but not end-to-end time from sending a wakeup event like an IPI using H_calls and receiving the events on the target. Practically there are few extra micro-seconds needed after deciding to wakeup a target core/thread to getting the target to start executing instructions within the LPAR instance. > I still think it would be preferrable to adjust the latency value > reported by the firmware to match reality over a kernel workaround. Right, practically we can fix for future releases and as such we targeted this scheme from POWER10 but expected no harm on POWER9 which proved to be wrong. We can possibly change this FW value for POWER9, but it is too expensive and not practical because many release streams exist for different platforms and further customers are at different streams as well. We cannot force all of them to update because that blows up co-dependency matrix. >From a user/customer's view current kernel worked fine, why is a kernel update changing my performance :( Looking back, we should have boxed any kernel-firmware dependent feature like this one to a future releases only. We have all options open for a future release like POWER10, but on a released product stream, we have to manage with kernel changes. In this specific case we should have been very conservative and not allow the kernel to change behaviour on released products. Thanks, Vaidy
Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote: > * Michal Such?nek [2021-04-23 09:35:51]: > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote: > > > From: "Gautham R. Shenoy" > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > > > of the Extended CEDE states advertised by the platform > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the > > Can you be more specific about 'older firmwares'? > > Hi Michal, > > This is POWER9 vs POWER10 difference, not really an obsolete FW. The > key idea behind the original patch was to make the H_CEDE latency and > hence target residency come from firmware instead of being decided by > the kernel. The advantage is such that, different type of systems in > POWER10 generation can adjust this value and have an optimal H_CEDE > entry criteria which balances good single thread performance and > wakeup latency. Further we can have additional H_CEDE state to feed > into the cpuidle. So all POWER9 machines are affected by the firmware bug where firmware reports CEDE1 exit latency of 2us and the real latency is 5us which causes the kernel to prefer CEDE1 too much when relying on the values supplied by the firmware. It is not about 'older firmware'. I still think it would be preferrable to adjust the latency value reported by the firmware to match reality over a kernel workaround. Thanks Michal
Re: [PATCH v2 1/2] powerpc: Free fdt on error in elf64_load()
On 4/21/21 9:36 AM, Lakshmi Ramasubramanian wrote: Hi Dan, There are a few "goto out;" statements before the local variable "fdt" is initialized through the call to of_kexec_alloc_and_setup_fdt() in elf64_load(). This will result in an uninitialized "fdt" being passed to kvfree() in this function if there is an error before the call to of_kexec_alloc_and_setup_fdt(). If there is any error after fdt is allocated, but before it is saved in the arch specific kimage struct, free the fdt. Reported-by: kernel test robot Reported-by: Dan Carpenter Signed-off-by: Michael Ellerman Signed-off-by: Lakshmi Ramasubramanian --- arch/powerpc/kexec/elf_64.c | 16 ++-- 1 file changed, 6 insertions(+), 10 deletions(-) Please review this patch and Patch 2/2. thanks, -lakshmi diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c index 5a569bb51349..02662e72c53d 100644 --- a/arch/powerpc/kexec/elf_64.c +++ b/arch/powerpc/kexec/elf_64.c @@ -114,7 +114,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr, initrd_len, cmdline); if (ret) - goto out; + goto out_free_fdt; fdt_pack(fdt); @@ -125,7 +125,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; ret = kexec_add_buffer(); if (ret) - goto out; + goto out_free_fdt; /* FDT will be freed in arch_kimage_file_post_load_cleanup */ image->arch.fdt = fdt; @@ -140,18 +140,14 @@ static void *elf64_load(struct kimage *image, char *kernel_buf, if (ret) pr_err("Error setting up the purgatory.\n"); + goto out; + +out_free_fdt: + kvfree(fdt); out: kfree(modified_cmdline); kexec_free_elf_info(_info); - /* -* Once FDT buffer has been successfully passed to kexec_add_buffer(), -* the FDT buffer address is saved in image->arch.fdt. In that case, -* the memory cannot be freed here in case of any other error. -*/ - if (ret && !image->arch.fdt) - kvfree(fdt); - return ret ? ERR_PTR(ret) : NULL; }
Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker
Le 23/04/2021 à 00:44, Nick Desaulniers a écrit : On Wed, Sep 2, 2020 at 11:02 AM Christophe Leroy wrote: Le 02/09/2020 à 19:41, Nick Desaulniers a écrit : On Wed, Sep 2, 2020 at 5:14 AM Michael Ellerman wrote: Nick Desaulniers writes: Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan sections") I think I'll just revert that for v5.9 ? SGTM; you'll probably still want these changes with some modifications at some point; vdso32 did have at least one orphaned section, and will be important for hermetic builds. Seeing crashes in supported versions of the tools ties our hands at the moment. Keeping the tool problem aside with binutils 2.26, do you have a way to really link an elf32ppc object when building vdso32 for PPC64 ? Sorry, I'm doing a bug scrub and found https://github.com/ClangBuiltLinux/linux/issues/774 still open (and my reply to this thread still in Drafts; never sent). With my patches rebased: $ file arch/powerpc/kernel/vdso32/vdso32.so arch/powerpc/kernel/vdso32/vdso32.so: ELF 32-bit MSB shared object, PowerPC or cisco 4500, version 1 (SYSV), dynamically linked, stripped Are you still using 2.26? I'm not able to repro Nathan's reported issue from https://lore.kernel.org/lkml/20200902052123.GA2687902@ubuntu-n2-xlarge-x86/, so I'm curious if I should resend the rebased patches as v2? One comment on your rebased patch: > diff --git a/arch/powerpc/include/asm/vdso.h b/arch/powerpc/include/asm/vdso.h > index 8542e9bbeead..0bd06ec06aaa 100644 > --- a/arch/powerpc/include/asm/vdso.h > +++ b/arch/powerpc/include/asm/vdso.h > @@ -25,19 +25,7 @@ int vdso_getcpu_init(void); > > #else /* __ASSEMBLY__ */ > > -#ifdef __VDSO64__ > -#define V_FUNCTION_BEGIN(name)\ > - .globl name;\ > - name: \ > - > -#define V_FUNCTION_END(name) \ > - .size name,.-name; > - > -#define V_LOCAL_FUNC(name) (name) > -#endif /* __VDSO64__ */ > - > -#ifdef __VDSO32__ > - > +#if defined(__VDSO32__) || defined (__VDSO64__) You always have either __VDSO32__ or __VDSO64__ so this #if is pointless > #define V_FUNCTION_BEGIN(name) \ >.globl name;\ >.type name,@function; \ > @@ -47,8 +35,7 @@ int vdso_getcpu_init(void); >.size name,.-name; > > #define V_LOCAL_FUNC(name) (name) > - > -#endif /* __VDSO32__ */ > +#endif /* __VDSO{32|64}__ */ > > #endif /* __ASSEMBLY__ */ > Christophe
Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction
On Fri, Apr 23, 2021 at 12:26:57PM +0200, Gabriel Paubert wrote: > On Thu, Apr 22, 2021 at 06:26:16PM -0500, Segher Boessenkool wrote: > > > But this can be made jump free :-): > > > > > > int tmp = regs->ccr << ra; > > > op->val = (tmp >> 31) | ((tmp >> 30) & 1); > > > > The compiler will do so automatically (or think of some better way to > > get the same result); in source code, what matters most is readability, > > or clarity in general (also clarity to the compiler). > > I just did a test (trivial code attached) and the original code always > produces one conditional branch at -O2, at least with the cross-compiler > I have on Debian (gcc 8.3). I have tested both -m32 and -m64. The 64 bit > version produces an unnecessary "extsw", so I wrote the second version > splitting the setting of the return value which gets rid of it. That is an older compiler, and it will be out-of-service any day now. It depends on what compiler flags you use, and what version of the ISA you are targetting. > The second "if" is fairly simple to optimize and the compiler does it > properly. Yeah. > Of course with my suggestion the compiler does not produce any branch. > But it needs a really good comment. Or you could try and help improve the compiler ;-) You can do this without writing compiler code yourself, by writing up some good enhancement request in bugzilla. The wider and more OoO the processors become, the more important it becomes to have branch-free code, in situations where the branches would not be well-predictable. > > (Right shifts of negative numbers are implementation-defined in C, > > fwiw -- but work like you expect in GCC). > > Well, I'm not worried about it, since I'd expect a compiler that does > logical right shifts on signed valued to break so much code that it > would be easily noticed (also in the kernel). Yup. And it *is* defined for signed values, as long as they are non-negative (the common case). > > > > Also, even people who write LE have the bigger end on the left normally > > > > (they just write some things right-to-left, and other things > > > > left-to-right). > > > > > > Around 1985, I had a documentation for the the National's 32032 > > > (little-endian) processor family, and all the instruction encodings were > > > presented with the LSB on the left and MSB on the right. > > > > Ouch! Did they write "regular" numbers with the least significant digit > > on the left as well? > > No, they were not that sadistic! But more inconsistent :-) Segher
Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
* Michal Such?nek [2021-04-23 09:35:51]: > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote: > > From: "Gautham R. Shenoy" > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > > of the Extended CEDE states advertised by the platform > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the > Can you be more specific about 'older firmwares'? Hi Michal, This is POWER9 vs POWER10 difference, not really an obsolete FW. The key idea behind the original patch was to make the H_CEDE latency and hence target residency come from firmware instead of being decided by the kernel. The advantage is such that, different type of systems in POWER10 generation can adjust this value and have an optimal H_CEDE entry criteria which balances good single thread performance and wakeup latency. Further we can have additional H_CEDE state to feed into the cpuidle. > Also while this is a performance regression on such firmwares it > should be fixed by updating the firmware to current version. > > Having sub-optimal performance on obsolete firmware should not require a > kernel workaround, should it? When we designed and tested this change on POWER9 and POWER10 systems the values that were set in F/w were working out fine with positive results in all our micro benchmarks and no regression in context switch tests. These repeatable results gave us the confidence that we can go ahead and set the values from F/w and remove the kernel's value for all future Linux versions. But where we slipped is the fact that real world workload show variations in performance and regressions in specific case because we are favouring H_CEDE state more often than snooze loop. The root cause is we have to send more IPIs to wakeup now because more cpus will be in H_CEDE state than before. This is a performance problem on POWER9 systems where we actually expected good benefit and also proved them with micro benchmarks, but later it turned out to have an impact for some workloads. Further the challenge is not that regressions are severe, it is the fact that on exact same hardware and firmware end users expect similar or better performance for everything when updating to a newer kernel and no regressions. We have these setting adjusted for POWER10 in F/w and hence behaviour will be similar when we come from old kernel on P9 to a new kernel on P10. We did test the reverse also like new kernel on P9 should show benefit. But as explained, the benefit came at the cost of regressing in few cases which were discovered later. Hence this fix is to keep exact same behaviour for POWER9 and use this F/w driven heuristics only from POWER10. > It's not like the kernel would crash on the affected firmware. Correct. We do not have a functional issue, but only a performance regression observable on certain real workloads. This is a minor change in cpuidle's H_CEDE usage which will show up only in certain workload patterns where we need idle CPU threads to wakeup faster to get the job done as compared to keeping busy CPU threads in single thread mode to get more execution slices. This fix is primarily to ensure kernel update does not change H_CEDE behaviour on same hardware generation there by causing performance variation and also regression in some case. Thanks for the questions and comments, I hope this gives additional context for this fix. --Vaidy
Re: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()
On Fri, Apr 23, 2021 at 9:42 AM David Laight wrote: > > From: Michael Ellerman > > Sent: 23 April 2021 14:51 > ... > > > (Does anyone - and by anyone I mean any large distro - compile with > > > local variables inited by the compiler?) > > > > This is where I say, "yes, Android" and you say "ugh no I meant a real > > distro", and I say "well ...". > > > > But yeah doesn't help us much. > > And it doesn't seem to stop my phone crashing either :-) > > Of course, I've absolutely no way of finding out where it is crashing. > Nor where the massive memory leaks are that means it need rebooting > every few days. You need a new phone. :) My Pixel3 uptime is sitting at 194 hours currently. It would be more, but those annoying security updates require reboots. I have had phones that I setup to reboot every night though. Rob
Re: [PATCH 1/2] powerpc: Fix a memory leak in error handling paths
On Fri, Apr 23, 2021 at 9:40 AM Christophe JAILLET wrote: > > If we exit the for_each_of_cpu_node loop early, the reference on the > current node must be decremented, otherwise there is a leak. > > Fixes: a94fe366340a ("powerpc: use for_each_of_cpu_node iterator") > Signed-off-by: Christophe JAILLET > --- > Strangely, the commit above added the needed of_node_put in one place but > missed 2 other places! Well, maintained it in one place and forgot to add in the other two. > This is strange, so maybe I misunderstand something. Review carefully > --- > arch/powerpc/platforms/powermac/feature.c | 2 ++ > 1 file changed, 2 insertions(+) Reviewed-by: Rob Herring I don't really think patch 2 is worthwhile but that's up to the powerpc maintainers. Rob
RE: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()
From: Michael Ellerman > Sent: 23 April 2021 14:51 ... > > (Does anyone - and by anyone I mean any large distro - compile with > > local variables inited by the compiler?) > > This is where I say, "yes, Android" and you say "ugh no I meant a real > distro", and I say "well ...". > > But yeah doesn't help us much. And it doesn't seem to stop my phone crashing either :-) Of course, I've absolutely no way of finding out where it is crashing. Nor where the massive memory leaks are that means it need rebooting every few days. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
[PATCH 2/2] powerpc: Save a few lines of code
'arch/powerpc/platforms/powermac/feature.c' triggers many checkpatch.pl warnings. The code looks old and not very active, so fixing them all does not make so much sense and will hide some 'git blame' information. So only apply a few fixes that save a few lines of code. Signed-off-by: Christophe JAILLET --- arch/powerpc/platforms/powermac/feature.c | 12 1 file changed, 4 insertions(+), 8 deletions(-) diff --git a/arch/powerpc/platforms/powermac/feature.c b/arch/powerpc/platforms/powermac/feature.c index e61f7d2e..155e8586692e 100644 --- a/arch/powerpc/platforms/powermac/feature.c +++ b/arch/powerpc/platforms/powermac/feature.c @@ -83,8 +83,7 @@ struct macio_chip *macio_find(struct device_node *child, int type) } EXPORT_SYMBOL_GPL(macio_find); -static const char *macio_names[] = -{ +static const char *macio_names[] = { "Unknown", "Grand Central", "OHare", @@ -119,8 +118,7 @@ struct feature_table_entry { feature_callfunction; }; -struct pmac_mb_def -{ +struct pmac_mb_def { const char* model_string; const char* model_name; int model_id; @@ -301,11 +299,10 @@ static long ohare_sleep_state(struct device_node *node, long param, long value) if ((pmac_mb.board_flags & PMAC_MB_CAN_SLEEP) == 0) return -EPERM; - if (value == 1) { + if (value == 1) MACIO_BIC(OHARE_FCR, OH_IOBUS_ENABLE); - } else if (value == 0) { + else if (value == 0) MACIO_BIS(OHARE_FCR, OH_IOBUS_ENABLE); - } return 0; } @@ -2992,7 +2989,6 @@ void pmac_register_agp_pm(struct pci_dev *bridge, if (bridge != pmac_agp_bridge) return; pmac_agp_suspend = pmac_agp_resume = NULL; - return; } EXPORT_SYMBOL(pmac_register_agp_pm); -- 2.27.0
[PATCH 1/2] powerpc: Fix a memory leak in error handling paths
If we exit the for_each_of_cpu_node loop early, the reference on the current node must be decremented, otherwise there is a leak. Fixes: a94fe366340a ("powerpc: use for_each_of_cpu_node iterator") Signed-off-by: Christophe JAILLET --- Strangely, the commit above added the needed of_node_put in one place but missed 2 other places! This is strange, so maybe I misunderstand something. Review carefully --- arch/powerpc/platforms/powermac/feature.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/platforms/powermac/feature.c b/arch/powerpc/platforms/powermac/feature.c index 5c77b9a24c0e..e61f7d2e 100644 --- a/arch/powerpc/platforms/powermac/feature.c +++ b/arch/powerpc/platforms/powermac/feature.c @@ -1060,6 +1060,7 @@ core99_reset_cpu(struct device_node *node, long param, long value) continue; if (param == *num) { reset_io = *rst; + of_node_put(np); break; } } @@ -1506,6 +1507,7 @@ static long g5_reset_cpu(struct device_node *node, long param, long value) continue; if (param == *num) { reset_io = *rst; + of_node_put(np); break; } } -- 2.27.0
[PATCH] powerpc/signal32: Fix erroneous SIGSEGV on RT signal return
Return of user_read_access_begin() is tested the wrong way, leading to a SIGSEGV when the user address is valid and likely an Oops when the user address is bad. Fix the test. Fixes: 887f3ceb51cd ("powerpc/signal32: Convert do_setcontext[_tm]() to user access block") Signed-off-by: Christophe Leroy --- arch/powerpc/kernel/signal_32.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c index fc41c58f0cbb..8f05ed0da292 100644 --- a/arch/powerpc/kernel/signal_32.c +++ b/arch/powerpc/kernel/signal_32.c @@ -967,7 +967,7 @@ static int do_setcontext(struct ucontext __user *ucp, struct pt_regs *regs, int sigset_t set; struct mcontext __user *mcp; - if (user_read_access_begin(ucp, sizeof(*ucp))) + if (!user_read_access_begin(ucp, sizeof(*ucp))) return -EFAULT; unsafe_get_sigset_t(, >uc_sigmask, failed); @@ -1005,7 +1005,7 @@ static int do_setcontext_tm(struct ucontext __user *ucp, u32 cmcp; u32 tm_cmcp; - if (user_read_access_begin(ucp, sizeof(*ucp))) + if (!user_read_access_begin(ucp, sizeof(*ucp))) return -EFAULT; unsafe_get_sigset_t(, >uc_sigmask, failed); -- 2.25.0
Re: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()
Daniel Axtens writes: > Daniel Axtens writes: > >> Hi Lakshmi, >> >>> On 4/15/21 12:14 PM, Lakshmi Ramasubramanian wrote: >>> >>> Sorry - missed copying device-tree and powerpc mailing lists. >>> There are a few "goto out;" statements before the local variable "fdt" is initialized through the call to of_kexec_alloc_and_setup_fdt() in elf64_load(). This will result in an uninitialized "fdt" being passed to kvfree() in this function if there is an error before the call to of_kexec_alloc_and_setup_fdt(). Initialize the local variable "fdt" to NULL. >> I'm a huge fan of initialising local variables! But I'm struggling to >> find the code path that will lead to an uninit fdt being returned... > > OK, so perhaps this was putting it too strongly. I have been bitten > by uninitialised things enough in C that I may have taken a slightly > overly-agressive view of fixing them in the source rather than the > compiler. I do think compiler-level mitigations are better, and I take > the point that we don't want to defeat compiler checking. > > (Does anyone - and by anyone I mean any large distro - compile with > local variables inited by the compiler?) This is where I say, "yes, Android" and you say "ugh no I meant a real distro", and I say "well ...". But yeah doesn't help us much. cheers
Re: [PATCH v5 14/16] dma-direct: Allocate memory from restricted DMA pool if available
On 2021-04-22 09:15, Claire Chang wrote: The restricted DMA pool is preferred if available. The restricted DMA pools provide a basic level of protection against the DMA overwriting buffer contents at unexpected times. However, to protect against general data leakage and system memory corruption, the system needs to provide a way to lock down the memory access, e.g., MPU. Signed-off-by: Claire Chang --- kernel/dma/direct.c | 35 ++- 1 file changed, 26 insertions(+), 9 deletions(-) diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 7a27f0510fcc..29523d2a9845 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t size) static void __dma_direct_free_pages(struct device *dev, struct page *page, size_t size) { +#ifdef CONFIG_DMA_RESTRICTED_POOL + if (swiotlb_free(dev, page, size)) + return; +#endif dma_free_contiguous(dev, page, size); } @@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size, gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask, _limit); - page = dma_alloc_contiguous(dev, size, gfp); + +#ifdef CONFIG_DMA_RESTRICTED_POOL + page = swiotlb_alloc(dev, size); + if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) { + __dma_direct_free_pages(dev, page, size); + page = NULL; + } +#endif + + if (!page) + page = dma_alloc_contiguous(dev, size, gfp); if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) { dma_free_contiguous(dev, page, size); page = NULL; @@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size, gfp |= __GFP_NOWARN; if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) && - !force_dma_unencrypted(dev)) { + !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) { page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO); if (!page) return NULL; @@ -161,8 +175,8 @@ void *dma_direct_alloc(struct device *dev, size_t size, } if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && - !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && - !dev_is_dma_coherent(dev)) + !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) && + !is_dev_swiotlb_force(dev)) return arch_dma_alloc(dev, size, dma_handle, gfp, attrs); /* @@ -172,7 +186,9 @@ void *dma_direct_alloc(struct device *dev, size_t size, if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) && !gfpflags_allow_blocking(gfp) && (force_dma_unencrypted(dev) || -(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev +(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && + !dev_is_dma_coherent(dev))) && + !is_dev_swiotlb_force(dev)) return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp); /* we always manually zero the memory once we are done */ @@ -253,15 +269,15 @@ void dma_direct_free(struct device *dev, size_t size, unsigned int page_order = get_order(size); if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) && - !force_dma_unencrypted(dev)) { + !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) { /* cpu_addr is a struct page cookie, not a kernel address */ dma_free_contiguous(dev, cpu_addr, size); return; } if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) && - !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && - !dev_is_dma_coherent(dev)) { + !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) && + !is_dev_swiotlb_force(dev)) { arch_dma_free(dev, size, cpu_addr, dma_addr, attrs); return; } @@ -289,7 +305,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, size_t size, void *ret; if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) && - force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp)) + force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) && + !is_dev_swiotlb_force(dev)) return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp); Wait, this seems broken for non-coherent devices - in that case we need to return a non-cacheable address, but we can't simply fall through into the remapping path below in GFP_ATOMIC context. That's why we need the atomic pool concept in the first place :/ Unless I've overlooked something, we're still using the regular cacheable linear map address of the dma_io_tlb_mem buffer, no? Robin. page = __dma_direct_alloc_pages(dev, size,
Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode
2021-04-23 12:46 UTC+0200 ~ Christophe Leroy > > > Le 23/04/2021 à 12:26, Quentin Monnet a écrit : >> 2021-04-23 09:19 UTC+0200 ~ Christophe Leroy >> >> >> [...] >> >>> I finally managed to cross compile bpftool with libbpf, libopcodes, >>> readline, ncurses, libcap, libz and all needed stuff. Was not easy but I >>> made it. >> >> Libcap is optional and bpftool does not use readline or ncurses. May I >> ask how you tried to build it? > > cd tools/bpf/ > > make ARCH=powerpc CROSS_COMPILE=ppc-linux- Ok, you could try running directly from tools/bpf/bpftool/ next time instead. Readline at least is for a different tool under tools/bpf/, bpf_dbg (But I'm still not sure where that ncurses requirement was pulled from). The requirements for specific kernel options probably came from yet another tool (runqslower, I think). Quentin
Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode
Le 23/04/2021 à 12:59, Quentin Monnet a écrit : 2021-04-23 12:46 UTC+0200 ~ Christophe Leroy Le 23/04/2021 à 12:26, Quentin Monnet a écrit : 2021-04-23 09:19 UTC+0200 ~ Christophe Leroy [...] I finally managed to cross compile bpftool with libbpf, libopcodes, readline, ncurses, libcap, libz and all needed stuff. Was not easy but I made it. Libcap is optional and bpftool does not use readline or ncurses. May I ask how you tried to build it? cd tools/bpf/ make ARCH=powerpc CROSS_COMPILE=ppc-linux- Ok, you could try running directly from tools/bpf/bpftool/ next time instead. Readline at least is for a different tool under tools/bpf/, bpf_dbg (But I'm still not sure where that ncurses requirement was pulled from). The requirements for specific kernel options probably came from yet another tool (runqslower, I think). ncurses (or termcap) is required by readline Christophe
Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode
Le 23/04/2021 à 12:26, Quentin Monnet a écrit : 2021-04-23 09:19 UTC+0200 ~ Christophe Leroy [...] I finally managed to cross compile bpftool with libbpf, libopcodes, readline, ncurses, libcap, libz and all needed stuff. Was not easy but I made it. Libcap is optional and bpftool does not use readline or ncurses. May I ask how you tried to build it? cd tools/bpf/ make ARCH=powerpc CROSS_COMPILE=ppc-linux- Christophe
Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode
2021-04-23 09:19 UTC+0200 ~ Christophe Leroy [...] > I finally managed to cross compile bpftool with libbpf, libopcodes, > readline, ncurses, libcap, libz and all needed stuff. Was not easy but I > made it. Libcap is optional and bpftool does not use readline or ncurses. May I ask how you tried to build it? > > Now, how do I use it ? > > Let say I want to dump the jitted code generated from a call to > 'tcpdump'. How do I do that with 'bpftool prog dump jited' ? > > I thought by calling this line I would then get programs dumped in a way > or another just like when setting 'bpf_jit_enable=2', but calling that > line just provides me some bpftool help text. Well the purpose of this text is to help you find the way to call bpftool to do what you want :). For dumping your programs' instructions, you need to tell bpftool what program to dump: Bpftool isn't waiting until you load a program to dump it, instead you need to load your program first and then tell bpftool to retrieve the instructions from the kernel. To reference your program you could use a pinned path, or first list the programs on your system with "bpftool prog show": # bpftool prog show 138: tracing name foo tag e54c922dfa54f65f gpl loaded_at 2021-02-25T01:32:30+ uid 0 xlated 256B jited 154B memlock 4096B map_ids 64 btf_id 235 Then you can use for example the program id displayed on the first line to reference and dump your program: # bpftool prog dump jited id 138 You should find additional documentation under tools/bpf/bpftool/Documentation. > > By the way, I would be nice to have a kernel OPTION that selects all > OPTIONS required for building bpftool. Because you discover them one by > one at every build failure. I had to had CONFIG_IPV6, CONFIG_DEBUG_BTF, > CONFIG_CGROUPS, ... If there could be an option like "Build a 'bpftool' > ready kernel" that selected all those, it would be great. > > Christophe I do not believe any of these are required to build bpftool. Quentin
Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor
Christophe Leroy writes: > Le 23/04/2021 à 00:58, Daniel Axtens a écrit : >>> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile >>> index c9d2c7825cd6..3a2f2001c62b 100644 >>> --- a/arch/powerpc/Makefile >>> +++ b/arch/powerpc/Makefile >>> @@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr) >>> KBUILD_AFLAGS += $(AFLAGS-y) >>> KBUILD_CFLAGS += $(call cc-option,-msoft-float) >>> KBUILD_CFLAGS += -pipe $(CFLAGS-y) >>> -CPP= $(CC) -E $(KBUILD_CFLAGS) >> >> I was trying to check the history to see why powerpc has its own >> definition. It seems to date back at least as far as merging the two >> powerpc platforms into one, maybe it was helpful then. I agree it >> doesn't seem to be needed now. >> > > I digged a bit deaper. It has been there since the introduction of arch/ppc/ > in Linux 1.3.45 > At the time, there was the exact same CPP definition in arch/mips/Makefile > > The CPP definition in mips disappeared is Linux 2.1.44pre3. > Since that version, ppc has been the only one with such CPP re-definition. > >> Snowpatch claims that this breaks the build, but I haven't been able to >> reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it >> off to a fork of mpe's linux-ci repo to see if any of those builds hit >> any issues: https://github.com/daxtens/linux-ci/actions >> > > By the way, I find snowpatch quite useless these days. It seems to delete the > reports a few minutes > after the test. We are less than one day after the patch was submitted and > the report of the build > failures are already gone. Yeah, it's pretty annoying. It needs to send emails to the list with its results, with a link to a location that retains the logs for some reasonable period. cheers
Re: [PATCH v5 16/16] of: Add plumbing for restricted DMA pool
On 2021-04-22 09:15, Claire Chang wrote: If a device is not behind an IOMMU, we look up the device node and set up the restricted DMA when the restricted-dma-pool is presented. Signed-off-by: Claire Chang --- drivers/of/address.c| 25 + drivers/of/device.c | 3 +++ drivers/of/of_private.h | 5 + 3 files changed, 33 insertions(+) diff --git a/drivers/of/address.c b/drivers/of/address.c index 54f221dde267..fff3adfe4986 100644 --- a/drivers/of/address.c +++ b/drivers/of/address.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np) } EXPORT_SYMBOL_GPL(of_dma_is_coherent); +int of_dma_set_restricted_buffer(struct device *dev) +{ + struct device_node *node; + int count, i; + + if (!dev->of_node) + return 0; + + count = of_property_count_elems_of_size(dev->of_node, "memory-region", + sizeof(phandle)); + for (i = 0; i < count; i++) { + node = of_parse_phandle(dev->of_node, "memory-region", i); + /* There might be multiple memory regions, but only one +* restriced-dma-pool region is allowed. +*/ What's the use-case for having multiple regions if the restricted pool is by definition the only one accessible? Robin. + if (of_device_is_compatible(node, "restricted-dma-pool") && + of_device_is_available(node)) + return of_reserved_mem_device_init_by_idx( + dev, dev->of_node, i); + } + + return 0; +} + /** * of_mmio_is_nonposted - Check if device uses non-posted MMIO * @np: device node diff --git a/drivers/of/device.c b/drivers/of/device.c index c5a9473a5fb1..d8d865223e51 100644 --- a/drivers/of/device.c +++ b/drivers/of/device.c @@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct device_node *np, arch_setup_dma_ops(dev, dma_start, size, iommu, coherent); + if (!iommu) + return of_dma_set_restricted_buffer(dev); + return 0; } EXPORT_SYMBOL_GPL(of_dma_configure_id); diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h index d717efbd637d..e9237f5eff48 100644 --- a/drivers/of/of_private.h +++ b/drivers/of/of_private.h @@ -163,12 +163,17 @@ struct bus_dma_region; #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA) int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map); +int of_dma_set_restricted_buffer(struct device *dev); #else static inline int of_dma_get_range(struct device_node *np, const struct bus_dma_region **map) { return -ENODEV; } +static inline int of_dma_get_restricted_buffer(struct device *dev) +{ + return -ENODEV; +} #endif #endif /* _LINUX_OF_PRIVATE_H */
Re: [PATCH v5 08/16] swiotlb: Update is_swiotlb_active to add a struct device argument
On 2021-04-22 09:15, Claire Chang wrote: Update is_swiotlb_active to add a struct device argument. This will be useful later to allow for restricted DMA pool. Signed-off-by: Claire Chang --- drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +- drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +- drivers/pci/xen-pcifront.c | 2 +- include/linux/swiotlb.h | 4 ++-- kernel/dma/direct.c | 2 +- kernel/dma/swiotlb.c | 4 ++-- 6 files changed, 8 insertions(+), 8 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c b/drivers/gpu/drm/i915/gem/i915_gem_internal.c index ce6b664b10aa..7d48c433446b 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c @@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj) max_order = MAX_ORDER; #ifdef CONFIG_SWIOTLB - if (is_swiotlb_active()) { + if (is_swiotlb_active(NULL)) { unsigned int max_segment; max_segment = swiotlb_max_segment(); diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c index e8b506a6685b..2a2ae6d6cf6d 100644 --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c @@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm) } #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86) - need_swiotlb = is_swiotlb_active(); + need_swiotlb = is_swiotlb_active(NULL); #endif ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev, diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c index b7a8f3a1921f..6d548ce53ce7 100644 --- a/drivers/pci/xen-pcifront.c +++ b/drivers/pci/xen-pcifront.c @@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct pcifront_device *pdev) spin_unlock(_dev_lock); - if (!err && !is_swiotlb_active()) { + if (!err && !is_swiotlb_active(NULL)) { err = pci_xen_swiotlb_init_late(); if (err) dev_err(>xdev->dev, "Could not setup SWIOTLB!\n"); diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 2a6cca07540b..c530c976d18b 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, phys_addr_t paddr) void __init swiotlb_exit(void); unsigned int swiotlb_max_segment(void); size_t swiotlb_max_mapping_size(struct device *dev); -bool is_swiotlb_active(void); +bool is_swiotlb_active(struct device *dev); void __init swiotlb_adjust_size(unsigned long size); #else #define swiotlb_force SWIOTLB_NO_FORCE @@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device *dev) return SIZE_MAX; } -static inline bool is_swiotlb_active(void) +static inline bool is_swiotlb_active(struct device *dev) { return false; } diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c index 84c9feb5474a..7a88c34d0867 100644 --- a/kernel/dma/direct.c +++ b/kernel/dma/direct.c @@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask) size_t dma_direct_max_mapping_size(struct device *dev) { /* If SWIOTLB is active, use its maximum mapping size */ - if (is_swiotlb_active() && + if (is_swiotlb_active(dev) && (dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE)) I wonder if it's worth trying to fold these other conditions into is_swiotlb_active() itself? I'm not entirely sure what matters for Xen, but for the other cases it seems like they probably only care about whether bouncing may occur for their particular device or not (possibly they want to be using dma_max_mapping_size() now anyway - TBH I'm struggling to make sense of what the swiotlb_max_segment business is supposed to mean). Otherwise, patch #9 will need to touch here as well to make sure that per-device forced bouncing is reflected correctly. Robin. return swiotlb_max_mapping_size(dev); return SIZE_MAX; diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index ffbb8724e06c..1d221343f1c8 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -659,9 +659,9 @@ size_t swiotlb_max_mapping_size(struct device *dev) return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE; } -bool is_swiotlb_active(void) +bool is_swiotlb_active(struct device *dev) { - return io_tlb_default_mem != NULL; + return get_io_tlb_mem(dev) != NULL; } EXPORT_SYMBOL_GPL(is_swiotlb_active);
Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction
"Naveen N. Rao" writes: > Michael Ellerman wrote: >> "Naveen N. Rao" writes: >>> Daniel Axtens wrote: Sathvika Vasireddy writes: > This adds emulation support for the following instruction: >* Set Boolean (setb) > > Signed-off-by: Sathvika Vasireddy >> ... If you do end up respinning the patch, I think it would be good to make the maths a bit clearer. I think it works because a left shift of 2 is the same as multiplying by 4, but it would be easier to follow if you used a temporary variable for btf. >>> >>> Indeed. I wonder if it is better to follow the ISA itself. Per the ISA, >>> the bit we are interested in is: >>> 4 x BFA + 32 >>> >>> So, if we use that along with the PPC_BIT() macro, we get: >>> if (regs->ccr & PPC_BIT(ra + 32)) >> >> Use of PPC_BIT risks annoying your maintainer :) > > Uh oh... that isn't good :) > > I looked up previous discussions and I think I now understand why you > don't prefer it. Hah, I'd forgotten I'd written (ranted :D) about this in the past. > But, I feel it helps make it easy to follow the code when referring to > the ISA. That's true. But I think that's much much less common than people reading the code in isolation. And ultimately it doesn't matter if the code (appears to) match the ISA, it matters that the code works. My worry is that too much use of those type of macros obscures what's actually happening. > I'm wondering if it is just the name you dislike and if so, > does it make sense to rename PPC_BIT() to something else? We have > BIT_ULL(), so perhaps BIT_MSB_ULL() or MSB_BIT_ULL()? The name is part of it. But I don't really like BIT_ULL() either, it hides in a macro something that could just be there in front of you ie. (1ull << x). For this case of setb, I think I'd go with something like below. It doesn't exactly match the ISA, but I think there's minimal obfuscation of what's actually going on. // ra is now bfa ra = (ra >> 2); // Extract 4-bit CR field val = regs->ccr >> (CR0_SHIFT - 4 * ra); if (val & 8) op->val = -1; else if (val & 4) op->val = 1; else op->val = 0; If anything could use a macro it would be the 8 and 4, eg. CR_LT, CR_GT. Of course that's probably got a bug in it, because I just wrote it by eye and it's 11:28 pm :) cheers
Re: [PATCH v5 05/16] swiotlb: Add restricted DMA pool initialization
On 22/04/2021 09:14, Claire Chang wrote: Add the initialization function to create restricted DMA pools from matching reserved-memory nodes. Signed-off-by: Claire Chang --- include/linux/device.h | 4 +++ include/linux/swiotlb.h | 3 +- kernel/dma/swiotlb.c| 80 + 3 files changed, 86 insertions(+), 1 deletion(-) diff --git a/include/linux/device.h b/include/linux/device.h index 38a2071cf776..4987608ea4ff 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -416,6 +416,7 @@ struct dev_links_info { * @dma_pools:Dma pools (if dma'ble device). * @dma_mem: Internal for coherent mem override. * @cma_area: Contiguous memory area for dma allocations + * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override. * @archdata: For arch-specific additions. * @of_node: Associated device tree node. * @fwnode: Associated device node supplied by platform firmware. @@ -521,6 +522,9 @@ struct device { #ifdef CONFIG_DMA_CMA struct cma *cma_area; /* contiguous memory area for dma allocations */ +#endif +#ifdef CONFIG_DMA_RESTRICTED_POOL + struct io_tlb_mem *dma_io_tlb_mem; #endif /* arch specific additions */ struct dev_archdata archdata; diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 216854a5e513..03ad6e3b4056 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -72,7 +72,8 @@ extern enum swiotlb_force swiotlb_force; *range check to see if the memory was in fact allocated by this *API. * @nslabs: The number of IO TLB blocks (in groups of 64) between @start and - * @end. This is command line adjustable via setup_io_tlb_npages. + * @end. For default swiotlb, this is command line adjustable via + * setup_io_tlb_npages. * @used: The number of used IO TLB block. * @list: The free list describing the number of free entries available *from each index. diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 57a9adb920bf..ffbb8724e06c 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -39,6 +39,13 @@ #ifdef CONFIG_DEBUG_FS #include #endif +#ifdef CONFIG_DMA_RESTRICTED_POOL +#include +#include +#include +#include +#include +#endif #include #include @@ -681,3 +688,76 @@ static int __init swiotlb_create_default_debugfs(void) late_initcall(swiotlb_create_default_debugfs); #endif + +#ifdef CONFIG_DMA_RESTRICTED_POOL +static int rmem_swiotlb_device_init(struct reserved_mem *rmem, + struct device *dev) +{ + struct io_tlb_mem *mem = rmem->priv; + unsigned long nslabs = rmem->size >> IO_TLB_SHIFT; + + if (dev->dma_io_tlb_mem) + return 0; + + /* Since multiple devices can share the same pool, the private data, +* io_tlb_mem struct, will be initialized by the first device attached +* to it. +*/ + if (!mem) { + mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL); + if (!mem) + return -ENOMEM; +#ifdef CONFIG_ARM + if (!PageHighMem(pfn_to_page(PHYS_PFN(rmem->base { + kfree(mem); + return -EINVAL; + } +#endif /* CONFIG_ARM */ + swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false); + + rmem->priv = mem; + } + +#ifdef CONFIG_DEBUG_FS + if (!io_tlb_default_mem->debugfs) + io_tlb_default_mem->debugfs = + debugfs_create_dir("swiotlb", NULL); At this point it's possible for io_tlb_default_mem to be NULL, leading to a splat. But even then if it's not and we have the situation where debugfs==NULL then the debugfs_create_dir() here will cause a subsequent attempt in swiotlb_create_debugfs() to fail (directory already exists) leading to mem->debugfs being assigned an error value. I suspect the creation of the debugfs directory needs to be separated from io_tlb_default_mem being set. Other than that I gave this series a go with our prototype of Arm's Confidential Computer Architecture[1] - since the majority of the guest's memory is protected from the host the restricted DMA pool allows (only) a small area to be shared with the host. After fixing (well hacking round) the above it all seems to be working fine with virtio drivers. Thanks, Steve [1] https://www.arm.com/why-arm/architecture/security-features/arm-confidential-compute-architecture
Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction
On Thu, Apr 22, 2021 at 06:26:16PM -0500, Segher Boessenkool wrote: > Hi! > > On Fri, Apr 23, 2021 at 12:16:18AM +0200, Gabriel Paubert wrote: > > On Thu, Apr 22, 2021 at 02:13:34PM -0500, Segher Boessenkool wrote: > > > On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote: > > > > Sathvika Vasireddy writes: > > > > > + if ((regs->ccr) & (1 << (31 - ra))) > > > > > + op->val = -1; > > > > > + else if ((regs->ccr) & (1 << (30 - ra))) > > > > > + op->val = 1; > > > > > + else > > > > > + op->val = 0; > > > > > > It imo is clearer if written > > > > > > if ((regs->ccr << ra) & 0x8000) > > > op->val = -1; > > > else if ((regs->ccr << ra) & 0x4000) > > > op->val = 1; > > > else > > > op->val = 0; > > > > > > but I guess not everyone agrees :-) > > > > But this can be made jump free :-): > > > > int tmp = regs->ccr << ra; > > op->val = (tmp >> 31) | ((tmp >> 30) & 1); > > The compiler will do so automatically (or think of some better way to > get the same result); in source code, what matters most is readability, > or clarity in general (also clarity to the compiler). I just did a test (trivial code attached) and the original code always produces one conditional branch at -O2, at least with the cross-compiler I have on Debian (gcc 8.3). I have tested both -m32 and -m64. The 64 bit version produces an unnecessary "extsw", so I wrote the second version splitting the setting of the return value which gets rid of it. The second "if" is fairly simple to optimize and the compiler does it properly. Of course with my suggestion the compiler does not produce any branch. But it needs a really good comment. > > (Right shifts of negative numbers are implementation-defined in C, > fwiw -- but work like you expect in GCC). Well, I'm not worried about it, since I'd expect a compiler that does logical right shifts on signed valued to break so much code that it would be easily noticed (also in the kernel). > > > (IIRC the srawi instruction sign-extends its result to 64 bits). > > If you consider it to work on 32-bit inputs, yeah, that is a simple way > to express it. > > > > > CR field: 76543210 > > > > bit: 0123 0123 0123 0123 0123 0123 0123 0123 > > > > normal bit #: 0.31 > > > > ibm bit #: 31.0 > > > > > > The bit numbers in CR fields are *always* numbered left-to-right. I > > > have never seen anyone use LE for it, anyway. > > > > > > Also, even people who write LE have the bigger end on the left normally > > > (they just write some things right-to-left, and other things > > > left-to-right). > > > > Around 1985, I had a documentation for the the National's 32032 > > (little-endian) processor family, and all the instruction encodings were > > presented with the LSB on the left and MSB on the right. > > Ouch! Did they write "regular" numbers with the least significant digit > on the left as well? No, they were not that sadistic! At least instructions were a whole number of bytes, unlike the iAPX432 where jumps needed to encode target addresses down to the bit level. > > > BTW on these processors, the immediate operands and the offsets (1, 2 or > > 4 bytes) for the addressing modes were encoded in big-endian byte order, > > but I digress. Consistency is overrated ;-) > > Inconsistency is the spice of life, yeah :-) ;-) Gabriel long setb_cond(int cr, int shift) { long ret; if ((cr << shift) & 0x8000) ret = -1; else if ((cr << shift) & 0x4000) ret = 1; else ret = 0; return ret; } long setb_uncond(int cr, int shift) { int tmp = cr << shift; long ret; ret = (tmp >> 31) | ((tmp >> 30) & 1); return ret; } long setb_uncond2(int cr, int shift) { int tmp = cr << shift; long ret; ret = (tmp >> 31); ret |= ((tmp >> 30) & 1); return ret; }
Re: [PATCH v3 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()
On 22/04/2021 17:07, Leonardo Bras wrote: Code used to create a ddw property that was previously scattered in enable_ddw() is now gathered in ddw_property_create(), which deals with allocation and filling the property, letting it ready for of_property_add(), which now occurs in sequence. This created an opportunity to reorganize the second part of enable_ddw(): Without this patch enable_ddw() does, in order: kzalloc() property & members, create_ddw(), fill ddwprop inside property, ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory, of_add_property(), and list_add(). With this patch enable_ddw() does, in order: create_ddw(), ddw_property_create(), of_add_property(), ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory, and list_add(). This change requires of_remove_property() in case anything fails after of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk in all memory, which looks the most expensive operation, only if everything else succeeds. Signed-off-by: Leonardo Bras --- arch/powerpc/platforms/pseries/iommu.c | 93 -- 1 file changed, 57 insertions(+), 36 deletions(-) diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 955cf095416c..48c029386d94 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -1122,6 +1122,35 @@ static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn) ret); } +static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr, + u32 page_shift, u32 window_shift) +{ + struct dynamic_dma_window_prop *ddwprop; + struct property *win64; + + win64 = kzalloc(sizeof(*win64), GFP_KERNEL); + if (!win64) + return NULL; + + win64->name = kstrdup(propname, GFP_KERNEL); + ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL); + win64->value = ddwprop; + win64->length = sizeof(*ddwprop); + if (!win64->name || !win64->value) { + kfree(win64); + kfree(win64->name); + kfree(win64->value); Wrong order. + return NULL; + } + + ddwprop->liobn = cpu_to_be32(liobn); + ddwprop->dma_base = cpu_to_be64(dma_addr); + ddwprop->tce_shift = cpu_to_be32(page_shift); + ddwprop->window_shift = cpu_to_be32(window_shift); + + return win64; +} + /* Return largest page shift based on "IO Page Sizes" output of ibm,query-pe-dma-window. */ static int iommu_get_page_shift(u32 query_page_size) { @@ -1167,11 +1196,11 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) struct ddw_query_response query; struct ddw_create_response create; int page_shift; + u64 win_addr; struct device_node *dn; u32 ddw_avail[DDW_APPLICABLE_SIZE]; struct direct_window *window; struct property *win64 = NULL; - struct dynamic_dma_window_prop *ddwprop; struct failed_ddw_pdn *fpdn; bool default_win_removed = false; bool pmem_present; @@ -1286,65 +1315,54 @@ static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn) 1ULL << page_shift); goto out_failed; } - win64 = kzalloc(sizeof(struct property), GFP_KERNEL); - if (!win64) { - dev_info(>dev, - "couldn't allocate property for 64bit dma window\n"); - goto out_failed; - } - win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL); - win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL); - win64->length = sizeof(*ddwprop); - if (!win64->name || !win64->value) { - dev_info(>dev, - "couldn't allocate property name and value\n"); - goto out_free_prop; - } ret = create_ddw(dev, ddw_avail, , page_shift, len); if (ret != 0) - goto out_free_prop; - - ddwprop->liobn = cpu_to_be32(create.liobn); - ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) | - create.addr_lo); - ddwprop->tce_shift = cpu_to_be32(page_shift); - ddwprop->window_shift = cpu_to_be32(len); + goto out_failed; dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n", create.liobn, dn); - window = ddw_list_new_entry(pdn, ddwprop); + win_addr = ((u64)create.addr_hi << 32) | create.addr_lo; + win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr, + page_shift, len); + if (!win64) { + dev_info(>dev, +"couldn't allocate property, property name, or value\n"); + goto out_del_win; + } + + ret =
Re: [PATCH 1/2] audit: add support for the openat2 syscall
On Thu, Apr 22, 2021 at 10:34:08PM -0400, Richard Guy Briggs wrote: > On 2021-03-18 08:08, Richard Guy Briggs wrote: > > On 2021-03-18 11:48, Christian Brauner wrote: > > > [+Cc Aleksa, the author of openat2()] > > > > Ah! Thanks for pulling in Aleksa. I thought I caught everyone... > > > > > and a comment below. :) > > > > Same... > > > > > On Wed, Mar 17, 2021 at 09:47:17PM -0400, Richard Guy Briggs wrote: > > > > The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9 > > > > ("open: introduce openat2(2) syscall") > > > > > > > > Add the openat2(2) syscall to the audit syscall classifier. > > > > > > > > See the github issue > > > > https://github.com/linux-audit/audit-kernel/issues/67 > > > > > > > > Signed-off-by: Richard Guy Briggs > > > > --- > > > > arch/alpha/kernel/audit.c | 2 ++ > > > > arch/ia64/kernel/audit.c | 2 ++ > > > > arch/parisc/kernel/audit.c | 2 ++ > > > > arch/parisc/kernel/compat_audit.c | 2 ++ > > > > arch/powerpc/kernel/audit.c| 2 ++ > > > > arch/powerpc/kernel/compat_audit.c | 2 ++ > > > > arch/s390/kernel/audit.c | 2 ++ > > > > arch/s390/kernel/compat_audit.c| 2 ++ > > > > arch/sparc/kernel/audit.c | 2 ++ > > > > arch/sparc/kernel/compat_audit.c | 2 ++ > > > > arch/x86/ia32/audit.c | 2 ++ > > > > arch/x86/kernel/audit_64.c | 2 ++ > > > > kernel/auditsc.c | 3 +++ > > > > lib/audit.c| 4 > > > > lib/compat_audit.c | 4 > > > > 15 files changed, 35 insertions(+) > > > > > > > > diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c > > > > index 96a9d18ff4c4..06a911b685d1 100644 > > > > --- a/arch/alpha/kernel/audit.c > > > > +++ b/arch/alpha/kernel/audit.c > > > > @@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > > return 3; > > > > case __NR_execve: > > > > return 5; > > > > + case __NR_openat2: > > > > + return 6; > > > > default: > > > > return 0; > > > > } > > > > diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c > > > > index 5192ca899fe6..5eaa888c8fd3 100644 > > > > --- a/arch/ia64/kernel/audit.c > > > > +++ b/arch/ia64/kernel/audit.c > > > > @@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > > return 3; > > > > case __NR_execve: > > > > return 5; > > > > + case __NR_openat2: > > > > + return 6; > > > > default: > > > > return 0; > > > > } > > > > diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c > > > > index 9eb47b2225d2..fc721a7727ba 100644 > > > > --- a/arch/parisc/kernel/audit.c > > > > +++ b/arch/parisc/kernel/audit.c > > > > @@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > > return 3; > > > > case __NR_execve: > > > > return 5; > > > > + case __NR_openat2: > > > > + return 6; > > > > default: > > > > return 0; > > > > } > > > > diff --git a/arch/parisc/kernel/compat_audit.c > > > > b/arch/parisc/kernel/compat_audit.c > > > > index 20c39c9d86a9..fc6d35918c44 100644 > > > > --- a/arch/parisc/kernel/compat_audit.c > > > > +++ b/arch/parisc/kernel/compat_audit.c > > > > @@ -35,6 +35,8 @@ int parisc32_classify_syscall(unsigned syscall) > > > > return 3; > > > > case __NR_execve: > > > > return 5; > > > > + case __NR_openat2: > > > > + return 6; > > > > default: > > > > return 1; > > > > } > > > > diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c > > > > index a27f3d09..8f32700b0baa 100644 > > > > --- a/arch/powerpc/kernel/audit.c > > > > +++ b/arch/powerpc/kernel/audit.c > > > > @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall) > > > > return 4; > > > > case __NR_execve: > > > > return 5; > > > > + case __NR_openat2: > > > > + return 6; > > > > default: > > > > return 0; > > > > } > > > > diff --git a/arch/powerpc/kernel/compat_audit.c > > > > b/arch/powerpc/kernel/compat_audit.c > > > > index 55c6ccda0a85..ebe45534b1c9 100644 > > > > --- a/arch/powerpc/kernel/compat_audit.c > > > > +++ b/arch/powerpc/kernel/compat_audit.c > > > > @@ -38,6 +38,8 @@ int ppc32_classify_syscall(unsigned syscall) > > > > return 4; > > > > case __NR_execve: > > > > return 5; > > > > + case __NR_openat2: > > > > + return 6; > > > > default: > > > > return 1; > > > > } > > > > diff --git a/arch/s390/kernel/audit.c b/arch/s390/kernel/audit.c > > > > index d395c6c9944c..d964cb94cfaf 100644 >
Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker
Le 23/04/2021 à 00:44, Nick Desaulniers a écrit : On Wed, Sep 2, 2020 at 11:02 AM Christophe Leroy wrote: Le 02/09/2020 à 19:41, Nick Desaulniers a écrit : On Wed, Sep 2, 2020 at 5:14 AM Michael Ellerman wrote: Nick Desaulniers writes: Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan sections") I think I'll just revert that for v5.9 ? SGTM; you'll probably still want these changes with some modifications at some point; vdso32 did have at least one orphaned section, and will be important for hermetic builds. Seeing crashes in supported versions of the tools ties our hands at the moment. Keeping the tool problem aside with binutils 2.26, do you have a way to really link an elf32ppc object when building vdso32 for PPC64 ? Sorry, I'm doing a bug scrub and found https://github.com/ClangBuiltLinux/linux/issues/774 still open (and my reply to this thread still in Drafts; never sent). With my patches rebased: $ file arch/powerpc/kernel/vdso32/vdso32.so arch/powerpc/kernel/vdso32/vdso32.so: ELF 32-bit MSB shared object, PowerPC or cisco 4500, version 1 (SYSV), dynamically linked, stripped Are you still using 2.26? Yes, our production kernels and applications are built with gcc 5.5 and binutils 2.26 I'm not able to repro Nathan's reported issue from https://lore.kernel.org/lkml/20200902052123.GA2687902@ubuntu-n2-xlarge-x86/, so I'm curious if I should resend the rebased patches as v2? I can't remember what was all this discussion about. I gave a try to your rebased patches. Still an issue with binutils 2.26: VDSO32L arch/powerpc/kernel/vdso32/vdso32.so.dbg ppc-linux-ld: warning: orphan section `.rela.got' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn'. ppc-linux-ld: warning: orphan section `.rela.plt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn'. ppc-linux-ld: warning: orphan section `.glink' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.glink'. ppc-linux-ld: warning: orphan section `.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.iplt'. ppc-linux-ld: warning: orphan section `.rela.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn'. ppc-linux-ld: warning: orphan section `.rela.text' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn'. /bin/sh: line 1: 7850 Segmentation fault (core dumped) ppc-linux-ld -EB -m elf32ppc -shared -soname linux-vdso32.so.1 --eh-frame-hdr --orphan-handling=warn -T arch/powerpc/kernel/vdso32/vdso32.lds arch/powerpc/kernel/vdso32/sigtramp.o arch/powerpc/kernel/vdso32/gettimeofday.o arch/powerpc/kernel/vdso32/datapage.o arch/powerpc/kernel/vdso32/cacheflush.o arch/powerpc/kernel/vdso32/note.o arch/powerpc/kernel/vdso32/getcpu.o arch/powerpc/kernel/vdso32/vgettimeofday.o -o arch/powerpc/kernel/vdso32/vdso32.so.dbg make[2]: *** [arch/powerpc/kernel/vdso32/vdso32.so.dbg] Error 139 make[2]: *** Deleting file `arch/powerpc/kernel/vdso32/vdso32.so.dbg' With gcc 10.1 and binutils 2.34 I get: PPC32 build: VDSO32L arch/powerpc/kernel/vdso32/vdso32.so.dbg powerpc64-linux-ld: warning: orphan section `.rela.got' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn' powerpc64-linux-ld: warning: orphan section `.rela.plt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn' powerpc64-linux-ld: warning: orphan section `.glink' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.glink' powerpc64-linux-ld: warning: orphan section `.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.iplt' powerpc64-linux-ld: warning: orphan section `.rela.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn' powerpc64-linux-ld: warning: orphan section `.rela.branch_lt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn' powerpc64-linux-ld: warning: orphan section `.rela.text' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn' PPC64 build: VDSO32L arch/powerpc/kernel/vdso32/vdso32.so.dbg powerpc64-linux-ld: warning: orphan section `.rela.got' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn' powerpc64-linux-ld: warning: orphan section `.rela.plt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn' powerpc64-linux-ld: warning: orphan section `.glink' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.glink' powerpc64-linux-ld: warning: orphan section `.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.iplt' powerpc64-linux-ld: warning: orphan section `.rela.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn' powerpc64-linux-ld: warning: orphan section `.rela.branch_lt' from
Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards
On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote: > From: "Gautham R. Shenoy" > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > of the Extended CEDE states advertised by the platform > > On some of the POWER9 LPARs, the older firmwares advertise a very low > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the Can you be more specific about 'older firmwares'? Also while this is a performance regression on such firmwares it should be fixed by updating the firmware to current version. Having sub-optimal performance on obsolete firmware should not require a kernel workaround, should it? It's not like the kernel would crash on the affected firmware. Thanks Michal
Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode
Le 20/04/2021 à 05:28, Alexei Starovoitov a écrit : On Sat, Apr 17, 2021 at 1:16 AM Christophe Leroy wrote: Le 16/04/2021 à 01:49, Alexei Starovoitov a écrit : On Thu, Apr 15, 2021 at 8:41 AM Quentin Monnet wrote: 2021-04-15 16:37 UTC+0200 ~ Daniel Borkmann On 4/15/21 11:32 AM, Jianlin Lv wrote: For debugging JITs, dumping the JITed image to kernel log is discouraged, "bpftool prog dump jited" is much better way to examine JITed dumps. This patch get rid of the code related to bpf_jit_enable=2 mode and update the proc handler of bpf_jit_enable, also added auxiliary information to explain how to use bpf_jit_disasm tool after this change. Signed-off-by: Jianlin Lv Hello, For what it's worth, I have already seen people dump the JIT image in kernel logs in Qemu VMs running with just a busybox, not for kernel development, but in a context where buiding/using bpftool was not possible. If building/using bpftool is not possible then majority of selftests won't be exercised. I don't think such environment is suitable for any kind of bpf development. Much so for JIT debugging. While bpf_jit_enable=2 is nothing but the debugging tool for JIT developers. I'd rather nuke that code instead of carrying it from kernel to kernel. When I implemented JIT for PPC32, it was extremely helpfull. As far as I understand, for the time being bpftool is not usable in my environment because it doesn't support cross compilation when the target's endianess differs from the building host endianess, see discussion at https://lore.kernel.org/bpf/21e66a09-514f-f426-b9e2-13baab0b9...@csgroup.eu/ That's right that selftests can't be exercised because they don't build. The question might be candid as I didn't investigate much about the replacement of "bpf_jit_enable=2 debugging mode" by bpftool, how do we use bpftool exactly for that ? Especially when using the BPF test module ? the kernel developers can add any amount of printk and dumps to debug their code, but such debugging aid should not be part of the production kernel. That sysctl was two things at once: debugging tool for kernel devs and introspection for users. bpftool jit dump solves the 2nd part. It provides JIT introspection to users. Debugging of the kernel can be done with any amount of auxiliary code including calling print_hex_dump() during jiting. I finally managed to cross compile bpftool with libbpf, libopcodes, readline, ncurses, libcap, libz and all needed stuff. Was not easy but I made it. Now, how do I use it ? Let say I want to dump the jitted code generated from a call to 'tcpdump'. How do I do that with 'bpftool prog dump jited' ? I thought by calling this line I would then get programs dumped in a way or another just like when setting 'bpf_jit_enable=2', but calling that line just provides me some bpftool help text. By the way, I would be nice to have a kernel OPTION that selects all OPTIONS required for building bpftool. Because you discover them one by one at every build failure. I had to had CONFIG_IPV6, CONFIG_DEBUG_BTF, CONFIG_CGROUPS, ... If there could be an option like "Build a 'bpftool' ready kernel" that selected all those, it would be great. Christophe
Re: [PATCH v3 01/11] powerpc/pseries/iommu: Replace hard-coded page shift
On 22/04/2021 17:07, Leonardo Bras wrote: Some functions assume IOMMU page size can only be 4K (pageshift == 12). Update them to accept any page size passed, so we can use 64K pages. In the process, some defines like TCE_SHIFT were made obsolete, and then removed. IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show a RPN of 52-bit, and considers a 12-bit pageshift, so there should be no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn. It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and tce_buildmulti_pSeriesLP(). After rereading the patch, I wonder why we had this TCE_RPN_MASK at all but what is certain is that this has nothing to do with IODA3 as these TCEs are guest phys addresses in pseries and IODA3 is bare metal. Except... Most places had a tbl struct, so using tbl->it_page_shift was simple. tce_free_pSeriesLP() was a special case, since callers not always have a tbl struct, so adding a tceshift parameter seems the right thing to do. Signed-off-by: Leonardo Bras Reviewed-by: Alexey Kardashevskiy --- arch/powerpc/include/asm/tce.h | 8 -- arch/powerpc/platforms/pseries/iommu.c | 39 +++--- 2 files changed, 23 insertions(+), 24 deletions(-) diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h index db5fc2f2262d..0c34d2756d92 100644 --- a/arch/powerpc/include/asm/tce.h +++ b/arch/powerpc/include/asm/tce.h @@ -19,15 +19,7 @@ #define TCE_VB0 #define TCE_PCI 1 -/* TCE page size is 4096 bytes (1 << 12) */ - -#define TCE_SHIFT 12 -#define TCE_PAGE_SIZE (1 << TCE_SHIFT) - #define TCE_ENTRY_SIZE8 /* each TCE is 64 bits */ - -#define TCE_RPN_MASK 0xfful /* 40-bit RPN (4K pages) */ -#define TCE_RPN_SHIFT 12 #define TCE_VALID 0x800 /* TCE valid */ #define TCE_ALLIO 0x400 /* TCE valid for all lpars */ #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */ diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 67c9953a6503..796ab356341c 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index, u64 proto_tce; __be64 *tcep; u64 rpn; + const unsigned long tceshift = tbl->it_page_shift; + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl); (nit: only used once) proto_tce = TCE_PCI_READ; // Read allowed @@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index, ... this pseries which is not pseriesLP, i.e. no LPAR == bare metal pseries such as ancient power5 or cellbe (I guess) and for those TCE_RPN_MASK may actually make sense, keep it. The rest of the patch looks good. Thanks, while (npages--) { /* can't move this out since we might cross MEMBLOCK boundary */ - rpn = __pa(uaddr) >> TCE_SHIFT; - *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT); + rpn = __pa(uaddr) >> tceshift; + *tcep = cpu_to_be64(proto_tce | rpn << tceshift); - uaddr += TCE_PAGE_SIZE; + uaddr += pagesize; tcep++; } return 0; @@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table *tbl, long index) return be64_to_cpu(*tcep); } -static void tce_free_pSeriesLP(unsigned long liobn, long, long); +static void tce_free_pSeriesLP(unsigned long liobn, long, long, long); static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long); static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift, @@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift, proto_tce |= TCE_PCI_WRITE; while (npages--) { - tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift; + tce = proto_tce | rpn << tceshift; rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce); if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) { ret = (int)rc; - tce_free_pSeriesLP(liobn, tcenum_start, + tce_free_pSeriesLP(liobn, tcenum_start, tceshift, (npages_start - (npages + 1))); break; } @@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long tcenum_start = tcenum, npages_start = npages; int ret = 0; unsigned long flags; + const unsigned long tceshift = tbl->it_page_shift; if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) { return
Re: [PATCH v5 01/16] swiotlb: Fix the type of index
On Thu, Apr 22, 2021 at 04:14:53PM +0800, Claire Chang wrote: > Fix the type of index from unsigned int to int since find_slots() might > return -1. > > Fixes: 0774983bc923 ("swiotlb: refactor swiotlb_tbl_map_single") > Signed-off-by: Claire Chang Looks good: Reviewed-by: Christoph Hellwig it really should go into 5.12. I'm not sure if Konrad is going to be able to queue this up due to his vacation, so I'm tempted to just queue it up in the dma-mapping tree.