Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-23 Thread Michal Suchánek
On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote:
> * Michal Such?nek  [2021-04-23 19:45:05]:
> 
> > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > > * Michal Such?nek  [2021-04-23 09:35:51]:
> > > 
> > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > > From: "Gautham R. Shenoy" 
> > > > > 
> > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > > of the Extended CEDE states advertised by the platform
> > > > > 
> > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > > Can you be more specific about 'older firmwares'?
> > > 
> > > Hi Michal,
> > > 
> > > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > > key idea behind the original patch was to make the H_CEDE latency and
> > > hence target residency come from firmware instead of being decided by
> > > the kernel.  The advantage is such that, different type of systems in
> > > POWER10 generation can adjust this value and have an optimal H_CEDE
> > > entry criteria which balances good single thread performance and
> > > wakeup latency.  Further we can have additional H_CEDE state to feed
> > > into the cpuidle.  
> > 
> > So all POWER9 machines are affected by the firmware bug where firmware
> > reports CEDE1 exit latency of 2us and the real latency is 5us which
> > causes the kernel to prefer CEDE1 too much when relying on the values
> > supplied by the firmware. It is not about 'older firmware'.
> 
> Correct.  All POWER9 systems running Linux as guest LPARs will see
> extra usage of CEDE idle state, but not baremetal (PowerNV).
> 
> The correct definition of the bug or miss-match in expectation is that
> firmware reports wakeup latency from a core/thread wakeup timing, but
> not end-to-end time from sending a wakeup event like an IPI using
> H_calls and receiving the events on the target.  Practically there are
> few extra micro-seconds needed after deciding to wakeup a target
> core/thread to getting the target to start executing instructions
> within the LPAR instance.

Thanks for the detailed explanation.

Maybe just adding a few microseconds to the reported time would be a
more reasonable workaround than using a blanket fixed value then.

> 
> > I still think it would be preferrable to adjust the latency value
> > reported by the firmware to match reality over a kernel workaround.
> 
> Right, practically we can fix for future releases and as such we
> targeted this scheme from POWER10 but expected no harm on POWER9 which
> proved to be wrong.
> 
> We can possibly change this FW value for POWER9, but it is too
> expensive and not practical because many release streams exist for
> different platforms and further customers are at different streams as
> well.  We cannot force all of them to update because that blows up
> co-dependency matrix.

>From the user point of view only few firmware release streams exist but
what is packaged in such binaries might be another story.

Thanks

Michal


Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-23 Thread Vaidyanathan Srinivasan
* Michal Such?nek  [2021-04-23 19:45:05]:

> On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> > * Michal Such?nek  [2021-04-23 09:35:51]:
> > 
> > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > > From: "Gautham R. Shenoy" 
> > > > 
> > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > > of the Extended CEDE states advertised by the platform
> > > > 
> > > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > > Can you be more specific about 'older firmwares'?
> > 
> > Hi Michal,
> > 
> > This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> > key idea behind the original patch was to make the H_CEDE latency and
> > hence target residency come from firmware instead of being decided by
> > the kernel.  The advantage is such that, different type of systems in
> > POWER10 generation can adjust this value and have an optimal H_CEDE
> > entry criteria which balances good single thread performance and
> > wakeup latency.  Further we can have additional H_CEDE state to feed
> > into the cpuidle.  
> 
> So all POWER9 machines are affected by the firmware bug where firmware
> reports CEDE1 exit latency of 2us and the real latency is 5us which
> causes the kernel to prefer CEDE1 too much when relying on the values
> supplied by the firmware. It is not about 'older firmware'.

Correct.  All POWER9 systems running Linux as guest LPARs will see
extra usage of CEDE idle state, but not baremetal (PowerNV).

The correct definition of the bug or miss-match in expectation is that
firmware reports wakeup latency from a core/thread wakeup timing, but
not end-to-end time from sending a wakeup event like an IPI using
H_calls and receiving the events on the target.  Practically there are
few extra micro-seconds needed after deciding to wakeup a target
core/thread to getting the target to start executing instructions
within the LPAR instance.

> I still think it would be preferrable to adjust the latency value
> reported by the firmware to match reality over a kernel workaround.

Right, practically we can fix for future releases and as such we
targeted this scheme from POWER10 but expected no harm on POWER9 which
proved to be wrong.

We can possibly change this FW value for POWER9, but it is too
expensive and not practical because many release streams exist for
different platforms and further customers are at different streams as
well.  We cannot force all of them to update because that blows up
co-dependency matrix.

>From a user/customer's view current kernel worked fine, why is
a kernel update changing my performance :(

Looking back, we should have boxed any kernel-firmware dependent
feature like this one to a future releases only.  We have all options
open for a future release like POWER10, but on a released product
stream, we have to manage with kernel changes.  In this specific case
we should have been very conservative and not allow the kernel to
change behaviour on released products.

Thanks,
Vaidy



Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-23 Thread Michal Suchánek
On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote:
> * Michal Such?nek  [2021-04-23 09:35:51]:
> 
> > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > > From: "Gautham R. Shenoy" 
> > > 
> > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > > of the Extended CEDE states advertised by the platform
> > > 
> > > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> > Can you be more specific about 'older firmwares'?
> 
> Hi Michal,
> 
> This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
> key idea behind the original patch was to make the H_CEDE latency and
> hence target residency come from firmware instead of being decided by
> the kernel.  The advantage is such that, different type of systems in
> POWER10 generation can adjust this value and have an optimal H_CEDE
> entry criteria which balances good single thread performance and
> wakeup latency.  Further we can have additional H_CEDE state to feed
> into the cpuidle.  

So all POWER9 machines are affected by the firmware bug where firmware
reports CEDE1 exit latency of 2us and the real latency is 5us which
causes the kernel to prefer CEDE1 too much when relying on the values
supplied by the firmware. It is not about 'older firmware'.

I still think it would be preferrable to adjust the latency value
reported by the firmware to match reality over a kernel workaround.

Thanks

Michal


Re: [PATCH v2 1/2] powerpc: Free fdt on error in elf64_load()

2021-04-23 Thread Lakshmi Ramasubramanian

On 4/21/21 9:36 AM, Lakshmi Ramasubramanian wrote:

Hi Dan,


There are a few "goto out;" statements before the local variable "fdt"
is initialized through the call to of_kexec_alloc_and_setup_fdt() in
elf64_load().  This will result in an uninitialized "fdt" being passed
to kvfree() in this function if there is an error before the call to
of_kexec_alloc_and_setup_fdt().

If there is any error after fdt is allocated, but before it is
saved in the arch specific kimage struct, free the fdt.

Reported-by: kernel test robot 
Reported-by: Dan Carpenter 
Signed-off-by: Michael Ellerman 
Signed-off-by: Lakshmi Ramasubramanian 
---
  arch/powerpc/kexec/elf_64.c | 16 ++--
  1 file changed, 6 insertions(+), 10 deletions(-)



Please review this patch and Patch 2/2.

thanks,
 -lakshmi


diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index 5a569bb51349..02662e72c53d 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -114,7 +114,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
  initrd_len, cmdline);
if (ret)
-   goto out;
+   goto out_free_fdt;
  
  	fdt_pack(fdt);
  
@@ -125,7 +125,7 @@ static void *elf64_load(struct kimage *image, char *kernel_buf,

kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
ret = kexec_add_buffer();
if (ret)
-   goto out;
+   goto out_free_fdt;
  
  	/* FDT will be freed in arch_kimage_file_post_load_cleanup */

image->arch.fdt = fdt;
@@ -140,18 +140,14 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
if (ret)
pr_err("Error setting up the purgatory.\n");
  
+	goto out;

+
+out_free_fdt:
+   kvfree(fdt);
  out:
kfree(modified_cmdline);
kexec_free_elf_info(_info);
  
-	/*

-* Once FDT buffer has been successfully passed to kexec_add_buffer(),
-* the FDT buffer address is saved in image->arch.fdt. In that case,
-* the memory cannot be freed here in case of any other error.
-*/
-   if (ret && !image->arch.fdt)
-   kvfree(fdt);
-
return ret ? ERR_PTR(ret) : NULL;
  }
  





Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker

2021-04-23 Thread Christophe Leroy




Le 23/04/2021 à 00:44, Nick Desaulniers a écrit :

On Wed, Sep 2, 2020 at 11:02 AM Christophe Leroy
 wrote:




Le 02/09/2020 à 19:41, Nick Desaulniers a écrit :

On Wed, Sep 2, 2020 at 5:14 AM Michael Ellerman  wrote:


Nick Desaulniers  writes:

Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan 
sections")


I think I'll just revert that for v5.9 ?


SGTM; you'll probably still want these changes with some modifications
at some point; vdso32 did have at least one orphaned section, and will
be important for hermetic builds.  Seeing crashes in supported
versions of the tools ties our hands at the moment.



Keeping the tool problem aside with binutils 2.26, do you have a way to
really link an elf32ppc object when  building vdso32 for PPC64 ?


Sorry, I'm doing a bug scrub and found
https://github.com/ClangBuiltLinux/linux/issues/774 still open (and my
reply to this thread still in Drafts; never sent). With my patches
rebased:
$ file arch/powerpc/kernel/vdso32/vdso32.so
arch/powerpc/kernel/vdso32/vdso32.so: ELF 32-bit MSB shared object,
PowerPC or cisco 4500, version 1 (SYSV), dynamically linked, stripped

Are you still using 2.26?

I'm not able to repro Nathan's reported issue from
https://lore.kernel.org/lkml/20200902052123.GA2687902@ubuntu-n2-xlarge-x86/,
so I'm curious if I should resend the rebased patches as v2?


One comment on your rebased patch:

> diff --git a/arch/powerpc/include/asm/vdso.h b/arch/powerpc/include/asm/vdso.h
> index 8542e9bbeead..0bd06ec06aaa 100644
> --- a/arch/powerpc/include/asm/vdso.h
> +++ b/arch/powerpc/include/asm/vdso.h
> @@ -25,19 +25,7 @@ int vdso_getcpu_init(void);
>
>   #else /* __ASSEMBLY__ */
>
> -#ifdef __VDSO64__
> -#define V_FUNCTION_BEGIN(name)\
> -  .globl name;\
> -  name:   \
> -
> -#define V_FUNCTION_END(name)  \
> -  .size name,.-name;
> -
> -#define V_LOCAL_FUNC(name) (name)
> -#endif /* __VDSO64__ */
> -
> -#ifdef __VDSO32__
> -
> +#if defined(__VDSO32__) || defined (__VDSO64__)

You always have either __VDSO32__ or __VDSO64__ so this #if is pointless

>   #define V_FUNCTION_BEGIN(name)   \
>.globl name;\
>.type name,@function;   \
> @@ -47,8 +35,7 @@ int vdso_getcpu_init(void);
>.size name,.-name;
>
>   #define V_LOCAL_FUNC(name) (name)
> -
> -#endif /* __VDSO32__ */
> +#endif /* __VDSO{32|64}__ */
>
>   #endif /* __ASSEMBLY__ */
>


Christophe


Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-23 Thread Segher Boessenkool
On Fri, Apr 23, 2021 at 12:26:57PM +0200, Gabriel Paubert wrote:
> On Thu, Apr 22, 2021 at 06:26:16PM -0500, Segher Boessenkool wrote:
> > > But this can be made jump free :-):
> > > 
> > >   int tmp = regs->ccr << ra;
> > >   op->val = (tmp >> 31) | ((tmp >> 30) & 1);
> > 
> > The compiler will do so automatically (or think of some better way to
> > get the same result); in source code, what matters most is readability,
> > or clarity in general (also clarity to the compiler).
> 
> I just did a test (trivial code attached) and the original code always
> produces one conditional branch at -O2, at least with the cross-compiler
> I have on Debian (gcc 8.3). I have tested both -m32 and -m64. The 64 bit
> version produces an unnecessary "extsw", so I wrote the second version
> splitting the setting of the return value which gets rid of it.

That is an older compiler, and it will be out-of-service any day now.

It depends on what compiler flags you use, and what version of the ISA
you are targetting.

> The second "if" is fairly simple to optimize and the compiler does it
> properly.

Yeah.

> Of course with my suggestion the compiler does not produce any branch. 
> But it needs a really good comment.

Or you could try and help improve the compiler ;-)  You can do this
without writing compiler code yourself, by writing up some good
enhancement request in bugzilla.

The wider and more OoO the processors become, the more important it
becomes to have branch-free code, in situations where the branches would
not be well-predictable.

> > (Right shifts of negative numbers are implementation-defined in C,
> > fwiw -- but work like you expect in GCC).
> 
> Well, I'm not worried about it, since I'd expect a compiler that does
> logical right shifts on signed valued to break so much code that it
> would be easily noticed (also in the kernel).

Yup.  And it *is* defined for signed values, as long as they are
non-negative (the common case).

> > > > Also, even people who write LE have the bigger end on the left normally
> > > > (they just write some things right-to-left, and other things
> > > > left-to-right).
> > > 
> > > Around 1985, I had a documentation for the the National's 32032
> > > (little-endian) processor family, and all the instruction encodings were
> > > presented with the LSB on the left and MSB on the right.
> > 
> > Ouch!  Did they write "regular" numbers with the least significant digit
> > on the left as well?
> 
> No, they were not that sadistic!

But more inconsistent :-)


Segher


Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-23 Thread Vaidyanathan Srinivasan
* Michal Such?nek  [2021-04-23 09:35:51]:

> On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> > From: "Gautham R. Shenoy" 
> > 
> > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> > of the Extended CEDE states advertised by the platform
> > 
> > On some of the POWER9 LPARs, the older firmwares advertise a very low
> > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
> Can you be more specific about 'older firmwares'?

Hi Michal,

This is POWER9 vs POWER10 difference, not really an obsolete FW.  The
key idea behind the original patch was to make the H_CEDE latency and
hence target residency come from firmware instead of being decided by
the kernel.  The advantage is such that, different type of systems in
POWER10 generation can adjust this value and have an optimal H_CEDE
entry criteria which balances good single thread performance and
wakeup latency.  Further we can have additional H_CEDE state to feed
into the cpuidle.  

> Also while this is a performance regression on such firmwares it
> should be fixed by updating the firmware to current version.
> 
> Having sub-optimal performance on obsolete firmware should not require a
> kernel workaround, should it?

When we designed and tested this change on POWER9 and POWER10 systems
the values that were set in F/w were working out fine with positive
results in all our micro benchmarks and no regression in context
switch tests.  These repeatable results gave us the confidence that we
can go ahead and set the values from F/w and remove the kernel's value
for all future Linux versions.

But where we slipped is the fact that real world workload show
variations in performance and regressions in specific case because we
are favouring H_CEDE state more often than snooze loop.  The root
cause is we have to send more IPIs to wakeup now because more cpus
will be in H_CEDE state than before.

This is a performance problem on POWER9 systems where we actually
expected good benefit and also proved them with micro benchmarks, but
later it turned out to have an impact for some workloads.  Further the
challenge is not that regressions are severe, it is the fact that on
exact same hardware and firmware end users expect similar or better
performance for everything when updating to a newer kernel and no
regressions.

We have these setting adjusted for POWER10 in F/w and hence behaviour
will be similar when we come from old kernel on P9 to a new kernel on
P10.  We did test the reverse also like new kernel on P9 should show
benefit.  But as explained, the benefit came at the cost of regressing
in few cases which were discovered later.

Hence this fix is to keep exact same behaviour for POWER9 and use this
F/w driven heuristics only from POWER10.

> It's not like the kernel would crash on the affected firmware.

Correct. We do not have a functional issue, but only a performance
regression observable on certain real workloads.

This is a minor change in cpuidle's H_CEDE usage which will show up
only in certain workload patterns where we need idle CPU threads to
wakeup faster to get the job done as compared to keeping busy CPU
threads in single thread mode to get more execution slices.

This fix is primarily to ensure kernel update does not change H_CEDE
behaviour on same hardware generation there by causing performance
variation and also regression in some case.

Thanks for the questions and comments, I hope this gives additional
context for this fix.

--Vaidy



Re: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()

2021-04-23 Thread Rob Herring
On Fri, Apr 23, 2021 at 9:42 AM David Laight  wrote:
>
> From: Michael Ellerman
> > Sent: 23 April 2021 14:51
> ...
> > > (Does anyone - and by anyone I mean any large distro - compile with
> > > local variables inited by the compiler?)
> >
> > This is where I say, "yes, Android" and you say "ugh no I meant a real
> > distro", and I say "well ...".
> >
> > But yeah doesn't help us much.
>
> And it doesn't seem to stop my phone crashing either :-)
>
> Of course, I've absolutely no way of finding out where it is crashing.
> Nor where the massive memory leaks are that means it need rebooting
> every few days.

You need a new phone. :) My Pixel3 uptime is sitting at 194 hours
currently. It would be more, but those annoying security updates
require reboots. I have had phones that I setup to reboot every night
though.

Rob


Re: [PATCH 1/2] powerpc: Fix a memory leak in error handling paths

2021-04-23 Thread Rob Herring
On Fri, Apr 23, 2021 at 9:40 AM Christophe JAILLET
 wrote:
>
> If we exit the for_each_of_cpu_node loop early, the reference on the
> current node must be decremented, otherwise there is a leak.
>
> Fixes: a94fe366340a ("powerpc: use for_each_of_cpu_node iterator")
> Signed-off-by: Christophe JAILLET 
> ---
> Strangely, the commit above added the needed of_node_put in one place but
> missed 2 other places!

Well, maintained it in one place and forgot to add in the other two.

> This is strange, so maybe I misunderstand something. Review carefully
> ---
>  arch/powerpc/platforms/powermac/feature.c | 2 ++
>  1 file changed, 2 insertions(+)

Reviewed-by: Rob Herring 

I don't really think patch 2 is worthwhile but that's up to the
powerpc maintainers.

Rob


RE: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()

2021-04-23 Thread David Laight
From: Michael Ellerman
> Sent: 23 April 2021 14:51
...
> > (Does anyone - and by anyone I mean any large distro - compile with
> > local variables inited by the compiler?)
> 
> This is where I say, "yes, Android" and you say "ugh no I meant a real
> distro", and I say "well ...".
> 
> But yeah doesn't help us much.

And it doesn't seem to stop my phone crashing either :-)

Of course, I've absolutely no way of finding out where it is crashing.
Nor where the massive memory leaks are that means it need rebooting
every few days.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



[PATCH 2/2] powerpc: Save a few lines of code

2021-04-23 Thread Christophe JAILLET
'arch/powerpc/platforms/powermac/feature.c' triggers many checkpatch.pl
warnings.
The code looks old and not very active, so fixing them all does not make
so much sense and will hide some 'git blame' information.

So only apply a few fixes that save a few lines of code.

Signed-off-by: Christophe JAILLET 
---
 arch/powerpc/platforms/powermac/feature.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index e61f7d2e..155e8586692e 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -83,8 +83,7 @@ struct macio_chip *macio_find(struct device_node *child, int 
type)
 }
 EXPORT_SYMBOL_GPL(macio_find);
 
-static const char *macio_names[] =
-{
+static const char *macio_names[] = {
"Unknown",
"Grand Central",
"OHare",
@@ -119,8 +118,7 @@ struct feature_table_entry {
feature_callfunction;
 };
 
-struct pmac_mb_def
-{
+struct pmac_mb_def {
const char* model_string;
const char* model_name;
int model_id;
@@ -301,11 +299,10 @@ static long ohare_sleep_state(struct device_node *node, 
long param, long value)
 
if ((pmac_mb.board_flags & PMAC_MB_CAN_SLEEP) == 0)
return -EPERM;
-   if (value == 1) {
+   if (value == 1)
MACIO_BIC(OHARE_FCR, OH_IOBUS_ENABLE);
-   } else if (value == 0) {
+   else if (value == 0)
MACIO_BIS(OHARE_FCR, OH_IOBUS_ENABLE);
-   }
 
return 0;
 }
@@ -2992,7 +2989,6 @@ void pmac_register_agp_pm(struct pci_dev *bridge,
if (bridge != pmac_agp_bridge)
return;
pmac_agp_suspend = pmac_agp_resume = NULL;
-   return;
 }
 EXPORT_SYMBOL(pmac_register_agp_pm);
 
-- 
2.27.0



[PATCH 1/2] powerpc: Fix a memory leak in error handling paths

2021-04-23 Thread Christophe JAILLET
If we exit the for_each_of_cpu_node loop early, the reference on the
current node must be decremented, otherwise there is a leak.

Fixes: a94fe366340a ("powerpc: use for_each_of_cpu_node iterator")
Signed-off-by: Christophe JAILLET 
---
Strangely, the commit above added the needed of_node_put in one place but
missed 2 other places!
This is strange, so maybe I misunderstand something. Review carefully
---
 arch/powerpc/platforms/powermac/feature.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/platforms/powermac/feature.c 
b/arch/powerpc/platforms/powermac/feature.c
index 5c77b9a24c0e..e61f7d2e 100644
--- a/arch/powerpc/platforms/powermac/feature.c
+++ b/arch/powerpc/platforms/powermac/feature.c
@@ -1060,6 +1060,7 @@ core99_reset_cpu(struct device_node *node, long param, 
long value)
continue;
if (param == *num) {
reset_io = *rst;
+   of_node_put(np);
break;
}
}
@@ -1506,6 +1507,7 @@ static long g5_reset_cpu(struct device_node *node, long 
param, long value)
continue;
if (param == *num) {
reset_io = *rst;
+   of_node_put(np);
break;
}
}
-- 
2.27.0



[PATCH] powerpc/signal32: Fix erroneous SIGSEGV on RT signal return

2021-04-23 Thread Christophe Leroy
Return of user_read_access_begin() is tested the wrong way,
leading to a SIGSEGV when the user address is valid and likely
an Oops when the user address is bad.

Fix the test.

Fixes: 887f3ceb51cd ("powerpc/signal32: Convert do_setcontext[_tm]() to user 
access block")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/signal_32.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index fc41c58f0cbb..8f05ed0da292 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -967,7 +967,7 @@ static int do_setcontext(struct ucontext __user *ucp, 
struct pt_regs *regs, int
sigset_t set;
struct mcontext __user *mcp;
 
-   if (user_read_access_begin(ucp, sizeof(*ucp)))
+   if (!user_read_access_begin(ucp, sizeof(*ucp)))
return -EFAULT;
 
unsafe_get_sigset_t(, >uc_sigmask, failed);
@@ -1005,7 +1005,7 @@ static int do_setcontext_tm(struct ucontext __user *ucp,
u32 cmcp;
u32 tm_cmcp;
 
-   if (user_read_access_begin(ucp, sizeof(*ucp)))
+   if (!user_read_access_begin(ucp, sizeof(*ucp)))
return -EFAULT;
 
unsafe_get_sigset_t(, >uc_sigmask, failed);
-- 
2.25.0



Re: [PATCH] powerpc: Initialize local variable fdt to NULL in elf64_load()

2021-04-23 Thread Michael Ellerman
Daniel Axtens  writes:
> Daniel Axtens  writes:
>
>> Hi Lakshmi,
>>
>>> On 4/15/21 12:14 PM, Lakshmi Ramasubramanian wrote:
>>>
>>> Sorry - missed copying device-tree and powerpc mailing lists.
>>>
 There are a few "goto out;" statements before the local variable "fdt"
 is initialized through the call to of_kexec_alloc_and_setup_fdt() in
 elf64_load(). This will result in an uninitialized "fdt" being passed
 to kvfree() in this function if there is an error before the call to
 of_kexec_alloc_and_setup_fdt().
 
 Initialize the local variable "fdt" to NULL.

>> I'm a huge fan of initialising local variables! But I'm struggling to
>> find the code path that will lead to an uninit fdt being returned...
>
> OK, so perhaps this was putting it too strongly. I have been bitten
> by uninitialised things enough in C that I may have taken a slightly
> overly-agressive view of fixing them in the source rather than the
> compiler. I do think compiler-level mitigations are better, and I take
> the point that we don't want to defeat compiler checking.
>
> (Does anyone - and by anyone I mean any large distro - compile with
> local variables inited by the compiler?)

This is where I say, "yes, Android" and you say "ugh no I meant a real
distro", and I say "well ...".

But yeah doesn't help us much.

cheers


Re: [PATCH v5 14/16] dma-direct: Allocate memory from restricted DMA pool if available

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

The restricted DMA pool is preferred if available.

The restricted DMA pools provide a basic level of protection against the
DMA overwriting buffer contents at unexpected times. However, to protect
against general data leakage and system memory corruption, the system
needs to provide a way to lock down the memory access, e.g., MPU.

Signed-off-by: Claire Chang 
---
  kernel/dma/direct.c | 35 ++-
  1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 7a27f0510fcc..29523d2a9845 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -78,6 +78,10 @@ static bool dma_coherent_ok(struct device *dev, phys_addr_t 
phys, size_t size)
  static void __dma_direct_free_pages(struct device *dev, struct page *page,
size_t size)
  {
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   if (swiotlb_free(dev, page, size))
+   return;
+#endif
dma_free_contiguous(dev, page, size);
  }
  
@@ -92,7 +96,17 @@ static struct page *__dma_direct_alloc_pages(struct device *dev, size_t size,
  
  	gfp |= dma_direct_optimal_gfp_mask(dev, dev->coherent_dma_mask,

   _limit);
-   page = dma_alloc_contiguous(dev, size, gfp);
+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   page = swiotlb_alloc(dev, size);
+   if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
+   __dma_direct_free_pages(dev, page, size);
+   page = NULL;
+   }
+#endif
+
+   if (!page)
+   page = dma_alloc_contiguous(dev, size, gfp);
if (page && !dma_coherent_ok(dev, page_to_phys(page), size)) {
dma_free_contiguous(dev, page, size);
page = NULL;
@@ -148,7 +162,7 @@ void *dma_direct_alloc(struct device *dev, size_t size,
gfp |= __GFP_NOWARN;
  
  	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&

-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
page = __dma_direct_alloc_pages(dev, size, gfp & ~__GFP_ZERO);
if (!page)
return NULL;
@@ -161,8 +175,8 @@ void *dma_direct_alloc(struct device *dev, size_t size,
}
  
  	if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&

-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev))
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev))
return arch_dma_alloc(dev, size, dma_handle, gfp, attrs);
  
  	/*

@@ -172,7 +186,9 @@ void *dma_direct_alloc(struct device *dev, size_t size,
if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&
!gfpflags_allow_blocking(gfp) &&
(force_dma_unencrypted(dev) ||
-(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && 
!dev_is_dma_coherent(dev
+(IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
+ !dev_is_dma_coherent(dev))) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);
  
  	/* we always manually zero the memory once we are done */

@@ -253,15 +269,15 @@ void dma_direct_free(struct device *dev, size_t size,
unsigned int page_order = get_order(size);
  
  	if ((attrs & DMA_ATTR_NO_KERNEL_MAPPING) &&

-   !force_dma_unencrypted(dev)) {
+   !force_dma_unencrypted(dev) && !is_dev_swiotlb_force(dev)) {
/* cpu_addr is a struct page cookie, not a kernel address */
dma_free_contiguous(dev, cpu_addr, size);
return;
}
  
  	if (!IS_ENABLED(CONFIG_ARCH_HAS_DMA_SET_UNCACHED) &&

-   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) &&
-   !dev_is_dma_coherent(dev)) {
+   !IS_ENABLED(CONFIG_DMA_DIRECT_REMAP) && !dev_is_dma_coherent(dev) &&
+   !is_dev_swiotlb_force(dev)) {
arch_dma_free(dev, size, cpu_addr, dma_addr, attrs);
return;
}
@@ -289,7 +305,8 @@ struct page *dma_direct_alloc_pages(struct device *dev, 
size_t size,
void *ret;
  
  	if (IS_ENABLED(CONFIG_DMA_COHERENT_POOL) &&

-   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp))
+   force_dma_unencrypted(dev) && !gfpflags_allow_blocking(gfp) &&
+   !is_dev_swiotlb_force(dev))
return dma_direct_alloc_from_pool(dev, size, dma_handle, gfp);


Wait, this seems broken for non-coherent devices - in that case we need 
to return a non-cacheable address, but we can't simply fall through into 
the remapping path below in GFP_ATOMIC context. That's why we need the 
atomic pool concept in the first place :/


Unless I've overlooked something, we're still using the regular 
cacheable linear map address of the dma_io_tlb_mem buffer, no?


Robin.

  
  	page = __dma_direct_alloc_pages(dev, size, 

Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-23 Thread Quentin Monnet
2021-04-23 12:46 UTC+0200 ~ Christophe Leroy 
> 
> 
> Le 23/04/2021 à 12:26, Quentin Monnet a écrit :
>> 2021-04-23 09:19 UTC+0200 ~ Christophe Leroy
>> 
>>
>> [...]
>>
>>> I finally managed to cross compile bpftool with libbpf, libopcodes,
>>> readline, ncurses, libcap, libz and all needed stuff. Was not easy but I
>>> made it.
>>
>> Libcap is optional and bpftool does not use readline or ncurses. May I
>> ask how you tried to build it?
> 
> cd tools/bpf/
> 
> make ARCH=powerpc CROSS_COMPILE=ppc-linux-

Ok, you could try running directly from tools/bpf/bpftool/ next time
instead.

Readline at least is for a different tool under tools/bpf/, bpf_dbg (But
I'm still not sure where that ncurses requirement was pulled from). The
requirements for specific kernel options probably came from yet another
tool (runqslower, I think).

Quentin


Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-23 Thread Christophe Leroy




Le 23/04/2021 à 12:59, Quentin Monnet a écrit :

2021-04-23 12:46 UTC+0200 ~ Christophe Leroy 



Le 23/04/2021 à 12:26, Quentin Monnet a écrit :

2021-04-23 09:19 UTC+0200 ~ Christophe Leroy


[...]


I finally managed to cross compile bpftool with libbpf, libopcodes,
readline, ncurses, libcap, libz and all needed stuff. Was not easy but I
made it.


Libcap is optional and bpftool does not use readline or ncurses. May I
ask how you tried to build it?


cd tools/bpf/

make ARCH=powerpc CROSS_COMPILE=ppc-linux-


Ok, you could try running directly from tools/bpf/bpftool/ next time
instead.

Readline at least is for a different tool under tools/bpf/, bpf_dbg (But
I'm still not sure where that ncurses requirement was pulled from). The
requirements for specific kernel options probably came from yet another
tool (runqslower, I think).



ncurses (or termcap) is required by readline

Christophe


Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-23 Thread Christophe Leroy




Le 23/04/2021 à 12:26, Quentin Monnet a écrit :

2021-04-23 09:19 UTC+0200 ~ Christophe Leroy 

[...]


I finally managed to cross compile bpftool with libbpf, libopcodes,
readline, ncurses, libcap, libz and all needed stuff. Was not easy but I
made it.


Libcap is optional and bpftool does not use readline or ncurses. May I
ask how you tried to build it?


cd tools/bpf/

make ARCH=powerpc CROSS_COMPILE=ppc-linux-


Christophe


Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-23 Thread Quentin Monnet
2021-04-23 09:19 UTC+0200 ~ Christophe Leroy 

[...]

> I finally managed to cross compile bpftool with libbpf, libopcodes,
> readline, ncurses, libcap, libz and all needed stuff. Was not easy but I
> made it.

Libcap is optional and bpftool does not use readline or ncurses. May I
ask how you tried to build it?

> 
> Now, how do I use it ?
> 
> Let say I want to dump the jitted code generated from a call to
> 'tcpdump'. How do I do that with 'bpftool prog dump jited' ?
> 
> I thought by calling this line I would then get programs dumped in a way
> or another just like when setting 'bpf_jit_enable=2', but calling that
> line just provides me some bpftool help text.

Well the purpose of this text is to help you find the way to call
bpftool to do what you want :). For dumping your programs' instructions,
you need to tell bpftool what program to dump: Bpftool isn't waiting
until you load a program to dump it, instead you need to load your
program first and then tell bpftool to retrieve the instructions from
the kernel. To reference your program you could use a pinned path, or
first list the programs on your system with "bpftool prog show":


# bpftool prog show
138: tracing  name foo  tag e54c922dfa54f65f  gpl
loaded_at 2021-02-25T01:32:30+  uid 0
xlated 256B  jited 154B  memlock 4096B  map_ids 64
btf_id 235

Then you can use for example the program id displayed on the first line
to reference and dump your program:

# bpftool prog dump jited id 138

You should find additional documentation under
tools/bpf/bpftool/Documentation.

> 
> By the way, I would be nice to have a kernel OPTION that selects all
> OPTIONS required for building bpftool. Because you discover them one by
> one at every build failure. I had to had CONFIG_IPV6, CONFIG_DEBUG_BTF,
> CONFIG_CGROUPS, ... If there could be an option like "Build a 'bpftool'
> ready kernel" that selected all those, it would be great.
> 
> Christophe

I do not believe any of these are required to build bpftool.

Quentin


Re: [PATCH kernel] powerpc/makefile: Do not redefine $(CPP) for preprocessor

2021-04-23 Thread Michael Ellerman
Christophe Leroy  writes:
> Le 23/04/2021 à 00:58, Daniel Axtens a écrit :
>>> diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
>>> index c9d2c7825cd6..3a2f2001c62b 100644
>>> --- a/arch/powerpc/Makefile
>>> +++ b/arch/powerpc/Makefile
>>> @@ -214,7 +214,6 @@ KBUILD_CPPFLAGS += -I $(srctree)/arch/$(ARCH) $(asinstr)
>>>   KBUILD_AFLAGS += $(AFLAGS-y)
>>>   KBUILD_CFLAGS += $(call cc-option,-msoft-float)
>>>   KBUILD_CFLAGS += -pipe $(CFLAGS-y)
>>> -CPP= $(CC) -E $(KBUILD_CFLAGS)
>> 
>> I was trying to check the history to see why powerpc has its own
>> definition. It seems to date back at least as far as merging the two
>> powerpc platforms into one, maybe it was helpful then. I agree it
>> doesn't seem to be needed now.
>> 
>
> I digged a bit deaper. It has been there since the introduction of arch/ppc/ 
> in Linux 1.3.45
> At the time, there was the exact same CPP definition in arch/mips/Makefile
>
> The CPP definition in mips disappeared is Linux 2.1.44pre3.
> Since that version, ppc has been the only one with such CPP re-definition.
>
>> Snowpatch claims that this breaks the build, but I haven't been able to
>> reproduce the issue in either pmac32 or ppc64 defconfig. I have sent it
>> off to a fork of mpe's linux-ci repo to see if any of those builds hit
>> any issues: https://github.com/daxtens/linux-ci/actions
>> 
>
> By the way, I find snowpatch quite useless these days. It seems to delete the 
> reports a few minutes 
> after the test. We are less than one day after the patch was submitted and 
> the report of the build 
> failures are already gone.

Yeah, it's pretty annoying. It needs to send emails to the list with its
results, with a link to a location that retains the logs for some
reasonable period.

cheers


Re: [PATCH v5 16/16] of: Add plumbing for restricted DMA pool

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

If a device is not behind an IOMMU, we look up the device node and set
up the restricted DMA when the restricted-dma-pool is presented.

Signed-off-by: Claire Chang 
---
  drivers/of/address.c| 25 +
  drivers/of/device.c |  3 +++
  drivers/of/of_private.h |  5 +
  3 files changed, 33 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index 54f221dde267..fff3adfe4986 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -8,6 +8,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -1109,6 +1110,30 @@ bool of_dma_is_coherent(struct device_node *np)
  }
  EXPORT_SYMBOL_GPL(of_dma_is_coherent);
  
+int of_dma_set_restricted_buffer(struct device *dev)

+{
+   struct device_node *node;
+   int count, i;
+
+   if (!dev->of_node)
+   return 0;
+
+   count = of_property_count_elems_of_size(dev->of_node, "memory-region",
+   sizeof(phandle));
+   for (i = 0; i < count; i++) {
+   node = of_parse_phandle(dev->of_node, "memory-region", i);
+   /* There might be multiple memory regions, but only one
+* restriced-dma-pool region is allowed.
+*/


What's the use-case for having multiple regions if the restricted pool 
is by definition the only one accessible?


Robin.


+   if (of_device_is_compatible(node, "restricted-dma-pool") &&
+   of_device_is_available(node))
+   return of_reserved_mem_device_init_by_idx(
+   dev, dev->of_node, i);
+   }
+
+   return 0;
+}
+
  /**
   * of_mmio_is_nonposted - Check if device uses non-posted MMIO
   * @np:   device node
diff --git a/drivers/of/device.c b/drivers/of/device.c
index c5a9473a5fb1..d8d865223e51 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -165,6 +165,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
  
  	arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
  
+	if (!iommu)

+   return of_dma_set_restricted_buffer(dev);
+
return 0;
  }
  EXPORT_SYMBOL_GPL(of_dma_configure_id);
diff --git a/drivers/of/of_private.h b/drivers/of/of_private.h
index d717efbd637d..e9237f5eff48 100644
--- a/drivers/of/of_private.h
+++ b/drivers/of/of_private.h
@@ -163,12 +163,17 @@ struct bus_dma_region;
  #if defined(CONFIG_OF_ADDRESS) && defined(CONFIG_HAS_DMA)
  int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map);
+int of_dma_set_restricted_buffer(struct device *dev);
  #else
  static inline int of_dma_get_range(struct device_node *np,
const struct bus_dma_region **map)
  {
return -ENODEV;
  }
+static inline int of_dma_get_restricted_buffer(struct device *dev)
+{
+   return -ENODEV;
+}
  #endif
  
  #endif /* _LINUX_OF_PRIVATE_H */




Re: [PATCH v5 08/16] swiotlb: Update is_swiotlb_active to add a struct device argument

2021-04-23 Thread Robin Murphy

On 2021-04-22 09:15, Claire Chang wrote:

Update is_swiotlb_active to add a struct device argument. This will be
useful later to allow for restricted DMA pool.

Signed-off-by: Claire Chang 
---
  drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +-
  drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +-
  drivers/pci/xen-pcifront.c   | 2 +-
  include/linux/swiotlb.h  | 4 ++--
  kernel/dma/direct.c  | 2 +-
  kernel/dma/swiotlb.c | 4 ++--
  6 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index ce6b664b10aa..7d48c433446b 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
  
  	max_order = MAX_ORDER;

  #ifdef CONFIG_SWIOTLB
-   if (is_swiotlb_active()) {
+   if (is_swiotlb_active(NULL)) {
unsigned int max_segment;
  
  		max_segment = swiotlb_max_segment();

diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c 
b/drivers/gpu/drm/nouveau/nouveau_ttm.c
index e8b506a6685b..2a2ae6d6cf6d 100644
--- a/drivers/gpu/drm/nouveau/nouveau_ttm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c
@@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm)
}
  
  #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86)

-   need_swiotlb = is_swiotlb_active();
+   need_swiotlb = is_swiotlb_active(NULL);
  #endif
  
  	ret = ttm_device_init(>ttm.bdev, _bo_driver, drm->dev->dev,

diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c
index b7a8f3a1921f..6d548ce53ce7 100644
--- a/drivers/pci/xen-pcifront.c
+++ b/drivers/pci/xen-pcifront.c
@@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct 
pcifront_device *pdev)
  
  	spin_unlock(_dev_lock);
  
-	if (!err && !is_swiotlb_active()) {

+   if (!err && !is_swiotlb_active(NULL)) {
err = pci_xen_swiotlb_init_late();
if (err)
dev_err(>xdev->dev, "Could not setup SWIOTLB!\n");
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 2a6cca07540b..c530c976d18b 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -123,7 +123,7 @@ static inline bool is_swiotlb_buffer(struct device *dev, 
phys_addr_t paddr)
  void __init swiotlb_exit(void);
  unsigned int swiotlb_max_segment(void);
  size_t swiotlb_max_mapping_size(struct device *dev);
-bool is_swiotlb_active(void);
+bool is_swiotlb_active(struct device *dev);
  void __init swiotlb_adjust_size(unsigned long size);
  #else
  #define swiotlb_force SWIOTLB_NO_FORCE
@@ -143,7 +143,7 @@ static inline size_t swiotlb_max_mapping_size(struct device 
*dev)
return SIZE_MAX;
  }
  
-static inline bool is_swiotlb_active(void)

+static inline bool is_swiotlb_active(struct device *dev)
  {
return false;
  }
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 84c9feb5474a..7a88c34d0867 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -495,7 +495,7 @@ int dma_direct_supported(struct device *dev, u64 mask)
  size_t dma_direct_max_mapping_size(struct device *dev)
  {
/* If SWIOTLB is active, use its maximum mapping size */
-   if (is_swiotlb_active() &&
+   if (is_swiotlb_active(dev) &&
(dma_addressing_limited(dev) || swiotlb_force == SWIOTLB_FORCE))


I wonder if it's worth trying to fold these other conditions into 
is_swiotlb_active() itself? I'm not entirely sure what matters for Xen, 
but for the other cases it seems like they probably only care about 
whether bouncing may occur for their particular device or not (possibly 
they want to be using dma_max_mapping_size() now anyway - TBH I'm 
struggling to make sense of what the swiotlb_max_segment business is 
supposed to mean).


Otherwise, patch #9 will need to touch here as well to make sure that 
per-device forced bouncing is reflected correctly.


Robin.


return swiotlb_max_mapping_size(dev);
return SIZE_MAX;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index ffbb8724e06c..1d221343f1c8 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -659,9 +659,9 @@ size_t swiotlb_max_mapping_size(struct device *dev)
return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE;
  }
  
-bool is_swiotlb_active(void)

+bool is_swiotlb_active(struct device *dev)
  {
-   return io_tlb_default_mem != NULL;
+   return get_io_tlb_mem(dev) != NULL;
  }
  EXPORT_SYMBOL_GPL(is_swiotlb_active);
  



Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-23 Thread Michael Ellerman
"Naveen N. Rao"  writes:
> Michael Ellerman wrote:
>> "Naveen N. Rao"  writes:
>>> Daniel Axtens wrote:
 Sathvika Vasireddy  writes:
 
> This adds emulation support for the following instruction:
>* Set Boolean (setb)
>
> Signed-off-by: Sathvika Vasireddy 
>> ...
 
 If you do end up respinning the patch, I think it would be good to make
 the maths a bit clearer. I think it works because a left shift of 2 is
 the same as multiplying by 4, but it would be easier to follow if you
 used a temporary variable for btf.
>>>
>>> Indeed. I wonder if it is better to follow the ISA itself. Per the ISA, 
>>> the bit we are interested in is:
>>> 4 x BFA + 32
>>>
>>> So, if we use that along with the PPC_BIT() macro, we get:
>>> if (regs->ccr & PPC_BIT(ra + 32))
>> 
>> Use of PPC_BIT risks annoying your maintainer :)
>
> Uh oh... that isn't good :)
>
> I looked up previous discussions and I think I now understand why you 
> don't prefer it.

Hah, I'd forgotten I'd written (ranted :D) about this in the past.

> But, I feel it helps make it easy to follow the code when referring to 
> the ISA.

That's true. But I think that's much much less common than people
reading the code in isolation.

And ultimately it doesn't matter if the code (appears to) match the ISA,
it matters that the code works. My worry is that too much use of those
type of macros obscures what's actually happening.

> I'm wondering if it is just the name you dislike and if so, 
> does it make sense to rename PPC_BIT() to something else? We have 
> BIT_ULL(), so perhaps BIT_MSB_ULL() or MSB_BIT_ULL()?

The name is part of it. But I don't really like BIT_ULL() either, it
hides in a macro something that could just be there in front of you
ie. (1ull << x).


For this case of setb, I think I'd go with something like below. It
doesn't exactly match the ISA, but I think there's minimal obfuscation
of what's actually going on.

// ra is now bfa
ra = (ra >> 2);

// Extract 4-bit CR field
val = regs->ccr >> (CR0_SHIFT - 4 * ra);

if (val & 8)
op->val = -1;
else if (val & 4)
op->val = 1;
else
op->val = 0;


If anything could use a macro it would be the 8 and 4, eg. CR_LT, CR_GT.

Of course that's probably got a bug in it, because I just wrote it by
eye and it's 11:28 pm :)

cheers


Re: [PATCH v5 05/16] swiotlb: Add restricted DMA pool initialization

2021-04-23 Thread Steven Price

On 22/04/2021 09:14, Claire Chang wrote:

Add the initialization function to create restricted DMA pools from
matching reserved-memory nodes.

Signed-off-by: Claire Chang 
---
  include/linux/device.h  |  4 +++
  include/linux/swiotlb.h |  3 +-
  kernel/dma/swiotlb.c| 80 +
  3 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/include/linux/device.h b/include/linux/device.h
index 38a2071cf776..4987608ea4ff 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -416,6 +416,7 @@ struct dev_links_info {
   * @dma_pools:Dma pools (if dma'ble device).
   * @dma_mem:  Internal for coherent mem override.
   * @cma_area: Contiguous memory area for dma allocations
+ * @dma_io_tlb_mem: Internal for swiotlb io_tlb_mem override.
   * @archdata: For arch-specific additions.
   * @of_node:  Associated device tree node.
   * @fwnode:   Associated device node supplied by platform firmware.
@@ -521,6 +522,9 @@ struct device {
  #ifdef CONFIG_DMA_CMA
struct cma *cma_area;   /* contiguous memory area for dma
   allocations */
+#endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+   struct io_tlb_mem *dma_io_tlb_mem;
  #endif
/* arch specific additions */
struct dev_archdata archdata;
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 216854a5e513..03ad6e3b4056 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -72,7 +72,8 @@ extern enum swiotlb_force swiotlb_force;
   *range check to see if the memory was in fact allocated by this
   *API.
   * @nslabs:   The number of IO TLB blocks (in groups of 64) between @start and
- * @end. This is command line adjustable via setup_io_tlb_npages.
+ * @end. For default swiotlb, this is command line adjustable via
+ * setup_io_tlb_npages.
   * @used: The number of used IO TLB block.
   * @list: The free list describing the number of free entries available
   *from each index.
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 57a9adb920bf..ffbb8724e06c 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -39,6 +39,13 @@
  #ifdef CONFIG_DEBUG_FS
  #include 
  #endif
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+#include 
+#include 
+#include 
+#include 
+#include 
+#endif
  
  #include 

  #include 
@@ -681,3 +688,76 @@ static int __init swiotlb_create_default_debugfs(void)
  late_initcall(swiotlb_create_default_debugfs);
  
  #endif

+
+#ifdef CONFIG_DMA_RESTRICTED_POOL
+static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
+   struct device *dev)
+{
+   struct io_tlb_mem *mem = rmem->priv;
+   unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
+
+   if (dev->dma_io_tlb_mem)
+   return 0;
+
+   /* Since multiple devices can share the same pool, the private data,
+* io_tlb_mem struct, will be initialized by the first device attached
+* to it.
+*/
+   if (!mem) {
+   mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL);
+   if (!mem)
+   return -ENOMEM;
+#ifdef CONFIG_ARM
+   if (!PageHighMem(pfn_to_page(PHYS_PFN(rmem->base {
+   kfree(mem);
+   return -EINVAL;
+   }
+#endif /* CONFIG_ARM */
+   swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false);
+
+   rmem->priv = mem;
+   }
+
+#ifdef CONFIG_DEBUG_FS
+   if (!io_tlb_default_mem->debugfs)
+   io_tlb_default_mem->debugfs =
+   debugfs_create_dir("swiotlb", NULL);


At this point it's possible for io_tlb_default_mem to be NULL, leading 
to a splat.


But even then if it's not and we have the situation where debugfs==NULL 
then the debugfs_create_dir() here will cause a subsequent attempt in 
swiotlb_create_debugfs() to fail (directory already exists) leading to 
mem->debugfs being assigned an error value. I suspect the creation of 
the debugfs directory needs to be separated from io_tlb_default_mem 
being set.


Other than that I gave this series a go with our prototype of Arm's 
Confidential Computer Architecture[1] - since the majority of the 
guest's memory is protected from the host the restricted DMA pool allows 
(only) a small area to be shared with the host.


After fixing (well hacking round) the above it all seems to be working 
fine with virtio drivers.


Thanks,

Steve

[1] 
https://www.arm.com/why-arm/architecture/security-features/arm-confidential-compute-architecture


Re: [PATCH 1/2] powerpc/sstep: Add emulation support for ‘setb’ instruction

2021-04-23 Thread Gabriel Paubert
On Thu, Apr 22, 2021 at 06:26:16PM -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Fri, Apr 23, 2021 at 12:16:18AM +0200, Gabriel Paubert wrote:
> > On Thu, Apr 22, 2021 at 02:13:34PM -0500, Segher Boessenkool wrote:
> > > On Fri, Apr 16, 2021 at 05:44:52PM +1000, Daniel Axtens wrote:
> > > > Sathvika Vasireddy  writes:
> > > > > + if ((regs->ccr) & (1 << (31 - ra)))
> > > > > + op->val = -1;
> > > > > + else if ((regs->ccr) & (1 << (30 - ra)))
> > > > > + op->val = 1;
> > > > > + else
> > > > > + op->val = 0;
> > > 
> > > It imo is clearer if written
> > > 
> > >   if ((regs->ccr << ra) & 0x8000)
> > >   op->val = -1;
> > >   else if ((regs->ccr << ra) & 0x4000)
> > >   op->val = 1;
> > >   else
> > >   op->val = 0;
> > > 
> > > but I guess not everyone agrees :-)
> > 
> > But this can be made jump free :-):
> > 
> > int tmp = regs->ccr << ra;
> > op->val = (tmp >> 31) | ((tmp >> 30) & 1);
> 
> The compiler will do so automatically (or think of some better way to
> get the same result); in source code, what matters most is readability,
> or clarity in general (also clarity to the compiler).

I just did a test (trivial code attached) and the original code always
produces one conditional branch at -O2, at least with the cross-compiler
I have on Debian (gcc 8.3). I have tested both -m32 and -m64. The 64 bit
version produces an unnecessary "extsw", so I wrote the second version
splitting the setting of the return value which gets rid of it.

The second "if" is fairly simple to optimize and the compiler does it
properly.

Of course with my suggestion the compiler does not produce any branch. 
But it needs a really good comment.


> 
> (Right shifts of negative numbers are implementation-defined in C,
> fwiw -- but work like you expect in GCC).

Well, I'm not worried about it, since I'd expect a compiler that does
logical right shifts on signed valued to break so much code that it
would be easily noticed (also in the kernel).


> 
> > (IIRC the srawi instruction sign-extends its result to 64 bits).
> 
> If you consider it to work on 32-bit inputs, yeah, that is a simple way
> to express it.
> 
> > > > CR field:  76543210
> > > > bit:  0123 0123 0123 0123 0123 0123 0123 0123
> > > > normal bit #: 0.31
> > > > ibm bit #:   31.0
> > > 
> > > The bit numbers in CR fields are *always* numbered left-to-right.  I
> > > have never seen anyone use LE for it, anyway.
> > > 
> > > Also, even people who write LE have the bigger end on the left normally
> > > (they just write some things right-to-left, and other things
> > > left-to-right).
> > 
> > Around 1985, I had a documentation for the the National's 32032
> > (little-endian) processor family, and all the instruction encodings were
> > presented with the LSB on the left and MSB on the right.
> 
> Ouch!  Did they write "regular" numbers with the least significant digit
> on the left as well?

No, they were not that sadistic!

At least instructions were a whole number of bytes, unlike the iAPX432
where jumps needed to encode target addresses down to the bit level.

> 
> > BTW on these processors, the immediate operands and the offsets (1, 2 or
> > 4 bytes) for the addressing modes were encoded in big-endian byte order,
> > but I digress. Consistency is overrated ;-)
> 
> Inconsistency is the spice of life, yeah :-)

;-)

Gabriel

 long setb_cond(int cr, int shift)
{
	long ret;
	if ((cr << shift) & 0x8000)
		ret = -1;
	else if ((cr << shift) & 0x4000)
		ret = 1;
	else
		ret = 0;
	return ret;
}

long setb_uncond(int cr, int shift)
{
	int tmp = cr << shift;
	long ret;
	ret = (tmp >> 31) | ((tmp >> 30) & 1);
	return ret;
}

long setb_uncond2(int cr, int shift)
{
	int tmp = cr << shift;
	long ret;
	ret = (tmp >> 31);
	ret |= ((tmp >> 30) & 1);
	return ret;
}


Re: [PATCH v3 06/11] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2021-04-23 Thread Alexey Kardashevskiy




On 22/04/2021 17:07, Leonardo Bras wrote:

Code used to create a ddw property that was previously scattered in
enable_ddw() is now gathered in ddw_property_create(), which deals with
allocation and filling the property, letting it ready for
of_property_add(), which now occurs in sequence.

This created an opportunity to reorganize the second part of enable_ddw():

Without this patch enable_ddw() does, in order:
kzalloc() property & members, create_ddw(), fill ddwprop inside property,
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
of_add_property(), and list_add().

With this patch enable_ddw() does, in order:
create_ddw(), ddw_property_create(), of_add_property(),
ddw_list_new_entry(), do tce_setrange_multi_pSeriesLP_walk in all memory,
and list_add().

This change requires of_remove_property() in case anything fails after
of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
in all memory, which looks the most expensive operation, only if
everything else succeeds.

Signed-off-by: Leonardo Bras 
---
  arch/powerpc/platforms/pseries/iommu.c | 93 --
  1 file changed, 57 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 955cf095416c..48c029386d94 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1122,6 +1122,35 @@ static void reset_dma_window(struct pci_dev *dev, struct 
device_node *par_dn)
 ret);
  }
  
+static struct property *ddw_property_create(const char *propname, u32 liobn, u64 dma_addr,

+   u32 page_shift, u32 window_shift)
+{
+   struct dynamic_dma_window_prop *ddwprop;
+   struct property *win64;
+
+   win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
+   if (!win64)
+   return NULL;
+
+   win64->name = kstrdup(propname, GFP_KERNEL);
+   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
+   win64->value = ddwprop;
+   win64->length = sizeof(*ddwprop);
+   if (!win64->name || !win64->value) {
+   kfree(win64);
+   kfree(win64->name);
+   kfree(win64->value);



Wrong order.





+   return NULL;
+   }
+
+   ddwprop->liobn = cpu_to_be32(liobn);
+   ddwprop->dma_base = cpu_to_be64(dma_addr);
+   ddwprop->tce_shift = cpu_to_be32(page_shift);
+   ddwprop->window_shift = cpu_to_be32(window_shift);
+
+   return win64;
+}
+
  /* Return largest page shift based on "IO Page Sizes" output of 
ibm,query-pe-dma-window. */
  static int iommu_get_page_shift(u32 query_page_size)
  {
@@ -1167,11 +1196,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
struct ddw_query_response query;
struct ddw_create_response create;
int page_shift;
+   u64 win_addr;
struct device_node *dn;
u32 ddw_avail[DDW_APPLICABLE_SIZE];
struct direct_window *window;
struct property *win64 = NULL;
-   struct dynamic_dma_window_prop *ddwprop;
struct failed_ddw_pdn *fpdn;
bool default_win_removed = false;
bool pmem_present;
@@ -1286,65 +1315,54 @@ static bool enable_ddw(struct pci_dev *dev, struct 
device_node *pdn)
1ULL << page_shift);
goto out_failed;
}
-   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
-   if (!win64) {
-   dev_info(>dev,
-   "couldn't allocate property for 64bit dma window\n");
-   goto out_failed;
-   }
-   win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
-   win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
-   win64->length = sizeof(*ddwprop);
-   if (!win64->name || !win64->value) {
-   dev_info(>dev,
-   "couldn't allocate property name and value\n");
-   goto out_free_prop;
-   }
  
  	ret = create_ddw(dev, ddw_avail, , page_shift, len);

if (ret != 0)
-   goto out_free_prop;
-
-   ddwprop->liobn = cpu_to_be32(create.liobn);
-   ddwprop->dma_base = cpu_to_be64(((u64)create.addr_hi << 32) |
-   create.addr_lo);
-   ddwprop->tce_shift = cpu_to_be32(page_shift);
-   ddwprop->window_shift = cpu_to_be32(len);
+   goto out_failed;
  
  	dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",

  create.liobn, dn);
  
-	window = ddw_list_new_entry(pdn, ddwprop);

+   win_addr = ((u64)create.addr_hi << 32) | create.addr_lo;
+   win64 = ddw_property_create(DIRECT64_PROPNAME, create.liobn, win_addr,
+   page_shift, len);
+   if (!win64) {
+   dev_info(>dev,
+"couldn't allocate property, property name, or 
value\n");
+   goto out_del_win;
+   }
+
+   ret = 

Re: [PATCH 1/2] audit: add support for the openat2 syscall

2021-04-23 Thread Christian Brauner
On Thu, Apr 22, 2021 at 10:34:08PM -0400, Richard Guy Briggs wrote:
> On 2021-03-18 08:08, Richard Guy Briggs wrote:
> > On 2021-03-18 11:48, Christian Brauner wrote:
> > > [+Cc Aleksa, the author of openat2()]
> > 
> > Ah!  Thanks for pulling in Aleksa.  I thought I caught everyone...
> > 
> > > and a comment below. :)
> > 
> > Same...
> > 
> > > On Wed, Mar 17, 2021 at 09:47:17PM -0400, Richard Guy Briggs wrote:
> > > > The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9
> > > > ("open: introduce openat2(2) syscall")
> > > > 
> > > > Add the openat2(2) syscall to the audit syscall classifier.
> > > > 
> > > > See the github issue
> > > > https://github.com/linux-audit/audit-kernel/issues/67
> > > > 
> > > > Signed-off-by: Richard Guy Briggs 
> > > > ---
> > > >  arch/alpha/kernel/audit.c  | 2 ++
> > > >  arch/ia64/kernel/audit.c   | 2 ++
> > > >  arch/parisc/kernel/audit.c | 2 ++
> > > >  arch/parisc/kernel/compat_audit.c  | 2 ++
> > > >  arch/powerpc/kernel/audit.c| 2 ++
> > > >  arch/powerpc/kernel/compat_audit.c | 2 ++
> > > >  arch/s390/kernel/audit.c   | 2 ++
> > > >  arch/s390/kernel/compat_audit.c| 2 ++
> > > >  arch/sparc/kernel/audit.c  | 2 ++
> > > >  arch/sparc/kernel/compat_audit.c   | 2 ++
> > > >  arch/x86/ia32/audit.c  | 2 ++
> > > >  arch/x86/kernel/audit_64.c | 2 ++
> > > >  kernel/auditsc.c   | 3 +++
> > > >  lib/audit.c| 4 
> > > >  lib/compat_audit.c | 4 
> > > >  15 files changed, 35 insertions(+)
> > > > 
> > > > diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
> > > > index 96a9d18ff4c4..06a911b685d1 100644
> > > > --- a/arch/alpha/kernel/audit.c
> > > > +++ b/arch/alpha/kernel/audit.c
> > > > @@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > > > return 3;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 0;
> > > > }
> > > > diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
> > > > index 5192ca899fe6..5eaa888c8fd3 100644
> > > > --- a/arch/ia64/kernel/audit.c
> > > > +++ b/arch/ia64/kernel/audit.c
> > > > @@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > > > return 3;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 0;
> > > > }
> > > > diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
> > > > index 9eb47b2225d2..fc721a7727ba 100644
> > > > --- a/arch/parisc/kernel/audit.c
> > > > +++ b/arch/parisc/kernel/audit.c
> > > > @@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > > > return 3;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 0;
> > > > }
> > > > diff --git a/arch/parisc/kernel/compat_audit.c 
> > > > b/arch/parisc/kernel/compat_audit.c
> > > > index 20c39c9d86a9..fc6d35918c44 100644
> > > > --- a/arch/parisc/kernel/compat_audit.c
> > > > +++ b/arch/parisc/kernel/compat_audit.c
> > > > @@ -35,6 +35,8 @@ int parisc32_classify_syscall(unsigned syscall)
> > > > return 3;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 1;
> > > > }
> > > > diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
> > > > index a27f3d09..8f32700b0baa 100644
> > > > --- a/arch/powerpc/kernel/audit.c
> > > > +++ b/arch/powerpc/kernel/audit.c
> > > > @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > > > return 4;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 0;
> > > > }
> > > > diff --git a/arch/powerpc/kernel/compat_audit.c 
> > > > b/arch/powerpc/kernel/compat_audit.c
> > > > index 55c6ccda0a85..ebe45534b1c9 100644
> > > > --- a/arch/powerpc/kernel/compat_audit.c
> > > > +++ b/arch/powerpc/kernel/compat_audit.c
> > > > @@ -38,6 +38,8 @@ int ppc32_classify_syscall(unsigned syscall)
> > > > return 4;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 1;
> > > > }
> > > > diff --git a/arch/s390/kernel/audit.c b/arch/s390/kernel/audit.c
> > > > index d395c6c9944c..d964cb94cfaf 100644
> 

Re: [PATCH 1/2] powerpc/vdso64: link vdso64 with linker

2021-04-23 Thread Christophe Leroy




Le 23/04/2021 à 00:44, Nick Desaulniers a écrit :

On Wed, Sep 2, 2020 at 11:02 AM Christophe Leroy
 wrote:




Le 02/09/2020 à 19:41, Nick Desaulniers a écrit :

On Wed, Sep 2, 2020 at 5:14 AM Michael Ellerman  wrote:


Nick Desaulniers  writes:

Fixes: commit f2af201002a8 ("powerpc/build: vdso linker warning for orphan 
sections")


I think I'll just revert that for v5.9 ?


SGTM; you'll probably still want these changes with some modifications
at some point; vdso32 did have at least one orphaned section, and will
be important for hermetic builds.  Seeing crashes in supported
versions of the tools ties our hands at the moment.



Keeping the tool problem aside with binutils 2.26, do you have a way to
really link an elf32ppc object when  building vdso32 for PPC64 ?


Sorry, I'm doing a bug scrub and found
https://github.com/ClangBuiltLinux/linux/issues/774 still open (and my
reply to this thread still in Drafts; never sent). With my patches
rebased:
$ file arch/powerpc/kernel/vdso32/vdso32.so
arch/powerpc/kernel/vdso32/vdso32.so: ELF 32-bit MSB shared object,
PowerPC or cisco 4500, version 1 (SYSV), dynamically linked, stripped

Are you still using 2.26?


Yes, our production kernels and applications are built with gcc 5.5 and 
binutils 2.26



I'm not able to repro Nathan's reported issue from
https://lore.kernel.org/lkml/20200902052123.GA2687902@ubuntu-n2-xlarge-x86/,
so I'm curious if I should resend the rebased patches as v2?



I can't remember what was all this discussion about.

I gave a try to your rebased patches.

Still an issue with binutils 2.26:

  VDSO32L arch/powerpc/kernel/vdso32/vdso32.so.dbg
ppc-linux-ld: warning: orphan section `.rela.got' from `arch/powerpc/kernel/vdso32/sigtramp.o' being 
placed in section `.rela.dyn'.
ppc-linux-ld: warning: orphan section `.rela.plt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being 
placed in section `.rela.dyn'.
ppc-linux-ld: warning: orphan section `.glink' from `arch/powerpc/kernel/vdso32/sigtramp.o' being 
placed in section `.glink'.
ppc-linux-ld: warning: orphan section `.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' being 
placed in section `.iplt'.
ppc-linux-ld: warning: orphan section `.rela.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.rela.dyn'.
ppc-linux-ld: warning: orphan section `.rela.text' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.rela.dyn'.
/bin/sh: line 1:  7850 Segmentation fault  (core dumped) ppc-linux-ld -EB -m elf32ppc -shared 
-soname linux-vdso32.so.1 --eh-frame-hdr --orphan-handling=warn -T 
arch/powerpc/kernel/vdso32/vdso32.lds arch/powerpc/kernel/vdso32/sigtramp.o 
arch/powerpc/kernel/vdso32/gettimeofday.o arch/powerpc/kernel/vdso32/datapage.o 
arch/powerpc/kernel/vdso32/cacheflush.o arch/powerpc/kernel/vdso32/note.o 
arch/powerpc/kernel/vdso32/getcpu.o arch/powerpc/kernel/vdso32/vgettimeofday.o -o 
arch/powerpc/kernel/vdso32/vdso32.so.dbg

make[2]: *** [arch/powerpc/kernel/vdso32/vdso32.so.dbg] Error 139
make[2]: *** Deleting file `arch/powerpc/kernel/vdso32/vdso32.so.dbg'



With gcc 10.1 and binutils 2.34 I get:

PPC32 build:

  VDSO32L arch/powerpc/kernel/vdso32/vdso32.so.dbg
powerpc64-linux-ld: warning: orphan section `.rela.got' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.rela.dyn'
powerpc64-linux-ld: warning: orphan section `.rela.plt' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.rela.dyn'
powerpc64-linux-ld: warning: orphan section `.glink' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.glink'
powerpc64-linux-ld: warning: orphan section `.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.iplt'
powerpc64-linux-ld: warning: orphan section `.rela.iplt' from 
`arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn'
powerpc64-linux-ld: warning: orphan section `.rela.branch_lt' from 
`arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn'
powerpc64-linux-ld: warning: orphan section `.rela.text' from 
`arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn'



PPC64 build:

  VDSO32L arch/powerpc/kernel/vdso32/vdso32.so.dbg
powerpc64-linux-ld: warning: orphan section `.rela.got' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.rela.dyn'
powerpc64-linux-ld: warning: orphan section `.rela.plt' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.rela.dyn'
powerpc64-linux-ld: warning: orphan section `.glink' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.glink'
powerpc64-linux-ld: warning: orphan section `.iplt' from `arch/powerpc/kernel/vdso32/sigtramp.o' 
being placed in section `.iplt'
powerpc64-linux-ld: warning: orphan section `.rela.iplt' from 
`arch/powerpc/kernel/vdso32/sigtramp.o' being placed in section `.rela.dyn'
powerpc64-linux-ld: warning: orphan section `.rela.branch_lt' from 

Re: [PATCH] cpuidle/pseries: Fixup CEDE0 latency only for POWER10 onwards

2021-04-23 Thread Michal Suchánek
On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote:
> From: "Gautham R. Shenoy" 
> 
> Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for
> CEDE(0)") sets the exit latency of CEDE(0) based on the latency values
> of the Extended CEDE states advertised by the platform
> 
> On some of the POWER9 LPARs, the older firmwares advertise a very low
> value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the
Can you be more specific about 'older firmwares'?

Also while this is a performance regression on such firmwares it
should be fixed by updating the firmware to current version.

Having sub-optimal performance on obsolete firmware should not require a
kernel workaround, should it?

It's not like the kernel would crash on the affected firmware.

Thanks

Michal


Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-23 Thread Christophe Leroy




Le 20/04/2021 à 05:28, Alexei Starovoitov a écrit :

On Sat, Apr 17, 2021 at 1:16 AM Christophe Leroy
 wrote:




Le 16/04/2021 à 01:49, Alexei Starovoitov a écrit :

On Thu, Apr 15, 2021 at 8:41 AM Quentin Monnet  wrote:


2021-04-15 16:37 UTC+0200 ~ Daniel Borkmann 

On 4/15/21 11:32 AM, Jianlin Lv wrote:

For debugging JITs, dumping the JITed image to kernel log is discouraged,
"bpftool prog dump jited" is much better way to examine JITed dumps.
This patch get rid of the code related to bpf_jit_enable=2 mode and
update the proc handler of bpf_jit_enable, also added auxiliary
information to explain how to use bpf_jit_disasm tool after this change.

Signed-off-by: Jianlin Lv 


Hello,

For what it's worth, I have already seen people dump the JIT image in
kernel logs in Qemu VMs running with just a busybox, not for kernel
development, but in a context where buiding/using bpftool was not
possible.


If building/using bpftool is not possible then majority of selftests won't
be exercised. I don't think such environment is suitable for any kind
of bpf development. Much so for JIT debugging.
While bpf_jit_enable=2 is nothing but the debugging tool for JIT developers.
I'd rather nuke that code instead of carrying it from kernel to kernel.



When I implemented JIT for PPC32, it was extremely helpfull.

As far as I understand, for the time being bpftool is not usable in my 
environment because it
doesn't support cross compilation when the target's endianess differs from the 
building host
endianess, see discussion at
https://lore.kernel.org/bpf/21e66a09-514f-f426-b9e2-13baab0b9...@csgroup.eu/

That's right that selftests can't be exercised because they don't build.

The question might be candid as I didn't investigate much about the replacement of 
"bpf_jit_enable=2
debugging mode" by bpftool, how do we use bpftool exactly for that ? Especially 
when using the BPF
test module ?


the kernel developers can add any amount of printk and dumps to debug
their code,
but such debugging aid should not be part of the production kernel.
That sysctl was two things at once: debugging tool for kernel devs and
introspection for users.
bpftool jit dump solves the 2nd part. It provides JIT introspection to users.
Debugging of the kernel can be done with any amount of auxiliary code
including calling print_hex_dump() during jiting.



I finally managed to cross compile bpftool with libbpf, libopcodes, readline, ncurses, libcap, libz 
and all needed stuff. Was not easy but I made it.


Now, how do I use it ?

Let say I want to dump the jitted code generated from a call to 'tcpdump'. How do I do that with 
'bpftool prog dump jited' ?


I thought by calling this line I would then get programs dumped in a way or another just like when 
setting 'bpf_jit_enable=2', but calling that line just provides me some bpftool help text.


By the way, I would be nice to have a kernel OPTION that selects all OPTIONS required for building 
bpftool. Because you discover them one by one at every build failure. I had to had CONFIG_IPV6, 
CONFIG_DEBUG_BTF, CONFIG_CGROUPS, ... If there could be an option like "Build a 'bpftool' ready 
kernel" that selected all those, it would be great.


Christophe


Re: [PATCH v3 01/11] powerpc/pseries/iommu: Replace hard-coded page shift

2021-04-23 Thread Alexey Kardashevskiy




On 22/04/2021 17:07, Leonardo Bras wrote:

Some functions assume IOMMU page size can only be 4K (pageshift == 12).
Update them to accept any page size passed, so we can use 64K pages.

In the process, some defines like TCE_SHIFT were made obsolete, and then
removed.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figures 3.4 and 3.5 show
a RPN of 52-bit, and considers a 12-bit pageshift, so there should be
no need of using TCE_RPN_MASK, which masks out any bit after 40 in rpn.
It's usage removed from tce_build_pSeries(), tce_build_pSeriesLP(), and
tce_buildmulti_pSeriesLP().



After rereading the patch, I wonder why we had this TCE_RPN_MASK at all 
but what is certain is that this has nothing to do with IODA3 as these 
TCEs are guest phys addresses in pseries and IODA3 is bare metal. Except...




Most places had a tbl struct, so using tbl->it_page_shift was simple.
tce_free_pSeriesLP() was a special case, since callers not always have a
tbl struct, so adding a tceshift parameter seems the right thing to do.

Signed-off-by: Leonardo Bras 
Reviewed-by: Alexey Kardashevskiy 
---
  arch/powerpc/include/asm/tce.h |  8 --
  arch/powerpc/platforms/pseries/iommu.c | 39 +++---
  2 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
index db5fc2f2262d..0c34d2756d92 100644
--- a/arch/powerpc/include/asm/tce.h
+++ b/arch/powerpc/include/asm/tce.h
@@ -19,15 +19,7 @@
  #define TCE_VB0
  #define TCE_PCI   1
  
-/* TCE page size is 4096 bytes (1 << 12) */

-
-#define TCE_SHIFT  12
-#define TCE_PAGE_SIZE  (1 << TCE_SHIFT)
-
  #define TCE_ENTRY_SIZE8   /* each TCE is 64 bits 
*/
-
-#define TCE_RPN_MASK   0xfful  /* 40-bit RPN (4K pages) */
-#define TCE_RPN_SHIFT  12
  #define TCE_VALID 0x800   /* TCE valid */
  #define TCE_ALLIO 0x400   /* TCE valid for all lpars */
  #define TCE_PCI_WRITE 0x2 /* write from PCI allowed */
diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 67c9953a6503..796ab356341c 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -107,6 +107,8 @@ static int tce_build_pSeries(struct iommu_table *tbl, long 
index,
u64 proto_tce;
__be64 *tcep;
u64 rpn;
+   const unsigned long tceshift = tbl->it_page_shift;
+   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);


(nit: only used once)

  
  	proto_tce = TCE_PCI_READ; // Read allowed
  
@@ -117,10 +119,10 @@ static int tce_build_pSeries(struct iommu_table *tbl, long index,



... this pseries which is not pseriesLP, i.e. no LPAR == bare metal 
pseries such as ancient power5 or cellbe (I guess) and for those 
TCE_RPN_MASK may actually make sense, keep it.


The rest of the patch looks good. Thanks,


  
  	while (npages--) {

/* can't move this out since we might cross MEMBLOCK boundary */
-   rpn = __pa(uaddr) >> TCE_SHIFT;
-   *tcep = cpu_to_be64(proto_tce | (rpn & TCE_RPN_MASK) << 
TCE_RPN_SHIFT);
+   rpn = __pa(uaddr) >> tceshift;
+   *tcep = cpu_to_be64(proto_tce | rpn << tceshift);
  
-		uaddr += TCE_PAGE_SIZE;

+   uaddr += pagesize;
tcep++;
}
return 0;
@@ -146,7 +148,7 @@ static unsigned long tce_get_pseries(struct iommu_table 
*tbl, long index)
return be64_to_cpu(*tcep);
  }
  
-static void tce_free_pSeriesLP(unsigned long liobn, long, long);

+static void tce_free_pSeriesLP(unsigned long liobn, long, long, long);
  static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
  
  static int tce_build_pSeriesLP(unsigned long liobn, long tcenum, long tceshift,

@@ -166,12 +168,12 @@ static int tce_build_pSeriesLP(unsigned long liobn, long 
tcenum, long tceshift,
proto_tce |= TCE_PCI_WRITE;
  
  	while (npages--) {

-   tce = proto_tce | (rpn & TCE_RPN_MASK) << tceshift;
+   tce = proto_tce | rpn << tceshift;
rc = plpar_tce_put((u64)liobn, (u64)tcenum << tceshift, tce);
  
  		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {

ret = (int)rc;
-   tce_free_pSeriesLP(liobn, tcenum_start,
+   tce_free_pSeriesLP(liobn, tcenum_start, tceshift,
   (npages_start - (npages + 1)));
break;
}
@@ -205,10 +207,11 @@ static int tce_buildmulti_pSeriesLP(struct iommu_table 
*tbl, long tcenum,
long tcenum_start = tcenum, npages_start = npages;
int ret = 0;
unsigned long flags;
+   const unsigned long tceshift = tbl->it_page_shift;
  
  	if ((npages == 1) || !firmware_has_feature(FW_FEATURE_PUT_TCE_IND)) {

return 

Re: [PATCH v5 01/16] swiotlb: Fix the type of index

2021-04-23 Thread Christoph Hellwig
On Thu, Apr 22, 2021 at 04:14:53PM +0800, Claire Chang wrote:
> Fix the type of index from unsigned int to int since find_slots() might
> return -1.
> 
> Fixes: 0774983bc923 ("swiotlb: refactor swiotlb_tbl_map_single")
> Signed-off-by: Claire Chang 

Looks good:

Reviewed-by: Christoph Hellwig 

it really should go into 5.12.  I'm not sure if Konrad is going to
be able to queue this up due to his vacation, so I'm tempted to just
queue it up in the dma-mapping tree.