Re: [RESEND PATCH net v4 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2024-04-09 Thread Christophe Leroy
Hi Vladimir,

Le 19/02/2024 à 16:30, Vladimir Oltean a écrit :
> Hi Sean,
> 
> On Thu, Feb 15, 2024 at 11:23:26AM -0500, Sean Anderson wrote:
>> smp_call_function_single disables IRQs when executing the callback. To
>> prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
>> This is already done by qman_update_cgr and qman_delete_cgr; fix the
>> other lockers.
>>
>> Fixes: 96f413f47677 ("soc/fsl/qbman: fix issue in qman_delete_cgr_safe()")
>> CC: sta...@vger.kernel.org
>> Signed-off-by: Sean Anderson 
>> Reviewed-by: Camelia Groza 
>> Tested-by: Vladimir Oltean 
>> ---
>> I got no response the first time I sent this, so I am resending to net.
>> This issue was introduced in a series which went through net, so I hope
>> it makes sense to take it via net.
>>
>> [1] 
>> https://lore.kernel.org/linux-arm-kernel/20240108161904.2865093-1-sean.ander...@seco.com/
>>
>> (no changes since v3)
>>
>> Changes in v3:
>> - Change blamed commit to something more appropriate
>>
>> Changes in v2:
>> - Fix one additional call to spin_unlock
> 
> Leo Li (Li Yang) is no longer with NXP. Until we figure out within NXP
> how to continue with the maintainership of drivers/soc/fsl/, yes, please
> continue to submit this series to 'net'. I would also like to point
> out to Arnd that this is the case.
> 
> Arnd, a large portion of drivers/soc/fsl/ is networking-related
> (dpio, qbman). Would it make sense to transfer the maintainership
> of these under the respective networking drivers, to simplify the
> procedures?

I see FREESCALE QUICC ENGINE LIBRARY (drivers/soc/fsl/qe/) is maintained 
by Qiang Zhao  but I can't find any mail from him in 
the past 4 years in linuxppc-dev list, and everytime I wanted to submit 
something I only got responses from Leo Ly.

The last commit he reviewed is 661ea25e5319 ("soc: fsl: qe: Replace 
one-element array and use struct_size() helper"), it was in May 2020.

Is he still working at NXP and actively maintaining that library ? 
Keeping this part maintained is vital for me as this SOC is embedded in 
the two powerpc platform I maintain (8xx and 83xx).

If Qiang Zhao is not able to activaly maintain that SOC anymore, I 
volonteer to maintain it.

Thanks
Christophe


Re: [PATCH v4] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-04-09 Thread Mahesh J Salgaonkar
On 2024-03-08 19:08:50 Fri, Michael Ellerman wrote:
> Aneesh Kumar K V  writes:
> > On 3/7/24 5:13 PM, Michael Ellerman wrote:
> >> Mahesh Salgaonkar  writes:
> >>> nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
> >>> crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
> >>> interrupt handler) if percpu allocation comes from vmalloc area.
> >>>
> >>> Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
> >>> wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue 
> >>> when
> >>> percpu allocation is from the embedded first chunk. However with
> >>> CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where 
> >>> percpu
> >>> allocation can come from the vmalloc area.
> >>>
> >>> With kernel command line "percpu_alloc=page" we can force percpu 
> >>> allocation
> >>> to come from vmalloc area and can see kernel crash in machine_check_early:
> >>>
> >>> [1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
> >>> [1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
> >>> [1.215719] --- interrupt: 200
> >>> [1.215720] [c00fffd73180] [] 0x0 (unreliable)
> >>> [1.215722] [c00fffd731b0] [] 0x0
> >>> [1.215724] [c00fffd73210] [c0008364] 
> >>> machine_check_early_common+0x134/0x1f8
> >>>
> >>> Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
> >>> first chunk is not embedded.
> >> 
> >> My system (powernv) doesn't even boot with percpu_alloc=page.
> >
> >
> > Can you share the crash details?
> 
> Yes but it's not pretty :)
> 
>   [1.725257][  T714] systemd-journald[714]: Collecting audit messages is 
> disabled.
>   [1.729401][T1] systemd[1]: Finished 
> systemd-tmpfiles-setup-dev.service - Create Static Device Nodes in /dev.
>   [^[[0;32m  OK  ^[[0m] Finished ^[[0;1;39msystemd-tmpfiles-…reate Static 
> Device Nodes in /dev.
>   [1.773902][   C22] Disabling lock debugging due to kernel taint
>   [1.773905][   C23] Oops: Machine check, sig: 7 [#1]
>   [1.773911][   C23] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA 
> PowerNV
>   [1.773916][   C23] Modules linked in:
>   [1.773920][   C23] CPU: 23 PID: 0 Comm: swapper/23 Tainted: G   M   
> 6.8.0-rc7-02500-g23515c370cbb #1
>   [1.773924][   C23] Hardware name: 8335-GTH POWER9 0x4e1202 
> opal:skiboot-v6.5.3-35-g1851b2a06 PowerNV
>   [1.773926][   C23] NIP:   LR:  CTR: 
> 
>   [1.773929][   C23] REGS: c00fffa6ef50 TRAP:    Tainted: G   M   
>  (6.8.0-rc7-02500-g23515c370cbb)
>   [1.773932][   C23] MSR:   <>  CR:   XER: 
> 
>   [1.773937][   C23] CFAR:  IRQMASK: 3 
>   [1.773937][   C23] GPR00:  c00fffa6efe0 
> c00fffa6efb0  
>   [1.773937][   C23] GPR04: c003d8c0 c1f5f000 
>  0103 
>   [1.773937][   C23] GPR08: 0003 653a0d962a590300 
>   
>   [1.773937][   C23] GPR12: c00fffa6f280  
> c00084a4  
>   [1.773937][   C23] GPR16: 53474552  
> c003d8c0 c00fffa6f280 
>   [1.773937][   C23] GPR20: c1f5f000 c00fffa6f340 
> c00fffa6f2e8  
>   [1.773937][   C23] GPR24: 0007fecf c65bbb80 
> 00550102 c2172b20 
>   [1.773937][   C23] GPR28:  53474552 
>  c00c6d80 
>   [1.773982][   C23] NIP [] 0x0
>   [1.773988][   C23] LR [] 0x0
>   [1.773990][   C23] Call Trace:
>   [1.773991][   C23] [c00fffa6efe0] [c1f5f000] 
> .TOC.+0x0/0xa1000 (unreliable)
>   [1.773999][   C23] Code:      
>         
>    
>   [1.774021][   C23] ---[ end trace  ]---
> 
> Something has gone badly wrong.
> 
> That was a test kernel with some other commits, but nothing that should
> cause that. Removing percpu_alloc=page fix it.

So, when I try this without my patch "Avoid nmi_enter/nmi_exit in real
mode interrupt", I see this getting recreated. However, I was not able
to recrate this even once with my changes. Are you able to see this
crash with my patch ?

Thanks,
-Mahesh.



[PATCH v5] powerpc: Avoid nmi_enter/nmi_exit in real mode interrupt.

2024-04-09 Thread Mahesh Salgaonkar
nmi_enter()/nmi_exit() touches per cpu variables which can lead to kernel
crash when invoked during real mode interrupt handling (e.g. early HMI/MCE
interrupt handler) if percpu allocation comes from vmalloc area.

Early HMI/MCE handlers are called through DEFINE_INTERRUPT_HANDLER_NMI()
wrapper which invokes nmi_enter/nmi_exit calls. We don't see any issue when
percpu allocation is from the embedded first chunk. However with
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there are chances where percpu
allocation can come from the vmalloc area.

With kernel command line "percpu_alloc=page" we can force percpu allocation
to come from vmalloc area and can see kernel crash in machine_check_early:

[1.215714] NIP [c0e49eb4] rcu_nmi_enter+0x24/0x110
[1.215717] LR [c00461a0] machine_check_early+0xf0/0x2c0
[1.215719] --- interrupt: 200
[1.215720] [c00fffd73180] [] 0x0 (unreliable)
[1.215722] [c00fffd731b0] [] 0x0
[1.215724] [c00fffd73210] [c0008364] 
machine_check_early_common+0x134/0x1f8

Fix this by avoiding use of nmi_enter()/nmi_exit() in real mode if percpu
first chunk is not embedded.

Reviewed-by: Christophe Leroy 
Tested-by: Shirisha Ganta 
Signed-off-by: Mahesh Salgaonkar 
---
Changes in v5:
- Invert the percpu chunk checks as suggested by mpe.
- Fix mirowatt build, microwatt config has CONFIG_PPC64=y and CONFIG_SMP=n.
- v4 at 
https://lore.kernel.org/linuxppc-dev/20240214095146.1527369-1-mah...@linux.ibm.com/

Changes in v4:
- Fix coding style issues.

Changes in v3:
- Address comments from Christophe Leroy to avoid using #ifdefs in the
  code
- v2 at 
https://lore.kernel.org/linuxppc-dev/20240205053647.1763446-1-mah...@linux.ibm.com/

Changes in v2:
- Rebase to upstream master
- Use jump_labels, if CONFIG_JUMP_LABEL is enabled, to avoid redoing the
  embed first chunk test at each interrupt entry.
- v1 is at 
https://lore.kernel.org/linuxppc-dev/164578465828.74956.6065296024817333750.stgit@jupiter/
---
 arch/powerpc/include/asm/interrupt.h | 10 ++
 arch/powerpc/include/asm/percpu.h| 10 ++
 arch/powerpc/kernel/setup_64.c   |  2 ++
 3 files changed, 22 insertions(+)

diff --git a/arch/powerpc/include/asm/interrupt.h 
b/arch/powerpc/include/asm/interrupt.h
index 7b610864b3645..2d6c886b40f44 100644
--- a/arch/powerpc/include/asm/interrupt.h
+++ b/arch/powerpc/include/asm/interrupt.h
@@ -336,6 +336,14 @@ static inline void interrupt_nmi_enter_prepare(struct 
pt_regs *regs, struct inte
if (IS_ENABLED(CONFIG_KASAN))
return;
 
+   /*
+* Likewise, do not use it in real mode if percpu first chunk is not
+* embedded. With CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK enabled there
+* are chances where percpu allocation can come from vmalloc area.
+*/
+   if (percpu_first_chunk_is_paged)
+   return;
+
/* Otherwise, it should be safe to call it */
nmi_enter();
 }
@@ -351,6 +359,8 @@ static inline void interrupt_nmi_exit_prepare(struct 
pt_regs *regs, struct inter
// no nmi_exit for a pseries hash guest taking a real mode 
exception
} else if (IS_ENABLED(CONFIG_KASAN)) {
// no nmi_exit for KASAN in real mode
+   } else if (percpu_first_chunk_is_paged) {
+   // no nmi_exit if percpu first chunk is not embedded
} else {
nmi_exit();
}
diff --git a/arch/powerpc/include/asm/percpu.h 
b/arch/powerpc/include/asm/percpu.h
index 8e5b7d0b851c6..634970ce13c6b 100644
--- a/arch/powerpc/include/asm/percpu.h
+++ b/arch/powerpc/include/asm/percpu.h
@@ -15,6 +15,16 @@
 #endif /* CONFIG_SMP */
 #endif /* __powerpc64__ */
 
+#if defined(CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK) && defined(CONFIG_SMP)
+#include 
+DECLARE_STATIC_KEY_FALSE(__percpu_first_chunk_is_paged);
+
+#define percpu_first_chunk_is_paged\
+   (static_key_enabled(&__percpu_first_chunk_is_paged.key))
+#else
+#define percpu_first_chunk_is_pagedfalse
+#endif /* CONFIG_PPC64 && CONFIG_SMP */
+
 #include 
 
 #include 
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 2f19d5e944852..ae36a129789ff 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -834,6 +834,7 @@ static __init int pcpu_cpu_to_node(int cpu)
 
 unsigned long __per_cpu_offset[NR_CPUS] __read_mostly;
 EXPORT_SYMBOL(__per_cpu_offset);
+DEFINE_STATIC_KEY_FALSE(__percpu_first_chunk_is_paged);
 
 void __init setup_per_cpu_areas(void)
 {
@@ -876,6 +877,7 @@ void __init setup_per_cpu_areas(void)
if (rc < 0)
panic("cannot initialize percpu area (err=%d)", rc);
 
+   static_key_enable(&__percpu_first_chunk_is_paged.key);
delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
for_each_possible_cpu(cpu) {
 __per_cpu_offset[cpu] = delta + pcpu_unit_offsets[cpu];
-- 
2.44.0



Re: [PATCH] KVM: PPC: Book3S HV nestedv2: Cancel pending HDEC exception

2024-04-09 Thread Nicholas Piggin
On Mon Apr 8, 2024 at 3:20 PM AEST, Michael Ellerman wrote:
> Thorsten Leemhuis  writes:
> > On 05.04.24 05:20, Michael Ellerman wrote:
> >> "Linux regression tracking (Thorsten Leemhuis)"
> >>  writes:
> >>> Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> >>> for once, to make this easily accessible to everyone.
> >>>
> >>> Was this regression ever resolved? Doesn't look like it, but maybe I
> >>> just missed something.
> >> 
> >> I'm not sure how it ended up on the regression list.
> >
> > That is easy to explain: I let lei search for mails containing words
> > like regress, bisect, and revert to become aware of regressions that
> > might need tracking. And...
> >
> >> IMHO it's not really a regression.
> >
> > ...sometimes I misjudge or misinterpret something and add it to the
> > regression tracking. Looks like that happened here.
> >
> > Sorry for that and the noise it caused!
>
> No worries.

It is actually a regression. It only really affects performance,
but the logic is broken (my fault). We were going to revert it, and
solve the initial regression a better way.

Thanks,
Nick


Re: [PATCH] vdso: Fix powerpc build U64_MAX undeclared error

2024-04-09 Thread Stephen Rothwell
Hi Adrian,

On Tue,  9 Apr 2024 09:26:39 +0300 Adrian Hunter  
wrote:
>
> U64_MAX is not in include/vdso/limits.h, although that isn't noticed on x86
> because x86 includes include/linux/limits.h indirectly. However powerpc
> is more selective, resulting in the following build error:
> 
>   In file included from :
>   lib/vdso/gettimeofday.c: In function 'vdso_calc_ns':
>   lib/vdso/gettimeofday.c:11:33: error: 'U64_MAX' undeclared
>  11 | # define VDSO_DELTA_MASK(vd)U64_MAX
> | ^~~
> 
> Use ULLONG_MAX instead which will work just as well and is in
> include/vdso/limits.h.
> 
> Reported-by: Stephen Rothwell 
> Closes: https://lore.kernel.org/all/20240409124905.6816d...@canb.auug.org.au/
> Fixes: c8e3a8b6f2e6 ("vdso: Consolidate vdso_calc_delta()")
> Signed-off-by: Adrian Hunter 
> ---
>  lib/vdso/gettimeofday.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
> index 9c3a8d2440c9..899850bd6f0b 100644
> --- a/lib/vdso/gettimeofday.c
> +++ b/lib/vdso/gettimeofday.c
> @@ -8,7 +8,7 @@
>  #ifndef vdso_calc_ns
>  
>  #ifdef VDSO_DELTA_NOMASK
> -# define VDSO_DELTA_MASK(vd) U64_MAX
> +# define VDSO_DELTA_MASK(vd) ULLONG_MAX
>  #else
>  # define VDSO_DELTA_MASK(vd) (vd->mask)
>  #endif
> -- 
> 2.34.1
> 

I have applied that to linux-next today and it builds for me.

Tested-by: Stephen Rothwell  # build only

-- 
Cheers,
Stephen Rothwell


pgp2rEv4taZpP.pgp
Description: OpenPGP digital signature


Re: [PATCH] powerpc/pseries: remove returning ENODEV when uevent is triggered

2024-04-09 Thread Lidong Zhong
Hi Michael,

Thanks for your reply.

On Tue, Apr 9, 2024 at 4:46 PM Michael Ellerman  wrote:

> Hi Lidong,
>
> Thanks for the patch.
>
> I'm not an expert on udev etc. so apologies if any of these questions
> are stupid.
>
> Lidong Zhong  writes:
> > We have noticed the following nuisance messages during boot
> >
> > [7.120610][ T1060] vio vio: uevent: failed to send synthetic uevent
> > [7.122281][ T1060] vio 4000: uevent: failed to send synthetic uevent
> > [7.122304][ T1060] vio 4001: uevent: failed to send synthetic uevent
> > [7.122324][ T1060] vio 4002: uevent: failed to send synthetic uevent
> > [7.122345][ T1060] vio 4004: uevent: failed to send synthetic uevent
> >
> > It's caused by either vio_register_device_node() failed to set
> dev->of_node or
> > the missing "compatible" property. Try return as much information as
> possible
> > instead of a failure.
>
> Does udev etc. cope with that? Can we just change the content of the
> MODALIAS value like that?
>
> With this patch we'll start emitting uevents for devices we previously
> didn't. I guess that's OK because nothing is expecting them?
>
> Can you include a log of udev showing the event firing and that nothing
> breaks.
>
> On my system here I see nothing matches the devices except for libvpd,
> which seems to match lots of things.
>

It's an issue reported by our customer. I am sorry I can't provide more
information because I  don't have the environment
to reproduce this issue. The only related log I got is shown below:

Feb 07 14:08:03 rb3i0060 udevadm[623]: vio: Failed to write 'add' to
'/sys/devices/vio/uevent', ignoring: No such device

Feb 07 14:08:03 rb3i0060 kernel: synth uevent: /devices/vio: failed to send
uevent

Feb 07 14:08:03 rb3i0060 kernel: vio vio: uevent: failed to send synthetic
uevent

Feb 07 14:08:03 rb3i0060 kernel: synth uevent: /devices/vio/4000: failed to
send uevent

Feb 07 14:08:03 rb3i0060 kernel: vio 4000: uevent: failed to send synthetic
uevent

Feb 07 14:08:03 rb3i0060 kernel: synth uevent: /devices/vio/4001: failed to
send uevent

Feb 07 14:08:03 rb3i0060 kernel: vio 4001: uevent: failed to send synthetic
uevent

Feb 07 14:08:03 rb3i0060 kernel: synth uevent: /devices/vio/4002: failed to
send uevent

Feb 07 14:08:03 rb3i0060 kernel: vio 4002: uevent: failed to send synthetic
uevent

Feb 07 14:08:03 rb3i0060 kernel: synth uevent: /devices/vio/4004: failed to
send uevent

Feb 07 14:08:03 rb3i0060 kernel: vio 4004: uevent: failed to send synthetic
uevent

Feb 07 14:08:03 rb3i0060 udevadm[623]: 4000: Failed to write 'add' to
'/sys/devices/vio/4000/uevent', ignoring: No such device

Feb 07 14:08:03 rb3i0060 udevadm[623]: 4001: Failed to write 'add' to
'/sys/devices/vio/4001/uevent', ignoring: No such device

Feb 07 14:08:03 rb3i0060 udevadm[623]: 4002: Failed to write 'add' to
'/sys/devices/vio/4002/uevent', ignoring: No such device

Feb 07 14:08:03 rb3i0060 udevadm[623]: 4004: Failed to write 'add' to
'/sys/devices/vio/4004/uevent', ignoring: No such device

systemd-udev-trigger service calls 'udevadm trigger --type=devices
--action=add' and kernel returns -ENODEV because either
dev->of_node is NULL or 'compatible' property is not present.  Similar
cases were already reported after some search, for example
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1827162
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1845319
I don't think it causes real problems but confusion to users.


> > diff --git a/arch/powerpc/platforms/pseries/vio.c
> b/arch/powerpc/platforms/pseries/vio.c
> > index 90ff85c879bf..62961715ca24 100644
> > --- a/arch/powerpc/platforms/pseries/vio.c
> > +++ b/arch/powerpc/platforms/pseries/vio.c
> > @@ -1593,12 +1593,13 @@ static int vio_hotplug(const struct device *dev,
> struct kobj_uevent_env *env)
> >
> >   dn = dev->of_node;
> >   if (!dn)
> > - return -ENODEV;
> > + goto out;
> >   cp = of_get_property(dn, "compatible", NULL);
> >   if (!cp)
> > - return -ENODEV;
> > -
> > - add_uevent_var(env, "MODALIAS=vio:T%sS%s", vio_dev->type, cp);
> > + add_uevent_var(env, "MODALIAS=vio:T%s", vio_dev->type);
>
> If it's OK to skip the compatible property then we don't need the
> of_node at all, and we could always emit this, even when of_node is not
> available.
>

You mean something like this?
@@ -1592,13 +1592,10 @@ static int vio_hotplug(const struct device *dev,
struct kobj_uevent_env *env)
const char *cp;

dn = dev->of_node;
-   if (!dn)
-   return -ENODEV;
-   cp = of_get_property(dn, "compatible", NULL);
-   if (!cp)
-   return -ENODEV;
-
-   add_uevent_var(env, "MODALIAS=vio:T%sS%s", vio_dev->type, cp);
+   if (dn && (cp = of_get_property(dn, "compatible", NULL))
+   add_uevent_var(env, "MODALIAS=vio:T%sS%s", vio_dev->type,
cp);
+   else
+   add_uevent_var(env, "MODALIAS=vio:T%s", vio_dev->type);
return 0;



Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2

2024-04-09 Thread Jason Gunthorpe
On Fri, Apr 05, 2024 at 05:42:44PM -0400, Peter Xu wrote:
> In short, hugetlb mappings shouldn't be special comparing to other huge pXd
> and large folio (cont-pXd) mappings for most of the walkers in my mind, if
> not all.  I need to look at all the walkers and there can be some tricky
> ones, but I believe that applies in general.  It's actually similar to what
> I did with slow gup here.

I think that is the big question, I also haven't done the research to
know the answer.

At this point focusing on moving what is reasonable to the pXX_* API
makes sense to me. Then reviewing what remains and making some
decision.

> Like this series, for cont-pXd we'll need multiple walks comparing to
> before (when with hugetlb_entry()), but for that part I'll provide some
> performance tests too, and we also have a fallback plan, which is to detect
> cont-pXd existance, which will also work for large folios.

I think we can optimize this pretty easy.
 
> > I think if you do the easy places for pXX conversion you will have a
> > good idea about what is needed for the hard places.
> 
> Here IMHO we don't need to understand "what is the size of this hugetlb
> vma"

Yeh, I never really understood why hugetlb was linked to the VMA.. The
page table is self describing, obviously.

> or "which level of pgtable does this hugetlb vma pages locate",

Ditto

> because we may not need that, e.g., when we only want to collect some smaps
> statistics.  "whether it's hugetlb" may matter, though. E.g. in the mm
> walker we see a huge pmd, it can be a thp, it can be a hugetlb (when
> hugetlb_entry removed), we may need extra check later to put things into
> the right bucket, but for the walker itself it doesn't necessarily need
> hugetlb_entry().

Right, places may still need to know it is part of a huge VMA because we
have special stuff linked to that.

> > But then again we come back to power and its big list of page sizes
> > and variety :( Looks like some there have huge sizes at the pgd level
> > at least.
> 
> Yeah this is something I want to be super clear, because I may miss
> something: we don't have real pgd pages, right?  Powerpc doesn't even
> define p4d_leaf(), AFAICT.

AFAICT it is because it hides it all in hugepd.

If the goal is to purge hugepd then some of the options might turn out
to convert hugepd into huge p4d/pgd, as I understand it. It would be
nice to have certainty on this at least.

We have effectively three APIs to parse a single page table and
currently none of the APIs can return 100% of the data for power.

Jason


RE: [EXT] [PATCH v8 3/6] KEYS: trusted: Introduce NXP DCP-backed trusted keys

2024-04-09 Thread Kshitiz Varshney
Hi David,

> -Original Message-
> From: David Gstir 
> Sent: Wednesday, April 3, 2024 12:51 PM
> To: Mimi Zohar ; James Bottomley
> ; Jarkko Sakkinen ; Herbert Xu
> ; David S. Miller 
> Cc: David Gstir ; Shawn Guo ;
> Jonathan Corbet ; Sascha Hauer
> ; Pengutronix Kernel Team
> ; Fabio Estevam ; dl-linux-
> imx ; Ahmad Fatoum ;
> sigma star Kernel Team ; David Howells
> ; Li Yang ; Paul Moore
> ; James Morris ; Serge E.
> Hallyn ; Paul E. McKenney ;
> Randy Dunlap ; Catalin Marinas
> ; Rafael J. Wysocki
> ; Tejun Heo ; Steven Rostedt
> (Google) ; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; linux-integr...@vger.kernel.org;
> keyri...@vger.kernel.org; linux-cry...@vger.kernel.org; linux-arm-
> ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org; linux-security-
> mod...@vger.kernel.org; Richard Weinberger ; David
> Oberhollenzer 
> Subject: [EXT] [PATCH v8 3/6] KEYS: trusted: Introduce NXP DCP-backed
> trusted keys
> 
> Caution: This is an external email. Please take care when clicking links or
> opening attachments. When in doubt, report the message using the 'Report
> this email' button
> 
> 
> DCP (Data Co-Processor) is the little brother of NXP's CAAM IP.
> Beside of accelerated crypto operations, it also offers support for hardware-
> bound keys. Using this feature it is possible to implement a blob
> mechanism similar to what CAAM offers. Unlike on CAAM, constructing and
> parsing the blob has to happen in software (i.e. the kernel).
> 
> The software-based blob format used by DCP trusted keys encrypts the
> payload using AES-128-GCM with a freshly generated random key and
> nonce.
> The random key itself is AES-128-ECB encrypted using the DCP unique or
> OTP key.
> 
> The DCP trusted key blob format is:
> /*
>  * struct dcp_blob_fmt - DCP BLOB format.
>  *
>  * @fmt_version: Format version, currently being %1
>  * @blob_key: Random AES 128 key which is used to encrypt @payload,
>  *@blob_key itself is encrypted with OTP or UNIQUE device key in
>  *AES-128-ECB mode by DCP.
>  * @nonce: Random nonce used for @payload encryption.
>  * @payload_len: Length of the plain text @payload.
>  * @payload: The payload itself, encrypted using AES-128-GCM and
> @blob_key,
>  *   GCM auth tag of size AES_BLOCK_SIZE is attached at the end of it.
>  *
>  * The total size of a DCP BLOB is sizeof(struct dcp_blob_fmt) +
> @payload_len +
>  * AES_BLOCK_SIZE.
>  */
> struct dcp_blob_fmt {
> __u8 fmt_version;
> __u8 blob_key[AES_KEYSIZE_128];
> __u8 nonce[AES_KEYSIZE_128];
> __le32 payload_len;
> __u8 payload[];
> } __packed;
> 
> By default the unique key is used. It is also possible to use the OTP key.
> While the unique key should be unique it is not documented how this key is
> derived. Therefore selection the OTP key is supported as well via the
> use_otp_key module parameter.
> 
> Co-developed-by: Richard Weinberger 
> Signed-off-by: Richard Weinberger 
> Co-developed-by: David Oberhollenzer 
> Signed-off-by: David Oberhollenzer 
> Signed-off-by: David Gstir 
> Reviewed-by: Jarkko Sakkinen 
> ---
>  include/keys/trusted_dcp.h|  11 +
>  security/keys/trusted-keys/Kconfig|   8 +
>  security/keys/trusted-keys/Makefile   |   2 +
>  security/keys/trusted-keys/trusted_core.c |   6 +-
>  security/keys/trusted-keys/trusted_dcp.c  | 313
> ++
>  5 files changed, 339 insertions(+), 1 deletion(-)  create mode 100644
> include/keys/trusted_dcp.h  create mode 100644 security/keys/trusted-
> keys/trusted_dcp.c
> 
> diff --git a/include/keys/trusted_dcp.h b/include/keys/trusted_dcp.h new
> file mode 100644 index ..9aaa42075b40
> --- /dev/null
> +++ b/include/keys/trusted_dcp.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (C) 2021 sigma star gmbh
> + */
> +
> +#ifndef TRUSTED_DCP_H
> +#define TRUSTED_DCP_H
> +
> +extern struct trusted_key_ops dcp_trusted_key_ops;
> +
> +#endif
> diff --git a/security/keys/trusted-keys/Kconfig b/security/keys/trusted-
> keys/Kconfig
> index 553dc117f385..1fb8aa001995 100644
> --- a/security/keys/trusted-keys/Kconfig
> +++ b/security/keys/trusted-keys/Kconfig
> @@ -39,6 +39,14 @@ config TRUSTED_KEYS_CAAM
>   Enable use of NXP's Cryptographic Accelerator and Assurance Module
>   (CAAM) as trusted key backend.
> 
> +config TRUSTED_KEYS_DCP
> +   bool "DCP-based trusted keys"
> +   depends on CRYPTO_DEV_MXS_DCP >= TRUSTED_KEYS
> +   default y
> +   select HAVE_TRUSTED_KEYS
> +   help
> + Enable use of NXP's DCP (Data Co-Processor) as trusted key backend.
> +
>  if !HAVE_TRUSTED_KEYS
> comment "No trust source selected!"
>  endif
> diff --git a/security/keys/trusted-keys/Makefile b/security/keys/trusted-
> keys/Makefile
> index 735aa0bc08ef..f0f3b27f688b 100644
> --- a/security/keys/trusted-keys/Makefile
> +++ b/security/keys/trusted-keys/Makefile
> @@ 

RE: [EXT] Re: [PATCH v8 6/6] docs: trusted-encrypted: add DCP as new trust source

2024-04-09 Thread Kshitiz Varshney
Hi Jarkko,


> -Original Message-
> From: Jarkko Sakkinen 
> Sent: Wednesday, April 3, 2024 9:18 PM
> To: David Gstir ; Mimi Zohar ;
> James Bottomley ; Herbert Xu
> ; David S. Miller 
> Cc: Shawn Guo ; Jonathan Corbet
> ; Sascha Hauer ; Pengutronix
> Kernel Team ; Fabio Estevam
> ; dl-linux-imx ; Ahmad Fatoum
> ; sigma star Kernel Team
> ; David Howells ; Li
> Yang ; Paul Moore ; James
> Morris ; Serge E. Hallyn ; Paul E.
> McKenney ; Randy Dunlap ;
> Catalin Marinas ; Rafael J. Wysocki
> ; Tejun Heo ; Steven Rostedt
> (Google) ; linux-...@vger.kernel.org; linux-
> ker...@vger.kernel.org; linux-integr...@vger.kernel.org;
> keyri...@vger.kernel.org; linux-cry...@vger.kernel.org; linux-arm-
> ker...@lists.infradead.org; linuxppc-dev@lists.ozlabs.org; linux-security-
> mod...@vger.kernel.org; Richard Weinberger ; David
> Oberhollenzer 
> Subject: [EXT] Re: [PATCH v8 6/6] docs: trusted-encrypted: add DCP as new
> trust source
> 
> Caution: This is an external email. Please take care when clicking links or
> opening attachments. When in doubt, report the message using the 'Report
> this email' button
> 
> 
> On Wed Apr 3, 2024 at 10:21 AM EEST, David Gstir wrote:
> > Update the documentation for trusted and encrypted KEYS with DCP as
> > new trust source:
> >
> > - Describe security properties of DCP trust source
> > - Describe key usage
> > - Document blob format
> >
> > Co-developed-by: Richard Weinberger 
> > Signed-off-by: Richard Weinberger 
> > Co-developed-by: David Oberhollenzer
> > 
> > Signed-off-by: David Oberhollenzer 
> > Signed-off-by: David Gstir 
> > ---
> >  .../security/keys/trusted-encrypted.rst   | 53 +++
> >  security/keys/trusted-keys/trusted_dcp.c  | 19 +++
> >  2 files changed, 72 insertions(+)
> >
> > diff --git a/Documentation/security/keys/trusted-encrypted.rst
> > b/Documentation/security/keys/trusted-encrypted.rst
> > index e989b9802f92..f4d7e162d5e4 100644
> > --- a/Documentation/security/keys/trusted-encrypted.rst
> > +++ b/Documentation/security/keys/trusted-encrypted.rst
> > @@ -42,6 +42,14 @@ safe.
> >   randomly generated and fused into each SoC at manufacturing time.
> >   Otherwise, a common fixed test key is used instead.
> >
> > + (4) DCP (Data Co-Processor: crypto accelerator of various i.MX
> > + SoCs)
> > +
> > + Rooted to a one-time programmable key (OTP) that is generally
> burnt
> > + in the on-chip fuses and is accessible to the DCP encryption 
> > engine
> only.
> > + DCP provides two keys that can be used as root of trust: the OTP
> key
> > + and the UNIQUE key. Default is to use the UNIQUE key, but 
> > selecting
> > + the OTP key can be done via a module parameter
> (dcp_use_otp_key).
> > +
> >*  Execution isolation
> >
> >   (1) TPM
> > @@ -57,6 +65,12 @@ safe.
> >
> >   Fixed set of operations running in isolated execution environment.
> >
> > + (4) DCP
> > +
> > + Fixed set of cryptographic operations running in isolated 
> > execution
> > + environment. Only basic blob key encryption is executed there.
> > + The actual key sealing/unsealing is done on main processor/kernel
> space.
> > +
> >* Optional binding to platform integrity state
> >
> >   (1) TPM
> > @@ -79,6 +93,11 @@ safe.
> >   Relies on the High Assurance Boot (HAB) mechanism of NXP SoCs
> >   for platform integrity.
> >
> > + (4) DCP
> > +
> > + Relies on Secure/Trusted boot process (called HAB by vendor) for
> > + platform integrity.
> > +
> >*  Interfaces and APIs
> >
> >   (1) TPM
> > @@ -94,6 +113,11 @@ safe.
> >
> >   Interface is specific to silicon vendor.
> >
> > + (4) DCP
> > +
> > + Vendor-specific API that is implemented as part of the DCP crypto
> driver in
> > + ``drivers/crypto/mxs-dcp.c``.
> > +
> >*  Threat model
> >
> >   The strength and appropriateness of a particular trust source
> > for a given @@ -129,6 +153,13 @@ selected trust source:
> >   CAAM HWRNG, enable CRYPTO_DEV_FSL_CAAM_RNG_API and ensure
> the device
> >   is probed.
> >
> > +  *  DCP (Data Co-Processor: crypto accelerator of various i.MX SoCs)
> > +
> > + The DCP hardware device itself does not provide a dedicated RNG
> interface,
> > + so the kernel default RNG is used. SoCs with DCP like the i.MX6ULL do
> have
> > + a dedicated hardware RNG that is independent from DCP which can be
> enabled
> > + to back the kernel RNG.
> > +
> >  Users may override this by specifying ``trusted.rng=kernel`` on the
> > kernel  command-line to override the used RNG with the kernel's random
> number pool.
> >
> > @@ -231,6 +262,19 @@ Usage::
> >  CAAM-specific format.  The key length for new keys is always in bytes.
> >  Trusted Keys can be 32 - 128 bytes (256 - 1024 bits).
> >
> > +Trusted Keys usage: DCP
> > +---
> > +
> > +Usage::
> > +
> > +keyctl add 

Re: [PATCH] MAINTAINERS: Drop Li Yang as their email address stopped working

2024-04-09 Thread Jakub Kicinski
On Fri,  5 Apr 2024 09:20:41 +0200 Uwe Kleine-König wrote:
> When sending a patch to (among others) Li Yang the nxp MTA replied that
> the address doesn't exist and so the mail couldn't be delivered. The
> error code was 550, so at least technically that's not a temporal issue.
> 
> Signed-off-by: Uwe Kleine-König 

FWIW it's eaac25d026a1 in net, thanks!


Re: [PATCH v3 03/15] kunit: Add test cases for backtrace warning suppression

2024-04-09 Thread Guenter Roeck
On Tue, Apr 09, 2024 at 04:29:42PM +0800, David Gow wrote:
> > +ifeq ($(CCONFIG_KUNIT_SUPPRESS_BACKTRACE),y)
> 
> s/CCONFIG_/CONFIG_/ ?
> 
> 
Odd, I know I tested this (and it still works ;-).
The additional "C" must have slipped in at some point.
Thanks for noticing!

Guenter


Re: [EXT] [PATCH v8 3/6] KEYS: trusted: Introduce NXP DCP-backed trusted keys

2024-04-09 Thread Ahmad Fatoum
Hello Kshitiz,

On 09.04.24 12:54, Kshitiz Varshney wrote:
> Hi David,
>> +   b->fmt_version = DCP_BLOB_VERSION;
>> +   get_random_bytes(b->nonce, AES_KEYSIZE_128);
>> +   get_random_bytes(b->blob_key, AES_KEYSIZE_128);
> 
> We can use HWRNG instead of using kernel RNG. Please refer 
> drivers/char/hw_random/imx-rngc.c 

imx-rngc can be enabled and used to seed the kernel entropy pool. Adding
direct calls into imx-rngc here only introduces duplicated code at no extra
benefit.

Cheers,
Ahmad

-- 
Pengutronix e.K.   | |
Steuerwalder Str. 21   | http://www.pengutronix.de/  |
31137 Hildesheim, Germany  | Phone: +49-5121-206917-0|
Amtsgericht Hildesheim, HRA 2686   | Fax:   +49-5121-206917- |



Re: [PATCH] vdso: Fix powerpc build U64_MAX undeclared error

2024-04-09 Thread John Stultz
On Mon, Apr 8, 2024 at 11:27 PM Adrian Hunter  wrote:
>
> U64_MAX is not in include/vdso/limits.h, although that isn't noticed on x86
> because x86 includes include/linux/limits.h indirectly. However powerpc
> is more selective, resulting in the following build error:
>
>   In file included from :
>   lib/vdso/gettimeofday.c: In function 'vdso_calc_ns':
>   lib/vdso/gettimeofday.c:11:33: error: 'U64_MAX' undeclared
>  11 | # define VDSO_DELTA_MASK(vd)U64_MAX
> | ^~~
>
> Use ULLONG_MAX instead which will work just as well and is in
> include/vdso/limits.h.
>
> Reported-by: Stephen Rothwell 
> Closes: https://lore.kernel.org/all/20240409124905.6816d...@canb.auug.org.au/
> Fixes: c8e3a8b6f2e6 ("vdso: Consolidate vdso_calc_delta()")
> Signed-off-by: Adrian Hunter 

Acked-by: John Stultz 


Re: [PATCH] powerpc/pseries: Enforce hcall result buffer validity and size

2024-04-09 Thread Nathan Lynch
Michael Ellerman  writes:
> Nathan Lynch via B4 Relay 
> writes:
>>
>> plpar_hcall(), plpar_hcall9(), and related functions expect callers to
>> provide valid result buffers of certain minimum size. Currently this
>> is communicated only through comments in the code and the compiler has
>> no idea.
>>
>> For example, if I write a bug like this:
>>
>>   long retbuf[PLPAR_HCALL_BUFSIZE]; // should be PLPAR_HCALL9_BUFSIZE
>>   plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf, ...);
>>
>> This compiles with no diagnostics emitted, but likely results in stack
>> corruption at runtime when plpar_hcall9() stores results past the end
>> of the array. (To be clear this is a contrived example and I have not
>> found a real instance yet.)
>
> We did have some real stack corruption bugs in the past.
>
> I referred to them in my previous (much uglier) attempt at a fix:
>
>   
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1476780032-21643-2-git-send-email-...@ellerman.id.au/
>
> Annoyingly I didn't describe them in any detail, but at least one of them was:
>
>   24c65bc7037e ("hwrng: pseries - port to new read API and fix stack
>   corruption")

Thanks for this background.


> Will this catch a case like that? Where the too-small buffer is not
> declared locally but rather comes into the function as a pointer?

No, unfortunately. But here's a sketch that forces retbuf to be an
array, along with the necessary changes to make pseries-rng build:

diff --git a/arch/powerpc/include/asm/hvcall.h 
b/arch/powerpc/include/asm/hvcall.h
index 39cd1ca4ccb9..4055e461dcfd 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -524,7 +524,11 @@ long plpar_hcall_norets_notrace(unsigned long opcode, ...);
  * Used for all but the craziest of phyp interfaces (see plpar_hcall9)
  */
 #define PLPAR_HCALL_BUFSIZE 4
-long plpar_hcall(unsigned long opcode, unsigned long retbuf[static 
PLPAR_HCALL_BUFSIZE], ...);
+long _plpar_hcall(unsigned long opcode, unsigned long retbuf[static 
PLPAR_HCALL_BUFSIZE], ...);
+#define plpar_hcall(opcode_, retbuf_, ...) ({  \
+   static_assert(ARRAY_SIZE(retbuf_) >= 
PLPAR_HCALL_BUFSIZE); \
+   _plpar_hcall(opcode_, retbuf_, ## __VA_ARGS__); \
+   })
 
 /**
  * plpar_hcall_raw: - Make a hypervisor call without calculating hcall stats
diff --git a/arch/powerpc/platforms/pseries/hvCall.S 
b/arch/powerpc/platforms/pseries/hvCall.S
index 2b0cac6fb61f..4570dc0648fc 100644
--- a/arch/powerpc/platforms/pseries/hvCall.S
+++ b/arch/powerpc/platforms/pseries/hvCall.S
@@ -147,7 +147,7 @@ plpar_hcall_norets_trace:
blr
 #endif
 
-_GLOBAL_TOC(plpar_hcall)
+_GLOBAL_TOC(_plpar_hcall)
HMT_MEDIUM
 
mfcrr0
diff --git a/arch/powerpc/platforms/pseries/lpar.c 
b/arch/powerpc/platforms/pseries/lpar.c
index 4e9916bb03d7..11738c40274c 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -54,7 +54,7 @@
 
 
 /* in hvCall.S */
-EXPORT_SYMBOL(plpar_hcall);
+EXPORT_SYMBOL(_plpar_hcall);
 EXPORT_SYMBOL(plpar_hcall9);
 EXPORT_SYMBOL(plpar_hcall_norets);
 
diff --git a/drivers/char/hw_random/pseries-rng.c 
b/drivers/char/hw_random/pseries-rng.c
index 62bdd5af1339..a8cc6b80cd76 100644
--- a/drivers/char/hw_random/pseries-rng.c
+++ b/drivers/char/hw_random/pseries-rng.c
@@ -15,10 +15,10 @@
 
 static int pseries_rng_read(struct hwrng *rng, void *data, size_t max, bool 
wait)
 {
-   u64 buffer[PLPAR_HCALL_BUFSIZE];
+   unsigned long buffer[PLPAR_HCALL_BUFSIZE];
int rc;
 
-   rc = plpar_hcall(H_RANDOM, (unsigned long *)buffer);
+   rc = plpar_hcall(H_RANDOM, buffer);
if (rc != H_SUCCESS) {
pr_err_ratelimited("H_RANDOM call failed %d\n", rc);
return -EIO;


There may be other call sites to fix but this is enough for
ppc64le_defconfig.


Re: [PATCH v2 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS

2024-04-09 Thread Catalin Marinas
On Wed, Apr 03, 2024 at 04:38:00PM +0800, Kefeng Wang wrote:
> The vm_flags of vma already checked under per-VMA lock, if it is a
> bad access, directly set fault to VM_FAULT_BADACCESS and handle error,
> no need to retry with mmap_lock again, the latency time reduces 34% in
> 'lat_sig -P 1 prot lat_sig' from lmbench testcase.
> 
> Since the page faut is handled under per-VMA lock, count it as a vma lock
> event with VMA_LOCK_SUCCESS.
> 
> Reviewed-by: Suren Baghdasaryan 
> Signed-off-by: Kefeng Wang 

Reviewed-by: Catalin Marinas 


Re: [PATCH v2 1/7] arm64: mm: cleanup __do_page_fault()

2024-04-09 Thread Catalin Marinas
On Wed, Apr 03, 2024 at 04:37:59PM +0800, Kefeng Wang wrote:
> The __do_page_fault() only calls handle_mm_fault() after vm_flags
> checked, and it is only called by do_page_fault(), let's squash
> it into do_page_fault() to cleanup code.
> 
> Reviewed-by: Suren Baghdasaryan 
> Signed-off-by: Kefeng Wang 

As I reviewed v1 and the changes are minimal:

Reviewed-by: Catalin Marinas 


Re: [PATCHv12 1/4] genirq: Provide a snapshot mechanism for interrupt statistics

2024-04-09 Thread Thomas Gleixner
On Wed, Mar 06 2024 at 20:52, Bitao Hu wrote:
> The soft lockup detector lacks a mechanism to identify interrupt storms
> as root cause of a lockup. To enable this the detector needs a
> mechanism to snapshot the interrupt count statistics on a CPU when the
> detector observes a potential lockup scenario and compare that against
> the interrupt count when it warns about the lockup later on. The number
> of interrupts in that period give a hint whether the lockup might be
> caused by an interrupt storm.
>
> Instead of having extra storage in the lockup detector and accessing
> the internals of the interrupt descriptor directly, convert the per CPU
> irq_desc::kstat_irq member to a data structure which contains the
> counter plus a snapshot member and provide interfaces to take a
> snapshot of all interrupts on the current CPU and to retrieve the delta
> of a specific interrupt later on.
>
> Originally-by: Thomas Gleixner 
> Signed-off-by: Bitao Hu 
> Reviewed-by: Liu Song 
> Reviewed-by: Douglas Anderson 

This does not apply anymore.

Also can you please split this apart to convert kstat_irqs to a struct
with just the count in it and then add the snapshot mechanics on top.

Thanks,

tglx



Re: [PATCH net-next] net: wan: fsl_qmc_hdlc: Convert to platform remove callback returning void

2024-04-09 Thread Herve Codina
On Tue,  9 Apr 2024 11:12:04 +0200
Uwe Kleine-König  wrote:

> The .remove() callback for a platform driver returns an int which makes
> many driver authors wrongly assume it's possible to do error handling by
> returning an error code. However the value returned is ignored (apart
> from emitting a warning) and this typically results in resource leaks.
> 
> To improve here there is a quest to make the remove callback return
> void. In the first step of this quest all drivers are converted to
> .remove_new(), which already returns void. Eventually after all drivers
> are converted, .remove_new() will be renamed to .remove().
> 
> Trivially convert this driver from always returning zero in the remove
> callback to the void returning variant.
> 
> Signed-off-by: Uwe Kleine-König 
> ---
> Hello,
> 
> the drivers below of drivers/net were already converted to struct
> platform_driver::remove_new during the v6.9-rc1 development cycle. This
> driver was added for v6.9-rc1 and so missed during my conversion.
> 
> There are still more drivers to be converted, so there is from my side
> no need to get this into v6.9, but the next merge window would be nice.
> 
> Best regards
> Uwe
> 
>  drivers/net/wan/fsl_qmc_hdlc.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 

Acked-by: Herve Codina 

Best regards,
Hervé


[PATCH net-next] net: wan: fsl_qmc_hdlc: Convert to platform remove callback returning void

2024-04-09 Thread Uwe Kleine-König
The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König 
---
Hello,

the drivers below of drivers/net were already converted to struct
platform_driver::remove_new during the v6.9-rc1 development cycle. This
driver was added for v6.9-rc1 and so missed during my conversion.

There are still more drivers to be converted, so there is from my side
no need to get this into v6.9, but the next merge window would be nice.

Best regards
Uwe

 drivers/net/wan/fsl_qmc_hdlc.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wan/fsl_qmc_hdlc.c b/drivers/net/wan/fsl_qmc_hdlc.c
index f69b1f579a0c..c5e7ca793c43 100644
--- a/drivers/net/wan/fsl_qmc_hdlc.c
+++ b/drivers/net/wan/fsl_qmc_hdlc.c
@@ -765,15 +765,13 @@ static int qmc_hdlc_probe(struct platform_device *pdev)
return ret;
 }
 
-static int qmc_hdlc_remove(struct platform_device *pdev)
+static void qmc_hdlc_remove(struct platform_device *pdev)
 {
struct qmc_hdlc *qmc_hdlc = platform_get_drvdata(pdev);
 
unregister_hdlc_device(qmc_hdlc->netdev);
free_netdev(qmc_hdlc->netdev);
qmc_hdlc_framer_exit(qmc_hdlc);
-
-   return 0;
 }
 
 static const struct of_device_id qmc_hdlc_id_table[] = {
@@ -788,7 +786,7 @@ static struct platform_driver qmc_hdlc_driver = {
.of_match_table = qmc_hdlc_id_table,
},
.probe = qmc_hdlc_probe,
-   .remove = qmc_hdlc_remove,
+   .remove_new = qmc_hdlc_remove,
 };
 module_platform_driver(qmc_hdlc_driver);
 
base-commit: 11cb68ad52ac78c81e33b806b531f097e68edfa2
-- 
2.43.0



Re: [PATCH] powerpc/eeh: Permanently disable the removed device

2024-04-09 Thread Michael Ellerman
Hi Ganesh,

Ganesh Goudar  writes:
> When a device is hot removed on powernv, the hotplug
> driver clears the device's state. However, on pseries,
> if a device is removed by phyp after reaching the error
> threshold, the kernel remains unaware, leading to the
> device not being torn down. This prevents necessary
> remediation actions like failover.
>
> Permanently disable the device if the presence check
> fails.

You can wrap your changelogs a bit wider, 70 or 80 columns is fine.

> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index ab316e155ea9..8d1606406d3f 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -508,7 +508,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>* state, PE is in good state.
>*/
>   if ((ret < 0) ||
> - (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
> + (ret == EEH_STATE_NOT_SUPPORT &&
> +  dev->error_state == pci_channel_io_perm_failure) ||
> + eeh_state_active(ret)) {
>   eeh_stats.false_positives++;
>   pe->false_positives++;
>   rc = 0;

How does this hunk relate the changelog?

This is adding an extra condition to the false positive check, so
there's a risk this causes devices to go into failure when previously
they didn't, right? So please explain why it's a good change. The
comment above the if needs updating too.

> diff --git a/arch/powerpc/kernel/eeh_driver.c 
> b/arch/powerpc/kernel/eeh_driver.c
> index 48773d2d9be3..10317badf471 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -867,7 +867,13 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>   if (!devices) {
>   pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
>   pe->phb->global_number, pe->addr);
> - goto out; /* nothing to recover */

The other cases that go to recover_failed usually print something at
warn level, so this probably should too. So either make the above a
pr_warn(), or change it to a warn with a more helpful message.

> + /*
> +  * The device is removed, Tear down its state,
> +  * On powernv hotplug driver would take care of
> +  * it but not on pseries, Permanently disable the
> +  * card as it is hot removed.
> +  */

Formatting and punctuation is weird. It can be wider, and capital letter
is only required after a full stop, not a comma.

Also you say that the powernv hotplug driver "would" take care of it,
that's past tense, is that what you mean? Does the powernv hotplug
driver still take care of it after this change? And (how) does that
driver cope with it happening here also?

> + goto recover_failed;
>   }
>   
cheers


Re: [PATCH] powerpc: Fix fatal warnings flag for LLVM's integrated assembler

2024-04-09 Thread Michael Ellerman
Nathan Chancellor  writes:
> When building with LLVM_IAS=1, there is an error because
> '-fatal-warnings' is not recognized as a valid flag:
>
>   clang: error: unsupported argument '-fatal-warnings' to option '-Wa,'
>
> Use the double hyphen version of the flag, '--fatal-warnings', which
> works with both the GNU assembler and LLVM's integrated assembler.
>
> Fixes: 608d4a5ca563 ("powerpc: Error on assembly warnings")

Thanks.

I guess I need to do some of my clang builds with LLVM_IAS=1 :}

cheers


Re: [PATCH v2 4/7] powerpc: mm: accelerate pagefault when badaccess

2024-04-09 Thread Michael Ellerman
Kefeng Wang  writes:
> The access_[pkey]_error() of vma already checked under per-VMA lock, if
> it is a bad access, directly handle error, no need to retry with mmap_lock
> again. In order to release the correct lock, pass the mm_struct into
> bad_access_pkey()/bad_access(), if mm is NULL, release vma lock, or
> release mmap_lock. Since the page faut is handled under per-VMA lock,
> count it as a vma lock event with VMA_LOCK_SUCCESS.
>
> Signed-off-by: Kefeng Wang 
> ---
>  arch/powerpc/mm/fault.c | 33 -
>  1 file changed, 20 insertions(+), 13 deletions(-)

I thought there might be a nicer way to do this, plumbing the mm and vma
down through all those levels is a bit of a pain (vma->vm_mm exists
after all).

But I couldn't come up with anything obviously better, without doing
lots of refactoring first, which would be a pain to integrate into this
series.

So anyway, if the series goes ahead:

Acked-by: Michael Ellerman  (powerpc)

cheers

> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 53335ae21a40..215690452495 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -71,23 +71,26 @@ static noinline int bad_area_nosemaphore(struct pt_regs 
> *regs, unsigned long add
>   return __bad_area_nosemaphore(regs, address, SEGV_MAPERR);
>  }
>  
> -static int __bad_area(struct pt_regs *regs, unsigned long address, int 
> si_code)
> +static int __bad_area(struct pt_regs *regs, unsigned long address, int 
> si_code,
> +   struct mm_struct *mm, struct vm_area_struct *vma)
>  {
> - struct mm_struct *mm = current->mm;
>  
>   /*
>* Something tried to access memory that isn't in our memory map..
>* Fix it, but check if it's kernel or user first..
>*/
> - mmap_read_unlock(mm);
> + if (mm)
> + mmap_read_unlock(mm);
> + else
> + vma_end_read(vma);
>  
>   return __bad_area_nosemaphore(regs, address, si_code);
>  }
>  
>  static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long 
> address,
> + struct mm_struct *mm,
>   struct vm_area_struct *vma)
>  {
> - struct mm_struct *mm = current->mm;
>   int pkey;
>  
>   /*
> @@ -109,7 +112,10 @@ static noinline int bad_access_pkey(struct pt_regs 
> *regs, unsigned long address,
>*/
>   pkey = vma_pkey(vma);
>  
> - mmap_read_unlock(mm);
> + if (mm)
> + mmap_read_unlock(mm);
> + else
> + vma_end_read(vma);
>  
>   /*
>* If we are in kernel mode, bail out with a SEGV, this will
> @@ -124,9 +130,10 @@ static noinline int bad_access_pkey(struct pt_regs 
> *regs, unsigned long address,
>   return 0;
>  }
>  
> -static noinline int bad_access(struct pt_regs *regs, unsigned long address)
> +static noinline int bad_access(struct pt_regs *regs, unsigned long address,
> +struct mm_struct *mm, struct vm_area_struct *vma)
>  {
> - return __bad_area(regs, address, SEGV_ACCERR);
> + return __bad_area(regs, address, SEGV_ACCERR, mm, vma);
>  }
>  
>  static int do_sigbus(struct pt_regs *regs, unsigned long address,
> @@ -479,13 +486,13 @@ static int ___do_page_fault(struct pt_regs *regs, 
> unsigned long address,
>  
>   if (unlikely(access_pkey_error(is_write, is_exec,
>  (error_code & DSISR_KEYFAULT), vma))) {
> - vma_end_read(vma);
> - goto lock_mmap;
> + count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> + return bad_access_pkey(regs, address, NULL, vma);
>   }
>  
>   if (unlikely(access_error(is_write, is_exec, vma))) {
> - vma_end_read(vma);
> - goto lock_mmap;
> + count_vm_vma_lock_event(VMA_LOCK_SUCCESS);
> + return bad_access(regs, address, NULL, vma);
>   }
>  
>   fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, 
> regs);
> @@ -521,10 +528,10 @@ static int ___do_page_fault(struct pt_regs *regs, 
> unsigned long address,
>  
>   if (unlikely(access_pkey_error(is_write, is_exec,
>  (error_code & DSISR_KEYFAULT), vma)))
> - return bad_access_pkey(regs, address, vma);
> + return bad_access_pkey(regs, address, mm, vma);
>  
>   if (unlikely(access_error(is_write, is_exec, vma)))
> - return bad_access(regs, address);
> + return bad_access(regs, address, mm, vma);
>  
>   /*
>* If for any reason at all we couldn't handle the fault,
> -- 
> 2.27.0
>
>
> ___
> linux-riscv mailing list
> linux-ri...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv


Re: [PATCH] powerpc/pseries: Enforce hcall result buffer validity and size

2024-04-09 Thread Michael Ellerman
Nathan Lynch via B4 Relay 
writes:
> From: Nathan Lynch 
>
> plpar_hcall(), plpar_hcall9(), and related functions expect callers to
> provide valid result buffers of certain minimum size. Currently this
> is communicated only through comments in the code and the compiler has
> no idea.
>
> For example, if I write a bug like this:
>
>   long retbuf[PLPAR_HCALL_BUFSIZE]; // should be PLPAR_HCALL9_BUFSIZE
>   plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf, ...);
>
> This compiles with no diagnostics emitted, but likely results in stack
> corruption at runtime when plpar_hcall9() stores results past the end
> of the array. (To be clear this is a contrived example and I have not
> found a real instance yet.)

We did have some real stack corruption bugs in the past.

I referred to them in my previous (much uglier) attempt at a fix:

  
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/1476780032-21643-2-git-send-email-...@ellerman.id.au/

Annoyingly I didn't describe them in any detail, but at least one of them was:

  24c65bc7037e ("hwrng: pseries - port to new read API and fix stack 
corruption")

Will this catch a case like that? Where the too-small buffer is not
declared locally but rather comes into the function as a pointer?

> To make this class of error less likely, we can use explicitly-sized
> array parameters instead of pointers in the declarations for the hcall
> APIs. When compiled with -Warray-bounds[1], the code above now
> provokes a diagnostic like this:
>
> error: array argument is too small;
> is of size 32, callee requires at least 72 [-Werror,-Warray-bounds]
>60 | plpar_hcall9(H_ALLOCATE_VAS_WINDOW, retbuf,
>   | ^   ~~
>
> [1] Enabled for LLVM builds but not GCC for now. See commit
> 0da6e5fd6c37 ("gcc: disable '-Warray-bounds' for gcc-13 too") and
> related changes.

clang build coverage is pretty good these days, so I think it's still
worth doing.

cheers

> diff --git a/arch/powerpc/include/asm/hvcall.h 
> b/arch/powerpc/include/asm/hvcall.h
> index a41e542ba94d..39cd1ca4ccb9 100644
> --- a/arch/powerpc/include/asm/hvcall.h
> +++ b/arch/powerpc/include/asm/hvcall.h
> @@ -524,7 +524,7 @@ long plpar_hcall_norets_notrace(unsigned long opcode, 
> ...);
>   * Used for all but the craziest of phyp interfaces (see plpar_hcall9)
>   */
>  #define PLPAR_HCALL_BUFSIZE 4
> -long plpar_hcall(unsigned long opcode, unsigned long *retbuf, ...);
> +long plpar_hcall(unsigned long opcode, unsigned long retbuf[static 
> PLPAR_HCALL_BUFSIZE], ...);
>  
>  /**
>   * plpar_hcall_raw: - Make a hypervisor call without calculating hcall stats
> @@ -538,7 +538,7 @@ long plpar_hcall(unsigned long opcode, unsigned long 
> *retbuf, ...);
>   * plpar_hcall, but plpar_hcall_raw works in real mode and does not
>   * calculate hypervisor call statistics.
>   */
> -long plpar_hcall_raw(unsigned long opcode, unsigned long *retbuf, ...);
> +long plpar_hcall_raw(unsigned long opcode, unsigned long retbuf[static 
> PLPAR_HCALL_BUFSIZE], ...);
>  
>  /**
>   * plpar_hcall9: - Make a pseries hypervisor call with up to 9 return 
> arguments
> @@ -549,8 +549,8 @@ long plpar_hcall_raw(unsigned long opcode, unsigned long 
> *retbuf, ...);
>   * PLPAR_HCALL9_BUFSIZE to size the return argument buffer.
>   */
>  #define PLPAR_HCALL9_BUFSIZE 9
> -long plpar_hcall9(unsigned long opcode, unsigned long *retbuf, ...);
> -long plpar_hcall9_raw(unsigned long opcode, unsigned long *retbuf, ...);
> +long plpar_hcall9(unsigned long opcode, unsigned long retbuf[static 
> PLPAR_HCALL9_BUFSIZE], ...);
> +long plpar_hcall9_raw(unsigned long opcode, unsigned long retbuf[static 
> PLPAR_HCALL9_BUFSIZE], ...);
>  
>  /* pseries hcall tracing */
>  extern struct static_key hcall_tracepoint_key;
>
> ---
> base-commit: bfe51886ca544956eb4ff924d1937ac01d0ca9c8
> change-id: 20240408-pseries-hvcall-retbuf-c47c4d70d847
>
> Best regards,
> -- 
> Nathan Lynch 


Re: [PATCH] powerpc/pseries: remove returning ENODEV when uevent is triggered

2024-04-09 Thread Michael Ellerman
Hi Lidong,

Thanks for the patch.

I'm not an expert on udev etc. so apologies if any of these questions
are stupid.

Lidong Zhong  writes:
> We have noticed the following nuisance messages during boot
>
> [7.120610][ T1060] vio vio: uevent: failed to send synthetic uevent
> [7.122281][ T1060] vio 4000: uevent: failed to send synthetic uevent
> [7.122304][ T1060] vio 4001: uevent: failed to send synthetic uevent
> [7.122324][ T1060] vio 4002: uevent: failed to send synthetic uevent
> [7.122345][ T1060] vio 4004: uevent: failed to send synthetic uevent
>
> It's caused by either vio_register_device_node() failed to set dev->of_node or
> the missing "compatible" property. Try return as much information as possible
> instead of a failure.

Does udev etc. cope with that? Can we just change the content of the
MODALIAS value like that?

With this patch we'll start emitting uevents for devices we previously
didn't. I guess that's OK because nothing is expecting them?

Can you include a log of udev showing the event firing and that nothing
breaks.

On my system here I see nothing matches the devices except for libvpd,
which seems to match lots of things.

> diff --git a/arch/powerpc/platforms/pseries/vio.c 
> b/arch/powerpc/platforms/pseries/vio.c
> index 90ff85c879bf..62961715ca24 100644
> --- a/arch/powerpc/platforms/pseries/vio.c
> +++ b/arch/powerpc/platforms/pseries/vio.c
> @@ -1593,12 +1593,13 @@ static int vio_hotplug(const struct device *dev, 
> struct kobj_uevent_env *env)
>  
>   dn = dev->of_node;
>   if (!dn)
> - return -ENODEV;
> + goto out;
>   cp = of_get_property(dn, "compatible", NULL);
>   if (!cp)
> - return -ENODEV;
> -
> - add_uevent_var(env, "MODALIAS=vio:T%sS%s", vio_dev->type, cp);
> + add_uevent_var(env, "MODALIAS=vio:T%s", vio_dev->type);

If it's OK to skip the compatible property then we don't need the
of_node at all, and we could always emit this, even when of_node is not
available.

> +else
> + add_uevent_var(env, "MODALIAS=vio:T%sS%s", vio_dev->type, cp);
> +out:
>   return 0;
>  }

I think we also should update the vio modalias_show() to follow the same
logic, otherwise the uevent MODALIAS value and the modalias file won't
match which is confusing.

Preferably vio_hotplug() and modalias_show() would just call a common
helper.

cheers


Re: [PATCH v3 04/15] kunit: Add documentation for warning backtrace suppression API

2024-04-09 Thread David Gow
On Wed, 3 Apr 2024 at 21:19, Guenter Roeck  wrote:
>
> Document API functions for suppressing warning backtraces.
>
> Tested-by: Linux Kernel Functional Testing 
> Acked-by: Dan Carpenter 
> Reviewed-by: Kees Cook 
> Signed-off-by: Guenter Roeck 
> ---

This looks good to me: thanks for adding the documentation!

If we add integration between this and the KUnit resource system,
we'll need to add that to this documentation.

I wonder if it would make sense to have an example where the
DEFINE_SUPPRESSED_WARNING() is global, e.g., in the test init/exit
functions. That might overcomplicate it a bit.

It also might be nice to document the individual macros with kerneldoc
comments. (Though, that could equally fit in patch #1).

Still, this is the most important bit, so I'm happy to have it as-is.

Reviewed-by: David Gow 

Cheers,
-- David


> v2:
> - Rebased to v6.9-rc1
> - Added Tested-by:, Acked-by:, and Reviewed-by: tags
> v3:
> - Rebased to v6.9-rc2
>
>  Documentation/dev-tools/kunit/usage.rst | 30 -
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/dev-tools/kunit/usage.rst 
> b/Documentation/dev-tools/kunit/usage.rst
> index 22955d56b379..8d3d36d4103d 100644
> --- a/Documentation/dev-tools/kunit/usage.rst
> +++ b/Documentation/dev-tools/kunit/usage.rst
> @@ -157,6 +157,34 @@ Alternatively, one can take full control over the error 
> message by using
> if (some_setup_function())
> KUNIT_FAIL(test, "Failed to setup thing for testing");
>
> +Suppressing warning backtraces
> +--
> +
> +Some unit tests trigger warning backtraces either intentionally or as side
> +effect. Such backtraces are normally undesirable since they distract from
> +the actual test and may result in the impression that there is a problem.
> +
> +Such backtraces can be suppressed. To suppress a backtrace in 
> some_function(),
> +use the following code.
> +
> +.. code-block:: c
> +
> +   static void some_test(struct kunit *test)
> +   {
> +   DEFINE_SUPPRESSED_WARNING(some_function);
> +
> +   START_SUPPRESSED_WARNING(some_function);
> +   trigger_backtrace();
> +   END_SUPPRESSED_WARNING(some_function);
> +   }
> +
> +SUPPRESSED_WARNING_COUNT() returns the number of suppressed backtraces. If 
> the
> +suppressed backtrace was triggered on purpose, this can be used to check if
> +the backtrace was actually triggered.
> +
> +.. code-block:: c
> +
> +   KUNIT_EXPECT_EQ(test, SUPPRESSED_WARNING_COUNT(some_function), 1);
>
>  Test Suites
>  ~~~
> @@ -857,4 +885,4 @@ For example:
> dev_managed_string = devm_kstrdup(fake_device, "Hello, 
> World!");
>
> // Everything is cleaned up automatically when the test ends.
> -   }
> \ No newline at end of file
> +   }
> --
> 2.39.2
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH v3 03/15] kunit: Add test cases for backtrace warning suppression

2024-04-09 Thread David Gow
On Wed, 3 Apr 2024 at 21:19, Guenter Roeck  wrote:
>
> Add unit tests to verify that warning backtrace suppression works.
>
> If backtrace suppression does _not_ work, the unit tests will likely
> trigger unsuppressed backtraces, which should actually help to get
> the affected architectures / platforms fixed.
>
> Tested-by: Linux Kernel Functional Testing 
> Acked-by: Dan Carpenter 
> Reviewed-by: Kees Cook 
> Signed-off-by: Guenter Roeck 
> ---

There's a typo in the Makefile, which stops this from being built at
all. Otherwise, it seems good to me.

-- David

> v2:
> - Rebased to v6.9-rc1
> - Added Tested-by:, Acked-by:, and Reviewed-by: tags
> - Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
> v3:
> - Rebased to v6.9-rc2
>
>  lib/kunit/Makefile |   7 +-
>  lib/kunit/backtrace-suppression-test.c | 104 +
>  2 files changed, 109 insertions(+), 2 deletions(-)
>  create mode 100644 lib/kunit/backtrace-suppression-test.c
>
> diff --git a/lib/kunit/Makefile b/lib/kunit/Makefile
> index 545b57c3be48..3eee1bd0ce5e 100644
> --- a/lib/kunit/Makefile
> +++ b/lib/kunit/Makefile
> @@ -16,10 +16,13 @@ endif
>
>  # KUnit 'hooks' and bug handling are built-in even when KUnit is built
>  # as a module.
> -obj-y +=   hooks.o \
> -   bug.o
> +obj-y +=   hooks.o
> +obj-$(CONFIG_KUNIT_SUPPRESS_BACKTRACE) += bug.o
>
>  obj-$(CONFIG_KUNIT_TEST) +=kunit-test.o
> +ifeq ($(CCONFIG_KUNIT_SUPPRESS_BACKTRACE),y)

s/CCONFIG_/CONFIG_/ ?




> +obj-$(CONFIG_KUNIT_TEST) +=backtrace-suppression-test.o
> +endif
>
>  # string-stream-test compiles built-in only.
>  ifeq ($(CONFIG_KUNIT_TEST),y)
> diff --git a/lib/kunit/backtrace-suppression-test.c 
> b/lib/kunit/backtrace-suppression-test.c
> new file mode 100644
> index ..47c619283802
> --- /dev/null
> +++ b/lib/kunit/backtrace-suppression-test.c
> @@ -0,0 +1,104 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * KUnit test for suppressing warning tracebacks
> + *
> + * Copyright (C) 2024, Guenter Roeck
> + * Author: Guenter Roeck 
> + */
> +
> +#include 
> +#include 
> +
> +static void backtrace_suppression_test_warn_direct(struct kunit *test)
> +{
> +   DEFINE_SUPPRESSED_WARNING(backtrace_suppression_test_warn_direct);
> +
> +   START_SUPPRESSED_WARNING(backtrace_suppression_test_warn_direct);
> +   WARN(1, "This backtrace should be suppressed");
> +   END_SUPPRESSED_WARNING(backtrace_suppression_test_warn_direct);
> +
> +   KUNIT_EXPECT_EQ(test, 
> SUPPRESSED_WARNING_COUNT(backtrace_suppression_test_warn_direct), 1);
> +}
> +
> +static void trigger_backtrace_warn(void)
> +{
> +   WARN(1, "This backtrace should be suppressed");
> +}
> +
> +static void backtrace_suppression_test_warn_indirect(struct kunit *test)
> +{
> +   DEFINE_SUPPRESSED_WARNING(trigger_backtrace_warn);
> +
> +   START_SUPPRESSED_WARNING(trigger_backtrace_warn);
> +   trigger_backtrace_warn();
> +   END_SUPPRESSED_WARNING(trigger_backtrace_warn);
> +
> +   KUNIT_EXPECT_EQ(test, 
> SUPPRESSED_WARNING_COUNT(trigger_backtrace_warn), 1);
> +}
> +
> +static void backtrace_suppression_test_warn_multi(struct kunit *test)
> +{
> +   DEFINE_SUPPRESSED_WARNING(trigger_backtrace_warn);
> +   DEFINE_SUPPRESSED_WARNING(backtrace_suppression_test_warn_multi);
> +
> +   START_SUPPRESSED_WARNING(backtrace_suppression_test_warn_multi);
> +   START_SUPPRESSED_WARNING(trigger_backtrace_warn);
> +   WARN(1, "This backtrace should be suppressed");
> +   trigger_backtrace_warn();
> +   END_SUPPRESSED_WARNING(trigger_backtrace_warn);
> +   END_SUPPRESSED_WARNING(backtrace_suppression_test_warn_multi);
> +
> +   KUNIT_EXPECT_EQ(test, 
> SUPPRESSED_WARNING_COUNT(backtrace_suppression_test_warn_multi), 1);
> +   KUNIT_EXPECT_EQ(test, 
> SUPPRESSED_WARNING_COUNT(trigger_backtrace_warn), 1);
> +}
> +
> +static void backtrace_suppression_test_warn_on_direct(struct kunit *test)
> +{
> +   DEFINE_SUPPRESSED_WARNING(backtrace_suppression_test_warn_on_direct);
> +
> +   if (!IS_ENABLED(CONFIG_DEBUG_BUGVERBOSE) && 
> !IS_ENABLED(CONFIG_KALLSYMS))
> +   kunit_skip(test, "requires CONFIG_DEBUG_BUGVERBOSE or 
> CONFIG_KALLSYMS");
> +
> +   START_SUPPRESSED_WARNING(backtrace_suppression_test_warn_on_direct);
> +   WARN_ON(1);
> +   END_SUPPRESSED_WARNING(backtrace_suppression_test_warn_on_direct);
> +
> +   KUNIT_EXPECT_EQ(test,
> +   
> SUPPRESSED_WARNING_COUNT(backtrace_suppression_test_warn_on_direct), 1);
> +}
> +
> +static void trigger_backtrace_warn_on(void)
> +{
> +   WARN_ON(1);
> +}
> +
> +static void backtrace_suppression_test_warn_on_indirect(struct kunit *test)
> +{
> +   DEFINE_SUPPRESSED_WARNING(trigger_backtrace_warn_on);
> +
> +   if (!IS_ENABLED(CONFIG_DEBUG_BUGVERBOSE))
> +   kunit_skip(test, 

Re: [PATCH v3 02/15] kunit: bug: Count suppressed warning backtraces

2024-04-09 Thread David Gow
On Wed, 3 Apr 2024 at 21:19, Guenter Roeck  wrote:
>
> Count suppressed warning backtraces to enable code which suppresses
> warning backtraces to check if the expected backtrace(s) have been
> observed.
>
> Using atomics for the backtrace count resulted in build errors on some
> architectures due to include file recursion, so use a plain integer
> for now.
>
> Acked-by: Dan Carpenter 
> Reviewed-by: Kees Cook 
> Tested-by: Linux Kernel Functional Testing 
> Signed-off-by: Guenter Roeck 
> ---

Looks good to me, thanks.

Reviewed-by: David Gow 

Cheers,
-- David


> v2:
> - Rebased to v6.9-rc1
> - Added Tested-by:, Acked-by:, and Reviewed-by: tags
> - Introduced KUNIT_SUPPRESS_BACKTRACE configuration option
> v3:
> - Rebased to v6.9-rc2
>
>  include/kunit/bug.h | 7 ++-
>  lib/kunit/bug.c | 4 +++-
>  2 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/include/kunit/bug.h b/include/kunit/bug.h
> index bd0fe047572b..72e9fb23bbd5 100644
> --- a/include/kunit/bug.h
> +++ b/include/kunit/bug.h
> @@ -20,6 +20,7 @@
>  struct __suppressed_warning {
> struct list_head node;
> const char *function;
> +   int counter;
>  };
>
>  void __start_suppress_warning(struct __suppressed_warning *warning);
> @@ -28,7 +29,7 @@ bool __is_suppressed_warning(const char *function);
>
>  #define DEFINE_SUPPRESSED_WARNING(func)\
> struct __suppressed_warning __kunit_suppress_##func = \
> -   { .function = __stringify(func) }
> +   { .function = __stringify(func), .counter = 0 }
>
>  #define START_SUPPRESSED_WARNING(func) \
> __start_suppress_warning(&__kunit_suppress_##func)
> @@ -39,12 +40,16 @@ bool __is_suppressed_warning(const char *function);
>  #define IS_SUPPRESSED_WARNING(func) \
> __is_suppressed_warning(func)
>
> +#define SUPPRESSED_WARNING_COUNT(func) \
> +   (__kunit_suppress_##func.counter)
> +
>  #else /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
>
>  #define DEFINE_SUPPRESSED_WARNING(func)
>  #define START_SUPPRESSED_WARNING(func)
>  #define END_SUPPRESSED_WARNING(func)
>  #define IS_SUPPRESSED_WARNING(func) (false)
> +#define SUPPRESSED_WARNING_COUNT(func) (0)
>
>  #endif /* CONFIG_KUNIT_SUPPRESS_BACKTRACE */
>  #endif /* __ASSEMBLY__ */
> diff --git a/lib/kunit/bug.c b/lib/kunit/bug.c
> index f93544d7a9d1..13b3d896c114 100644
> --- a/lib/kunit/bug.c
> +++ b/lib/kunit/bug.c
> @@ -32,8 +32,10 @@ bool __is_suppressed_warning(const char *function)
> return false;
>
> list_for_each_entry(warning, _warnings, node) {
> -   if (!strcmp(function, warning->function))
> +   if (!strcmp(function, warning->function)) {
> +   warning->counter++;
> return true;
> +   }
> }
> return false;
>  }
> --
> 2.39.2
>
> --
> You received this message because you are subscribed to the Google Groups 
> "KUnit Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kunit-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/kunit-dev/20240403131936.787234-3-linux%40roeck-us.net.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH v3 01/15] bug/kunit: Core support for suppressing warning backtraces

2024-04-09 Thread David Gow
On Wed, 3 Apr 2024 at 21:19, Guenter Roeck  wrote:
>
> Some unit tests intentionally trigger warning backtraces by passing
> bad parameters to API functions. Such unit tests typically check the
> return value from those calls, not the existence of the warning backtrace.
>
> Such intentionally generated warning backtraces are neither desirable
> nor useful for a number of reasons.
> - They can result in overlooked real problems.
> - A warning that suddenly starts to show up in unit tests needs to be
>   investigated and has to be marked to be ignored, for example by
>   adjusting filter scripts. Such filters are ad-hoc because there is
>   no real standard format for warnings. On top of that, such filter
>   scripts would require constant maintenance.
>
> One option to address problem would be to add messages such as "expected
> warning backtraces start / end here" to the kernel log.  However, that
> would again require filter scripts, it might result in missing real
> problematic warning backtraces triggered while the test is running, and
> the irrelevant backtrace(s) would still clog the kernel log.
>
> Solve the problem by providing a means to identify and suppress specific
> warning backtraces while executing test code. Since the new functionality
> results in an image size increase of about 1% if CONFIG_KUNIT is enabled,
> provide configuration option KUNIT_SUPPRESS_BACKTRACE to be able to disable
> the new functionality. This option is by default enabled since almost all
> systems with CONFIG_KUNIT enabled will want to benefit from it.
>
> Cc: Dan Carpenter 
> Cc: Daniel Diaz 
> Cc: Naresh Kamboju 
> Cc: Kees Cook 
> Tested-by: Linux Kernel Functional Testing 
> Acked-by: Dan Carpenter 
> Reviewed-by: Kees Cook 
> Signed-off-by: Guenter Roeck 
> ---

Sorry it took so long to get to this.

I love the idea, we've needed this for a while.

There are some downsides to this being entirely based on the name of
the function which contains WARN(). Partly because there could be
several WARN()s within a function, and there'd be overlap, and partly
because the function name is never actually printed during a warning
(it may come from the stack trace, but that can be misleading with
inlined functions).  I don't think either of these are showstoppers,
though, but it'd be nice to extend this in the future with (a) other
ways of identifying warnings, such as the format string, and (b) print
the function name in the report, if it's present. The function name is
probably a good middle ground, complexity-wise, though, so I'm happy
to have it thus far.

 I also think we're missing some opportunities to integrate this
better with existing KUnit infrastructure, like the
action/resource/cleanup system. In particular, it'd be nice to have a
way of ensuring that suppressions won't get leaked if the test aborts
between START_SUPPRESSED_WARNING() and END_SUPPRESSED_WARNING(). It's
not difficult to use this as-is, but it'd be nice to have some
helpers, rather than having to, for instance:
KUNIT_DEFINE_ACTION_WRAPPER(kunit_stop_suppressing_warning,
__end_suppress_warning, struct __suppressed_warning *);
DEFINE_SUPPRESSED_WARNING(vfree);
START_SUPPRESSED_WARNING(vfree);
kunit_add_action(test, kunit_stop_suppressing_warning, (void
*)&__kunit_suppress_vfree);

(With the note that the DEFINE_SUPPRESSED_WARNING() will have to be
global, or put on the heap, lest it become a dangling pointer by the
time the suppression has stopped.)

Equally, do we want to make the
__{start,end,is}_suppress[ed]_warning()  functions KUnit 'hooks'? This
would allow them to be used in modules which don't depend directly on
KUnit. I suspect it's not important in this case: but worth keeping in
mind in case we find a situation where we'd need to suppress a warning
elsewhere.

These are all things which could be added/changed in follow-up
patches, though, so I don't think they're blockers. Otherwise, this
looks good: perhaps the naming could be a bit more consistent with
other KUnit things, but that depends on how much we want this to be 'a
part of KUnit' versus an independent bit of functionality.

> v2:
> - Rebased to v6.9-rc1
> - Added Tested-by:, Acked-by:, and Reviewed-by: tags
> - Added CONFIG_KUNIT_SUPPRESS_BACKTRACE configuration option,
>   enabled by default
> v3:
> - Rebased to v6.9-rc2
>
>  include/asm-generic/bug.h | 16 +---
>  include/kunit/bug.h   | 51 +++
>  include/kunit/test.h  |  1 +
>  include/linux/bug.h   | 13 ++
>  lib/bug.c | 51 ---
>  lib/kunit/Kconfig |  9 +++
>  lib/kunit/Makefile|  6 +++--
>  lib/kunit/bug.c   | 40 ++
>  8 files changed, 178 insertions(+), 9 deletions(-)
>  create mode 100644 include/kunit/bug.h
>  create mode 100644 lib/kunit/bug.c
>
> diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
> index 6e794420bd39..c170b6477689 100644

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

2024-04-09 Thread Tobias Huschle
On Fri, Mar 22, 2024 at 06:02:05PM +0100, Vincent Guittot wrote:
> and then
> se->vruntime = max_vruntime(se->vruntime, vruntime)
> 

First things first, I was wrong to assume a "boost" in the CFS code. So I
dug a bit deeper and tried to pinpoint what the difference between CFS and
EEVDF actually is. I found the following:

Let's assume we have two tasks taking turns on a single CPU.
Task 1 is always runnable.
Task 2 gets woken up by task 1 and goes back to sleep when it is done.
This means, task 1 runs, wakes up task 2, task 2 runs, goes to sleep and
task 1 runs again and we repeat.
Most of the time: runtime(task1) > runtime(task2)
Rare occasions:   runtime(task1) < runtime(task2)
So, task 1 usually consumes more of its designated time slices until it gets
rescheduled by the wakeup of task2 than task 2 does. But both never consume
their full time slice. Rather the opposite, both run for low 5-digit ns or
less.

So something like this:

task 1|--||-||--...
task 2   || ||

This creates different behaviors under CFS and EEVDF:

### CFS 

In CFS the difference in runtimes means that task 2 cannot catch up with 
task 1 vruntime-wise

With every exchange between task 1 and task 2, task 2 falls back more on
vruntime. Once a difference in the magnitude of sysctl_sched_latency is 
established, the difference remains stable due to the max handling in 
place_entity.

Occasionally, task 2 may run longer than task 1. In those cases, it
will catch up slightly. But in the majority of cases, task 2 runs
shorter, thereby increasing the difference in vruntime.

This would explain why task 2 gets always scheduled immediately on wakeup.

### EEVDF ##

The rare occasions where task 2 runs longer than task 1 seem to cause 
issues with EEVDF:

In the regular case where task 1 runs longer than task 2. Task 2 gets 
a positive lag and is selected on wake up --> good.
In the irregular case where task 2 runs longer than task 1 task 2 
now gets a negative lag and is no longer chosen on wakeup --> bad (in some 
cases).

This would explain why task 2 gets not selected on wake up occasionally. 

### Summary 

So my wording, that a woken up task gets "boosted" was obviously wrong. 
Task 2 is not getting boosted in CFS, it gets "outrun" by task 1, with 
no chance of catching up. Leaving it with a smaller vruntime value.

EEVDF on the other hand, does not allow lag to accumulate if an entity, like 
task 2 in this case, regularly dequeues itself. So it will always have 
a lag with an upper boundary of whatever difference it encountered in 
comparison to the runtime with task 1.

The patch below, allows tasks to accumulate lag over time. This fixes the
original regression, that made me stumble into this topic. But, this might 
of course come with arbitrary side effects.

I'm not suggesting to actually implement this, but would like to confirm 
whether my understanding is correct that this is the aspect where CFS and 
EEVDF differ, where CFS is more aware of the past in this particular case
than EEVDF is.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..b83a72311d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
s64 lag, limit;
 
SCHED_WARN_ON(!se->on_rq);
-   lag = avg_vruntime(cfs_rq) - se->vruntime;
+   lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime;
 
limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
se->vlag = clamp(lag, -limit, limit);


[PATCH] vdso: Fix powerpc build U64_MAX undeclared error

2024-04-09 Thread Adrian Hunter
U64_MAX is not in include/vdso/limits.h, although that isn't noticed on x86
because x86 includes include/linux/limits.h indirectly. However powerpc
is more selective, resulting in the following build error:

  In file included from :
  lib/vdso/gettimeofday.c: In function 'vdso_calc_ns':
  lib/vdso/gettimeofday.c:11:33: error: 'U64_MAX' undeclared
 11 | # define VDSO_DELTA_MASK(vd)U64_MAX
| ^~~

Use ULLONG_MAX instead which will work just as well and is in
include/vdso/limits.h.

Reported-by: Stephen Rothwell 
Closes: https://lore.kernel.org/all/20240409124905.6816d...@canb.auug.org.au/
Fixes: c8e3a8b6f2e6 ("vdso: Consolidate vdso_calc_delta()")
Signed-off-by: Adrian Hunter 
---
 lib/vdso/gettimeofday.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c
index 9c3a8d2440c9..899850bd6f0b 100644
--- a/lib/vdso/gettimeofday.c
+++ b/lib/vdso/gettimeofday.c
@@ -8,7 +8,7 @@
 #ifndef vdso_calc_ns
 
 #ifdef VDSO_DELTA_NOMASK
-# define VDSO_DELTA_MASK(vd)   U64_MAX
+# define VDSO_DELTA_MASK(vd)   ULLONG_MAX
 #else
 # define VDSO_DELTA_MASK(vd)   (vd->mask)
 #endif
-- 
2.34.1



[PATCH RFC v2 5/5] arm64: mm: take DMA zone offset into account

2024-04-09 Thread Baruch Siach
Commit 791ab8b2e3db ("arm64: Ignore any DMA offsets in the
max_zone_phys() calculation") made DMA/DMA32 zones span the entire RAM
when RAM starts above 32-bits. This breaks hardware with DMA area that
start above 32-bits. But the commit log says that "we haven't noticed
any such hardware". It turns out that such hardware does exist.

One such platform has RAM starting at 32GB with an internal bus that has
the following DMA limits:

  #address-cells = <2>;
  #size-cells = <2>;
  dma-ranges = <0x00 0xc000 0x08 0x 0x00 0x4000>;

Devices under this bus can see 1GB of DMA range between 3GB-4GB in each
device address space. This range is mapped to CPU memory at 32GB-33GB.
With current code DMA allocations for devices under this bus are not
limited to DMA area, leading to run-time allocation failure.

Modify 'zone_dma_bits' calculation (via dt_zone_dma_bits) to only cover
the actual DMA area starting at 'zone_dma_off'. Use the newly introduced
'min' parameter of of_dma_get_cpu_limits() to set 'zone_dma_off'.

DMA32 zone is useless in this configuration, so make its limit the same
as the DMA zone when the lower DMA limit is higher than 32-bits.

The result is DMA zone that properly reflects the hardware constraints
as follows:

[0.00] Zone ranges:
[0.00]   DMA  [mem 0x0008-0x00083fff]
[0.00]   DMA32empty
[0.00]   Normal   [mem 0x00084000-0x000b]

Suggested-by: Catalin Marinas 
Signed-off-by: Baruch Siach 
---
 arch/arm64/mm/init.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 77e942ca578b..cd283ae0178d 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -128,9 +128,11 @@ static void __init zone_sizes_init(void)
 
 #ifdef CONFIG_ZONE_DMA
acpi_zone_dma_limit = acpi_iort_dma_get_max_cpu_address();
-   of_dma_get_cpu_limits(NULL, _zone_dma_limit, NULL);
+   of_dma_get_cpu_limits(NULL, _zone_dma_limit, _dma_base);
zone_dma_limit = min(dt_zone_dma_limit, acpi_zone_dma_limit);
arm64_dma_phys_limit = max_zone_phys(zone_dma_limit);
+   if (zone_dma_base > U32_MAX)
+   dma32_phys_limit = arm64_dma_phys_limit;
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
 #ifdef CONFIG_ZONE_DMA32
-- 
2.43.0



[PATCH RFC v2 1/5] dma-mapping: replace zone_dma_bits by zone_dma_limit

2024-04-09 Thread Baruch Siach
From: Catalin Marinas 

Hardware DMA limit might not be power of 2. When RAM range starts above
0, say 4GB, DMA limit of 30 bits should end at 5GB. A single high bit
can not encode this limit.

Use direct phys_addr_t limit address for DMA zone limit.

Following commits will add explicit base address to DMA zone.

---
Catalin,

This is taken almost verbatim from your email:

  https://lore.kernel.org/all/zz2hnhjv3gdzu...@arm.com/

Would you provide your sign-off?

Thanks,
baruch
---
 arch/arm64/mm/init.c   | 32 ++--
 arch/powerpc/mm/mem.c  |  9 -
 arch/s390/mm/init.c|  2 +-
 include/linux/dma-direct.h |  2 +-
 kernel/dma/direct.c|  6 +++---
 kernel/dma/pool.c  |  2 +-
 kernel/dma/swiotlb.c   |  4 ++--
 7 files changed, 22 insertions(+), 35 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 03efd86dce0a..00508c69ca9e 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -113,36 +113,24 @@ static void __init arch_reserve_crashkernel(void)
low_size, high);
 }
 
-/*
- * Return the maximum physical address for a zone accessible by the given bits
- * limit. If DRAM starts above 32-bit, expand the zone to the maximum
- * available memory, otherwise cap it at 32-bit.
- */
-static phys_addr_t __init max_zone_phys(unsigned int zone_bits)
+static phys_addr_t __init max_zone_phys(phys_addr_t zone_limit)
 {
-   phys_addr_t zone_mask = DMA_BIT_MASK(zone_bits);
-   phys_addr_t phys_start = memblock_start_of_DRAM();
-
-   if (phys_start > U32_MAX)
-   zone_mask = PHYS_ADDR_MAX;
-   else if (phys_start > zone_mask)
-   zone_mask = U32_MAX;
-
-   return min(zone_mask, memblock_end_of_DRAM() - 1) + 1;
+   return min(zone_limit, memblock_end_of_DRAM() - 1) + 1;
 }
 
 static void __init zone_sizes_init(void)
 {
unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
-   unsigned int __maybe_unused acpi_zone_dma_bits;
-   unsigned int __maybe_unused dt_zone_dma_bits;
-   phys_addr_t __maybe_unused dma32_phys_limit = max_zone_phys(32);
+   phys_addr_t __maybe_unused acpi_zone_dma_limit;
+   phys_addr_t __maybe_unused dt_zone_dma_limit;
+   phys_addr_t __maybe_unused dma32_phys_limit =
+   max_zone_phys(DMA_BIT_MASK(32));
 
 #ifdef CONFIG_ZONE_DMA
-   acpi_zone_dma_bits = fls64(acpi_iort_dma_get_max_cpu_address());
-   dt_zone_dma_bits = fls64(of_dma_get_max_cpu_address(NULL));
-   zone_dma_bits = min3(32U, dt_zone_dma_bits, acpi_zone_dma_bits);
-   arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
+   acpi_zone_dma_limit = acpi_iort_dma_get_max_cpu_address();
+   dt_zone_dma_limit = of_dma_get_max_cpu_address(NULL);
+   zone_dma_limit = min(dt_zone_dma_limit, acpi_zone_dma_limit);
+   arm64_dma_phys_limit = max_zone_phys(zone_dma_limit);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
 #ifdef CONFIG_ZONE_DMA32
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 3a440004b97d..4d6f575fd354 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -214,7 +214,7 @@ static int __init mark_nonram_nosave(void)
  * everything else. GFP_DMA32 page allocations automatically fall back to
  * ZONE_DMA.
  *
- * By using 31-bit unconditionally, we can exploit zone_dma_bits to inform the
+ * By using 31-bit unconditionally, we can exploit zone_dma_limit to inform the
  * generic DMA mapping code.  32-bit only devices (if not handled by an IOMMU
  * anyway) will take a first dip into ZONE_NORMAL and get otherwise served by
  * ZONE_DMA.
@@ -250,13 +250,12 @@ void __init paging_init(void)
 * powerbooks.
 */
if (IS_ENABLED(CONFIG_PPC32))
-   zone_dma_bits = 30;
+   zone_dma_limit = DMA_BIT_MASK(30);
else
-   zone_dma_bits = 31;
+   zone_dma_limit = DMA_BIT_MASK(31);
 
 #ifdef CONFIG_ZONE_DMA
-   max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
- 1UL << (zone_dma_bits - PAGE_SHIFT));
+   max_zone_pfns[ZONE_DMA] = min(max_low_pfn, zone_dma_limit >> 
PAGE_SHIFT);
 #endif
max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
 #ifdef CONFIG_HIGHMEM
diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c
index f6391442c0c2..5feaa60933b7 100644
--- a/arch/s390/mm/init.c
+++ b/arch/s390/mm/init.c
@@ -95,7 +95,7 @@ void __init paging_init(void)
 
vmem_map_init();
sparse_init();
-   zone_dma_bits = 31;
+   zone_dma_limit = DMA_BIT_MASK(31);
memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
max_zone_pfns[ZONE_DMA] = virt_to_pfn(MAX_DMA_ADDRESS);
max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 3eb3589ff43e..7cf76f1d3239 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -12,7 +12,7 @@
 #include 
 

[PATCH RFC v2 2/5] of: get dma area lower limit

2024-04-09 Thread Baruch Siach
of_dma_get_max_cpu_address() returns the highest CPU address that
devices can use for DMA. The implicit assumption is that all CPU
addresses below that limit are suitable for DMA. However the
'dma-ranges' property this code uses also encodes a lower limit for DMA
that is potentially non zero.

Rename to of_dma_get_cpu_limits(), and extend to retrieve also the lower
limit for the same 'dma-ranges' property describing the high limit.

Update callers of of_dma_get_max_cpu_address(). No functional change
intended.

Signed-off-by: Baruch Siach 
---
 arch/arm64/mm/init.c  |  2 +-
 drivers/of/address.c  | 38 +++---
 drivers/of/unittest.c |  8 
 include/linux/of.h| 11 ---
 4 files changed, 40 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 00508c69ca9e..77e942ca578b 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -128,7 +128,7 @@ static void __init zone_sizes_init(void)
 
 #ifdef CONFIG_ZONE_DMA
acpi_zone_dma_limit = acpi_iort_dma_get_max_cpu_address();
-   dt_zone_dma_limit = of_dma_get_max_cpu_address(NULL);
+   of_dma_get_cpu_limits(NULL, _zone_dma_limit, NULL);
zone_dma_limit = min(dt_zone_dma_limit, acpi_zone_dma_limit);
arm64_dma_phys_limit = max_zone_phys(zone_dma_limit);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
diff --git a/drivers/of/address.c b/drivers/of/address.c
index ae46a3605904..ac009b3bb63b 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -964,21 +964,25 @@ int of_dma_get_range(struct device_node *np, const struct 
bus_dma_region **map)
 #endif /* CONFIG_HAS_DMA */
 
 /**
- * of_dma_get_max_cpu_address - Gets highest CPU address suitable for DMA
+ * of_dma_get_cpu_limits - Gets highest CPU address suitable for DMA
  * @np: The node to start searching from or NULL to start from the root
+ * @max: Pointer to high address limit or NULL if not needed
+ * @min: Pointer to low address limit or NULL if not needed
  *
  * Gets the highest CPU physical address that is addressable by all DMA masters
- * in the sub-tree pointed by np, or the whole tree if NULL is passed. If no
- * DMA constrained device is found, it returns PHYS_ADDR_MAX.
+ * in the sub-tree pointed by np, or the whole tree if @np in NULL. If no
+ * DMA constrained device is found, @*max is PHYS_ADDR_MAX, and @*low is 0.
  */
-phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
+void __init of_dma_get_cpu_limits(struct device_node *np,
+   phys_addr_t *max, phys_addr_t *min)
 {
phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
struct of_range_parser parser;
-   phys_addr_t subtree_max_addr;
+   phys_addr_t min_cpu_addr = 0;
struct device_node *child;
struct of_range range;
const __be32 *ranges;
+   u64 cpu_start = 0;
u64 cpu_end = 0;
int len;
 
@@ -988,21 +992,33 @@ phys_addr_t __init of_dma_get_max_cpu_address(struct 
device_node *np)
ranges = of_get_property(np, "dma-ranges", );
if (ranges && len) {
of_dma_range_parser_init(, np);
-   for_each_of_range(, )
-   if (range.cpu_addr + range.size > cpu_end)
+   for_each_of_range(, ) {
+   if (range.cpu_addr + range.size > cpu_end) {
cpu_end = range.cpu_addr + range.size - 1;
+   cpu_start = range.cpu_addr;
+   }
+   }
 
-   if (max_cpu_addr > cpu_end)
+   if (max_cpu_addr > cpu_end) {
max_cpu_addr = cpu_end;
+   min_cpu_addr = cpu_start;
+   }
}
 
for_each_available_child_of_node(np, child) {
-   subtree_max_addr = of_dma_get_max_cpu_address(child);
-   if (max_cpu_addr > subtree_max_addr)
+   phys_addr_t subtree_max_addr, subtree_min_addr;
+
+   of_dma_get_cpu_limits(child, _max_addr, 
_min_addr);
+   if (max_cpu_addr > subtree_max_addr) {
max_cpu_addr = subtree_max_addr;
+   min_cpu_addr = subtree_min_addr;
+   }
}
 
-   return max_cpu_addr;
+   if (max)
+   *max = max_cpu_addr;
+   if (min)
+   *min = min_cpu_addr;
 }
 
 /**
diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 6b5c36b6a758..2d632d4ec5b1 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -921,7 +921,7 @@ static void __init of_unittest_changeset(void)
 #endif
 }
 
-static void __init of_unittest_dma_get_max_cpu_address(void)
+static void __init of_unittest_dma_get_cpu_limits(void)
 {
struct device_node *np;
phys_addr_t cpu_addr;
@@ -935,9 +935,9 @@ static void __init of_unittest_dma_get_max_cpu_address(void)
return;
}
 
-   cpu_addr = 

[PATCH RFC v2 3/5] of: unittest: add test for of_dma_get_cpu_limits() 'min' param

2024-04-09 Thread Baruch Siach
Verify that of_dma_get_cpu_limits() sets this new parameter to the
expected result.

Signed-off-by: Baruch Siach 
---
 drivers/of/unittest.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 2d632d4ec5b1..8fabb445a62a 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -924,7 +924,7 @@ static void __init of_unittest_changeset(void)
 static void __init of_unittest_dma_get_cpu_limits(void)
 {
struct device_node *np;
-   phys_addr_t cpu_addr;
+   phys_addr_t cpu_addr_max, cpu_addr_min;
 
if (!IS_ENABLED(CONFIG_OF_ADDRESS))
return;
@@ -935,10 +935,13 @@ static void __init of_unittest_dma_get_cpu_limits(void)
return;
}
 
-   of_dma_get_cpu_limits(np, _addr, NULL);
-   unittest(cpu_addr == 0x4fff,
-"of_dma_get_cpu_limits: wrong CPU addr %pad (expecting %x)\n",
-_addr, 0x4fff);
+   of_dma_get_cpu_limits(np, _addr_max, _addr_min);
+   unittest(cpu_addr_max == 0x4fff,
+"of_dma_get_cpu_limits: wrong CPU max addr %pad (expecting 
%x)\n",
+_addr_max, 0x4fff);
+   unittest(cpu_addr_min == 0x4000,
+"of_dma_get_cpu_limits: wrong CPU min addr %pad (expecting 
%x)\n",
+_addr_min, 0x4000);
 }
 
 static void __init of_unittest_dma_ranges_one(const char *path,
-- 
2.43.0



[PATCH RFC v2 4/5] dma-direct: add base offset to zone_dma_bits

2024-04-09 Thread Baruch Siach
Current code using zone_dma_bits assume that all addresses range in the
bits mask are suitable for DMA. For some existing platforms this
assumption is not correct. DMA range might have non zero lower limit.

Add 'zone_dma_base' for platform code to set base address for DMA zone.

Rename the dma_direct_supported() local 'min_mask' variable to better
describe its use as limit.

Suggested-by: Catalin Marinas 
Signed-off-by: Baruch Siach 
---
 include/linux/dma-direct.h | 1 +
 kernel/dma/direct.c| 9 +
 kernel/dma/pool.c  | 2 +-
 kernel/dma/swiotlb.c   | 4 ++--
 4 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
index 7cf76f1d3239..dd0330cbef81 100644
--- a/include/linux/dma-direct.h
+++ b/include/linux/dma-direct.h
@@ -13,6 +13,7 @@
 #include 
 
 extern phys_addr_t zone_dma_limit;
+extern phys_addr_t zone_dma_base;
 
 /*
  * Record the mapping of CPU physical to DMA addresses for a given region.
diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 3b2ebcd4f576..92bb241645d6 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -21,6 +21,7 @@
  * override the variable below for dma-direct to work properly.
  */
 phys_addr_t zone_dma_limit __ro_after_init = DMA_BIT_MASK(24);
+phys_addr_t zone_dma_base __ro_after_init;
 
 static inline dma_addr_t phys_to_dma_direct(struct device *dev,
phys_addr_t phys)
@@ -59,7 +60,7 @@ static gfp_t dma_direct_optimal_gfp_mask(struct device *dev, 
u64 *phys_limit)
 * zones.
 */
*phys_limit = dma_to_phys(dev, dma_limit);
-   if (*phys_limit <= zone_dma_limit)
+   if (*phys_limit <= zone_dma_base + zone_dma_limit)
return GFP_DMA;
if (*phys_limit <= DMA_BIT_MASK(32))
return GFP_DMA32;
@@ -567,7 +568,7 @@ int dma_direct_mmap(struct device *dev, struct 
vm_area_struct *vma,
 
 int dma_direct_supported(struct device *dev, u64 mask)
 {
-   u64 min_mask = (max_pfn - 1) << PAGE_SHIFT;
+   u64 min_limit = (max_pfn - 1) << PAGE_SHIFT;
 
/*
 * Because 32-bit DMA masks are so common we expect every architecture
@@ -584,8 +585,8 @@ int dma_direct_supported(struct device *dev, u64 mask)
 * part of the check.
 */
if (IS_ENABLED(CONFIG_ZONE_DMA))
-   min_mask = min_t(u64, min_mask, zone_dma_limit);
-   return mask >= phys_to_dma_unencrypted(dev, min_mask);
+   min_limit = min_t(u64, min_limit, zone_dma_base + 
zone_dma_limit);
+   return mask >= phys_to_dma_unencrypted(dev, min_limit);
 }
 
 /*
diff --git a/kernel/dma/pool.c b/kernel/dma/pool.c
index 410a7b40e496..61a86f3d83ae 100644
--- a/kernel/dma/pool.c
+++ b/kernel/dma/pool.c
@@ -70,7 +70,7 @@ static bool cma_in_zone(gfp_t gfp)
/* CMA can't cross zone boundaries, see cma_activate_area() */
end = cma_get_base(cma) + size - 1;
if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp & GFP_DMA))
-   return end <= zone_dma_limit;
+   return end <= zone_dma_base + zone_dma_limit;
if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp & GFP_DMA32))
return end <= DMA_BIT_MASK(32);
return true;
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 96d6eee7d215..814052df07c5 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -446,7 +446,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
if (!remap)
io_tlb_default_mem.can_grow = true;
if (IS_ENABLED(CONFIG_ZONE_DMA) && (gfp_mask & __GFP_DMA))
-   io_tlb_default_mem.phys_limit = zone_dma_limit;
+   io_tlb_default_mem.phys_limit = zone_dma_base + zone_dma_limit;
else if (IS_ENABLED(CONFIG_ZONE_DMA32) && (gfp_mask & __GFP_DMA32))
io_tlb_default_mem.phys_limit = DMA_BIT_MASK(32);
else
@@ -625,7 +625,7 @@ static struct page *swiotlb_alloc_tlb(struct device *dev, 
size_t bytes,
}
 
gfp &= ~GFP_ZONEMASK;
-   if (phys_limit <= zone_dma_limit)
+   if (phys_limit <= zone_dma_base + zone_dma_limit)
gfp |= __GFP_DMA;
else if (phys_limit <= DMA_BIT_MASK(32))
gfp |= __GFP_DMA32;
-- 
2.43.0



[PATCH RFC v2 0/5] arm64: support DMA zone starting above 4GB

2024-04-09 Thread Baruch Siach
DMA zones code assumes that DMA lower limit is zero. When there is no RAM 
below 4GB, arm64 platform code sets DMA/DMA32 zone limits to cover the entire 
RAM[0].

The platform I have has RAM starting at 32GB. Devices with 30-bit DMA mask are 
mapped to 1GB at the bottom of RAM, between 32GB - 33GB. A DMA zone over the 
entire RAM breaks DMA allocation for these devices.

In response to a previous RFC hack[1] Catalin Marinas suggested to add a
separate offset value as base address for the DMA zone. This RFC series 
attempts to implement that suggestion.

With this series applied, the DMA zone covers the right RAM range for my 
platform.

RFC v2:

  * Add patch from Catalin[2] changing zone_dma_bits to zone_dma_limit to 
simplify subsequent patches

  * Test on real hardware

RFC v1: https://lore.kernel.org/all/cover.1703683642.git.bar...@tkos.co.il/

[0] See commit 791ab8b2e3db ("arm64: Ignore any DMA offsets in the 
max_zone_phys() calculation")

[1] 
https://lore.kernel.org/all/9af8a19c3398e7dc09cfc1fbafed98d795d9f83e.1699464622.git.bar...@tkos.co.il/

[2] https://lore.kernel.org/all/zz2hnhjv3gdzu...@arm.com/

Baruch Siach (4):
  of: get dma area lower limit
  of: unittest: add test for of_dma_get_cpu_limits() 'min' param
  dma-direct: add base offset to zone_dma_bits
  arm64: mm: take DMA zone offset into account

Catalin Marinas (1):
  dma-mapping: replace zone_dma_bits by zone_dma_limit

 arch/arm64/mm/init.c   | 34 --
 arch/powerpc/mm/mem.c  |  9 -
 arch/s390/mm/init.c|  2 +-
 drivers/of/address.c   | 38 +++---
 drivers/of/unittest.c  | 17 ++---
 include/linux/dma-direct.h |  3 ++-
 include/linux/of.h | 11 ---
 kernel/dma/direct.c| 11 ++-
 kernel/dma/pool.c  |  2 +-
 kernel/dma/swiotlb.c   |  4 ++--
 10 files changed, 73 insertions(+), 58 deletions(-)

-- 
2.43.0