Re: [PATCH RESEND v8 16/16] bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of

2024-05-17 Thread Will Deacon
Hi Klara,

On Fri, May 17, 2024 at 01:00:31AM +0200, Klara Modin wrote:
> On 2024-05-05 18:06, Mike Rapoport wrote:
> > From: "Mike Rapoport (IBM)" 
> > 
> > BPF just-in-time compiler depended on CONFIG_MODULES because it used
> > module_alloc() to allocate memory for the generated code.
> > 
> > Since code allocations are now implemented with execmem, drop dependency of
> > CONFIG_BPF_JIT on CONFIG_MODULES and make it select CONFIG_EXECMEM.
> > 
> > Suggested-by: Björn Töpel 
> > Signed-off-by: Mike Rapoport (IBM) 
> > ---
> >   kernel/bpf/Kconfig | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> > index bc25f5098a25..f999e4e0b344 100644
> > --- a/kernel/bpf/Kconfig
> > +++ b/kernel/bpf/Kconfig
> > @@ -43,7 +43,7 @@ config BPF_JIT
> > bool "Enable BPF Just In Time compiler"
> > depends on BPF
> > depends on HAVE_CBPF_JIT || HAVE_EBPF_JIT
> > -   depends on MODULES
> > +   select EXECMEM
> > help
> >   BPF programs are normally handled by a BPF interpreter. This option
> >   allows the kernel to generate native code when a program is loaded
> 
> This does not seem to work entirely. If build with BPF_JIT without module
> support for my Raspberry Pi 3 B I get warnings in my kernel log (easiest way
> to trigger it seems to be trying to ssh into it, which fails).

Thanks for the report. I was able to reproduce this using QEMU and it
looks like the problem is because bpf_arch_text_copy() silently fails
to write to the read-only area as a result of patch_map() faulting and
the resulting -EFAULT being chucked away.

Please can you try the diff below?

Will

--->8

diff --git a/arch/arm64/kernel/patching.c b/arch/arm64/kernel/patching.c
index 255534930368..94b9fea65aca 100644
--- a/arch/arm64/kernel/patching.c
+++ b/arch/arm64/kernel/patching.c
@@ -36,7 +36,7 @@ static void __kprobes *patch_map(void *addr, int fixmap)
 
if (image)
page = phys_to_page(__pa_symbol(addr));
-   else if (IS_ENABLED(CONFIG_STRICT_MODULE_RWX))
+   else if (IS_ENABLED(CONFIG_EXECMEM))
page = vmalloc_to_page(addr);
else
return addr;




Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-19 Thread Will Deacon
On Thu, Apr 18, 2024 at 12:53:26PM -0700, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, Will Deacon wrote:
> > On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> > > On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson 
> > > >  wrote:
> > > > > 
> > > > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  
> > > > > > wrote:
> > > > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > > > Also, if you're in the business of hacking the MMU notifier code, 
> > > > > > > it
> > > > > > > would be really great to change the .clear_flush_young() callback 
> > > > > > > so
> > > > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > > > moment,
> > > > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > > > 'flush_on_ret'
> > > > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > > > lighter-weight and targetted TLBI in the architecture page-table 
> > > > > > > code
> > > > > > > when we actually update the ptes for small ranges.
> > > > > > 
> > > > > > Indeed, and I was looking at this earlier this week as it has a 
> > > > > > pretty
> > > > > > devastating effect with NV (it blows the shadow S2 for that VMID, 
> > > > > > with
> > > > > > costly consequences).
> > > > > > 
> > > > > > In general, it feels like the TLB invalidation should stay with the
> > > > > > code that deals with the page tables, as it has a pretty good idea 
> > > > > > of
> > > > > > what needs to be invalidated and how -- specially on architectures
> > > > > > that have a HW-broadcast facility like arm64.
> > > > > 
> > > > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > > > simpler, more
> > > > > straightforward solution would be to let architectures override 
> > > > > flush_on_ret,
> > > > > but I would prefer something like the below as x86 can also utilize a 
> > > > > range-based
> > > > > flush when running as a nested hypervisor.
> > > 
> > > ...
> > > 
> > > > I think this works for us on HW that has range invalidation, which
> > > > would already be a positive move.
> > > > 
> > > > For the lesser HW that isn't range capable, it also gives the
> > > > opportunity to perform the iteration ourselves or go for the nuclear
> > > > option if the range is larger than some arbitrary constant (though
> > > > this is additional work).
> > > > 
> > > > But this still considers the whole range as being affected by
> > > > range->handler(). It'd be interesting to try and see whether more
> > > > precise tracking is (or isn't) generally beneficial.
> > > 
> > > I assume the idea would be to let arch code do single-page invalidations 
> > > of
> > > stage-2 entries for each gfn?
> > 
> > Right, as it's the only code which knows which ptes actually ended up
> > being aged.
> > 
> > > Unless I'm having a brain fart, x86 can't make use of that functionality. 
> > >  Intel
> > > doesn't provide any way to do targeted invalidation of stage-2 mappings.  
> > > AMD
> > > provides an instruction to do broadcast invalidations, but it takes a 
> > > virtual
> > > address, i.e. a stage-1 address.  I can't tell if it's a host virtual 
> > > address or
> > > a guest virtual address, but it's a moot point because KVM doen't have 
> > > the guest
> > > virtual address, and if it's a host virtual address, there would need to 
> > > be valid
> > > mappings in the host page tables for it to work, which KVM can't 
> > > guarantee.
> > 
> > Ah, so it sounds like it would need to be an arch opt-in then.
> 
> Even if x86 (or some other arch code) could use the precise tracking, I think 
> it
> would make sense to have the behavior be arch specific.  Adding infrastructure
> to get information from arch code, only to turn aroun

Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-18 Thread Will Deacon
On Mon, Apr 15, 2024 at 10:03:51AM -0700, Sean Christopherson wrote:
> On Sat, Apr 13, 2024, Marc Zyngier wrote:
> > On Fri, 12 Apr 2024 15:54:22 +0100, Sean Christopherson  
> > wrote:
> > > 
> > > On Fri, Apr 12, 2024, Marc Zyngier wrote:
> > > > On Fri, 12 Apr 2024 11:44:09 +0100, Will Deacon  wrote:
> > > > > On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> > > > > Also, if you're in the business of hacking the MMU notifier code, it
> > > > > would be really great to change the .clear_flush_young() callback so
> > > > > that the architecture could handle the TLB invalidation. At the 
> > > > > moment,
> > > > > the core KVM code invalidates the whole VMID courtesy of 
> > > > > 'flush_on_ret'
> > > > > being set by kvm_handle_hva_range(), whereas we could do a much
> > > > > lighter-weight and targetted TLBI in the architecture page-table code
> > > > > when we actually update the ptes for small ranges.
> > > > 
> > > > Indeed, and I was looking at this earlier this week as it has a pretty
> > > > devastating effect with NV (it blows the shadow S2 for that VMID, with
> > > > costly consequences).
> > > > 
> > > > In general, it feels like the TLB invalidation should stay with the
> > > > code that deals with the page tables, as it has a pretty good idea of
> > > > what needs to be invalidated and how -- specially on architectures
> > > > that have a HW-broadcast facility like arm64.
> > > 
> > > Would this be roughly on par with an in-line flush on arm64?  The 
> > > simpler, more
> > > straightforward solution would be to let architectures override 
> > > flush_on_ret,
> > > but I would prefer something like the below as x86 can also utilize a 
> > > range-based
> > > flush when running as a nested hypervisor.
> 
> ...
> 
> > I think this works for us on HW that has range invalidation, which
> > would already be a positive move.
> > 
> > For the lesser HW that isn't range capable, it also gives the
> > opportunity to perform the iteration ourselves or go for the nuclear
> > option if the range is larger than some arbitrary constant (though
> > this is additional work).
> > 
> > But this still considers the whole range as being affected by
> > range->handler(). It'd be interesting to try and see whether more
> > precise tracking is (or isn't) generally beneficial.
> 
> I assume the idea would be to let arch code do single-page invalidations of
> stage-2 entries for each gfn?

Right, as it's the only code which knows which ptes actually ended up
being aged.

> Unless I'm having a brain fart, x86 can't make use of that functionality.  
> Intel
> doesn't provide any way to do targeted invalidation of stage-2 mappings.  AMD
> provides an instruction to do broadcast invalidations, but it takes a virtual
> address, i.e. a stage-1 address.  I can't tell if it's a host virtual address 
> or
> a guest virtual address, but it's a moot point because KVM doen't have the 
> guest
> virtual address, and if it's a host virtual address, there would need to be 
> valid
> mappings in the host page tables for it to work, which KVM can't guarantee.

Ah, so it sounds like it would need to be an arch opt-in then.

Will



Re: [PATCH 1/4] KVM: delete .change_pte MMU notifier callback

2024-04-12 Thread Will Deacon
On Fri, Apr 05, 2024 at 07:58:12AM -0400, Paolo Bonzini wrote:
> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
> index dc04bc767865..ff17849be9f4 100644
> --- a/arch/arm64/kvm/mmu.c
> +++ b/arch/arm64/kvm/mmu.c
> @@ -1768,40 +1768,6 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct 
> kvm_gfn_range *range)
>   return false;
>  }
>  
> -bool kvm_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
> -{
> - kvm_pfn_t pfn = pte_pfn(range->arg.pte);
> -
> - if (!kvm->arch.mmu.pgt)
> - return false;
> -
> - WARN_ON(range->end - range->start != 1);
> -
> - /*
> -  * If the page isn't tagged, defer to user_mem_abort() for sanitising
> -  * the MTE tags. The S2 pte should have been unmapped by
> -  * mmu_notifier_invalidate_range_end().
> -  */
> - if (kvm_has_mte(kvm) && !page_mte_tagged(pfn_to_page(pfn)))
> - return false;
> -
> - /*
> -  * We've moved a page around, probably through CoW, so let's treat
> -  * it just like a translation fault and the map handler will clean
> -  * the cache to the PoC.
> -  *
> -  * The MMU notifiers will have unmapped a huge PMD before calling
> -  * ->change_pte() (which in turn calls kvm_set_spte_gfn()) and
> -  * therefore we never need to clear out a huge PMD through this
> -  * calling path and a memcache is not required.
> -  */
> - kvm_pgtable_stage2_map(kvm->arch.mmu.pgt, range->start << PAGE_SHIFT,
> -PAGE_SIZE, __pfn_to_phys(pfn),
> -KVM_PGTABLE_PROT_R, NULL, 0);
> -
> - return false;
> -}
> -
>  bool kvm_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
>   u64 size = (range->end - range->start) << PAGE_SHIFT;

Thanks. It's nice to see this code retire:

Acked-by: Will Deacon 

Also, if you're in the business of hacking the MMU notifier code, it
would be really great to change the .clear_flush_young() callback so
that the architecture could handle the TLB invalidation. At the moment,
the core KVM code invalidates the whole VMID courtesy of 'flush_on_ret'
being set by kvm_handle_hva_range(), whereas we could do a much
lighter-weight and targetted TLBI in the architecture page-table code
when we actually update the ptes for small ranges.

Will



Re: [PATCH untested] vhost: order avail ring reads after index updates

2024-03-27 Thread Will Deacon
On Wed, Mar 27, 2024 at 01:26:23PM -0400, Michael S. Tsirkin wrote:
> vhost_get_vq_desc (correctly) uses smp_rmb to order
> avail ring reads after index reads.
> However, over time we added two more places that read the
> index and do not bother with barriers.
> Since vhost_get_vq_desc when it was written assumed it is the
> only reader when it sees a new index value is cached
> it does not bother with a barrier either, as a result,
> on the nvidia-gracehopper platform (arm64) available ring
> entry reads have been observed bypassing ring reads, causing
> a ring corruption.
> 
> To fix, factor out the correct index access code from vhost_get_vq_desc.
> As a side benefit, we also validate the index on all paths now, which
> will hopefully help catch future errors earlier.
> 
> Note: current code is inconsistent in how it handles errors:
> some places treat it as an empty ring, others - non empty.
> This patch does not attempt to change the existing behaviour.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Gavin Shan 
> Reported-by: Will Deacon 
> Suggested-by: Will Deacon 
> Fixes: 275bf960ac69 ("vhost: better detection of available buffers")
> Cc: "Jason Wang" 
> Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
> Cc: "Stefano Garzarella" 
> Signed-off-by: Michael S. Tsirkin 
> ---
> 
> I think it's better to bite the bullet and clean up the code.
> Note: this is still only built, not tested.
> Gavin could you help test please?
> Especially on the arm platform you have?
> 
> Will thanks so much for finding this race!

No problem, and I was also hoping that the smp_rmb() could be
consolidated into a single helper like you've done here.

One minor comment below:

>  drivers/vhost/vhost.c | 80 +++
>  1 file changed, 42 insertions(+), 38 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..26b70b1fd9ff 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1290,10 +1290,38 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>   mutex_unlock(>vqs[i]->mutex);
>  }
>  
> -static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> -   __virtio16 *idx)
> +static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  {
> - return vhost_get_avail(vq, *idx, >avail->idx);
> + __virtio16 idx;
> + u16 avail_idx;
> + int r = vhost_get_avail(vq, idx, >avail->idx);
> +
> + if (unlikely(r < 0)) {
> + vq_err(vq, "Failed to access avail idx at %p: %d\n",
> +>avail->idx, r);
> + return -EFAULT;
> + }
> +
> + avail_idx = vhost16_to_cpu(vq, idx);
> +
> + /* Check it isn't doing very strange things with descriptor numbers. */
> + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
> + vq_err(vq, "Guest moved used index from %u to %u",
> +vq->last_avail_idx, vq->avail_idx);
> + return -EFAULT;
> + }
> +
> + /* Nothing new? We are done. */
> + if (avail_idx == vq->avail_idx)
> + return 0;
> +
> + vq->avail_idx = avail_idx;
> +
> + /* We updated vq->avail_idx so we need a memory barrier between
> +  * the index read above and the caller reading avail ring entries.
> +  */
> + smp_rmb();

I think you could use smp_acquire__after_ctrl_dep() if you're feeling
brave, but to be honest I'd prefer we went in the opposite direction
and used READ/WRITE_ONCE + smp_load_acquire()/smp_store_release() across
the board. It's just a thankless, error-prone task to get there :(

So, for the patch as-is:

Acked-by: Will Deacon 

(I've not tested it either though, so definitely wait for Gavin on that!)

Cheers,

Will



Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Will Deacon
On Tue, Mar 26, 2024 at 11:43:13AM +, Will Deacon wrote:
> On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote:
> > On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > > Secondly, the debugging code is enhanced so that the available head for
> > > > (last_avail_idx - 1) is read for twice and recorded. It means the 
> > > > available
> > > > head for one specific available index is read for twice. I do see the
> > > > available heads are different from the consecutive reads. More details
> > > > are shared as below.
> > > > 
> > > > From the guest side
> > > > ===
> > > > 
> > > > virtio_net virtio0: output.0:id 86 is not a head!
> > > > head to be released: 047 062 112
> > > > 
> > > > avail_idx:
> > > > 000  49665
> > > > 001  49666  <--
> > > >  :
> > > > 015  49664
> > > 
> > > what are these #s 49665 and so on?
> > > and how large is the ring?
> > > I am guessing 49664 is the index ring size is 16 and
> > > 49664 % 16 == 0
> > 
> > More than that, 49664 % 256 == 0
> > 
> > So again there seems to be an error in the vicinity of roll-over of
> > the idx low byte, as I observed in the earlier log. Surely this is
> > more than coincidence?
> 
> Yeah, I'd still really like to see the disassembly for both sides of the
> protocol here. Gavin, is that something you're able to provide? Worst
> case, the host and guest vmlinux objects would be a starting point.
> 
> Personally, I'd be fairly surprised if this was a hardware issue.

Ok, long shot after eyeballing the vhost code, but does the diff below
help at all? It looks like vhost_vq_avail_empty() can advance the value
saved in 'vq->avail_idx' but without the read barrier, possibly confusing
vhost_get_vq_desc() in polling mode.

Will

--->8

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 045f666b4f12..87bff710331a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2801,6 +2801,7 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, struct 
vhost_virtqueue *vq)
return false;
vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
 
+   smp_rmb();
return vq->avail_idx == vq->last_avail_idx;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);




Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-26 Thread Will Deacon
On Tue, Mar 26, 2024 at 09:38:55AM +, Keir Fraser wrote:
> On Tue, Mar 26, 2024 at 03:49:02AM -0400, Michael S. Tsirkin wrote:
> > > Secondly, the debugging code is enhanced so that the available head for
> > > (last_avail_idx - 1) is read for twice and recorded. It means the 
> > > available
> > > head for one specific available index is read for twice. I do see the
> > > available heads are different from the consecutive reads. More details
> > > are shared as below.
> > > 
> > > From the guest side
> > > ===
> > > 
> > > virtio_net virtio0: output.0:id 86 is not a head!
> > > head to be released: 047 062 112
> > > 
> > > avail_idx:
> > > 000  49665
> > > 001  49666  <--
> > >  :
> > > 015  49664
> > 
> > what are these #s 49665 and so on?
> > and how large is the ring?
> > I am guessing 49664 is the index ring size is 16 and
> > 49664 % 16 == 0
> 
> More than that, 49664 % 256 == 0
> 
> So again there seems to be an error in the vicinity of roll-over of
> the idx low byte, as I observed in the earlier log. Surely this is
> more than coincidence?

Yeah, I'd still really like to see the disassembly for both sides of the
protocol here. Gavin, is that something you're able to provide? Worst
case, the host and guest vmlinux objects would be a starting point.

Personally, I'd be fairly surprised if this was a hardware issue.

Will



Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Will Deacon
On Tue, Mar 19, 2024 at 02:59:23PM +1000, Gavin Shan wrote:
> On 3/19/24 02:59, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > The issue is reported by Yihuang Yu who have 'netperf' test on
> > > NVidia's grace-grace and grace-hopper machines. The 'netperf'
> > > client is started in the VM hosted by grace-hopper machine,
> > > while the 'netperf' server is running on grace-grace machine.
> > > 
> > > The VM is started with virtio-net and vhost has been enabled.
> > > We observe a error message spew from VM and then soft-lockup
> > > report. The error message indicates the data associated with
> > > the descriptor (index: 135) has been released, and the queue
> > > is marked as broken. It eventually leads to the endless effort
> > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> > > and soft-lockup. The stale index 135 is fetched from the available
> > > ring and published to the used ring by vhost, meaning we have
> > > disordred write to the available ring element and available index.
> > > 
> > >/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> > >-accel kvm -machine virt,gic-version=host\
> > >   : \
> > >-netdev tap,id=vnet0,vhost=on\
> > >-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> > > 
> > >[   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> > > 
> > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> > > ARM64. It should work for other architectures, but performance loss is
> > > expected.
> > > 
> > > Cc: sta...@vger.kernel.org
> > > Reported-by: Yihuang Yu 
> > > Signed-off-by: Gavin Shan 
> > > ---
> > >   drivers/virtio/virtio_ring.c | 12 +---
> > >   1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, 
> > > head);
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > >   vq->split.avail_idx_shadow++;
> > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   
> > > vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> > 
> 
> Thanks for your comments, Will.
> 
> Yes, DMB should work for us. However, it seems this instruction has issues on
> NVidia's grace-hopper. It's hard for me to understand how DMB and DSB works
> from hardware level. I agree it's not the solution to replace DMB with DSB
> before we fully understand the root cause.
> 
> I tried the possible replacement like below. __smp_mb() can avoid the issue 
&

Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-19 Thread Will Deacon
On Tue, Mar 19, 2024 at 03:36:31AM -0400, Michael S. Tsirkin wrote:
> On Mon, Mar 18, 2024 at 04:59:24PM +, Will Deacon wrote:
> > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 49299b1f9ec7..7d852811c912 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > >   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > >  
> > > - /* Descriptors and available array need to be set before we expose the
> > > -  * new available array entries. */
> > > - virtio_wmb(vq->weak_barriers);
> > > + /*
> > > +  * Descriptors and available array need to be set before we expose
> > > +  * the new available array entries. virtio_wmb() should be enough
> > > +  * to ensuere the order theoretically. However, a stronger barrier
> > > +  * is needed by ARM64. Otherwise, the stale data can be observed
> > > +  * by the host (vhost). A stronger barrier should work for other
> > > +  * architectures, but performance loss is expected.
> > > +  */
> > > + virtio_mb(false);
> > >   vq->split.avail_idx_shadow++;
> > >   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
> > >   vq->split.avail_idx_shadow);
> > 
> > Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
> > here, especially when ordering accesses to coherent memory.
> > 
> > In practice, either the larger timing different from the DSB or the fact
> > that you're going from a Store->Store barrier to a full barrier is what
> > makes things "work" for you. Have you tried, for example, a DMB SY
> > (e.g. via __smb_mb()).
> > 
> > We definitely shouldn't take changes like this without a proper
> > explanation of what is going on.
> 
> Just making sure: so on this system, how do
> smp_wmb() and wmb() differ? smb_wmb is normally for synchronizing
> with kernel running on another CPU and we are doing something
> unusual in virtio when we use it to synchronize with host
> as opposed to the guest - e.g. CONFIG_SMP is special cased
> because of this:
> 
> #define virt_wmb() do { kcsan_wmb(); __smp_wmb(); } while (0)
> 
> Note __smp_wmb not smp_wmb which would be a NOP on UP.

I think that should be fine (as long as the NOP is a barrier() for the
compiler).

wmb() uses a DSB, but that is only really relevant to endpoint ordering
with real I/O devices, e.g. if for some reason you need to guarantee
that a write has made it all the way to the device before proceeding.
Even then you're slightly at the mercy of the memory type and the
interconnect not giving an early acknowledgement, so the extra ordering
is rarely needed in practice and we don't even provide it for our I/O
accessors (e.g. writel() and readl()).

So for virtio between two CPUs using coherent memory, DSB is a red
herring.

Will



Re: [PATCH] virtio_ring: Fix the stale index in available ring

2024-03-18 Thread Will Deacon
On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
> The issue is reported by Yihuang Yu who have 'netperf' test on
> NVidia's grace-grace and grace-hopper machines. The 'netperf'
> client is started in the VM hosted by grace-hopper machine,
> while the 'netperf' server is running on grace-grace machine.
> 
> The VM is started with virtio-net and vhost has been enabled.
> We observe a error message spew from VM and then soft-lockup
> report. The error message indicates the data associated with
> the descriptor (index: 135) has been released, and the queue
> is marked as broken. It eventually leads to the endless effort
> to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
> and soft-lockup. The stale index 135 is fetched from the available
> ring and published to the used ring by vhost, meaning we have
> disordred write to the available ring element and available index.
> 
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host\
>  : \
>   -netdev tap,id=vnet0,vhost=on\
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \
> 
>   [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!
> 
> Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
> virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
> ARM64. It should work for other architectures, but performance loss is
> expected.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/virtio/virtio_ring.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..7d852811c912 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>   vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
>  
> - /* Descriptors and available array need to be set before we expose the
> -  * new available array entries. */
> - virtio_wmb(vq->weak_barriers);
> + /*
> +  * Descriptors and available array need to be set before we expose
> +  * the new available array entries. virtio_wmb() should be enough
> +  * to ensuere the order theoretically. However, a stronger barrier
> +  * is needed by ARM64. Otherwise, the stale data can be observed
> +  * by the host (vhost). A stronger barrier should work for other
> +  * architectures, but performance loss is expected.
> +  */
> + virtio_mb(false);
>   vq->split.avail_idx_shadow++;
>   vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
>   vq->split.avail_idx_shadow);

Replacing a DMB with a DSB is _very_ unlikely to be the correct solution
here, especially when ordering accesses to coherent memory.

In practice, either the larger timing different from the DSB or the fact
that you're going from a Store->Store barrier to a full barrier is what
makes things "work" for you. Have you tried, for example, a DMB SY
(e.g. via __smb_mb()).

We definitely shouldn't take changes like this without a proper
explanation of what is going on.

Will



Re: [PATCH] iommu/qcom: restore IOMMU state if needed

2023-12-12 Thread Will Deacon
On Wed, 11 Oct 2023 19:57:26 +0200, Luca Weiss wrote:
> From: Vladimir Lypak 
> 
> If the IOMMU has a power domain then some state will be lost in
> qcom_iommu_suspend and TZ will reset device if we don't call
> qcom_scm_restore_sec_cfg before accessing it again.
> 
> 
> [...]

Applied to will (for-joerg/arm-smmu/updates), thanks!

[1/1] iommu/qcom: restore IOMMU state if needed
  https://git.kernel.org/will/c/268dd4edb748

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev



Re: [PATCH v3 04/13] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-11-07 Thread Will Deacon
On Mon, Oct 30, 2023 at 09:00:53AM +0200, Mike Rapoport wrote:
> On Thu, Oct 26, 2023 at 11:24:39AM +0100, Will Deacon wrote:
> > On Thu, Oct 26, 2023 at 11:58:00AM +0300, Mike Rapoport wrote:
> > > On Mon, Oct 23, 2023 at 06:14:20PM +0100, Will Deacon wrote:
> > > > On Mon, Sep 18, 2023 at 10:29:46AM +0300, Mike Rapoport wrote:
> > > > > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> > > > > index dd851297596e..cd6320de1c54 100644
> > > > > --- a/arch/arm64/kernel/module.c
> > > > > +++ b/arch/arm64/kernel/module.c
> 
> ...
> 
> > > > > - if (module_direct_base) {
> > > > > - p = __vmalloc_node_range(size, MODULE_ALIGN,
> > > > > -  module_direct_base,
> > > > > -  module_direct_base + SZ_128M,
> > > > > -  GFP_KERNEL | __GFP_NOWARN,
> > > > > -  PAGE_KERNEL, 0, NUMA_NO_NODE,
> > > > > -  __builtin_return_address(0));
> > > > > - }
> > > > > + module_init_limits();
> > > > 
> > > > Hmm, this used to be run from subsys_initcall(), but now you're running
> > > > it _really_ early, before random_init(), so randomization of the module
> > > > space is no longer going to be very random if we don't have early 
> > > > entropy
> > > > from the firmware or the CPU, which is likely to be the case on most 
> > > > SoCs.
> > > 
> > > Well, it will be as random as KASLR. Won't that be enough?
> > 
> > I don't think that's true -- we have the 'kaslr-seed' property for KASLR,
> > but I'm not seeing anything like that for the module randomisation and I
> > also don't see why we need to set these limits so early.
> 
> x86 needs execmem initialized before ftrace_init() so I thought it would be
> best to setup execmem along with most of MM in mm_core_init().
> 
> I'll move execmem initialization for !x86 to a later point, say
> core_initcall.

Thanks, Mike.

Will



Re: [PATCH v3 04/13] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-10-26 Thread Will Deacon
On Thu, Oct 26, 2023 at 11:58:00AM +0300, Mike Rapoport wrote:
> On Mon, Oct 23, 2023 at 06:14:20PM +0100, Will Deacon wrote:
> > On Mon, Sep 18, 2023 at 10:29:46AM +0300, Mike Rapoport wrote:
> > > diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> > > index dd851297596e..cd6320de1c54 100644
> > > --- a/arch/arm64/kernel/module.c
> > > +++ b/arch/arm64/kernel/module.c
> > > @@ -20,6 +20,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  
> > >  #include 
> > >  #include 
> > > @@ -108,46 +109,38 @@ static int __init module_init_limits(void)
> > >  
> > >   return 0;
> > >  }
> > > -subsys_initcall(module_init_limits);
> > >  
> > > -void *module_alloc(unsigned long size)
> > > +static struct execmem_params execmem_params __ro_after_init = {
> > > + .ranges = {
> > > + [EXECMEM_DEFAULT] = {
> > > + .flags = EXECMEM_KASAN_SHADOW,
> > > + .alignment = MODULE_ALIGN,
> > > + },
> > > + },
> > > +};
> > > +
> > > +struct execmem_params __init *execmem_arch_params(void)
> > >  {
> > > - void *p = NULL;
> > > + struct execmem_range *r = _params.ranges[EXECMEM_DEFAULT];
> > >  
> > > - /*
> > > -  * Where possible, prefer to allocate within direct branch range of the
> > > -  * kernel such that no PLTs are necessary.
> > > -  */
> > 
> > Why are you removing this comment? I think you could just move it next
> > to the part where we set a 128MiB range.
>  
> Oops, my bad. Will add it back.

Thanks.

> > > - if (module_direct_base) {
> > > - p = __vmalloc_node_range(size, MODULE_ALIGN,
> > > -  module_direct_base,
> > > -  module_direct_base + SZ_128M,
> > > -  GFP_KERNEL | __GFP_NOWARN,
> > > -  PAGE_KERNEL, 0, NUMA_NO_NODE,
> > > -  __builtin_return_address(0));
> > > - }
> > > + module_init_limits();
> > 
> > Hmm, this used to be run from subsys_initcall(), but now you're running
> > it _really_ early, before random_init(), so randomization of the module
> > space is no longer going to be very random if we don't have early entropy
> > from the firmware or the CPU, which is likely to be the case on most SoCs.
> 
> Well, it will be as random as KASLR. Won't that be enough?

I don't think that's true -- we have the 'kaslr-seed' property for KASLR,
but I'm not seeing anything like that for the module randomisation and I
also don't see why we need to set these limits so early.

Will



Re: [PATCH v3 07/13] arm64, execmem: extend execmem_params for generated code allocations

2023-10-23 Thread Will Deacon
On Mon, Sep 18, 2023 at 10:29:49AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> The memory allocations for kprobes and BPF on arm64 can be placed
> anywhere in vmalloc address space and currently this is implemented with
> overrides of alloc_insn_page() and bpf_jit_alloc_exec() in arm64.
> 
> Define EXECMEM_KPROBES and EXECMEM_BPF ranges in arm64::execmem_params and
> drop overrides of alloc_insn_page() and bpf_jit_alloc_exec().
> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/arm64/kernel/module.c | 13 +
>  arch/arm64/kernel/probes/kprobes.c |  7 ---
>  arch/arm64/net/bpf_jit_comp.c  | 11 ---
>  3 files changed, 13 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index cd6320de1c54..d27db168d2a2 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -116,6 +116,16 @@ static struct execmem_params execmem_params 
> __ro_after_init = {
>   .flags = EXECMEM_KASAN_SHADOW,
>   .alignment = MODULE_ALIGN,
>   },
> + [EXECMEM_KPROBES] = {
> + .start = VMALLOC_START,
> + .end = VMALLOC_END,
> + .alignment = 1,
> + },
> + [EXECMEM_BPF] = {
> + .start = VMALLOC_START,
> + .end = VMALLOC_END,
> + .alignment = 1,
> + },
>   },
>  };
>  
> @@ -140,6 +150,9 @@ struct execmem_params __init *execmem_arch_params(void)
>   r->end = module_plt_base + SZ_2G;
>   }
>  
> + execmem_params.ranges[EXECMEM_KPROBES].pgprot = PAGE_KERNEL_ROX;
> + execmem_params.ranges[EXECMEM_BPF].pgprot = PAGE_KERNEL;
> +
>   return _params;
>  }
>  
> diff --git a/arch/arm64/kernel/probes/kprobes.c 
> b/arch/arm64/kernel/probes/kprobes.c
> index 70b91a8c6bb3..6fccedd02b2a 100644
> --- a/arch/arm64/kernel/probes/kprobes.c
> +++ b/arch/arm64/kernel/probes/kprobes.c
> @@ -129,13 +129,6 @@ int __kprobes arch_prepare_kprobe(struct kprobe *p)
>   return 0;
>  }
>  
> -void *alloc_insn_page(void)
> -{
> - return __vmalloc_node_range(PAGE_SIZE, 1, VMALLOC_START, VMALLOC_END,
> - GFP_KERNEL, PAGE_KERNEL_ROX, VM_FLUSH_RESET_PERMS,
> -     NUMA_NO_NODE, __builtin_return_address(0));
> -}

It's slightly curious that we didn't clear the tag here, so it's nice that
it all happens magically with your series:

Acked-by: Will Deacon 

Will



Re: [PATCH v3 04/13] mm/execmem, arch: convert remaining overrides of module_alloc to execmem

2023-10-23 Thread Will Deacon
Hi Mike,

On Mon, Sep 18, 2023 at 10:29:46AM +0300, Mike Rapoport wrote:
> From: "Mike Rapoport (IBM)" 
> 
> Extend execmem parameters to accommodate more complex overrides of
> module_alloc() by architectures.
> 
> This includes specification of a fallback range required by arm, arm64
> and powerpc and support for allocation of KASAN shadow required by
> arm64, s390 and x86.
> 
> The core implementation of execmem_alloc() takes care of suppressing
> warnings when the initial allocation fails but there is a fallback range
> defined.
> 
> Signed-off-by: Mike Rapoport (IBM) 
> ---
>  arch/arm/kernel/module.c | 38 -
>  arch/arm64/kernel/module.c   | 57 ++--
>  arch/powerpc/kernel/module.c | 52 ++---
>  arch/s390/kernel/module.c| 52 +++--
>  arch/x86/kernel/module.c | 64 +++-
>  include/linux/execmem.h  | 14 
>  mm/execmem.c | 43 ++--
>  7 files changed, 167 insertions(+), 153 deletions(-)

[...]

> diff --git a/arch/arm64/kernel/module.c b/arch/arm64/kernel/module.c
> index dd851297596e..cd6320de1c54 100644
> --- a/arch/arm64/kernel/module.c
> +++ b/arch/arm64/kernel/module.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -108,46 +109,38 @@ static int __init module_init_limits(void)
>  
>   return 0;
>  }
> -subsys_initcall(module_init_limits);
>  
> -void *module_alloc(unsigned long size)
> +static struct execmem_params execmem_params __ro_after_init = {
> + .ranges = {
> + [EXECMEM_DEFAULT] = {
> + .flags = EXECMEM_KASAN_SHADOW,
> + .alignment = MODULE_ALIGN,
> + },
> + },
> +};
> +
> +struct execmem_params __init *execmem_arch_params(void)
>  {
> - void *p = NULL;
> + struct execmem_range *r = _params.ranges[EXECMEM_DEFAULT];
>  
> - /*
> -  * Where possible, prefer to allocate within direct branch range of the
> -  * kernel such that no PLTs are necessary.
> -  */

Why are you removing this comment? I think you could just move it next
to the part where we set a 128MiB range.

> - if (module_direct_base) {
> - p = __vmalloc_node_range(size, MODULE_ALIGN,
> -  module_direct_base,
> -  module_direct_base + SZ_128M,
> -  GFP_KERNEL | __GFP_NOWARN,
> -  PAGE_KERNEL, 0, NUMA_NO_NODE,
> -  __builtin_return_address(0));
> - }
> + module_init_limits();

Hmm, this used to be run from subsys_initcall(), but now you're running
it _really_ early, before random_init(), so randomization of the module
space is no longer going to be very random if we don't have early entropy
from the firmware or the CPU, which is likely to be the case on most SoCs.

>  
> - if (!p && module_plt_base) {
> - p = __vmalloc_node_range(size, MODULE_ALIGN,
> -  module_plt_base,
> -  module_plt_base + SZ_2G,
> -  GFP_KERNEL | __GFP_NOWARN,
> -  PAGE_KERNEL, 0, NUMA_NO_NODE,
> -  __builtin_return_address(0));
> - }
> + r->pgprot = PAGE_KERNEL;
>  
> - if (!p) {
> - pr_warn_ratelimited("%s: unable to allocate memory\n",
> - __func__);
> - }
> + if (module_direct_base) {
> + r->start = module_direct_base;
> + r->end = module_direct_base + SZ_128M;
>  
> - if (p && (kasan_alloc_module_shadow(p, size, GFP_KERNEL) < 0)) {
> - vfree(p);
> - return NULL;
> + if (module_plt_base) {
> + r->fallback_start = module_plt_base;
> + r->fallback_end = module_plt_base + SZ_2G;
> + }
> + } else if (module_plt_base) {
> + r->start = module_plt_base;
> + r->end = module_plt_base + SZ_2G;
>   }
>  
> - /* Memory is intended to be executable, reset the pointer tag. */
> - return kasan_reset_tag(p);
> + return _params;
>  }
>  
>  enum aarch64_reloc_op {

[...]

> diff --git a/include/linux/execmem.h b/include/linux/execmem.h
> index 44e213625053..806ad1a0088d 100644
> --- a/include/linux/execmem.h
> +++ b/include/linux/execmem.h
> @@ -32,19 +32,33 @@ enum execmem_type {
>   EXECMEM_TYPE_MAX,
>  };
>  
> +/**
> + * enum execmem_module_flags - options for executable memory allocations
> + * @EXECMEM_KASAN_SHADOW:allocate kasan shadow
> + */
> +enum execmem_range_flags {
> + EXECMEM_KASAN_SHADOW= (1 << 0),
> +};
> +
>  /**
>   * struct execmem_range - definition of a memory range suitable for code and

Re: [PATCH v2 00/14] Clean up RPM bus clocks remnants

2023-09-18 Thread Will Deacon
On Tue, 12 Sep 2023 15:31:38 +0200, Konrad Dybcio wrote:
> After the recent cleanups ([1], [2]) some in-tree abusers that directly
> accessed the RPM bus clocks, effectively circumventing and working
> against the efforts of the interconnect framework, were found.
> 
> Patches 1-5 drop deprecated references and the rest attempt to stop
> direct bus clock abuses.
> 
> [...]

Applied SMMU bindings fix to will (for-joerg/arm-smmu/fixes), thanks!

[04/14] dt-bindings: arm-smmu: Fix SDM630 clocks description
https://git.kernel.org/will/c/938ba2f252a5

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-19 Thread Will Deacon
On Wed, Apr 14, 2021 at 12:16:38PM +0200, Peter Zijlstra wrote:
> How's this then? Compile tested only on openrisc/simple_smp_defconfig.
> 
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index d74b13825501..a7a1296b0b4d 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -2,6 +2,36 @@
>  /*
>   * Queued spinlock
>   *
> + * A 'generic' spinlock implementation that is based on MCS locks. An
> + * architecture that's looking for a 'generic' spinlock, please first 
> consider
> + * ticket-lock.h and only come looking here when you've considered all the
> + * constraints below and can show your hardware does actually perform better
> + * with qspinlock.
> + *
> + *
> + * It relies on atomic_*_release()/atomic_*_acquire() to be RCsc (or no 
> weaker
> + * than RCtso if you're power), where regular code only expects atomic_t to 
> be
> + * RCpc.

Maybe capitalise "Power" to make it clear this about the architecture?

> + *
> + * It relies on a far greater (compared to ticket-lock.h) set of atomic
> + * operations to behave well together, please audit them carefully to ensure
> + * they all have forward progress. Many atomic operations may default to
> + * cmpxchg() loops which will not have good forward progress properties on
> + * LL/SC architectures.
> + *
> + * One notable example is atomic_fetch_or_acquire(), which x86 cannot 
> (cheaply)
> + * do. Carefully read the patches that introduced 
> queued_fetch_set_pending_acquire().
> + *
> + * It also heavily relies on mixed size atomic operations, in specific it
> + * requires architectures to have xchg16; something which many LL/SC
> + * architectures need to implement as a 32bit and+or in order to satisfy the
> + * forward progress guarantees mentioned above.
> + *
> + * Further reading on mixed size atomics that might be relevant:
> + *
> + *   http://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
> + *
> + *
>   * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
>   * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
>   *
> diff --git a/include/asm-generic/ticket-lock-types.h 
> b/include/asm-generic/ticket-lock-types.h
> new file mode 100644
> index ..829759aedda8
> --- /dev/null
> +++ b/include/asm-generic/ticket-lock-types.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __ASM_GENERIC_TICKET_LOCK_TYPES_H
> +#define __ASM_GENERIC_TICKET_LOCK_TYPES_H
> +
> +#include 
> +typedef atomic_t arch_spinlock_t;
> +
> +#define __ARCH_SPIN_LOCK_UNLOCKEDATOMIC_INIT(0)
> +
> +#endif /* __ASM_GENERIC_TICKET_LOCK_TYPES_H */
> diff --git a/include/asm-generic/ticket-lock.h 
> b/include/asm-generic/ticket-lock.h
> new file mode 100644
> index ..3f0d53e21a37
> --- /dev/null
> +++ b/include/asm-generic/ticket-lock.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * 'Generic' ticket-lock implementation.
> + *
> + * It relies on atomic_fetch_add() having well defined forward progress
> + * guarantees under contention. If your architecture cannot provide this, 
> stick
> + * to a test-and-set lock.
> + *
> + * It also relies on atomic_fetch_add() being safe vs smp_store_release() on 
> a
> + * sub-word of the value. This is generally true for anything LL/SC although
> + * you'd be hard pressed to find anything useful in architecture 
> specifications
> + * about this. If your architecture cannot do this you might be better off 
> with
> + * a test-and-set.
> + *
> + * It further assumes atomic_*_release() + atomic_*_acquire() is RCpc and 
> hence
> + * uses atomic_fetch_add() which is SC to create an RCsc lock.
> + *
> + * The implementation uses smp_cond_load_acquire() to spin, so if the
> + * architecture has WFE like instructions to sleep instead of poll for word
> + * modifications be sure to implement that (see ARM64 for example).
> + *
> + */
> +
> +#ifndef __ASM_GENERIC_TICKET_LOCK_H
> +#define __ASM_GENERIC_TICKET_LOCK_H
> +
> +#include 
> +#include 
> +
> +static __always_inline void ticket_lock(arch_spinlock_t *lock)
> +{
> + u32 val = atomic_fetch_add(1<<16, lock); /* SC, gives us RCsc */

I hate to say it, but smp_mb__after_unlock_lock() would make the intention
a lot clearer here :( That is, the implementation as you have it gives
stronger than RCsc semantics for all architectures.

Alternatively, we could write the thing RCpc and throw an smp_mb() into
the unlock path if CONFIG_ARCH_WEAK_RELEASE_ACQUIRE.

> + u16 ticket = val >> 16;
> +
> + if (ticket == (u16)val)
> + return;
> +
> + atomic_cond_read_acquire(lock, ticket == (u16)VAL);
> +}
> +
> +static __always_inline bool ticket_trylock(arch_spinlock_t *lock)
> +{
> + u32 old = atomic_read(lock);
> +
> + if ((old >> 16) != (old & 0x))
> + return false;
> +
> + return atomic_try_cmpxchg(lock, , old + (1<<16)); /* SC, for RCsc */
> +}
> +
> +static __always_inline void 

Re: [PATCH v5] arm64: Enable perf events based hard lockup detector

2021-04-19 Thread Will Deacon
On Mon, Apr 12, 2021 at 05:31:13PM +0530, Sumit Garg wrote:
> On Tue, 30 Mar 2021 at 18:00, Sumit Garg  wrote:
> > On Tue, 30 Mar 2021 at 14:07, Lecopzer Chen  
> > wrote:
> > > > > On Fri, 15 Jan 2021 at 17:32, Sumit Garg  
> > > > > wrote:
> > > > > >
> > > > > > With the recent feature added to enable perf events to use pseudo 
> > > > > > NMIs
> > > > > > as interrupts on platforms which support GICv3 or later, its now 
> > > > > > been
> > > > > > possible to enable hard lockup detector (or NMI watchdog) on arm64
> > > > > > platforms. So enable corresponding support.
> > > > > >
> > > > > > One thing to note here is that normally lockup detector is 
> > > > > > initialized
> > > > > > just after the early initcalls but PMU on arm64 comes up much later 
> > > > > > as
> > > > > > device_initcall(). So we need to re-initialize lockup detection once
> > > > > > PMU has been initialized.
> > > > > >
> > > > > > Signed-off-by: Sumit Garg 
> > > > > > ---
> > > > > >
> > > > > > Changes in v5:
> > > > > > - Fix lockup_detector_init() invocation to be rather invoked from 
> > > > > > CPU
> > > > > >   binded context as it makes heavy use of per-cpu variables and 
> > > > > > shouldn't
> > > > > >   be invoked from preemptible context.
> > > > > >
> > > > >
> > > > > Do you have any further comments on this?
> > > > >
> 
> Since there aren't any further comments, can you re-pick this feature for 
> 5.13?

I'd still like Mark's Ack on this, as the approach you have taken doesn't
really sit with what he was suggesting.

I also don't understand how all the CPUs get initialised with your patch,
since the PMU driver will be initialised after SMP is up and running.

Will


Re: [PATCH v2 1/3] iommu: io-pgtable: add DART pagetable format

2021-04-19 Thread Will Deacon
On Fri, Apr 09, 2021 at 09:38:15PM +0200, Arnd Bergmann wrote:
> On Fri, Apr 9, 2021 at 6:56 PM Sven Peter  wrote:
> > On Wed, Apr 7, 2021, at 12:44, Will Deacon wrote:
> > > On Sun, Mar 28, 2021 at 09:40:07AM +0200, Sven Peter wrote:
> > >
> > > > +   cfg->pgsize_bitmap &= SZ_16K;
> > > > +   if (!cfg->pgsize_bitmap)
> > > > +   return NULL;
> > >
> > > This is worrying (and again, I don't think this belongs here). How is this
> > > thing supposed to work if the CPU is using 4k pages?
> >
> > This SoC is just full of fun surprises!
> > I didn't even think about that case since I've always been using 16k pages 
> > so far.
> >
> > I've checked again and wasn't able to find any way to configure the pagesize
> > of the IOMMU. There seem to be variants of this IP in older iPhones which
> > support a 4k pagesize but to the best of my knowledge this is hard wired
> > and not configurable in software.
> >
> > When booting with 4k pages I hit the BUG_ON in iova.c that ensures that the
> > iommu pagesize has to be <= the cpu page size.
> >
> > I see two options here and I'm not sure I like either of them:
> >
> > 1) Just don't support 4k CPU pages together with IOMMU translations and only
> >allow full bypass mode there.
> >This would however mean that PCIe (i.e. ethernet, usb ports on the Mac
> >mini) and possibly Thunderbolt support would not be possible since these
> >devices don't seem to like iommu bypass mode at all.
> 
> It should be possible to do a fake bypass mode by just programming a
> static page table for as much address space as you can, and then
> use swiotlb to address any memory beyond that. This won't perform
> well because it requires bounce buffers for any high memory, but it
> can be a last resort if a dart instance cannot do normal bypass mode.
> 
> > 2) I've had a brief discussion on IRC with Arnd about this [1] and he 
> > pointed
> >out that the dma_map_sg API doesn't make any guarantees about the 
> > returned
> >iovas and that it might be possible to make this work at least for 
> > devices
> >that go through the normal DMA API.
> >
> >I've then replaced the page size check with a WARN_ON in iova.c just to 
> > see
> >what happens. At least normal devices that go through the DMA API seem to
> >work with my configuration. iommu_dma_alloc took the 
> > iommu_dma_alloc_remap
> >path which was called with the cpu page size but then used
> >domain->pgsize_bitmap to increase that to 16k. So this kinda works out, 
> > but
> >there are other functions in dma-iommu.c that I believe rely on the fact 
> > that
> >the iommu can map single cpu pages. This feels very fragile right now and
> >would probably require some rather invasive changes.
> 
> The other second-to-last resort here would be to duplicate the code from
> the dma-iommu code and implement the dma-mapping API directly on
> top of the dart hardware instead of the iommu layer. This would probably
> be much faster than the swiotlb on top of a bypass or a linear map,
> but it's a really awful abstraction that would require adding special cases
> into a lot of generic code.
> 
> >Any driver that tries to use the iommu API directly could be trouble
> >as well if they make similar assumptions.
> 
> I think pretty much all drivers using the iommu API directly already
> depends on having a matching page size.  I don't see any way to use
> e.g. PCI device assignment using vfio, or a GPU driver with per-process
> contexts when the iotlb page size is larger than the CPU's.
> 
> >Is this something you would even want to support in the iommu subsytem
> >and is it even possible to do this in a sane way?
> 
> I don't know how hard it is to do adjust the dma-iommu implementation
> to allow this, but as long as we can work out the DT binding to support
> both normal dma-iommu mode with 16KB pages and some kind of
> passthrough mode (emulated or not) with 4KB pages, it can be left
> as a possible optimization for later.

I think one of the main things to modify is the IOVA allocator
(drivers/iommu/iova.c). Once that is happy with pages bigger than the CPU
page size, then you could probably fake everything else in the DMA layer by
offsetting the returned DMA addresses into the 16K page which got mapped.

Will


Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

2021-04-19 Thread Will Deacon
On Thu, Apr 08, 2021 at 01:38:17PM -0500, Rob Herring wrote:
> On Thu, Apr 8, 2021 at 6:08 AM Mark Rutland  wrote:
> > On Wed, Apr 07, 2021 at 01:44:37PM +0100, Will Deacon wrote:
> > > On Thu, Apr 01, 2021 at 02:45:21PM -0500, Rob Herring wrote:
> > > > On Wed, Mar 31, 2021 at 11:01 AM Will Deacon  wrote:
> > > I guess I'm just worried about exposing the counters to userspace after
> > > the PMU driver (or perf core?) thinks that they're no longer exposed in
> > > case we leak other events.
> >
> > IMO that's not practically different from the single-PMU case (i.e.
> > multi-PMU isn't material, either we have a concern with leaking or we
> > don't); more on that below.

Well, maybe. It looks the single-PMU case is exposed to the same issue,
but I think a solution needs to take into account the multi-PMU situation.

> > While it looks odd to place this on the mm, I don't think it's the end
> > of the world.
> >
> > > However, I'm not sure how this is supposed to work normally: what
> > > happens if e.g. a privileged user has a per-cpu counter for a kernel
> > > event while a task has a counter with direct access -- can that task
> > > read the kernel event out of the PMU registers from userspace?
> >
> > Yes -- userspace could go read any counters even though it isn't
> > supposed to, and could potentially infer information from those. It
> > won't have access to the config registers or kernel data structures, so
> > it isn't guaranteed to know what the even is or when it is
> > context-switched/reprogrammed/etc.
> >
> > If we believe that's a problem, then it's difficult to do anything
> > robust other than denying userspace access entirely, since disabling
> > userspace access while in use would surprise applications, and denying
> > privileged events would need some global state that we consult at event
> > creation time (in addition to being an inversion of privilege).
> >
> > IIRC there was some fuss about this a while back on x86; I'll go dig and
> > see what I can find, unless Peter has a memory...
> 
> Maybe this one[1].
> 
> Rob
> 
> [1] 
> https://lore.kernel.org/lkml/20200730123815.18518-1-kan.li...@linux.intel.com/

Going through the archives and talking to Peter, it looks like this is still
an active area of concern:

  - There are patches to clear "dirty" counters on context-switch. They were
queued for 5.13 but broke -tip on Friday:


https://lore.kernel.org/lkml/yhm%2fm4za2lpry...@hirez.programming.kicks-ass.net/

  - Per-cpu events cannot be protected in software:


https://lore.kernel.org/lkml/calcetrvvpzud_hq8xoomhn_wwrqjuvroect2do4_d4rozoa...@mail.gmail.com/

so without hardware support, we need a way to disable user access for
people that care about this leakage

x86 has an "rdpmc" file exposed for the PMU device in sysfs which allows
access to be disabled. I don't think these patches add such a thing, and
that's where the fun with multi-PMU machines would come into play.

Will


Re: [PATCH V6 1/4] arch_topology: Rename freq_scale as arch_freq_scale

2021-04-19 Thread Will Deacon
On Wed, Mar 10, 2021 at 10:53:23AM +0530, Viresh Kumar wrote:
> Rename freq_scale to a less generic name, as it will get exported soon
> for modules. Since x86 already names its own implementation of this as
> arch_freq_scale, lets stick to that.
> 
> Suggested-by: Will Deacon 
> Signed-off-by: Viresh Kumar 
> ---
>  arch/arm64/kernel/topology.c  | 6 +++---
>  drivers/base/arch_topology.c  | 4 ++--
>  include/linux/arch_topology.h | 4 ++--
>  3 files changed, 7 insertions(+), 7 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 1/3] arm64: ptrace: Add is_syscall_success to handle compat

2021-04-19 Thread Will Deacon
On Fri, Apr 16, 2021 at 02:34:41PM +0100, Mark Rutland wrote:
> On Fri, Apr 16, 2021 at 01:33:22PM +0100, Catalin Marinas wrote:
> > On Fri, Apr 16, 2021 at 03:55:31PM +0800, He Zhe wrote:
> > > The general version of is_syscall_success does not handle 32-bit
> > > compatible case, which would cause 32-bit negative return code to be
> > > recoganized as a positive number later and seen as a "success".
> > > 
> > > Since is_compat_thread is defined in compat.h, implementing
> > > is_syscall_success in ptrace.h would introduce build failure due to
> > > recursive inclusion of some basic headers like mutex.h. We put the
> > > implementation to ptrace.c
> > > 
> > > Signed-off-by: He Zhe 
> > > ---
> > >  arch/arm64/include/asm/ptrace.h |  3 +++
> > >  arch/arm64/kernel/ptrace.c  | 10 ++
> > >  2 files changed, 13 insertions(+)
> > > 
> > > diff --git a/arch/arm64/include/asm/ptrace.h 
> > > b/arch/arm64/include/asm/ptrace.h
> > > index e58bca832dff..3c415e9e5d85 100644
> > > --- a/arch/arm64/include/asm/ptrace.h
> > > +++ b/arch/arm64/include/asm/ptrace.h
> > > @@ -328,6 +328,9 @@ static inline void regs_set_return_value(struct 
> > > pt_regs *regs, unsigned long rc)
> > >   regs->regs[0] = rc;
> > >  }
> > >  
> > > +extern inline int is_syscall_success(struct pt_regs *regs);
> > > +#define is_syscall_success(regs) is_syscall_success(regs)
> > > +
> > >  /**
> > >   * regs_get_kernel_argument() - get Nth function argument in kernel
> > >   * @regs:pt_regs of that context
> > > diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
> > > index 170f42fd6101..3266201f8c60 100644
> > > --- a/arch/arm64/kernel/ptrace.c
> > > +++ b/arch/arm64/kernel/ptrace.c
> > > @@ -1909,3 +1909,13 @@ int valid_user_regs(struct user_pt_regs *regs, 
> > > struct task_struct *task)
> > >   else
> > >   return valid_native_regs(regs);
> > >  }
> > > +
> > > +inline int is_syscall_success(struct pt_regs *regs)
> > > +{
> > > + unsigned long val = regs->regs[0];
> > > +
> > > + if (is_compat_thread(task_thread_info(current)))
> > > + val = sign_extend64(val, 31);
> > > +
> > > + return !IS_ERR_VALUE(val);
> > > +}
> > 
> > It's better to use compat_user_mode(regs) here instead of
> > is_compat_thread(). It saves us from worrying whether regs are for the
> > current context.
> > 
> > I think we should change regs_return_value() instead. This function
> > seems to be called from several other places and it has the same
> > potential problems if called on compat pt_regs.
> 
> I think this is a problem we created for ourselves back in commit:
> 
>   15956689a0e60aa0 ("arm64: compat: Ensure upper 32 bits of x0 are zero on 
> syscall return)
> 
> AFAICT, the perf regs samples are the only place this matters, since for
> ptrace the compat regs are implicitly truncated to compat_ulong_t, and
> audit expects the non-truncated return value. Other architectures don't
> truncate here, so I think we're setting ourselves up for a game of
> whack-a-mole to truncate and extend wherever we need to.
> 
> Given that, I suspect it'd be better to do something like the below.
> 
> Will, thoughts?

I think perf is one example, but this is also visible to userspace via the
native ptrace interface and I distinctly remember needing this for some
versions of arm64 strace to work correctly when tracing compat tasks.

So I do think that clearing the upper bits on the return path is the right
approach, but it sounds like we need some more work to handle syscall(-1)
and audit (what exactly is the problem here after these patches have been
applied?)

Will


Re: [PATCH 00/13] [RFC] Rust support

2021-04-19 Thread Will Deacon
On Mon, Apr 19, 2021 at 11:40:54AM +0200, Paolo Bonzini wrote:
> On 19/04/21 11:36, Peter Zijlstra wrote:
> > On Mon, Apr 19, 2021 at 11:02:12AM +0200, Paolo Bonzini wrote:
> > > > void writer(void)
> > > > {
> > > >   atomic_store_explicit(, seq+1, memory_order_relaxed);
> > > >   atomic_thread_fence(memory_order_acquire);
> > > 
> > > This needs to be memory_order_release.  The only change in the resulting
> > > assembly is that "dmb ishld" becomes "dmb ish", which is not as good as 
> > > the
> > > "dmb ishst" you get from smp_wmb() but not buggy either.
> > 
> > Yuck! And that is what requires the insides to be
> > atomic_store_explicit(), otherwise this fence doesn't have to affect
> > them.
> 
> Not just that, even the write needs to be atomic_store_explicit in order to
> avoid a data race.atomic_store_explicit

https://wg21.link/P0690

was an attempt to address this, but I don't know if any of the ideas got
adopted in the end.

Will


Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Will Deacon
On Thu, Apr 15, 2021 at 04:26:46PM +, Ali Saidi wrote:
> 
> On Thu, 15 Apr 2021 16:02:29 +0100, Will Deacon wrote:
> > On Thu, Apr 15, 2021 at 02:25:52PM +, Ali Saidi wrote:
> > > While this code is executed with the wait_lock held, a reader can
> > > acquire the lock without holding wait_lock.  The writer side loops
> > > checking the value with the atomic_cond_read_acquire(), but only truly
> > > acquires the lock when the compare-and-exchange is completed
> > > successfully which isn’t ordered. The other atomic operations from this
> > > point are release-ordered and thus reads after the lock acquisition can
> > > be completed before the lock is truly acquired which violates the
> > > guarantees the lock should be making.
> > 
> > I think it would be worth spelling this out with an example. The issue
> > appears to be a concurrent reader in interrupt context taking and releasing
> > the lock in the window where the writer has returned from the
> > atomic_cond_read_acquire() but has not yet performed the cmpxchg(). Loads
> > can be speculated during this time, but the A-B-A of the lock word
> > from _QW_WAITING to (_QW_WAITING | _QR_BIAS) and back to _QW_WAITING allows
> > the atomic_cmpxchg_relaxed() to succeed. Is that right?
> 
> You're right. What we're seeing is an A-B-A problem that can allow 
> atomic_cond_read_acquire() to succeed and before the cmpxchg succeeds a reader
> performs an A-B-A on the lock which allows the core to observe a read that
> follows the cmpxchg ahead of the cmpxchg succeeding. 
> 
> We've seen a problem in epoll where the reader does a xchg while
> holding the read lock, but the writer can see a value change out from under 
> it. 
> 
> Writer   | Reader 2
> 
> ep_scan_ready_list() |
> |- write_lock_irq()  |
> |- queued_write_lock_slowpath()  |
>   |- atomic_cond_read_acquire()  |
>  | read_lock_irqsave(>lock, flags);
>  | chain_epi_lockless()
>  |epi->next = xchg(>ovflist, epi);
>  | read_unlock_irqrestore(>lock, 
> flags);
>  |   
>  atomic_cmpxchg_relaxed()|
>   READ_ONCE(ep->ovflist);


Please stick this in the commit message, preferably annotated a bit like
Peter's example to show the READ_ONCE() being speculated.

> > With that in mind, it would probably be a good idea to eyeball the qspinlock
> > slowpath as well, as that uses both atomic_cond_read_acquire() and
> > atomic_try_cmpxchg_relaxed().
> 
> It seems plausible that the same thing could occur here in qspinlock:
>   if ((val & _Q_TAIL_MASK) == tail) {
>   if (atomic_try_cmpxchg_relaxed(>val, , 
> _Q_LOCKED_VAL))
>   goto release; /* No contention */
>   }

Just been thinking about this, but I don't see an issue here because
everybody is queuing the same way (i.e. we don't have a mechanism to jump
the queue like we do for qrwlock) and the tail portion of the lock word
isn't susceptible to ABA. That is, once we're at the head of the queue
and we've seen the lock become unlocked via atomic_cond_read_acquire(),
then we know we hold it.

So qspinlock looks fine to me, but I'd obviously value anybody else's
opinion on that.

Will


Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Will Deacon
On Thu, Apr 15, 2021 at 04:37:58PM +0100, Catalin Marinas wrote:
> On Thu, Apr 15, 2021 at 04:28:21PM +0100, Will Deacon wrote:
> > On Thu, Apr 15, 2021 at 05:03:58PM +0200, Peter Zijlstra wrote:
> > > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> > > index 4786dd271b45..22aeccc363ca 100644
> > > --- a/kernel/locking/qrwlock.c
> > > +++ b/kernel/locking/qrwlock.c
> > > @@ -60,6 +60,8 @@ EXPORT_SYMBOL(queued_read_lock_slowpath);
> > >   */
> > >  void queued_write_lock_slowpath(struct qrwlock *lock)
> > >  {
> > > + u32 cnt;
> > > +
> > >   /* Put the writer into the wait queue */
> > >   arch_spin_lock(>wait_lock);
> > >  
> > > @@ -73,9 +75,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
> > >  
> > >   /* When no more readers or writers, set the locked flag */
> > >   do {
> > > - atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> > > - } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING,
> > > - _QW_LOCKED) != _QW_WAITING);
> > > + cnt = atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> > 
> > I think the issue is that >here< a concurrent reader in interrupt context
> > can take the lock and release it again, but we could speculate reads from
> > the critical section up over the later release and up before the control
> > dependency here...
> > 
> > > + } while (!atomic_try_cmpxchg_relaxed(>cnts, , _QW_LOCKED));
> > 
> > ... and then this cmpxchg() will succeed, so our speculated stale reads
> > could be used.
> > 
> > *HOWEVER*
> > 
> > Speculating a read should be fine in the face of a concurrent _reader_,
> > so for this to be an issue it implies that the reader is also doing some
> > (atomic?) updates.
> 
> There's at least one such case: see chain_epi_lockless() updating
> epi->next, called from ep_poll_callback() with a read_lock held. This
> races with ep_done_scan() which has the write_lock held.

Do you know if that's the code which triggered this patch? If so, it would
be great to have this information in the changelog!

> I think the authors of the above code interpreted the read_lock as
> something that multiple threads can own disregarding the _read_ part.

Using RmW atomics should be fine for that; it's no worse than some of the
tricks pulled in RCU read context in the dentry cache (but then again, what
is? ;)

Will


Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Will Deacon
On Thu, Apr 15, 2021 at 05:03:58PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 15, 2021 at 02:25:52PM +, Ali Saidi wrote:
> > While this code is executed with the wait_lock held, a reader can
> > acquire the lock without holding wait_lock.  The writer side loops
> > checking the value with the atomic_cond_read_acquire(), but only truly
> > acquires the lock when the compare-and-exchange is completed
> > successfully which isn’t ordered. The other atomic operations from this
> > point are release-ordered and thus reads after the lock acquisition can
> > be completed before the lock is truly acquired which violates the
> > guarantees the lock should be making.
> 
> Should be per who? We explicitly do not order the lock acquire store.
> qspinlock has the exact same issue.
> 
> If you look in the git history surrounding spin_is_locked(), you'll find
> 'lovely' things.
> 
> Basically, if you're doing crazy things with spin_is_locked() you get to
> keep the pieces.
> 
> > Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when 
> > spinning in qrwloc")
> > Signed-off-by: Ali Saidi 
> > Cc: sta...@vger.kernel.org
> > ---
> >  kernel/locking/qrwlock.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> > index 4786dd271b45..10770f6ac4d9 100644
> > --- a/kernel/locking/qrwlock.c
> > +++ b/kernel/locking/qrwlock.c
> > @@ -73,8 +73,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
> >  
> > /* When no more readers or writers, set the locked flag */
> > do {
> > -   atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> > -   } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING,
> > +   atomic_cond_read_relaxed(>cnts, VAL == _QW_WAITING);
> > +   } while (atomic_cmpxchg_acquire(>cnts, _QW_WAITING,
> > _QW_LOCKED) != _QW_WAITING);
> >  unlock:
> > arch_spin_unlock(>wait_lock);
> 
> This doesn't make sense, there is no such thing as a store-acquire. What
> you're doing here is moving the acquire from one load to the next. A
> load we know will load the exact same value.
> 
> Also see Documentation/atomic_t.txt:
> 
>   {}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
> 
> 
> If anything this code wants to be written like so.
> 
> ---
> 
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 4786dd271b45..22aeccc363ca 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -60,6 +60,8 @@ EXPORT_SYMBOL(queued_read_lock_slowpath);
>   */
>  void queued_write_lock_slowpath(struct qrwlock *lock)
>  {
> + u32 cnt;
> +
>   /* Put the writer into the wait queue */
>   arch_spin_lock(>wait_lock);
>  
> @@ -73,9 +75,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
>  
>   /* When no more readers or writers, set the locked flag */
>   do {
> - atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> - } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING,
> - _QW_LOCKED) != _QW_WAITING);
> + cnt = atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);

I think the issue is that >here< a concurrent reader in interrupt context
can take the lock and release it again, but we could speculate reads from
the critical section up over the later release and up before the control
dependency here...

> + } while (!atomic_try_cmpxchg_relaxed(>cnts, , _QW_LOCKED));

... and then this cmpxchg() will succeed, so our speculated stale reads
could be used.

*HOWEVER*

Speculating a read should be fine in the face of a concurrent _reader_,
so for this to be an issue it implies that the reader is also doing some
(atomic?) updates.

Ali?

Will


Re: [PATCH] locking/qrwlock: Fix ordering in queued_write_lock_slowpath

2021-04-15 Thread Will Deacon
On Thu, Apr 15, 2021 at 02:25:52PM +, Ali Saidi wrote:
> While this code is executed with the wait_lock held, a reader can
> acquire the lock without holding wait_lock.  The writer side loops
> checking the value with the atomic_cond_read_acquire(), but only truly
> acquires the lock when the compare-and-exchange is completed
> successfully which isn’t ordered. The other atomic operations from this
> point are release-ordered and thus reads after the lock acquisition can
> be completed before the lock is truly acquired which violates the
> guarantees the lock should be making.

I think it would be worth spelling this out with an example. The issue
appears to be a concurrent reader in interrupt context taking and releasing
the lock in the window where the writer has returned from the
atomic_cond_read_acquire() but has not yet performed the cmpxchg(). Loads
can be speculated during this time, but the A-B-A of the lock word
from _QW_WAITING to (_QW_WAITING | _QR_BIAS) and back to _QW_WAITING allows
the atomic_cmpxchg_relaxed() to succeed. Is that right?

With that in mind, it would probably be a good idea to eyeball the qspinlock
slowpath as well, as that uses both atomic_cond_read_acquire() and
atomic_try_cmpxchg_relaxed().

> Fixes: b519b56e378ee ("locking/qrwlock: Use atomic_cond_read_acquire() when 
> spinning in qrwloc")

Typo in the quoted subject ('qrwloc').

> Signed-off-by: Ali Saidi 
> Cc: sta...@vger.kernel.org
> ---
>  kernel/locking/qrwlock.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/locking/qrwlock.c b/kernel/locking/qrwlock.c
> index 4786dd271b45..10770f6ac4d9 100644
> --- a/kernel/locking/qrwlock.c
> +++ b/kernel/locking/qrwlock.c
> @@ -73,8 +73,8 @@ void queued_write_lock_slowpath(struct qrwlock *lock)
>  
>   /* When no more readers or writers, set the locked flag */
>   do {
> - atomic_cond_read_acquire(>cnts, VAL == _QW_WAITING);
> - } while (atomic_cmpxchg_relaxed(>cnts, _QW_WAITING,
> + atomic_cond_read_relaxed(>cnts, VAL == _QW_WAITING);
> + } while (atomic_cmpxchg_acquire(>cnts, _QW_WAITING,
>       _QW_LOCKED) != _QW_WAITING);

Patch looks good, so with an updated message:

Acked-by: Will Deacon 

Will


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-15 Thread Will Deacon
On Thu, Apr 15, 2021 at 10:02:18AM +0100, Catalin Marinas wrote:
> (fixed Will's email address)
> 
> On Thu, Apr 15, 2021 at 10:09:54AM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 15, 2021 at 05:47:34AM +0900, Stafford Horne wrote:
> > > > How's this then? Compile tested only on openrisc/simple_smp_defconfig.
> > > 
> > > I did my testing with this FPGA build SoC:
> > > 
> > >  https://github.com/stffrdhrn/de0_nano-multicore
> > > 
> > > Note, the CPU timer sync logic uses mb() and is a bit flaky.  So missing 
> > > mb()
> > > might be a reason.  I thought we had defined mb() and l.msync, but it 
> > > seems to
> > > have gotten lost.
> > > 
> > > With that said I could test out this ticket-lock implementation.  How 
> > > would I
> > > tell if its better than qspinlock?
> > 
> > Mostly if it isn't worse, it's better for being *much* simpler. As you
> > can see, the guts of ticket is like 16 lines of C (lock+unlock) and you
> > only need the behaviour of atomic_fetch_add() to reason about behaviour
> > of the whole thing. qspinlock OTOH is mind bending painful to reason
> > about.
> > 
> > There are some spinlock tests in locktorture; but back when I had a
> > userspace copy of the lot and would measure min,avg,max acquire times
> > under various contention loads (making sure to only run a single task
> > per CPU etc.. to avoid lock holder preemption and other such 'fun'
> > things).
> > 
> > It took us a fair amount of work to get qspinlock to compete with ticket
> > for low contention cases (by far the most common in the kernel), and it
> > took a fairly large amount of CPUs for qspinlock to really win from
> > ticket on the contended case. Your hardware may vary. In particular the
> > access to the external cacheline (for queueing, see the queue: label in
> > queued_spin_lock_slowpath) is a pain-point and the relative cost of
> > cacheline misses for your arch determines where (and if) low contention
> > behaviour is competitive.
> > 
> > Also, less variance (the reason for the min/max measure) is better.
> > Large variance is typically a sign of fwd progress trouble.
> 
> IIRC, one issue we had with ticket spinlocks on arm64 was on big.LITTLE
> systems where the little CPUs were always last to get a ticket when
> racing with the big cores. That was with load/store exclusives (LR/SC
> style) and would have probably got better with atomics but we moved to
> qspinlocks eventually (the Juno board didn't have atomics).
> 
> (leaving the rest of the text below for Will's convenience)

Yes, I think it was this thread:

https://lore.kernel.org/lkml/alpine.DEB.2.20.1707261548560.2186@nanos

but I don't think you can really fix such hardware by changing the locking
algorithm (although my proposed cpu_relax() hack was worryingly effective).

Will


[GIT PULL] arm64: Fixes for -rc8

2021-04-14 Thread Will Deacon
Hi Linus,

Please pull these three arm64 fixes for -rc8; summary in the tag. We
don't have anything else on the horizon, although two of these issues
(the asm constraint and kprobes bugs) have been around for a while so
you never know.

Cheers,

Will

--->8

The following changes since commit 20109a859a9b514eb10c22b8a14b5704ffe93897:

  arm64: kernel: disable CNP on Carmel (2021-03-25 10:00:23 +)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git tags/arm64-fixes

for you to fetch changes up to 738fa58ee1328481d1d7889e7c430b3401c571b9:

  arm64: kprobes: Restore local irqflag if kprobes is cancelled (2021-04-13 
09:30:16 +0100)


arm64 fixes for -rc8

- Fix incorrect asm constraint for load_unaligned_zeropad() fixup

- Fix thread flag update when setting TIF_MTE_ASYNC_FAULT

- Fix restored irq state when handling fault on kprobe


Catalin Marinas (1):
  arm64: mte: Ensure TIF_MTE_ASYNC_FAULT is set atomically

Jisheng Zhang (1):
  arm64: kprobes: Restore local irqflag if kprobes is cancelled

Peter Collingbourne (1):
  arm64: fix inline asm in load_unaligned_zeropad()

 arch/arm64/Kconfig  |  6 +-
 arch/arm64/include/asm/word-at-a-time.h | 10 +-
 arch/arm64/kernel/entry.S   | 10 ++
 arch/arm64/kernel/probes/kprobes.c  |  6 --
 4 files changed, 20 insertions(+), 12 deletions(-)


Re: [PATCH] arm64: kprobes: Restore local irqflag if kprobes is cancelled

2021-04-13 Thread Will Deacon
On Mon, 12 Apr 2021 17:41:01 +0800, Jisheng Zhang wrote:
> If instruction being single stepped caused a page fault, the kprobes
> is cancelled to let the page fault handler continue as a normal page
> fault. But the local irqflags are disabled so cpu will restore pstate
> with DAIF masked. After pagefault is serviced, the kprobes is
> triggerred again, we overwrite the saved_irqflag by calling
> kprobes_save_local_irqflag(). NOTE, DAIF is masked in this new saved
> irqflag. After kprobes is serviced, the cpu pstate is retored with
> DAIF masked.
> 
> [...]

Applied to arm64 (for-next/fixes), thanks!

[1/1] arm64: kprobes: Restore local irqflag if kprobes is cancelled
  https://git.kernel.org/arm64/c/738fa58ee132

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [PATCH] arm64: mte: Move MTE TCF0 check in entry-common

2021-04-08 Thread Will Deacon
On Thu, Apr 08, 2021 at 04:06:23PM +0100, Mark Rutland wrote:
> On Thu, Apr 08, 2021 at 03:56:04PM +0100, Will Deacon wrote:
> > On Thu, Apr 08, 2021 at 03:37:23PM +0100, Vincenzo Frascino wrote:
> > > diff --git a/arch/arm64/kernel/entry-common.c 
> > > b/arch/arm64/kernel/entry-common.c
> > > index 9d3588450473..837d3624a1d5 100644
> > > --- a/arch/arm64/kernel/entry-common.c
> > > +++ b/arch/arm64/kernel/entry-common.c
> > > @@ -289,10 +289,16 @@ asmlinkage void noinstr enter_from_user_mode(void)
> > >   CT_WARN_ON(ct_state() != CONTEXT_USER);
> > >   user_exit_irqoff();
> > >   trace_hardirqs_off_finish();
> > > +
> > > + /* Check for asynchronous tag check faults in user space */
> > > + check_mte_async_tcf0();
> > >  }
> > 
> > Is enter_from_user_mode() always called when we enter the kernel from EL0?
> > afaict, some paths (e.g. el0_irq()) only end up calling it if
> > CONTEXT_TRACKING or TRACE_IRQFLAGS are enabled.
> 
> Currently everything that's in {enter,exit}_from_user_mode() only
> matters when either CONTEXT_TRACKING or TRACE_IRQFLAGS is selected (and
> expands to an empty stub otherwise).
> 
> We could drop the ifdeffery in user_{enter,exit}_irqoff() to have them
> called regardless, or add CONFIG_MTE to the list.

I'm always in favour of dropping ifdeffery if it's getting in the way.

> > >  asmlinkage void noinstr exit_to_user_mode(void)
> > >  {
> > > + /* Ignore asynchronous tag check faults in the uaccess routines */
> > > + clear_mte_async_tcf0();
> > > +
> > 
> > and this one seems to be called even less often.
> 
> This is always done in ret_to_user, so (modulo ifdeferry above) all
> returns to EL0 call this.

Right, I was just saying that if you disabled those CONFIG options then this
isn't called _at all_ whereas I think enter_from_user_mode() still is on
some paths.

Will


Re: [PATCH] arm64: mte: Move MTE TCF0 check in entry-common

2021-04-08 Thread Will Deacon
On Thu, Apr 08, 2021 at 03:37:23PM +0100, Vincenzo Frascino wrote:
> The check_mte_async_tcf macro sets the TIF flag non-atomically. This can
> race with another CPU doing a set_tsk_thread_flag() and the flag can be
> lost in the process.

Actually, it's all the *other* flags that get lost!

> Move the tcf0 check to enter_from_user_mode() and clear tcf0 in
> exit_to_user_mode() to address the problem.
> 
> Note: Moving the check in entry-common allows to use set_thread_flag()
> which is safe.
> 
> Fixes: 637ec831ea4f ("arm64: mte: Handle synchronous and asynchronous
> tag check faults")
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Reported-by: Will Deacon 
> Signed-off-by: Vincenzo Frascino 
> ---
>  arch/arm64/include/asm/mte.h |  8 
>  arch/arm64/kernel/entry-common.c |  6 ++
>  arch/arm64/kernel/entry.S| 30 --
>  arch/arm64/kernel/mte.c  | 25 +++--
>  4 files changed, 37 insertions(+), 32 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 9b557a457f24..188f778c6f7b 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -31,6 +31,8 @@ void mte_invalidate_tags(int type, pgoff_t offset);
>  void mte_invalidate_tags_area(int type);
>  void *mte_allocate_tag_storage(void);
>  void mte_free_tag_storage(char *storage);
> +void check_mte_async_tcf0(void);
> +void clear_mte_async_tcf0(void);
>  
>  #ifdef CONFIG_ARM64_MTE
>  
> @@ -83,6 +85,12 @@ static inline int mte_ptrace_copy_tags(struct task_struct 
> *child,
>  {
>   return -EIO;
>  }
> +void check_mte_async_tcf0(void)
> +{
> +}
> +void clear_mte_async_tcf0(void)
> +{
> +}
>  
>  static inline void mte_assign_mem_tag_range(void *addr, size_t size)
>  {
> diff --git a/arch/arm64/kernel/entry-common.c 
> b/arch/arm64/kernel/entry-common.c
> index 9d3588450473..837d3624a1d5 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -289,10 +289,16 @@ asmlinkage void noinstr enter_from_user_mode(void)
>   CT_WARN_ON(ct_state() != CONTEXT_USER);
>   user_exit_irqoff();
>   trace_hardirqs_off_finish();
> +
> + /* Check for asynchronous tag check faults in user space */
> + check_mte_async_tcf0();
>  }

Is enter_from_user_mode() always called when we enter the kernel from EL0?
afaict, some paths (e.g. el0_irq()) only end up calling it if
CONTEXT_TRACKING or TRACE_IRQFLAGS are enabled.

>  
>  asmlinkage void noinstr exit_to_user_mode(void)
>  {
> + /* Ignore asynchronous tag check faults in the uaccess routines */
> + clear_mte_async_tcf0();
> +

and this one seems to be called even less often.

Will


[GIT PULL] iommu/arm-smmu: Updates for 5.13

2021-04-08 Thread Will Deacon
Hi Joerg,

There's hardly anything on the SMMU front for 5.13, but please pull
these regardless. Summary in the tag.

Cheers,

Will

--->8

The following changes since commit 1e28eed17697bcf343c6743f0028cc3b5dd88bf0:

  Linux 5.12-rc3 (2021-03-14 14:41:02 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git 
tags/arm-smmu-updates

for you to fetch changes up to e0bb4b73540495111ff2723e41cf5add2f031021:

  iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command 
(2021-04-07 11:30:40 +0100)


Arm SMMU updates for 5.13

- SMMUv3:

  * Drop vestigial PREFETCH_ADDR support

  * Elide TLB sync logic for empty gather

  * Fix "Service Failure Mode" handling

- SMMUv2:

  * New Qualcomm compatible string


Sai Prakash Ranjan (1):
  dt-bindings: arm-smmu: Add compatible for SC7280 SoC

Xiang Chen (1):
  iommu/arm-smmu-v3: Add a check to avoid invalid iotlb sync

Zenghui Yu (1):
  iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command

Zhen Lei (1):
  iommu/arm-smmu-v3: add bit field SFM into GERROR_ERR_MASK

 Documentation/devicetree/bindings/iommu/arm,smmu.yaml | 1 +
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   | 5 +++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   | 4 +---
 3 files changed, 5 insertions(+), 5 deletions(-)


Re: [PATCH 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-08 Thread Will Deacon
Hi John,

On Thu, Apr 08, 2021 at 01:55:02PM +0100, John Garry wrote:
> On 08/04/2021 10:01, Jonathan Cameron wrote:
> > On Wed, 7 Apr 2021 21:40:05 +0100
> > Will Deacon  wrote:
> > 
> > > On Wed, Apr 07, 2021 at 05:49:02PM +0800, Qi Liu wrote:
> > > > PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
> > > > to sample bandwidth, latency, buffer occupation etc.
> > > > 
> > > > Each PMU RCiEP device monitors multiple root ports, and each RCiEP is
> > > > registered as a pmu in /sys/bus/event_source/devices, so users can
> > > > select target PMU, and use filter to do further sets.
> > > > 
> > > > Filtering options contains:
> > > > event- select the event.
> > > > subevent - select the subevent.
> > > > port - select target root ports. Information of root ports
> > > > are shown under sysfs.
> > > > bdf   - select requester_id of target EP device.
> > > > trig_len - set trigger condition for starting event statistics.
> > > > trigger_mode - set trigger mode. 0 means starting to statistic when
> > > > bigger than trigger condition, and 1 means smaller.
> > > > thr_len  - set threshold for statistics.
> > > > thr_mode - set threshold mode. 0 means count when bigger than
> > > > threshold, and 1 means smaller.
> > > > 
> > > > Reviewed-by: Jonathan Cameron 
> > > 
> > > Do you have a link to this review, please?
> > 
> > Internal review, so drop the tag.
> > 
> > Jonathan
> 
> Hi Will,
> 
> Are you implying that you would rather that any review for these drivers is
> done in public on the lists?

Absolutely! If I can see that you and/or Jonathan have given the thing a
good going through, then it's a lot easier to merge the patches. But just
having the tag doesn't help much, as I don't know whether it was a concerted
review effort or a "yeah, this function is about what I thought, cheers"
type of review.

That's not to say internal patch review isn't a useful tool in some
circumstances (e.g. somebody new to the kernel, confidential stuff,
prototyping), but the vast majority of the time I'd say having the review
on the public lists is the best bet.

Will


Re: [PATCH v10 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-04-07 Thread Will Deacon
On Thu, Apr 01, 2021 at 04:23:44PM -0700, Kees Cook wrote:
> This provides the ability for architectures to enable kernel stack base
> address offset randomization. This feature is controlled by the boot
> param "randomize_kstack_offset=on/off", with its default value set by
> CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT.
> 
> This feature is based on the original idea from the last public release
> of PaX's RANDKSTACK feature: https://pax.grsecurity.net/docs/randkstack.txt
> All the credit for the original idea goes to the PaX team. Note that
> the design and implementation of this upstream randomize_kstack_offset
> feature differs greatly from the RANDKSTACK feature (see below).

[...]

> diff --git a/include/linux/randomize_kstack.h 
> b/include/linux/randomize_kstack.h
> new file mode 100644
> index ..fd80fab663a9
> --- /dev/null
> +++ b/include/linux/randomize_kstack.h
> @@ -0,0 +1,54 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _LINUX_RANDOMIZE_KSTACK_H
> +#define _LINUX_RANDOMIZE_KSTACK_H
> +
> +#include 
> +#include 
> +#include 
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
> +  randomize_kstack_offset);
> +DECLARE_PER_CPU(u32, kstack_offset);
> +
> +/*
> + * Do not use this anywhere else in the kernel. This is used here because
> + * it provides an arch-agnostic way to grow the stack with correct
> + * alignment. Also, since this use is being explicitly masked to a max of
> + * 10 bits, stack-clash style attacks are unlikely. For more details see
> + * "VLAs" in Documentation/process/deprecated.rst
> + */
> +void *__builtin_alloca(size_t size);
> +/*
> + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the
> + * "VLA" from being unbounded (see above). 10 bits leaves enough room for
> + * per-arch offset masks to reduce entropy (by removing higher bits, since
> + * high entropy may overly constrain usable stack space), and for
> + * compiler/arch-specific stack alignment to remove the lower bits.
> + */
> +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF)
> +
> +/*
> + * These macros must be used during syscall entry when interrupts and
> + * preempt are disabled, and after user registers have been stored to
> + * the stack.
> + */

This comment is out of date, as this is called from preemptible context on
arm64. Does that matter in terms of offset randomness?

Will


Re: [PATCH v10 5/6] arm64: entry: Enable random_kstack_offset support

2021-04-07 Thread Will Deacon
On Thu, Apr 01, 2021 at 04:23:46PM -0700, Kees Cook wrote:
> Allow for a randomized stack offset on a per-syscall basis, with roughly
> 5 bits of entropy. (And include AAPCS rationale AAPCS thanks to Mark
> Rutland.)
> 
> In order to avoid unconditional stack canaries on syscall entry (due to
> the use of alloca()), also disable stack protector to avoid triggering
> needless checks and slowing down the entry path. As there is no general
> way to control stack protector coverage with a function attribute[1],
> this must be disabled at the compilation unit level. This isn't a problem
> here, though, since stack protector was not triggered before: examining
> the resulting syscall.o, there are no changes in canary coverage (none
> before, none now).
> 
> [1] a working __attribute__((no_stack_protector)) has been added to GCC
> and Clang but has not been released in any version yet:
> https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=346b302d09c1e6db56d9fe69048acb32fbb97845
> https://reviews.llvm.org/rG4fbf84c1732fca596ad1d6e96015e19760eb8a9b
> 
> Signed-off-by: Kees Cook 
> ---
>  arch/arm64/Kconfig  |  1 +
>  arch/arm64/kernel/Makefile  |  5 +
>  arch/arm64/kernel/syscall.c | 16 ++++
>  3 files changed, 22 insertions(+)

Acked-by: Will Deacon 

Will


Re: [PATCH v4 15/18] irqchip/apple-aic: Add support for the Apple Interrupt Controller

2021-04-07 Thread Will Deacon
On Fri, Apr 02, 2021 at 06:05:39PM +0900, Hector Martin wrote:
> This is the root interrupt controller used on Apple ARM SoCs such as the
> M1. This irqchip driver performs multiple functions:
> 
> * Handles both IRQs and FIQs
> 
> * Drives the AIC peripheral itself (which handles IRQs)
> 
> * Dispatches FIQs to downstream hard-wired clients (currently the ARM
>   timer).
> 
> * Implements a virtual IPI multiplexer to funnel multiple Linux IPIs
>   into a single hardware IPI
> 
> Signed-off-by: Hector Martin 
> ---
>  MAINTAINERS |   2 +
>  drivers/irqchip/Kconfig |   8 +
>  drivers/irqchip/Makefile|   1 +
>  drivers/irqchip/irq-apple-aic.c | 837 
>  include/linux/cpuhotplug.h  |   1 +
>  5 files changed, 849 insertions(+)
>  create mode 100644 drivers/irqchip/irq-apple-aic.c

Couple of stale comment nits:

> +static void aic_ipi_unmask(struct irq_data *d)
> +{
> + struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> + u32 irq_bit = BIT(irqd_to_hwirq(d));
> +
> + atomic_or(irq_bit, this_cpu_ptr(_vipi_enable));
> +
> + /*
> +  * The atomic_or() above must complete before the atomic_read_acquire() 
> below to avoid
> +  * racing aic_ipi_send_mask().
> +  */

(the atomic_read_acquire() is now an atomic_read())

> + smp_mb__after_atomic();
> +
> + /*
> +  * If a pending vIPI was unmasked, raise a HW IPI to ourselves.
> +  * No barriers needed here since this is a self-IPI.
> +  */
> + if (atomic_read(this_cpu_ptr(_vipi_flag)) & irq_bit)
> + aic_ic_write(ic, AIC_IPI_SEND, 
> AIC_IPI_SEND_CPU(smp_processor_id()));
> +}
> +
> +static void aic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask)
> +{
> + struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> + u32 irq_bit = BIT(irqd_to_hwirq(d));
> + u32 send = 0;
> + int cpu;
> + unsigned long pending;
> +
> + for_each_cpu(cpu, mask) {
> + /*
> +  * This sequence is the mirror of the one in aic_ipi_unmask();
> +  * see the comment there. Additionally, release semantics
> +  * ensure that the vIPI flag set is ordered after any shared
> +  * memory accesses that precede it. This therefore also pairs
> +  * with the atomic_fetch_andnot in aic_handle_ipi().
> +  */
> + pending = atomic_fetch_or_release(irq_bit, 
> per_cpu_ptr(_vipi_flag, cpu));
> +
> + /*
> +  * The atomic_fetch_or_release() above must complete before the
> +  * atomic_read_acquire() below to avoid racing aic_ipi_unmask().
> +  */

(same here)

> + smp_mb__after_atomic();
> +
> + if (!(pending & irq_bit) &&
> + (atomic_read(per_cpu_ptr(_vipi_enable, cpu)) & irq_bit))
> + send |= AIC_IPI_SEND_CPU(cpu);
> + }

But with that:

Acked-by: Will Deacon 

Will


Re: [PATCH v4 11/18] asm-generic/io.h: implement pci_remap_cfgspace using ioremap_np

2021-04-07 Thread Will Deacon
On Wed, Apr 07, 2021 at 04:27:42PM +0300, Andy Shevchenko wrote:
> On Fri, Apr 2, 2021 at 12:07 PM Hector Martin  wrote:
> >
> > Now that we have ioremap_np(), we can make pci_remap_cfgspace() default
> > to it, falling back to ioremap() on platforms where it is not available.
> >
> > Remove the arm64 implementation, since that is now redundant. Future
> > cleanups should be able to do the same for other arches, and eventually
> > make the generic pci_remap_cfgspace() unconditional.
> 
> ...
> 
> > +   void __iomem *ret = ioremap_np(offset, size);
> > +
> > +   if (!ret)
> > +   ret = ioremap(offset, size);
> > +
> > +   return ret;
> 
> Usually negative conditions are worse for cognitive functions of human beings.
> (On top of that some patterns are applied)
> 
> I would rewrite above as
> 
> void __iomem *ret;
> 
> ret = ioremap_np(offset, size);
> if (ret)
>   return ret;
> 
> return ioremap(offset, size);

Looks like it might be one of those rare occasions where the GCC ternary if
extension thingy comes in handy:

return ioremap_np(offset, size) ?: ioremap(offset, size);

but however it's done, the logic looks good to me and thanks Hector for
updating this:

Acked-by: Will Deacon 

Will


Re: [RFC PATCH v3 1/2] KVM: arm64: Move CMOs from user_mem_abort to the fault handlers

2021-04-07 Thread Will Deacon
On Wed, Apr 07, 2021 at 04:31:31PM +0100, Alexandru Elisei wrote:
> On 3/26/21 3:16 AM, Yanan Wang wrote:
> > We currently uniformly permorm CMOs of D-cache and I-cache in function
> > user_mem_abort before calling the fault handlers. If we get concurrent
> > guest faults(e.g. translation faults, permission faults) or some really
> > unnecessary guest faults caused by BBM, CMOs for the first vcpu are
> 
> I can't figure out what BBM means.

Oh, I know that one! BBM means "Break Before Make". Not to be confused with
DBM (Dirty Bit Management) or BFM (Bit Field Move).

Will


Re: [PATCH 1/2] drivers/perf: hisi: Add driver for HiSilicon PCIe PMU

2021-04-07 Thread Will Deacon
On Wed, Apr 07, 2021 at 05:49:02PM +0800, Qi Liu wrote:
> PCIe PMU Root Complex Integrated End Point(RCiEP) device is supported
> to sample bandwidth, latency, buffer occupation etc.
> 
> Each PMU RCiEP device monitors multiple root ports, and each RCiEP is
> registered as a pmu in /sys/bus/event_source/devices, so users can
> select target PMU, and use filter to do further sets.
> 
> Filtering options contains:
> event- select the event.
> subevent - select the subevent.
> port - select target root ports. Information of root ports
>are shown under sysfs.
> bdf   - select requester_id of target EP device.
> trig_len - set trigger condition for starting event statistics.
> trigger_mode - set trigger mode. 0 means starting to statistic when
>bigger than trigger condition, and 1 means smaller.
> thr_len  - set threshold for statistics.
> thr_mode - set threshold mode. 0 means count when bigger than
>threshold, and 1 means smaller.
> 
> Reviewed-by: Jonathan Cameron 

Do you have a link to this review, please?

Will


Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

2021-04-07 Thread Will Deacon
[Moving Mark to To: since I'd like his view on this]

On Thu, Apr 01, 2021 at 02:45:21PM -0500, Rob Herring wrote:
> On Wed, Mar 31, 2021 at 11:01 AM Will Deacon  wrote:
> >
> > On Tue, Mar 30, 2021 at 12:09:38PM -0500, Rob Herring wrote:
> > > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon  wrote:
> > > >
> > > > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > > > From: Raphael Gault 
> > > > >
> > > > > Keep track of event opened with direct access to the hardware counters
> > > > > and modify permissions while they are open.
> > > > >
> > > > > The strategy used here is the same which x86 uses: every time an event
> > > > > is mapped, the permissions are set if required. The atomic field added
> > > > > in the mm_context helps keep track of the different event opened and
> > > > > de-activate the permissions when all are unmapped.
> > > > > We also need to update the permissions in the context switch code so
> > > > > that tasks keep the right permissions.
> > > > >
> > > > > In order to enable 64-bit counters for userspace when available, a new
> > > > > config1 bit is added for userspace to indicate it wants userspace 
> > > > > counter
> > > > > access. This bit allows the kernel to decide if chaining should be
> > > > > disabled and chaining and userspace access are incompatible.
> > > > > The modes for config1 are as follows:
> > > > >
> > > > > config1 = 0 or 2 : user access enabled and always 32-bit
> > > > > config1 = 1 : user access disabled and always 64-bit (using chaining 
> > > > > if needed)
> > > > > config1 = 3 : user access enabled and counter size matches underlying 
> > > > > counter.
> > > >
> > > > In this last case, how does userspace know whether it got a 32-bit or a
> > > > 64-bit counter?
> > >
> > > pmc_width in the user page. If the read loop is correct, then it
> > > doesn't matter to the user.
> >
> > Gotcha; please mention that in this comment,
> >
> > > > > +static void refresh_pmuserenr(void *mm)
> > > > > +{
> > > > > + if (mm == current->active_mm)
> > > > > + perf_switch_user_access(mm);
> > > > > +}
> > > > > +
> > > > > +static void armv8pmu_event_mapped(struct perf_event *event, struct 
> > > > > mm_struct *mm)
> > > > > +{
> > > > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > > > +
> > > > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > > > + return;
> > > > > +
> > > > > + /*
> > > > > +  * This function relies on not being called concurrently in two
> > > > > +  * tasks in the same mm.  Otherwise one task could observe
> > > > > +  * pmu_direct_access > 1 and return all the way back to
> > > > > +  * userspace with user access disabled while another task is 
> > > > > still
> > > > > +  * doing on_each_cpu_mask() to enable user access.
> > > > > +  *
> > > > > +  * For now, this can't happen because all callers hold mmap_lock
> > > > > +  * for write.  If this changes, we'll need a different solution.
> > > > > +  */
> > > > > + lockdep_assert_held_write(>mmap_lock);
> > > > > +
> > > > > + if (atomic_inc_return(>context.pmu_direct_access) == 1)
> > > > > + on_each_cpu_mask(>supported_cpus, 
> > > > > refresh_pmuserenr, mm, 1);
> > > > > +}
> > > >
> > > > Why do we need to cross-call here? Seems like it would be a tonne 
> > > > simpler to
> > > > handle the trap. Is there a reason not to do that?
> > >
> > > Alignment with x86? You didn't like the trap handler either:
> > >
> > > > Hmm... this feels pretty fragile since, although we may expect 
> > > > userspace only
> > > > to trigger this in the context of the specific perf use-case, we don't 
> > > > have
> > > > a way to detect that, so the ABI we're exposing is that EL0 accesses to
> > > > non-existent counters will return 0. I don't really 

Re: [PATCH] iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command

2021-04-07 Thread Will Deacon
On Wed, 7 Apr 2021 16:44:48 +0800, Zenghui Yu wrote:
> Per SMMUv3 spec, there is no Size and Addr field in the PREFETCH_CONFIG
> command and they're not used by the driver. Remove them.
> 
> We can add them back if we're going to use PREFETCH_ADDR in the future.

Applied to will (for-joerg/arm-smmu/updates), thanks!

[1/1] iommu/arm-smmu-v3: Remove the unused fields for PREFETCH_CONFIG command
  https://git.kernel.org/will/c/e0bb4b735404

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [PATCH v2 1/3] iommu: io-pgtable: add DART pagetable format

2021-04-07 Thread Will Deacon
On Sun, Mar 28, 2021 at 09:40:07AM +0200, Sven Peter wrote:
> Apple's DART iommu uses a pagetable format that shares some
> similarities with the ones already implemented by io-pgtable.c.
> Add a new format variant to support the required differences
> so that we don't have to duplicate the pagetable handling code.
> 
> Signed-off-by: Sven Peter 
> ---
>  drivers/iommu/io-pgtable-arm.c | 59 ++
>  drivers/iommu/io-pgtable.c |  1 +
>  include/linux/io-pgtable.h |  6 
>  3 files changed, 66 insertions(+)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index 87def58e79b5..2f63443fd115 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -127,6 +127,9 @@
>  #define ARM_MALI_LPAE_MEMATTR_IMP_DEF0x88ULL
>  #define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL
>  
> +#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7)
> +#define APPLE_DART_PTE_PROT_NO_READ (1<<8)
> +
>  /* IOPTE accessors */
>  #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))
>  
> @@ -381,6 +384,15 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
> arm_lpae_io_pgtable *data,
>  {
>   arm_lpae_iopte pte;
>  
> + if (data->iop.fmt == ARM_APPLE_DART) {
> + pte = 0;
> + if (!(prot & IOMMU_WRITE))
> + pte |= APPLE_DART_PTE_PROT_NO_WRITE;
> + if (!(prot & IOMMU_READ))
> + pte |= APPLE_DART_PTE_PROT_NO_READ;
> + return pte;
> + }
> +
>   if (data->iop.fmt == ARM_64_LPAE_S1 ||
>   data->iop.fmt == ARM_32_LPAE_S1) {
>   pte = ARM_LPAE_PTE_nG;
> @@ -1043,6 +1055,48 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg 
> *cfg, void *cookie)
>   return NULL;
>  }
>  
> +static struct io_pgtable *
> +apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
> +{
> + struct arm_lpae_io_pgtable *data;
> +
> + if (cfg->ias > 36)
> + return NULL;
> + if (cfg->oas > 36)
> + return NULL;
> +
> + if (!cfg->coherent_walk)
> + return NULL;

This all feels like IOMMU-specific limitations leaking into the page-table
code here; it doesn't feel so unlikely that future implementations of this
IP might have greater addressing capabilities, for example, and so I don't
see why the page-table code needs to police this.

> + cfg->pgsize_bitmap &= SZ_16K;
> + if (!cfg->pgsize_bitmap)
> + return NULL;

This is worrying (and again, I don't think this belongs here). How is this
thing supposed to work if the CPU is using 4k pages?

Will


Re: [PATCH v2 3/3] iommu: dart: Add DART iommu driver

2021-04-07 Thread Will Deacon
On Sun, Mar 28, 2021 at 09:40:09AM +0200, Sven Peter wrote:
> Apple's new SoCs use iommus for almost all peripherals. These Device
> Address Resolution Tables must be setup before these peripherals can
> act as DMA masters.
> 
> Signed-off-by: Sven Peter 
> ---
>  MAINTAINERS  |   1 +
>  drivers/iommu/Kconfig|  14 +
>  drivers/iommu/Makefile   |   1 +
>  drivers/iommu/apple-dart-iommu.c | 858 +++
>  4 files changed, 874 insertions(+)
>  create mode 100644 drivers/iommu/apple-dart-iommu.c

[...]

> +/* must be called with held domain->lock */
> +static int apple_dart_attach_stream(struct apple_dart_domain *domain,
> + struct apple_dart *dart, u32 sid)
> +{
> + unsigned long flags;
> + struct apple_dart_stream *stream;
> + struct io_pgtable_cfg *pgtbl_cfg;
> + int ret;
> +
> + list_for_each_entry(stream, >streams, stream_head) {
> + if (stream->dart == dart && stream->sid == sid) {
> + stream->num_devices++;
> + return 0;
> + }
> + }
> +
> + spin_lock_irqsave(>lock, flags);
> +
> + if (WARN_ON(dart->used_sids & BIT(sid))) {
> + ret = -EINVAL;
> + goto error;
> + }
> +
> + stream = kzalloc(sizeof(*stream), GFP_KERNEL);
> + if (!stream) {
> + ret = -ENOMEM;
> + goto error;
> + }

Just in case you missed it, a cocci bot noticed that you're using GFP_KERNEL
to allocate while holding a spinlock here:

https://lore.kernel.org/r/alpine.DEB.2.22.394.2104041724340.2958@hadrien

Will


Re: [PATCH] arm64: perf: Remove redundant initialization in perf_event.c

2021-04-01 Thread Will Deacon
On Thu, 1 Apr 2021 19:16:41 +0800, Qi Liu wrote:
> The initialization of value in function armv8pmu_read_hw_counter()
> and armv8pmu_read_counter() seem redundant, as they are soon updated.
> So, We can remove them.

Applied to will (for-next/perf), thanks!

[1/1] arm64: perf: Remove redundant initialization in perf_event.c
  https://git.kernel.org/will/c/2c2e21e78a94

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

2021-04-01 Thread Will Deacon
On Wed, Mar 31, 2021 at 12:52:11PM -0500, Rob Herring wrote:
> On Wed, Mar 31, 2021 at 10:38 AM Will Deacon  wrote:
> >
> > On Tue, Mar 30, 2021 at 04:08:11PM -0500, Rob Herring wrote:
> > > On Tue, Mar 30, 2021 at 12:09 PM Rob Herring  wrote:
> > > > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon  wrote:
> > > > > The logic here feels like it
> > > > > could with a bit of untangling.
> > > >
> > > > Yes, I don't love it, but couldn't come up with anything better. It is
> > > > complicated by the fact that flags have to be set before we assign the
> > > > counter and can't set/change them when we assign the counter. It would
> > > > take a lot of refactoring with armpmu code to fix that.
> > >
> > > How's this instead?:
> > >
> > > if (armv8pmu_event_want_user_access(event) || 
> > > !armv8pmu_event_is_64bit(event))
> > > event->hw.flags |= ARMPMU_EL0_RD_CNTR;
> > >
> > > /*
> > > * At this point, the counter is not assigned. If a 64-bit counter is
> > > * requested, we must make sure the h/w has 64-bit counters if we set
> > > * the event size to 64-bit because chaining is not supported with
> > > * userspace access. This may still fail later on if the CPU cycle
> > > * counter is in use.
> > > */
> > > if (armv8pmu_event_is_64bit(event) &&
> > > (!armv8pmu_event_want_user_access(event) ||
> > >  armv8pmu_has_long_event(cpu_pmu) || (hw_event_id ==
> > > ARMV8_PMUV3_PERFCTR_CPU_CYCLES)))
> > > event->hw.flags |= ARMPMU_EVT_64BIT;
> >
> > I thought there were some cases where we could assign cycles event to an
> > event counter; does that not happen anymore?
> 
> Yes, but if we hit that scenario when the user has asked for 64-bit
> user access, then we return an error later when assigning the counter.
> I think we can assume if users have gone to the trouble of requesting
> 64-bit counters, then they can deal with ensuring they don't have
> multiple users.
> 
> Otherwise, the only way I see to simplify this is we only support
> 64-bit counters in userspace when we have v8.5 PMU.

I'm happy to start from that position, and then we can extend it later if
there's a need.

Will


Re: [PATCH v9 0/6] Optionally randomize kernel stack offset each syscall

2021-04-01 Thread Will Deacon
Hi Kees,

On Wed, Mar 31, 2021 at 01:54:52PM -0700, Kees Cook wrote:
> Hi Will (and Mark and Catalin),
> 
> Can you take this via the arm64 tree for v5.13 please? Thomas has added
> his Reviewed-by, so it only leaves arm64's. :)

Sorry, these got mixed up in my inbox so I just replied to v7 and v8 and
then noticed v9. Argh!

However, I think the comments still apply: namely that the dummy "=m" looks
dangerous to me and I think you're accessing pcpu variables with preemption
enabled for the arm64 part:

https://lore.kernel.org/r/20210401083034.GA8554@willie-the-truck
https://lore.kernel.org/r/20210401083430.GB8554@willie-the-truck

Will


Re: [PATCH v7 5/6] arm64: entry: Enable random_kstack_offset support

2021-04-01 Thread Will Deacon
On Fri, Mar 19, 2021 at 02:28:34PM -0700, Kees Cook wrote:
> Allow for a randomized stack offset on a per-syscall basis, with roughly
> 5 bits of entropy. (And include AAPCS rationale AAPCS thanks to Mark
> Rutland.)
> 
> In order to avoid unconditional stack canaries on syscall entry (due to
> the use of alloca()), also disable stack protector to avoid triggering
> needless checks and slowing down the entry path. As there is no general
> way to control stack protector coverage with a function attribute[1],
> this must be disabled at the compilation unit level. This isn't a problem
> here, though, since stack protector was not triggered before: examining
> the resulting syscall.o, there are no changes in canary coverage (none
> before, none now).
> 
> [1] a working __attribute__((no_stack_protector)) has been added to GCC
> and Clang but has not been released in any version yet:
> https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=346b302d09c1e6db56d9fe69048acb32fbb97845
> https://reviews.llvm.org/rG4fbf84c1732fca596ad1d6e96015e19760eb8a9b
> 
> Signed-off-by: Kees Cook 
> ---
>  arch/arm64/Kconfig  |  1 +
>  arch/arm64/kernel/Makefile  |  5 +
>  arch/arm64/kernel/syscall.c | 10 ++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1f212b47a48a..2d0e5f544429 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -146,6 +146,7 @@ config ARM64
>   select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
>   select HAVE_ARCH_PFN_VALID
>   select HAVE_ARCH_PREL32_RELOCATIONS
> + select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
>   select HAVE_ARCH_SECCOMP_FILTER
>   select HAVE_ARCH_STACKLEAK
>   select HAVE_ARCH_THREAD_STRUCT_WHITELIST
> diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
> index ed65576ce710..6cc97730790e 100644
> --- a/arch/arm64/kernel/Makefile
> +++ b/arch/arm64/kernel/Makefile
> @@ -9,6 +9,11 @@ CFLAGS_REMOVE_ftrace.o = $(CC_FLAGS_FTRACE)
>  CFLAGS_REMOVE_insn.o = $(CC_FLAGS_FTRACE)
>  CFLAGS_REMOVE_return_address.o = $(CC_FLAGS_FTRACE)
>  
> +# Remove stack protector to avoid triggering unneeded stack canary
> +# checks due to randomize_kstack_offset.
> +CFLAGS_REMOVE_syscall.o   = -fstack-protector -fstack-protector-strong
> +CFLAGS_syscall.o += -fno-stack-protector
> +
>  # Object file lists.
>  obj-y:= debug-monitors.o entry.o irq.o fpsimd.o  
> \
>  entry-common.o entry-fpsimd.o process.o ptrace.o 
> \
> diff --git a/arch/arm64/kernel/syscall.c b/arch/arm64/kernel/syscall.c
> index b9cf12b271d7..58227a1c207e 100644
> --- a/arch/arm64/kernel/syscall.c
> +++ b/arch/arm64/kernel/syscall.c
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  #include 
> @@ -43,6 +44,8 @@ static void invoke_syscall(struct pt_regs *regs, unsigned 
> int scno,
>  {
>   long ret;
>  
> + add_random_kstack_offset();
> +
>   if (scno < sc_nr) {
>   syscall_fn_t syscall_fn;
>   syscall_fn = syscall_table[array_index_nospec(scno, sc_nr)];
> @@ -55,6 +58,13 @@ static void invoke_syscall(struct pt_regs *regs, unsigned 
> int scno,
>   ret = lower_32_bits(ret);
>  
>   regs->regs[0] = ret;
> +
> + /*
> +  * The AAPCS mandates a 16-byte (i.e. 4-bit) aligned SP at
> +  * function boundaries. We want at least 5 bits of entropy so we
> +  * must randomize at least SP[8:4].
> +  */
> + choose_random_kstack_offset(get_random_int() & 0x1FF);

Not sure about either of these new calls -- aren't we preemptible in
invoke_syscall()?

Will


Re: [PATCH v8 3/6] stack: Optionally randomize kernel stack offset each syscall

2021-04-01 Thread Will Deacon
On Tue, Mar 30, 2021 at 01:57:47PM -0700, Kees Cook wrote:
> diff --git a/include/linux/randomize_kstack.h 
> b/include/linux/randomize_kstack.h
> new file mode 100644
> index ..351520803006
> --- /dev/null
> +++ b/include/linux/randomize_kstack.h
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef _LINUX_RANDOMIZE_KSTACK_H
> +#define _LINUX_RANDOMIZE_KSTACK_H
> +
> +#include 
> +#include 
> +#include 
> +
> +DECLARE_STATIC_KEY_MAYBE(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT,
> +  randomize_kstack_offset);
> +DECLARE_PER_CPU(u32, kstack_offset);
> +
> +/*
> + * Do not use this anywhere else in the kernel. This is used here because
> + * it provides an arch-agnostic way to grow the stack with correct
> + * alignment. Also, since this use is being explicitly masked to a max of
> + * 10 bits, stack-clash style attacks are unlikely. For more details see
> + * "VLAs" in Documentation/process/deprecated.rst
> + * The asm statement is designed to convince the compiler to keep the
> + * allocation around even after "ptr" goes out of scope.
> + */
> +void *__builtin_alloca(size_t size);
> +/*
> + * Use, at most, 10 bits of entropy. We explicitly cap this to keep the
> + * "VLA" from being unbounded (see above). 10 bits leaves enough room for
> + * per-arch offset masks to reduce entropy (by removing higher bits, since
> + * high entropy may overly constrain usable stack space), and for
> + * compiler/arch-specific stack alignment to remove the lower bits.
> + */
> +#define KSTACK_OFFSET_MAX(x) ((x) & 0x3FF)
> +
> +/*
> + * These macros must be used during syscall entry when interrupts and
> + * preempt are disabled, and after user registers have been stored to
> + * the stack.
> + */
> +#define add_random_kstack_offset() do {  
> \
> + if (static_branch_maybe(CONFIG_RANDOMIZE_KSTACK_OFFSET_DEFAULT, \
> + _kstack_offset)) {\
> + u32 offset = __this_cpu_read(kstack_offset);\
> + u8 *ptr = __builtin_alloca(KSTACK_OFFSET_MAX(offset));  \
> + asm volatile("" : "=m"(*ptr) :: "memory");  \

Using the "m" constraint here is dangerous if you don't actually evaluate it
inside the asm. For example, if the compiler decides to generate an
addressing mode relative to the stack but with writeback (autodecrement), then
the stack pointer will be off by 8 bytes. Can you use "o" instead?

Will


Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

2021-03-31 Thread Will Deacon
On Tue, Mar 30, 2021 at 12:09:38PM -0500, Rob Herring wrote:
> On Tue, Mar 30, 2021 at 10:31 AM Will Deacon  wrote:
> >
> > On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> > > From: Raphael Gault 
> > >
> > > Keep track of event opened with direct access to the hardware counters
> > > and modify permissions while they are open.
> > >
> > > The strategy used here is the same which x86 uses: every time an event
> > > is mapped, the permissions are set if required. The atomic field added
> > > in the mm_context helps keep track of the different event opened and
> > > de-activate the permissions when all are unmapped.
> > > We also need to update the permissions in the context switch code so
> > > that tasks keep the right permissions.
> > >
> > > In order to enable 64-bit counters for userspace when available, a new
> > > config1 bit is added for userspace to indicate it wants userspace counter
> > > access. This bit allows the kernel to decide if chaining should be
> > > disabled and chaining and userspace access are incompatible.
> > > The modes for config1 are as follows:
> > >
> > > config1 = 0 or 2 : user access enabled and always 32-bit
> > > config1 = 1 : user access disabled and always 64-bit (using chaining if 
> > > needed)
> > > config1 = 3 : user access enabled and counter size matches underlying 
> > > counter.
> >
> > In this last case, how does userspace know whether it got a 32-bit or a
> > 64-bit counter?
> 
> pmc_width in the user page. If the read loop is correct, then it
> doesn't matter to the user.

Gotcha; please mention that in this comment,

> > > +static void refresh_pmuserenr(void *mm)
> > > +{
> > > + if (mm == current->active_mm)
> > > + perf_switch_user_access(mm);
> > > +}
> > > +
> > > +static void armv8pmu_event_mapped(struct perf_event *event, struct 
> > > mm_struct *mm)
> > > +{
> > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > +
> > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > + return;
> > > +
> > > + /*
> > > +  * This function relies on not being called concurrently in two
> > > +  * tasks in the same mm.  Otherwise one task could observe
> > > +  * pmu_direct_access > 1 and return all the way back to
> > > +  * userspace with user access disabled while another task is still
> > > +  * doing on_each_cpu_mask() to enable user access.
> > > +  *
> > > +  * For now, this can't happen because all callers hold mmap_lock
> > > +  * for write.  If this changes, we'll need a different solution.
> > > +  */
> > > + lockdep_assert_held_write(>mmap_lock);
> > > +
> > > + if (atomic_inc_return(>context.pmu_direct_access) == 1)
> > > + on_each_cpu_mask(>supported_cpus, 
> > > refresh_pmuserenr, mm, 1);
> > > +}
> >
> > Why do we need to cross-call here? Seems like it would be a tonne simpler to
> > handle the trap. Is there a reason not to do that?
> 
> Alignment with x86? You didn't like the trap handler either:
> 
> > Hmm... this feels pretty fragile since, although we may expect userspace 
> > only
> > to trigger this in the context of the specific perf use-case, we don't have
> > a way to detect that, so the ABI we're exposing is that EL0 accesses to
> > non-existent counters will return 0. I don't really think that's something
> > we want to commit to.
> 
> Full mail: https://lore.kernel.org/r/20200928182601.GA11974@willie-the-truck
> 
> The handler would be different, but we'd still have the problem of not
> being able to detect the use case, right?

Well hang on, the thing you link to was about _emulating_ the access. I'm
not saying we should do that, but rather just enable EL0 PMU access lazily
instead of using the IPI. We're going to enable it when we context-switch
anyway.

The alternative is enabling EL0 access unconditionally and leaving it
enabled.

> > > +
> > > +static void armv8pmu_event_unmapped(struct perf_event *event, struct 
> > > mm_struct *mm)
> > > +{
> > > + struct arm_pmu *armpmu = to_arm_pmu(event->pmu);
> > > +
> > > + if (!(event->hw.flags & ARMPMU_EL0_RD_CNTR))
> > > + return;
> > > +
> > > + if (atomic_dec_and_test(>context.pmu_direct_access))
> > > + on_each_cp

Re: [PATCH v6 10/10] Documentation: arm64: Document PMU counters access from userspace

2021-03-31 Thread Will Deacon
On Wed, Mar 10, 2021 at 05:08:37PM -0700, Rob Herring wrote:
> From: Raphael Gault 
> 
> Add a documentation file to describe the access to the pmu hardware
> counters from userspace
> 
> Signed-off-by: Raphael Gault 
> Signed-off-by: Rob Herring 
> ---
> v6:
>   - Update the chained event section with attr.config1 details
> v2:
>   - Update links to test examples
> 
> Changes from Raphael's v4:
>   - Convert to rSt
>   - Update chained event status
>   - Add section for heterogeneous systems
> ---
>  Documentation/arm64/index.rst |  1 +
>  .../arm64/perf_counter_user_access.rst| 60 +++

We already have Documentation/arm64/perf.rst so I think you can add this
in there as a new section.

Will


Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

2021-03-31 Thread Will Deacon
On Tue, Mar 30, 2021 at 04:08:11PM -0500, Rob Herring wrote:
> On Tue, Mar 30, 2021 at 12:09 PM Rob Herring  wrote:
> > On Tue, Mar 30, 2021 at 10:31 AM Will Deacon  wrote:
> > > The logic here feels like it
> > > could with a bit of untangling.
> >
> > Yes, I don't love it, but couldn't come up with anything better. It is
> > complicated by the fact that flags have to be set before we assign the
> > counter and can't set/change them when we assign the counter. It would
> > take a lot of refactoring with armpmu code to fix that.
> 
> How's this instead?:
> 
> if (armv8pmu_event_want_user_access(event) || !armv8pmu_event_is_64bit(event))
> event->hw.flags |= ARMPMU_EL0_RD_CNTR;
> 
> /*
> * At this point, the counter is not assigned. If a 64-bit counter is
> * requested, we must make sure the h/w has 64-bit counters if we set
> * the event size to 64-bit because chaining is not supported with
> * userspace access. This may still fail later on if the CPU cycle
> * counter is in use.
> */
> if (armv8pmu_event_is_64bit(event) &&
> (!armv8pmu_event_want_user_access(event) ||
>  armv8pmu_has_long_event(cpu_pmu) || (hw_event_id ==
> ARMV8_PMUV3_PERFCTR_CPU_CYCLES)))
> event->hw.flags |= ARMPMU_EVT_64BIT;

I thought there were some cases where we could assign cycles event to an
event counter; does that not happen anymore?

Will


Re: [PATCH v2 3/7] powerpc: convert config files to generic cmdline

2021-03-31 Thread Will Deacon
On Tue, Mar 30, 2021 at 10:35:21AM -0700, Daniel Walker wrote:
> On Mon, Mar 29, 2021 at 11:07:51AM +0100, Will Deacon wrote:
> > On Thu, Mar 25, 2021 at 12:59:56PM -0700, Daniel Walker wrote:
> > > On Thu, Mar 25, 2021 at 01:03:55PM +0100, Christophe Leroy wrote:
> > > > 
> > > > Ok, so you agree we don't need to provide two CMDLINE, one to be 
> > > > appended and one to be prepended.
> > > > 
> > > > Let's only provide once CMDLINE as of today, and ask the user to select
> > > > whether he wants it appended or prepended or replacee. Then no need to
> > > > change all existing config to rename CONFIG_CMDLINE into either of the 
> > > > new
> > > > ones.
> > > > 
> > > > That's the main difference between my series and Daniel's series. So 
> > > > I'll
> > > > finish taking Will's comment into account and we'll send out a v3 soon.
> > > 
> > > It doesn't solve the needs of Cisco, I've stated many times your changes 
> > > have
> > > little value. Please stop submitting them.
> > 
> > FWIW, they're useful for arm64 and I will gladly review the updated series.
> > 
> > I don't think asking people to stop submitting patches is ever the right
> > answer. Please don't do that.
> 
> Why ? It's me nacking his series, is that not allowed anymore ?

If you're that way inclined then you can "nack" whatever you want, but
please allow the rest of us to continue reviewing the patches. You don't
have any basis on which to veto other people's contributions and so
demanding that somebody stops posting code is neither constructive nor
meaningful.

Will


Re: [PATCH bpf-next] bpf: arm64: Redefine MOV consistent with arch insn

2021-03-31 Thread Will Deacon
On Wed, Mar 31, 2021 at 05:22:18PM +0800, Jianlin Lv wrote:
> On Tue, Mar 30, 2021 at 5:31 PM Will Deacon  wrote:
> >
> > On Tue, Mar 30, 2021 at 03:42:35PM +0800, Jianlin Lv wrote:
> > > A64_MOV is currently mapped to Add Instruction. Architecturally MOV
> > > (register) is an alias of ORR (shifted register) and MOV (to or from SP)
> > > is an alias of ADD (immediate).
> > > This patch redefines A64_MOV and uses existing functionality
> > > aarch64_insn_gen_move_reg() in insn.c to encode MOV (register) 
> > > instruction.
> > > For moving between register and stack pointer, rename macro to A64_MOV_SP.
> >
> > What does this gain us? There's no requirement for a BPF "MOV" to match an
> > arm64 architectural "MOV", so what's the up-side of aligning them like this?
> 
> According to the description in the Arm Software Optimization Guide,
> Arithmetic(basic) and Logical(basic) instructions have the same
> Exec Latency and Execution Throughput.
> This change did not bring about a performance improvement.
> The original intention was to make the instruction map more 'natively'.

I think we should leave the code as-is, then. Having a separate MOV_SP
macro s confusing and, worse, I worry that somebody passing A64_SP to
A64_MOV will end up using the zero register.

Will


Re: [PATCH v6 02/10] arm64: perf: Enable PMU counter direct access for perf event

2021-03-30 Thread Will Deacon
On Wed, Mar 10, 2021 at 05:08:29PM -0700, Rob Herring wrote:
> From: Raphael Gault 
> 
> Keep track of event opened with direct access to the hardware counters
> and modify permissions while they are open.
> 
> The strategy used here is the same which x86 uses: every time an event
> is mapped, the permissions are set if required. The atomic field added
> in the mm_context helps keep track of the different event opened and
> de-activate the permissions when all are unmapped.
> We also need to update the permissions in the context switch code so
> that tasks keep the right permissions.
> 
> In order to enable 64-bit counters for userspace when available, a new
> config1 bit is added for userspace to indicate it wants userspace counter
> access. This bit allows the kernel to decide if chaining should be
> disabled and chaining and userspace access are incompatible.
> The modes for config1 are as follows:
> 
> config1 = 0 or 2 : user access enabled and always 32-bit
> config1 = 1 : user access disabled and always 64-bit (using chaining if 
> needed)
> config1 = 3 : user access enabled and counter size matches underlying counter.

In this last case, how does userspace know whether it got a 32-bit or a
64-bit counter?

> User access is enabled with config1 == 0 so that we match x86 behavior
> and don't need Arm specific code (outside the counter read).
> 
> Signed-off-by: Raphael Gault 
> Signed-off-by: Rob Herring 
> ---
> I'm not completely sure if using current->active_mm in an IPI is okay?
> It seems to work in my testing.
> 
> Peter Z says (event->oncpu == smp_processor_id()) in the user page
> update is always true, but my testing says otherwise[1].

Peter? Sounds like there's either a misunderstanding here or we have some
fundamental issue elsewhere.

> v6:
>  - Add new attr.config1 rdpmc bit for userspace to hint it wants
>userspace access when also requesting 64-bit counters.
> 
> v5:
>  - Only set cap_user_rdpmc if event is on current cpu
>  - Limit enabling/disabling access to CPUs associated with the PMU
>(supported_cpus) and with the mm_struct matching current->active_mm.
> 
> v2:
>  - Move mapped/unmapped into arm64 code. Fixes arm32.
>  - Rebase on cap_user_time_short changes
> 
> Changes from Raphael's v4:
>   - Drop homogeneous check
>   - Disable access for chained counters
>   - Set pmc_width in user page
> 
> [1] 
> https://lore.kernel.org/lkml/cal_jsqk+ekef5navnbfarcjre3myhfbfe54f9yhkbstnwql...@mail.gmail.com/
> 
> user fix
> ---
>  arch/arm64/include/asm/mmu.h |  5 ++
>  arch/arm64/include/asm/mmu_context.h |  2 +
>  arch/arm64/include/asm/perf_event.h  | 14 +
>  arch/arm64/kernel/perf_event.c   | 86 +++-
>  4 files changed, 104 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/mmu.h b/arch/arm64/include/asm/mmu.h
> index 75beffe2ee8a..ee08447455da 100644
> --- a/arch/arm64/include/asm/mmu.h
> +++ b/arch/arm64/include/asm/mmu.h
> @@ -18,6 +18,11 @@
>  
>  typedef struct {
>   atomic64_t  id;
> + /*
> +  * non-zero if userspace have access to hardware
> +  * counters directly.
> +  */
> + atomic_tpmu_direct_access;
>  #ifdef CONFIG_COMPAT
>   void*sigpage;
>  #endif
> diff --git a/arch/arm64/include/asm/mmu_context.h 
> b/arch/arm64/include/asm/mmu_context.h
> index 70ce8c1d2b07..ccb5ff417b42 100644
> --- a/arch/arm64/include/asm/mmu_context.h
> +++ b/arch/arm64/include/asm/mmu_context.h
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -230,6 +231,7 @@ static inline void __switch_mm(struct mm_struct *next)
>   }
>  
>   check_and_switch_context(next);
> + perf_switch_user_access(next);
>  }
>  
>  static inline void
> diff --git a/arch/arm64/include/asm/perf_event.h 
> b/arch/arm64/include/asm/perf_event.h
> index 60731f602d3e..112f3f63b79e 100644
> --- a/arch/arm64/include/asm/perf_event.h
> +++ b/arch/arm64/include/asm/perf_event.h
> @@ -8,6 +8,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #define  ARMV8_PMU_MAX_COUNTERS  32
>  #define  ARMV8_PMU_COUNTER_MASK  (ARMV8_PMU_MAX_COUNTERS - 1)
> @@ -254,4 +255,17 @@ extern unsigned long perf_misc_flags(struct pt_regs 
> *regs);
>   (regs)->pstate = PSR_MODE_EL1h; \
>  }
>  
> +static inline void perf_switch_user_access(struct mm_struct *mm)
> +{
> + if (!IS_ENABLED(CONFIG_PERF_EVENTS))
> + return;

CONFIG_HW_PERF_EVENTS might be a better fit here.

> +
> + if (atomic_read(>context.pmu_direct_access)) {
> + write_sysreg(ARMV8_PMU_USERENR_ER|ARMV8_PMU_USERENR_CR,
> +  pmuserenr_el0);
> + } else {
> + write_sysreg(0, pmuserenr_el0);
> + }
> +}
> +
>  #endif
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index 387838496955..9ad3cc523ef4 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ 

Re: [PATCH 1/2] iommu/arm-smmu-qcom: Skip the TTBR1 quirk for db820c.

2021-03-30 Thread Will Deacon
On Tue, Mar 30, 2021 at 08:03:36AM -0700, Rob Clark wrote:
> On Tue, Mar 30, 2021 at 2:34 AM Will Deacon  wrote:
> >
> > On Mon, Mar 29, 2021 at 09:02:50PM -0700, Rob Clark wrote:
> > > On Mon, Mar 29, 2021 at 7:47 AM Will Deacon  wrote:
> > > >
> > > > On Fri, Mar 26, 2021 at 04:13:02PM -0700, Eric Anholt wrote:
> > > > > db820c wants to use the qcom smmu path to get HUPCF set (which keeps
> > > > > the GPU from wedging and then sometimes wedging the kernel after a
> > > > > page fault), but it doesn't have separate pagetables support yet in
> > > > > drm/msm so we can't go all the way to the TTBR1 path.
> > > >
> > > > What do you mean by "doesn't have separate pagetables support yet"? The
> > > > compatible string doesn't feel like the right way to determine this.
> > >
> > > the compatible string identifies what it is, not what the sw
> > > limitations are, so in that regard it seems right to me..
> >
> > Well it depends on what "doesn't have separate pagetables support yet"
> > means. I can't tell if it's a hardware issue, a firmware issue or a driver
> > issue.
> 
> Just a driver issue (and the fact that currently we don't have
> physical access to a device... debugging a5xx per-process-pgtables by
> pushing untested things to the CI farm is kind of a difficult way to
> work)

But then in that case, this is using the compatible string to identify a
driver issue, no?

Will


Re: [PATCH v6 01/10] arm64: pmu: Add function implementation to update event index in userpage

2021-03-30 Thread Will Deacon
On Wed, Mar 10, 2021 at 05:08:28PM -0700, Rob Herring wrote:
> From: Raphael Gault 
> 
> In order to be able to access the counter directly for userspace,
> we need to provide the index of the counter using the userpage.
> We thus need to override the event_idx function to retrieve and
> convert the perf_event index to armv8 hardware index.
> 
> Since the arm_pmu driver can be used by any implementation, even
> if not armv8, two components play a role into making sure the
> behaviour is correct and consistent with the PMU capabilities:
> 
> * the ARMPMU_EL0_RD_CNTR flag which denotes the capability to access
> counter from userspace.
> * the event_idx call back, which is implemented and initialized by
> the PMU implementation: if no callback is provided, the default
> behaviour applies, returning 0 as index value.
> 
> Signed-off-by: Raphael Gault 
> Signed-off-by: Rob Herring 
> ---
>  arch/arm64/kernel/perf_event.c | 18 ++
>  include/linux/perf/arm_pmu.h   |  2 ++
>  2 files changed, 20 insertions(+)

Acked-by: Will Deacon 

Will


Re: [PATCH] arm64/kernel/probes: Use BUG_ON instead of if condition followed by BUG.

2021-03-30 Thread Will Deacon
On Tue, Mar 30, 2021 at 04:57:50AM -0700, zhouchuangao wrote:
> It can be optimized at compile time.

Hmm, I don't see it (and I also don't understand why we care). Do you have
numbers showing that this is worthwhile?

Will


Re: [PATCH] docs: perf: Address some html build warnings

2021-03-30 Thread Will Deacon
On Mon, 29 Mar 2021 20:32:01 +0800, Qi Liu wrote:
> Fix following html build warnings:
> Documentation/admin-guide/perf/hisi-pmu.rst:61: WARNING: Unexpected 
> indentation.
> Documentation/admin-guide/perf/hisi-pmu.rst:62: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> Documentation/admin-guide/perf/hisi-pmu.rst:69: WARNING: Unexpected 
> indentation.
> Documentation/admin-guide/perf/hisi-pmu.rst:70: WARNING: Block quote ends 
> without a blank line; unexpected unindent.
> Documentation/admin-guide/perf/hisi-pmu.rst:83: WARNING: Unexpected 
> indentation.

Applied to will (for-next/perf), thanks!

[1/1] docs: perf: Address some html build warnings
  https://git.kernel.org/will/c/b88f5e9792cc

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [PATCH 1/2] iommu/arm-smmu-qcom: Skip the TTBR1 quirk for db820c.

2021-03-30 Thread Will Deacon
On Mon, Mar 29, 2021 at 09:02:50PM -0700, Rob Clark wrote:
> On Mon, Mar 29, 2021 at 7:47 AM Will Deacon  wrote:
> >
> > On Fri, Mar 26, 2021 at 04:13:02PM -0700, Eric Anholt wrote:
> > > db820c wants to use the qcom smmu path to get HUPCF set (which keeps
> > > the GPU from wedging and then sometimes wedging the kernel after a
> > > page fault), but it doesn't have separate pagetables support yet in
> > > drm/msm so we can't go all the way to the TTBR1 path.
> >
> > What do you mean by "doesn't have separate pagetables support yet"? The
> > compatible string doesn't feel like the right way to determine this.
> 
> the compatible string identifies what it is, not what the sw
> limitations are, so in that regard it seems right to me..

Well it depends on what "doesn't have separate pagetables support yet"
means. I can't tell if it's a hardware issue, a firmware issue or a driver
issue.

Will


Re: [PATCH bpf-next] bpf: arm64: Redefine MOV consistent with arch insn

2021-03-30 Thread Will Deacon
On Tue, Mar 30, 2021 at 03:42:35PM +0800, Jianlin Lv wrote:
> A64_MOV is currently mapped to Add Instruction. Architecturally MOV
> (register) is an alias of ORR (shifted register) and MOV (to or from SP)
> is an alias of ADD (immediate).
> This patch redefines A64_MOV and uses existing functionality
> aarch64_insn_gen_move_reg() in insn.c to encode MOV (register) instruction.
> For moving between register and stack pointer, rename macro to A64_MOV_SP.

What does this gain us? There's no requirement for a BPF "MOV" to match an
arm64 architectural "MOV", so what's the up-side of aligning them like this?

Cheers,

Will


Re: [PATCH] locking/mutex: initialize osq lock in __MUTEX_INITIALIZER()

2021-03-29 Thread Will Deacon
On Mon, Mar 29, 2021 at 12:15:16AM -0700, Nikitas Angelinas wrote:
> Since __MUTEX_INITIALIZER() is used on memory that is initialized to 0
> anyway this change should not have an effect, but it seems better to
> initialize osq explicitly for completeness, as done in other macros and
> functions that initialize mutex and rwsem.
> 
> Signed-off-by: Nikitas Angelinas 
> ---
>  include/linux/mutex.h | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/include/linux/mutex.h b/include/linux/mutex.h
> index 515cff7..bff47f8 100644
> --- a/include/linux/mutex.h
> +++ b/include/linux/mutex.h
> @@ -129,10 +129,18 @@ do {
> \
>  # define __DEP_MAP_MUTEX_INITIALIZER(lockname)
>  #endif
>  
> +#ifdef CONFIG_MUTEX_SPIN_ON_OWNER
> +# define __OSQ_MUTEX_INITIALIZER(lockname)   \
> + , .osq = OSQ_LOCK_UNLOCKED
> +#else
> +# define __OSQ_MUTEX_INITIALIZER(lockname)
> +#endif
> +
>  #define __MUTEX_INITIALIZER(lockname) \
>   { .owner = ATOMIC_LONG_INIT(0) \
>   , .wait_lock = __SPIN_LOCK_UNLOCKED(lockname.wait_lock) \
>   , .wait_list = LIST_HEAD_INIT(lockname.wait_list) \
> + __OSQ_MUTEX_INITIALIZER(lockname) \

You don't need the lockname parameter for this macro.

Will


Re: [PATCH 1/2] iommu/arm-smmu-qcom: Skip the TTBR1 quirk for db820c.

2021-03-29 Thread Will Deacon
On Fri, Mar 26, 2021 at 04:13:02PM -0700, Eric Anholt wrote:
> db820c wants to use the qcom smmu path to get HUPCF set (which keeps
> the GPU from wedging and then sometimes wedging the kernel after a
> page fault), but it doesn't have separate pagetables support yet in
> drm/msm so we can't go all the way to the TTBR1 path.

What do you mean by "doesn't have separate pagetables support yet"? The
compatible string doesn't feel like the right way to determine this.

Will


Re: [PATCH v4 5/5] arm64: Kconfig: select KASAN_VMALLOC if KANSAN_GENERIC is enabled

2021-03-29 Thread Will Deacon
On Wed, Mar 24, 2021 at 12:05:22PM +0800, Lecopzer Chen wrote:
> Before this patch, someone who wants to use VMAP_STACK when
> KASAN_GENERIC enabled must explicitly select KASAN_VMALLOC.
> 
> From Will's suggestion [1]:
>   > I would _really_ like to move to VMAP stack unconditionally, and
>   > that would effectively force KASAN_VMALLOC to be set if KASAN is in use
> 
> Because VMAP_STACK now depends on either HW_TAGS or KASAN_VMALLOC if
> KASAN enabled, in order to make VMAP_STACK selected unconditionally,
> we bind KANSAN_GENERIC and KASAN_VMALLOC together.
> 
> Note that SW_TAGS supports neither VMAP_STACK nor KASAN_VMALLOC now,
> so this is the first step to make VMAP_STACK selected unconditionally.

Do you know if anybody is working on this? It's really unfortunate that
we can't move exclusively to VMAP_STACK just because of SW_TAGS KASAN.

That said, what is there to do? As things stand, won't kernel stack
addresses end up using KASAN_TAG_KERNEL?

Will


Re: [RFT PATCH v3 16/27] irqchip/apple-aic: Add support for the Apple Interrupt Controller

2021-03-29 Thread Will Deacon
Hi Hector,

On Fri, Mar 26, 2021 at 05:58:15PM +0900, Hector Martin wrote:
> On 25/03/2021 04.57, Will Deacon wrote:
> > > + event = readl(ic->base + AIC_EVENT);
> > > + type = FIELD_GET(AIC_EVENT_TYPE, event);
> > > + irq = FIELD_GET(AIC_EVENT_NUM, event);
> > > +
> > > + if (type == AIC_EVENT_TYPE_HW)
> > > + handle_domain_irq(aic_irqc->hw_domain, irq, regs);
> > > + else if (type == AIC_EVENT_TYPE_IPI && irq == 1)
> > > + aic_handle_ipi(regs);
> > > + else if (event != 0)
> > > + pr_err("Unknown IRQ event %d, %d\n", type, irq);
> > > + } while (event);
> > > +
> > > + /*
> > > +  * vGIC maintenance interrupts end up here too, so we need to check
> > > +  * for them separately. Just report and disable vGIC for now, until
> > > +  * we implement this properly.
> > > +  */
> > > + if ((read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EN) &&
> > > + read_sysreg_s(SYS_ICH_MISR_EL2) != 0) {
> > > + pr_err("vGIC IRQ fired, disabling.\n");
> > > + sysreg_clear_set_s(SYS_ICH_HCR_EL2, ICH_HCR_EN, 0);
> > > + }
> > 
> > What prevents all these system register accesses being speculated up before
> > the handler?
> 
> Nothing, but that's not a problem, is it? If the condition is met, it means
> the vGIC IRQ *is* firing and needs clearing. We don't particularly care if
> this happens before, after, or during the rest of the IRQ handling.
> 
> I changed the message to this, because we actually should never hit this
> path with correctly-working KVM code (it takes care of it before this
> handler runs):
> 
> pr_err_ratelimited("vGIC IRQ fired and not handled by KVM, disabling.\n");

Looks good.

> > > +static struct irq_chip aic_chip = {
> > > + .name = "AIC",
> > > + .irq_mask = aic_irq_mask,
> > > + .irq_unmask = aic_irq_unmask,
> > 
> > I know these are driven by the higher-level irq chip code, but I'm a bit
> > confused as to what provides ordering if, e.g. something ends up calling:
> > 
> > aic_chip.irq_mask(d);
> > ...
> > aic_chip.irq_unmask(d);
> > 
> > I can't see any ISBs in here and they're writing to two different registers,
> > so can we end up with the IRQ masked after this sequence?
> 
> Wait, aren't MMIO writes to the same peripheral using device-nGnRnE memory
> modes always ordered with respect to each other? I thought the _relaxed
> versions were only trouble when mixed with memory/DMA buffers, and MMIO for
> any given peripheral always takes effect in program order.

Sorry, this was my mistake -- I seem to have mixed up the MMIO parts with
the system register parts. In this case, aic_irq_[un]mask() are MMIO writes,
so things work out. It's the FIQ mask/unmask code that needs the ISBs.

> > > +static void aic_ipi_mask(struct irq_data *d)
> > > +{
> > > + u32 irq_bit = BIT(irqd_to_hwirq(d));
> > > + int this_cpu = smp_processor_id();
> > > +
> > > + /* No specific ordering requirements needed here. */
> > > + atomic_andnot(irq_bit, _vipi_enable[this_cpu]);
> > > +}
> > 
> > Why not use a per-cpu variable here instead of an array of atomics? The pcpu
> > API has things for atomic updates (e.g. or, and, xchg).
> 
> One CPU still needs to be able to mutate the flags of another CPU to fire an
> IPI; AIUI the per-cpu ops are *not* atomic for concurrent access by multiple
> CPUs, and in fact there is no API for that, only for "this CPU".

Huh, I really thought we had an API for that, but you're right. Oh well! But
I'd still suggest a per-cpu atomic_t in that case, rather than the array.

> > > +static void aic_ipi_unmask(struct irq_data *d)
> > > +{
> > > + struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> > > + u32 irq_bit = BIT(irqd_to_hwirq(d));
> > > + int this_cpu = smp_processor_id();
> > > +
> > > + /*
> > > +  * This must complete before the atomic_read_acquire() below to avoid
> > > +  * racing aic_ipi_send_mask(). Use a dummy fetch op with release
> > > +  * semantics for this. This is arch-specific: ARMv8 B2.3.3 specifies
> > > +  * that writes with Release semantics are Barrier-ordered-before reads
> > > +  * with Acquire semantics, even though the Linux arch-independent
> > > +  * definition of these atomic ops does not.
> > > +  */
> > 
> > I think a more idio

Re: [PATCH v2 3/7] powerpc: convert config files to generic cmdline

2021-03-29 Thread Will Deacon
On Thu, Mar 25, 2021 at 12:59:56PM -0700, Daniel Walker wrote:
> On Thu, Mar 25, 2021 at 01:03:55PM +0100, Christophe Leroy wrote:
> > 
> > Ok, so you agree we don't need to provide two CMDLINE, one to be appended 
> > and one to be prepended.
> > 
> > Let's only provide once CMDLINE as of today, and ask the user to select
> > whether he wants it appended or prepended or replacee. Then no need to
> > change all existing config to rename CONFIG_CMDLINE into either of the new
> > ones.
> > 
> > That's the main difference between my series and Daniel's series. So I'll
> > finish taking Will's comment into account and we'll send out a v3 soon.
> 
> It doesn't solve the needs of Cisco, I've stated many times your changes have
> little value. Please stop submitting them.

FWIW, they're useful for arm64 and I will gladly review the updated series.

I don't think asking people to stop submitting patches is ever the right
answer. Please don't do that.

Will


Re: linux-next: build warnings after merge of the arm-perf tree

2021-03-29 Thread Will Deacon
On Mon, Mar 29, 2021 at 05:06:20PM +0800, liuqi (BA) wrote:
> 
> 
> On 2021/3/29 16:47, Will Deacon wrote:
> > On Fri, Mar 26, 2021 at 05:07:41PM +0800, Shaokun Zhang wrote:
> > > Apologies for the mistake.
> > > 
> > > Will, shall I send a new version v5 to fix this issue or other?
> > 
> > Please send additional patches on top now that these are queued.
> > 
> > Thanks,
> > 
> > Will
> > 
> > .
> > 
> We'll send a new patch to fix these warnings today, apologies for the
> mistake again.

Really no need to apologise; we're fixing things before they land in Linus'
tree -- that's exactly how it's supposed to work!

Will


Re: linux-next: build warnings after merge of the arm-perf tree

2021-03-29 Thread Will Deacon
On Fri, Mar 26, 2021 at 05:07:41PM +0800, Shaokun Zhang wrote:
> Apologies for the mistake.
> 
> Will, shall I send a new version v5 to fix this issue or other?

Please send additional patches on top now that these are queued.

Thanks,

Will


Re: [PATCH v2 6/7] cmdline: Gives architectures opportunity to use generically defined boot cmdline manipulation

2021-03-25 Thread Will Deacon
On Thu, Mar 25, 2021 at 12:18:38PM +0100, Christophe Leroy wrote:
> 
> 
> Le 03/03/2021 à 18:57, Will Deacon a écrit :
> > On Tue, Mar 02, 2021 at 05:25:22PM +, Christophe Leroy wrote:
> > > Most architectures have similar boot command line manipulation
> > > options. This patchs adds the definition in init/Kconfig, gated by
> > > CONFIG_HAVE_CMDLINE that the architectures can select to use them.
> > > 
> > > In order to use this, a few architectures will have to change their
> > > CONFIG options:
> > > - riscv has to replace CMDLINE_FALLBACK by CMDLINE_FROM_BOOTLOADER
> > > - architectures using CONFIG_CMDLINE_OVERRIDE or
> > > CONFIG_CMDLINE_OVERWRITE have to replace them by CONFIG_CMDLINE_FORCE.
> > > 
> > > Architectures also have to define CONFIG_DEFAULT_CMDLINE.
> > > 
> > > Signed-off-by: Christophe Leroy 
> > > ---
> > >   init/Kconfig | 56 
> > >   1 file changed, 56 insertions(+)
> > > 
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index 22946fe5ded9..a0f2ad9467df 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -117,6 +117,62 @@ config INIT_ENV_ARG_LIMIT
> > > Maximum of each of the number of arguments and environment
> > > variables passed to init from the kernel command line.
> > > +config HAVE_CMDLINE
> > > + bool
> > > +
> > > +config CMDLINE_BOOL
> > > + bool "Default bootloader kernel arguments"
> > > + depends on HAVE_CMDLINE
> > > + help
> > > +   On some platforms, there is currently no way for the boot loader to
> > > +   pass arguments to the kernel. For these platforms, you can supply
> > > +   some command-line options at build time by entering them here.  In
> > > +   most cases you will need to specify the root device here.
> > 
> > Why is this needed as well as CMDLINE_FROM_BOOTLOADER? IIUC, the latter
> > will use CONFIG_CMDLINE if it fails to get anything from the bootloader,
> > which sounds like the same scenario.
> > 
> > > +config CMDLINE
> > > + string "Initial kernel command string"
> > 
> > s/Initial/Default
> > 
> > which is then consistent with the rest of the text here.
> > 
> > > + depends on CMDLINE_BOOL
> > 
> > Ah, so this is a bit different and I don't think lines-up with the
> > CMDLINE_BOOL help text.
> > 
> > > + default DEFAULT_CMDLINE
> > > + help
> > > +   On some platforms, there is currently no way for the boot loader to
> > > +   pass arguments to the kernel. For these platforms, you can supply
> > > +   some command-line options at build time by entering them here.  In
> > > +   most cases you will need to specify the root device here.
> > 
> > (same stale text)
> > 
> > > +choice
> > > + prompt "Kernel command line type" if CMDLINE != ""
> > > + default CMDLINE_FROM_BOOTLOADER
> > > + help
> > > +   Selects the way you want to use the default kernel arguments.
> > 
> > How about:
> > 
> > "Determines how the default kernel arguments are combined with any
> >   arguments passed by the bootloader"
> > 
> > > +config CMDLINE_FROM_BOOTLOADER
> > > + bool "Use bootloader kernel arguments if available"
> > > + help
> > > +   Uses the command-line options passed by the boot loader. If
> > > +   the boot loader doesn't provide any, the default kernel command
> > > +   string provided in CMDLINE will be used.
> > > +
> > > +config CMDLINE_EXTEND
> > 
> > Can we rename this to CMDLINE_APPEND, please? There is code in the tree
> > which disagrees about what CMDLINE_EXTEND means, so that will need be
> > to be updated to be consistent (e.g. the EFI stub parsing order). Having
> > the generic option with a different name means we won't accidentally end
> > up with the same inconsistent behaviours.
> 
> Argh, yes. Seems like the problem is even larger than that IIUC:
> 
> - For ARM it means to append the bootloader arguments to the CONFIG_CMDLINE
> - For Powerpc it means to append the CONFIG_CMDLINE to the bootloader 
> arguments
> - For SH  it means to append the CONFIG_CMDLINE to the bootloader arguments
> - For EFI it means to append the bootloader arguments to the CONFIG_CMDLINE
> - For OF it means to append the CONFIG_CMDLINE to the bootloader arguments
> 
> So what happens o

Re: [syzbot] BUG: soft lockup in do_wp_page (4)

2021-03-25 Thread Will Deacon
On Thu, Mar 25, 2021 at 08:24:53PM +0100, Dmitry Vyukov wrote:
> On Thu, Mar 25, 2021 at 8:10 PM Will Deacon  wrote:
> > On Thu, Mar 25, 2021 at 07:34:54PM +0100, Dmitry Vyukov wrote:
> > > On Thu, Mar 25, 2021 at 7:20 PM Will Deacon  wrote:
> > > > On Thu, Mar 18, 2021 at 08:34:16PM +0100, Dmitry Vyukov wrote:
> > > > > On Thu, Mar 18, 2021 at 8:31 PM syzbot
> > > > >  wrote:
> > > > > commit cae118b6acc309539b9e846cbb19187c164c
> > > > > Author: Will Deacon
> > > > > Date:   Wed Mar 3 13:49:27 2021 +
> > > > > arm64: Drop support for CMDLINE_EXTEND
> > > > >
> > > > > syzbot passes lots of critical things in CONFIG_CMDLINE:
> > > > > https://github.com/google/syzkaller/blob/c3c81c94865791469d376eba84f4a2d7763d3f71/dashboard/config/linux/upstream-arm64-kasan.config#L495
> > > > > but also wants the bootloader args to be appended.
> > > > > What is the way to do it now?
> > > >
> > > > For now, there isn't a way to do it with CONFIG_CMDLINE, so I think you 
> > > > can
> > > > either:
> > > >
> > > >   * Revert my patch for your kernels
> > > >   * Pass the arguments via QEMU's -append option
> > > >   * Take a look at one of the series which should hopefully add this
> > > > functionality back (but with well-defined semantics) [1] [2]
> > >
> > > Unfortunately none of these work for syzbot (and I assume other
> > > testing environments).
> > >
> > > syzbot does not support custom patches by design:
> > > http://bit.do/syzbot#no-custom-patches
> > > As any testing system, it tests the official trees.
> > >
> > > It's not humans who start these VMs, so it's not as easy as changing
> > > the command line after typing...
> > > There is no support for passing args specifically to qemu, syzkaller
> > > support not just qemu, so these things are specifically localized in
> > > the config. Additionally there is an issue of communicating all these
> > > scattered details to developers in bug reports. Currently syzbot
> > > reports the kernel config and it as well captures command line.
> > >
> > > Could you revert the patch? Is there any point in removing the
> > > currently supported feature before the new feature lands?
> >
> > Well, we only just merged it (in 5.10 I think?), and the semantics of the
> > new version will be different, so I really don't see the value in supporting
> > both (even worse, Android has its own implementation which is different
> > again). The timeline was: we merged CMDLINE_EXTEND, then we noticed it was
> > broken, my fixes were rejected, so we removed the feature rather than
> > support the broken version. In the relatively small window while it was
> > merged, syzbot started using it :(
> 
> I didn't realize it was just introduced :)
> We used CMDLINE_EXTEND on x86, and I looked for a similar option for
> arm64 and found it.
> 
> > So I really think the best bet is to wait until the patches are sorted out.
> > I think Christophe is about to spin a new version, and I reviewed his last
> > copy, so I don't see this being far off,
> 
> If it's expected to be merged soon, let's wait.

Thanks, and knowing that we have a keen user helps to prioritise the review
:)

Will


Re: [syzbot] BUG: soft lockup in do_wp_page (4)

2021-03-25 Thread Will Deacon
On Thu, Mar 25, 2021 at 07:34:54PM +0100, Dmitry Vyukov wrote:
> On Thu, Mar 25, 2021 at 7:20 PM Will Deacon  wrote:
> >
> > On Thu, Mar 18, 2021 at 08:34:16PM +0100, Dmitry Vyukov wrote:
> > > On Thu, Mar 18, 2021 at 8:31 PM syzbot
> > >  wrote:
> > > >
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit:bf152b0b Merge tag 'for_linus' of 
> > > > git://git.kernel.org/pub..
> > > > git tree:   upstream
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=17d5264ed0
> > > > kernel config:  
> > > > https://syzkaller.appspot.com/x/.config?x=2c9917c41f0bc04b
> > > > dashboard link: 
> > > > https://syzkaller.appspot.com/bug?extid=0b036374a865ba0efa8e
> > > > userspace arch: arm64
> > > >
> > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > > commit:
> > > > Reported-by: syzbot+0b036374a865ba0ef...@syzkaller.appspotmail.com
> > > >
> > > > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [syz-executor.1:3684]
> > >
> > > +Will, arm
> > >
> > > If I am reading this commit correctly, this is caused by:
> > >
> > > commit cae118b6acc309539b9e846cbb19187c164c
> > > Author: Will Deacon
> > > Date:   Wed Mar 3 13:49:27 2021 +
> > > arm64: Drop support for CMDLINE_EXTEND
> > >
> > > syzbot passes lots of critical things in CONFIG_CMDLINE:
> > > https://github.com/google/syzkaller/blob/c3c81c94865791469d376eba84f4a2d7763d3f71/dashboard/config/linux/upstream-arm64-kasan.config#L495
> > > but also wants the bootloader args to be appended.
> > > What is the way to do it now?
> >
> > For now, there isn't a way to do it with CONFIG_CMDLINE, so I think you can
> > either:
> >
> >   * Revert my patch for your kernels
> >   * Pass the arguments via QEMU's -append option
> >   * Take a look at one of the series which should hopefully add this
> > functionality back (but with well-defined semantics) [1] [2]
> 
> Unfortunately none of these work for syzbot (and I assume other
> testing environments).
> 
> syzbot does not support custom patches by design:
> http://bit.do/syzbot#no-custom-patches
> As any testing system, it tests the official trees.
> 
> It's not humans who start these VMs, so it's not as easy as changing
> the command line after typing...
> There is no support for passing args specifically to qemu, syzkaller
> support not just qemu, so these things are specifically localized in
> the config. Additionally there is an issue of communicating all these
> scattered details to developers in bug reports. Currently syzbot
> reports the kernel config and it as well captures command line.
> 
> Could you revert the patch? Is there any point in removing the
> currently supported feature before the new feature lands?

Well, we only just merged it (in 5.10 I think?), and the semantics of the
new version will be different, so I really don't see the value in supporting
both (even worse, Android has its own implementation which is different
again). The timeline was: we merged CMDLINE_EXTEND, then we noticed it was
broken, my fixes were rejected, so we removed the feature rather than
support the broken version. In the relatively small window while it was
merged, syzbot started using it :(

So I really think the best bet is to wait until the patches are sorted out.
I think Christophe is about to spin a new version, and I reviewed his last
copy, so I don't see this being far off,

Will


Re: [syzbot] BUG: soft lockup in do_wp_page (4)

2021-03-25 Thread Will Deacon
On Thu, Mar 18, 2021 at 08:34:16PM +0100, Dmitry Vyukov wrote:
> On Thu, Mar 18, 2021 at 8:31 PM syzbot
>  wrote:
> >
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:bf152b0b Merge tag 'for_linus' of git://git.kernel.org/pub..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17d5264ed0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=2c9917c41f0bc04b
> > dashboard link: https://syzkaller.appspot.com/bug?extid=0b036374a865ba0efa8e
> > userspace arch: arm64
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+0b036374a865ba0ef...@syzkaller.appspotmail.com
> >
> > watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [syz-executor.1:3684]
> 
> +Will, arm
> 
> If I am reading this commit correctly, this is caused by:
> 
> commit cae118b6acc309539b9e846cbb19187c164c
> Author: Will Deacon
> Date:   Wed Mar 3 13:49:27 2021 +
> arm64: Drop support for CMDLINE_EXTEND
> 
> syzbot passes lots of critical things in CONFIG_CMDLINE:
> https://github.com/google/syzkaller/blob/c3c81c94865791469d376eba84f4a2d7763d3f71/dashboard/config/linux/upstream-arm64-kasan.config#L495
> but also wants the bootloader args to be appended.
> What is the way to do it now?

For now, there isn't a way to do it with CONFIG_CMDLINE, so I think you can
either:

  * Revert my patch for your kernels
  * Pass the arguments via QEMU's -append option
  * Take a look at one of the series which should hopefully add this
functionality back (but with well-defined semantics) [1] [2]

Sorry for the nuisance; I did try to fix this [3] but it's a bit of a
mess.

Will

[1] 
https://lore.kernel.org/linux-arch/cover.1614705851.git.christophe.le...@csgroup.eu/
[2] 
https://lore.kernel.org/linux-mips/20210308235319.2988609-2-danie...@cisco.com/#t
[3] https://lore.kernel.org/r/20210225125921.13147-1-w...@kernel.org


Re: [PATCH 1/1] iommu/arm-smmu-v3: add bit field SFM into GERROR_ERR_MASK

2021-03-25 Thread Will Deacon
On Wed, 24 Mar 2021 16:16:03 +0800, Zhen Lei wrote:
> In arm_smmu_gerror_handler(), the value of the SMMU_GERROR register is
> filtered by GERROR_ERR_MASK. However, the GERROR_ERR_MASK does not contain
> the SFM bit. As a result, the subsequent error processing is not performed
> when only the SFM error occurs.

Applied to will (for-joerg/arm-smmu/updates), thanks!

[1/1] iommu/arm-smmu-v3: add bit field SFM into GERROR_ERR_MASK
  https://git.kernel.org/will/c/655c447c97d7

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [PATCH] dt-bindings: arm-smmu: Add compatible for SC7280 SoC

2021-03-25 Thread Will Deacon
On Mon, 15 Mar 2021 11:32:24 +0530, Rajendra Nayak wrote:
> Add the SoC specific compatible for SC7280 implementing
> arm,mmu-500.

Applied to will (for-joerg/arm-smmu/updates), thanks!

[1/1] dt-bindings: arm-smmu: Add compatible for SC7280 SoC
  https://git.kernel.org/will/c/a9aa2bb18ecb

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [PATCH v2 0/4] iommu/iova: Add CPU hotplug handler to flush rcaches to core code

2021-03-25 Thread Will Deacon
On Thu, Mar 25, 2021 at 08:29:57PM +0800, John Garry wrote:
> The Intel IOMMU driver supports flushing the per-CPU rcaches when a CPU is
> offlined.
> 
> Let's move it to core code, so everyone can take advantage.
> 
> Also throw in a patch to stop exporting free_iova_fast().
> 
> Differences to v1:
> - Add RB tags (thanks!)
> - Add patch to stop exporting free_iova_fast()
> - Drop patch to correct comment
> - Add patch to delete iommu_dma_free_cpu_cached_iovas() and associated
>   changes
> 
> John Garry (4):
>   iova: Add CPU hotplug handler to flush rcaches
>   iommu/vt-d: Remove IOVA domain rcache flushing for CPU offlining
>   iommu: Delete iommu_dma_free_cpu_cached_iovas()
>   iommu: Stop exporting free_iova_fast()

Looks like this is all set for 5.13, so hopefully Joerg can stick it in
-next for a bit more exposure.

Will


Re: [PATCH 2/3] iommu/io-pgtable-arm: Add IOMMU_LLC page protection flag

2021-03-25 Thread Will Deacon
On Tue, Mar 09, 2021 at 12:10:44PM +0530, Sai Prakash Ranjan wrote:
> On 2021-02-05 17:38, Sai Prakash Ranjan wrote:
> > On 2021-02-04 03:16, Will Deacon wrote:
> > > On Tue, Feb 02, 2021 at 11:56:27AM +0530, Sai Prakash Ranjan wrote:
> > > > On 2021-02-01 23:50, Jordan Crouse wrote:
> > > > > On Mon, Feb 01, 2021 at 08:20:44AM -0800, Rob Clark wrote:
> > > > > > On Mon, Feb 1, 2021 at 3:16 AM Will Deacon  wrote:
> > > > > > > On Fri, Jan 29, 2021 at 03:12:59PM +0530, Sai Prakash Ranjan 
> > > > > > > wrote:
> > > > > > > > On 2021-01-29 14:35, Will Deacon wrote:
> > > > > > > > > On Mon, Jan 11, 2021 at 07:45:04PM +0530, Sai Prakash Ranjan 
> > > > > > > > > wrote:
> > > > > > > > > > +#define IOMMU_LLC(1 << 6)
> > > > > > > > >
> > > > > > > > > On reflection, I'm a bit worried about exposing this because 
> > > > > > > > > I think it
> > > > > > > > > will
> > > > > > > > > introduce a mismatched virtual alias with the CPU (we don't 
> > > > > > > > > even have a
> > > > > > > > > MAIR
> > > > > > > > > set up for this memory type). Now, we also have that issue 
> > > > > > > > > for the PTW,
> > > > > > > > > but
> > > > > > > > > since we always use cache maintenance (i.e. the streaming 
> > > > > > > > > API) for
> > > > > > > > > publishing the page-tables to a non-coheren walker, it works 
> > > > > > > > > out.
> > > > > > > > > However,
> > > > > > > > > if somebody expects IOMMU_LLC to be coherent with a DMA API 
> > > > > > > > > coherent
> > > > > > > > > allocation, then they're potentially in for a nasty surprise 
> > > > > > > > > due to the
> > > > > > > > > mismatched outer-cacheability attributes.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Can't we add the syscached memory type similar to what is done 
> > > > > > > > on android?
> > > > > > >
> > > > > > > Maybe. How does the GPU driver map these things on the CPU side?
> > > > > >
> > > > > > Currently we use writecombine mappings for everything, although 
> > > > > > there
> > > > > > are some cases that we'd like to use cached (but have not merged
> > > > > > patches that would give userspace a way to flush/invalidate)
> > > > > >
> > > > >
> > > > > LLC/system cache doesn't have a relationship with the CPU cache.  Its
> > > > > just a
> > > > > little accelerator that sits on the connection from the GPU to DDR and
> > > > > caches
> > > > > accesses. The hint that Sai is suggesting is used to mark the buffers 
> > > > > as
> > > > > 'no-write-allocate' to prevent GPU write operations from being cached 
> > > > > in
> > > > > the LLC
> > > > > which a) isn't interesting and b) takes up cache space for read
> > > > > operations.
> > > > >
> > > > > Its easiest to think of the LLC as a bonus accelerator that has no 
> > > > > cost
> > > > > for
> > > > > us to use outside of the unfortunate per buffer hint.
> > > > >
> > > > > We do have to worry about the CPU cache w.r.t I/O coherency (which is 
> > > > > a
> > > > > different hint) and in that case we have all of concerns that Will
> > > > > identified.
> > > > >
> > > > 
> > > > For mismatched outer cacheability attributes which Will
> > > > mentioned, I was
> > > > referring to [1] in android kernel.
> > > 
> > > I've lost track of the conversation here :/
> > > 
> > > When the GPU has a buffer mapped with IOMMU_LLC, is the buffer also
> > > mapped
> > > into the CPU and with what attributes? Rob said "writecombine for
> > > everything" -- does that mean ioremap_wc() / MEMREMAP_WC?
> > > 
> > 
> > Rob answered this.
> > 
> > > Finally, we need to be careful when we use the word "hint" as
> > > "allocation
> > > hint" has a specific meaning in the architecture, and if we only
> > > mismatch on
> > > those then we're actually ok. But I think IOMMU_LLC is more than
> > > just a
> > > hint, since it actually drives eviction policy (i.e. it enables
> > > writeback).
> > > 
> > > Sorry for the pedantry, but I just want to make sure we're all talking
> > > about the same things!
> > > 
> > 
> > Sorry for the confusion which probably was caused by my mentioning of
> > android, NWA(no write allocate) is an allocation hint which we can
> > ignore
> > for now as it is not introduced yet in upstream.
> > 
> 
> Any chance of taking this forward? We do not want to miss out on small fps
> gain when the product gets released.

Do we have a solution to the mismatched virtual alias?

Will


Re: [PATCH 1/2] iommu/mediatek-v1: Alloc building as module

2021-03-25 Thread Will Deacon
On Tue, Mar 23, 2021 at 01:58:00PM +0800, Yong Wu wrote:
> This patch only adds support for building the IOMMU-v1 driver as module.
> Correspondingly switch the config to tristate.
> 
> Signed-off-by: Yong Wu 
> ---
> rebase on v5.12-rc2.
> ---
>  drivers/iommu/Kconfig| 2 +-
>  drivers/iommu/mtk_iommu_v1.c | 9 -
>  2 files changed, 5 insertions(+), 6 deletions(-)

Both of these patches look fine to me, but you probably need to check
the setting of MODULE_OWNER after:

https://lore.kernel.org/r/f4de29d8330981301c1935e667b507254a2691ae.1616157612.git.robin.mur...@arm.com

Will


Re: [PATCH 2/2] iommu: Streamline registration interface

2021-03-25 Thread Will Deacon
On Fri, Mar 19, 2021 at 12:52:02PM +, Robin Murphy wrote:
> Rather than have separate opaque setter functions that are easy to
> overlook and lead to repetitive boilerplate in drivers, let's pass the
> relevant initialisation parameters directly to iommu_device_register().
> 
> Signed-off-by: Robin Murphy 
> ---
>  drivers/iommu/amd/init.c|  3 +--
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  5 +---
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   |  5 +---
>  drivers/iommu/arm/arm-smmu/qcom_iommu.c |  5 +---
>  drivers/iommu/exynos-iommu.c|  5 +---
>  drivers/iommu/fsl_pamu_domain.c |  4 +--
>  drivers/iommu/intel/dmar.c  |  4 +--
>  drivers/iommu/intel/iommu.c |  3 +--
>  drivers/iommu/iommu.c   |  7 -
>  drivers/iommu/ipmmu-vmsa.c  |  6 +
>  drivers/iommu/msm_iommu.c   |  5 +---
>  drivers/iommu/mtk_iommu.c   |  5 +---
>  drivers/iommu/mtk_iommu_v1.c|  4 +--
>  drivers/iommu/omap-iommu.c  |  5 +---
>  drivers/iommu/rockchip-iommu.c  |  5 +---
>  drivers/iommu/s390-iommu.c  |  4 +--
>  drivers/iommu/sprd-iommu.c  |  5 +---
>  drivers/iommu/sun50i-iommu.c|  5 +---
>  drivers/iommu/tegra-gart.c  |  5 +---
>  drivers/iommu/tegra-smmu.c  |  5 +---
>  drivers/iommu/virtio-iommu.c|  5 +---
>  include/linux/iommu.h   | 29 -
>  22 files changed, 31 insertions(+), 98 deletions(-)

I was worried this might blow up with !CONFIG_IOMMU_API, but actually
it all looks fine and is much cleaner imo so:

Acked-by: Will Deacon 

Will


Re: [PATCH 1/2] iommu: Statically set module owner

2021-03-25 Thread Will Deacon
On Fri, Mar 19, 2021 at 12:52:01PM +, Robin Murphy wrote:
> It happens that the 3 drivers which first supported being modular are
> also ones which play games with their pgsize_bitmap, so have non-const
> iommu_ops where dynamically setting the owner manages to work out OK.
> However, it's less than ideal to force that upon all drivers which want
> to be modular - like the new sprd-iommu driver which now has a potential
> bug in that regard - so let's just statically set the module owner and
> let ops remain const wherever possible.
> 
> Signed-off-by: Robin Murphy 
> ---
> 
> This is something I hadn't got round to sending earlier, so now rebased
> onto iommu/next to accommodate the new driver :)
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 +
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 1 +
>  drivers/iommu/sprd-iommu.c  | 1 +
>  drivers/iommu/virtio-iommu.c| 1 +
>  include/linux/iommu.h   | 9 +
>  5 files changed, 5 insertions(+), 8 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCHv2 2/2] iommu/arm-smmu-qcom: Move the adreno smmu specific impl earlier

2021-03-25 Thread Will Deacon
On Thu, Mar 25, 2021 at 01:10:12PM +0530, Sai Prakash Ranjan wrote:
> On 2021-03-15 00:31, Sai Prakash Ranjan wrote:
> > On 2021-03-12 04:59, Bjorn Andersson wrote:
> > > On Sat 27 Feb 07:53 CST 2021, Sai Prakash Ranjan wrote:
> > > > On 2021-02-27 00:44, Bjorn Andersson wrote:
> > > > > On Fri 26 Feb 12:23 CST 2021, Rob Clark wrote:
> > > > >
> > > > >
> > > > > The current logic picks one of:
> > > > > 1) is the compatible mentioned in qcom_smmu_impl_of_match[]
> > > > > 2) is the compatible an adreno
> > > > > 3) no quirks needed
> > > > >
> > > > > The change flips the order of these, so the only way I can see this
> > > > > change affecting things is if we expected a match on #2, but we got 
> > > > > one
> > > > > on #1.
> > > > >
> > > > > Which implies that the instance that we want to act according to the
> > > > > adreno impl was listed in qcom_smmu_impl_of_match[] - which either is
> > > > > wrong, or there's a single instance that needs both behaviors.
> > > > >
> > > > > (And I believe Jordan's answer confirms the latter - there's a single
> > > > > SMMU instance that needs all them quirks at once)
> > > > >
> > > > 
> > > > Let me go through the problem statement in case my commit
> > > > message wasn't
> > > > clear. There are two SMMUs (APSS and GPU) on SC7280 and both are
> > > > SMMU500
> > > > (ARM SMMU IP).
> > > > 
> > > > APSS SMMU compatible - ("qcom,sc7280-smmu-500", "arm,mmu-500")
> > > > GPU SMMU compatible - ("qcom,sc7280-smmu-500",
> > > > "qcom,adreno-smmu", "arm,mmu-500")
> > > > 
> > > > Now if we take SC7180 as an example, GPU SMMU was QSMMU(QCOM SMMU IP)
> > > > and APSS SMMU was SMMU500(ARM SMMU IP).
> > > > 
> > > > APSS SMMU compatible - ("qcom,sc7180-smmu-500", "arm,mmu-500")
> > > > GPU SMMU compatible - ("qcom,sc7180-smmu-v2",
> > > > "qcom,adreno-smmu", "qcom,smmu-v2")
> > > > 
> > > > Current code sequence without this patch,
> > > > 
> > > > if (of_match_node(qcom_smmu_impl_of_match, np))
> > > >  return qcom_smmu_create(smmu, _smmu_impl);
> > > > 
> > > > if (of_device_is_compatible(np, "qcom,adreno-smmu"))
> > > > return qcom_smmu_create(smmu, _adreno_smmu_impl);
> > > > 
> > > > Now if we look at the compatible for SC7180, there is no problem
> > > > because
> > > > the APSS SMMU will match the one in qcom_smmu_impl_of_match[]
> > > > and GPU SMMU
> > > > will match "qcom,adreno-smmu" because the compatible strings are
> > > > different.
> > > > But for SC7280, both the APSS SMMU and GPU SMMU
> > > > compatible("qcom,sc7280-smmu-500")
> > > > are same. So GPU SMMU will match with the one in
> > > > qcom_smmu_impl_of_match[]
> > > > i.e.., "qcom,sc7280-smmu-500" which functionally doesn't cause
> > > > any problem
> > > > but we will miss settings for split pagetables which are part of
> > > > GPU SMMU
> > > > specific implementation.
> > > > 
> > > > We can avoid this with yet another new compatible for GPU SMMU
> > > > something like
> > > > "qcom,sc7280-adreno-smmu-500" but since we can handle this
> > > > easily in the
> > > > driver and since the IPs are same, meaning if there was a
> > > > hardware quirk
> > > > required, then we would need to apply to both of them and would
> > > > this additional
> > > > compatible be of any help?
> > > > 
> > > 
> > > No, I think you're doing the right thing of having them both. I just
> > > didn't remember us doing that.
> > > 
> > > > Coming to the part of quirks now, you are right saying both
> > > > SMMUs will need
> > > > to have the same quirks in SC7280 and similar others where both
> > > > are based on
> > > > same IPs but those should probably be *hardware quirks* and if
> > > > they are
> > > > software based like the S2CR quirk depending on the firmware,
> > > > then it might
> > > > not be applicable to both. In case if it is applicable, then as
> > > > Jordan mentioned,
> > > > we can add the same quirks in GPU SMMU implementation.
> > > > 
> > > 
> > > I do suspect that at some point (probably sooner than later) we'd have
> > > to support both inheriting of stream from the bootloader and the
> > > Adreno
> > > "quirks" in the same instance.
> > > 
> > > But for now this is okay to me.
> > > 
> > 
> > Sure, let me know if you or anyone face any issues without it and I will
> > add it. I will resend this series with the dt-bindings patch for sc7280
> > smmu
> > which wasn't cc'd to smmu folks by mistake.
> > 
> 
> I think there is consensus on this series. I can resend if required but it
> still applies cleanly, let me know if you have any comments?

Please resend with the bindings patch, and I'd like Bjorn's Ack as well.

Will


Re: [PATCH 1/1] iommu: Don't use lazy flush for untrusted device

2021-03-25 Thread Will Deacon
On Thu, Feb 25, 2021 at 02:14:54PM +0800, Lu Baolu wrote:
> The lazy IOTLB flushing setup leaves a time window, in which the device
> can still access some system memory, which has already been unmapped by
> the device driver. It's not suitable for untrusted devices. A malicious
> device might use this to attack the system by obtaining data that it
> shouldn't obtain.
> 
> Fixes: c588072bba6b5 ("iommu/vt-d: Convert intel iommu driver to the iommu 
> ops")
> Signed-off-by: Lu Baolu 
> ---
>  drivers/iommu/dma-iommu.c | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)

Acked-by: Will Deacon 

Will


[GIT PULL] arm64: Fixes for -rc5

2021-03-25 Thread Will Deacon
Hi Linus,

Please pull these arm64 fixes for -rc5. Minor fixes all over, ranging
from typos to tests to errata workarounds. Summary in the tag.

Cheers,

Will

--->8

The following changes since commit c8e3866836528a4ba3b0535834f03768d74f7d8e:

  perf/arm_dmc620_pmu: Fix error return code in dmc620_pmu_device_probe() 
(2021-03-12 11:30:31 +)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git tags/arm64-fixes

for you to fetch changes up to 20109a859a9b514eb10c22b8a14b5704ffe93897:

  arm64: kernel: disable CNP on Carmel (2021-03-25 10:00:23 +)


arm64 fixes for -rc5

- Fix possible memory hotplug failure with KASLR

- Fix FFR value in SVE kselftest

- Fix backtraces reported in /proc/$pid/stack

- Disable broken CnP implementation on NVIDIA Carmel

- Typo fixes and ACPI documentation clarification

- Fix some W=1 warnings


Alex Elder (1):
  arm64: csum: cast to the proper type

Andre Przywara (1):
  kselftest/arm64: sve: Do not use non-canonical FFR register value

Bhaskar Chowdhury (1):
  arm64: cpuinfo: Fix a typo

Maninder Singh (1):
  arm64/process.c: fix Wmissing-prototypes build warnings

Mark Rutland (1):
  arm64: stacktrace: don't trace arch_stack_walk()

Pavel Tatashin (2):
  arm64: kdump: update ppos when reading elfcorehdr
  arm64: mm: correct the inside linear map range during hotplug check

Rich Wiley (1):
  arm64: kernel: disable CNP on Carmel

Tom Saeger (1):
  Documentation: arm64/acpi : clarify arm64 support of IBFT

 Documentation/arm64/acpi_object_usage.rst   | 10 +-
 Documentation/arm64/silicon-errata.rst  |  3 +++
 arch/arm64/Kconfig  | 10 ++
 arch/arm64/include/asm/checksum.h   |  2 +-
 arch/arm64/include/asm/cpucaps.h|  3 ++-
 arch/arm64/include/asm/processor.h  |  2 ++
 arch/arm64/include/asm/thread_info.h|  2 ++
 arch/arm64/kernel/cpu_errata.c  |  8 
 arch/arm64/kernel/cpufeature.c  |  5 -
 arch/arm64/kernel/cpuinfo.c |  2 +-
 arch/arm64/kernel/crash_dump.c  |  2 ++
 arch/arm64/kernel/process.c |  2 ++
 arch/arm64/kernel/stacktrace.c  |  9 +
 arch/arm64/mm/mmu.c | 21 +++--
 tools/testing/selftests/arm64/fp/sve-test.S | 22 +-
 15 files changed, 83 insertions(+), 20 deletions(-)


Re: [RFT PATCH v3 08/27] asm-generic/io.h: Add a non-posted variant of ioremap()

2021-03-25 Thread Will Deacon
On Thu, Mar 25, 2021 at 11:07:40PM +0900, Hector Martin wrote:
> On 25/03/2021 04.09, Arnd Bergmann wrote:
> > On Wed, Mar 24, 2021 at 7:12 PM Will Deacon  wrote:
> > > 
> > > > +/*
> > > > + * ioremap_np needs an explicit architecture implementation, as it
> > > > + * requests stronger semantics than regular ioremap(). Portable drivers
> > > > + * should instead use one of the higher-level abstractions, like
> > > > + * devm_ioremap_resource(), to choose the correct variant for any given
> > > > + * device and bus. Portable drivers with a good reason to want 
> > > > non-posted
> > > > + * write semantics should always provide an ioremap() fallback in case
> > > > + * ioremap_np() is not available.
> > > > + */
> > > > +#ifndef ioremap_np
> > > > +#define ioremap_np ioremap_np
> > > > +static inline void __iomem *ioremap_np(phys_addr_t offset, size_t size)
> > > > +{
> > > > + return NULL;
> > > > +}
> > > > +#endif
> > > 
> > > Can we implement the generic pci_remap_cfgspace() in terms of ioremap_np()
> > > if it is supported by the architecture? That way, we could avoid defining
> > > both on arm64.
> > 
> > Good idea. It needs a fallback in case the ioremap_np() fails on most
> > architectures, but that sounds easy enough.
> > 
> > Since pci_remap_cfgspace() only has custom implementations, it sounds like
> > we can actually make the generic implementation unconditional in the end,
> > but that requires adding ioremap_np() on 32-bit as well, and I would keep
> > that separate from this series.
> 
> Sounds good; I'm adding a patch to adjust the generic implementation and
> remove the arm64 one in v4, and we can then complete the cleanup for other
> arches later.

Cheers. Don't forget to update the comment in the generic code which
complains about the lack of an ioremap() API for non-posted writes ;)

Will


Re: [PATCH] drivers/perf: Simplify the SMMUv3 PMU event attributes

2021-03-25 Thread Will Deacon
On Mon, 8 Feb 2021 21:04:58 +0800, Qi Liu wrote:
> For each PMU event, there is a SMMU_EVENT_ATTR(xx, XX) and
> _event_attr_xx.attr.attr. Let's redefine the SMMU_EVENT_ATTR
> to simplify the smmu_pmu_events.

Applied to will (for-next/perf), thanks!

[1/1] drivers/perf: Simplify the SMMUv3 PMU event attributes
  https://git.kernel.org/will/c/174744136dcb

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [PATCH v2 0/3] drivers/perf: convert sysfs sprintf/snprintf/scnprintf to sysfs_emit

2021-03-25 Thread Will Deacon
On Fri, 19 Mar 2021 18:04:30 +0800, Qi Liu wrote:
> Use the generic sysfs_emit() and sysfs_emit_at() function to take
> place of sprintf/snprintf/scnprintf, to avoid buffer overrun.
> 
> Qi Liu (2):
>   drivers/perf: convert sysfs scnprintf family to sysfs_emit_at() and 
> sysfs_emit()
>   drivers/perf: convert sysfs sprintf family to sysfs_emit
> 
> [...]

Applied to will (for-next/perf), thanks!

[1/3] drivers/perf: convert sysfs snprintf family to sysfs_emit
  https://git.kernel.org/will/c/700a9cf0527c
[2/3] drivers/perf: convert sysfs scnprintf family to sysfs_emit_at() and 
sysfs_emit()
  https://git.kernel.org/will/c/9ec9f9cf8660
[3/3] drivers/perf: convert sysfs sprintf family to sysfs_emit
  https://git.kernel.org/will/c/fb62d67586af

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [PATCH 1/2] Perf: Clean generated directory, other archs.

2021-03-25 Thread Will Deacon
On Sun, Mar 07, 2021 at 11:19:42AM +0100, Gon Solo wrote:
> After a make clean there are generated directories left in the arch
> directories of perf. Clean them up.
> 
> Suggested-by: Arnaldo Carvalho de Melo 
> Signed-off-by: Andreas Wendleder 
> ---
>  tools/perf/arch/arm64/Makefile   | 5 +++--
>  tools/perf/arch/powerpc/Makefile | 5 +++--
>  tools/perf/arch/s390/Makefile| 5 +++--
>  3 files changed, 9 insertions(+), 6 deletions(-)

For arm64:

Acked-by: Will Deacon 

Will


Re: (subset) [PATCH 1/2] arm64/process.c: fix Wmissing-prototypes build warnings

2021-03-25 Thread Will Deacon
On Wed, 24 Mar 2021 12:24:58 +0530, Maninder Singh wrote:
> function protypes are missed before defination, which
> leads to compilation warning with "-Wmissing-prototypes"
> flag.
> 
> https://lkml.org/lkml/2021/3/19/840
> 
> arch/arm64/kernel/process.c:261:6: warning: no previous prototype for 
> '__show_regs' [-Wmissing-prototypes]
> 261 | void __show_regs(struct pt_regs *regs)
> |  ^~~
> arch/arm64/kernel/process.c:307:6: warning: no previous prototype for 
> '__show_regs_alloc_free' [-Wmissing-prototypes]
> 307 | void __show_regs_alloc_free(struct pt_regs *regs)
> |  ^~
> arch/arm64/kernel/process.c:365:5: warning: no previous prototype for 
> 'arch_dup_task_struct' [-Wmissing-prototypes]
>365 | int arch_dup_task_struct(struct task_struct *dst, struct task_struct 
> *src)
> | ^~~~
> arch/arm64/kernel/process.c:546:41: warning: no previous prototype for 
> '__switch_to' [-Wmissing-prototypes]
>546 | __notrace_funcgraph struct task_struct *__switch_to(struct 
> task_struct *prev,
> | ^~~
> arch/arm64/kernel/process.c:710:25: warning: no previous prototype for 
> 'arm64_preempt_schedule_irq' [-Wmissing-prototypes]
>710 | asmlinkage void __sched arm64_preempt_schedule_irq(void)
> | ^~

Applied first patch ONLY to arm64 (for-next/fixes), thanks!

[1/2] arm64/process.c: fix Wmissing-prototypes build warnings
  https://git.kernel.org/arm64/c/baa96377bc7b

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [RFT PATCH v3 16/27] irqchip/apple-aic: Add support for the Apple Interrupt Controller

2021-03-24 Thread Will Deacon
Hi Hector,

Sorry it took me so long to get to this. Some comments below.

On Fri, Mar 05, 2021 at 06:38:51AM +0900, Hector Martin wrote:
> This is the root interrupt controller used on Apple ARM SoCs such as the
> M1. This irqchip driver performs multiple functions:
> 
> * Handles both IRQs and FIQs
> 
> * Drives the AIC peripheral itself (which handles IRQs)
> 
> * Dispatches FIQs to downstream hard-wired clients (currently the ARM
>   timer).
> 
> * Implements a virtual IPI multiplexer to funnel multiple Linux IPIs
>   into a single hardware IPI
> 
> Signed-off-by: Hector Martin 
> ---
>  MAINTAINERS |   2 +
>  drivers/irqchip/Kconfig |   8 +
>  drivers/irqchip/Makefile|   1 +
>  drivers/irqchip/irq-apple-aic.c | 710 
>  include/linux/cpuhotplug.h  |   1 +
>  5 files changed, 722 insertions(+)
>  create mode 100644 drivers/irqchip/irq-apple-aic.c

[...]

> + * Implementation notes:
> + *
> + * - This driver creates two IRQ domains, one for HW IRQs and internal FIQs,
> + *   and one for IPIs.
> + * - Since Linux needs more than 2 IPIs, we implement a software IRQ 
> controller
> + *   and funnel all IPIs into one per-CPU IPI (the second "self" IPI is 
> unused).
> + * - FIQ hwirq numbers are assigned after true hwirqs, and are per-cpu.
> + * - DT bindings use 3-cell form (like GIC):
> + *   - <0 nr flags> - hwirq #nr
> + *   - <1 nr flags> - FIQ #nr
> + * - nr=0  Physical HV timer
> + * - nr=1  Virtual HV timer
> + * - nr=2  Physical guest timer
> + * - nr=3  Virtual guest timer
> + *
> + */
> +
> +#define pr_fmt(fmt) "%s: " fmt, __func__

General nit: but I suspect many of the prints in here probably want to be
using the *_ratelimited variants.

> +static void __exception_irq_entry aic_handle_irq(struct pt_regs *regs)
> +{
> + struct aic_irq_chip *ic = aic_irqc;
> + u32 event, type, irq;
> +
> + do {
> + /*
> +  * We cannot use a relaxed read here, as DMA needs to be
> +  * ordered with respect to the IRQ firing.
> +  */

I think this could be a bit clearer: the readl() doesn't order any DMA
accesses, but instead means that subsequent reads by the CPU are ordered
(which may be from a buffer which was DMA'd to) are ordered after the
read of the MMIO register.

> + event = readl(ic->base + AIC_EVENT);
> + type = FIELD_GET(AIC_EVENT_TYPE, event);
> + irq = FIELD_GET(AIC_EVENT_NUM, event);
> +
> + if (type == AIC_EVENT_TYPE_HW)
> + handle_domain_irq(aic_irqc->hw_domain, irq, regs);
> + else if (type == AIC_EVENT_TYPE_IPI && irq == 1)
> + aic_handle_ipi(regs);
> + else if (event != 0)
> + pr_err("Unknown IRQ event %d, %d\n", type, irq);
> + } while (event);
> +
> + /*
> +  * vGIC maintenance interrupts end up here too, so we need to check
> +  * for them separately. Just report and disable vGIC for now, until
> +  * we implement this properly.
> +  */
> + if ((read_sysreg_s(SYS_ICH_HCR_EL2) & ICH_HCR_EN) &&
> + read_sysreg_s(SYS_ICH_MISR_EL2) != 0) {
> + pr_err("vGIC IRQ fired, disabling.\n");
> + sysreg_clear_set_s(SYS_ICH_HCR_EL2, ICH_HCR_EN, 0);
> + }

What prevents all these system register accesses being speculated up before
the handler?

> +}
> +
> +static int aic_irq_set_affinity(struct irq_data *d,
> + const struct cpumask *mask_val, bool force)
> +{
> + irq_hw_number_t hwirq = irqd_to_hwirq(d);
> + struct aic_irq_chip *ic = irq_data_get_irq_chip_data(d);
> + int cpu;
> +
> + if (hwirq > ic->nr_hw)
> + return -EINVAL;
> +
> + if (force)
> + cpu = cpumask_first(mask_val);
> + else
> + cpu = cpumask_any_and(mask_val, cpu_online_mask);
> +
> + aic_ic_write(ic, AIC_TARGET_CPU + hwirq * 4, BIT(cpu));
> + irq_data_update_effective_affinity(d, cpumask_of(cpu));
> +
> + return IRQ_SET_MASK_OK;
> +}
> +
> +static int aic_irq_set_type(struct irq_data *d, unsigned int type)
> +{
> + return (type == IRQ_TYPE_LEVEL_HIGH) ? 0 : -EINVAL;
> +}
> +
> +static struct irq_chip aic_chip = {
> + .name = "AIC",
> + .irq_mask = aic_irq_mask,
> + .irq_unmask = aic_irq_unmask,

I know these are driven by the higher-level irq chip code, but I'm a bit
confused as to what provides ordering if, e.g. something ends up calling:

aic_chip.irq_mask(d);
...
aic_chip.irq_unmask(d);

I can't see any ISBs in here and they're writing to two different registers,
so can we end up with the IRQ masked after this sequence?

> +/*
> + * IPI irqchip
> + */
> +
> +static void aic_ipi_mask(struct irq_data *d)
> +{
> + u32 irq_bit = BIT(irqd_to_hwirq(d));
> + int this_cpu = smp_processor_id();
> +
> + /* No specific ordering requirements needed here. */
> + 

Re: [RFT PATCH v3 13/27] arm64: Add Apple vendor-specific system registers

2021-03-24 Thread Will Deacon
On Wed, Mar 24, 2021 at 06:59:21PM +, Mark Rutland wrote:
> On Wed, Mar 24, 2021 at 06:38:18PM +, Will Deacon wrote:
> > On Fri, Mar 05, 2021 at 06:38:48AM +0900, Hector Martin wrote:
> > > Apple ARM64 SoCs have a ton of vendor-specific registers we're going to
> > > have to deal with, and those don't really belong in sysreg.h with all
> > > the architectural registers. Make a new home for them, and add some
> > > registers which are useful for early bring-up.
> > > 
> > > Signed-off-by: Hector Martin 
> > > ---
> > >  MAINTAINERS   |  1 +
> > >  arch/arm64/include/asm/sysreg_apple.h | 69 +++
> > >  2 files changed, 70 insertions(+)
> > >  create mode 100644 arch/arm64/include/asm/sysreg_apple.h
> > > 
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index aec14fbd61b8..3a352c687d4b 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -1646,6 +1646,7 @@ B:  https://github.com/AsahiLinux/linux/issues
> > >  C:   irc://chat.freenode.net/asahi-dev
> > >  T:   git https://github.com/AsahiLinux/linux.git
> > >  F:   Documentation/devicetree/bindings/arm/apple.yaml
> > > +F:   arch/arm64/include/asm/sysreg_apple.h
> > 
> > (this isn't needed with my suggestion below).
> > 
> > >  ARM/ARTPEC MACHINE SUPPORT
> > >  M:   Jesper Nilsson 
> > > diff --git a/arch/arm64/include/asm/sysreg_apple.h 
> > > b/arch/arm64/include/asm/sysreg_apple.h
> > > new file mode 100644
> > > index ..48347a51d564
> > > --- /dev/null
> > > +++ b/arch/arm64/include/asm/sysreg_apple.h
> > 
> > I doubt apple are the only folks doing this, so can we instead have
> > sysreg-impdef.h please, and then have an Apple section in there for these
> > registers? That way, we could also have an imp_sys_reg() macro to limit
> > CRn to 11 or 15, which is the reserved encoding space for these registers.
> > 
> > We'll cc you for any patches touching the Apple parts, as we don't have
> > the first clue about what's hiding in there.
> 
> For existing IMP-DEF sysregs (e.g. the Kryo L2 control registers), we've
> put the definitions in the drivers, rather than collating
> non-architectural bits under arch/arm64/.

Yeah, but we could include those here as well.

> So far we've kept arch/arm64/ largely devoid of IMP-DEF bits, and it
> seems a shame to add something with the sole purpose of collating that,
> especially given arch code shouldn't need to touch these if FW and
> bootloader have done their jobs right.
> 
> Can we put the definitions in the relevant drivers? That would sidestep
> any pain with MAINTAINERS, too.

If we can genuinely ignore these in arch code, then sure. I just don't know
how long that is going to be the case, and ending up in a situation where
these are scattered randomly throughout the tree sounds horrible to me.

Will


Re: [RFT PATCH v3 13/27] arm64: Add Apple vendor-specific system registers

2021-03-24 Thread Will Deacon
On Fri, Mar 05, 2021 at 06:38:48AM +0900, Hector Martin wrote:
> Apple ARM64 SoCs have a ton of vendor-specific registers we're going to
> have to deal with, and those don't really belong in sysreg.h with all
> the architectural registers. Make a new home for them, and add some
> registers which are useful for early bring-up.
> 
> Signed-off-by: Hector Martin 
> ---
>  MAINTAINERS   |  1 +
>  arch/arm64/include/asm/sysreg_apple.h | 69 +++
>  2 files changed, 70 insertions(+)
>  create mode 100644 arch/arm64/include/asm/sysreg_apple.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index aec14fbd61b8..3a352c687d4b 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1646,6 +1646,7 @@ B:  https://github.com/AsahiLinux/linux/issues
>  C:   irc://chat.freenode.net/asahi-dev
>  T:   git https://github.com/AsahiLinux/linux.git
>  F:   Documentation/devicetree/bindings/arm/apple.yaml
> +F:   arch/arm64/include/asm/sysreg_apple.h

(this isn't needed with my suggestion below).

>  ARM/ARTPEC MACHINE SUPPORT
>  M:   Jesper Nilsson 
> diff --git a/arch/arm64/include/asm/sysreg_apple.h 
> b/arch/arm64/include/asm/sysreg_apple.h
> new file mode 100644
> index ..48347a51d564
> --- /dev/null
> +++ b/arch/arm64/include/asm/sysreg_apple.h

I doubt apple are the only folks doing this, so can we instead have
sysreg-impdef.h please, and then have an Apple section in there for these
registers? That way, we could also have an imp_sys_reg() macro to limit
CRn to 11 or 15, which is the reserved encoding space for these registers.

We'll cc you for any patches touching the Apple parts, as we don't have
the first clue about what's hiding in there.

> @@ -0,0 +1,69 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Apple SoC vendor-defined system register definitions
> + *
> + * Copyright The Asahi Linux Contributors
> +
> + * This file contains only well-understood registers that are useful to
> + * Linux. If you are looking for things to add here, you should visit:
> + *
> + * https://github.com/AsahiLinux/docs/wiki/HW:ARM-System-Registers
> + */
> +
> +#ifndef __ASM_SYSREG_APPLE_H
> +#define __ASM_SYSREG_APPLE_H
> +
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * Keep these registers in encoding order, except for register arrays;
> + * those should be listed in array order starting from the position of
> + * the encoding of the first register.
> + */
> +
> +#define SYS_APL_PMCR0_EL1sys_reg(3, 1, 15, 0, 0)
> +#define PMCR0_IMODE  GENMASK(10, 8)
> +#define PMCR0_IMODE_OFF  0
> +#define PMCR0_IMODE_PMI  1
> +#define PMCR0_IMODE_AIC  2
> +#define PMCR0_IMODE_HALT 3
> +#define PMCR0_IMODE_FIQ  4
> +#define PMCR0_IACT   BIT(11)

The Arm ARM says this about imp-def sysregs:

  | The Arm architecture guarantees not to define any register name
  | prefixed with IMP_ as part of the standard Arm architecture.
  |
  | Note
  | Arm strongly recommends that any register names created in the
  | IMPLEMENTATION DEFINED register spaces be prefixed with IMP_ and
  | postfixed with _ELx, where appropriate.

and it seems like we could follow that here without much effort, if you
don't mind.

Will


Re: [RFT PATCH v3 14/27] arm64: move ICH_ sysreg bits from arm-gic-v3.h to sysreg.h

2021-03-24 Thread Will Deacon
On Fri, Mar 05, 2021 at 06:38:49AM +0900, Hector Martin wrote:
> These definitions are in arm-gic-v3.h for historical reasons which no
> longer apply. Move them to sysreg.h so the AIC driver can use them, as
> it needs to peek into vGIC registers to deal with the GIC maintentance
> interrupt.
> 
> Signed-off-by: Hector Martin 
> ---
>  arch/arm64/include/asm/sysreg.h| 60 ++
>  include/linux/irqchip/arm-gic-v3.h | 56 
>  2 files changed, 60 insertions(+), 56 deletions(-)

This would be much easier to remove if you just moved the definitions,
rather than reordered than and changed the comments at the same time!

But I *think* nothing had changed, so:

Acked-by: Will Deacon 

Will


Re: [RFT PATCH v3 11/27] arm64: Implement ioremap_np() to map MMIO as nGnRnE

2021-03-24 Thread Will Deacon
On Fri, Mar 05, 2021 at 06:38:46AM +0900, Hector Martin wrote:
> This is used on Apple ARM platforms, which require most MMIO
> (except PCI devices) to be mapped as nGnRnE.
> 
> Signed-off-by: Hector Martin 
> ---
>  arch/arm64/include/asm/io.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
> index 5ea8656a2030..953b8703af60 100644
> --- a/arch/arm64/include/asm/io.h
> +++ b/arch/arm64/include/asm/io.h
> @@ -169,6 +169,7 @@ extern void __iomem *ioremap_cache(phys_addr_t phys_addr, 
> size_t size);
>  
>  #define ioremap(addr, size)  __ioremap((addr), (size), 
> __pgprot(PROT_DEVICE_nGnRE))
>  #define ioremap_wc(addr, size)   __ioremap((addr), (size), 
> __pgprot(PROT_NORMAL_NC))
> +#define ioremap_np(addr, size)   __ioremap((addr), (size), 
> __pgprot(PROT_DEVICE_nGnRnE))

Probably worth noting that whether or not this actually results in a
non-posted mapping depends on the system architecture, but this is the
best we can do, so:

Acked-by: Will Deacon 

I would personally prefer that drivers didn't have to care about this, and
ioremap on arm64 figured out the right attributes based on the region being
mapped, but I haven't followed the discussion closely so I won't die on that
hill.

Will


Re: [RFT PATCH v3 05/27] arm64: cputype: Add CPU implementor & types for the Apple M1 cores

2021-03-24 Thread Will Deacon
On Fri, Mar 05, 2021 at 06:38:40AM +0900, Hector Martin wrote:
> The implementor will be used to condition the FIQ support quirk.
> 
> The specific CPU types are not used at the moment, but let's add them
> for documentation purposes.
> 
> Signed-off-by: Hector Martin 
> ---
>  arch/arm64/include/asm/cputype.h | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/cputype.h 
> b/arch/arm64/include/asm/cputype.h
> index ef5b040dee44..6231e1f0abe7 100644
> --- a/arch/arm64/include/asm/cputype.h
> +++ b/arch/arm64/include/asm/cputype.h
> @@ -59,6 +59,7 @@
>  #define ARM_CPU_IMP_NVIDIA   0x4E
>  #define ARM_CPU_IMP_FUJITSU  0x46
>  #define ARM_CPU_IMP_HISI 0x48
> +#define ARM_CPU_IMP_APPLE0x61
>  
>  #define ARM_CPU_PART_AEM_V8  0xD0F
>  #define ARM_CPU_PART_FOUNDATION  0xD00
> @@ -99,6 +100,9 @@
>  
>  #define HISI_CPU_PART_TSV110 0xD01
>  
> +#define APPLE_CPU_PART_M1_ICESTORM   0x022
> +#define APPLE_CPU_PART_M1_FIRESTORM  0x023
> +
>  #define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
> ARM_CPU_PART_CORTEX_A53)
>  #define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
> ARM_CPU_PART_CORTEX_A57)
>  #define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
> ARM_CPU_PART_CORTEX_A72)
> @@ -127,6 +131,8 @@
>  #define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, 
> NVIDIA_CPU_PART_CARMEL)
>  #define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, 
> FUJITSU_CPU_PART_A64FX)
>  #define MIDR_HISI_TSV110 MIDR_CPU_MODEL(ARM_CPU_IMP_HISI, 
> HISI_CPU_PART_TSV110)
> +#define MIDR_APPLE_M1_ICESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, 
> APPLE_CPU_PART_M1_ICESTORM)
> +#define MIDR_APPLE_M1_FIRESTORM MIDR_CPU_MODEL(ARM_CPU_IMP_APPLE, 
> APPLE_CPU_PART_M1_FIRESTORM)

We usually only merge these when they're needed, but this SoC seems broken
enough that I can see the value in having them from the start :(

Acked-by: Will Deacon 

Will


Re: [RFT PATCH v3 08/27] asm-generic/io.h: Add a non-posted variant of ioremap()

2021-03-24 Thread Will Deacon
On Fri, Mar 05, 2021 at 06:38:43AM +0900, Hector Martin wrote:
> ARM64 currently defaults to posted MMIO (nGnRnE), but some devices
> require the use of non-posted MMIO (nGnRE). Introduce a new ioremap()
> variant to handle this case. ioremap_np() is aliased to ioremap() by
> default on arches that do not implement this variant.
> 
> sparc64 is the only architecture that needs to be touched directly,
> because it includes neither of the generic io.h or iomap.h headers.
> 
> This adds the IORESOURCE_MEM_NONPOSTED flag, which maps to this
> variant and marks a given resource as requiring non-posted mappings.
> This is implemented in the resource system because it is a SoC-level
> requirement, so existing drivers do not need special-case code to pick
> this ioremap variant.
> 
> Then this is implemented in devres by introducing devm_ioremap_np(),
> and making devm_ioremap_resource() automatically select this variant
> when the resource has the IORESOURCE_MEM_NONPOSTED flag set.
> 
> Signed-off-by: Hector Martin 
> ---
>  .../driver-api/driver-model/devres.rst|  1 +
>  arch/sparc/include/asm/io_64.h|  4 
>  include/asm-generic/io.h  | 22 ++-
>  include/asm-generic/iomap.h   |  9 
>  include/linux/io.h|  2 ++
>  include/linux/ioport.h|  1 +
>  lib/devres.c  | 22 +++
>  7 files changed, 60 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/driver-api/driver-model/devres.rst 
> b/Documentation/driver-api/driver-model/devres.rst
> index cd8b6e657b94..2f45877a539d 100644
> --- a/Documentation/driver-api/driver-model/devres.rst
> +++ b/Documentation/driver-api/driver-model/devres.rst
> @@ -309,6 +309,7 @@ IOMAP
>devm_ioremap()
>devm_ioremap_uc()
>devm_ioremap_wc()
> +  devm_ioremap_np()
>devm_ioremap_resource() : checks resource, requests memory region, ioremaps
>devm_ioremap_resource_wc()
>devm_platform_ioremap_resource() : calls devm_ioremap_resource() for 
> platform device
> diff --git a/arch/sparc/include/asm/io_64.h b/arch/sparc/include/asm/io_64.h
> index 9bb27e5c22f1..9fbfc9574432 100644
> --- a/arch/sparc/include/asm/io_64.h
> +++ b/arch/sparc/include/asm/io_64.h
> @@ -409,6 +409,10 @@ static inline void __iomem *ioremap(unsigned long 
> offset, unsigned long size)
>  #define ioremap_uc(X,Y)  ioremap((X),(Y))
>  #define ioremap_wc(X,Y)  ioremap((X),(Y))
>  #define ioremap_wt(X,Y)  ioremap((X),(Y))
> +static inline void __iomem *ioremap_np(unsigned long offset, unsigned long 
> size)
> +{
> + return NULL;
> +}
>  
>  static inline void iounmap(volatile void __iomem *addr)
>  {
> diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
> index c6af40ce03be..082e0c96db6e 100644
> --- a/include/asm-generic/io.h
> +++ b/include/asm-generic/io.h
> @@ -942,7 +942,9 @@ static inline void *phys_to_virt(unsigned long address)
>   *
>   * ioremap_wc() and ioremap_wt() can provide more relaxed caching attributes
>   * for specific drivers if the architecture choses to implement them.  If 
> they
> - * are not implemented we fall back to plain ioremap.
> + * are not implemented we fall back to plain ioremap. Conversely, 
> ioremap_np()
> + * can provide stricter non-posted write semantics if the architecture
> + * implements them.
>   */
>  #ifndef CONFIG_MMU
>  #ifndef ioremap
> @@ -993,6 +995,24 @@ static inline void __iomem *ioremap_uc(phys_addr_t 
> offset, size_t size)
>  {
>   return NULL;
>  }
> +
> +/*
> + * ioremap_np needs an explicit architecture implementation, as it
> + * requests stronger semantics than regular ioremap(). Portable drivers
> + * should instead use one of the higher-level abstractions, like
> + * devm_ioremap_resource(), to choose the correct variant for any given
> + * device and bus. Portable drivers with a good reason to want non-posted
> + * write semantics should always provide an ioremap() fallback in case
> + * ioremap_np() is not available.
> + */
> +#ifndef ioremap_np
> +#define ioremap_np ioremap_np
> +static inline void __iomem *ioremap_np(phys_addr_t offset, size_t size)
> +{
> + return NULL;
> +}
> +#endif

Can we implement the generic pci_remap_cfgspace() in terms of ioremap_np()
if it is supported by the architecture? That way, we could avoid defining
both on arm64.

Will


Re: [RFT PATCH v3 01/27] arm64: Cope with CPUs stuck in VHE mode

2021-03-24 Thread Will Deacon
On Fri, Mar 05, 2021 at 06:38:36AM +0900, Hector Martin wrote:
> From: Marc Zyngier 
> 
> It seems that the CPU known as Apple M1 has the terrible habit
> of being stuck with HCR_EL2.E2H==1, in violation of the architecture.
> 
> Try and work around this deplorable state of affairs by detecting
> the stuck bit early and short-circuit the nVHE dance. It is still
> unknown whether there are many more such nuggets to be found...
> 
> Reported-by: Hector Martin 
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm64/kernel/head.S | 33 ++---
>  arch/arm64/kernel/hyp-stub.S | 28 
>  2 files changed, 54 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S
> index 66b0e0b66e31..673002b11865 100644
> --- a/arch/arm64/kernel/head.S
> +++ b/arch/arm64/kernel/head.S
> @@ -477,14 +477,13 @@ EXPORT_SYMBOL(kimage_vaddr)
>   * booted in EL1 or EL2 respectively.
>   */
>  SYM_FUNC_START(init_kernel_el)
> - mov_q   x0, INIT_SCTLR_EL1_MMU_OFF
> - msr sctlr_el1, x0
> -
>   mrs x0, CurrentEL
>   cmp x0, #CurrentEL_EL2
>   b.eqinit_el2
>  
>  SYM_INNER_LABEL(init_el1, SYM_L_LOCAL)
> + mov_q   x0, INIT_SCTLR_EL1_MMU_OFF
> + msr sctlr_el1, x0
>   isb
>   mov_q   x0, INIT_PSTATE_EL1
>   msr spsr_el1, x0
> @@ -504,6 +503,34 @@ SYM_INNER_LABEL(init_el2, SYM_L_LOCAL)
>   msr vbar_el2, x0
>   isb
>  
> + /*
> +  * Fruity CPUs seem to have HCR_EL2.E2H set to RES1,
> +  * making it impossible to start in nVHE mode. Is that
> +  * compliant with the architecture? Absolutely not!
> +  */
> + mrs x0, hcr_el2
> + and x0, x0, #HCR_E2H
> + cbz x0, 1f
> +
> + /* Switching to VHE requires a sane SCTLR_EL1 as a start */
> + mov_q   x0, INIT_SCTLR_EL1_MMU_OFF
> + msr_s   SYS_SCTLR_EL12, x0
> +
> + /*
> +  * Force an eret into a helper "function", and let it return
> +  * to our original caller... This makes sure that we have
> +  * initialised the basic PSTATE state.
> +  */
> + mov x0, #INIT_PSTATE_EL2
> + msr spsr_el1, x0
> + adr_l   x0, stick_to_vhe
> + msr elr_el1, x0
> + eret
> +
> +1:
> + mov_q   x0, INIT_SCTLR_EL1_MMU_OFF
> + msr sctlr_el1, x0
> +
>   msr elr_el2, lr
>   mov w0, #BOOT_CPU_MODE_EL2
>   eret
> diff --git a/arch/arm64/kernel/hyp-stub.S b/arch/arm64/kernel/hyp-stub.S
> index 5eccbd62fec8..c7601030ee82 100644
> --- a/arch/arm64/kernel/hyp-stub.S
> +++ b/arch/arm64/kernel/hyp-stub.S
> @@ -27,12 +27,12 @@ SYM_CODE_START(__hyp_stub_vectors)
>   ventry  el2_fiq_invalid // FIQ EL2t
>   ventry  el2_error_invalid   // Error EL2t
>  
> - ventry  el2_sync_invalid// Synchronous EL2h
> + ventry  elx_sync// Synchronous EL2h
>   ventry  el2_irq_invalid // IRQ EL2h
>   ventry  el2_fiq_invalid // FIQ EL2h
>   ventry  el2_error_invalid   // Error EL2h
>  
> - ventry  el1_sync// Synchronous 64-bit EL1
> + ventry  elx_sync// Synchronous 64-bit EL1
>   ventry  el1_irq_invalid // IRQ 64-bit EL1
>   ventry  el1_fiq_invalid // FIQ 64-bit EL1
>   ventry  el1_error_invalid   // Error 64-bit EL1
> @@ -45,7 +45,7 @@ SYM_CODE_END(__hyp_stub_vectors)
>  
>   .align 11
>  
> -SYM_CODE_START_LOCAL(el1_sync)
> +SYM_CODE_START_LOCAL(elx_sync)
>   cmp x0, #HVC_SET_VECTORS
>   b.ne1f
>   msr vbar_el2, x1
> @@ -71,7 +71,7 @@ SYM_CODE_START_LOCAL(el1_sync)
>  
>  9:   mov x0, xzr
>   eret
> -SYM_CODE_END(el1_sync)
> +SYM_CODE_END(elx_sync)
>  
>  // nVHE? No way! Give me the real thing!
>  SYM_CODE_START_LOCAL(mutate_to_vhe)
> @@ -243,3 +243,23 @@ SYM_FUNC_START(switch_to_vhe)
>  #endif
>   ret
>  SYM_FUNC_END(switch_to_vhe)
> +
> +SYM_FUNC_START(stick_to_vhe)
> + /*
> +  * Make sure the switch to VHE cannot fail, by overriding the
> +  * override. This is hilarious.
> +  */
> + adr_l   x1, id_aa64mmfr1_override
> + add x1, x1, #FTR_OVR_MASK_OFFSET
> + dc  civac, x1
> + dsb sy
> + isb

Why do we need an ISB here?

> + ldr x0, [x1]
> + bic x0, x0, #(0xf << ID_AA64MMFR1_VHE_SHIFT)
> + str x0, [x1]

I find it a bit bizarre doing this here, as for the primary CPU we're still
a way away from parsing the early paramaters and for secondary CPUs this
doesn't need to be done for each one. Furthermore, this same code is run
on the resume path, which can probably then race with itself.

Is it possible to do it later on the boot CPU only, e.g. in
init_feature_override()? We should probably also log a warning that we're
ignoring the option because nVHE is not available.

Will


Re: [PATCHv3 6/6] arm64: irq: allow FIQs to be handled

2021-03-24 Thread Will Deacon
On Mon, Mar 15, 2021 at 11:56:29AM +, Mark Rutland wrote:
> On contemporary platforms we don't use FIQ, and treat any stray FIQ as a
> fatal event. However, some platforms have an interrupt controller wired
> to FIQ, and need to handle FIQ as part of regular operation.
> 
> So that we can support both cases dynamically, this patch updates the
> FIQ exception handling code to operate the same way as the IRQ handling
> code, with its own handle_arch_fiq handler. Where a root FIQ handler is
> not registered, an unexpected FIQ exception will trigger the default FIQ
> handler, which will panic() as today. Where a root FIQ handler is
> registered, handling of the FIQ is deferred to that handler.
> 
> As el0_fiq_invalid_compat is supplanted by el0_fiq, the former is
> removed. For !CONFIG_COMPAT builds we never expect to take an exception
> from AArch32 EL0, so we keep the common el0_fiq_invalid handler.
> 
> Signed-off-by: Mark Rutland 
> Tested-by: Hector Martin 
> Cc: Catalin Marinas 
> Cc: James Morse 
> Cc: Marc Zyngier 
> Cc: Thomas Gleixner 
> Cc: Will Deacon 
> ---
>  arch/arm64/include/asm/irq.h |  1 +
>  arch/arm64/kernel/entry.S| 30 +-
>  arch/arm64/kernel/irq.c  | 16 ++++
>  3 files changed, 38 insertions(+), 9 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCHv3 5/6] arm64: Always keep DAIF.[IF] in sync

2021-03-24 Thread Will Deacon
On Mon, Mar 15, 2021 at 11:56:28AM +, Mark Rutland wrote:
> From: Hector Martin 
> 
> Apple SoCs (A11 and newer) have some interrupt sources hardwired to the
> FIQ line. We implement support for this by simply treating IRQs and FIQs
> the same way in the interrupt vectors.
> 
> To support these systems, the FIQ mask bit needs to be kept in sync with
> the IRQ mask bit, so both kinds of exceptions are masked together. No
> other platforms should be delivering FIQ exceptions right now, and we
> already unmask FIQ in normal process context, so this should not have an
> effect on other systems - if spurious FIQs were arriving, they would
> already panic the kernel.
> 
> Signed-off-by: Hector Martin 
> Signed-off-by: Mark Rutland 
> Tested-by: Hector Martin 
> Cc: Catalin Marinas 
> Cc: James Morse 
> Cc: Marc Zyngier 
> Cc: Thomas Gleixner 
> Cc: Will Deacon 
> ---
>  arch/arm64/include/asm/arch_gicv3.h |  2 +-
>  arch/arm64/include/asm/assembler.h  |  8 
>  arch/arm64/include/asm/daifflags.h  | 10 +-
>  arch/arm64/include/asm/irqflags.h   | 16 +++-
>  arch/arm64/kernel/entry.S   | 12 +++-
>  arch/arm64/kernel/process.c |  2 +-
>  arch/arm64/kernel/smp.c     |  1 +
>  7 files changed, 26 insertions(+), 25 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCHv3 4/6] arm64: entry: factor irq triage logic into macros

2021-03-24 Thread Will Deacon
On Mon, Mar 15, 2021 at 11:56:27AM +, Mark Rutland wrote:
> From: Marc Zyngier 
> 
> In subsequent patches we'll allow an FIQ handler to be registered, and
> FIQ exceptions will need to be triaged very similarly to IRQ exceptions.
> So that we can reuse the existing logic, this patch factors the IRQ
> triage logic out into macros that can be reused for FIQ.
> 
> The macros are named to follow the elX_foo_handler scheme used by the C
> exception handlers. For consistency with other top-level exception
> handlers, the kernel_entry/kernel_exit logic is not moved into the
> macros. As FIQ will use a different C handler, this handler name is
> provided as an argument to the macros.
> 
> There should be no functional change as a result of this patch.
> 
> Signed-off-by: Marc Zyngier 
> [Mark: rework macros, commit message, rebase before DAIF rework]
> Signed-off-by: Mark Rutland 
> Tested-by: Hector Martin 
> Cc: Catalin Marinas 
> Cc: James Morse 
> Cc: Thomas Gleixner 
> Cc: Will Deacon 
> ---
>  arch/arm64/kernel/entry.S | 80 
> +------
>  1 file changed, 43 insertions(+), 37 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCHv3 3/6] arm64: irq: rework root IRQ handler registration

2021-03-24 Thread Will Deacon
On Mon, Mar 15, 2021 at 11:56:26AM +, Mark Rutland wrote:
> If we accidentally unmask IRQs before we've registered a root IRQ
> handler, handle_arch_irq will be NULL, and the IRQ exception handler
> will branch to a bogus address.
> 
> To make this easier to debug, this patch initialises handle_arch_irq to
> a default handler which will panic(), making such problems easier to
> debug. When we add support for FIQ handlers, we can follow the same
> approach.
> 
> When we add support for a root FIQ handler, it's possible to have root
> IRQ handler without an root FIQ handler, and in theory the inverse is
> also possible. To permit this, and to keep the IRQ/FIQ registration
> logic similar, this patch removes the panic in the absence of a root IRQ
> controller. Instead, set_handle_irq() logs when a handler is registered,
> which is sufficient for debug purposes.
> 
> Signed-off-by: Mark Rutland 
> Tested-by: Hector Martin 
> Cc: Catalin Marinas 
> Cc: James Morse 
> Cc: Marc Zyngier 
> Cc: Thomas Gleixner 
> Cc: Will Deacon 
> ---
>  arch/arm64/kernel/irq.c | 12 ++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH] arm64: move --fix-cortex-a53-843419 linker test to Kconfig

2021-03-24 Thread Will Deacon
On Wed, Mar 24, 2021 at 04:11:28PM +0900, Masahiro Yamada wrote:
> $(call ld-option, --fix-cortex-a53-843419) in arch/arm64/Makefile is
> evaluated every time even for Make targets that do not need the linker,
> such as "make ARCH=arm64 install".
> 
> Recently, the Kbuild tree queued up a patch to avoid needless
> compiler/linker flag evaluation. I beleive it is a good improvement
> itself, but causing a false-positive warning for arm64 installation
> in linux-next. (Thanks to Nathan for the report)
> 
> Kconfig can test the linker capability just once, and store it in the
> .config file. The build and installation steps that follow do not need
> to test the liniker over again.
> 
> Reported-by: Nathan Chancellor 
> Signed-off-by: Masahiro Yamada 
> ---
> 
> I was not sure what the preferred CONFIG option name is.
> Please suggest a one if you have a better idea.
> 
> 
>  arch/arm64/Kconfig  | 3 +++
>  arch/arm64/Makefile | 2 +-
>  2 files changed, 4 insertions(+), 1 deletion(-)

Acked-by: Will Deacon 

Will


  1   2   3   4   5   6   7   8   9   10   >