Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-04-01 Thread Will Deacon
On Thu, Apr 01, 2021 at 11:59:45AM +0200, Christoph Hellwig wrote:
> For now I'll just pass the iommu_domain to iommu_get_dma_strict,
> so that we can check for it.  We can do additional cleanups on top
> of that later.

Sounds good to me, cheers!

Will


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Will Deacon
On Wed, Mar 31, 2021 at 02:09:37PM +0100, Robin Murphy wrote:
> On 2021-03-31 12:49, Will Deacon wrote:
> > On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:
> > > On 2021-03-30 14:58, Will Deacon wrote:
> > > > On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:
> > > > > On 2021-03-30 14:11, Will Deacon wrote:
> > > > > > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:
> > > > > > > From: Robin Murphy 
> > > > > > > 
> > > > > > > Instead make the global iommu_dma_strict paramete in iommu.c 
> > > > > > > canonical by
> > > > > > > exporting helpers to get and set it and use those directly in the 
> > > > > > > drivers.
> > > > > > > 
> > > > > > > This make sure that the iommu.strict parameter also works for the 
> > > > > > > AMD and
> > > > > > > Intel IOMMU drivers on x86.  As those default to lazy flushing a 
> > > > > > > new
> > > > > > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
> > > > > > > represent the default if not overriden by an explicit parameter.
> > > > > > > 
> > > > > > > Signed-off-by: Robin Murphy .
> > > > > > > [ported on top of the other iommu_attr changes and added a few 
> > > > > > > small
> > > > > > > missing bits]
> > > > > > > Signed-off-by: Christoph Hellwig 
> > > > > > > ---
> > > > > > > drivers/iommu/amd/iommu.c   | 23 +---
> > > > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 
> > > > > > > +---
> > > > > > > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> > > > > > > drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
> > > > > > > drivers/iommu/dma-iommu.c   |  9 +--
> > > > > > > drivers/iommu/intel/iommu.c | 64 
> > > > > > > -
> > > > > > > drivers/iommu/iommu.c   | 27 ++---
> > > > > > > include/linux/iommu.h   |  4 +-
> > > > > > > 8 files changed, 40 insertions(+), 165 deletions(-)
> > > > > > 
> > > > > > I really like this cleanup, but I can't help wonder if it's going 
> > > > > > in the
> > > > > > wrong direction. With SoCs often having multiple IOMMU instances 
> > > > > > and a
> > > > > > distinction between "trusted" and "untrusted" devices, then having 
> > > > > > the
> > > > > > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
> > > > > > unreasonable to me, but this change makes it a global property.
> > > > > 
> > > > > The intent here was just to streamline the existing behaviour of 
> > > > > stuffing a
> > > > > global property into a domain attribute then pulling it out again in 
> > > > > the
> > > > > illusion that it was in any way per-domain. We're still checking
> > > > > dev_is_untrusted() before making an actual decision, and it's not 
> > > > > like we
> > > > > can't add more factors at that point if we want to.
> > > > 
> > > > Like I say, the cleanup is great. I'm just wondering whether there's a
> > > > better way to express the complicated logic to decide whether or not to 
> > > > use
> > > > the flush queue than what we end up with:
> > > > 
> > > > if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
> > > > domain->ops->flush_iotlb_all && !iommu_get_dma_strict())
> > > > 
> > > > which is mixing up globals, device properties and domain properties. The
> > > > result is that the driver code ends up just using the global to 
> > > > determine
> > > > whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table 
> > > > code,
> > > > which is a departure from the current way of doing things.
> > > 
> > > But previously, SMMU only ever saw th

Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-31 Thread Will Deacon
On Tue, Mar 30, 2021 at 05:28:19PM +0100, Robin Murphy wrote:
> On 2021-03-30 14:58, Will Deacon wrote:
> > On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:
> > > On 2021-03-30 14:11, Will Deacon wrote:
> > > > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:
> > > > > From: Robin Murphy 
> > > > > 
> > > > > Instead make the global iommu_dma_strict paramete in iommu.c 
> > > > > canonical by
> > > > > exporting helpers to get and set it and use those directly in the 
> > > > > drivers.
> > > > > 
> > > > > This make sure that the iommu.strict parameter also works for the AMD 
> > > > > and
> > > > > Intel IOMMU drivers on x86.  As those default to lazy flushing a new
> > > > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
> > > > > represent the default if not overriden by an explicit parameter.
> > > > > 
> > > > > Signed-off-by: Robin Murphy .
> > > > > [ported on top of the other iommu_attr changes and added a few small
> > > > >missing bits]
> > > > > Signed-off-by: Christoph Hellwig 
> > > > > ---
> > > > >drivers/iommu/amd/iommu.c   | 23 +---
> > > > >drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
> > > > >drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> > > > >drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
> > > > >drivers/iommu/dma-iommu.c   |  9 +--
> > > > >drivers/iommu/intel/iommu.c | 64 
> > > > > -
> > > > >drivers/iommu/iommu.c   | 27 ++---
> > > > >include/linux/iommu.h   |  4 +-
> > > > >8 files changed, 40 insertions(+), 165 deletions(-)
> > > > 
> > > > I really like this cleanup, but I can't help wonder if it's going in the
> > > > wrong direction. With SoCs often having multiple IOMMU instances and a
> > > > distinction between "trusted" and "untrusted" devices, then having the
> > > > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
> > > > unreasonable to me, but this change makes it a global property.
> > > 
> > > The intent here was just to streamline the existing behaviour of stuffing 
> > > a
> > > global property into a domain attribute then pulling it out again in the
> > > illusion that it was in any way per-domain. We're still checking
> > > dev_is_untrusted() before making an actual decision, and it's not like we
> > > can't add more factors at that point if we want to.
> > 
> > Like I say, the cleanup is great. I'm just wondering whether there's a
> > better way to express the complicated logic to decide whether or not to use
> > the flush queue than what we end up with:
> > 
> > if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
> > domain->ops->flush_iotlb_all && !iommu_get_dma_strict())
> > 
> > which is mixing up globals, device properties and domain properties. The
> > result is that the driver code ends up just using the global to determine
> > whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
> > which is a departure from the current way of doing things.
> 
> But previously, SMMU only ever saw the global policy piped through the
> domain attribute by iommu_group_alloc_default_domain(), so there's no
> functional change there.

For DMA domains sure, but I don't think that's the case for unmanaged
domains such as those used by VFIO.

> Obviously some of the above checks could be factored out into some kind of
> iommu_use_flush_queue() helper that IOMMU drivers can also call if they need
> to keep in sync. Or maybe we just allow iommu-dma to set
> IO_PGTABLE_QUIRK_NON_STRICT directly via iommu_set_pgtable_quirks() if we're
> treating that as a generic thing now.

I think a helper that takes a domain would be a good starting point.

Will


Re: [PATCH bpf-next] bpf: arm64: Redefine MOV consistent with arch insn

2021-03-31 Thread Will Deacon
On Wed, Mar 31, 2021 at 05:22:18PM +0800, Jianlin Lv wrote:
> On Tue, Mar 30, 2021 at 5:31 PM Will Deacon  wrote:
> >
> > On Tue, Mar 30, 2021 at 03:42:35PM +0800, Jianlin Lv wrote:
> > > A64_MOV is currently mapped to Add Instruction. Architecturally MOV
> > > (register) is an alias of ORR (shifted register) and MOV (to or from SP)
> > > is an alias of ADD (immediate).
> > > This patch redefines A64_MOV and uses existing functionality
> > > aarch64_insn_gen_move_reg() in insn.c to encode MOV (register) 
> > > instruction.
> > > For moving between register and stack pointer, rename macro to A64_MOV_SP.
> >
> > What does this gain us? There's no requirement for a BPF "MOV" to match an
> > arm64 architectural "MOV", so what's the up-side of aligning them like this?
> 
> According to the description in the Arm Software Optimization Guide,
> Arithmetic(basic) and Logical(basic) instructions have the same
> Exec Latency and Execution Throughput.
> This change did not bring about a performance improvement.
> The original intention was to make the instruction map more 'natively'.

I think we should leave the code as-is, then. Having a separate MOV_SP
macro s confusing and, worse, I worry that somebody passing A64_SP to
A64_MOV will end up using the zero register.

Will


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-30 Thread Will Deacon
On Tue, Mar 30, 2021 at 02:19:38PM +0100, Robin Murphy wrote:
> On 2021-03-30 14:11, Will Deacon wrote:
> > On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:
> > > From: Robin Murphy 
> > > 
> > > Instead make the global iommu_dma_strict paramete in iommu.c canonical by
> > > exporting helpers to get and set it and use those directly in the drivers.
> > > 
> > > This make sure that the iommu.strict parameter also works for the AMD and
> > > Intel IOMMU drivers on x86.  As those default to lazy flushing a new
> > > IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
> > > represent the default if not overriden by an explicit parameter.
> > > 
> > > Signed-off-by: Robin Murphy .
> > > [ported on top of the other iommu_attr changes and added a few small
> > >   missing bits]
> > > Signed-off-by: Christoph Hellwig 
> > > ---
> > >   drivers/iommu/amd/iommu.c   | 23 +---
> > >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
> > >   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> > >   drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
> > >   drivers/iommu/dma-iommu.c   |  9 +--
> > >   drivers/iommu/intel/iommu.c | 64 -
> > >   drivers/iommu/iommu.c   | 27 ++---
> > >   include/linux/iommu.h   |  4 +-
> > >   8 files changed, 40 insertions(+), 165 deletions(-)
> > 
> > I really like this cleanup, but I can't help wonder if it's going in the
> > wrong direction. With SoCs often having multiple IOMMU instances and a
> > distinction between "trusted" and "untrusted" devices, then having the
> > flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
> > unreasonable to me, but this change makes it a global property.
> 
> The intent here was just to streamline the existing behaviour of stuffing a
> global property into a domain attribute then pulling it out again in the
> illusion that it was in any way per-domain. We're still checking
> dev_is_untrusted() before making an actual decision, and it's not like we
> can't add more factors at that point if we want to.

Like I say, the cleanup is great. I'm just wondering whether there's a
better way to express the complicated logic to decide whether or not to use
the flush queue than what we end up with:

if (!cookie->fq_domain && (!dev || !dev_is_untrusted(dev)) &&
domain->ops->flush_iotlb_all && !iommu_get_dma_strict())

which is mixing up globals, device properties and domain properties. The
result is that the driver code ends up just using the global to determine
whether or not to pass IO_PGTABLE_QUIRK_NON_STRICT to the page-table code,
which is a departure from the current way of doing things.

> > For example, see the recent patch from Lu Baolu:
> > 
> > https://lore.kernel.org/r/20210225061454.2864009-1-baolu...@linux.intel.com
> 
> Erm, this patch is based on that one, it's right there in the context :/

Ah, sorry, I didn't spot that! I was just trying to illustrate that this
is per-device.

Will


Re: [PATCH 18/18] iommu: remove iommu_domain_{get,set}_attr

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:24PM +0100, Christoph Hellwig wrote:
> Remove the now unused iommu attr infrastructure.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/iommu/iommu.c | 26 --
>  include/linux/iommu.h | 36 
>  2 files changed, 62 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 17/18] iommu: remove DOMAIN_ATTR_IO_PGTABLE_CFG

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:23PM +0100, Christoph Hellwig wrote:
> Use an explicit set_pgtable_quirks method instead that just passes
> the actual quirk bitmask instead.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/gpu/drm/msm/adreno/adreno_gpu.c |  5 +-
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 64 +
>  drivers/iommu/arm/arm-smmu/arm-smmu.h   |  2 +-
>  drivers/iommu/iommu.c   | 11 +
>  include/linux/io-pgtable.h  |  4 --
>  include/linux/iommu.h   | 12 -
>  6 files changed, 35 insertions(+), 63 deletions(-)

I'm fine with this for now, although there has been talk about passing
things other than boolean flags as page-table quirks. We can cross that
bridge when we get there, so:

Acked-by: Will Deacon 

Will


Re: [PATCH 16/18] iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:22PM +0100, Christoph Hellwig wrote:
> From: Robin Murphy 
> 
> Instead make the global iommu_dma_strict paramete in iommu.c canonical by
> exporting helpers to get and set it and use those directly in the drivers.
> 
> This make sure that the iommu.strict parameter also works for the AMD and
> Intel IOMMU drivers on x86.  As those default to lazy flushing a new
> IOMMU_CMD_LINE_STRICT is used to turn the value into a tristate to
> represent the default if not overriden by an explicit parameter.
> 
> Signed-off-by: Robin Murphy .
> [ported on top of the other iommu_attr changes and added a few small
>  missing bits]
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/iommu/amd/iommu.c   | 23 +---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 50 +---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 27 +
>  drivers/iommu/dma-iommu.c   |  9 +--
>  drivers/iommu/intel/iommu.c | 64 -
>  drivers/iommu/iommu.c   | 27 ++---
>  include/linux/iommu.h   |  4 +-
>  8 files changed, 40 insertions(+), 165 deletions(-)

I really like this cleanup, but I can't help wonder if it's going in the
wrong direction. With SoCs often having multiple IOMMU instances and a
distinction between "trusted" and "untrusted" devices, then having the
flush-queue enabled on a per-IOMMU or per-domain basis doesn't sound
unreasonable to me, but this change makes it a global property.

For example, see the recent patch from Lu Baolu:

https://lore.kernel.org/r/20210225061454.2864009-1-baolu...@linux.intel.com

Will


Re: [PATCH 15/18] iommu: remove iommu_set_cmd_line_dma_api and iommu_cmd_line_dma_api

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:21PM +0100, Christoph Hellwig wrote:
> Don't obsfucate the trivial bit flag check.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/iommu/iommu.c | 23 +--
>  1 file changed, 5 insertions(+), 18 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 14/18] iommu: remove DOMAIN_ATTR_NESTING

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:20PM +0100, Christoph Hellwig wrote:
> Use an explicit enable_nesting method instead.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 43 -
>  drivers/iommu/arm/arm-smmu/arm-smmu.c   | 30 +++---
>  drivers/iommu/intel/iommu.c | 31 +--
>  drivers/iommu/iommu.c   | 10 +
>  drivers/vfio/vfio_iommu_type1.c |  5 +--
>  include/linux/iommu.h   |  4 +-
>  6 files changed, 55 insertions(+), 68 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 13/18] iommu: remove DOMAIN_ATTR_GEOMETRY

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:19PM +0100, Christoph Hellwig wrote:
> The geometry information can be trivially queried from the iommu_domain
> struture.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/iommu.c   | 20 +++-
>  drivers/vfio/vfio_iommu_type1.c | 26 --
>  drivers/vhost/vdpa.c| 10 +++---
>  include/linux/iommu.h   |  1 -
>  4 files changed, 18 insertions(+), 39 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 12/18] iommu: remove DOMAIN_ATTR_PAGING

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:18PM +0100, Christoph Hellwig wrote:
> DOMAIN_ATTR_PAGING is never used.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/iommu.c | 5 -
>  include/linux/iommu.h | 1 -
>  2 files changed, 6 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 11/18] iommu/fsl_pamu: remove the snoop_id field

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:17PM +0100, Christoph Hellwig wrote:
> The snoop_id is always set to ~(u32)0.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 5 ++---
>  drivers/iommu/fsl_pamu_domain.h | 1 -
>  2 files changed, 2 insertions(+), 4 deletions(-)

pamu_config_ppaace() takes quite a few useless parameters at this stage,
but anyway:

Acked-by: Will Deacon 

Do you know if this driver is actually useful? Once the complexity has been
stripped back, the stubs and default values really stand out.

Will


Re: [PATCH 10/18] iommu/fsl_pamu: enable the liodn when attaching a device

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:16PM +0100, Christoph Hellwig wrote:
> Instead of a separate call to enable all devices from the list, just
> enablde the liodn one the device is attached to the iommu domain.

(typos: "enablde" and "one" probably needs to be "once"?)

> This also remove the DOMAIN_ATTR_FSL_PAMU_ENABLE iommu_attr.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 47 ++---
>  drivers/iommu/fsl_pamu_domain.h | 10 --
>  drivers/soc/fsl/qbman/qman_portal.c | 11 ---
>  include/linux/iommu.h   |  1 -
>  4 files changed, 3 insertions(+), 66 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 09/18] iommu/fsl_pamu: merge handle_attach_device into fsl_pamu_attach_device

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:15PM +0100, Christoph Hellwig wrote:
> No good reason to split this functionality over two functions.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 59 +++--
>  1 file changed, 20 insertions(+), 39 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 08/18] iommu/fsl_pamu: merge pamu_set_liodn and map_liodn

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:14PM +0100, Christoph Hellwig wrote:
> Merge the two fuctions that configure the ppaace into a single coherent
> function.  I somehow doubt we need the two pamu_config_ppaace calls,
> but keep the existing behavior just to be on the safe side.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 65 +
>  1 file changed, 17 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
> index 40eff4b7bc5d42..4a4944332674f7 100644
> --- a/drivers/iommu/fsl_pamu_domain.c
> +++ b/drivers/iommu/fsl_pamu_domain.c
> @@ -54,25 +54,6 @@ static int __init iommu_init_mempool(void)
>   return 0;
>  }
>  
> -/* Map the DMA window corresponding to the LIODN */
> -static int map_liodn(int liodn, struct fsl_dma_domain *dma_domain)
> -{
> - int ret;
> - struct iommu_domain_geometry *geom = &dma_domain->iommu_domain.geometry;
> - unsigned long flags;
> -
> - spin_lock_irqsave(&iommu_lock, flags);
> - ret = pamu_config_ppaace(liodn, geom->aperture_start,
> -  geom->aperture_end - 1, ~(u32)0,
> -  0, dma_domain->snoop_id, dma_domain->stash_id,
> -  PAACE_AP_PERMS_QUERY | PAACE_AP_PERMS_UPDATE);
> - spin_unlock_irqrestore(&iommu_lock, flags);
> - if (ret)
> - pr_debug("PAACE configuration failed for liodn %d\n", liodn);
> -
> - return ret;
> -}
> -
>  static int update_liodn_stash(int liodn, struct fsl_dma_domain *dma_domain,
> u32 val)
>  {
> @@ -94,11 +75,11 @@ static int update_liodn_stash(int liodn, struct 
> fsl_dma_domain *dma_domain,
>  }
>  
>  /* Set the geometry parameters for a LIODN */
> -static int pamu_set_liodn(int liodn, struct device *dev,
> -   struct fsl_dma_domain *dma_domain,
> -   struct iommu_domain_geometry *geom_attr)
> +static int pamu_set_liodn(struct fsl_dma_domain *dma_domain, struct device 
> *dev,
> +   int liodn)
>  {
> - phys_addr_t window_addr, window_size;
> + struct iommu_domain *domain = &dma_domain->iommu_domain;
> + struct iommu_domain_geometry *geom = &domain->geometry;
>   u32 omi_index = ~(u32)0;
>   unsigned long flags;
>   int ret;
> @@ -110,22 +91,25 @@ static int pamu_set_liodn(int liodn, struct device *dev,
>*/
>   get_ome_index(&omi_index, dev);
>  
> - window_addr = geom_attr->aperture_start;
> - window_size = geom_attr->aperture_end + 1;
> -
>   spin_lock_irqsave(&iommu_lock, flags);
>   ret = pamu_disable_liodn(liodn);
> - if (!ret)
> - ret = pamu_config_ppaace(liodn, window_addr, window_size, 
> omi_index,
> -  0, dma_domain->snoop_id,
> -  dma_domain->stash_id, 0);
> + if (ret)
> + goto out_unlock;
> + ret = pamu_config_ppaace(liodn, geom->aperture_start,
> +  geom->aperture_end - 1, omi_index, 0,
> +  dma_domain->snoop_id, dma_domain->stash_id, 0);
> + if (ret)
> + goto out_unlock;
> + ret = pamu_config_ppaace(liodn, geom->aperture_start,
> +  geom->aperture_end - 1, ~(u32)0,
> +  0, dma_domain->snoop_id, dma_domain->stash_id,
> +  PAACE_AP_PERMS_QUERY | PAACE_AP_PERMS_UPDATE);

There's more '+1' / '-1' confusion here with aperture_end which I'm not
managing to follow. What am I missing?

Will


Re: [PATCH 07/18] iommu/fsl_pamu: replace DOMAIN_ATTR_FSL_PAMU_STASH with a direct call

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:13PM +0100, Christoph Hellwig wrote:
> Add a fsl_pamu_configure_l1_stash API that qman_portal can call directly
> instead of indirecting through the iommu attr API.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  arch/powerpc/include/asm/fsl_pamu_stash.h | 12 +++-
>  drivers/iommu/fsl_pamu_domain.c   | 16 +++-
>  drivers/iommu/fsl_pamu_domain.h   |  2 --
>  drivers/soc/fsl/qbman/qman_portal.c   | 18 +++---
>  include/linux/iommu.h |  1 -
>  5 files changed, 9 insertions(+), 40 deletions(-)

Heh, this thing is so over-engineered.

Acked-by: Will Deacon 

Will


Re: [PATCH 06/18] iommu/fsl_pamu: remove ->domain_window_enable

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:12PM +0100, Christoph Hellwig wrote:
> The only thing that fsl_pamu_window_enable does for the current caller
> is to fill in the prot value in the only dma_window structure, and to
> propagate a few values from the iommu_domain_geometry struture into the
> dma_window.  Remove the dma_window entirely, hardcode the prot value and
> otherwise use the iommu_domain_geometry structure instead.
> 
> Remove the now unused ->domain_window_enable iommu method.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 182 +++-
>  drivers/iommu/fsl_pamu_domain.h |  17 ---
>  drivers/iommu/iommu.c   |  11 --
>  drivers/soc/fsl/qbman/qman_portal.c |   7 --
>  include/linux/iommu.h   |  17 ---
>  5 files changed, 14 insertions(+), 220 deletions(-)
> 
> diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
> index e6bdd38fc18409..fd2bc88b690465 100644
> --- a/drivers/iommu/fsl_pamu_domain.c
> +++ b/drivers/iommu/fsl_pamu_domain.c
> @@ -54,34 +54,18 @@ static int __init iommu_init_mempool(void)
>   return 0;
>  }
>  
> -static phys_addr_t get_phys_addr(struct fsl_dma_domain *dma_domain, 
> dma_addr_t iova)
> -{
> - struct dma_window *win_ptr = &dma_domain->win_arr[0];
> - struct iommu_domain_geometry *geom;
> -
> - geom = &dma_domain->iommu_domain.geometry;
> -
> - if (win_ptr->valid)
> - return win_ptr->paddr + (iova & (win_ptr->size - 1));
> -
> - return 0;
> -}
> -
>  /* Map the DMA window corresponding to the LIODN */
>  static int map_liodn(int liodn, struct fsl_dma_domain *dma_domain)
>  {
>   int ret;
> - struct dma_window *wnd = &dma_domain->win_arr[0];
> - phys_addr_t wnd_addr = dma_domain->iommu_domain.geometry.aperture_start;
> + struct iommu_domain_geometry *geom = &dma_domain->iommu_domain.geometry;
>   unsigned long flags;
>  
>   spin_lock_irqsave(&iommu_lock, flags);
> - ret = pamu_config_ppaace(liodn, wnd_addr,
> -  wnd->size,
> -  ~(u32)0,
> -  wnd->paddr >> PAMU_PAGE_SHIFT,
> -  dma_domain->snoop_id, dma_domain->stash_id,
> -  wnd->prot);
> + ret = pamu_config_ppaace(liodn, geom->aperture_start,
> +  geom->aperture_end - 1, ~(u32)0,

You're passing 'geom->aperture_end - 1' as the size here, but the old code
seemed to _add_ 1:

> -static int fsl_pamu_window_enable(struct iommu_domain *domain, u32 wnd_nr,
> -   phys_addr_t paddr, u64 size, int prot)
> -{
> - struct fsl_dma_domain *dma_domain = to_fsl_dma_domain(domain);
> - struct dma_window *wnd;
> - int pamu_prot = 0;
> - int ret;
> - unsigned long flags;
> - u64 win_size;
> -
> - if (prot & IOMMU_READ)
> - pamu_prot |= PAACE_AP_PERMS_QUERY;
> - if (prot & IOMMU_WRITE)
> - pamu_prot |= PAACE_AP_PERMS_UPDATE;
> -
> - spin_lock_irqsave(&dma_domain->domain_lock, flags);
> - if (wnd_nr > 0) {
> - pr_debug("Invalid window index\n");
> - spin_unlock_irqrestore(&dma_domain->domain_lock, flags);
> - return -EINVAL;
> - }
> -
> - win_size = (domain->geometry.aperture_end + 1) >> ilog2(1);

here ^^ when calculating the exclusive upper bound. In other words, I think
'1ULL << 36' used to get passed to pamu_config_ppaace(). Is that an
intentional change?

Will


Re: [PATCH 05/18] iommu/fsl_pamu: remove support for multiple windows

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:11PM +0100, Christoph Hellwig wrote:
> The only domains allocated forces use of a single window.  Remove all
> the code related to multiple window support, as well as the need for
> qman_portal to force a single window.
> 
> Remove the now unused DOMAIN_ATTR_WINDOWS iommu_attr.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu.c| 264 +-
>  drivers/iommu/fsl_pamu.h|  10 +-
>  drivers/iommu/fsl_pamu_domain.c | 275 +---
>  drivers/iommu/fsl_pamu_domain.h |  12 +-
>  drivers/soc/fsl/qbman/qman_portal.c |   7 -
>  include/linux/iommu.h   |   1 -
>  6 files changed, 59 insertions(+), 510 deletions(-)

[...]

> + set_bf(ppaace->impl_attr, PAACE_IA_ATM, PAACE_ATM_WINDOW_XLATE);
> + ppaace->twbah = rpn >> 20;
> + set_bf(ppaace->win_bitfields, PAACE_WIN_TWBAL, rpn);
> + set_bf(ppaace->addr_bitfields, PAACE_AF_AP, prot);
> + set_bf(ppaace->impl_attr, PAACE_IA_WCE, 0);
> + set_bf(ppaace->addr_bitfields, PPAACE_AF_MW, 0);
>   mb();

(I wonder what on Earth that mb() is doing...)

> diff --git a/drivers/iommu/fsl_pamu_domain.h b/drivers/iommu/fsl_pamu_domain.h
> index 53d359d66fe577..b9236fb5a8f82e 100644
> --- a/drivers/iommu/fsl_pamu_domain.h
> +++ b/drivers/iommu/fsl_pamu_domain.h
> @@ -17,23 +17,13 @@ struct dma_window {
>  };
>  
>  struct fsl_dma_domain {
> - /*
> -  * Number of windows assocaited with this domain.
> -  * During domain initialization, it is set to the
> -  * the maximum number of subwindows allowed for a LIODN.
> -  * Minimum value for this is 1 indicating a single PAMU
> -  * window, without any sub windows. Value can be set/
> -  * queried by set_attr/get_attr API for DOMAIN_ATTR_WINDOWS.
> -  * Value can only be set once the geometry has been configured.
> -  */
> - u32 win_cnt;
>   /*
>* win_arr contains information of the configured
>* windows for a domain. This is allocated only
>* when the number of windows for the domain are
>* set.
>*/

The last part of this comment is now stale ^^

> - struct dma_window   *win_arr;
> + struct dma_window       win_arr[1];
>   /* list of devices associated with the domain */
>   struct list_headdevices;
>   /* dma_domain states:

Acked-by: Will Deacon 

Will


Re: [PATCH 04/18] iommu/fsl_pamu: merge iommu_alloc_dma_domain into fsl_pamu_domain_alloc

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:10PM +0100, Christoph Hellwig wrote:
> Keep the functionality to allocate the domain together.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 34 ++---
>  1 file changed, 10 insertions(+), 24 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 03/18] iommu/fsl_pamu: remove support for setting DOMAIN_ATTR_GEOMETRY

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:09PM +0100, Christoph Hellwig wrote:
> The default geometry is the same as the one set by qman_port given
> that FSL_PAMU depends on having 64-bit physical and thus DMA addresses.
> 
> Remove the support to update the geometry and remove the now pointless
> geom_size field.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 55 +++--
>  drivers/iommu/fsl_pamu_domain.h |  6 
>  drivers/soc/fsl/qbman/qman_portal.c | 12 ---
>  3 files changed, 5 insertions(+), 68 deletions(-)

Took me a minute to track down the other magic '36' which ends up in
aperture_end, but I found it eventually so:

Acked-by: Will Deacon 

(It does make me wonder what all this glue was intended to be used for)

Will


Re: [PATCH 02/18] iommu/fsl_pamu: remove fsl_pamu_get_domain_attr

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:08PM +0100, Christoph Hellwig wrote:
> None of the values returned by this function are ever queried.  Also
> remove the DOMAIN_ATTR_FSL_PAMUV1 enum value that is not otherwise used.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 30 --
>  include/linux/iommu.h   |  4 
>  2 files changed, 34 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH 01/18] iommu: remove the unused domain_window_disable method

2021-03-30 Thread Will Deacon
On Tue, Mar 16, 2021 at 04:38:07PM +0100, Christoph Hellwig wrote:
> domain_window_disable is wired up by fsl_pamu, but never actually called.
> 
> Signed-off-by: Christoph Hellwig 
> Acked-by: Li Yang 
> ---
>  drivers/iommu/fsl_pamu_domain.c | 48 -
>  include/linux/iommu.h   |  2 --
>  2 files changed, 50 deletions(-)

Acked-by: Will Deacon 

Will


Re: [PATCH bpf-next] bpf: arm64: Redefine MOV consistent with arch insn

2021-03-30 Thread Will Deacon
On Tue, Mar 30, 2021 at 03:42:35PM +0800, Jianlin Lv wrote:
> A64_MOV is currently mapped to Add Instruction. Architecturally MOV
> (register) is an alias of ORR (shifted register) and MOV (to or from SP)
> is an alias of ADD (immediate).
> This patch redefines A64_MOV and uses existing functionality
> aarch64_insn_gen_move_reg() in insn.c to encode MOV (register) instruction.
> For moving between register and stack pointer, rename macro to A64_MOV_SP.

What does this gain us? There's no requirement for a BPF "MOV" to match an
arm64 architectural "MOV", so what's the up-side of aligning them like this?

Cheers,

Will


Re: [Freedreno] [PATCH 16/17] iommu: remove DOMAIN_ATTR_IO_PGTABLE_CFG

2021-03-10 Thread Will Deacon
On Wed, Mar 10, 2021 at 09:58:06AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 05, 2021 at 10:00:12AM +0000, Will Deacon wrote:
> > > But one thing I'm not sure about is whether
> > > IO_PGTABLE_QUIRK_ARM_OUTER_WBWA is something that other devices
> > > *should* be using as well, but just haven't gotten around to yet.
> > 
> > The intention is certainly that this would be a place to collate per-domain
> > pgtable quirks, so I'd prefer not to tie that to the GPU.
> 
> So the overall consensus is to just keep this as-is for now?

Yes, please. If it doesn't see wider adoption then we can revisit it.

Will


Re: [Freedreno] [PATCH 16/17] iommu: remove DOMAIN_ATTR_IO_PGTABLE_CFG

2021-03-05 Thread Will Deacon
On Thu, Mar 04, 2021 at 03:11:08PM -0800, Rob Clark wrote:
> On Thu, Mar 4, 2021 at 7:48 AM Robin Murphy  wrote:
> >
> > On 2021-03-01 08:42, Christoph Hellwig wrote:
> > > Signed-off-by: Christoph Hellwig 
> >
> > Moreso than the previous patch, where the feature is at least relatively
> > generic (note that there's a bunch of in-flight development around
> > DOMAIN_ATTR_NESTING), I'm really not convinced that it's beneficial to
> > bloat the generic iommu_ops structure with private driver-specific
> > interfaces. The attribute interface is a great compromise for these
> > kinds of things, and you can easily add type-checked wrappers around it
> > for external callers (maybe even make the actual attributes internal
> > between the IOMMU core and drivers) if that's your concern.
> 
> I suppose if this is *just* for the GPU we could move it into 
> adreno_smmu_priv..
> 
> But one thing I'm not sure about is whether
> IO_PGTABLE_QUIRK_ARM_OUTER_WBWA is something that other devices
> *should* be using as well, but just haven't gotten around to yet.

The intention is certainly that this would be a place to collate per-domain
pgtable quirks, so I'd prefer not to tie that to the GPU.

Will


Re: [PATCH bpf-next 2/2] libbpf, xsk: add libbpf_smp_store_release libbpf_smp_load_acquire

2021-03-03 Thread Will Deacon
On Tue, Mar 02, 2021 at 10:13:21AM +0100, Daniel Borkmann wrote:
> On 3/2/21 9:05 AM, Björn Töpel wrote:
> > On 2021-03-01 17:10, Toke Høiland-Jørgensen wrote:
> > > Björn Töpel  writes:
> > > > From: Björn Töpel 
> > > > 
> > > > Now that the AF_XDP rings have load-acquire/store-release semantics,
> > > > move libbpf to that as well.
> > > > 
> > > > The library-internal libbpf_smp_{load_acquire,store_release} are only
> > > > valid for 32-bit words on ARM64.
> > > > 
> > > > Also, remove the barriers that are no longer in use.
> > > 
> > > So what happens if an updated libbpf is paired with an older kernel (or
> > > vice versa)?
> > 
> > "This is fine." ;-) This was briefly discussed in [1], outlined by the
> > previous commit!
> > 
> > ...even on POWER.
> 
> Could you put a summary or quote of that discussion on 'why it is okay and 
> does not
> cause /forward or backward/ compat issues with user space' directly into 
> patch 1's
> commit message?
> 
> I feel just referring to a link is probably less suitable in this case as it 
> should
> rather be part of the commit message that contains the justification on why 
> it is
> waterproof - at least it feels that specific area may be a bit 
> under-documented, so
> having it as direct part certainly doesn't hurt.
> 
> Would also be great to get Will's ACK on that when you have a v2. :)

Please stick me on CC for that and I'll take a look as I've forgotten pretty
much everything about this since last time :)

Will


Re: [PATCH v17 1/7] arm/arm64: Probe for the presence of KVM hypervisor

2021-02-05 Thread Will Deacon
On Fri, Feb 05, 2021 at 04:50:27PM +, Marc Zyngier wrote:
> On 2021-02-05 11:19, Will Deacon wrote:
> > On Fri, Feb 05, 2021 at 09:11:00AM +, Steven Price wrote:
> > > On 02/02/2021 14:11, Marc Zyngier wrote:
> > > > +   for (i = 0; i < 32; ++i) {
> > > > +   if (res.a0 & (i))
> > > > +   set_bit(i + (32 * 0), __kvm_arm_hyp_services);
> > > > +   if (res.a1 & (i))
> > > > +   set_bit(i + (32 * 1), __kvm_arm_hyp_services);
> > > > +   if (res.a2 & (i))
> > > > +   set_bit(i + (32 * 2), __kvm_arm_hyp_services);
> > > > +   if (res.a3 & (i))
> > > > +   set_bit(i + (32 * 3), __kvm_arm_hyp_services);
> > > 
> > > The bit shifts are missing, the tests should be of the form:
> > > 
> > >   if (res.a0 & (1 << i))
> > > 
> > > Or indeed using a BIT() macro.
> > 
> > Maybe even test_bit()?
> 
> Actually, maybe not doing things a-bit-at-a-time is less error prone.
> See below what I intend to fold in.
> 
> Thanks,
> 
> M.
> 
> diff --git a/drivers/firmware/smccc/kvm_guest.c
> b/drivers/firmware/smccc/kvm_guest.c
> index 00bf3c7969fc..08836f2f39ee 100644
> --- a/drivers/firmware/smccc/kvm_guest.c
> +++ b/drivers/firmware/smccc/kvm_guest.c
> @@ -2,8 +2,8 @@
> 
>  #define pr_fmt(fmt) "smccc: KVM: " fmt
> 
> -#include 
>  #include 
> +#include 
>  #include 
>  #include 
> 
> @@ -13,8 +13,8 @@ static DECLARE_BITMAP(__kvm_arm_hyp_services,
> ARM_SMCCC_KVM_NUM_FUNCS) __ro_afte
> 
>  void __init kvm_init_hyp_services(void)
>  {
> - int i;
>   struct arm_smccc_res res;
> + u32 val[4];
> 
>   if (arm_smccc_1_1_get_conduit() != SMCCC_CONDUIT_HVC)
>   return;
> @@ -28,16 +28,13 @@ void __init kvm_init_hyp_services(void)
> 
>   memset(&res, 0, sizeof(res));
>   arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID, &res);
> - for (i = 0; i < 32; ++i) {
> - if (res.a0 & (i))
> - set_bit(i + (32 * 0), __kvm_arm_hyp_services);
> - if (res.a1 & (i))
> - set_bit(i + (32 * 1), __kvm_arm_hyp_services);
> - if (res.a2 & (i))
> - set_bit(i + (32 * 2), __kvm_arm_hyp_services);
> - if (res.a3 & (i))
> - set_bit(i + (32 * 3), __kvm_arm_hyp_services);
> - }
> +
> + val[0] = lower_32_bits(res.a0);
> + val[1] = lower_32_bits(res.a1);
> + val[2] = lower_32_bits(res.a2);
> + val[3] = lower_32_bits(res.a3);
> +
> + bitmap_from_arr32(__kvm_arm_hyp_services, val, ARM_SMCCC_KVM_NUM_FUNCS);

Nice! That's loads better.

Will


Re: [PATCH v17 1/7] arm/arm64: Probe for the presence of KVM hypervisor

2021-02-05 Thread Will Deacon
On Fri, Feb 05, 2021 at 09:11:00AM +, Steven Price wrote:
> On 02/02/2021 14:11, Marc Zyngier wrote:
> > diff --git a/drivers/firmware/smccc/kvm_guest.c 
> > b/drivers/firmware/smccc/kvm_guest.c
> > new file mode 100644
> > index ..23ce1ded88b4
> > --- /dev/null
> > +++ b/drivers/firmware/smccc/kvm_guest.c
> > @@ -0,0 +1,51 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +#define pr_fmt(fmt) "smccc: KVM: " fmt
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +static DECLARE_BITMAP(__kvm_arm_hyp_services, ARM_SMCCC_KVM_NUM_FUNCS) 
> > __ro_after_init = { };
> > +
> > +void __init kvm_init_hyp_services(void)
> > +{
> > +   int i;
> > +   struct arm_smccc_res res;
> > +
> > +   if (arm_smccc_1_1_get_conduit() != SMCCC_CONDUIT_HVC)
> > +   return;
> > +
> > +   arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, &res);
> > +   if (res.a0 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_0 ||
> > +   res.a1 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_1 ||
> > +   res.a2 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_2 ||
> > +   res.a3 != ARM_SMCCC_VENDOR_HYP_UID_KVM_REG_3)
> > +   return;
> > +
> > +   memset(&res, 0, sizeof(res));
> > +   arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_KVM_FEATURES_FUNC_ID, &res);
> > +   for (i = 0; i < 32; ++i) {
> > +   if (res.a0 & (i))
> > +   set_bit(i + (32 * 0), __kvm_arm_hyp_services);
> > +   if (res.a1 & (i))
> > +   set_bit(i + (32 * 1), __kvm_arm_hyp_services);
> > +   if (res.a2 & (i))
> > +   set_bit(i + (32 * 2), __kvm_arm_hyp_services);
> > +   if (res.a3 & (i))
> > +   set_bit(i + (32 * 3), __kvm_arm_hyp_services);
> 
> The bit shifts are missing, the tests should be of the form:
> 
>   if (res.a0 & (1 << i))
> 
> Or indeed using a BIT() macro.

Maybe even test_bit()?

Will


Re: tools/bpf: Compilation issue on powerpc: unknown type name '__vector128'

2020-10-26 Thread Will Deacon
On Mon, Oct 26, 2020 at 03:45:55PM +1100, Michael Ellerman wrote:
> Vitaly Chikunov  writes:
> > Adding netdev and PowerPC maintainers JFYI.
> 
> Thanks.
> 
> > On Sat, Oct 24, 2020 at 11:23:19AM +0300, Dmitry V. Levin wrote:
> >> Hi,
> >> 
> >> On Sat, Oct 24, 2020 at 02:06:41AM +0300, Vitaly Chikunov wrote:
> >> > Hi,
> >> > 
> >> > Commit f143c11bb7b9 ("tools: bpf: Use local copy of headers including
> >> > uapi/linux/filter.h") introduces compilation issue on powerpc:
> >> >  
> >> >   builder@powerpc64le:~/linux$ make -C tools/bpf V=1
> >> >   make: Entering directory '/usr/src/linux/tools/bpf'
> >> >   gcc -Wall -O2 -D__EXPORTED_HEADERS__ 
> >> > -I/usr/src/linux/tools/include/uapi -I/usr/src/linux/tools/include 
> >> > -DDISASM_FOUR_ARGS_SIGNATURE -c -o bpf_dbg.o 
> >> > /usr/src/linux/tools/bpf/bpf_dbg.c
> 
> Defining __EXPORTED_HEADERS__ is a hack to circumvent the checks in the
> uapi headers.
> 
> So first comment is to stop doing that, although it doesn't actually fix
> this issue.
> 
> >> >   In file included from /usr/include/asm/sigcontext.h:14,
> >> > from /usr/include/bits/sigcontext.h:30,
> >> > from /usr/include/signal.h:291,
> >> > from /usr/src/linux/tools/bpf/bpf_dbg.c:51:
> >> >   /usr/include/asm/elf.h:160:9: error: unknown type name '__vector128'
> >> > 160 | typedef __vector128 elf_vrreg_t;
> >> >  | ^~~
> >> >   make: *** [Makefile:67: bpf_dbg.o] Error 1
> >> >   make: Leaving directory '/usr/src/linux/tools/bpf'
> >> 
> >> __vector128 is defined in arch/powerpc/include/uapi/asm/types.h;
> >> while include/uapi/linux/types.h does #include ,
> >> tools/include/uapi/linux/types.h doesn't, resulting to this
> >> compilation error.
> >
> > This is too puzzling to fix portably.
> 
> I don't really understand how this is expected to work.
> 
> We have tools/include/uapi/linux/types.h which is some sort of hand
> hacked types.h, but doesn't match the real types.h from
> include/uapi/linux.
> 
> In particular the tools/include types.h doesn't include asm/types.h,
> which is why this breaks.
> 
> I can build bpf_dbg if I copy the properly exported header in:
> 
>   $ make INSTALL_HDR_PATH=$PWD/headers headers_install
>   $ cp headers/include/linux/types.h tools/include/uapi/linux/
>   $ make -C tools/bpf bpf_dbg
>   make: Entering directory '/home/michael/linux/tools/bpf'
>   
>   Auto-detecting system features:
>   ...libbfd: [ on  ]
>   ...disassembler-four-args: [ on  ]
>   
> CC   bpf_dbg.o
> LINK bpf_dbg
>   make: Leaving directory '/home/michael/linux/tools/bpf
> 
> 
> I'm not sure what the proper fix is.
> 
> Maybe sync the tools/include types.h with the real one?

Yeah, all f143c11bb7b9 did was sync the local copy with the result of
'headers_install', so if that's sufficient then I think that's the right
way to fix the immediate breakage.

> Or TBH I would have thought the best option is to not have
> tools/include/uapi at all, but instead just run headers_install before
> building and use the properly exported headers.

Agreed, although in some cases I suspect that kernel-internal parts are
being used.

Wiil


Re: WARNING in sta_info_alloc

2020-10-06 Thread Will Deacon
On Tue, Oct 06, 2020 at 01:08:23AM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:549738f1 Linux 5.9-rc8
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15b97ba390
> kernel config:  https://syzkaller.appspot.com/x/.config?x=c06bcf3cc963d91c
> dashboard link: https://syzkaller.appspot.com/bug?extid=45d7c243c006f39dc55a
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12bae9c050
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1099b1c050
> 
> The issue was bisected to:
> 
> commit 643c332d519bdfbf80d21f40d1c0aa0ccf3ec1cb
> Author: Zi Shen Lim 
> Date:   Thu Jun 9 04:18:50 2016 +
> 
> arm64: bpf: optimize LD_ABS, LD_IND
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=11d4447790
> final oops: https://syzkaller.appspot.com/x/report.txt?x=13d4447790
> console output: https://syzkaller.appspot.com/x/log.txt?x=15d4447790
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+45d7c243c006f39dc...@syzkaller.appspotmail.com
> Fixes: 643c332d519b ("arm64: bpf: optimize LD_ABS, LD_IND")
> 
> [ cut here ]
> WARNING: CPU: 0 PID: 6879 at net/mac80211/ieee80211_i.h:1447 
> ieee80211_get_sband net/mac80211/ieee80211_i.h:1447 [inline]
> WARNING: CPU: 0 PID: 6879 at net/mac80211/ieee80211_i.h:1447 
> sta_info_alloc+0x1900/0x1f90 net/mac80211/sta_info.c:469
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 0 PID: 6879 Comm: syz-executor071 Not tainted 5.9.0-rc8-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x198/0x1fd lib/dump_stack.c:118
>  panic+0x382/0x7fb kernel/panic.c:231
>  __warn.cold+0x20/0x4b kernel/panic.c:600
>  report_bug+0x1bd/0x210 lib/bug.c:198
>  handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234
>  exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254
>  asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536
> RIP: 0010:ieee80211_get_sband net/mac80211/ieee80211_i.h:1447 [inline]

The patch fingered by the bisection only affects arm64, but this is an x86
box. So this is clearly bogus.

Will


Re: [PATCH v3] arm64: bpf: Fix branch offset in JIT

2020-09-17 Thread Will Deacon
On Thu, Sep 17, 2020 at 11:49:25AM +0300, Ilias Apalodimas wrote:
> Running the eBPF test_verifier leads to random errors looking like this:
> 
> [ 6525.735488] Unexpected kernel BRK exception at EL1
> [ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
> [ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver 
> fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd 
> cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 
> sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs 
> blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic 
> ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore 
> nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy 
> libphy gpio_mb86s7x
> [ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: GW   
>   5.9.0-rc1+ #47
> [ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS 
> build #1 Jun  6 2020
> [ 6525.804812] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
> [ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
> [ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
> [ 6525.820832] sp : 8000130cbb80
> [ 6525.824141] x29: 8000130cbbb0 x28: 
> [ 6525.829451] x27: 05ef6fcbf39b x26: 
> [ 6525.834759] x25: 8000130cbb80 x24: 800011dc7038
> [ 6525.840067] x23: 8000130cbd00 x22: 0008f624d080
> [ 6525.845375] x21: 0001 x20: 800011dc7000
> [ 6525.850682] x19:  x18: 
> [ 6525.855990] x17:  x16: 
> [ 6525.861298] x15:  x14: 
> [ 6525.866606] x13:  x12: 
> [ 6525.871913] x11: 0001 x10: 800a660c
> [ 6525.877220] x9 : 800010951810 x8 : 8000130cbc38
> [ 6525.882528] x7 :  x6 : 009864cfa881
> [ 6525.887836] x5 : 00ff x4 : 002880ba1a0b3e9f
> [ 6525.893144] x3 : 0018 x2 : 800a4374
> [ 6525.898452] x1 : 000a x0 : 0009
> [ 6525.903760] Call trace:
> [ 6525.906202]  bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
> [ 6525.911076]  bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
> [ 6525.915957]  bpf_dispatcher_xdp_func+0x14/0x20
> [ 6525.920398]  bpf_test_run+0x70/0x1b0
> [ 6525.923969]  bpf_prog_test_run_xdp+0xec/0x190
> [ 6525.928326]  __do_sys_bpf+0xc88/0x1b28
> [ 6525.932072]  __arm64_sys_bpf+0x24/0x30
> [ 6525.935820]  el0_svc_common.constprop.0+0x70/0x168
> [ 6525.940607]  do_el0_svc+0x28/0x88
> [ 6525.943920]  el0_sync_handler+0x88/0x190
> [ 6525.947838]  el0_sync+0x140/0x180
> [ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
> [ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
> 
> The reason is the offset[] creation and later usage, while building
> the eBPF body. The code currently omits the first instruction, since
> build_insn() will increase our ctx->idx before saving it.
> That was fine up until bounded eBPF loops were introduced. After that
> introduction, offset[0] must be the offset of the end of prologue which
> is the start of the 1st insn while, offset[n] holds the
> offset of the end of n-th insn.
> 
> When "taken loop with back jump to 1st insn" test runs, it will
> eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
> permitted, the current outcome depends on the value stored in
> ctx->offset[-1], which has nothing to do with our array.
> If the value happens to be 0 the tests will work. If not this error
> triggers.
> 
> commit 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st insn")
> fixed an indentical bug on x86 when eBPF bounded loops were introduced.
> 
> So let's fix it by creating the ctx->offset[] differently. Track the
> beginning of instruction and account for the extra instruction while
> calculating the arm instruction offsets.
> 
> Fixes: 2589726d12a1 ("bpf: introduce bounded loops")
> Reported-by: Naresh Kamboju 
> Reported-by: Jiri Olsa 
> Co-developed-by: Jean-Philippe Brucker 
> Signed-off-by: Jean-Philippe Brucker 
> Co-developed-by: Yauheni Kaliuta 
> Signed-off-by: Yauheni Kaliuta 
> Signed-off-by: Ilias Apalodimas 

Acked-by: Will Deacon 

Catalin -- do you want to take this as a fix?

Will


Re: [PATCH v2] arm64: bpf: Fix branch offset in JIT

2020-09-15 Thread Will Deacon
On Tue, Sep 15, 2020 at 04:53:44PM +0300, Ilias Apalodimas wrote:
> On Tue, Sep 15, 2020 at 02:11:03PM +0100, Will Deacon wrote:
> > Hi Ilias,
> > 
> > On Mon, Sep 14, 2020 at 07:03:55PM +0300, Ilias Apalodimas wrote:
> > > Running the eBPF test_verifier leads to random errors looking like this:
> > > 
> > > [ 6525.735488] Unexpected kernel BRK exception at EL1
> > > [ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
> > 
> > Does this happen because we poison the BPF memory with BRK instructions?
> > Maybe we should look at using a special immediate so we can detect this,
> > rather than end up in the ptrace handler.
> 
> As discussed offline this is what aarch64_insn_gen_branch_imm() will return 
> for
> offsets > 128M and yes replacing the handler with a more suitable message 
> would 
> be good.

Can you give the diff below a shot, please? Hopefully printing a more useful
message will mean these things get triaged/debugged better in future.

Will

--->8

diff --git a/arch/arm64/include/asm/extable.h b/arch/arm64/include/asm/extable.h
index 840a35ed92ec..b15eb4a3e6b2 100644
--- a/arch/arm64/include/asm/extable.h
+++ b/arch/arm64/include/asm/extable.h
@@ -22,6 +22,15 @@ struct exception_table_entry
 
 #define ARCH_HAS_RELATIVE_EXTABLE
 
+static inline bool in_bpf_jit(struct pt_regs *regs)
+{
+   if (!IS_ENABLED(CONFIG_BPF_JIT))
+   return false;
+
+   return regs->pc >= BPF_JIT_REGION_START &&
+  regs->pc < BPF_JIT_REGION_END;
+}
+
 #ifdef CONFIG_BPF_JIT
 int arm64_bpf_fixup_exception(const struct exception_table_entry *ex,
  struct pt_regs *regs);
diff --git a/arch/arm64/kernel/debug-monitors.c 
b/arch/arm64/kernel/debug-monitors.c
index 7310a4f7f993..fa76151de6ff 100644
--- a/arch/arm64/kernel/debug-monitors.c
+++ b/arch/arm64/kernel/debug-monitors.c
@@ -384,7 +384,7 @@ void __init debug_traps_init(void)
hook_debug_fault_code(DBG_ESR_EVT_HWSS, single_step_handler, SIGTRAP,
  TRAP_TRACE, "single-step handler");
hook_debug_fault_code(DBG_ESR_EVT_BRK, brk_handler, SIGTRAP,
- TRAP_BRKPT, "ptrace BRK handler");
+ TRAP_BRKPT, "BRK handler");
 }
 
 /* Re-enable single step for syscall restarting. */
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 13ebd5ca2070..9f7fde99eda2 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -994,6 +995,21 @@ static struct break_hook bug_break_hook = {
.imm = BUG_BRK_IMM,
 };
 
+static int reserved_fault_handler(struct pt_regs *regs, unsigned int esr)
+{
+   pr_err("%s generated an invalid instruction at %pS!\n",
+   in_bpf_jit(regs) ? "BPF JIT" : "Kernel runtime patching",
+   instruction_pointer(regs));
+
+   /* We cannot handle this */
+   return DBG_HOOK_ERROR;
+}
+
+static struct break_hook fault_break_hook = {
+   .fn = reserved_fault_handler,
+   .imm = FAULT_BRK_IMM,
+};
+
 #ifdef CONFIG_KASAN_SW_TAGS
 
 #define KASAN_ESR_RECOVER  0x20
@@ -1059,6 +1075,7 @@ int __init early_brk64(unsigned long addr, unsigned int 
esr,
 void __init trap_init(void)
 {
register_kernel_break_hook(&bug_break_hook);
+   register_kernel_break_hook(&fault_break_hook);
 #ifdef CONFIG_KASAN_SW_TAGS
register_kernel_break_hook(&kasan_break_hook);
 #endif
diff --git a/arch/arm64/mm/extable.c b/arch/arm64/mm/extable.c
index eee1732ab6cd..aa0060178343 100644
--- a/arch/arm64/mm/extable.c
+++ b/arch/arm64/mm/extable.c
@@ -14,9 +14,7 @@ int fixup_exception(struct pt_regs *regs)
if (!fixup)
return 0;
 
-   if (IS_ENABLED(CONFIG_BPF_JIT) &&
-   regs->pc >= BPF_JIT_REGION_START &&
-   regs->pc < BPF_JIT_REGION_END)
+   if (in_bpf_jit(regs))
return arm64_bpf_fixup_exception(fixup, regs);
 
regs->pc = (unsigned long)&fixup->fixup + fixup->fixup;


Re: [PATCH v2] arm64: bpf: Fix branch offset in JIT

2020-09-15 Thread Will Deacon
Hi Ilias,

On Mon, Sep 14, 2020 at 07:03:55PM +0300, Ilias Apalodimas wrote:
> Running the eBPF test_verifier leads to random errors looking like this:
> 
> [ 6525.735488] Unexpected kernel BRK exception at EL1
> [ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP

Does this happen because we poison the BPF memory with BRK instructions?
Maybe we should look at using a special immediate so we can detect this,
rather than end up in the ptrace handler.

> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index f8912e45be7a..0974e58c 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -143,9 +143,13 @@ static inline void emit_addr_mov_i64(const int reg, 
> const u64 val,
>   }
>  }
>  
> -static inline int bpf2a64_offset(int bpf_to, int bpf_from,
> +static inline int bpf2a64_offset(int bpf_insn, int off,
>const struct jit_ctx *ctx)
>  {
> + /* arm64 offset is relative to the branch instruction */
> + int bpf_from = bpf_insn + 1;
> + /* BPF JMP offset is relative to the next instruction */
> + int bpf_to = bpf_insn + off + 1;
>   int to = ctx->offset[bpf_to];
>   /* -1 to account for the Branch instruction */
>   int from = ctx->offset[bpf_from] - 1;

I think this is a bit confusing with all the variables. How about just
doing:

/* BPF JMP offset is relative to the next BPF instruction */
bpf_insn++;

/*
 * Whereas arm64 branch instructions encode the offset from the
 * branch itself, so we must subtract 1 from the instruction offset.
 */
return ctx->offset[bpf_insn + off] - ctx->offset[bpf_insn] - 1;

> @@ -642,7 +646,7 @@ static int build_insn(const struct bpf_insn *insn, struct 
> jit_ctx *ctx,
>  
>   /* JUMP off */
>   case BPF_JMP | BPF_JA:
> - jmp_offset = bpf2a64_offset(i + off, i, ctx);
> + jmp_offset = bpf2a64_offset(i, off, ctx);
>   check_imm26(jmp_offset);
>   emit(A64_B(jmp_offset), ctx);
>   break;
> @@ -669,7 +673,7 @@ static int build_insn(const struct bpf_insn *insn, struct 
> jit_ctx *ctx,
>   case BPF_JMP32 | BPF_JSLE | BPF_X:
>   emit(A64_CMP(is64, dst, src), ctx);
>  emit_cond_jmp:
> - jmp_offset = bpf2a64_offset(i + off, i, ctx);
> + jmp_offset = bpf2a64_offset(i, off, ctx);
>   check_imm19(jmp_offset);
>   switch (BPF_OP(code)) {
>   case BPF_JEQ:
> @@ -912,18 +916,26 @@ static int build_body(struct jit_ctx *ctx, bool 
> extra_pass)
>   const struct bpf_insn *insn = &prog->insnsi[i];
>   int ret;
>  
> + /*
> +  * offset[0] offset of the end of prologue, start of the
> +  * first insn.
> +  * offset[x] - offset of the end of x insn.

So does offset[1] point at the last arm64 instruction for the first BPF
instruction, or does it point to the first arm64 instruction for the second
BPF instruction?

> +  */
> + if (ctx->image == NULL)
> + ctx->offset[i] = ctx->idx;
> +
>   ret = build_insn(insn, ctx, extra_pass);
>   if (ret > 0) {
>   i++;
>   if (ctx->image == NULL)
> - ctx->offset[i] = ctx->idx;
> + ctx->offset[i] = ctx->offset[i - 1];

Does it matter that we set the offset for both halves of a 16-byte BPF
instruction? I think that's a change in behaviour here.

>   continue;
>   }
> - if (ctx->image == NULL)
> - ctx->offset[i] = ctx->idx;
>   if (ret)
>   return ret;
>   }
> + if (ctx->image == NULL)
> + ctx->offset[i] = ctx->idx;

I think it would be cleared to set ctx->offset[0] before the for loop (with
a comment about what it is) and then change the for loop to iterate from 1
all the way to prog->len.

Will


Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Will Deacon
Hi Ilias,

On Mon, Sep 14, 2020 at 04:23:50PM +0300, Ilias Apalodimas wrote:
> On Mon, Sep 14, 2020 at 03:35:04PM +0300, Ilias Apalodimas wrote:
> > On Mon, Sep 14, 2020 at 01:20:43PM +0100, Will Deacon wrote:
> > > On Mon, Sep 14, 2020 at 11:36:21AM +0300, Ilias Apalodimas wrote:
> > > > Running the eBPF test_verifier leads to random errors looking like this:

[...]

> > > > The reason seems to be the offset[] creation and usage ctx->offset[]
> > > 
> > > "seems to be"? Are you unsure?
> > 
> > Reading the history and other ports of the JIT implementation, I couldn't 
> > tell if the decision on skipping the 1st entry was deliberate or not on 
> > Aarch64. Reading through the mailist list didn't help either [1].
> > Skipping the 1st entry seems indeed to cause the problem.
> > I did run the patch though the BPF tests and showed no regressions + fixing 
> > the error.
> 
> I'll correct myself here.
> Looking into 7c2e988f400e ("bpf: fix x64 JIT code generation for jmp to 1st 
> insn")
> explains things a bit better.
> Jumping back to the 1st insn wasn't allowed until eBPF bounded loops were 
> introduced. That's what the 1st instruction was not saved in the original 
> code.
> 
> > > 
> > > No Fixes: tag?
> > 
> > I'll re-spin and apply one 
> > 
> Any suggestion on any Fixes I should apply? The original code was 'correct' 
> and
> broke only when bounded loops and their self-tests were introduced.

Ouch, that's pretty bad as it means nobody is regression testing BPF on
arm64 with mainline. Damn.

The Fixes: tag should identify the commit beyond which we don't need to
backport the fix, so it sounds like introduction of bounded loops, according
to your analysis.

Will


Re: [PATCH] arm64: bpf: Fix branch offset in JIT

2020-09-14 Thread Will Deacon
On Mon, Sep 14, 2020 at 11:36:21AM +0300, Ilias Apalodimas wrote:
> Running the eBPF test_verifier leads to random errors looking like this:
> 
> [ 6525.735488] Unexpected kernel BRK exception at EL1
> [ 6525.735502] Internal error: ptrace BRK handler: f2000100 [#1] SMP
> [ 6525.741609] Modules linked in: nls_utf8 cifs libdes libarc4 dns_resolver 
> fscache binfmt_misc nls_ascii nls_cp437 vfat fat aes_ce_blk crypto_simd 
> cryptd aes_ce_cipher ghash_ce gf128mul efi_pstore sha2_ce sha256_arm64 
> sha1_ce evdev efivars efivarfs ip_tables x_tables autofs4 btrfs 
> blake2b_generic xor xor_neon zstd_compress raid6_pq libcrc32c crc32c_generic 
> ahci xhci_pci libahci xhci_hcd igb libata i2c_algo_bit nvme realtek usbcore 
> nvme_core scsi_mod t10_pi netsec mdio_devres of_mdio gpio_keys fixed_phy 
> libphy gpio_mb86s7x
> [ 6525.787760] CPU: 3 PID: 7881 Comm: test_verifier Tainted: GW   
>   5.9.0-rc1+ #47
> [ 6525.796111] Hardware name: Socionext SynQuacer E-series DeveloperBox, BIOS 
> build #1 Jun  6 2020
> [ 6525.804812] pstate: 2005 (nzCv daif -PAN -UAO BTYPE=--)
> [ 6525.810390] pc : bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
> [ 6525.815613] lr : bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
> [ 6525.820832] sp : 8000130cbb80
> [ 6525.824141] x29: 8000130cbbb0 x28: 
> [ 6525.829451] x27: 05ef6fcbf39b x26: 
> [ 6525.834759] x25: 8000130cbb80 x24: 800011dc7038
> [ 6525.840067] x23: 8000130cbd00 x22: 0008f624d080
> [ 6525.845375] x21: 0001 x20: 800011dc7000
> [ 6525.850682] x19:  x18: 
> [ 6525.855990] x17:  x16: 
> [ 6525.861298] x15:  x14: 
> [ 6525.866606] x13:  x12: 
> [ 6525.871913] x11: 0001 x10: 800a660c
> [ 6525.877220] x9 : 800010951810 x8 : 8000130cbc38
> [ 6525.882528] x7 :  x6 : 009864cfa881
> [ 6525.887836] x5 : 00ff x4 : 002880ba1a0b3e9f
> [ 6525.893144] x3 : 0018 x2 : 800a4374
> [ 6525.898452] x1 : 000a x0 : 0009
> [ 6525.903760] Call trace:
> [ 6525.906202]  bpf_prog_c3d01833289b6311_F+0xc8/0x9f4
> [ 6525.911076]  bpf_prog_d53bb52e3f4483f9_F+0x38/0xc8c
> [ 6525.915957]  bpf_dispatcher_xdp_func+0x14/0x20
> [ 6525.920398]  bpf_test_run+0x70/0x1b0
> [ 6525.923969]  bpf_prog_test_run_xdp+0xec/0x190
> [ 6525.928326]  __do_sys_bpf+0xc88/0x1b28
> [ 6525.932072]  __arm64_sys_bpf+0x24/0x30
> [ 6525.935820]  el0_svc_common.constprop.0+0x70/0x168
> [ 6525.940607]  do_el0_svc+0x28/0x88
> [ 6525.943920]  el0_sync_handler+0x88/0x190
> [ 6525.947838]  el0_sync+0x140/0x180
> [ 6525.951154] Code: d4202000 d4202000 d4202000 d4202000 (d4202000)
> [ 6525.957249] ---[ end trace cecc3f93b14927e2 ]---
> 
> The reason seems to be the offset[] creation and usage ctx->offset[]

"seems to be"? Are you unsure?

> while building the eBPF body.  The code currently omits the first 
> instruction, since build_insn() will increase our ctx->idx before saving 
> it.  When "taken loop with back jump to 1st insn" test runs it will
> eventually call bpf2a64_offset(-1, 2, ctx). Since negative indexing is
> permitted, the current outcome depends on the value stored in
> ctx->offset[-1], which has nothing to do with our array.
> If the value happens to be 0 the tests will work. If not this error
> triggers.
> 
> So let's fix it by creating the ctx->offset[] correctly in the first
> place and account for the extra instruction while calculating the arm
> instruction offsets.

No Fixes: tag?

> Signed-off-by: Ilias Apalodimas 
> Signed-off-by: Jean-Philippe Brucker 
> Signed-off-by: Yauheni Kaliuta 

Non-author signoffs here. What's going on?

Will


Re: [PATCH v2 1/1] netfilter: nat: add a range check for l3/l4 protonum

2020-09-01 Thread Will Deacon
Hi Will, Pablo,

On Tue, Aug 04, 2020 at 01:37:11PM +0200, Pablo Neira Ayuso wrote:
> This patch is much smaller and if you confirm this is address the
> issue, then this is awesome.

Did that ever get confirmed? AFAICT, nothing ended up landing in the stable
trees for this.

Cheers,

Will


> On Mon, Aug 03, 2020 at 06:31:56PM +, William Mcvicker wrote:
> [...]
> > diff --git a/net/netfilter/nf_conntrack_netlink.c 
> > b/net/netfilter/nf_conntrack_netlink.c
> > index 31fa94064a62..56d310f8b29a 100644
> > --- a/net/netfilter/nf_conntrack_netlink.c
> > +++ b/net/netfilter/nf_conntrack_netlink.c
> > @@ -1129,6 +1129,8 @@ ctnetlink_parse_tuple(const struct nlattr * const 
> > cda[],
> > if (!tb[CTA_TUPLE_IP])
> > return -EINVAL;
> >  
> > +   if (l3num >= NFPROTO_NUMPROTO)
> > +   return -EINVAL;
> 
> l3num can only be either NFPROTO_IPV4 or NFPROTO_IPV6.
> 
> Other than that, bail out with EOPNOTSUPP.
> 
> Thank you.


Re: [PATCH v13 2/9] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC

2020-08-20 Thread Will Deacon
On Tue, Jul 28, 2020 at 01:07:14AM +, Jianyong Wu wrote:
> 
> 
> > -Original Message-
> > From: Will Deacon 
> > Sent: Monday, July 27, 2020 7:38 PM
> > To: Jianyong Wu 
> > Cc: netdev@vger.kernel.org; yangbo...@nxp.com; john.stu...@linaro.org;
> > t...@linutronix.de; pbonz...@redhat.com; sean.j.christopher...@intel.com;
> > m...@kernel.org; richardcoch...@gmail.com; Mark Rutland
> > ; Suzuki Poulose ;
> > Steven Price ; linux-ker...@vger.kernel.org; linux-
> > arm-ker...@lists.infradead.org; kvm...@lists.cs.columbia.edu;
> > k...@vger.kernel.org; Steve Capper ; Kaly Xin
> > ; Justin He ; Wei Chen
> > ; nd 
> > Subject: Re: [PATCH v13 2/9] arm/arm64: KVM: Advertise KVM UID to guests
> > via SMCCC
> > 
> > On Mon, Jul 27, 2020 at 03:45:37AM +, Jianyong Wu wrote:
> > > > From: Will Deacon 
> > > >
> > > > We can advertise ourselves to guests as KVM and provide a basic
> > > > features bitmap for discoverability of future hypervisor services.
> > > >
> > > > Cc: Marc Zyngier 
> > > > Signed-off-by: Will Deacon 
> > > > Signed-off-by: Jianyong Wu 
> > > > ---
> > > >  arch/arm64/kvm/hypercalls.c | 29 +++--
> > > >  1 file changed, 19 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/kvm/hypercalls.c
> > > > b/arch/arm64/kvm/hypercalls.c index 550dfa3e53cd..db6dce3d0e23
> > > > 100644
> > > > --- a/arch/arm64/kvm/hypercalls.c
> > > > +++ b/arch/arm64/kvm/hypercalls.c
> > > > @@ -12,13 +12,13 @@
> > > >  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)  {
> > > > u32 func_id = smccc_get_function(vcpu);
> > > > -   long val = SMCCC_RET_NOT_SUPPORTED;
> > > > +   u32 val[4] = {SMCCC_RET_NOT_SUPPORTED};
> > >
> > > There is a risk as this u32 value will return here and a u64 value
> > > will be obtained in guest. For example, The val[0] is initialized as
> > > -1 of 0x and the guest get 0x then it will be compared
> > > with -1 of 0x Also this problem exists for the
> > > transfer of address in u64 type. So the following assignment to "val"
> > > should be split into two
> > > u32 value and assign to val[0] and val[1] respectively.
> > > WDYT?
> > 
> > Yes, I think you're right that this is a bug, but isn't the solution just 
> > to make
> > that an array of 'long'?
> > 
> > long val [4];
> > 
> > That will sign-extend the negative error codes as required, while leaving 
> > the
> > explicitly unsigned UID constants alone.
> 
> Ok, that's much better. I will fix it at next version.
> 
> By the way, I wonder when will you update this patch set. I see someone like 
> me
> adopt this patch set as code base and need rebase it every time, so expect 
> your update.

I'm not working on it, so please feel free to include it along with the
patches that add an upstream user.

Will


Re: [PATCH v13 2/9] arm/arm64: KVM: Advertise KVM UID to guests via SMCCC

2020-07-27 Thread Will Deacon
On Mon, Jul 27, 2020 at 03:45:37AM +, Jianyong Wu wrote:
> > From: Will Deacon 
> > 
> > We can advertise ourselves to guests as KVM and provide a basic features
> > bitmap for discoverability of future hypervisor services.
> > 
> > Cc: Marc Zyngier 
> > Signed-off-by: Will Deacon 
> > Signed-off-by: Jianyong Wu 
> > ---
> >  arch/arm64/kvm/hypercalls.c | 29 +++--
> >  1 file changed, 19 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/hypercalls.c b/arch/arm64/kvm/hypercalls.c
> > index 550dfa3e53cd..db6dce3d0e23 100644
> > --- a/arch/arm64/kvm/hypercalls.c
> > +++ b/arch/arm64/kvm/hypercalls.c
> > @@ -12,13 +12,13 @@
> >  int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)  {
> > u32 func_id = smccc_get_function(vcpu);
> > -   long val = SMCCC_RET_NOT_SUPPORTED;
> > +   u32 val[4] = {SMCCC_RET_NOT_SUPPORTED};
> 
> There is a risk as this u32 value will return here and a u64 value will be
> obtained in guest. For example, The val[0] is initialized as -1 of
> 0x and the guest get 0x then it will be compared with -1
> of 0x Also this problem exists for the transfer of address
> in u64 type. So the following assignment to "val" should be split into two
> u32 value and assign to val[0] and val[1] respectively.
> WDYT?

Yes, I think you're right that this is a bug, but isn't the solution just
to make that an array of 'long'?

long val [4];

That will sign-extend the negative error codes as required, while leaving
the explicitly unsigned UID constants alone.

Will


Re: WARNING in dev_change_net_namespace

2020-06-08 Thread Will Deacon
On Sat, Jun 06, 2020 at 07:03:03AM -0700, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 13dc4d836179444f0ca90188cfccd23f9cd9ff05
> Author: Will Deacon 
> Date:   Tue Apr 21 14:29:18 2020 +
> 
> arm64: cpufeature: Remove redundant call to id_aa64pfr0_32bit_el0()
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=109aa3b110
> start commit:   7ae77150 Merge tag 'powerpc-5.8-1' of git://git.kernel.org..
> git tree:   upstream
> final crash:https://syzkaller.appspot.com/x/report.txt?x=129aa3b110
> console output: https://syzkaller.appspot.com/x/log.txt?x=149aa3b110
> kernel config:  https://syzkaller.appspot.com/x/.config?x=be4578b3f1083656
> dashboard link: https://syzkaller.appspot.com/bug?extid=830c6dbfc71edc4f0b8f
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1203283210
> 
> Reported-by: syzbot+830c6dbfc71edc4f0...@syzkaller.appspotmail.com
> Fixes: 13dc4d836179 ("arm64: cpufeature: Remove redundant call to 
> id_aa64pfr0_32bit_el0()")


Yeah... I doubt that very much.

Will


Re: [PATCH bpf-next v2 0/3] arm64 BPF JIT Optimizations

2020-05-11 Thread Will Deacon
On Fri, 8 May 2020 11:15:43 -0700, Luke Nelson wrote:
> This patch series introduces several optimizations to the arm64 BPF JIT.
> The optimizations make use of arm64 immediate instructions to avoid
> loading BPF immediates to temporary registers, when possible.
> 
> In the process, we discovered two bugs in the logical immediate encoding
> function in arch/arm64/kernel/insn.c using Serval. The series also fixes
> the two bugs before introducing the optimizations.
> 
> [...]

Applied to arm64 (for-next/bpf), thanks!

[1/3] arm64: insn: Fix two bugs in encoding 32-bit logical immediates
  https://git.kernel.org/arm64/c/579d1b3faa37
[2/3] bpf, arm64: Optimize AND,OR,XOR,JSET BPF_K using arm64 logical immediates
  https://git.kernel.org/arm64/c/fd49591cb49b
[3/3] bpf, arm64: Optimize ADD,SUB,JMP BPF_K using arm64 add/sub immediates
  https://git.kernel.org/arm64/c/fd868f148189

Cheers,
-- 
Will

https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev


Re: [RFC PATCH bpf-next 1/3] arm64: insn: Fix two bugs in encoding 32-bit logical immediates

2020-05-08 Thread Will Deacon
On Thu, May 07, 2020 at 02:48:07PM -0700, Luke Nelson wrote:
> Thanks for the comments! Responses below:
> 
> > It's a bit grotty spreading the checks out now. How about we tweak things
> > slightly along the lines of:
> >
> >
> > diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> > index 4a9e773a177f..60ec788eaf33 100644
> > --- a/arch/arm64/kernel/insn.c
> > +++ b/arch/arm64/kernel/insn.c
> > [...]
> 
> Agreed; this new version looks much cleaner. I re-ran all the tests /
> verification and everything seems good. Would you like me to submit a
> v2 of this series with this new code?

Yes, please! And please include Daniel's acks on the BPF changes too. It's a
public holiday here in the UK today, but I can pick this up next week.

> >> We tested the new code against llvm-mc with all 1,302 encodable 32-bit
> >> logical immediates and all 5,334 encodable 64-bit logical immediates.
> >
> > That, on its own, is awesome information. Do you have any pointer on
> > how to set this up?
> 
> Sure! The process of running the tests is pretty involved, but I'll
> describe it below and give some links here.
> 
> We found the bugs in insn.c while adding support for logical immediates
> to the BPF JIT and verifying the changes with our tool, Serval:
> https://github.com/uw-unsat/serval-bpf. The basic idea for how we tested /
> verified logical immediates is the following:
> 
> First, we have a Python script [1] for generating every encodable
> logical immediate and the corresponding instruction fields that encode
> that immediate. The script validates the list by checking that llvm-mc
> decodes each instruction back to the expected immediate.
> 
> Next, we use the list [2] from the first step to check a Racket
> translation [3] of the logical immediate encoding function in insn.c.
> We found the second mask bug by noticing that some (encodable) 32-bit
> immediates were being rejected by the encoding function.
> 
> Last, we use the Racket translation of the encoding function to verify
> the correctness of the BPF JIT implementation [4], i.e., the JIT
> correctly compiles BPF_{AND,OR,XOR,JSET} BPF_K instructions to arm64
> instructions with equivalent semantics. We found the first bug as the
> verifier complained that the function was producing invalid encodings
> for 32-bit -1 immediates, and we were able to reproduce a kernel crash
> using the BPF tests.
> 
> We manually translated the C code to Racket because our verifier, Serval,
> currently only works on Racket code.

Nice! Two things:

(1) I really think you should give a talk on this at a Linux conference
(2) Did you look at any instruction generation functions other than the
logical immediate encoding function?

Cheers,

Will


Re: [RFC PATCH bpf-next 1/3] arm64: insn: Fix two bugs in encoding 32-bit logical immediates

2020-05-07 Thread Will Deacon
Hi Luke,

Thanks for the patches.

On Wed, May 06, 2020 at 06:05:01PM -0700, Luke Nelson wrote:
> This patch fixes two issues present in the current function for encoding
> arm64 logical immediates when using the 32-bit variants of instructions.
> 
> First, the code does not correctly reject an all-ones 32-bit immediate
> and returns an undefined instruction encoding, which can crash the kernel.
> The fix is to add a check for this case.
> 
> Second, the code incorrectly rejects some 32-bit immediates that are
> actually encodable as logical immediates. The root cause is that the code
> uses a default mask of 64-bit all-ones, even for 32-bit immediates. This
> causes an issue later on when the mask is used to fill the top bits of
> the immediate with ones, shown here:
> 
>   /*
>* Pattern: 0..01..10..01..1
>*
>* Fill the unused top bits with ones, and check if
>* the result is a valid immediate (all ones with a
>* contiguous ranges of zeroes).
>*/
>   imm |= ~mask;
>   if (!range_of_ones(~imm))
>   return AARCH64_BREAK_FAULT;
> 
> To see the problem, consider an immediate of the form 0..01..10..01..1,
> where the upper 32 bits are zero, such as 0x8001. The code checks
> if ~(imm | ~mask) contains a range of ones: the incorrect mask yields
> 1..10..01..10..0, which fails the check; the correct mask yields
> 0..01..10..0, which succeeds.
> 
> The fix is to use a 32-bit all-ones default mask for 32-bit immediates.
> 
> Currently, the only user of this function is in
> arch/arm64/kvm/va_layout.c, which uses 64-bit immediates and won't
> trigger these bugs.

Ah, so this isn't a fix or a bpf patch ;)

I can queue it via arm64 for 5.8, along with the bpf patches since there
are some other small changes pending in the arm64 bpf backend for BTI.

> We tested the new code against llvm-mc with all 1,302 encodable 32-bit
> logical immediates and all 5,334 encodable 64-bit logical immediates.
> 
> Fixes: ef3935eeebff ("arm64: insn: Add encoder for bitwise operations using 
> literals")
> Co-developed-by: Xi Wang 
> Signed-off-by: Xi Wang 
> Signed-off-by: Luke Nelson 
> ---
>  arch/arm64/kernel/insn.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
> index 4a9e773a177f..42fad79546bb 100644
> --- a/arch/arm64/kernel/insn.c
> +++ b/arch/arm64/kernel/insn.c
> @@ -1535,7 +1535,7 @@ static u32 aarch64_encode_immediate(u64 imm,
>   u32 insn)
>  {
>   unsigned int immr, imms, n, ones, ror, esz, tmp;
> - u64 mask = ~0UL;
> + u64 mask;
>  
>   /* Can't encode full zeroes or full ones */
>   if (!imm || !~imm)

It's a bit grotty spreading the checks out now. How about we tweak things
slightly along the lines of:


diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index 4a9e773a177f..60ec788eaf33 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -1535,16 +1535,10 @@ static u32 aarch64_encode_immediate(u64 imm,
u32 insn)
 {
unsigned int immr, imms, n, ones, ror, esz, tmp;
-   u64 mask = ~0UL;
-
-   /* Can't encode full zeroes or full ones */
-   if (!imm || !~imm)
-   return AARCH64_BREAK_FAULT;
+   u64 mask;
 
switch (variant) {
case AARCH64_INSN_VARIANT_32BIT:
-   if (upper_32_bits(imm))
-   return AARCH64_BREAK_FAULT;
esz = 32;
break;
case AARCH64_INSN_VARIANT_64BIT:
@@ -1556,6 +1550,12 @@ static u32 aarch64_encode_immediate(u64 imm,
return AARCH64_BREAK_FAULT;
}
 
+   mask = GENMASK(esz - 1, 0);
+
+   /* Can't encode full zeroes or full ones */
+   if (imm & ~mask || !imm || imm == mask)
+   return AARCH64_BREAK_FAULT;
+
/*
 * Inverse of Replicate(). Try to spot a repeating pattern
 * with a pow2 stride.


What do you think?

Will


Re: [PATCH] arm64: do_csum: implement accelerated scalar version

2019-08-15 Thread Will Deacon
On Thu, May 16, 2019 at 11:14:35AM +0800, Zhangshaokun wrote:
> On 2019/5/15 17:47, Will Deacon wrote:
> > On Mon, Apr 15, 2019 at 07:18:22PM +0100, Robin Murphy wrote:
> >> On 12/04/2019 10:52, Will Deacon wrote:
> >>> I'm waiting for Robin to come back with numbers for a C implementation.
> >>>
> >>> Robin -- did you get anywhere with that?
> >>
> >> Still not what I would call finished, but where I've got so far (besides an
> >> increasingly elaborate test rig) is as below - it still wants some 
> >> unrolling
> >> in the middle to really fly (and actual testing on BE), but the worst-case
> >> performance already equals or just beats this asm version on Cortex-A53 
> >> with
> >> GCC 7 (by virtue of being alignment-insensitive and branchless except for
> >> the loop). Unfortunately, the advantage of C code being instrumentable does
> >> also come around to bite me...
> > 
> > Is there any interest from anybody in spinning a proper patch out of this?
> > Shaokun?
> 
> HiSilicon's Kunpeng920(Hi1620) benefits from do_csum optimization, if Ard and
> Robin are ok, Lingyan or I can try to do it.
> Of course, if any guy posts the patch, we are happy to test it.
> Any will be ok.

I don't mind who posts it, but Robin is super busy with SMMU stuff at the
moment so it probably makes more sense for you or Lingyan to do it.

Will


Re: [PATCH] arm64: do_csum: implement accelerated scalar version

2019-05-15 Thread Will Deacon
On Mon, Apr 15, 2019 at 07:18:22PM +0100, Robin Murphy wrote:
> On 12/04/2019 10:52, Will Deacon wrote:
> > I'm waiting for Robin to come back with numbers for a C implementation.
> > 
> > Robin -- did you get anywhere with that?
> 
> Still not what I would call finished, but where I've got so far (besides an
> increasingly elaborate test rig) is as below - it still wants some unrolling
> in the middle to really fly (and actual testing on BE), but the worst-case
> performance already equals or just beats this asm version on Cortex-A53 with
> GCC 7 (by virtue of being alignment-insensitive and branchless except for
> the loop). Unfortunately, the advantage of C code being instrumentable does
> also come around to bite me...

Is there any interest from anybody in spinning a proper patch out of this?
Shaokun?

Will

> /* Looks dumb, but generates nice-ish code */
> static u64 accumulate(u64 sum, u64 data)
> {
>   __uint128_t tmp = (__uint128_t)sum + data;
>   return tmp + (tmp >> 64);
> }
> 
> unsigned int do_csum_c(const unsigned char *buff, int len)
> {
>   unsigned int offset, shift, sum, count;
>   u64 data, *ptr;
>   u64 sum64 = 0;
> 
>   offset = (unsigned long)buff & 0x7;
>   /*
>* This is to all intents and purposes safe, since rounding down cannot
>* result in a different page or cache line being accessed, and @buff
>* should absolutely not be pointing to anything read-sensitive.
>* It does, however, piss off KASAN...
>*/
>   ptr = (u64 *)(buff - offset);
>   shift = offset * 8;
> 
>   /*
>* Head: zero out any excess leading bytes. Shifting back by the same
>* amount should be at least as fast as any other way of handling the
>* odd/even alignment, and means we can ignore it until the very end.
>*/
>   data = *ptr++;
> #ifdef __LITTLE_ENDIAN
>   data = (data >> shift) << shift;
> #else
>   data = (data << shift) >> shift;
> #endif
>   count = 8 - offset;
> 
>   /* Body: straightforward aligned loads from here on... */
>   //TODO: fancy stuff with larger strides and uint128s?
>   while(len > count) {
>   sum64 = accumulate(sum64, data);
>   data = *ptr++;
>   count += 8;
>   }
>   /*
>* Tail: zero any over-read bytes similarly to the head, again
>* preserving odd/even alignment.
>*/
>   shift = (count - len) * 8;
> #ifdef __LITTLE_ENDIAN
>   data = (data << shift) >> shift;
> #else
>   data = (data >> shift) << shift;
> #endif
>   sum64 = accumulate(sum64, data);
> 
>   /* Finally, folding */
>   sum64 += (sum64 >> 32) | (sum64 << 32);
>   sum = sum64 >> 32;
>   sum += (sum >> 16) | (sum << 16);
>   if (offset & 1)
>   return (u16)swab32(sum);
> 
>   return sum >> 16;
> }


Re: [PATCH] arm64: do_csum: implement accelerated scalar version

2019-04-12 Thread Will Deacon
On Fri, Apr 12, 2019 at 10:31:16AM +0800, Zhangshaokun wrote:
> On 2019/2/19 7:08, Ard Biesheuvel wrote:
> > It turns out that the IP checksumming code is still exercised often,
> > even though one might expect that modern NICs with checksum offload
> > have no use for it. However, as Lingyan points out, there are
> > combinations of features where the network stack may still fall back
> > to software checksumming, and so it makes sense to provide an
> > optimized implementation in software as well.
> > 
> > So provide an implementation of do_csum() in scalar assembler, which,
> > unlike C, gives direct access to the carry flag, making the code run
> > substantially faster. The routine uses overlapping 64 byte loads for
> > all input size > 64 bytes, in order to reduce the number of branches
> > and improve performance on cores with deep pipelines.
> > 
> > On Cortex-A57, this implementation is on par with Lingyan's NEON
> > implementation, and roughly 7x as fast as the generic C code.
> > 
> > Cc: "huanglingyan (A)" 
> > Signed-off-by: Ard Biesheuvel 
> > ---
> > Test code after the patch.
>
> Hi maintainers and Ard,
> 
> Any update on it?

I'm waiting for Robin to come back with numbers for a C implementation.

Robin -- did you get anywhere with that?

Will


Re: bpf memory model. Was: [PATCH v4 bpf-next 1/9] bpf: introduce bpf_spin_lock

2019-01-30 Thread Will Deacon
Hi Alexei,

On Mon, Jan 28, 2019 at 01:56:24PM -0800, Alexei Starovoitov wrote:
> On Mon, Jan 28, 2019 at 10:24:08AM +0100, Peter Zijlstra wrote:
> > On Fri, Jan 25, 2019 at 04:17:26PM -0800, Alexei Starovoitov wrote:
> > > What I want to avoid is to define the whole execution ordering model 
> > > upfront.
> > > We cannot say that BPF ISA is weakly ordered like alpha.
> > > Most of the bpf progs are written and running on x86. We shouldn't
> > > twist bpf developer's arm by artificially relaxing memory model.
> > > BPF memory model is equal to memory model of underlying architecture.
> > > What we can do is to make it bpf progs a bit more portable with
> > > smp_rmb instructions, but we must not force weak execution on the 
> > > developer.
> > 
> > Well, I agree with only introducing bits you actually need, and my
> > smp_rmb() example might have been poorly chosen, smp_load_acquire() /
> > smp_store_release() might have been a far more useful example.
> > 
> > But I disagree with the last part; we have to pick a model now;
> > otherwise you'll pain yourself into a corner.
> > 
> > Also; Alpha isn't very relevant these days; however ARM64 does seem to
> > be gaining a lot of attention and that is very much a weak architecture.
> > Adding strongly ordered assumptions to BPF now, will penalize them in
> > the long run.
> 
> arm64 is gaining attention just like riscV is gaining it too.
> BPF jit for arm64 is very solid, while BPF jit for riscV is being worked on.
> BPF is not picking sides in CPU HW and ISA battles.

It's not about picking a side, it's about providing an abstraction of the
various CPU architectures out there so that the programmer doesn't need to
worry about where their program may run. Hell, even if you just said "eBPF
follows x86 semantics" that would be better than saying nothing (and then we
could have a discussion about whether x86 semantics are really what you
want).

> Memory model is CPU HW design decision. BPF ISA cannot dictate HW design.
> We're not saying today that BPF is strongly ordered.
> BPF load/stores are behaving differently on x86 vs arm64.
> We can add new instructions, but we cannot 'define' how load/stores behave
> from memory model perspective.
> For example, take atomicity of single byte load/store.
> Not all archs have them atomic, but we cannot say to bpf programmers
> to always assume non-atomic byte loads.

Hmm, I don't think this is a good approach to take for the future of eBPF.
Assuming that a desirable property of an eBPF program is portability between
CPU architectures, then you're effectively forcing the programmer to "assume
the worst", where the worst is almost certainly unusable for practical
purposes.

One easy thing you could consider would be to allow tagging of an eBPF
program with its supported target architectures (the JIT will refuse to
accept it for other architectures). This would at least prevent remove the
illusion of portability and force the programmer to be explicit.

However, I think we'd much better off if we defined some basic ordering
primitives such as relaxed and RCpc-style acquire/release operations
(including atomic RmW), single-copy atomic memory accesses up to the native
machine size and a full-fence instruction. If your program uses something
that the underlying arch doesn't support, then it is rejected (e.g. 64-bit
READ_ONCE on a 32-bit arch)

That should map straightforwardly to all modern architectures and allow for
efficient codegen on x86 and arm64. It would probably require a bunch of new
BPF instructions that would be defined to be atomic (you already have XADD
as a relaxed atomic add).

Apologies if this sounds patronising, but I'm keen to help figure out the
semantics *now* so that we don't end up having to infer them later on, which
is the usual painful case for memory models. I suspect Peter and Paul would
also prefer to attack it that way around. I appreciate that the temptation
is to avoid the problem by deferring to the underlying hardware memory
model, but I think that will create more problems than it solves and we're
here to help you get this right.

Will


Re: [PATCH RFC 2/4] include/linux/compiler.h: allow memory operands

2019-01-07 Thread Will Deacon
On Wed, Jan 02, 2019 at 03:57:54PM -0500, Michael S. Tsirkin wrote:
> We don't really care whether the variable is in-register
> or in-memory. Relax the constraint accordingly.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  include/linux/compiler.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/compiler.h b/include/linux/compiler.h
> index 1ad367b4cd8d..6601d39e8c48 100644
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -154,7 +154,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, 
> int val,
>  #ifndef OPTIMIZER_HIDE_VAR
>  /* Make the optimizer believe the variable can be manipulated arbitrarily. */
>  #define OPTIMIZER_HIDE_VAR(var)  
> \
> - __asm__ ("" : "=r" (var) : "0" (var))
> + __asm__ ("" : "=rm" (var) : "0" (var))
>  #endif

I think this can break for architectures with write-back addressing modes
such as arm, where the "m" constraint is assumed to be evaluated precisely
once in the asm block.

Will


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Will Deacon
On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote:
> On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski  wrote:
> >
> > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> >  wrote:
> > >
> > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski  wrote:
> > > >
> >
> > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > work on module data?  Perhaps crypto code trying to encrypt static
> > > > data because our APIs don’t understand virtual addresses.  I guess if
> > > > highmem is ever used for modules, then we should be fine.
> > > >
> > >
> > > The crypto code shouldn't care, but I think it will probably break 
> > > hibernate :-(
> >
> > How so?  Hibernate works (or at least should work) on x86 PAE, where
> > __va doesn't work on module data, and, on x86, the direct map has some
> > RO parts with where the module is, so hibernate can't be writing to
> > the memory through the direct map with its final permissions.
> 
> On arm64 at least, hibernate reads the contents of memory via the
> linear mapping. Not sure about other arches.

Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
the pte when we see that it's PROT_NONE?

Will


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-06 Thread Will Deacon
On Thu, Dec 06, 2018 at 08:29:03AM +0100, Ard Biesheuvel wrote:
> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski  wrote:
> >
> > On Wed, Dec 5, 2018 at 3:41 AM Will Deacon  wrote:
> > >
> > > On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> > > > On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> > > >  wrote:
> > > > >
> > > > > On Tue, 2018-12-04 at 16:03 +, Will Deacon wrote:
> > > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe 
> > > > > > > > 
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Since vfree will lazily flush the TLB, but not lazily free the 
> > > > > > > > underlying
> > > > > > > > pages,
> > > > > > > > it often leaves stale TLB entries to freed pages that could get 
> > > > > > > > re-used.
> > > > > > > > This is
> > > > > > > > undesirable for cases where the memory being freed has special 
> > > > > > > > permissions
> > > > > > > > such
> > > > > > > > as executable.
> > > > > > >
> > > > > > > So I am trying to finish my patch-set for preventing transient 
> > > > > > > W+X mappings
> > > > > > > from taking space, by handling kprobes & ftrace that I missed 
> > > > > > > (thanks again
> > > > > > > for
> > > > > > > pointing it out).
> > > > > > >
> > > > > > > But all of the sudden, I don’t understand why we have the problem 
> > > > > > > that this
> > > > > > > (your) patch-set deals with at all. We already change the 
> > > > > > > mappings to make
> > > > > > > the memory wrAcked-by: Ard Biesheuvel 
> itable before freeing the memory, so why can’t we make it
> > > > > > > non-executable at the same time? Actually, why do we make the 
> > > > > > > module memory,
> > > > > > > including its data executable before freeing it???
> > > > > >
> > > > > > Yeah, this is really confusing, but I have a suspicion it's a 
> > > > > > combination
> > > > > > of the various different configurations and hysterical raisins. We 
> > > > > > can't
> > > > > > rely on module_alloc() allocating from the vmalloc area (see nios2) 
> > > > > > nor
> > > > > > can we rely on disable_ro_nx() being available at build time.
> > > > > >
> > > > > > If we *could* rely on module allocations always using vmalloc(), 
> > > > > > then
> > > > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > > > afaict -- who cares about the memory attributes of a mapping that's 
> > > > > > about
> > > > > > to disappear anyway?
> > > > > >
> > > > > > Is it just nios2 that does something different?
> > > > > >
> > > > > Yea it is really intertwined. I think for x86, set_memory_nx 
> > > > > everywhere would
> > > > > solve it as well, in fact that was what I first thought the solution 
> > > > > should be
> > > > > until this was suggested. It's interesting that from the other thread 
> > > > > Masami
> > > > > Hiramatsu referenced, set_memory_nx was suggested last year and would 
> > > > > have
> > > > > inadvertently blocked this on x86. But, on the other architectures I 
> > > > > have since
> > > > > learned it is a bit different.
> > > > >
> > > > > It looks like actually most arch's don't re-define set_memory_*, and 
> > > > > so all of
> > > > > the frob_* functions are actually just noops. In which case 
> > > > > allocating RWX is
> > > > > needed to make it work at all, because that is what the allocation is 
> > > > > going to
> > > > > stay at. So in these archs, set_memory_nx won't solve it because it 
> > > > > will do
> > > > > nothin

Re: [PATCH v4 2/2] arm64/bpf: don't allocate BPF JIT programs in module memory

2018-12-05 Thread Will Deacon
On Wed, Dec 05, 2018 at 01:24:17PM +0100, Daniel Borkmann wrote:
> On 12/04/2018 04:45 PM, Ard Biesheuvel wrote:
> > On Mon, 3 Dec 2018 at 13:49, Will Deacon  wrote:
> >> On Fri, Nov 30, 2018 at 08:20:06PM +0100, Ard Biesheuvel wrote:
> >>> On Fri, 30 Nov 2018 at 19:26, Will Deacon  wrote:
> >>>> On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
> >>>>> diff --git a/arch/arm64/net/bpf_jit_comp.c 
> >>>>> b/arch/arm64/net/bpf_jit_comp.c
> >>>>> index a6fdaea07c63..76c2ab40c02d 100644
> >>>>> --- a/arch/arm64/net/bpf_jit_comp.c
> >>>>> +++ b/arch/arm64/net/bpf_jit_comp.c
> >>>>> @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct 
> >>>>> bpf_prog *prog)
> >>>>>  tmp : orig_prog);
> >>>>>   return prog;
> >>>>>  }
> >>>>> +
> >>>>> +void *bpf_jit_alloc_exec(unsigned long size)
> >>>>> +{
> >>>>> + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
> >>>>> + BPF_JIT_REGION_END, GFP_KERNEL,
> >>>>> + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> >>>>> + __builtin_return_address(0));
> >>>>
> >>>> I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged.
> >>>
> >>> I think akpm already queued up that patch.
> >>>
> >>>> In the
> >>>> meantime, I wonder if it's worth zeroing the region in 
> >>>> bpf_jit_free_exec()?
> >>>> (although we'd need the size information...).
> >>>>
> >>>
> >>> Not sure. What exactly would that achieve?
> >>
> >> I think the zero encoding is guaranteed to be undefined, so it would limit
> >> the usefulness of any stale, executable TLB entries. However, we'd also 
> >> need
> >> cache maintenance to make that stuff visible to the I side, so it's 
> >> probably
> >> not worth it, especially if akpm has queued the stuff from Rich.
> >>
> >> Maybe just add an:
> >>
> >> /* FIXME: Remove this when VM_IMMEDIATE_UNMAP is supported */
> >> #ifndef VM_IMMEDIATE_UNMAP
> >> #define VM_IMMEDIATE_UNMAP  0
> >> #endif
> >>
> >> so we remember to come back and sort this out? Up to you.
> > 
> > I'll just make a note to send out that patch once the definition lands via 
> > -akpm
> 
> Could I get an ACK from you for this patch, then I'd take the series into 
> bpf-next.

Gah, thanks for the ping: I thought I acked this initially, but turns out I
didn't.

Acked-by: Will Deacon 

Will


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-05 Thread Will Deacon
On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>  wrote:
> >
> > On Tue, 2018-12-04 at 16:03 +, Will Deacon wrote:
> > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe 
> > > > > 
> > > > > wrote:
> > > > >
> > > > > Since vfree will lazily flush the TLB, but not lazily free the 
> > > > > underlying
> > > > > pages,
> > > > > it often leaves stale TLB entries to freed pages that could get 
> > > > > re-used.
> > > > > This is
> > > > > undesirable for cases where the memory being freed has special 
> > > > > permissions
> > > > > such
> > > > > as executable.
> > > >
> > > > So I am trying to finish my patch-set for preventing transient W+X 
> > > > mappings
> > > > from taking space, by handling kprobes & ftrace that I missed (thanks 
> > > > again
> > > > for
> > > > pointing it out).
> > > >
> > > > But all of the sudden, I don’t understand why we have the problem that 
> > > > this
> > > > (your) patch-set deals with at all. We already change the mappings to 
> > > > make
> > > > the memory writable before freeing the memory, so why can’t we make it
> > > > non-executable at the same time? Actually, why do we make the module 
> > > > memory,
> > > > including its data executable before freeing it???
> > >
> > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > of the various different configurations and hysterical raisins. We can't
> > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > can we rely on disable_ro_nx() being available at build time.
> > >
> > > If we *could* rely on module allocations always using vmalloc(), then
> > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > afaict -- who cares about the memory attributes of a mapping that's about
> > > to disappear anyway?
> > >
> > > Is it just nios2 that does something different?
> > >
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere 
> > would
> > solve it as well, in fact that was what I first thought the solution should 
> > be
> > until this was suggested. It's interesting that from the other thread Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > inadvertently blocked this on x86. But, on the other architectures I have 
> > since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so all 
> > of
> > the frob_* functions are actually just noops. In which case allocating RWX 
> > is
> > needed to make it work at all, because that is what the allocation is going 
> > to
> > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there is 
> > the
> > changing of the permissions on the directmap as well. You don't want some 
> > other
> > caller getting a page that was left RO when freed and then trying to write 
> > to
> > it, if I understand this.
> >
> 
> Exactly.

Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.

> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> VM_MAY_ADJUST_PERMS or similar.  It would have the semantics you want,
> but it would also call some arch hooks to put back the direct map
> permissions before the flush.  Does that seem reasonable?  It would
> need to be hooked up that implement set_memory_ro(), but that should
> be quite easy.  If nothing else, it could fall back to set_memory_ro()
> in the absence of a better implementation.

You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.

Will

[1]
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=for-next/core&id=c55191e96caa9d787e8f682c5e525b7f8172a3b4


Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages

2018-12-04 Thread Will Deacon
On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe  
> > wrote:
> > 
> > Since vfree will lazily flush the TLB, but not lazily free the underlying 
> > pages,
> > it often leaves stale TLB entries to freed pages that could get re-used. 
> > This is
> > undesirable for cases where the memory being freed has special permissions 
> > such
> > as executable.
> 
> So I am trying to finish my patch-set for preventing transient W+X mappings
> from taking space, by handling kprobes & ftrace that I missed (thanks again 
> for
> pointing it out).
> 
> But all of the sudden, I don’t understand why we have the problem that this
> (your) patch-set deals with at all. We already change the mappings to make
> the memory writable before freeing the memory, so why can’t we make it
> non-executable at the same time? Actually, why do we make the module memory,
> including its data executable before freeing it???

Yeah, this is really confusing, but I have a suspicion it's a combination
of the various different configurations and hysterical raisins. We can't
rely on module_alloc() allocating from the vmalloc area (see nios2) nor
can we rely on disable_ro_nx() being available at build time.

If we *could* rely on module allocations always using vmalloc(), then
we could pass in Rick's new flag and drop disable_ro_nx() altogether
afaict -- who cares about the memory attributes of a mapping that's about
to disappear anyway?

Is it just nios2 that does something different?

Will


Re: [PATCH v4 2/2] arm64/bpf: don't allocate BPF JIT programs in module memory

2018-12-03 Thread Will Deacon
On Fri, Nov 30, 2018 at 08:20:06PM +0100, Ard Biesheuvel wrote:
> On Fri, 30 Nov 2018 at 19:26, Will Deacon  wrote:
> >
> > On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
> > > The arm64 module region is a 128 MB region that is kept close to
> > > the core kernel, in order to ensure that relative branches are
> > > always in range. So using the same region for programs that do
> > > not have this restriction is wasteful, and preferably avoided.
> > >
> > > Now that the core BPF JIT code permits the alloc/free routines to
> > > be overridden, implement them by vmalloc()/vfree() calls from a
> > > dedicated 128 MB region set aside for BPF programs. This ensures
> > > that BPF programs are still in branching range of each other, which
> > > is something the JIT currently depends upon (and is not guaranteed
> > > when using module_alloc() on KASLR kernels like we do currently).
> > > It also ensures that placement of BPF programs does not correlate
> > > with the placement of the core kernel or modules, making it less
> > > likely that leaking the former will reveal the latter.
> > >
> > > This also solves an issue under KASAN, where shadow memory is
> > > needlessly allocated for all BPF programs (which don't require KASAN
> > > shadow pages since they are not KASAN instrumented)
> > >
> > > Signed-off-by: Ard Biesheuvel 
> > > ---
> > >  arch/arm64/include/asm/memory.h |  5 -
> > >  arch/arm64/net/bpf_jit_comp.c   | 13 +
> > >  2 files changed, 17 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/arch/arm64/include/asm/memory.h 
> > > b/arch/arm64/include/asm/memory.h
> > > index b96442960aea..ee20fc63899c 100644
> > > --- a/arch/arm64/include/asm/memory.h
> > > +++ b/arch/arm64/include/asm/memory.h
> > > @@ -62,8 +62,11 @@
> > >  #define PAGE_OFFSET  (UL(0x) - \
> > >   (UL(1) << (VA_BITS - 1)) + 1)
> > >  #define KIMAGE_VADDR (MODULES_END)
> > > +#define BPF_JIT_REGION_START (VA_START + KASAN_SHADOW_SIZE)
> > > +#define BPF_JIT_REGION_SIZE  (SZ_128M)
> > > +#define BPF_JIT_REGION_END   (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
> > >  #define MODULES_END  (MODULES_VADDR + MODULES_VSIZE)
> > > -#define MODULES_VADDR(VA_START + KASAN_SHADOW_SIZE)
> > > +#define MODULES_VADDR(BPF_JIT_REGION_END)
> > >  #define MODULES_VSIZE(SZ_128M)
> > >  #define VMEMMAP_START(PAGE_OFFSET - VMEMMAP_SIZE)
> > >  #define PCI_IO_END   (VMEMMAP_START - SZ_2M)
> > > diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> > > index a6fdaea07c63..76c2ab40c02d 100644
> > > --- a/arch/arm64/net/bpf_jit_comp.c
> > > +++ b/arch/arm64/net/bpf_jit_comp.c
> > > @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
> > > *prog)
> > >  tmp : orig_prog);
> > >   return prog;
> > >  }
> > > +
> > > +void *bpf_jit_alloc_exec(unsigned long size)
> > > +{
> > > + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
> > > + BPF_JIT_REGION_END, GFP_KERNEL,
> > > + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> > > + __builtin_return_address(0));
> >
> > I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged.
> 
> I think akpm already queued up that patch.
> 
> > In the
> > meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
> > (although we'd need the size information...).
> >
> 
> Not sure. What exactly would that achieve?

I think the zero encoding is guaranteed to be undefined, so it would limit
the usefulness of any stale, executable TLB entries. However, we'd also need
cache maintenance to make that stuff visible to the I side, so it's probably
not worth it, especially if akpm has queued the stuff from Rich.

Maybe just add an:

/* FIXME: Remove this when VM_IMMEDIATE_UNMAP is supported */
#ifndef VM_IMMEDIATE_UNMAP
#define VM_IMMEDIATE_UNMAP  0
#endif

so we remember to come back and sort this out? Up to you.

Will


Re: [PATCH v4 2/2] arm64/bpf: don't allocate BPF JIT programs in module memory

2018-11-30 Thread Will Deacon
On Fri, Nov 23, 2018 at 11:18:04PM +0100, Ard Biesheuvel wrote:
> The arm64 module region is a 128 MB region that is kept close to
> the core kernel, in order to ensure that relative branches are
> always in range. So using the same region for programs that do
> not have this restriction is wasteful, and preferably avoided.
> 
> Now that the core BPF JIT code permits the alloc/free routines to
> be overridden, implement them by vmalloc()/vfree() calls from a
> dedicated 128 MB region set aside for BPF programs. This ensures
> that BPF programs are still in branching range of each other, which
> is something the JIT currently depends upon (and is not guaranteed
> when using module_alloc() on KASLR kernels like we do currently).
> It also ensures that placement of BPF programs does not correlate
> with the placement of the core kernel or modules, making it less
> likely that leaking the former will reveal the latter.
> 
> This also solves an issue under KASAN, where shadow memory is
> needlessly allocated for all BPF programs (which don't require KASAN
> shadow pages since they are not KASAN instrumented)
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  arch/arm64/include/asm/memory.h |  5 -
>  arch/arm64/net/bpf_jit_comp.c   | 13 +
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/memory.h b/arch/arm64/include/asm/memory.h
> index b96442960aea..ee20fc63899c 100644
> --- a/arch/arm64/include/asm/memory.h
> +++ b/arch/arm64/include/asm/memory.h
> @@ -62,8 +62,11 @@
>  #define PAGE_OFFSET  (UL(0x) - \
>   (UL(1) << (VA_BITS - 1)) + 1)
>  #define KIMAGE_VADDR (MODULES_END)
> +#define BPF_JIT_REGION_START (VA_START + KASAN_SHADOW_SIZE)
> +#define BPF_JIT_REGION_SIZE  (SZ_128M)
> +#define BPF_JIT_REGION_END   (BPF_JIT_REGION_START + BPF_JIT_REGION_SIZE)
>  #define MODULES_END  (MODULES_VADDR + MODULES_VSIZE)
> -#define MODULES_VADDR(VA_START + KASAN_SHADOW_SIZE)
> +#define MODULES_VADDR(BPF_JIT_REGION_END)
>  #define MODULES_VSIZE(SZ_128M)
>  #define VMEMMAP_START(PAGE_OFFSET - VMEMMAP_SIZE)
>  #define PCI_IO_END   (VMEMMAP_START - SZ_2M)
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index a6fdaea07c63..76c2ab40c02d 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -940,3 +940,16 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog 
> *prog)
>  tmp : orig_prog);
>   return prog;
>  }
> +
> +void *bpf_jit_alloc_exec(unsigned long size)
> +{
> + return __vmalloc_node_range(size, PAGE_SIZE, BPF_JIT_REGION_START,
> + BPF_JIT_REGION_END, GFP_KERNEL,
> + PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
> + __builtin_return_address(0));

I guess we'll want VM_IMMEDIATE_UNMAP here if Rich gets that merged. In the
meantime, I wonder if it's worth zeroing the region in bpf_jit_free_exec()?
(although we'd need the size information...).

Will


Re: [PATCH 0/2] Don’t leave executable TLB entries to freed pages

2018-11-28 Thread Will Deacon
On Tue, Nov 27, 2018 at 05:21:08PM -0800, Nadav Amit wrote:
> > On Nov 27, 2018, at 5:06 PM, Nadav Amit  wrote:
> > 
> >> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe  
> >> wrote:
> >> 
> >> Sometimes when memory is freed via the module subsystem, an executable
> >> permissioned TLB entry can remain to a freed page. If the page is re-used 
> >> to
> >> back an address that will receive data from userspace, it can result in 
> >> user
> >> data being mapped as executable in the kernel. The root of this behavior is
> >> vfree lazily flushing the TLB, but not lazily freeing the underlying 
> >> pages. 
> >> 
> >> There are sort of three categories of this which show up across modules, 
> >> bpf,
> >> kprobes and ftrace:
> >> 
> >> 1. When executable memory is touched and then immediatly freed
> >> 
> >>  This shows up in a couple error conditions in the module loader and BPF 
> >> JIT
> >>  compiler.
> > 
> > Interesting!
> > 
> > Note that this may cause conflict with "x86: avoid W^X being broken during
> > modules loading”, which I recently submitted.
> 
> I actually have not looked on the vmalloc() code too much recent, but it
> seems … strange:
> 
>   void vm_unmap_aliases(void)
>   {   
> 
>   ...
>   mutex_lock(&vmap_purge_lock);
>   purge_fragmented_blocks_allcpus();
>   if (!__purge_vmap_area_lazy(start, end) && flush)
>   flush_tlb_kernel_range(start, end);
>   mutex_unlock(&vmap_purge_lock);
>   }
> 
> Since __purge_vmap_area_lazy() releases the memory, it seems there is a time
> window between the release of the region and the TLB flush, in which the
> area can be allocated for another purpose. This can result in a
> (theoretical) correctness issue. No?

If __purge_vmap_area_lazy() returns false, then it hasn't freed the memory,
so we only invalidate the TLB if 'flush' is true in that case. If
__purge_vmap_area_lazy() returns true instead, then it takes care of the TLB
invalidation before the freeing.

Will


Re: [PATCH bpf-next 2/3] tools, perf: use smp_{rmb,mb} barriers instead of {rmb,mb}

2018-10-19 Thread Will Deacon
On Thu, Oct 18, 2018 at 08:53:42PM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 18, 2018 at 09:00:46PM +0200, Daniel Borkmann wrote:
> > On 10/18/2018 05:33 PM, Alexei Starovoitov wrote:
> > > On Thu, Oct 18, 2018 at 05:04:34PM +0200, Daniel Borkmann wrote:
> > >>  #endif /* _TOOLS_LINUX_ASM_IA64_BARRIER_H */
> > >> diff --git a/tools/arch/powerpc/include/asm/barrier.h 
> > >> b/tools/arch/powerpc/include/asm/barrier.h
> > >> index a634da0..905a2c6 100644
> > >> --- a/tools/arch/powerpc/include/asm/barrier.h
> > >> +++ b/tools/arch/powerpc/include/asm/barrier.h
> > >> @@ -27,4 +27,20 @@
> > >>  #define rmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > >>  #define wmb()  __asm__ __volatile__ ("sync" : : : "memory")
> > >>
> > >> +#if defined(__powerpc64__)
> > >> +#define smp_lwsync()__asm__ __volatile__ ("lwsync" : : : "memory")
> > >> +
> > >> +#define smp_store_release(p, v) \
> > >> +do {\
> > >> +smp_lwsync();   \
> > >> +WRITE_ONCE(*p, v);  \
> > >> +} while (0)
> > >> +
> > >> +#define smp_load_acquire(p) \
> > >> +({  \
> > >> +typeof(*p) ___p1 = READ_ONCE(*p);   \
> > >> +smp_lwsync();   \
> > >> +___p1;  \
> > > 
> > > I don't like this proliferation of asm.
> > > Why do we think that we can do better job than compiler?
> > > can we please use gcc builtins instead?
> > > https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> > > __atomic_load_n(ptr, __ATOMIC_ACQUIRE);
> > > __atomic_store_n(ptr, val, __ATOMIC_RELEASE);
> > > are done specifically for this use case if I'm not mistaken.
> > > I think it pays to learn what compiler provides.
> > 
> > But are you sure the C11 memory model matches exact same model as kernel?
> > Seems like last time Will looked into it [0] it wasn't the case ...
> 
> I'm only suggesting equivalence of __atomic_load_n(ptr, __ATOMIC_ACQUIRE)
> with kernel's smp_load_acquire().
> I've seen a bunch of user space ring buffer implementations implemented
> with __atomic_load_n() primitives.
> But let's ask experts who live in both worlds.

One thing to be wary of is if there is an implementation choice between
how to implement load-acquire and store-release for a given architecture.
In these situations, it's often important that concurrent software agrees
on the "mapping", so we'd need to be sure that (a) All userspace compilers
that we care about have compatible mappings and (b) These mappings are
compatible with the kernel code.

Will


Re: [RFC PATCH] skb: Define NET_IP_ALIGN based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

2018-10-05 Thread Will Deacon
On Thu, Oct 04, 2018 at 07:43:59PM +0200, Ard Biesheuvel wrote:
> (+ Arnd, Russell, Catalin, Will)
> 
> On 4 October 2018 at 19:36, Ben Hutchings  
> wrote:
> > NET_IP_ALIGN is supposed to be defined as 0 if DMA writes to an
> > unaligned buffer would be more expensive than CPU access to unaligned
> > header fields, and otherwise defined as 2.
> >
> > Currently only ppc64 and x86 configurations define it to be 0.
> > However several other architectures (conditionally) define
> > CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, which seems to imply that
> > NET_IP_ALIGN should be 0.
> >
> > Remove the overriding definitions for ppc64 and x86 and define
> > NET_IP_ALIGN solely based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
> >
> > Signed-off-by: Ben Hutchings 
> 
> While this makes sense for arm64, I don't think it is appropriate for
> ARM per se.

Agreed that this makes sense for arm64, and I'd be happy to take a patch
defining it as 0 there.

Will


Re: RFC on writel and writel_relaxed

2018-03-29 Thread Will Deacon
On Thu, Mar 29, 2018 at 08:31:32AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> >   This is a variation on the mandatory write barrier that causes writes to 
> > weakly
> >   ordered I/O regions to be partially ordered.  Its effects may go beyond 
> > the
> >   CPU->Hardware interface and actually affect the hardware at some level.
> > 
> > How can a driver writer possibly get that right?
> > 
> > IIRC it was added for some big ia64 system that was really expensive
> > to implement the proper wmb() semantics on. So wmb() semantics were
> > quietly downgraded, then the subsequently broken drivers they cared
> > about were fixed by adding the stronger mmiowb().
> > 
> > What should have happened was wmb and writel remained correct, sane, and
> > expensive, and they add an mmio_wmb() to order MMIO stores made by the
> > writel_relaxed accessors, then use that to speed up the few drivers they
> > care about.
> > 
> > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> > make wmb ordering talk about stores to the device, not to some
> > intermediate stage of the interconnect where it can be subsequently
> > reordered wrt the device? Drivers can be converted back to using wmb
> > or writel gradually.
> 
> I was under the impression that mmiowb was specifically about ordering
> writel's with a subsequent spin_unlock, without it, MMIOs from
> different CPUs (within the same lock) would still arrive OO.
> 
> If that's indeed the case, I would suggest ia64 switches to a similar
> per-cpu flag trick powerpc uses.

... or we could remove ia64.

/me runs for cover

Will


Re: RFC on writel and writel_relaxed

2018-03-28 Thread Will Deacon
On Wed, Mar 28, 2018 at 05:42:56PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote:
> > On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
> >  wrote:
> > > 
> > > This is why, I want (with your agreement) to define clearly and once
> > > and for all, that the Linux semantics of writel are that it is ordered
> > > with previous writes to coherent memory (*)
> > 
> > Honestly, I think those are the sane semantics. In fact, make it
> > "ordered with previous writes" full stop, since it's not only ordered
> > wrt previous writes to memory, but also previous writel's.
> 
> Of course. It was somewhat a given that it's ordered vs. any previous
> MMIO actually, but it doesn't hurt to spell it out once more.

Good. So I think this confirms our understanding so far.

> 
> > > Also, can I assume the above ordering with writel() equally applies to
> > > readl() or not ?
> > > 
> > > IE:
> > > dma_buf->foo = 1;
> > > readl(STUPID_DEVICE_DMA_KICK_ON_READ);
> > 
> > If that KICK_ON_READ is UC, then that's definitely the case. And
> > honestly, status registers like that really should always be UC.
> > 
> > But if somebody sets the area WC (which is crazy), then I think it
> > might be at least debatable. x86 semantics does allow reads to be done
> > before previous writes (or, put another way, writes to be buffered -
> > the buffers are ordered so writes don't get re-ordered, but reads can
> > happen during the buffering).
> 
> Right, for now I worry about UC semantics. Once we have nailed that, we
> can look at WC, which is a lot more tricky as archs differs more
> widely, but one thing at a time.
> 
> > But UC accesses are always  done entirely ordered, and honestly, any
> > status register that starts a DMA would not make sense any other way.
> > 
> > Of course, you'd have to be pretty odd to want to start a DMA with a
> > read anyway - partly exactly because it's bad for performance since
> > reads will be synchronous and not buffered like a write).
> 
> I have bad memories of old adaptec controllers ...
> 
> That said, I think the above might not be right on ARM if we want to
> make it the rule, Will, what do you reckon ?

So there are two cases to consider:

1.
if (readl(DEVICE_DMA_STATUS) == DMA_DONE)
mydata = *dma_bufp;



2.
*dma_bufp = 42;
readl(DEVICE_DMA_KICK_ON_READ);


For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to
add an mb() to make it work.

Do both of these work on power? If so, I guess I can make readl even more
expensive :/ Feels a bit like the tail wagging the dog, though.

Another thing I just realised is that we restrict the barriers we use in
readl/writel on arm64 so that they don't necessary apply to both loads and
stores. To be specific:

   writel is ordered against prior writes to memory, but not reads

   readl is ordered against subsequent reads of memory, but not writes (but
   note that in example (1) above, the control dependency ensures that).

If necessary, I could move the barrier in our readl implementation to be
before the read, then play the control-dependency + instruction-sync (ISB)
trick that you do on power.

Will


Re: RFC on writel and writel_relaxed

2018-03-27 Thread Will Deacon
Hi Alex,

On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote:
> +netdev, +Alex
> 
> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
> >>  Most of the drivers have a unwound loop with writeq() or something to
> >>> do it.
> >>
> >> But isn't the writeq() barrier much more expensive than anything you'd
> >> do in function calls?
> > 
> > It is for us, and will break any write combining.
> > 
> > The same document says that _relaxed() does not give that guarentee.
> >
> > The lwn articule on this went into some depth on the interaction with
> > spinlocks.
> >
> > As far as I can see, containment in a spinlock seems to be the only
> > different between writel and writel_relaxed..
> 
>  I was always puzzled by this: The intention of _relaxed() on ARM
>  (where it originates) was to skip the barrier that serializes DMA
>  with MMIO, not to skip the serialization between MMIO and locks.
> >>>
> >>> But that was never a requirement of writel(),
> >>> Documentation/memory-barriers.txt gives an explicit example demanding
> >>> the wmb() before writel() for ordering system memory against writel.
> > 
> > This is a bug in the documentation.
> > 
> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> >> Adding Alexander Duyck to Cc, he added that section as part of
> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> >> dma_wmb()"). Also adding the other people that were involved with that.
> > 
> > Linus himself made it very clear years ago. readl and writel have to
> > order vs memory accesses.
> > 
> >>> I actually have no idea why ARM had that barrier, I always assumed it
> >>> was to give program ordering to the accesses and that _relaxed allowed
> >>> re-ordering (the usual meaning of relaxed)..
> >>>
> >>> But the barrier document makes it pretty clear that the only
> >>> difference between the two is spinlock containment, and WillD wrote
> >>> this text, so I belive it is accurate for ARM.
> >>>
> >>> Very confusing.
> >>
> >> It does mention serialization with both DMA and locks in the
> >> section about  readX_relaxed()/writeX_relaxed(). The part
> >> about DMA is very clear here, and I must have just forgotten
> >> the exact semantics with regards to spinlocks. I'm still not
> >> sure what prevents a writel() from leaking out the end of a
> >> spinlock section that doesn't happen with writel_relaxed(), since
> >> the barrier in writel() comes before the access, and the
> >> spin_unlock() shouldn't affect the external buses.
> > 
> > So...
> > 
> > Historically, what happened is that we (we means whoever participated
> > in the discussion on the list with Linus calling the shots really)
> > decided that there was no sane way for drivers to understand a world
> > where readl/writel didn't fully order things vs. memory accesses (ie,
> > DMA).
> > 
> > So it should always be correct to do:
> > 
> > - Write to some in-memory buffer
> > - writel() to kick the DMA read of that buffer
> > 
> > without any extra barrier.
> > 
> > The spinlock situation however got murky. Mostly that came up because
> > on architecture (I forgot who, might have been ia64) has a hard time
> > providing that consistency without making writel insanely expensive.
> > 
> > Thus they created mmiowb whose main purpose was precisely to order
> > writel with a following spin_unlock.
> > 
> > I decided not to go down that path on power because getting all drivers
> > "fixed" to do the right thing was going to be a losing battle, and
> > instead added per-cpu tracking of writel in order to "escalate" to a
> > heavier barrier in spin_unlock itself when necessary.
> > 
> > Now, all this happened more than a decade ago and it's possible that
> > the understanding or expectations "shifted" over time...
> 
> Alex is raising concerns on the netdev list.
> 
> Sinan
> "We are being told that if you use writel(), then you don't need a wmb() on
> all architectures."
> 
> Alex:
> "I'm not sure who told you that but that is incorrect, at least for
> x86. If you attempt to use writel() without the wmb() we will have to
> NAK the patches. We will accept the wmb() with writel_releaxed() since
> that solves things for ARM."
> 
> > Jason is seeking behavior clarification for write combined buffers.
> 
> Alex:
> "Don't bother. I can tell you right now that for x86 you have to have a
> wmb() before the writel().

To clarify: are you saying that on x86 you need a wmb() prior to a writel
if you want that writel to be ordered after prior writes to memory? Is this
specific to WC memory or some other non-standard attribute?

The only reason we have wmb() inside writel() on arm, arm64 and power is for
parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
memory by default so that it was easier to write portable drivers. The
performance impact of that implicit barrier is non-trivial, but 

Re: [PATCH bpf] bpf: prevent out-of-bounds speculation

2018-02-05 Thread Will Deacon
Hi all,

On Wed, Jan 10, 2018 at 07:47:33PM +, Will Deacon wrote:
> On Tue, Jan 09, 2018 at 10:21:29AM +0000, Will Deacon wrote:
> > On Mon, Jan 08, 2018 at 10:49:01AM -0800, Linus Torvalds wrote:
> > > In this particular case, we should be very much aware of future CPU's
> > > being more _constrained_, because CPU vendors had better start taking
> > > this thing into account.
> > > 
> > > So the masking approach is FUNDAMENTALLY SAFER than the "let's try to
> > > limit control speculation".
> > > 
> > > If somebody can point to a CPU that actually speculates across an
> > > address masking operation, I will be very surprised. And unless you
> > > can point to that, then stop trying to dismiss the masking approach.
> > 
> > Whilst I agree with your comments about future CPUs, this stuff is further
> > out of academia than you might think. We're definitely erring on the
> > belt-and-braces side of things at the moment, so let me go check what's
> > *actually* been built and I suspect we'll be able to make the masking work.
> > 
> > Stay tuned...
> 
> I can happily confirm that there aren't any (ARM architecture) CPUs where
> the masking approach is not sufficient, so there's no need to worry about
> value speculation breaking this.

Unfortunately, thanks to some internal miscommunication, my previous assertion
that no implementations of the Armv8 architecture utilise data value prediction
has turned out to be incorrect. I received confirmation last week that this has
been deployed in production silicon and has shipped widely in various products,
so the horse really has bolted and this isn't confined to academia as was
suggested previously.

We're still investigating whether this affects the mask-based mitigation used
by eBPF, but we'll definitely be adding a CSDB-based sequence to our nospec
helpers to ensure that array_index_nospec is robust for arm64: the CSDB
instruction follows the generation of the mask (see patches at [1]).

In the meantime, I wanted to correct my previous claim in case anybody else
is using that as a basis to push ahead with the bare masking approach
elsewhere for arm64.

Sorry for the confusion,

Will

[1] 
http://lists.infradead.org/pipermail/linux-arm-kernel/2018-February/557825.html


Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-18 Thread Will Deacon
On Thu, Jan 18, 2018 at 08:58:08AM -0800, Dan Williams wrote:
> On Thu, Jan 18, 2018 at 5:18 AM, Will Deacon  wrote:
> > On Thu, Jan 11, 2018 at 05:41:08PM -0800, Dan Williams wrote:
> >> On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds
> >>  wrote:
> >> > On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  
> >> > wrote:
> >> >>
> >> >> This series incorporates Mark Rutland's latest ARM changes and adds
> >> >> the x86 specific implementation of 'ifence_array_ptr'. That ifence
> >> >> based approach is provided as an opt-in fallback, but the default
> >> >> mitigation, '__array_ptr', uses a 'mask' approach that removes
> >> >> conditional branches instructions, and otherwise aims to redirect
> >> >> speculation to use a NULL pointer rather than a user controlled value.
> >> >
> >> > Do you have any performance numbers and perhaps example code
> >> > generation? Is this noticeable? Are there any microbenchmarks showing
> >> > the difference between lfence use and the masking model?
> >>
> >> I don't have performance numbers, but here's a sample code generation
> >> from __fcheck_files, where the 'and; lea; and' sequence is portion of
> >> array_ptr() after the mask generation with 'sbb'.
> >>
> >> fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
> >>  8e7:   8b 02   mov(%rdx),%eax
> >>  8e9:   48 39 c7cmp%rax,%rdi
> >>  8ec:   48 19 c9sbb%rcx,%rcx
> >>  8ef:   48 8b 42 08 mov0x8(%rdx),%rax
> >>  8f3:   48 89 femov%rdi,%rsi
> >>  8f6:   48 21 ceand%rcx,%rsi
> >>  8f9:   48 8d 04 f0 lea(%rax,%rsi,8),%rax
> >>  8fd:   48 21 c8and%rcx,%rax
> >>
> >>
> >> > Having both seems good for testing, but wouldn't we want to pick one in 
> >> > the end?
> >>
> >> I was thinking we'd keep it as a 'just in case' sort of thing, at
> >> least until the 'probably safe' assumption of the 'mask' approach has
> >> more time to settle out.
> >
> > From the arm64 side, the only concern I have (and this actually applies to
> > our CSDB sequence as well) is the calculation of the array size by the
> > caller. As Linus mentioned at the end of [1], if the determination of the
> > size argument is based on a conditional branch, then masking doesn't help
> > because you bound within the wrong range under speculation.
> >
> > We ran into this when trying to use masking to protect our uaccess routines
> > where the conditional bound is either KERNEL_DS or USER_DS. It's possible
> > that a prior conditional set_fs(KERNEL_DS) could defeat the masking and so
> > we'd need to throw some heavy barriers in set_fs to make it robust.
> 
> At least in the conditional mask case near set_fs() usage the approach
> we are taking is to use a barrier. I.e. the following guidance from
> Linus:
> 
> "Basically, the rule is trivial: find all 'stac' users, and use address
> masking if those users already integrate the limit check, and lfence
> they don't."
> 
> ...which translates to narrow the pointer for get_user() and use a
> barrier  for __get_user().

Great, that matches my thinking re set_fs but I'm still worried about
finding all the places where the bound is conditional for other users
of the macro. Then again, finding the places that need this macro in the
first place is tough enough so perhaps analysing the bound calculation
doesn't make it much worse.

Will


Re: [PATCH v2 00/19] prevent bounds-check bypass via speculative execution

2018-01-18 Thread Will Deacon
Hi Dan, Linus,

On Thu, Jan 11, 2018 at 05:41:08PM -0800, Dan Williams wrote:
> On Thu, Jan 11, 2018 at 5:19 PM, Linus Torvalds
>  wrote:
> > On Thu, Jan 11, 2018 at 4:46 PM, Dan Williams  
> > wrote:
> >>
> >> This series incorporates Mark Rutland's latest ARM changes and adds
> >> the x86 specific implementation of 'ifence_array_ptr'. That ifence
> >> based approach is provided as an opt-in fallback, but the default
> >> mitigation, '__array_ptr', uses a 'mask' approach that removes
> >> conditional branches instructions, and otherwise aims to redirect
> >> speculation to use a NULL pointer rather than a user controlled value.
> >
> > Do you have any performance numbers and perhaps example code
> > generation? Is this noticeable? Are there any microbenchmarks showing
> > the difference between lfence use and the masking model?
> 
> I don't have performance numbers, but here's a sample code generation
> from __fcheck_files, where the 'and; lea; and' sequence is portion of
> array_ptr() after the mask generation with 'sbb'.
> 
> fdp = array_ptr(fdt->fd, fd, fdt->max_fds);
>  8e7:   8b 02   mov(%rdx),%eax
>  8e9:   48 39 c7cmp%rax,%rdi
>  8ec:   48 19 c9sbb%rcx,%rcx
>  8ef:   48 8b 42 08 mov0x8(%rdx),%rax
>  8f3:   48 89 femov%rdi,%rsi
>  8f6:   48 21 ceand%rcx,%rsi
>  8f9:   48 8d 04 f0 lea(%rax,%rsi,8),%rax
>  8fd:   48 21 c8and%rcx,%rax
> 
> 
> > Having both seems good for testing, but wouldn't we want to pick one in the 
> > end?
> 
> I was thinking we'd keep it as a 'just in case' sort of thing, at
> least until the 'probably safe' assumption of the 'mask' approach has
> more time to settle out.

>From the arm64 side, the only concern I have (and this actually applies to
our CSDB sequence as well) is the calculation of the array size by the
caller. As Linus mentioned at the end of [1], if the determination of the
size argument is based on a conditional branch, then masking doesn't help
because you bound within the wrong range under speculation.

We ran into this when trying to use masking to protect our uaccess routines
where the conditional bound is either KERNEL_DS or USER_DS. It's possible
that a prior conditional set_fs(KERNEL_DS) could defeat the masking and so
we'd need to throw some heavy barriers in set_fs to make it robust.

The question then is whether or not we're likely to run into this elsewhere,
and I don't have a good feel for that.

Will

[1] 
http://lkml.kernel.org/r/CA+55aFz0tsreoa=5ud2nofcpng-dizlbht9wu9asyhplfjd...@mail.gmail.com


Re: [PATCH bpf] bpf: prevent out-of-bounds speculation

2018-01-10 Thread Will Deacon
Hi again Linus, Alexei,

On Tue, Jan 09, 2018 at 10:21:29AM +, Will Deacon wrote:
> On Mon, Jan 08, 2018 at 10:49:01AM -0800, Linus Torvalds wrote:
> > In this particular case, we should be very much aware of future CPU's
> > being more _constrained_, because CPU vendors had better start taking
> > this thing into account.
> > 
> > So the masking approach is FUNDAMENTALLY SAFER than the "let's try to
> > limit control speculation".
> > 
> > If somebody can point to a CPU that actually speculates across an
> > address masking operation, I will be very surprised. And unless you
> > can point to that, then stop trying to dismiss the masking approach.
> 
> Whilst I agree with your comments about future CPUs, this stuff is further
> out of academia than you might think. We're definitely erring on the
> belt-and-braces side of things at the moment, so let me go check what's
> *actually* been built and I suspect we'll be able to make the masking work.
> 
> Stay tuned...

I can happily confirm that there aren't any (ARM architecture) CPUs where
the masking approach is not sufficient, so there's no need to worry about
value speculation breaking this.

Will


Re: [PATCH bpf] bpf: prevent out-of-bounds speculation

2018-01-09 Thread Will Deacon
Hi Linus,

On Mon, Jan 08, 2018 at 10:49:01AM -0800, Linus Torvalds wrote:
> On Mon, Jan 8, 2018 at 9:05 AM, Mark Rutland  wrote:
> >
> > I'm a little worried that in the presence of some CPU/compiler
> > optimisations, the masking may effectively be skipped under speculation.
> > So I'm not sure how robust this is going to be.
> 
> Honestly, I think the masking is a hell of a lot more robust than any
> of the "official" fixes.
> 
> More generic data speculation (as opposed to control speculation) is
> 
>  (a) mainly academic masturbation
> 
>  (b) hasn't even been shown to be a good idea even in _theory_ yet
> (except for the "completely unreal hardware" kind of theory where
> people assume some data oracle)
> 
>  (c) isn't actually done in any real CPU's today that I'm aware of
> (unless you want to call the return stack data speculation).
> 
> and the thing is, we should really not then worry about "... but maybe
> future CPU's will be more aggressive", which is the traditional worry
> in these kinds of cases.
> 
> Why? Because quite honestly, any future CPU's that are more aggressive
> about speculating things like this are broken shit that we should call
> out as such, and tell people not to use.
> 
> Seriously.
> 
> In this particular case, we should be very much aware of future CPU's
> being more _constrained_, because CPU vendors had better start taking
> this thing into account.
> 
> So the masking approach is FUNDAMENTALLY SAFER than the "let's try to
> limit control speculation".
> 
> If somebody can point to a CPU that actually speculates across an
> address masking operation, I will be very surprised. And unless you
> can point to that, then stop trying to dismiss the masking approach.

Whilst I agree with your comments about future CPUs, this stuff is further
out of academia than you might think. We're definitely erring on the
belt-and-braces side of things at the moment, so let me go check what's
*actually* been built and I suspect we'll be able to make the masking work.

Stay tuned...

Will


Re: [PATCH 11/31] nds32: Atomic operations

2017-11-20 Thread Will Deacon
Hi Greentime,

On Wed, Nov 08, 2017 at 01:54:59PM +0800, Greentime Hu wrote:
> From: Greentime Hu 
> 
> Signed-off-by: Vincent Chen 
> Signed-off-by: Greentime Hu 
> ---
>  arch/nds32/include/asm/futex.h|  116 
>  arch/nds32/include/asm/spinlock.h |  178 
> +
>  2 files changed, 294 insertions(+)
>  create mode 100644 arch/nds32/include/asm/futex.h
>  create mode 100644 arch/nds32/include/asm/spinlock.h

[...]

> +static inline int
> +futex_atomic_cmpxchg_inatomic(u32 * uval, u32 __user * uaddr,
> +   u32 oldval, u32 newval)
> +{
> + int ret = 0;
> + u32 val, tmp, flags;
> +
> + if (!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))
> + return -EFAULT;
> +
> + smp_mb();
> + asm volatile ("   movi$ta, #0\n"
> +   "1: llw %1, [%6 + $ta]\n"
> +   "   sub %3, %1, %4\n"
> +   "   cmovz   %2, %5, %3\n"
> +   "   cmovn   %2, %1, %3\n"
> +   "2: scw %2, [%6 + $ta]\n"
> +   "   beqz%2, 1b\n"
> +   "3:\n   " __futex_atomic_ex_table("%7")
> +   :"+&r"(ret), "=&r"(val), "=&r"(tmp), "=&r"(flags)
> +   :"r"(oldval), "r"(newval), "r"(uaddr), "i"(-EFAULT)
> +   :"$ta", "memory");
> + smp_mb();
> +
> + *uval = val;
> + return ret;
> +}

I see you rely on asm-generic/barrier.h for your barrier definitions, which
suggests that you only need to prevent reordering by the compiler because
you're not SMP. Is that right? If so, using smp_mb() is a little weird.

What about DMA transactions? I imagine you might need some extra
instructions for the mandatory barriers there.

Also:

> +static inline void arch_spin_lock(arch_spinlock_t * lock)
> +{
> + unsigned long tmp;
> +
> + __asm__ __volatile__("1:\n"
> +  "\tllw\t%0, [%1]\n"
> +  "\tbnez\t%0, 1b\n"
> +  "\tmovi\t%0, #0x1\n"
> +  "\tscw\t%0, [%1]\n"
> +  "\tbeqz\t%0, 1b\n"
> +  :"=&r"(tmp)
> +  :"r"(&lock->lock)
> +  :"memory");
> +}

Here it looks like you're eliding an explicit barrier here because you
already have a "memory" clobber. Can't you do the same for the futex code
above?

Will


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-20 Thread Will Deacon
On Thu, Oct 19, 2017 at 10:34:54PM -0700, Eric Dumazet wrote:
> On Thu, Oct 19, 2017 at 8:13 PM, Wei Wei  wrote:
> > Code: f9406680 8b01 91009000 f9800011 (885f7c01)
> > All code
> > 
> >0:   80 66 40 f9 andb   $0xf9,0x40(%rsi)
> >4:   00 00   add%al,(%rax)
> >6:   01 8b 00 90 00 91   add%ecx,-0x6eff7000(%rbx)
> >c:   11 00   adc%eax,(%rax)
> >e:   80 f9 01cmp$0x1,%cl
> >   11:   7c 5f   jl 0x72
> >   13:*  88 00   mov%al,(%rax)   <-- 
> > trapping instruction
> >   15:   00 00   add%al,(%rax)
> > ...
> >
> > Code starting with the faulting instruction
> > ===
> >0:   01 7c 5f 88 add%edi,-0x78(%rdi,%rbx,2)
> >4:   00 00   add%al,(%rax)
> > ...
> > —[ end trace 261e7ac1458ccc0a ]---
> >
> 
> I thought it was happening on arm64 ?
> 
> This is x86_64 disassembly :/

I guess they forgot the ARCH/CROSS_COMPILE env vars for decodecode. here
you go:

Code: f9406680 8b01 91009000 f9800011 (885f7c01)
All code

   0:   f9406680ldr x0, [x20,#200]
   4:   8b01add x0, x0, x1
   8:   91009000add x0, x0, #0x24
   c:   f9800011prfmpstl1strm, [x0]
  10:*  885f7c01ldxrw1, [x0]<-- trapping instruction

Code starting with the faulting instruction
===
   0:   885f7c01ldxrw1, [x0]

so it's faulting on the load part of an atomic rmw.

Will


Re: [PATCH net] xgene: Always get clk source, but ignore if it's missing for SGMII ports

2017-08-04 Thread Will Deacon
On Thu, Aug 03, 2017 at 03:43:14PM +0200, Thomas Bogendoerfer wrote:
> From: Thomas Bogendoerfer 
> 
> Even the driver doesn't do anything with the clk source for SGMII
> ports it needs to be enabled by doing a devm_clk_get(), if there is
> a clk source in DT.
> 
> Fixes: 0db01097cabd ('xgene: Don't fail probe, if there is no clk resource 
> for SGMII interfaces')
> Signed-off-by: Thomas Bogendoerfer 
> Tested-by: Laura Abbott 
> Acked-by: Iyappan Subramanian 
> ---
>  drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)

Thanks, this fixes a boot-time crash (below) when bringing up the interface
on my xgene board with -rc3, so:

Tested-by: Will Deacon 

Will

--->8

[8.076583] synchronous external abort: synchronous external abort 
(0x9610) at 0x0aca0010
[8.085777] internal error: : 9610 [#1] preempt smp
[8.090977] modules linked in:
[8.094020] cpu: 4 pid: 2345 comm: ip not tainted 
4.13.0-rc3-2-g93bd9a45f608 #2
[8.101638] hardware name: appliedmicro x-gene mustang board/x-gene mustang 
board, bios 3.06.25 oct 17 2016
[8.111330] task: 8003e7181c00 task.stack: 8003e5afc000
[8.117227] pc is at xgene_enet_rd_mac+0x80/0x108
[8.121909] lr is at xgene_enet_rd_mac+0x50/0x108
[8.126589] pc : [] lr : [] pstate: 
a145
[8.133948] sp : 8003e5aff570
[8.137244] x29: 8003e5aff570 x28: 0001 
[8.142533] x27: 8003eb9cc838 x26: 08663970 
[8.147820] x25: 0aca000c x24: 8003eb9cc000 
[8.153107] x23:  x22: 0aca0004 
[8.158394] x21: 8003eb9cc9f0 x20: 0aca0010 
[8.163680] x19: 000b x18: 0001 
[8.168966] x17: 000120177170 x16: 08889dc8 
[8.174253] x15:  x14: 0001e15a50b8 
[8.179539] x13:  x12: 8003e8cb0a00 
[8.184825] x11:  x10: 0004 
[8.190111] x9 :  x8 : 8003fff75468 
[8.195398] x7 : 000d7c8c x6 : 089c 
[8.200684] x5 : 0001 x4 : 8003e7181c00 
[8.205969] x3 :  x2 :  
[8.211255] x1 : 00330033 x0 : 4000 
[8.216543] process ip (pid: 2345, stack limit = 0x8003e5afc000)
[8.222865] stack: (0x8003e5aff570 to 0x8003e5b0)
[8.228582] f560:   8003e5aff5c0 
086600f4
[8.236374] f580: 0001 8003eb9cc800 0001 
8003eb9cc000
[8.244164] f5a0: 8003eca16a2e 8003ec93a410 08acdf08 
0863b684
[8.251955] f5c0: 8003e5aff5f0 086601f4 8003eb9cc800 

[8.259745] f5e0:  08663ce0 8003e5aff600 
08663cf0
[8.267536] f600: 8003e5aff670 088aa3b8 8003eb9cc000 
1003
[8.275326] f620: 08ace018 8003eb9cc048 1002 
8003e5aff8e0
[8.283116] f640: 08ace018 8003e8dd8010 8003e5affb78 

[8.290907] f660: 08663000 08d185f0 8003e5aff6b0 
088aa6b8
[8.298697] f680: 8003eb9cc000 1003 0001 

[8.306488] f6a0: 8003eb9cc000 8003eb9cc000 8003e5aff6f0 
088aa798
[8.314278] f6c0: 8003eb9cc000 1002  

[8.322069] f6e0: 8003e6868000 002b 8003e5aff720 
088bc168
[8.329859] f700: 8003e8dd8000 8003eb9cc000  
8003e5aff968
[8.337650] f720: 8003e5aff840 088be918 08d5e000 
8003eb9cc000
[8.345441] f740:  8003e5affb78  

[8.353231] f760: 8003e5aff840 08b159c0 8003e8dd8020 
081b3a90
[8.361022] f780: 8003e7037b80 7e000f9b0e00 00e80043e6c38f53 
8003ecf49450
[8.368812] f7a0: 8003e7037b80 0001 8003e5aff7d0 
081bf0cc
[8.376603] f7c0: 7e000f9b0e00 081b39c0 8003e5aff7e0 
0818ef70
[8.384393] f7e0: 8003e5aff800 081b3a9c 8003e5aff840 
088be3e0
[8.392184] f800: 8003e8dd8000 081b44ec  
8003e5affb78
[8.399974] f820: 8003e5aff840 088be438 8003e8dd8000 
0858169c
[8.407765] f840: 8003e5affad0 088bec24  
8003e8dd8000
[8.41] f860: 8003e5affb78  8003e6868000 

[8.423346] f880:  0008 089b1000 
8003e7181c00
[8.431136] f8a0: 8003e5aff8c0 8003e6868000  
08fc7e80
[8.438927] f8c0: 08b15a80 8003e8dd8010 800

Re: [PATCH v2 0/9] Remove spin_unlock_wait()

2017-07-06 Thread Will Deacon
On Thu, Jul 06, 2017 at 06:50:36PM +0200, Peter Zijlstra wrote:
> On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 02:12:24PM +, David Laight wrote:
> > > > From: Paul E. McKenney
> > 
> > [ . . . ]
> > 
> > > Now on the one hand I feel like Oleg that it would be a shame to loose
> > > the optimization, OTOH this thing is really really tricky to use,
> > > and has lead to a number of bugs already.
> > 
> > I do agree, it is a bit sad to see these optimizations go.  So, should
> > this make mainline, I will be tagging the commits that spin_unlock_wait()
> > so that they can be easily reverted should someone come up with good
> > semantics and a compelling use case with compelling performance benefits.
> 
> Ha!, but what would constitute 'good semantics' ?
> 
> The current thing is something along the lines of:
> 
>   "Waits for the currently observed critical section
>to complete with ACQUIRE ordering such that it will observe
>whatever state was left by said critical section."
> 
> With the 'obvious' benefit of limited interference on those actually
> wanting to acquire the lock, and a shorter wait time on our side too,
> since we only need to wait for completion of the current section, and
> not for however many contender are before us.
> 
> Not sure I have an actual (micro) benchmark that shows a difference
> though.
> 
> 
> 
> Is this all good enough to retain the thing, I dunno. Like I said, I'm
> conflicted on the whole thing. On the one hand its a nice optimization,
> on the other hand I don't want to have to keep fixing these bugs.

As I've said, I'd be keen to see us drop this and bring it back if/when we
get a compelling use-case along with performance numbers. At that point,
we'd be in a better position to define the semantics anyway, knowing what
exactly is expected by the use-case.

Will


Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Will Deacon
On Mon, Jul 03, 2017 at 09:40:22AM -0700, Linus Torvalds wrote:
> On Mon, Jul 3, 2017 at 9:18 AM, Paul E. McKenney
>  wrote:
> >
> > Agreed, and my next step is to look at spin_lock() followed by
> > spin_is_locked(), not necessarily the same lock.
> 
> Hmm. Most (all?) "spin_is_locked()" really should be about the same
> thread that took the lock (ie it's about asserts and lock debugging).
> 
> The optimistic ABBA avoidance pattern for spinlocks *should* be
> 
> spin_lock(inner)
> ...
> if (!try_lock(outer)) {
>spin_unlock(inner);
>.. do them in the right order ..
> 
> so I don't think spin_is_locked() should have any memory barriers.
> 
> In fact, the core function for spin_is_locked() is arguably
> arch_spin_value_unlocked() which doesn't even do the access itself.

Yeah, but there's some spaced-out stuff going on in kgdb_cpu_enter where
it looks to me like raw_spin_is_locked is used for synchronization. My
eyes are hurting looking at it, though.

Will


Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Will Deacon
On Fri, Jun 30, 2017 at 03:18:40PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 30, 2017 at 02:13:39PM +0100, Will Deacon wrote:
> > On Fri, Jun 30, 2017 at 05:38:15AM -0700, Paul E. McKenney wrote:
> > > I also need to check all uses of spin_is_locked().  There might no
> > > longer be any that rely on any particular ordering...
> > 
> > Right. I think we're looking for the "insane case" as per 38b850a73034
> > (which was apparently used by ipc/sem.c at the time, but no longer).
> > 
> > There's a usage in kernel/debug/debug_core.c, but it doesn't fill me with
> > joy.
> 
> That is indeed an interesting one...  But my first round will be what
> semantics the implementations seem to provide:
> 
> Acquire courtesy of TSO: s390, sparc, x86.
> Acquire: ia64 (in reality fully ordered).
> Control dependency: alpha, arc, arm, blackfin, hexagon, m32r, mn10300, tile,
>   xtensa.
> Control dependency plus leading full barrier: arm64, powerpc.
> UP-only: c6x, cris, frv, h8300, m68k, microblaze nios2, openrisc, um, 
> unicore32.
> 
> Special cases:
>   metag: Acquire if !CONFIG_METAG_SMP_WRITE_REORDERING.
>  Otherwise control dependency?
>   mips: Control dependency, acquire if CONFIG_CPU_CAVIUM_OCTEON.
>   parisc: Acquire courtesy of TSO, but why barrier in smp_load_acquire?
>   sh: Acquire if one of SH4A, SH5, or J2, otherwise acquire?  UP-only?
> 
> Are these correct, or am I missing something with any of them?

That looks about right but, at least on ARM, I think we have to consider
the semantics of spin_is_locked with respect to the other spin_* functions,
rather than in isolation.

For example, ARM only has a control dependency, but spin_lock has a trailing
smp_mb() and spin_unlock has both leading and trailing smp_mb().

Will


Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-06-30 Thread Will Deacon
On Fri, Jun 30, 2017 at 05:38:15AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 30, 2017 at 10:19:29AM +0100, Will Deacon wrote:
> > On Thu, Jun 29, 2017 at 05:01:16PM -0700, Paul E. McKenney wrote:
> > > There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> > > and it appears that all callers could do just as well with a lock/unlock
> > > pair.  This commit therefore removes spin_unlock_wait() and related
> > > definitions from core code.
> > > 
> > > Signed-off-by: Paul E. McKenney 
> > > Cc: Arnd Bergmann 
> > > Cc: Ingo Molnar 
> > > Cc: Will Deacon 
> > > Cc: Peter Zijlstra 
> > > Cc: Alan Stern 
> > > Cc: Andrea Parri 
> > > Cc: Linus Torvalds 
> > > ---
> > >  include/asm-generic/qspinlock.h |  14 -
> > >  include/linux/spinlock.h|  31 ---
> > >  include/linux/spinlock_up.h |   6 ---
> > >  kernel/locking/qspinlock.c  | 117 
> > > 
> > >  4 files changed, 168 deletions(-)
> > 
> > [...]
> > 
> > > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> > > index b2caec7315af..64a9051e4c2c 100644
> > > --- a/kernel/locking/qspinlock.c
> > > +++ b/kernel/locking/qspinlock.c
> > > @@ -267,123 +267,6 @@ static __always_inline u32  
> > > __pv_wait_head_or_lock(struct qspinlock *lock,
> > >  #define queued_spin_lock_slowpathnative_queued_spin_lock_slowpath
> > >  #endif
> > >  
> > > -/*
> > > - * Various notes on spin_is_locked() and spin_unlock_wait(), which are
> > > - * 'interesting' functions:
> > > - *
> > > - * PROBLEM: some architectures have an interesting issue with atomic 
> > > ACQUIRE
> > > - * operations in that the ACQUIRE applies to the LOAD _not_ the STORE 
> > > (ARM64,
> > > - * PPC). Also qspinlock has a similar issue per construction, the 
> > > setting of
> > > - * the locked byte can be unordered acquiring the lock proper.
> > > - *
> > > - * This gets to be 'interesting' in the following cases, where the 
> > > /should/s
> > > - * end up false because of this issue.
> > > - *
> > > - *
> > > - * CASE 1:
> > > - *
> > > - * So the spin_is_locked() correctness issue comes from something like:
> > > - *
> > > - *   CPU0CPU1
> > > - *
> > > - *   global_lock();  local_lock(i)
> > > - * spin_lock(&G)   spin_lock(&L[i])
> > > - * for (i) if (!spin_is_locked(&G)) {
> > > - *   spin_unlock_wait(&L[i]);
> > > smp_acquire__after_ctrl_dep();
> > > - *   return;
> > > - * }
> > > - * // deal with fail
> > > - *
> > > - * Where it is important CPU1 sees G locked or CPU0 sees L[i] locked such
> > > - * that there is exclusion between the two critical sections.
> > > - *
> > > - * The load from spin_is_locked(&G) /should/ be constrained by the 
> > > ACQUIRE from
> > > - * spin_lock(&L[i]), and similarly the load(s) from 
> > > spin_unlock_wait(&L[i])
> > > - * /should/ be constrained by the ACQUIRE from spin_lock(&G).
> > > - *
> > > - * Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB.
> > 
> > Might be worth keeping this comment about spin_is_locked, since we're not
> > removing that guy just yet!
> 
> Ah, all the examples had spin_unlock_wait() in them.  So what I need to
> do is to create a spin_unlock_wait()-free example to illustrate the
> text starting with "The load from spin_is_locked(", correct?

Yeah, I think so.

> I also need to check all uses of spin_is_locked().  There might no
> longer be any that rely on any particular ordering...

Right. I think we're looking for the "insane case" as per 38b850a73034
(which was apparently used by ipc/sem.c at the time, but no longer).

There's a usage in kernel/debug/debug_core.c, but it doesn't fill me with
joy.

Will


Re: [PATCH RFC 12/26] arm64: Remove spin_unlock_wait() arch-specific definitions

2017-06-30 Thread Will Deacon
On Thu, Jun 29, 2017 at 05:01:20PM -0700, Paul E. McKenney wrote:
> There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> and it appears that all callers could do just as well with a lock/unlock
> pair.  This commit therefore removes the underlying arch-specific
> arch_spin_unlock_wait().
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: 
> Cc: Peter Zijlstra 
> Cc: Alan Stern 
> Cc: Andrea Parri 
> Cc: Linus Torvalds 
> ---
>  arch/arm64/include/asm/spinlock.h | 58 
> ---
>  1 file changed, 5 insertions(+), 53 deletions(-)

I'm going to miss this code.

Acked-by: Will Deacon 

Will


Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-06-30 Thread Will Deacon
On Thu, Jun 29, 2017 at 05:01:16PM -0700, Paul E. McKenney wrote:
> There is no agreed-upon definition of spin_unlock_wait()'s semantics,
> and it appears that all callers could do just as well with a lock/unlock
> pair.  This commit therefore removes spin_unlock_wait() and related
> definitions from core code.
> 
> Signed-off-by: Paul E. McKenney 
> Cc: Arnd Bergmann 
> Cc: Ingo Molnar 
> Cc: Will Deacon 
> Cc: Peter Zijlstra 
> Cc: Alan Stern 
> Cc: Andrea Parri 
> Cc: Linus Torvalds 
> ---
>  include/asm-generic/qspinlock.h |  14 -
>  include/linux/spinlock.h|  31 ---
>  include/linux/spinlock_up.h |   6 ---
>  kernel/locking/qspinlock.c  | 117 
> 
>  4 files changed, 168 deletions(-)

[...]

> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index b2caec7315af..64a9051e4c2c 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -267,123 +267,6 @@ static __always_inline u32  
> __pv_wait_head_or_lock(struct qspinlock *lock,
>  #define queued_spin_lock_slowpathnative_queued_spin_lock_slowpath
>  #endif
>  
> -/*
> - * Various notes on spin_is_locked() and spin_unlock_wait(), which are
> - * 'interesting' functions:
> - *
> - * PROBLEM: some architectures have an interesting issue with atomic ACQUIRE
> - * operations in that the ACQUIRE applies to the LOAD _not_ the STORE (ARM64,
> - * PPC). Also qspinlock has a similar issue per construction, the setting of
> - * the locked byte can be unordered acquiring the lock proper.
> - *
> - * This gets to be 'interesting' in the following cases, where the /should/s
> - * end up false because of this issue.
> - *
> - *
> - * CASE 1:
> - *
> - * So the spin_is_locked() correctness issue comes from something like:
> - *
> - *   CPU0CPU1
> - *
> - *   global_lock();  local_lock(i)
> - * spin_lock(&G)   spin_lock(&L[i])
> - * for (i) if (!spin_is_locked(&G)) {
> - *   spin_unlock_wait(&L[i]);smp_acquire__after_ctrl_dep();
> - *   return;
> - * }
> - * // deal with fail
> - *
> - * Where it is important CPU1 sees G locked or CPU0 sees L[i] locked such
> - * that there is exclusion between the two critical sections.
> - *
> - * The load from spin_is_locked(&G) /should/ be constrained by the ACQUIRE 
> from
> - * spin_lock(&L[i]), and similarly the load(s) from spin_unlock_wait(&L[i])
> - * /should/ be constrained by the ACQUIRE from spin_lock(&G).
> - *
> - * Similarly, later stuff is constrained by the ACQUIRE from CTRL+RMB.

Might be worth keeping this comment about spin_is_locked, since we're not
removing that guy just yet!

Will


Re: [PATCH net] bpf, arm64: use separate register for state in stxr

2017-06-07 Thread Will Deacon
Hi Daniel,

On Wed, Jun 07, 2017 at 01:45:37PM +0200, Daniel Borkmann wrote:
> Will reported that in BPF_XADD we must use a different register in stxr
> instruction for the status flag due to otherwise CONSTRAINED UNPREDICTABLE
> behavior per architecture. Reference manual says [1]:
> 
>   If s == t, then one of the following behaviors must occur:
> 
>* The instruction is UNDEFINED.
>* The instruction executes as a NOP.
>* The instruction performs the store to the specified address, but
>  the value stored is UNKNOWN.
> 
> Thus, use a different temporary register for the status flag to fix it.
> 
> Disassembly extract from test 226/STX_XADD_DW from test_bpf.ko:
> 
>   [...]
>   003c:  c85f7d4b  ldxr x11, [x10]
>   0040:  8b07016b  add x11, x11, x7
>   0044:  c80c7d4b  stxr w12, x11, [x10]
>   0048:  35ac  cbnz w12, 0x003c
>   [...]
> 
>   [1] https://static.docs.arm.com/ddi0487/b/DDI0487B_a_armv8_arm.pdf, p.6132
> 
> Fixes: 85f68fe89832 ("bpf, arm64: implement jiting of BPF_XADD")
> Reported-by: Will Deacon 
> Signed-off-by: Daniel Borkmann 
> ---
>  arch/arm64/net/bpf_jit_comp.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Cheers for fixing this up:

Acked-by: Will Deacon 

Will


Re: [PATCH net-next] bpf, arm64: implement jiting of BPF_XADD

2017-06-02 Thread Will Deacon
Hi Daniel,

[sorry, only just noticed that was queued]

On Mon, May 01, 2017 at 02:57:20AM +0200, Daniel Borkmann wrote:
> This work adds BPF_XADD for BPF_W/BPF_DW to the arm64 JIT and therefore
> completes JITing of all BPF instructions, meaning we can thus also remove
> the 'notyet' label and do not need to fall back to the interpreter when
> BPF_XADD is used in a program!
> 
> This now also brings arm64 JIT in line with x86_64, s390x, ppc64, sparc64,
> where all current eBPF features are supported.
> 
> BPF_W example from test_bpf:
> 
>   .u.insns_int = {
> BPF_ALU32_IMM(BPF_MOV, R0, 0x12),
> BPF_ST_MEM(BPF_W, R10, -40, 0x10),
> BPF_STX_XADD(BPF_W, R10, R0, -40),
> BPF_LDX_MEM(BPF_W, R0, R10, -40),
> BPF_EXIT_INSN(),
>   },
> 
>   [...]
>   0020:  52800247  mov w7, #0x12 // #18
>   0024:  928004eb  mov x11, #0xffd8 // #-40
>   0028:  d280020a  mov x10, #0x10 // #16
>   002c:  b82b6b2a  str w10, [x25,x11]
>   // start of xadd mapping:
>   0030:  928004ea  mov x10, #0xffd8 // #-40
>   0034:  8b19014a  add x10, x10, x25
>   0038:  f9800151  prfm pstl1strm, [x10]
>   003c:  885f7d4b  ldxr w11, [x10]
>   0040:  0b07016b  add w11, w11, w7
>   0044:  880b7d4b  stxr w11, w11, [x10]

This form of STXR (where s == t) is CONSTRAINED UNPREDICTABLE per the
architecture; you need to use separate registers for the data and the
status flag. You might also be interested in the atomic instructions
introduced in ARMv8.1, which includes the LDADD instruction. You can
check elf_hwcap & HWCAP_ATOMICS to see if it's supported.

Also, did we get a conclusion on the barrier semantics for this? Currently
you don't have any here: is that ok?

Will


Re: Re: [PATCH net-next 1/4] ixgbe: sparc: rename the ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER

2017-04-24 Thread Will Deacon
On Wed, Apr 19, 2017 at 02:46:19PM +, Gabriele Paoloni wrote:
> > From: Amir Ancel [mailto:am...@mellanox.com]
> > Sent: 18 April 2017 21:18
> > To: David Laight; Gabriele Paoloni; da...@davemloft.net
> > Cc: Catalin Marinas; Will Deacon; Mark Rutland; Robin Murphy;
> > jeffrey.t.kirs...@intel.com; alexander.du...@gmail.com; linux-arm-
> > ker...@lists.infradead.org; netdev@vger.kernel.org; Dingtianhong;
> > Linuxarm
> > Subject: Re: Re: [PATCH net-next 1/4] ixgbe: sparc: rename the
> > ARCH_WANT_RELAX_ORDER to IXGBE_ALLOW_RELAXED_ORDER
> > 
> > Hi,
> > mlx5 driver is planned to have RO support this year.
> > I believe drivers should be able to query whether the arch support it
> 
> I guess that here when you say query you mean having a config symbol
> that is set accordingly to the host architecture, right?
> 
> As already said I have looked around a bit and other drivers do not seem
> to enable/disable RO for their EP on the basis of the host architecture.
> So why should mlx5 do it according to the host?
> 
> Also my understating is that some architectures (like ARM64 for example)
> can have different PCI host controller implementations depending on the
> vendor...therefore maybe it is not appropriate there to have a Kconfig
> symbol selected by the architecture...  

Indeed. We're not able to determine whether or not RO is supported at
compile time, so we'd have to detect this dynamically if we want to support
it for arm64 with a single kernel Image. That means either passing something
through firmware, having the PCI host controller opt-in or something coarse
like a command-line option.

Will


Re: [PATCH 4/4] net: smsc911x: Move interrupt allocation to open/stop

2016-09-02 Thread Will Deacon
On Thu, Sep 01, 2016 at 03:15:09PM -0500, Jeremy Linton wrote:
> The /proc/irq/xx information is incorrect for smsc911x because
> the request_irq is happening before the register_netdev has the
> proper device name. Moving it to the open also fixes the case
> of when the device is renamed.
> 
> Reported-by: Will Deacon 
> Signed-off-by: Jeremy Linton 
> ---
>  drivers/net/ethernet/smsc/smsc911x.c | 47 
> ++--
>  1 file changed, 18 insertions(+), 29 deletions(-)

FWIW, this fixes the interface naming under /proc//ethX for me:

Tested-by: Will Deacon 

Will


Re: [PATCH net-next 2/3] arm64: bpf: optimize JMP_CALL

2016-06-07 Thread Will Deacon
On Mon, Jun 06, 2016 at 09:36:03PM -0700, Z Lim wrote:
> On Mon, Jun 6, 2016 at 10:05 AM, Will Deacon  wrote:
> > On Sat, Jun 04, 2016 at 03:00:29PM -0700, Zi Shen Lim wrote:
> >> Remove superfluous stack frame, saving us 3 instructions for
> >> every JMP_CALL.
> >>
> >> Signed-off-by: Zi Shen Lim 
> >> ---
> >>  arch/arm64/net/bpf_jit_comp.c | 3 ---
> >>  1 file changed, 3 deletions(-)
> >>
> >> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> >> index 51abc97..7ae304e 100644
> >> --- a/arch/arm64/net/bpf_jit_comp.c
> >> +++ b/arch/arm64/net/bpf_jit_comp.c
> >> @@ -578,11 +578,8 @@ emit_cond_jmp:
> >>   const u64 func = (u64)__bpf_call_base + imm;
> >>
> >>   emit_a64_mov_i64(tmp, func, ctx);
> >> - emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
> >> - emit(A64_MOV(1, A64_FP, A64_SP), ctx);
> >>   emit(A64_BLR(tmp), ctx);
> >>   emit(A64_MOV(1, r0, A64_R(0)), ctx);
> >> - emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
> >>   break;
> >>   }
> >
> > Is the jitted code intended to be unwindable by standard tools?
> 
> Before this patch:
> bpf_prologue => push stack frame
> ...
> jmp_call => push stack frame, call bpf_helper*, pop stack frame
> ...
> bpf_epilogue => pop stack frame, ret
> 
> Now:
> bpf_prologue => push stack frame
> ...
> jmp_call => call bpf_helper*
> ...
> bpf_epilogue => pop stack frame, ret
> 
> *Note: bpf_helpers in kernel/bpf/helper.c
> 
> So yes, it's still unwindable.

Sure, I'm not disputing that. I just wondered whether or not it needs to
be unwindable at all...

Will


Re: [PATCH net-next 2/3] arm64: bpf: optimize JMP_CALL

2016-06-06 Thread Will Deacon
On Sat, Jun 04, 2016 at 03:00:29PM -0700, Zi Shen Lim wrote:
> Remove superfluous stack frame, saving us 3 instructions for
> every JMP_CALL.
> 
> Signed-off-by: Zi Shen Lim 
> ---
>  arch/arm64/net/bpf_jit_comp.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index 51abc97..7ae304e 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -578,11 +578,8 @@ emit_cond_jmp:
>   const u64 func = (u64)__bpf_call_base + imm;
>  
>   emit_a64_mov_i64(tmp, func, ctx);
> - emit(A64_PUSH(A64_FP, A64_LR, A64_SP), ctx);
> - emit(A64_MOV(1, A64_FP, A64_SP), ctx);
>   emit(A64_BLR(tmp), ctx);
>   emit(A64_MOV(1, r0, A64_R(0)), ctx);
> - emit(A64_POP(A64_FP, A64_LR, A64_SP), ctx);
>   break;
>   }

Is the jitted code intended to be unwindable by standard tools?

Will


Re: [PATCH] arm64: bpf: jit JMP_JSET_{X,K}

2016-05-13 Thread Will Deacon
On Fri, May 13, 2016 at 10:57:18AM +0100, Will Deacon wrote:
> On Thu, May 12, 2016 at 11:37:58PM -0700, Zi Shen Lim wrote:
> > Original implementation commit e54bcde3d69d ("arm64: eBPF JIT compiler")
> > had the relevant code paths, but due to an oversight always fail jiting.
> > 
> > As a result, we had been falling back to BPF interpreter whenever a BPF
> > program has JMP_JSET_{X,K} instructions.
> > 
> > With this fix, we confirm that the corresponding tests in lib/test_bpf
> > continue to pass, and also jited.
> > 
> > ...
> > [2.784553] test_bpf: #30 JSET jited:1 188 192 197 PASS
> > [2.791373] test_bpf: #31 tcpdump port 22 jited:1 325 677 625 PASS
> > [2.808800] test_bpf: #32 tcpdump complex jited:1 323 731 991 PASS
> > ...
> > [3.190759] test_bpf: #237 JMP_JSET_K: if (0x3 & 0x2) return 1 jited:1 
> > 110 PASS
> > [3.192524] test_bpf: #238 JMP_JSET_K: if (0x3 & 0x) return 1 
> > jited:1 98 PASS
> > [3.211014] test_bpf: #249 JMP_JSET_X: if (0x3 & 0x2) return 1 jited:1 
> > 120 PASS
> > [3.212973] test_bpf: #250 JMP_JSET_X: if (0x3 & 0x) return 1 
> > jited:1 89 PASS
> > ...
> > 
> > Fixes: e54bcde3d69d ("arm64: eBPF JIT compiler")
> > Signed-off-by: Zi Shen Lim 
> > ---
> >  arch/arm64/net/bpf_jit_comp.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> > index 031ed08..d0d5190 100644
> > --- a/arch/arm64/net/bpf_jit_comp.c
> > +++ b/arch/arm64/net/bpf_jit_comp.c
> > @@ -478,6 +478,7 @@ emit_cond_jmp:
> > case BPF_JGE:
> > jmp_cond = A64_COND_CS;
> > break;
> > +   case BPF_JSET:
> > case BPF_JNE:
> > jmp_cond = A64_COND_NE;
> > break;
> 
> Are you sure about this? filter.txt says:
> 
> jne  - Jump on K != A
> ...
> jset - Jump on k & A
> 
> so it looks weird wiring them both to the same thing. I'm not sure you
> can express this as a simple CMP + B..

Ah, sorry, I see how this works now. I overlooked the BPF_JMP | BPF_JSET |
BPF_X case emitting a TST instruction. In which case:

Acked-by: Will Deacon 

I'm assuming David will queue this?

Will


Re: [PATCH] arm64: bpf: jit JMP_JSET_{X,K}

2016-05-13 Thread Will Deacon
On Thu, May 12, 2016 at 11:37:58PM -0700, Zi Shen Lim wrote:
> Original implementation commit e54bcde3d69d ("arm64: eBPF JIT compiler")
> had the relevant code paths, but due to an oversight always fail jiting.
> 
> As a result, we had been falling back to BPF interpreter whenever a BPF
> program has JMP_JSET_{X,K} instructions.
> 
> With this fix, we confirm that the corresponding tests in lib/test_bpf
> continue to pass, and also jited.
> 
> ...
> [2.784553] test_bpf: #30 JSET jited:1 188 192 197 PASS
> [2.791373] test_bpf: #31 tcpdump port 22 jited:1 325 677 625 PASS
> [2.808800] test_bpf: #32 tcpdump complex jited:1 323 731 991 PASS
> ...
> [3.190759] test_bpf: #237 JMP_JSET_K: if (0x3 & 0x2) return 1 jited:1 110 
> PASS
> [3.192524] test_bpf: #238 JMP_JSET_K: if (0x3 & 0x) return 1 
> jited:1 98 PASS
> [3.211014] test_bpf: #249 JMP_JSET_X: if (0x3 & 0x2) return 1 jited:1 120 
> PASS
> [3.212973] test_bpf: #250 JMP_JSET_X: if (0x3 & 0x) return 1 
> jited:1 89 PASS
> ...
> 
> Fixes: e54bcde3d69d ("arm64: eBPF JIT compiler")
> Signed-off-by: Zi Shen Lim 
> ---
>  arch/arm64/net/bpf_jit_comp.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
> index 031ed08..d0d5190 100644
> --- a/arch/arm64/net/bpf_jit_comp.c
> +++ b/arch/arm64/net/bpf_jit_comp.c
> @@ -478,6 +478,7 @@ emit_cond_jmp:
>   case BPF_JGE:
>   jmp_cond = A64_COND_CS;
>   break;
> + case BPF_JSET:
>   case BPF_JNE:
>   jmp_cond = A64_COND_NE;
>   break;

Are you sure about this? filter.txt says:

jne  - Jump on K != A
...
jset - Jump on k & A

so it looks weird wiring them both to the same thing. I'm not sure you
can express this as a simple CMP + B..

Will


Re: [RESEND PATCH] arm64: bpf: add 'store immediate' instruction

2015-12-02 Thread Will Deacon
On Tue, Dec 01, 2015 at 02:20:40PM -0800, Shi, Yang wrote:
> On 11/30/2015 2:24 PM, Yang Shi wrote:
> >aarch64 doesn't have native store immediate instruction, such operation
> >has to be implemented by the below instruction sequence:
> >
> >Load immediate to register
> >Store register
> >
> >Signed-off-by: Yang Shi 
> >CC: Zi Shen Lim 
> 
> Had email exchange offline with Zi Shen Lim since he is traveling and cannot
> send text-only mail, quoted below for his reply:
> 
> "I've given reviewed-by in response to original posting. Unless something
> has changed, feel free to add it."
> 
> Since there is nothing changed, added his reviewed-by.
> 
> Reviewed-by: Zi Shen Lim 

I assume David will take this via netdev.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net: tulip: turn compile-time warning into dev_warn()

2015-11-19 Thread Will Deacon
On Thu, Nov 19, 2015 at 11:42:26AM +0100, Arnd Bergmann wrote:
> The tulip driver causes annoying build-time warnings for allmodconfig
> builds for all recent architectures:
> 
> dec/tulip/winbond-840.c:910:2: warning: #warning Processor architecture 
> undefined
> dec/tulip/tulip_core.c:101:2: warning: #warning Processor architecture 
> undefined!
> 
> This is the last remaining warning for arm64, and I'd like to get rid of
> it. We don't really know the cache line size, architecturally it would
> be at least 16 bytes, but all implementations I found have 64 or 128
> bytes. Configuring tulip for 32-byte lines as we do on ARM32 seems to
> be the safe but slow default, and nobody who cares about performance these
> days would use a tulip chip anyway, so we can just use that.
> 
> To save the next person the job of trying to find out what this is for
> and picking a default for their architecture just to kill off the warning,
> I'm now removing the preprocessor #warning and turning it into a pr_warn
> or dev_warn that prints the equivalent information when the driver gets
> loaded.
> 
> Signed-off-by: Arnd Bergmann 
> 
> diff --git a/drivers/net/ethernet/dec/tulip/tulip_core.c 
> b/drivers/net/ethernet/dec/tulip/tulip_core.c
> index ed41559bae77..b553409e04ad 100644
> --- a/drivers/net/ethernet/dec/tulip/tulip_core.c
> +++ b/drivers/net/ethernet/dec/tulip/tulip_core.c
> @@ -98,8 +98,7 @@ static int csr0 = 0x01A0 | 0x4800;
>  #elif defined(__mips__)
>  static int csr0 = 0x0020 | 0x4000;
>  #else
> -#warning Processor architecture undefined!
> -static int csr0 = 0x00A0 | 0x4800;
> +static int csr0;
>  #endif
>  
>  /* Operational parameters that usually are not changed. */
> @@ -1982,6 +1981,12 @@ static int __init tulip_init (void)
>   pr_info("%s", version);
>  #endif
>  
> + if (!csr0) {
> + pr_warn("tulip: unknown CPU architecture, using default 
> csr0\n");
> + /* default to 8 longword cache line alignment */
> + csr0 = 0x00A0 | 0x4800;

Maybe print "defaulting to 8 longword cache line alignment" instead of
"default csr0"?

> diff --git a/drivers/net/ethernet/dec/tulip/winbond-840.c 
> b/drivers/net/ethernet/dec/tulip/winbond-840.c
> index 9beb3d34d4ba..3c0e4d5c5fef 100644
> --- a/drivers/net/ethernet/dec/tulip/winbond-840.c
> +++ b/drivers/net/ethernet/dec/tulip/winbond-840.c
> @@ -907,7 +907,7 @@ static void init_registers(struct net_device *dev)
>  #elif defined(CONFIG_SPARC) || defined (CONFIG_PARISC) || defined(CONFIG_ARM)
>   i |= 0x4800;
>  #else
> -#warning Processor architecture undefined
> + dev_warn(&dev->dev, "unknown CPU architecture, using default csr0 
> setting\n");
>   i |= 0x4800;

Then we could print the default csr0 value here.

But, to be honest, this patch fixes a #warning on arm64 for a driver that
I never expect to be used. So whatever you do to silence it:

  Acked-by: Will Deacon 

/me waits for on-soc tulip integration.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: next build: 235 warnings 3 failures (next/next-20151117)

2015-11-18 Thread Will Deacon
On Wed, Nov 18, 2015 at 03:21:19PM +, David Laight wrote:
> From: Will Deacon
> > Sent: 18 November 2015 12:28
> > On Wed, Nov 18, 2015 at 12:11:25PM +, David Laight wrote:
> > > From: Will Deacon
> > > >   
> > > > http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/386094.html
> > >
> > > That patch forces a memory write-read and returns uninitialised stack
> > > for short reads.
> > 
> > Really? The disassembly looks fine to me. Do you have a concrete example
> > of where you think it goes wrong, please?
> > 
> > > Who knows what happens on big-endian systems.
> > 
> > The same thing as READ_ONCE? I'll test it there to make sure, but I
> > don't see a problem.
> 
> Ah, god, it is absolutely horrid. But probably right :-(

Yeah, I wasn't pretending it was nice :) FWIW, I've given it a reasonable
testing in both little-endian and big-endian configurations and it seems
to be happy.

> Do all the lda variants zero extend to 64 bits ?

Yes.

> If so maybe you could use a single 64 bit variable for the result of the read
> and then cast it to typeof(*p) to get the required sign extension for
> small integer types.

That was the original proposal from Arnd, but I want this to work with
structures smaller than 64-bit (e.g. arch_spinlock_t), so that's why
I decided to follow the approach laid down by READ_ONCE.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: next build: 235 warnings 3 failures (next/next-20151117)

2015-11-18 Thread Will Deacon
On Wed, Nov 18, 2015 at 12:11:25PM +, David Laight wrote:
> From: Will Deacon
> > Sent: 18 November 2015 10:14
> > On Tue, Nov 17, 2015 at 08:17:17PM +0100, Arnd Bergmann wrote:
> > > On Tuesday 17 November 2015 17:12:37 Will Deacon wrote:
> > > > On Tue, Nov 17, 2015 at 06:03:40PM +0100, Arnd Bergmann wrote:
> > > > > On Tuesday 17 November 2015 16:44:53 Will Deacon wrote:
> > > > > > > 8<
> > > > > > > Subject: ARM64: make smp_load_acquire() work with const arguments
> > > > > > >
> > > > > > > smp_load_acquire() uses typeof() to declare a local variable for 
> > > > > > > temporarily
> > > > > > > storing the output of the memory access. This fails when the 
> > > > > > > argument is
> > > > > > > constant, because the assembler complains about using a constant 
> > > > > > > register
> > > > > > > as output:
> > > > > > >
> > > > > > >  arch/arm64/include/asm/barrier.h:71:3: error: read-only variable 
> > > > > > > '___p1'
> > > > > > >  used as 'asm' output
> > > > > >
> > > > > > Do you know the usage in the kernel causing this warning?
> > > > >
> > > > > A newly introduced function in include/net/sock.h:
> > > > >
> > > > > static inline int sk_state_load(const struct sock *sk)
> > > > > {
> > > > > return smp_load_acquire(&sk->sk_state);
> > > > > }
> > > >
> > > > Hmm, maybe we could play a similar trick to READ_ONCE by declaring an
> > > > anonymous union and writing through the non-const member?
> > >
> > > Yes, I think that would work, if you think we need to care about the
> > > case where we read into a structure.
> > >
> > > Can you come up with a patch for that?
> > 
> > Done:
> > 
> >   
> > http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/386094.html
> 
> That patch forces a memory write-read and returns uninitialised stack
> for short reads.

Really? The disassembly looks fine to me. Do you have a concrete example
of where you think it goes wrong, please?

> Who knows what happens on big-endian systems.

The same thing as READ_ONCE? I'll test it there to make sure, but I
don't see a problem.

> You need a static inline function with separate temporaries in each
> branch.

I'm just following the precedent set out by READ_ONCE, since that's
tackling exactly the same problem and appears to be working for people.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: next build: 235 warnings 3 failures (next/next-20151117)

2015-11-18 Thread Will Deacon
On Tue, Nov 17, 2015 at 08:17:17PM +0100, Arnd Bergmann wrote:
> On Tuesday 17 November 2015 17:12:37 Will Deacon wrote:
> > On Tue, Nov 17, 2015 at 06:03:40PM +0100, Arnd Bergmann wrote:
> > > On Tuesday 17 November 2015 16:44:53 Will Deacon wrote:
> > > > > 8<
> > > > > Subject: ARM64: make smp_load_acquire() work with const arguments
> > > > > 
> > > > > smp_load_acquire() uses typeof() to declare a local variable for 
> > > > > temporarily
> > > > > storing the output of the memory access. This fails when the argument 
> > > > > is
> > > > > constant, because the assembler complains about using a constant 
> > > > > register
> > > > > as output:
> > > > > 
> > > > >  arch/arm64/include/asm/barrier.h:71:3: error: read-only variable 
> > > > > '___p1'
> > > > >  used as 'asm' output
> > > > 
> > > > Do you know the usage in the kernel causing this warning?
> > > 
> > > A newly introduced function in include/net/sock.h:
> > > 
> > > static inline int sk_state_load(const struct sock *sk)
> > > {
> > > return smp_load_acquire(&sk->sk_state);
> > > }
> > 
> > Hmm, maybe we could play a similar trick to READ_ONCE by declaring an
> > anonymous union and writing through the non-const member?
> 
> Yes, I think that would work, if you think we need to care about the
> case where we read into a structure.
> 
> Can you come up with a patch for that?

Done:

  
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-November/386094.html

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: next build: 235 warnings 3 failures (next/next-20151117)

2015-11-17 Thread Will Deacon
On Tue, Nov 17, 2015 at 06:03:40PM +0100, Arnd Bergmann wrote:
> On Tuesday 17 November 2015 16:44:53 Will Deacon wrote:
> > > 8<
> > > Subject: ARM64: make smp_load_acquire() work with const arguments
> > > 
> > > smp_load_acquire() uses typeof() to declare a local variable for 
> > > temporarily
> > > storing the output of the memory access. This fails when the argument is
> > > constant, because the assembler complains about using a constant register
> > > as output:
> > > 
> > >  arch/arm64/include/asm/barrier.h:71:3: error: read-only variable '___p1'
> > >  used as 'asm' output
> > 
> > Do you know the usage in the kernel causing this warning?
> 
> A newly introduced function in include/net/sock.h:
> 
> static inline int sk_state_load(const struct sock *sk)
> {
> return smp_load_acquire(&sk->sk_state);
> }

Hmm, maybe we could play a similar trick to READ_ONCE by declaring an
anonymous union and writing through the non-const member?

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: next build: 235 warnings 3 failures (next/next-20151117)

2015-11-17 Thread Will Deacon
Hi Arnd,

On Tue, Nov 17, 2015 at 09:57:30AM +0100, Arnd Bergmann wrote:
> On Monday 16 November 2015 19:05:05 Olof's autobuilder wrote:
> > 
> > Errors:
> > 
> > arm64.allmodconfig:
> > arch/arm64/include/asm/barrier.h:71:3: error: read-only variable '___p1' 
> > used as 'asm' output
> > arch/arm64/include/asm/barrier.h:75:3: error: read-only variable '___p1' 
> > used as 'asm' output
> > arch/arm64/include/asm/barrier.h:79:3: error: read-only variable '___p1' 
> > used as 'asm' output
> > arch/arm64/include/asm/barrier.h:83:3: error: read-only variable '___p1' 
> > used as 'asm' output
> > arch/arm64/include/asm/barrier.h:71:3: error: read-only variable '___p1' 
> > used as 'asm' output
> > arch/arm64/include/asm/barrier.h:75:3: error: read-only variable '___p1' 
> > used as 'asm' output
> > arch/arm64/include/asm/barrier.h:79:3: error: read-only variable '___p1' 
> > used as 'asm' output
> > arch/arm64/include/asm/barrier.h:83:3: error: read-only variable '___p1' 
> > used as 'asm' output
> > arch/arm64/include/asm/barrier.h:71:3: error: read-only variable '___p1' 
> > used as 'asm' output
> 
> The patch below seems to fix it. Please review/apply.
> 
> 8<
> Subject: ARM64: make smp_load_acquire() work with const arguments
> 
> smp_load_acquire() uses typeof() to declare a local variable for temporarily
> storing the output of the memory access. This fails when the argument is
> constant, because the assembler complains about using a constant register
> as output:
> 
>  arch/arm64/include/asm/barrier.h:71:3: error: read-only variable '___p1'
>  used as 'asm' output

Do you know the usage in the kernel causing this warning?

> This changes the implementation to use an 'unsigned long' for the temporary 
> value
> and only cast it to the original type in the end.
> 
> Signed-off-by: Arnd Bergmann 
> 
> diff --git a/arch/arm64/include/asm/barrier.h 
> b/arch/arm64/include/asm/barrier.h
> index 624f9679f4b0..05fa329467f6 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -64,7 +64,7 @@ do {
> \
>  
>  #define smp_load_acquire(p)  \
>  ({   \
> - typeof(*p) ___p1;   \
> + unsigned long ___p1;\
>   compiletime_assert_atomic_type(*p); \
>   switch (sizeof(*p)) {   \
>   case 1: \
> @@ -84,7 +84,7 @@ do {
> \
>   : "=r" (___p1) : "Q" (*p) : "memory");  \
>   break;  \
>   }   \
> - ___p1;  \
> + (typeof(*p))___p1;  
> \

My worry about having the cast here is if somebody decides to
smp_load_acquire from a packed 64-bit structure (e.g. some sort of lock),
then we'll get a conversion to non-scalar type error from the compiler.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Will Deacon
On Wed, Nov 11, 2015 at 10:11:33AM -0800, Alexei Starovoitov wrote:
> On Wed, Nov 11, 2015 at 06:57:41PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 11, 2015 at 12:35:48PM -0500, David Miller wrote:
> > > From: Alexei Starovoitov 
> > > Date: Wed, 11 Nov 2015 09:27:00 -0800
> > > 
> > > > BPF_XADD == atomic_add() in kernel. period.
> > > > we are not going to deprecate it or introduce something else.
> > > 
> > > Agreed, it makes no sense to try and tie C99 or whatever atomic
> > > semantics to something that is already clearly defined to have
> > > exactly kernel atomic_add() semantics.
> > 
> > Dave, this really doesn't make any sense to me. __sync primitives have
> > well defined semantics and (e)BPF is violating this.
> 
> bpf_xadd was never meant to be __sync_fetch_and_add equivalent.
> From the day one it meant to be atomic_add() as kernel does it.
> I did piggy back on __sync in the llvm backend because it was the quick
> and dirty way to move forward.
> In retrospect I should have introduced a clean intrinstic for that instead,
> but it's not too late to do it now. user space we can change at any time
> unlike kernel.

But it's not just "user space", it's the source language definition!
I also don't see how you can change it now, without simply rejecting
the __sync primitives outright.

> > Furthermore, the fetch_and_add (or XADD) name has well defined
> > semantics, which (e)BPF also violates.
> 
> bpf_xadd also didn't meant to be 'fetch'. It was void return from the 
> beginning.

Right, so it's just a misnomer.

> > Atomicy is hard enough as it is, backends giving random interpretations
> > to them isn't helping anybody.
> 
> no randomness. bpf_xadd == atomic_add() in kernel.
> imo that is the simplest and cleanest intepretantion one can have, no?

I don't really mind, as long as there is a semantic that everybody agrees
on. Really, I just want this to be consistent because memory models are
a PITA enough without having multiple interpretations flying around.

> > It also baffles me that Alexei is seemingly unwilling to change/rev the
> > (e)BPF instructions, which would be invisible to the regular user, he
> > does want to change the language itself, which will impact all
> > 'scripts'.
> 
> well, we cannot change it in kernel because it's ABI.
> I'm not against adding new insns. We definitely can, but let's figure out why?
> Is anything broken? No. So what new insns make sense?

If you end up needing a suite of atomics, I would suggest the __atomic
builtins because they are likely to be more portable and more flexible
than trying to use the kernel memory model outside of the environment
for which it was developed. However, I agree with you that we can cross
that bridge when we get there.

> Adding new intrinsic to llvm is not a big deal. I'll add it as soon
> as I have time to work on it or if somebody beats me to it I would be
> glad to test it and apply it.

I'm more interested in what you do about the existing intrinsic. Anyway,
I'll raise a ticket against LLVM so that they're aware (and maybe
somebody else will fix it :).

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Will Deacon
On Wed, Nov 11, 2015 at 12:35:48PM -0500, David Miller wrote:
> From: Alexei Starovoitov 
> Date: Wed, 11 Nov 2015 09:27:00 -0800
> 
> > BPF_XADD == atomic_add() in kernel. period.
> > we are not going to deprecate it or introduce something else.
> 
> Agreed, it makes no sense to try and tie C99 or whatever atomic
> semantics to something that is already clearly defined to have
> exactly kernel atomic_add() semantics.

... and which is emitted by LLVM when asked to compile __sync_fetch_and_add,
which has clearly defined (yet conflicting) semantics.

If the discrepancy is in LLVM (and it sounds like it is), then I'll raise
a bug over there instead.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Will Deacon
Hi Daniel,

Thanks for investigating this further.

On Wed, Nov 11, 2015 at 04:52:00PM +0100, Daniel Borkmann wrote:
> I played a bit around with eBPF code to assign the __sync_fetch_and_add()
> return value to a var and dump it to trace pipe, or use it as return code.
> llvm compiles it (with the result assignment) and it looks like:
> 
> [...]
> 206: (b7) r3 = 3
> 207: (db) lock *(u64 *)(r0 +0) += r3
> 208: (bf) r1 = r10
> 209: (07) r1 += -16
> 210: (b7) r2 = 10
> 211: (85) call 6 // r3 dumped here
> [...]
> 
> [...]
> 206: (b7) r5 = 3
> 207: (db) lock *(u64 *)(r0 +0) += r5
> 208: (bf) r1 = r10
> 209: (07) r1 += -16
> 210: (b7) r2 = 10
> 211: (b7) r3 = 43
> 212: (b7) r4 = 42
> 213: (85) call 6 // r5 dumped here
> [...]
> 
> [...]
> 11: (b7) r0 = 3
> 12: (db) lock *(u64 *)(r1 +0) += r0
> 13: (95) exit // r0 returned here
> [...]
> 
> What it seems is that we 'get back' the value (== 3 here in r3, r5, r0)
> that we're adding, at least that's what seems to be generated wrt
> register assignments. Hmm, the semantic differences of bpf target
> should be documented somewhere for people writing eBPF programs to
> be aware of.

If we're going to document it, a bug tracker might be a good place to
start. The behaviour, as it stands, is broken wrt the definition of the
__sync primitives. That is, there is no way to build __sync_fetch_and_add
out of BPF_XADD without changing its semantics.

We could fix this by either:

(1) Defining BPF_XADD to match __sync_fetch_and_add (including memory
barriers).

(2) Introducing some new BPF_ atomics, that map to something like the
C11 __atomic builtins and deprecating BPF_XADD in favour of these.

(3) Introducing new source-language intrinsics to match what BPF can do
(unlikely to be popular).

As it stands, I'm not especially keen on adding BPF_XADD to the arm64
JIT backend until we have at least (1) and preferably (2) as well.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] arm64: bpf: add 'store immediate' instruction

2015-11-11 Thread Will Deacon
On Wed, Nov 11, 2015 at 12:12:56PM +, Will Deacon wrote:
> On Tue, Nov 10, 2015 at 06:45:39PM -0800, Z Lim wrote:
> > On Tue, Nov 10, 2015 at 2:41 PM, Yang Shi  wrote:
> > > aarch64 doesn't have native store immediate instruction, such operation
> > 
> > Actually, aarch64 does have "STR (immediate)". For arm64 JIT, we can
> > consider using it as an optimization.
> 
> Yes, I'd definitely like to see that in preference to moving via a
> temporary register.

Wait a second, we're both talking rubbish here :) The STR (immediate)
form is referring to the addressing mode, whereas this patch wants to
store an immediate value to memory, which does need moving to a register
first.

So the original patch is fine.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Will Deacon
On Wed, Nov 11, 2015 at 01:21:04PM +0100, Daniel Borkmann wrote:
> On 11/11/2015 12:58 PM, Will Deacon wrote:
> >On Wed, Nov 11, 2015 at 11:42:11AM +0100, Daniel Borkmann wrote:
> >>On 11/11/2015 11:24 AM, Will Deacon wrote:
> >>>On Wed, Nov 11, 2015 at 09:49:48AM +0100, Arnd Bergmann wrote:
> >>>>On Tuesday 10 November 2015 18:52:45 Z Lim wrote:
> >>>>>On Tue, Nov 10, 2015 at 4:42 PM, Alexei Starovoitov
> >>>>> wrote:
> >>>>>>On Tue, Nov 10, 2015 at 04:26:02PM -0800, Shi, Yang wrote:
> >>>>>>>On 11/10/2015 4:08 PM, Eric Dumazet wrote:
> >>>>>>>>On Tue, 2015-11-10 at 14:41 -0800, Yang Shi wrote:
> >>>>>>>>>aarch64 doesn't have native support for XADD instruction, implement 
> >>>>>>>>>it by
> >>>>>>>>>the below instruction sequence:
> >>>>>
> >>>>>aarch64 supports atomic add in ARMv8.1.
> >>>>>For ARMv8(.0), please consider using LDXR/STXR sequence.
> >>>>
> >>>>Is it worth optimizing for the 8.1 case? It would add a bit of complexity
> >>>>to make the code depend on the CPU feature, but it's certainly doable.
> >>>
> >>>What's the atomicity required for? Put another way, what are we racing
> >>>with (I thought bpf was single-threaded)? Do we need to worry about
> >>>memory barriers?
> >>>
> >>>Apologies if these are stupid questions, but all I could find was
> >>>samples/bpf/sock_example.c and it didn't help much :(
> >>
> >>The equivalent code more readable in restricted C syntax (that can be
> >>compiled by llvm) can be found in samples/bpf/sockex1_kern.c. So the
> >>built-in __sync_fetch_and_add() will be translated into a BPF_XADD
> >>insn variant.
> >
> >Yikes, so the memory-model for BPF is based around the deprecated GCC
> >__sync builtins, that inherit their semantics from ia64? Any reason not
> >to use the C11-compatible __atomic builtins[1] as a base?
> 
> Hmm, gcc doesn't have an eBPF compiler backend, so this won't work on
> gcc at all. The eBPF backend in LLVM recognizes the __sync_fetch_and_add()
> keyword and maps that to a BPF_XADD version (BPF_W or BPF_DW). In the
> interpreter (__bpf_prog_run()), as Eric mentioned, this maps to atomic_add()
> and atomic64_add(), respectively. So the struct bpf_insn prog[] you saw
> from sock_example.c can be regarded as one possible equivalent program
> section output from the compiler.

Ok, so if I understand you correctly, then __sync_fetch_and_add() has
different semantics depending on the backend target. That seems counter
to the LLVM atomics Documentation:

  http://llvm.org/docs/Atomics.html

which specifically calls out the __sync_* primitives as being
sequentially-consistent and requiring barriers on ARM (which isn't the
case for atomic[64]_add in the kernel).

If we re-use the __sync_* naming scheme in the source language, I don't
think we can overlay our own semantics in the backend. The
__sync_fetch_and_add primitive is also expected to return the old value,
which doesn't appear to be the case for BPF_XADD.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] arm64: bpf: add 'store immediate' instruction

2015-11-11 Thread Will Deacon
On Tue, Nov 10, 2015 at 06:45:39PM -0800, Z Lim wrote:
> On Tue, Nov 10, 2015 at 2:41 PM, Yang Shi  wrote:
> > aarch64 doesn't have native store immediate instruction, such operation
> 
> Actually, aarch64 does have "STR (immediate)". For arm64 JIT, we can
> consider using it as an optimization.

Yes, I'd definitely like to see that in preference to moving via a
temporary register.

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Will Deacon
Hi Daniel,

On Wed, Nov 11, 2015 at 11:42:11AM +0100, Daniel Borkmann wrote:
> On 11/11/2015 11:24 AM, Will Deacon wrote:
> >On Wed, Nov 11, 2015 at 09:49:48AM +0100, Arnd Bergmann wrote:
> >>On Tuesday 10 November 2015 18:52:45 Z Lim wrote:
> >>>On Tue, Nov 10, 2015 at 4:42 PM, Alexei Starovoitov
> >>> wrote:
> >>>>On Tue, Nov 10, 2015 at 04:26:02PM -0800, Shi, Yang wrote:
> >>>>>On 11/10/2015 4:08 PM, Eric Dumazet wrote:
> >>>>>>On Tue, 2015-11-10 at 14:41 -0800, Yang Shi wrote:
> >>>>>>>aarch64 doesn't have native support for XADD instruction, implement it 
> >>>>>>>by
> >>>>>>>the below instruction sequence:
> >>>
> >>>aarch64 supports atomic add in ARMv8.1.
> >>>For ARMv8(.0), please consider using LDXR/STXR sequence.
> >>
> >>Is it worth optimizing for the 8.1 case? It would add a bit of complexity
> >>to make the code depend on the CPU feature, but it's certainly doable.
> >
> >What's the atomicity required for? Put another way, what are we racing
> >with (I thought bpf was single-threaded)? Do we need to worry about
> >memory barriers?
> >
> >Apologies if these are stupid questions, but all I could find was
> >samples/bpf/sock_example.c and it didn't help much :(
> 
> The equivalent code more readable in restricted C syntax (that can be
> compiled by llvm) can be found in samples/bpf/sockex1_kern.c. So the
> built-in __sync_fetch_and_add() will be translated into a BPF_XADD
> insn variant.

Yikes, so the memory-model for BPF is based around the deprecated GCC
__sync builtins, that inherit their semantics from ia64? Any reason not
to use the C11-compatible __atomic builtins[1] as a base?

> What you can race against is that an eBPF map can be _shared_ by
> multiple eBPF programs that are attached somewhere in the system, and
> they could all update a particular entry/counter from the map at the
> same time.

Ok, so it does sound like eBPF needs to define/choose a memory-model and
I worry that riding on the back of __sync isn't necessarily the right
thing to do, particularly as its fallen out of favour with the compiler
folks. On weakly-ordered architectures, it's also going to result in
heavy-weight barriers for all atomic operations.

Will

[1] https://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] arm64: bpf: add BPF XADD instruction

2015-11-11 Thread Will Deacon
On Wed, Nov 11, 2015 at 09:49:48AM +0100, Arnd Bergmann wrote:
> On Tuesday 10 November 2015 18:52:45 Z Lim wrote:
> > On Tue, Nov 10, 2015 at 4:42 PM, Alexei Starovoitov
> >  wrote:
> > > On Tue, Nov 10, 2015 at 04:26:02PM -0800, Shi, Yang wrote:
> > >> On 11/10/2015 4:08 PM, Eric Dumazet wrote:
> > >> >On Tue, 2015-11-10 at 14:41 -0800, Yang Shi wrote:
> > >> >>aarch64 doesn't have native support for XADD instruction, implement it 
> > >> >>by
> > >> >>the below instruction sequence:
> > 
> > aarch64 supports atomic add in ARMv8.1.
> > For ARMv8(.0), please consider using LDXR/STXR sequence.
> 
> Is it worth optimizing for the 8.1 case? It would add a bit of complexity
> to make the code depend on the CPU feature, but it's certainly doable.

What's the atomicity required for? Put another way, what are we racing
with (I thought bpf was single-threaded)? Do we need to worry about
memory barriers?

Apologies if these are stupid questions, but all I could find was
samples/bpf/sock_example.c and it didn't help much :(

Will
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html