Re: [PATCH 1/2] drivers: net: xgene: fix: Derive prefetch number from irq

2016-03-09 Thread Mark Rutland
On Tue, Mar 08, 2016 at 09:49:36PM -0800, Iyappan Subramanian wrote:
> Prefetch buffer numbers are mapped to hardware irqs. Currently
> they are statically assigned to match with firmware irqs.
> 
> If the irq on firmware changes, prefetch buffer number on the driver
> also needs to be updated to match with the firmware.
> 
> This patch removes this static association between firmware and the
> driver by deriving the prefetch buffer number from the Linux irq.

This driver should neither care about the Linux IRQ number nor the interrupt
controller's physical IRQ number. Neither of the two are properties of the
device.

What you actually care about is the index of the interrupt pin _out_ of the
device, not _into_ the interrupt controller. As with other drivers, that should
described explicitly in the device binding. Either the interrupts should be
listed in a fixed order, or they should be given names.

> +static u8 xgene_start_cpu_bufnum(struct xgene_enet_pdata *pdata)
> +{
> + struct irq_desc *desc;
> + int hwirq;
> +
> + desc = irq_to_desc(pdata->irqs[0]);
> + hwirq = desc->irq_data.hwirq;
> +
> + return hwirq - XGENE2_RM0_START_IRQ;
> +}

No device driver should be doing this.

Mark.


Re: [v8, 1/7] Documentation: DT: update Freescale DCFG compatible

2016-04-22 Thread Mark Rutland
On Fri, Apr 22, 2016 at 02:27:38PM +0800, Yangbo Lu wrote:
> Update Freescale DCFG compatible with 'fsl,-dcfg' instead
> of 'fsl,ls1021a-dcfg' to include more chips.
> 
> Signed-off-by: Yangbo Lu 
> ---
> Changes for v8:
>   - Added this patch
> ---
>  Documentation/devicetree/bindings/arm/fsl.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/arm/fsl.txt 
> b/Documentation/devicetree/bindings/arm/fsl.txt
> index 752a685..1d5f512 100644
> --- a/Documentation/devicetree/bindings/arm/fsl.txt
> +++ b/Documentation/devicetree/bindings/arm/fsl.txt
> @@ -119,7 +119,7 @@ Freescale DCFG
>  configuration and status for the device. Such as setting the secondary
>  core start address and release the secondary core from holdoff and startup.
>Required properties:
> -  - compatible: should be "fsl,ls1021a-dcfg"
> +  - compatible: should be "fsl,-dcfg"

Please list specific values expected for , while jusy saying
 may be more generic, it makes it practically impossible to search
for the correct binding given a compatible string, and it's vague as to
exaclty what  should be.

Thanks,
Mark.



>- reg : should contain base address and length of DCFG memory-mapped 
> registers
>  
>  Example:
> -- 
> 2.1.0.27.g96db324
> 
> --
> To unsubscribe from this list: send the line "unsubscribe devicetree" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Re: [PATCH net] bpf, arm: start flushing icache range from header

2015-11-16 Thread Mark Rutland
On Sat, Nov 14, 2015 at 01:26:53AM +0100, Daniel Borkmann wrote:
> During review I noticed that the icache range we're flushing should
> start at header already and not at ctx.image.
> 
> Reason is that after 55309dd3d4cd ("net: bpf: arm: address randomize
> and write protect JIT code"), we also want to make sure to flush the
> random-sized trap in front of the start of the actual program (analogous
> to x86). No operational differences from user side.
> 
> Signed-off-by: Daniel Borkmann 
> Tested-by: Nicolas Schichan 
> Cc: Alexei Starovoitov 
> ---
>  ( As arm32 fixes usually go via Dave's tree, targeting -net. )
> 
>  arch/arm/net/bpf_jit_32.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> index 2f4b14c..591f9db 100644
> --- a/arch/arm/net/bpf_jit_32.c
> +++ b/arch/arm/net/bpf_jit_32.c
> @@ -1061,7 +1061,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
>   }
>   build_epilogue(&ctx);
>  
> - flush_icache_range((u32)ctx.target, (u32)(ctx.target + ctx.idx));
> + flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));

As with the arm64 patch, doesn't this prevent us from flushing the end
of the image? ctx.idx doesn't seem to take into account the header size.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] bpf, arm: start flushing icache range from header

2015-11-16 Thread Mark Rutland
On Mon, Nov 16, 2015 at 11:40:55AM +, Mark Rutland wrote:
> On Sat, Nov 14, 2015 at 01:26:53AM +0100, Daniel Borkmann wrote:
> > During review I noticed that the icache range we're flushing should
> > start at header already and not at ctx.image.
> > 
> > Reason is that after 55309dd3d4cd ("net: bpf: arm: address randomize
> > and write protect JIT code"), we also want to make sure to flush the
> > random-sized trap in front of the start of the actual program (analogous
> > to x86). No operational differences from user side.
> > 
> > Signed-off-by: Daniel Borkmann 
> > Tested-by: Nicolas Schichan 
> > Cc: Alexei Starovoitov 
> > ---
> >  ( As arm32 fixes usually go via Dave's tree, targeting -net. )
> > 
> >  arch/arm/net/bpf_jit_32.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
> > index 2f4b14c..591f9db 100644
> > --- a/arch/arm/net/bpf_jit_32.c
> > +++ b/arch/arm/net/bpf_jit_32.c
> > @@ -1061,7 +1061,7 @@ void bpf_jit_compile(struct bpf_prog *fp)
> > }
> > build_epilogue(&ctx);
> >  
> > -   flush_icache_range((u32)ctx.target, (u32)(ctx.target + ctx.idx));
> > +   flush_icache_range((u32)header, (u32)(ctx.target + ctx.idx));
> 
> As with the arm64 patch, doesn't this prevent us from flushing the end
> of the image? ctx.idx doesn't seem to take into account the header size.

I'd misread the patch; it is fine.

Sorry for the noise.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KASAN failures in X-Gene ethernet driver in v4.4-rc2

2015-11-25 Thread Mark Rutland
While testing a v4.4-rc2 defconfig + KASAN_INLINE kernel on an X-Gene
platform, I spotted the KASAN warnings below. I'm using the Linaro 15.08
little-endian AArch64 GCC [1] to enable KASAN_INLINE. My rootfs is an
NFS mount.

Most of the time I can trigger the issue by grabbing the kernel source
tarball:

$ wget https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.4-rc2.tar.xz

This doesn't seem to trigger for small files (< 30K or so at least), and
I don't see similar issues triggered by my NFS root during boot.

When running the same kernel and workload on a Juno platform using
SMSC911x for networking I do not see similar issues.

Any idea what's to blame?

Thanks,
Mark.

[1] 
https://releases.linaro.org/components/toolchain/binaries/latest-5.1/arm-linux-gnueabihf/gcc-linaro-5.1-2015.08-x86_64_arm-linux-gnueabihf.tar.xz

==
BUG: KASAN: use-after-free in xgene_enet_start_xmit+0x1a04/0x22c0 at addr 
ffc36c220cb8
Read of size 8 by task kworker/5:2H/864
=
BUG skbuff_head_cache (Not tainted): kasan: bad access detected
-

Disabling lock debugging due to kernel taint
INFO: Allocated in __alloc_skb+0x8c/0x448 age=3 cpu=5 pid=864
INFO: Freed in kfree_skbmem+0xc4/0xf0 age=2 cpu=0 pid=0
INFO: Slab 0xffbecdb08800 objects=32 used=3 fp=0xffc36c220c00 
flags=0x40004080
INFO: Object 0xffc36c220c00 @offset=3072 fp=0xffc36c221500

Bytes b4 ffc36c220bf0: 05 00 00 00 60 03 00 00 8a ab ff ff 00 00 00 00  
`...
Object ffc36c220c00: 00 15 22 6c c3 ff ff ff 00 00 00 00 00 00 00 00  
.."l
Object ffc36c220c10: 00 00 00 00 00 00 00 00 00 f0 0d 6d c3 ff ff ff  
...m
Object ffc36c220c20: 00 21 06 6d c3 ff ff ff 5e 00 00 00 14 00 00 00  
.!.m^...
Object ffc36c220c30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object ffc36c220c40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object ffc36c220c50: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object ffc36c220c60: 98 ce da 00 c0 ff ff ff 00 00 00 00 00 00 00 00  

Object ffc36c220c70: 82 05 00 00 40 05 00 00 0e 00 00 00 00 00 00 00  
@...
Object ffc36c220c80: 00 16 00 00 80 00 10 00 00 00 00 00 00 00 00 00  

Object ffc36c220c90: 95 6e 6a c8 00 00 00 00 06 00 00 00 00 00 00 00  
.nj.
Object ffc36c220ca0: 00 00 00 00 00 00 00 00 08 00 80 00 6c 00 5e 00  
l.^.
Object ffc36c220cb0: a0 00 00 00 80 02 00 00 80 ba c7 6c c3 ff ff ff  
...l
Object ffc36c220cc0: de ba c7 6c c3 ff ff ff 40 0a 00 00 01 00 00 00  
...l@...
CPU: 5 PID: 864 Comm: kworker/5:2H Tainted: GB   4.4.0-rc2 #4
Hardware name: APM X-Gene Mustang board (DT)
Workqueue: rpciod rpc_async_schedule
Call trace:
[] dump_backtrace+0x0/0x280
[] show_stack+0x14/0x20
[] dump_stack+0x100/0x188
[] print_trailer+0xfc/0x168
[] object_err+0x3c/0x50
[] kasan_report_error+0x244/0x558
[] __asan_report_load8_noabort+0x48/0x50
[] xgene_enet_start_xmit+0x1a04/0x22c0
[] dev_hard_start_xmit+0x5bc/0xa70
[] sch_direct_xmit+0x2d8/0x5d0
[] __dev_queue_xmit+0x6a8/0x10d0
[] dev_queue_xmit+0x10/0x18
[] ip_finish_output2+0x5f4/0x1010
[] ip_finish_output+0x48c/0x688
[] ip_output+0x278/0x358
[] ip_local_out+0xa4/0xc8
[] ip_queue_xmit+0x534/0x1368
[] tcp_transmit_skb+0x10cc/0x27c8
[] tcp_write_xmit+0x53c/0x4788
[] __tcp_push_pending_frames+0x8c/0x1e0
[] tcp_push+0x37c/0x550
[] tcp_sendpage+0xdc8/0x1428
[] inet_sendpage+0x208/0x338
[] xs_sendpages+0x378/0x4b8
[] xs_tcp_send_request+0x1f4/0x4b0
[] xprt_transmit+0xe0/0x6f8
[] call_transmit+0x6f4/0xcd8
[] __rpc_execute+0x104/0x590
[] rpc_async_schedule+0x10/0x18
[] process_one_work+0x3d0/0xc80
[] worker_thread+0x348/0xd90
[] kthread+0x1f4/0x258
[] ret_from_fork+0x10/0x40
Memory state around the buggy address:
 ffc36c220b80: fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc
 ffc36c220c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffc36c220c80: fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc
^
 ffc36c220d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffc36c220d80: fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc
==
==
BUG: KASAN: use-after-free in xgene_enet_start_xmit+0x19f8/0x22c0 at addr 
ffc36c220cb4
Read of size 4 by task kworker/5:2H/864
=
BUG skbuff_head_cache (Tainted: GB  ): kasan: bad access detected
-

INFO: Allocated in __alloc_skb+0x8c/0x

Re: KASAN failures in X-Gene ethernet driver in v4.4-rc2

2015-11-25 Thread Mark Rutland
On Wed, Nov 25, 2015 at 08:17:36AM -0800, Eric Dumazet wrote:
> On Wed, 2015-11-25 at 15:59 +0000, Mark Rutland wrote:
> > xgene_enet_start_xmit
> 
> Please try following trivial fix

With that applied KASAN is silent, despite my efforts to trigger the
issue, so it looks like that fixes it.

Thanks,
Mark.

> diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c 
> b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
> index 1adfe7036843..9147a0107c44 100644
> --- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
> +++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
> @@ -450,12 +450,12 @@ static netdev_tx_t xgene_enet_start_xmit(struct sk_buff 
> *skb,
>   return NETDEV_TX_OK;
>   }
>  
> - pdata->ring_ops->wr_cmd(tx_ring, count);
>   skb_tx_timestamp(skb);
>  
>   pdata->stats.tx_packets++;
>   pdata->stats.tx_bytes += skb->len;
>  
> + pdata->ring_ops->wr_cmd(tx_ring, count);
>   return NETDEV_TX_OK;
>  }
>  
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] net, thunder, bgx: Add support for ACPI binding.

2015-08-07 Thread Mark Rutland
On Fri, Aug 07, 2015 at 01:33:10AM +0100, David Daney wrote:
> From: David Daney 
> 
> Find out which PHYs belong to which BGX instance in the ACPI way.
> 
> Set the MAC address of the device as provided by ACPI tables. This is
> similar to the implementation for devicetree in
> of_get_mac_address(). The table is searched for the device property
> entries "mac-address", "local-mac-address" and "address" in that
> order. The address is provided in a u64 variable and must contain a
> valid 6 bytes-len mac addr.
> 
> Based on code from: Narinder Dhillon 
> Tomasz Nowicki 
> Robert Richter 
> 
> Signed-off-by: Tomasz Nowicki 
> Signed-off-by: Robert Richter 
> Signed-off-by: David Daney 
> ---
>  drivers/net/ethernet/cavium/thunder/thunder_bgx.c | 137 
> +-
>  1 file changed, 135 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c 
> b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
> index 615b2af..2056583 100644
> --- a/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
> +++ b/drivers/net/ethernet/cavium/thunder/thunder_bgx.c
> @@ -6,6 +6,7 @@
>   * as published by the Free Software Foundation.
>   */
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -26,7 +27,7 @@
>  struct lmac {
>   struct bgx  *bgx;
>   int dmac;
> - unsigned char   mac[ETH_ALEN];
> + u8  mac[ETH_ALEN];
>   boollink_up;
>   int lmacid; /* ID within BGX */
>   int lmacid_bd; /* ID on board */
> @@ -835,6 +836,133 @@ static void bgx_get_qlm_mode(struct bgx *bgx)
>   }
>  }
>  
> +#ifdef CONFIG_ACPI
> +
> +static int bgx_match_phy_id(struct device *dev, void *data)
> +{
> + struct phy_device *phydev = to_phy_device(dev);
> + u32 *phy_id = data;
> +
> + if (phydev->addr == *phy_id)
> + return 1;
> +
> + return 0;
> +}
> +
> +static const char * const addr_propnames[] = {
> + "mac-address",
> + "local-mac-address",
> + "address",
> +};

If these are going to be generally necessary, then we should get them
adopted as standardised _DSD properties (ideally just one of them).

[...]

> +static acpi_status bgx_acpi_register_phy(acpi_handle handle,
> +  u32 lvl, void *context, void **rv)
> +{
> + struct acpi_reference_args args;
> + const union acpi_object *prop;
> + struct bgx *bgx = context;
> + struct acpi_device *adev;
> + struct device *phy_dev;
> + u32 phy_id;
> +
> + if (acpi_bus_get_device(handle, &adev))
> + goto out;
> +
> + SET_NETDEV_DEV(&bgx->lmac[bgx->lmac_count].netdev, &bgx->pdev->dev);
> +
> + acpi_get_mac_address(adev, bgx->lmac[bgx->lmac_count].mac);
> +
> + bgx->lmac[bgx->lmac_count].lmacid = bgx->lmac_count;
> +
> + if (acpi_dev_get_property_reference(adev, "phy-handle", 0, &args))
> + goto out;
> +
> + if (acpi_dev_get_property(args.adev, "phy-channel", ACPI_TYPE_INTEGER, 
> &prop))
> + goto out;

Likewise for any inter-device properties, so that we can actually handle
them in a generic fashion, and avoid / learn from the mistakes we've
already handled with DT.

Mark.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] net, thunder, bgx: Add support for ACPI binding.

2015-08-07 Thread Mark Rutland
[Correcting the devicetree list address, which I typo'd in my original
reply]

> >> +static const char * const addr_propnames[] = {
> >> +  "mac-address",
> >> +  "local-mac-address",
> >> +  "address",
> >> +};
> >
> > If these are going to be generally necessary, then we should get them
> > adopted as standardised _DSD properties (ideally just one of them).
> 
> As far as I can tell, and please correct me if I am wrong, ACPI-6.0 
> doesn't contemplate MAC addresses.
> 
> Today we are using "mac-address", which is an Integer containing the MAC 
> address in its lowest order 48 bits in Little-Endian byte order.
> 
> The hardware and ACPI tables are here today, and we would like to 
> support it.  If some future ACPI specification specifies a standard way 
> to do this, we will probably adapt the code to do this in a standard manner.
> 
> 
> >
> > [...]
> >
> >> +static acpi_status bgx_acpi_register_phy(acpi_handle handle,
> >> +   u32 lvl, void *context, void **rv)
> >> +{
> >> +  struct acpi_reference_args args;
> >> +  const union acpi_object *prop;
> >> +  struct bgx *bgx = context;
> >> +  struct acpi_device *adev;
> >> +  struct device *phy_dev;
> >> +  u32 phy_id;
> >> +
> >> +  if (acpi_bus_get_device(handle, &adev))
> >> +  goto out;
> >> +
> >> +  SET_NETDEV_DEV(&bgx->lmac[bgx->lmac_count].netdev, &bgx->pdev->dev);
> >> +
> >> +  acpi_get_mac_address(adev, bgx->lmac[bgx->lmac_count].mac);
> >> +
> >> +  bgx->lmac[bgx->lmac_count].lmacid = bgx->lmac_count;
> >> +
> >> +  if (acpi_dev_get_property_reference(adev, "phy-handle", 0, &args))
> >> +  goto out;
> >> +
> >> +  if (acpi_dev_get_property(args.adev, "phy-channel", ACPI_TYPE_INTEGER, 
> >> &prop))
> >> +  goto out;
> >
> > Likewise for any inter-device properties, so that we can actually handle
> > them in a generic fashion, and avoid / learn from the mistakes we've
> > already handled with DT.
> 
> This is the fallacy of the ACPI is superior to DT argument.  The 
> specification of PHY topology and MAC addresses is well standardized in 
> DT, there is no question about what the proper way to specify it is. 
> Under ACPI, it is the Wild West, there is no specification, so each 
> system design is forced to invent something, and everybody comes up with 
> an incompatible implementation.

Indeed.

If ACPI is going to handle it, it should handle it properly. I really
don't see the point in bodging properties together in a less standard
manner than DT, especially for inter-device relationships.

Doing so is painful for _everyone_, and it's extremely unlikely that
other ACPI-aware OSs will actually support these custom descriptions,
making this Linux-specific, and breaking the rationale for using ACPI in
the first place -- a standard that says "just do non-standard stuff" is
not a usable standard.

For intra-device properties, we should standardise what we can, but
vendor-specific stuff is ok -- this can be self-contained within a
driver.

For inter-device relationships ACPI _must_ gain a better model of
componentised devices. It's simply unworkable otherwise, and as you
point out it's fallacious to say that because ACPI is being used that
something is magically industry standard, portable, etc.

This is not your problem in particular; the entire handling of _DSD so
far is a joke IMO.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] net, thunder, bgx: Add support for ACPI binding.

2015-08-07 Thread Mark Rutland
[Correcting the devicetree list address, which I typo'd in my original
reply]

[resending to _really_ correct the address, apologies for the spam]

> >> +static const char * const addr_propnames[] = {
> >> +  "mac-address",
> >> +  "local-mac-address",
> >> +  "address",
> >> +};
> >
> > If these are going to be generally necessary, then we should get them
> > adopted as standardised _DSD properties (ideally just one of them).
> 
> As far as I can tell, and please correct me if I am wrong, ACPI-6.0 
> doesn't contemplate MAC addresses.
> 
> Today we are using "mac-address", which is an Integer containing the MAC 
> address in its lowest order 48 bits in Little-Endian byte order.
> 
> The hardware and ACPI tables are here today, and we would like to 
> support it.  If some future ACPI specification specifies a standard way 
> to do this, we will probably adapt the code to do this in a standard manner.
> 
> 
> >
> > [...]
> >
> >> +static acpi_status bgx_acpi_register_phy(acpi_handle handle,
> >> +   u32 lvl, void *context, void **rv)
> >> +{
> >> +  struct acpi_reference_args args;
> >> +  const union acpi_object *prop;
> >> +  struct bgx *bgx = context;
> >> +  struct acpi_device *adev;
> >> +  struct device *phy_dev;
> >> +  u32 phy_id;
> >> +
> >> +  if (acpi_bus_get_device(handle, &adev))
> >> +  goto out;
> >> +
> >> +  SET_NETDEV_DEV(&bgx->lmac[bgx->lmac_count].netdev, &bgx->pdev->dev);
> >> +
> >> +  acpi_get_mac_address(adev, bgx->lmac[bgx->lmac_count].mac);
> >> +
> >> +  bgx->lmac[bgx->lmac_count].lmacid = bgx->lmac_count;
> >> +
> >> +  if (acpi_dev_get_property_reference(adev, "phy-handle", 0, &args))
> >> +  goto out;
> >> +
> >> +  if (acpi_dev_get_property(args.adev, "phy-channel", ACPI_TYPE_INTEGER, 
> >> &prop))
> >> +  goto out;
> >
> > Likewise for any inter-device properties, so that we can actually handle
> > them in a generic fashion, and avoid / learn from the mistakes we've
> > already handled with DT.
> 
> This is the fallacy of the ACPI is superior to DT argument.  The 
> specification of PHY topology and MAC addresses is well standardized in 
> DT, there is no question about what the proper way to specify it is. 
> Under ACPI, it is the Wild West, there is no specification, so each 
> system design is forced to invent something, and everybody comes up with 
> an incompatible implementation.

Indeed.

If ACPI is going to handle it, it should handle it properly. I really
don't see the point in bodging properties together in a less standard
manner than DT, especially for inter-device relationships.

Doing so is painful for _everyone_, and it's extremely unlikely that
other ACPI-aware OSs will actually support these custom descriptions,
making this Linux-specific, and breaking the rationale for using ACPI in
the first place -- a standard that says "just do non-standard stuff" is
not a usable standard.

For intra-device properties, we should standardise what we can, but
vendor-specific stuff is ok -- this can be self-contained within a
driver.

For inter-device relationships ACPI _must_ gain a better model of
componentised devices. It's simply unworkable otherwise, and as you
point out it's fallacious to say that because ACPI is being used that
something is magically industry standard, portable, etc.

This is not your problem in particular; the entire handling of _DSD so
far is a joke IMO.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


v4.16 in_dev_finish_destroy() accessing freed idev->dev->pcpu_refcnt

2018-04-09 Thread Mark Rutland
Hi all,

As a heads-up, while fuzzing v4.16 on arm64, I'm hitting an intermittent
issue where in_dev_finish_destroy() calls dev_put() on idev->dev, where
idev->dev->pcpu_refcnt is NULL. Apparently idev->dev has already been
freed.

This results in a fault where we try to access the percpu offset of
NULL, such as in the example splat at the end of this mail.

So far I've had no luck in extracting a reliable reproducer, but I've
uploaded the relevant syzkaller logs (and reports) to my kernel.org
webspace [1].

Thanks,
Mark.

[1] 
https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20180409-in_dev_finish_destroy-null-pcpu_refcnt/

Unable to handle kernel paging request at virtual address 60002be8
Mem abort info:
  ESR = 0x9604
  Exception class = DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x0004
  CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp = fb56b784
[60002be8] pgd=
Internal error: Oops: 9604 [#1] PREEMPT SMP
Modules linked in:
CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.16.0-6-g6eee13dfb529 #1
Hardware name: linux,dummy-virt (DT)
pstate: 8045 (Nzcv daif +PAN -UAO)
pc : __percpu_add arch/arm64/include/asm/percpu.h:102 [inline]
pc : dev_put include/linux/netdevice.h:3395 [inline]
pc : in_dev_finish_destroy+0xcc/0x230 net/ipv4/devinet.c:231
lr : dev_put include/linux/netdevice.h:3395 [inline]
lr : in_dev_finish_destroy+0xa4/0x230 net/ipv4/devinet.c:231
sp : 800036317b70
x29: 800036317b70 x28: 80001f39d338 
x27: 2a917460 x26: 2a917000 
x25: 000a x24: 2a919000 
x23: 800036317c90 x22: 29b84da8 
x21: 13e73a68 x20: 80006e7a9100 
x19: 80001f39d180 x18:  
x17:  x16: 282951f0 
x15: 0001 x14: f200 
x13: 2a9fb560 x12: 0002 
x11: 1c147fbb x10: 0004 
x9 :  x8 : 1fffe4000178908c 
x7 :  x6 :  
x5 :  x4 : 1fffe4000178908c 
x3 : 1fffe4000178908c x2 :  
x1 : 60002be8 x0 : 60002be8 
Process ksoftirqd/2 (pid: 22, stack limit = 0xe96adbb2)
Call trace:
 __percpu_add arch/arm64/include/asm/percpu.h:102 [inline]
 dev_put include/linux/netdevice.h:3395 [inline]
 in_dev_finish_destroy+0xcc/0x230 net/ipv4/devinet.c:231
 in_dev_put include/linux/inetdevice.h:248 [inline]
 in_dev_rcu_put+0x34/0x48 net/ipv4/devinet.c:287
 __rcu_reclaim kernel/rcu/rcu.h:172 [inline]
 rcu_do_batch kernel/rcu/tree.c:2674 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2933 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:2900 [inline]
 rcu_process_callbacks+0x350/0x988 kernel/rcu/tree.c:2917
 __do_softirq+0x318/0x734 kernel/softirq.c:285
 run_ksoftirqd+0x70/0xa8 kernel/softirq.c:666
 smpboot_thread_fn+0x544/0x9d0 kernel/smpboot.c:164
 kthread+0x2f8/0x380 kernel/kthread.c:238
 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1158
Code: d538d081 f9425a80 9282 8b01 (885f7c04) 
---[ end trace 577c2c0fbbf0d4d4 ]---


Re: [PATCH v3 17/33] nds32: VDSO support

2017-12-08 Thread Mark Rutland
On Fri, Dec 08, 2017 at 05:12:00PM +0800, Greentime Hu wrote:
> From: Greentime Hu 
> 
> This patch adds VDSO support. The VDSO code is currently used for
> sys_rt_sigreturn() and optimised gettimeofday() (using the SoC timer counter).

[...]

> +static int grab_timer_node_info(void)
> +{
> + struct device_node *timer_node;
> +
> + timer_node = of_find_node_by_name(NULL, "timer");

Please use a compatible string, rather than matching the timer by name.

It's plausible that you have multiple nodes called "timer" in the DT,
under different parent nodes, and this might not be the device you
think it is. I see your dt in patch 24 has two timer nodes.

It would be best if your clocksource driver exposed some stuct that you
looked at here, so that you're guaranteed to user the same device.

> + of_property_read_u32(timer_node, "cycle-count-offset",
> +  &vdso_data->cycle_count_offset);
> + vdso_data->cycle_count_down =
> + of_property_read_bool(timer_node, "cycle-count-down");

... and then you'd only need to parse these in one place, too.

IIUC these are proeprties for the atcpit device, which has no
documentation or driver in this series.

So I'm rather confused as to what's going on here.

> + return of_address_to_resource(timer_node, 0, &timer_res);
> +}

> +int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
> +{

> + /*Map timer to user space */
> + vdso_base += PAGE_SIZE;
> + prot = __pgprot(_PAGE_V | _PAGE_M_UR_KR | _PAGE_D |
> + _PAGE_G | _PAGE_C_DEV);
> + ret = io_remap_pfn_range(vma, vdso_base, timer_res.start >> PAGE_SHIFT,
> +  PAGE_SIZE, prot);
> + if (ret)
> + goto up_fail;

Maybe this is fine, but it looks a bit suspicious.

Is it safe to map IO memory to a userspace process like this?

In general that isn't safe, since userspace could access other registers
(if those exist), perform accesses that change the state of hardware, or
make unsupported access types (e.g. unaligned, atomic) that result in
errors the kernel can't handle.

Does none of that apply here?

Thanks,
Mark.


Re: [PATCH v3 24/33] nds32: Device tree support

2017-12-08 Thread Mark Rutland
On Fri, Dec 08, 2017 at 05:12:07PM +0800, Greentime Hu wrote:
> + timer0: timer@f040 {
> + compatible = "andestech,atcpit100";
> + reg = <0xf040 0x1000>;
> + interrupts = <2>;
> + clocks = <&clk_pll>;
> + clock-names = "apb_pclk";
> + cycle-count-offset = <0x38>;
> + cycle-count-down;
> + };

The cover leter mentioned the atcpit100 driver had been removed, but the
node is still here, and the vdso code seems to look for this node.

This binding needs documentation.

Thanks,
Mark.


Re: [PATCH v3 24/33] nds32: Device tree support

2017-12-08 Thread Mark Rutland
On Fri, Dec 08, 2017 at 05:12:07PM +0800, Greentime Hu wrote:
> + timer0: timer@9840 {
> + compatible = "andestech,atftmr010";
> + reg = <0x9840 0x1000>;
> + interrupts = <19>;
> + clocks = <&clk_pll>;
> + clock-names = "apb_pclk";
> + cycle-count-offset = <0x20>;
> + };

Looking through the series, I can't find a binding or driver for this
timer, either.

Does timekeeping work with this series, or are additional patches
necessary?

Thanks,
Mark.


Re: [PATCH v3 17/33] nds32: VDSO support

2017-12-08 Thread Mark Rutland
On Fri, Dec 08, 2017 at 07:54:42PM +0800, Greentime Hu wrote:
> 2017-12-08 18:21 GMT+08:00 Mark Rutland :
> > On Fri, Dec 08, 2017 at 05:12:00PM +0800, Greentime Hu wrote:
> >> +static int grab_timer_node_info(void)
> >> +{
> >> + struct device_node *timer_node;
> >> +
> >> + timer_node = of_find_node_by_name(NULL, "timer");
> >
> > Please use a compatible string, rather than matching the timer by name.
> >
> > It's plausible that you have multiple nodes called "timer" in the DT,
> > under different parent nodes, and this might not be the device you
> > think it is. I see your dt in patch 24 has two timer nodes.
> >
> > It would be best if your clocksource driver exposed some stuct that you
> > looked at here, so that you're guaranteed to user the same device.
> 
> We'd like to use "timer" here because there are 2 different timer IPs
> and we are sure that they won't be in the same SoC.
> We think this implementation in VDSO should be platform independent to
> get cycle-count register.
> Our customer or other SoC provider who can use "timer" and define
> cycle-count-offset or cycle-count-down then we can get the correct
> cycle-count.

This is not the right way to do things.

So from a DT perspective, NAK. 

You should not add properties to arbitrary DT bindings to handle a Linux
implementation detail.

Please remove this DT code, and have the drivers for those timer blocks
export this information to your vdso code somehow.

> We sent atcpit100 patch last time along with our arch, however we'd
> like to send it to its sub system this time and my colleague is still
> working on it.
> He may send the timer patch next week.

I think that it would make sense for that patch to be part of the arch
port, especially given that (AFAICT) there is no dirver for the other
timer IP that you mention.

[...]

> >> +int arch_setup_additional_pages(struct linux_binprm *bprm, int 
> >> uses_interp)
> >> +{
> >
> >> + /*Map timer to user space */
> >> + vdso_base += PAGE_SIZE;
> >> + prot = __pgprot(_PAGE_V | _PAGE_M_UR_KR | _PAGE_D |
> >> + _PAGE_G | _PAGE_C_DEV);
> >> + ret = io_remap_pfn_range(vma, vdso_base, timer_res.start >> 
> >> PAGE_SHIFT,
> >> +  PAGE_SIZE, prot);
> >> + if (ret)
> >> + goto up_fail;
> >
> > Maybe this is fine, but it looks a bit suspicious.
> >
> > Is it safe to map IO memory to a userspace process like this?
> >
> > In general that isn't safe, since userspace could access other registers
> > (if those exist), perform accesses that change the state of hardware, or
> > make unsupported access types (e.g. unaligned, atomic) that result in
> > errors the kernel can't handle.
> >
> > Does none of that apply here?
> 
> We only provide read permission to this page so hareware state won't
> be chagned. It will trigger exception if we try to write.
> We will check about the alignment/atomic issue of this region.

Ok, thanks.

This is another reason to only do this for devices/drivers that we have
drivers for, since we can't know that this is safe in general.

Thanks,
Mark.


Re: [PATCHv2 1/3] dt-bindings: net: Add DT bindings for Socionext Netsec

2017-12-12 Thread Mark Rutland
Hi,

On Tue, Dec 12, 2017 at 10:45:21PM +0530, jassisinghb...@gmail.com wrote:
> From: Jassi Brar 
> 
> This patch adds documentation for Device-Tree bindings for the
> Socionext NetSec Controller driver.
> 
> Signed-off-by: Ard Biesheuvel 
> Signed-off-by: Jassi Brar 
> ---
>  .../devicetree/bindings/net/socionext-netsec.txt   | 43 
> ++
>  1 file changed, 43 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/socionext-netsec.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/socionext-netsec.txt 
> b/Documentation/devicetree/bindings/net/socionext-netsec.txt
> new file mode 100644
> index 000..4695969
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/socionext-netsec.txt
> @@ -0,0 +1,45 @@
> +* Socionext NetSec Ethernet Controller IP
> +
> +Required properties:
> +- compatible: Should be "socionext,synquacer-netsec"
> +- reg: Address and length of the control register area, followed by the
> +   address and length of the EEPROM holding the MAC address and
> +   microengine firmware
> +- interrupts: Should contain ethernet controller interrupt
> +- clocks: phandle to the PHY reference clock, and any other clocks to be
> +  switched by runtime_pm
> +- clock-names: Required only if more than a single clock is listed in 
> 'clocks'.
> +   The PHY reference clock must be named 'phy_refclk'

Please define the full set of clocks (and their names) explicitly. This
should be well-known.

Otherwise, this looks ok.

Thanks,
Mark.

> +- phy-mode: See ethernet.txt file in the same directory
> +- phy-handle: phandle to select child phy
> +
> +Optional properties: (See ethernet.txt file in the same directory)
> +- dma-coherent: Boolean property, must only be present if memory
> +  accesses performed by the device are cache coherent
> +- local-mac-address
> +- mac-address
> +- max-speed
> +- max-frame-size
> +
> +Required properties for the child phy:
> +- reg: phy address
> +
> +Example:
> + eth0: netsec@522D {
> + compatible = "socionext,synquacer-netsec";
> + reg = <0 0x522D 0x0 0x1>, <0 0x1000 0x0 0x1>;
> + interrupts = ;
> + clocks = <&clk_netsec>;
> + phy-mode = "rgmii";
> + max-speed = <1000>;
> + max-frame-size = <9000>;
> + phy-handle = <ðphy0>;
> +
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + ethphy0: ethernet-phy@1 {
> + compatible = "ethernet-phy-ieee802.3-c22";
> + reg = <1>;
> + };
> + };
> -- 
> 2.7.4
> 


v4.16-rc1 misaligned atomics in skb__clone / __napi_alloc_skb

2018-02-15 Thread Mark Rutland
Hi,

While fuzzing arm64 v4.16-rc1 with Syzkaller, I've been hitting a
misaligned atomic in __skb_clone:

atomic_inc(&(skb_shinfo(skb)->dataref));

.. where dataref doesn't have the required natural alignment, and the
atomic operation faults. e.g. i often see it aligned to a single byte
boundary rather than a four byte boundary.

AFAICT, the skb_shared_info is misaligned at the instant it's allocated
in __napi_alloc_skb(). With the patch at the end of this mail, the
atomic_set() (which is a WRITE_ONCE()) in __build_skb() blows up, e.g.

WARNING: CPU: 0 PID: 8457 at mm/access_once.c:12 
access_once_alignment_check+0x34/0x40 mm/access_once.c:12
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 8457 Comm: syz-executor1 Not tainted 4.16.0-rc1-2-gb03ae7b8b0de 
#9
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x390 arch/arm64/kernel/time.c:52
 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:151
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0xd0/0x130 lib/dump_stack.c:53
 panic+0x220/0x3fc kernel/panic.c:183
 __warn+0x270/0x2bc kernel/panic.c:547
 report_bug+0x1dc/0x2d0 lib/bug.c:184
 bug_handler+0x7c/0x128 arch/arm64/kernel/traps.c:758
 call_break_hook arch/arm64/kernel/debug-monitors.c:305 [inline]
 brk_handler+0x1a0/0x300 arch/arm64/kernel/debug-monitors.c:320
 do_debug_exception+0x15c/0x408 arch/arm64/mm/fault.c:808
 el1_dbg+0x18/0x78
 access_once_alignment_check+0x34/0x40 mm/access_once.c:12
 __napi_alloc_skb+0x18c/0x2b8 net/core/skbuff.c:482
 napi_alloc_skb include/linux/skbuff.h:2643 [inline]
 napi_get_frags+0x68/0x120 net/core/dev.c:5108
 tun_napi_alloc_frags drivers/net/tun.c:1477 [inline]
 tun_get_user+0x13b0/0x3fe8 drivers/net/tun.c:1820
 tun_chr_write_iter+0xa8/0x158 drivers/net/tun.c:1988
 call_write_iter include/linux/fs.h:1781 [inline]
 do_iter_readv_writev+0x2f8/0x490 fs/read_write.c:653
 do_iter_write+0x14c/0x4b0 fs/read_write.c:932
 vfs_writev+0x130/0x288 fs/read_write.c:977
 do_writev+0xe0/0x248 fs/read_write.c:1012
 SYSC_writev fs/read_write.c:1085 [inline]
 SyS_writev+0x34/0x48 fs/read_write.c:1082
 el0_svc_naked+0x30/0x34
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x1002082
Memory Limit: none
Rebooting in 86400 seconds..

... I see these splats with both tun and virtio-net.

I have some Syzkaller logs, and can reproduce the problem locally, but
unfortunately the C reproducer it generated doesn't seem to work on its
own.

Any ideas as to how this could happen?

Thanks,
Mark.

>8
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index c2cc57a2f508..c06b810a3b3b 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -163,6 +163,8 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int 
val,
 
 #include 
 
+void access_once_alignment_check(const volatile void *ptr, int size);
+
 #define __READ_ONCE_SIZE   \
 ({ \
switch (size) { \
@@ -180,6 +182,7 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int 
val,
 static __always_inline
 void __read_once_size(const volatile void *p, void *res, int size)
 {
+   access_once_alignment_check(p, size);
__READ_ONCE_SIZE;
 }
 
@@ -203,6 +206,8 @@ void __read_once_size_nocheck(const volatile void *p, void 
*res, int size)
 
 static __always_inline void __write_once_size(volatile void *p, void *res, int 
size)
 {
+   access_once_alignment_check(p, size);
+
switch (size) {
case 1: *(volatile __u8 *)p = *(__u8 *)res; break;
case 2: *(volatile __u16 *)p = *(__u16 *)res; break;
diff --git a/mm/Makefile b/mm/Makefile
index e669f02c5a54..604d269d7d57 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -3,6 +3,7 @@
 # Makefile for the linux memory manager.
 #
 
+KASAN_SANITIZE_access_once.o := n
 KASAN_SANITIZE_slab_common.o := n
 KASAN_SANITIZE_slab.o := n
 KASAN_SANITIZE_slub.o := n
@@ -10,6 +11,7 @@ KASAN_SANITIZE_slub.o := n
 # These files are disabled because they produce non-interesting and/or
 # flaky coverage that is not a function of syscall inputs. E.g. slab is out of
 # free pages, or a task is migrated between nodes.
+KCOV_INSTRUMENT_access_once.o := n
 KCOV_INSTRUMENT_slab_common.o := n
 KCOV_INSTRUMENT_slob.o := n
 KCOV_INSTRUMENT_slab.o := n
@@ -39,7 +41,7 @@ obj-y := filemap.o mempool.o oom_kill.o \
   mm_init.o mmu_context.o percpu.o slab_common.o \
   compaction.o vmacache.o swap_slots.o \
   interval_tree.o list_lru.o workingset.o \
-  debug.o $(mmu-y)
+  debug.o access_once.o $(mmu-y)
 
 obj-y += init-mm.o
 
diff --git a/mm/access_once.c b/mm/access_once.c
new file mode 100644
index ..42ee35d171c4
--- /dev/null
+++ b/mm/access_once.c
@@ -0,0 +1,15 @@
+#include 
+#include 

Re: v4.16-rc1 misaligned atomics in skb__clone / __napi_alloc_skb

2018-02-15 Thread Mark Rutland
On Thu, Feb 15, 2018 at 09:24:36AM -0800, Eric Dumazet wrote:
> On Thu, Feb 15, 2018 at 9:20 AM, Eric Dumazet  wrote:
> >
> > Yes, it seems tun.c breaks the assumptions.
> >
> > If it really wants to provide arbitrary fragments and alignments, it
> > should use a separate
> 
> Sorry, I have sent the message to soon.
> 
> tun.c should use a private 'struct page_frag_cache' to deliver
> arbitrary frags/alignments,
> so that syzkaller might catch interesting bugs in the stack.
>
> > Please try :
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > index 
> > 81e6cc951e7fc7c983919365c34842c34bcaedcf..92c6b6d02f7c18b63c42ffe1d9cb7286975e1263
> > 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -1500,7 +1500,7 @@ static struct sk_buff
> > *tun_napi_alloc_frags(struct tun_file *tfile,
> > }
> >
> > local_bh_disable();
> > -   data = napi_alloc_frag(fragsz);
> > +   data = napi_alloc_frag(SKB_DATA_ALIGN(fragsz));
> > local_bh_enable();
> > if (!data) {
> > err = -ENOMEM;
> 
> This patch should solve your immediate problem, but would lower fuzzer
> abilities to find bugs.

So far so good, it seems!

> I will send something more suited to original intent of these commits :
> 
> 90e33d45940793def6f773b2d528e9f3c84ffdc7 tun: enable napi_gro_frags()
> for TUN/TAP driver
> 943170998b200190f99d3fe7e771437e2c51f319 tun: enable NAPI for TUN/TAP driver

Thanks! I'd be more than happy to test any such patches.

As I mentioned, I'm also seeing similar in virtio-net, e.g.

WARNING: CPU: 0 PID: 8 at mm/access_once.c:12 
access_once_alignment_check+0x34/0x40 mm/access_once.c:12
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 8 Comm: ksoftirqd/0 Not tainted 4.16.0-rc1-2-gb03ae7b8b0de #9
Hardware name: linux,dummy-virt (DT)
Call trace:
 dump_backtrace+0x0/0x390 arch/arm64/kernel/time.c:52
 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:151
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0xd0/0x130 lib/dump_stack.c:53
 panic+0x220/0x3fc kernel/panic.c:183
 __warn+0x270/0x2bc kernel/panic.c:547
 report_bug+0x1dc/0x2d0 lib/bug.c:184
 bug_handler+0x7c/0x128 arch/arm64/kernel/traps.c:758
 call_break_hook arch/arm64/kernel/debug-monitors.c:305 [inline]
 brk_handler+0x1a0/0x300 arch/arm64/kernel/debug-monitors.c:320
 do_debug_exception+0x15c/0x408 arch/arm64/mm/fault.c:808
 el1_dbg+0x18/0x78
 access_once_alignment_check+0x34/0x40 mm/access_once.c:12
 __napi_alloc_skb+0x18c/0x2b8 net/core/skbuff.c:482
 napi_alloc_skb include/linux/skbuff.h:2643 [inline]
 page_to_skb.isra.17+0x58/0x610 drivers/net/virtio_net.c:345
 receive_mergeable drivers/net/virtio_net.c:783 [inline]
 receive_buf+0x978/0x2a70 drivers/net/virtio_net.c:888
 virtnet_receive drivers/net/virtio_net.c:1160 [inline]
 virtnet_poll+0x24c/0x850 drivers/net/virtio_net.c:1240
 napi_poll net/core/dev.c:5690 [inline]
 net_rx_action+0x324/0xa50 net/core/dev.c:5756
 __do_softirq+0x318/0x734 kernel/softirq.c:285
 run_ksoftirqd+0x70/0xa8 kernel/softirq.c:666
 smpboot_thread_fn+0x544/0x9d0 kernel/smpboot.c:164
 kthread+0x2f8/0x380 kernel/kthread.c:238
 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1158
SMP: stopping secondary CPUs
Kernel Offset: disabled
CPU features: 0x1002082
Memory Limit: none
Rebooting in 86400 seconds..

... does similar apply there?

Thanks,
Mark.


Re: v4.16-rc1 misaligned atomics in skb__clone / __napi_alloc_skb

2018-02-15 Thread Mark Rutland
On Thu, Feb 15, 2018 at 09:43:06AM -0800, Eric Dumazet wrote:
> On Thu, 2018-02-15 at 09:24 -0800, Eric Dumazet wrote:
> > 
> > I will send something more suited to original intent of these commits :
> > 
> > 90e33d45940793def6f773b2d528e9f3c84ffdc7 tun: enable napi_gro_frags()
> > for TUN/TAP driver
> > 943170998b200190f99d3fe7e771437e2c51f319 tun: enable NAPI for TUN/TAP driver
> 
> Can you try this patch ?

Looks good! No splats after 10 minutes with a test that usually fails in
a few seconds.

FWIW:

Tested-by: Mark Rutland 

Thanks,
Mark.

>  drivers/net/tun.c |   16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 
> 81e6cc951e7fc7c983919365c34842c34bcaedcf..b52258c327d2e1d7c7476de345e49f082909c246
>  100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1489,27 +1489,23 @@ static struct sk_buff *tun_napi_alloc_frags(struct 
> tun_file *tfile,
>   skb->truesize += skb->data_len;
>  
>   for (i = 1; i < it->nr_segs; i++) {
> + struct page_frag *pfrag = ¤t->task_frag;
>   size_t fragsz = it->iov[i].iov_len;
> - unsigned long offset;
> - struct page *page;
> - void *data;
>  
>   if (fragsz == 0 || fragsz > PAGE_SIZE) {
>   err = -EINVAL;
>   goto free;
>   }
>  
> - local_bh_disable();
> - data = napi_alloc_frag(fragsz);
> - local_bh_enable();
> - if (!data) {
> + if (!skb_page_frag_refill(fragsz, pfrag, GFP_KERNEL)) {
>   err = -ENOMEM;
>   goto free;
>   }
>  
> - page = virt_to_head_page(data);
> - offset = data - page_address(page);
> - skb_fill_page_desc(skb, i - 1, page, offset, fragsz);
> + skb_fill_page_desc(skb, i - 1, pfrag->page,
> +pfrag->offset, fragsz);
> + page_ref_inc(pfrag->page);
> + pfrag->offset += fragsz;
>   }
>  
>   return skb;
> 


dns_resolver_preparse tries to print arbitrarily-large user-provided strings

2018-02-27 Thread Mark Rutland
Hi,

As a heads-up, while fuzzing v4.16-rc3 on arm64 with Syzkaller, I hit a
system hang which I was able to minize to the reproducer below. It looks
like the system hang is an artifact of Syzkaller using panic_on_warn, as
dns_resolver_preparse can trigger a WARN_ONCE() in the bowels of
printk(), and we recurse on a lock that's already held.

The underlying issue is that dns_resolver_preparse() may try to dump an
arbitrarily large chunk of user-provided data, even from an unprivileged
user, while printk() has some internal limit on string precision.

On bare metal (without panic_on_warn), this means I get the following in
my dmesg:

$ ./repro
[   56.870339] Option ' 












   
[   56.870350] [ cut here ]
[   56.870355] precision 1044478 too large
[   56.870362] WARNING: CPU: 2 PID: 2450 at lib/vsprintf.c:2179 
vsnprintf+0xdf0/0x1268
[   56.870366] Modules linked in:
[   56.870376] CPU: 2 PID: 2450 Comm: repro Not tainted 
4.16.0-rc3-2-g544c20187688-dirty #4
[   56.870382] Hardware name: ARM Juno development board (r1) (DT)
[   56.870386] pstate: 8085 (Nzcv daIf -PAN -UAO)
[   56.870390] pc : vsnprintf+0xdf0/0x1268
[   56.870394] lr : vsnprintf+0xdf0/0x1268
[   56.870398] sp : 80092b41f6f0
[   56.870401] x29: 80092b41f6f0 x28: 2b093f36
[   56.870411] x27: 03e0 x26: 0002
[   56.870421] x25: 000feffe x24: 100125683eee
[   56.870431] x23: 2a433200 x22: 2bc5f3c0
[   56.870440] x21: dfff2000 x20: 2bc5efea
[   56.870450] x19: 2a4329ce x18: c51e5860
[   56.870459] x17: 00411018 x16: 28957d28
[   56.870469] x15: f200 x14: 2aa0a6c0
[   56.870479] x13:  x12: 2aa0a000
[   56.870488] x11: 100126eda462 x10: 100126eda462
[   56.870498] x9 : dfff2000 x8 : 
[   56.870507] x7 : 8009376d2311 x6 : 8009376d2312
[   56.870517] x5 : 100126eda463 x4 : 100126eda463
[   56.870526] x3 :  x2 : 0001
[   56.870536] x1 : fdb9418241605c00 x0 : 
[   56.870545] Call trace:
[   56.870549]  vsnprintf+0xdf0/0x1268
[   56.870553]  vscnprintf+0x48/0x88
[   56.870556]  vprintk_emit+0xac/0x5f0
[   56.870560]  vprintk_default+0x44/0x50
[   56.870564]  vprintk_func+0x3dc/0x630
[   56.870568]  printk+0xbc/0xec
[   56.870572]  dns_resolver_preparse+0x5bc/0x6e8
[   56.870576]  key_create_or_update+0x298/0x828
[   56.870580]  SyS_add_key+0x110/0x3c8
[   56.870584]  el0_svc_naked+0x30/0x34
[   56.870589] ---[ end trace 2b46801cfa43d927 ]---

Any ideas on how to avoid this?

Thanks,
Mark.

>8
#include 
#include 

#define BUF_SIZE 0xff000

static char buf[BUF_SIZE] = {
[0] = '#',
[1 ... BUF_SIZE - 2] = 'a'
};

int main()
{
syscall(__NR_add_key, "dns_resolver", "a", buf, BUF_SIZE, -1);
return 0;
}



Re: [PATCH bpf] bpf: prevent out-of-bounds speculation

2018-01-09 Thread Mark Rutland
On Mon, Jan 08, 2018 at 10:49:01AM -0800, Linus Torvalds wrote:
> On Mon, Jan 8, 2018 at 9:05 AM, Mark Rutland  wrote:
> >
> > I'm a little worried that in the presence of some CPU/compiler
> > optimisations, the masking may effectively be skipped under speculation.
> > So I'm not sure how robust this is going to be.
> 
> Honestly, I think the masking is a hell of a lot more robust than any
> of the "official" fixes.
> 
> More generic data speculation (as opposed to control speculation) is

>  (c) isn't actually done in any real CPU's today that I'm aware of
> (unless you want to call the return stack data speculation).

Maybe not generally, but the GPZ writeup claims that when a load that
misses in the cache, some CPUs speculate the value (as all-zeroes).

See "Variant 3: Rogue data cache load" in:

  
https://googleprojectzero.blogspot.co.uk/2018/01/reading-privileged-memory-with-side.html

If a CPU speculates a load of a mask as all-zeroes, we're fine. If a CPU
can speculate the load of a mask as all-ones, the AND is effectively a
NOP.

I'll wait for Will to find out what's actually been built...

Thanks,
Mark.


Re: [PATCH bpf] bpf: prevent out-of-bounds speculation

2018-01-05 Thread Mark Rutland
Hi Alexei,

On Thu, Jan 04, 2018 at 08:28:11PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov 
> 
> Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
> memory accesses under a bounds check may be speculated even if the
> bounds check fails, providing a primitive for building a side channel.
> 
> To avoid leaking kernel data round up array-based maps and mask the index
> after bounds check, so speculated load with out of bounds index will load
> either valid value from the array or zero from the padded area.

This is on my radar, and I'll try to review / test this next week.

Thanks,
Mark.


Re: [PATCH 01/18] asm-generic/barrier: add generic nospec helpers

2018-01-06 Thread Mark Rutland
On Fri, Jan 05, 2018 at 09:23:06PM -0800, Dan Williams wrote:
> On Fri, Jan 5, 2018 at 6:55 PM, Linus Torvalds
>  wrote:
> > On Fri, Jan 5, 2018 at 5:09 PM, Dan Williams  
> > wrote:
> >> +#ifndef nospec_ptr
> >> +#define nospec_ptr(ptr, lo, hi)   
> >>  \
> >
> > Do we actually want this horrible interface?
> >
> > It just causes the compiler - or inline asm - to generate worse code,
> > because it needs to compare against both high and low limits.
> >
> > Basically all users are arrays that are zero-based, and where a
> > comparison against the high _index_ limit would be sufficient.
> >
> > But the way this is all designed, it's literally designed for bad code
> > generation for the unusual case, and the usual array case is written
> > in the form of the unusual and wrong non-array case. That really seems
> > excessively stupid.
> 
> Yes, it appears we can kill nospec_ptr() and move nospec_array_ptr()
> to assume 0 based arrays rather than use nospec_ptr.

Sounds good to me; I can respin the arm/arm64 implementations accordingly.

We can always revisit that if we have non-array cases that need to be covered.

Thanks,
Mark.


Re: [PATCH bpf] bpf: prevent out-of-bounds speculation

2018-01-08 Thread Mark Rutland
Hi Alexei,

On Thu, Jan 04, 2018 at 08:28:11PM -0800, Alexei Starovoitov wrote:
> From: Alexei Starovoitov 
> 
> Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
> memory accesses under a bounds check may be speculated even if the
> bounds check fails, providing a primitive for building a side channel.
> 
> To avoid leaking kernel data round up array-based maps and mask the index
> after bounds check, so speculated load with out of bounds index will load
> either valid value from the array or zero from the padded area.

Thanks for putting this together, this certainly looks neat.

I'm a little worried that in the presence of some CPU/compiler
optimisations, the masking may effectively be skipped under speculation.
So I'm not sure how robust this is going to be.

More on that below.

> To avoid duplicating map_lookup functions for root/unpriv always generate
> a sequence of bpf instructions equivalent to map_lookup function for
> array and array_of_maps map types when map was created by unpriv user.
> And unconditionally mask index for percpu_array, since it's fast enough,
> even when max_entries are not rounded to power of 2 for root user,
> since percpu_array doesn't have map_gen_lookup callback yet.

Is there a noticeable slowdown from the masking? Can't we always have
that in place?

> @@ -157,7 +175,7 @@ static void *percpu_array_map_lookup_elem(struct bpf_map 
> *map, void *key)
>   if (unlikely(index >= array->map.max_entries))
>   return NULL;
>  
> - return this_cpu_ptr(array->pptrs[index]);
> + return this_cpu_ptr(array->pptrs[index & array->index_mask]);

As above, I think this isn't necessarily robust, as CPU/compiler
optimisations can break the dependency on the index_mask, allowing
speculation without a mask.

e.g. a compiler could re-write this as:

if (array->index_mask != 0x)
index &= array->index_mask;
return this_cpu_ptr(array->pptrs[index]);

... which would allow an unmasked index to be used in speculated paths.

Similar cases could occur with some CPU implementations. For example, HW
value-prediction could result in the use of an all-ones mask under
speculation.

I think that we may need to be able to provide an arch-specific
pointer sanitization sequence (though we could certainly have masking as
the default).

I have a rough idea as to how that could be plumbed into the JIT. First
I need to verify the sequence I have in mind for arm/arm64 is
sufficient.

Thanks,
Mark.


Re: [PATCH 02/18] Documentation: document nospec helpers

2018-01-08 Thread Mark Rutland
Hi Jon,

On Mon, Jan 08, 2018 at 09:29:17AM -0700, Jonathan Corbet wrote:
> On Fri, 05 Jan 2018 17:10:03 -0800
> Dan Williams  wrote:
> 
> > Document the rationale and usage of the new nospec*() helpers.
> 
> I have just a couple of overall comments.
> 
>  - It would be nice if the document were done in RST and placed in the
>core-API manual, perhaps using kerneldoc comments for the macros
>themselves.  It's already 99.9% RST now, so the changes required would
>be minimal.

Is there any quickstart guide to RST that you can recommend?

I'm happy to clean up the documentation; I'm just not familiar with RST.

>  - More importantly: is there any way at all to give guidance to
>developers wondering *when* they should use these primitives?  I think
>it would be easy to create a situation where they don't get used where
>they are really needed; meanwhile, there may well be a flood of
>"helpful" patches adding them where they make no sense at all.

This is on my TODO list.

The unfortunate truth is that it's likely to be a subjective judgement
call in many cases, depending on how likely it is that the user can
influence the code in question, so it's difficult to provide
hard-and-fast rules.

Thanks,
Mark.


[PATCH RESEND] SUNRPC: fix include for cmpxchg_relaxed()

2018-05-03 Thread Mark Rutland
Currently net/sunrpc/xprtmultipath.c is the only file outside of arch/
headers and asm-generic/ headers to include , apparently
for the use of cmpxchg_relaxed().

However, many architectures do not provide cmpxchg_relaxed() in their
, and it is necessary to include  to get
this definition, as noted in Documentation/core-api/atomic_ops.rst:

  If someone wants to use xchg(), cmpxchg() and their variants,
  linux/atomic.h should be included rather than asm/cmpxchg.h, unless
  the code is in arch/* and can take care of itself.

Evidently we're getting the right header this via some transitive
include today, but this isn't something we can/should rely upon,
especially with ongoing rework of the atomic headers for KASAN
instrumentation.

Let's fix the code to include , avoiding fragility.

Signed-off-by: Mark Rutland 
Cc: Trond Myklebust 
Cc: Anna Schumaker 
Cc: J. Bruce Fields 
Cc: Jeff Layton 
Cc: David S. Miller 
Cc: linux-...@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 net/sunrpc/xprtmultipath.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

I sent this about a year ago [1], but got no response. This still applies atop
of v4.17-rc3.

I'm currently trying to implement instrumented atomics for arm64, and it would
be great to have this fixed.

Mark.

[1] 
https://lkml.kernel.org/r/1489574142-20856-1-git-send-email-mark.rutl...@arm.com

diff --git a/net/sunrpc/xprtmultipath.c b/net/sunrpc/xprtmultipath.c
index e2d64c7138c3..d897f41be244 100644
--- a/net/sunrpc/xprtmultipath.c
+++ b/net/sunrpc/xprtmultipath.c
@@ -13,7 +13,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
-- 
2.11.0



[PATCH] bpf: fix possible spectre-v1 in find_and_alloc_map()

2018-05-03 Thread Mark Rutland
It's possible for userspace to control attr->map_type. Sanitize it when
using it as an array index to prevent an out-of-bounds value being used
under speculation.

Found by smatch.

Signed-off-by: Mark Rutland 
Cc: Alexei Starovoitov 
Cc: Dan Carpenter 
Cc: Daniel Borkmann 
Cc: Peter Zijlstra 
Cc: netdev@vger.kernel.org
---
 kernel/bpf/syscall.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

I found this when running smatch over a v4.17-rc2 arm64 allyesconfig kernel.

IIUC this may allow for a speculative branch to an arbitrary gadget when we
subsequently call ops->map_alloc_check(attr).

Mark.

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 4ca46df19c9a..8a7acd0dbeb6 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PROG_ARRAY || \
   (map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
@@ -102,12 +103,14 @@ const struct bpf_map_ops bpf_map_offload_ops = {
 static struct bpf_map *find_and_alloc_map(union bpf_attr *attr)
 {
const struct bpf_map_ops *ops;
+   u32 type = attr->map_type;
struct bpf_map *map;
int err;
 
-   if (attr->map_type >= ARRAY_SIZE(bpf_map_types))
+   if (type >= ARRAY_SIZE(bpf_map_types))
return ERR_PTR(-EINVAL);
-   ops = bpf_map_types[attr->map_type];
+   type = array_index_nospec(type, ARRAY_SIZE(bpf_map_types));
+   ops = bpf_map_types[type];
if (!ops)
return ERR_PTR(-EINVAL);
 
@@ -122,7 +125,7 @@ static struct bpf_map *find_and_alloc_map(union bpf_attr 
*attr)
if (IS_ERR(map))
return map;
map->ops = ops;
-   map->map_type = attr->map_type;
+   map->map_type = type;
return map;
 }
 
-- 
2.11.0



Re: [PATCH] bpf: fix possible spectre-v1 in find_and_alloc_map()

2018-05-04 Thread Mark Rutland
On Fri, May 04, 2018 at 02:16:31AM +0200, Daniel Borkmann wrote:
> On 05/03/2018 06:04 PM, Mark Rutland wrote:
> > It's possible for userspace to control attr->map_type. Sanitize it when
> > using it as an array index to prevent an out-of-bounds value being used
> > under speculation.
> > 
> > Found by smatch.
> > 
> > Signed-off-by: Mark Rutland 
> > Cc: Alexei Starovoitov 
> > Cc: Dan Carpenter 
> > Cc: Daniel Borkmann 
> > Cc: Peter Zijlstra 
> > Cc: netdev@vger.kernel.org
> 
> Applied to bpf tree, thanks Mark!

Cheers!

> I've also just submitted one for BPF progs
> (http://patchwork.ozlabs.org/patch/908385/) which is same situation.

That looks good to me. That case doesn't show up in my smatch results so
far, but I might just need a few more build iterations.

Thanks,
Mark.


Re: [PATCH bpf] bpf: fix ldx in ld_abs rewrite for large offsets

2018-07-10 Thread Mark Rutland
On Tue, Jul 10, 2018 at 12:43:22AM +0200, Daniel Borkmann wrote:
> Mark reported that syzkaller triggered a KASAN detected slab-out-of-bounds
> bug in ___bpf_prog_run() with a BPF_LD | BPF_ABS word load at offset 0x8001.
> After further investigation it became clear that the issue was the
> BPF_LDX_MEM() which takes offset as an argument whereas it cannot encode
> larger than S16_MAX offsets into it. For this synthetical case we need to
> move the full address into tmp register instead and do the LDX without
> immediate value.
> 
> Fixes: e0cea7ce988c ("bpf: implement ld_abs/ld_ind in native bpf")
> Reported-by: syzbot 
> Reported-by: Mark Rutland 
> Signed-off-by: Daniel Borkmann 
> ---
>  net/core/filter.c | 16 +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 5fa66a3..a13f5b1 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -459,11 +459,21 @@ static bool convert_bpf_ld_abs(struct sock_filter *fp, 
> struct bpf_insn **insnp)
>(!unaligned_ok && offset >= 0 &&
> offset + ip_align >= 0 &&
> offset + ip_align % size == 0))) {
> + bool ldx_off_ok = offset <= S16_MAX;
> +

Given offset is a (signed) int, is it possible for that to be a negative
value less than S16_MIN? ... or is that ruled out elsewhere?

Thanks,
Mark.

>   *insn++ = BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_H);
>   *insn++ = BPF_ALU64_IMM(BPF_SUB, BPF_REG_TMP, offset);
> - *insn++ = BPF_JMP_IMM(BPF_JSLT, BPF_REG_TMP, size, 2 + endian);
> - *insn++ = BPF_LDX_MEM(BPF_SIZE(fp->code), BPF_REG_A, BPF_REG_D,
> -   offset);
> + *insn++ = BPF_JMP_IMM(BPF_JSLT, BPF_REG_TMP,
> +   size, 2 + endian + (!ldx_off_ok * 2));
> + if (ldx_off_ok) {
> + *insn++ = BPF_LDX_MEM(BPF_SIZE(fp->code), BPF_REG_A,
> +   BPF_REG_D, offset);
> + } else {
> + *insn++ = BPF_MOV64_REG(BPF_REG_TMP, BPF_REG_D);
> + *insn++ = BPF_ALU64_IMM(BPF_ADD, BPF_REG_TMP, offset);
> + *insn++ = BPF_LDX_MEM(BPF_SIZE(fp->code), BPF_REG_A,
> +   BPF_REG_TMP, 0);
> + }
>   if (endian)
>   *insn++ = BPF_ENDIAN(BPF_FROM_BE, BPF_REG_A, size * 8);
>   *insn++ = BPF_JMP_A(8);
> -- 
> 2.9.5
> 


Re: [PATCH bpf] bpf: fix ldx in ld_abs rewrite for large offsets

2018-07-10 Thread Mark Rutland
On Tue, Jul 10, 2018 at 02:13:42PM +0200, Daniel Borkmann wrote:
> On 07/10/2018 12:14 PM, Mark Rutland wrote:
> > On Tue, Jul 10, 2018 at 12:43:22AM +0200, Daniel Borkmann wrote:
> >> Mark reported that syzkaller triggered a KASAN detected slab-out-of-bounds
> >> bug in ___bpf_prog_run() with a BPF_LD | BPF_ABS word load at offset 
> >> 0x8001.
> >> After further investigation it became clear that the issue was the
> >> BPF_LDX_MEM() which takes offset as an argument whereas it cannot encode
> >> larger than S16_MAX offsets into it. For this synthetical case we need to
> >> move the full address into tmp register instead and do the LDX without
> >> immediate value.
> >>
> >> Fixes: e0cea7ce988c ("bpf: implement ld_abs/ld_ind in native bpf")
> >> Reported-by: syzbot 
> >> Reported-by: Mark Rutland 
> >> Signed-off-by: Daniel Borkmann 
> >> ---
> >>  net/core/filter.c | 16 +---
> >>  1 file changed, 13 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index 5fa66a3..a13f5b1 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -459,11 +459,21 @@ static bool convert_bpf_ld_abs(struct sock_filter 
> >> *fp, struct bpf_insn **insnp)
> >> (!unaligned_ok && offset >= 0 &&
> >>  offset + ip_align >= 0 &&
> >>  offset + ip_align % size == 0))) {
> >> +  bool ldx_off_ok = offset <= S16_MAX;
> >> +
> > 
> > Given offset is a (signed) int, is it possible for that to be a negative
> > value less than S16_MIN? ... or is that ruled out elsewhere?
> 
> This branch here handles only positive offset.

Ah, sorry. I missed the "offset >= 0" check in the context above.

> offset can be negative,
> but in that case these insns won't be emitted and handling will be done
> in 'slow path' via bpf_skb_load_helper_*().

Ok; looks good to me, then.

Thanks,
Mark.


Re: [PATCH 3/3] net: smsc911x: add u16 workaround for pxa platforms

2016-10-03 Thread Mark Rutland
On Mon, Oct 03, 2016 at 11:05:53AM +0200, Robert Jarzmik wrote:
> Add a workaround for mainstone, idp and stargate2 boards, for u16 writes
> which must be aligned on 32 bits addresses.
> 
> Signed-off-by: Robert Jarzmik 
> ---
>  Documentation/devicetree/bindings/net/smsc911x.txt | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/smsc911x.txt 
> b/Documentation/devicetree/bindings/net/smsc911x.txt
> index 3fed3c124411..224965b7453c 100644
> --- a/Documentation/devicetree/bindings/net/smsc911x.txt
> +++ b/Documentation/devicetree/bindings/net/smsc911x.txt
> @@ -13,6 +13,8 @@ Optional properties:
>  - reg-io-width : Specify the size (in bytes) of the IO accesses that
>should be performed on the device.  Valid value for SMSC LAN is
>2 or 4.  If it's omitted or invalid, the size would be 2.
> +- reg-u16-align4 : Boolean, put in place the workaround the force all
> +u16 writes to be 32 bits aligned

This property name and description is confusing.

How exactly does this differ from having reg-io-width = <4>, which is
documented immediately above?

Thanks,
Mark.

>  - smsc,irq-active-high : Indicates the IRQ polarity is active-high
>  - smsc,irq-push-pull : Indicates the IRQ type is push-pull
>  - smsc,force-internal-phy : Forces SMSC LAN controller to use
> -- 
> 2.1.4
> 


Re: [PATCH 3/3] net: smsc911x: add u16 workaround for pxa platforms

2016-10-03 Thread Mark Rutland
On Mon, Oct 03, 2016 at 05:09:13PM +0100, Russell King - ARM Linux wrote:
> Please note that the binding doc for smsc,lan91c111.txt is slightly wrong
> on two counts:
> 
> 1) compatible property:
> 
> compatible = "smsc,lan91c111";
> 
> vs the code:
> 
> static const struct of_device_id smc91x_match[] = {
> { .compatible = "smsc,lan91c94", },
> { .compatible = "smsc,lan91c111", },
> {},
> };
> MODULE_DEVICE_TABLE(of, smc91x_match);
> 
> So the binding document needs to mention that smsc,lan91c94 is a valid
> compatible for this device.

Yes, it should.

> 2) reg-io-width property:
> 
> - reg-io-width : Mask of sizes (in bytes) of the IO accesses that
>   are supported on the device.  Valid value for SMSC LAN91c111 are
>   1, 2 or 4.  If it's omitted or invalid, the size would be 2 meaning
>   16-bit access only.
 
> Moreover, look at the property name vs the binding description.  It's
> property name says it's a width, but the description says it's a mask
> of sizes - these really aren't the same thing.  Once you start
> specifying these other legal masks, it makes a nonsense of the "width"
> part of the name.  It's too late to try and fix this now though.

Indeed, as-is this is nonsense. :(

The best we can do here is to add a big fat notice regarding the
misnaming; adding a new property is only giong to cause more confusion.

> The binding document really needs to get fixed - I'll try to cook up a
> patch during this week to correct these points, but it probably needs
> coordination if others are going to be changing this as well.

Thanks for handling both of these.

Given the historical rate of change of the binding document, I suspect
the stuff for pxa platforms is going to be the only potential conflict.

Either all of that can go via the DT tree (independent of any new code),
or we can ack the whole lot and it can all go via the net tree in one
go.

Thanks,
Mark.


Re: [PATCH 3/3] net: smsc911x: add u16 workaround for pxa platforms

2016-10-03 Thread Mark Rutland
On Mon, Oct 03, 2016 at 06:11:23PM +0200, Robert Jarzmik wrote:
> Mark Rutland  writes:
> 
> > On Mon, Oct 03, 2016 at 11:05:53AM +0200, Robert Jarzmik wrote:
> >> Add a workaround for mainstone, idp and stargate2 boards, for u16 writes
> >> which must be aligned on 32 bits addresses.
> >> 
> >> Signed-off-by: Robert Jarzmik 
> >> ---
> >>  Documentation/devicetree/bindings/net/smsc911x.txt | 2 ++
> >>  1 file changed, 2 insertions(+)
> >> 
> >> diff --git a/Documentation/devicetree/bindings/net/smsc911x.txt 
> >> b/Documentation/devicetree/bindings/net/smsc911x.txt
> >> index 3fed3c124411..224965b7453c 100644
> >> --- a/Documentation/devicetree/bindings/net/smsc911x.txt
> >> +++ b/Documentation/devicetree/bindings/net/smsc911x.txt
> >> @@ -13,6 +13,8 @@ Optional properties:
> >>  - reg-io-width : Specify the size (in bytes) of the IO accesses that
> >>should be performed on the device.  Valid value for SMSC LAN is
> >>2 or 4.  If it's omitted or invalid, the size would be 2.
> >> +- reg-u16-align4 : Boolean, put in place the workaround the force all
> >> + u16 writes to be 32 bits aligned
> >
> > This property name and description is confusing.
> >
> > How exactly does this differ from having reg-io-width = <4>, which is
> > documented immediately above?
> 
> reg-io-width specifies the IO size, ie. how many data lines are physically
> connected from the system bus to the lan adapter.
> 
> reg-u16-align4 tells that a specific hardware doesn't support 16 bit writes 
> not
> being 32 bits aligned, or said differently that a "store" 16 bits wide on an
> address of the format 4*n + 2 deserves a special handling in the driver, 
> while a
> store 16 bits wide on an address of the format 4*n can follow the simple 
> casual
> case.

If I've understood correctly, effectively the low 2 address lines to the
device are hard-wired to zero, e.g. a 16-bit access to 4*n + 2 would go
to 4*n + 0 on the device? Or is the failure case distinct from that?

Do we have other platforms where similar is true? e.g. u8 accesses
requiring 16-bit alignment?

Thanks,
Mark.


[PATCH 1/2] ath9k: ar9002_mac: kill off ACCESS_ONCE()

2016-12-27 Thread Mark Rutland
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.

However, for some new features (e.g. KTSAN / Kernel Thread Sanitizer),
it is necessary to instrument reads and writes separately, which is not
possible with ACCESS_ONCE(). This distinction is critical to correct
operation.

It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, for some files (including the ath9k ar9002 mac
driver), this mangles the formatting. As a preparatory step, this patch
converts the driver to use {READ,WRITE}_ONCE() without said mangling.


virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)


Signed-off-by: Mark Rutland 
Cc: ath9k-de...@qca.qualcomm.com
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: ath9k-de...@lists.ath9k.org
Cc: netdev@vger.kernel.org
---
 drivers/net/wireless/ath/ath9k/ar9002_mac.c | 64 ++---
 1 file changed, 32 insertions(+), 32 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/ar9002_mac.c 
b/drivers/net/wireless/ath/ath9k/ar9002_mac.c
index f816909..4b3c9b1 100644
--- a/drivers/net/wireless/ath/ath9k/ar9002_mac.c
+++ b/drivers/net/wireless/ath/ath9k/ar9002_mac.c
@@ -220,8 +220,8 @@ ar9002_set_txdesc(struct ath_hw *ah, void *ds, struct 
ath_tx_info *i)
ads->ds_txstatus6 = ads->ds_txstatus7 = 0;
ads->ds_txstatus8 = ads->ds_txstatus9 = 0;
 
-   ACCESS_ONCE(ads->ds_link) = i->link;
-   ACCESS_ONCE(ads->ds_data) = i->buf_addr[0];
+   WRITE_ONCE(ads->ds_link, i->link);
+   WRITE_ONCE(ads->ds_data, i->buf_addr[0]);
 
ctl1 = i->buf_len[0] | (i->is_last ? 0 : AR_TxMore);
ctl6 = SM(i->keytype, AR_EncrType);
@@ -235,26 +235,26 @@ ar9002_set_txdesc(struct ath_hw *ah, void *ds, struct 
ath_tx_info *i)
 
if ((i->is_first || i->is_last) &&
i->aggr != AGGR_BUF_MIDDLE && i->aggr != AGGR_BUF_LAST) {
-   ACCESS_ONCE(ads->ds_ctl2) = set11nTries(i->rates, 0)
+   WRITE_ONCE(ads->ds_ctl2, set11nTries(i->rates, 0)
| set11nTries(i->rates, 1)
| set11nTries(i->rates, 2)
| set11nTries(i->rates, 3)
| (i->dur_update ? AR_DurUpdateEna : 0)
-   | SM(0, AR_BurstDur);
+   | SM(0, AR_BurstDur));
 
-   ACCESS_ONCE(ads->ds_ctl3) = set11nRate(i->rates, 0)
+   WRITE_ONCE(ads->ds_ctl3, set11nRate(i->rates, 0)
| set11nRate(i->rates, 1)
| set11nRate(i->rates, 2)
-   | set11nRate(i->rates, 3);
+   | set11nRate(i->rates, 3));
} else {
-   ACCESS_ONCE(ads->ds_ctl2) = 0;
-   ACCESS_ONCE(ads->ds_ctl3) = 0;
+   WRITE_ONCE(ads->ds_ctl2, 0);
+   WRITE_ONCE(ads->ds_ctl3, 0);
}
 
if (!i->is_first) {
-   ACCESS_ONCE(ads->ds_ctl0) = 0;
-   ACCESS_ONCE(ads->ds_ctl1) = ctl1;
-   ACCESS_ONCE(ads->ds_ctl6) = ctl6;
+   WRITE_ONCE(ads->ds_ctl0, 0);
+   WRITE_ONCE(ads->ds_ctl1, ctl1);
+   WRITE_ONCE(ads->ds_ctl6, ctl6);
return;
}
 
@@ -279,7 +279,7 @@ ar9002_set_txdesc(struct ath_hw *ah, void *ds, struct 
ath_tx_info *i)
break;
}
 
-   ACCESS_ONCE(ads->ds_ctl0) = (i->pkt_len & AR_FrameLen)
+   WRITE_ONCE(ads->ds_ctl0, (i->pkt_len & AR_FrameLen)
| (i->flags & ATH9K_TXDESC_VMF ? AR_VirtMoreFrag : 0)
| SM(i->txpower[0], AR_XmitPower0)
| (i->flags & ATH9K_TXDESC_VEOL ? AR_VEOL : 0)
@@ -287,29 +287,29 @@ ar9002_set_txdesc(struct ath_hw *ah, void *ds, struct 
ath_tx_info *i)
| (i->keyix != ATH9K_TXKEYIX_INVALID ? AR_DestIdxValid : 0)
| (i->flags & ATH9K_TXDESC_CLRDMASK ? AR_ClrDestMask : 0)
| (i->flags & ATH9K_TXDESC_RTSENA ? AR_RTSEnable :
-  (i->flags & ATH9K_TXDESC_CTSENA ? AR_CTSEnable : 0));
+  (i->flags & ATH9K_TXDESC_CTSENA ? AR_CTSEnable : 0)));
 
-   ACCESS_ONCE(ads->ds_ctl1) = ctl1;
-   ACCESS_ONCE(ads->ds_ctl6) = ctl6;
+   WRITE_ONCE(ads->ds_ctl1, ctl1);
+   WRITE_ONCE(ads->ds_ctl6, ctl6);
 
if (i->aggr == AGGR_BUF_MIDDLE || i->aggr == AGGR_BUF_LAST)
return;
 
-   ACCESS_ONCE(ads->ds_ctl4) = 

[PATCH 0/2] ath9k: kill of ACCESS_ONCE() in MAC drivers

2016-12-27 Thread Mark Rutland
For several reasons, it would be beneficial to kill off ACCESS_ONCE()
tree-wide, in favour of {READ,WRITE}_ONCE(). These work with aggregate
types, more obviously document their intended behaviour, and are
necessary for tools like KTSAN to work correctly (as otherwise reads and
writes cannot be instrumented separately).

While it's possible to script a tree-wide conversion using Coccinelle,
some cases such as the ath9k MAC drivers require some manual
intervention to ensure that the resulting code remains legible. This
series moves the ath9k MAC drivers over to {READ,WRITE}_ONCE(). In both
cases this is functionally equivalent to the below Coccinelle script
being applied, though the existing formatting is retained.


virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)


Thanks,
Mark.

Mark Rutland (2):
  ath9k: ar9002_mac: kill off ACCESS_ONCE()
  ath9k: ar9003_mac: kill off ACCESS_ONCE()

 drivers/net/wireless/ath/ath9k/ar9002_mac.c | 64 ++--
 drivers/net/wireless/ath/ath9k/ar9003_mac.c | 92 ++---
 2 files changed, 78 insertions(+), 78 deletions(-)

-- 
2.7.4



[PATCH 2/2] ath9k: ar9003_mac: kill off ACCESS_ONCE()

2016-12-27 Thread Mark Rutland
For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't currently harmful.

However, for some new features (e.g. KTSAN / Kernel Thread Sanitizer),
it is necessary to instrument reads and writes separately, which is not
possible with ACCESS_ONCE(). This distinction is critical to correct
operation.

It's possible to transform the bulk of kernel code using the Coccinelle
script below. However, for some files (including the ath9k ar9003 mac
driver), this mangles the formatting. As a preparatory step, this patch
converts the driver to use {READ,WRITE}_ONCE() without said mangling.


virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)


Signed-off-by: Mark Rutland 
Cc: ath9k-de...@qca.qualcomm.com
Cc: Kalle Valo 
Cc: linux-wirel...@vger.kernel.org
Cc: ath9k-de...@lists.ath9k.org
Cc: netdev@vger.kernel.org
---
 drivers/net/wireless/ath/ath9k/ar9003_mac.c | 92 ++---
 1 file changed, 46 insertions(+), 46 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/ar9003_mac.c 
b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
index da84b70..cc5bb0a 100644
--- a/drivers/net/wireless/ath/ath9k/ar9003_mac.c
+++ b/drivers/net/wireless/ath/ath9k/ar9003_mac.c
@@ -39,47 +39,47 @@ ar9003_set_txdesc(struct ath_hw *ah, void *ds, struct 
ath_tx_info *i)
  (i->qcu << AR_TxQcuNum_S) | desc_len;
 
checksum += val;
-   ACCESS_ONCE(ads->info) = val;
+   WRITE_ONCE(ads->info, val);
 
checksum += i->link;
-   ACCESS_ONCE(ads->link) = i->link;
+   WRITE_ONCE(ads->link, i->link);
 
checksum += i->buf_addr[0];
-   ACCESS_ONCE(ads->data0) = i->buf_addr[0];
+   WRITE_ONCE(ads->data0, i->buf_addr[0]);
checksum += i->buf_addr[1];
-   ACCESS_ONCE(ads->data1) = i->buf_addr[1];
+   WRITE_ONCE(ads->data1, i->buf_addr[1]);
checksum += i->buf_addr[2];
-   ACCESS_ONCE(ads->data2) = i->buf_addr[2];
+   WRITE_ONCE(ads->data2, i->buf_addr[2]);
checksum += i->buf_addr[3];
-   ACCESS_ONCE(ads->data3) = i->buf_addr[3];
+   WRITE_ONCE(ads->data3, i->buf_addr[3]);
 
checksum += (val = (i->buf_len[0] << AR_BufLen_S) & AR_BufLen);
-   ACCESS_ONCE(ads->ctl3) = val;
+   WRITE_ONCE(ads->ctl3, val);
checksum += (val = (i->buf_len[1] << AR_BufLen_S) & AR_BufLen);
-   ACCESS_ONCE(ads->ctl5) = val;
+   WRITE_ONCE(ads->ctl5, val);
checksum += (val = (i->buf_len[2] << AR_BufLen_S) & AR_BufLen);
-   ACCESS_ONCE(ads->ctl7) = val;
+   WRITE_ONCE(ads->ctl7, val);
checksum += (val = (i->buf_len[3] << AR_BufLen_S) & AR_BufLen);
-   ACCESS_ONCE(ads->ctl9) = val;
+   WRITE_ONCE(ads->ctl9, val);
 
checksum = (u16) (((checksum & 0x) + (checksum >> 16)) & 0x);
-   ACCESS_ONCE(ads->ctl10) = checksum;
+   WRITE_ONCE(ads->ctl10, checksum);
 
if (i->is_first || i->is_last) {
-   ACCESS_ONCE(ads->ctl13) = set11nTries(i->rates, 0)
+   WRITE_ONCE(ads->ctl13, set11nTries(i->rates, 0)
| set11nTries(i->rates, 1)
| set11nTries(i->rates, 2)
| set11nTries(i->rates, 3)
| (i->dur_update ? AR_DurUpdateEna : 0)
-   | SM(0, AR_BurstDur);
+   | SM(0, AR_BurstDur));
 
-   ACCESS_ONCE(ads->ctl14) = set11nRate(i->rates, 0)
+   WRITE_ONCE(ads->ctl14, set11nRate(i->rates, 0)
| set11nRate(i->rates, 1)
| set11nRate(i->rates, 2)
-   | set11nRate(i->rates, 3);
+   | set11nRate(i->rates, 3));
} else {
-   ACCESS_ONCE(ads->ctl13) = 0;
-   ACCESS_ONCE(ads->ctl14) = 0;
+   WRITE_ONCE(ads->ctl13, 0);
+   WRITE_ONCE(ads->ctl14, 0);
}
 
ads->ctl20 = 0;
@@ -89,17 +89,17 @@ ar9003_set_txdesc(struct ath_hw *ah, void *ds, struct 
ath_tx_info *i)
 
ctl17 = SM(i->keytype, AR_EncrType);
if (!i->is_first) {
-   ACCESS_ONCE(ads->ctl11) = 0;
-   ACCESS_ONCE(ads->ctl12) = i->is_last ? 0 : AR_TxMore;
-   ACCESS_ONCE(ads->ctl15) = 0;
-   ACCESS_ONCE(ads->ctl16) = 0;
-   ACCESS_ONCE(ads->ctl17) = ctl17;
-   ACCESS_ONCE(ads->ctl18) = 0;
-   ACCESS_ONCE(ads-

[PATCH 3/3] tools/virtio: use {READ,WRITE}_ONCE() in uaccess.h

2016-11-24 Thread Mark Rutland
As a step towards killing off ACCESS_ONCE, use {READ,WRITE}_ONCE() for the
virtio tools uaccess primitives, pulling these in from .

With this done, we can kill off the now-unused ACCESS_ONCE() definition.

Signed-off-by: Mark Rutland 
Cc: Jason Wang 
Cc: Michael S. Tsirkin 
Cc: linux-ker...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
---
 tools/virtio/linux/uaccess.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/tools/virtio/linux/uaccess.h b/tools/virtio/linux/uaccess.h
index 0a578fe..fa05d01 100644
--- a/tools/virtio/linux/uaccess.h
+++ b/tools/virtio/linux/uaccess.h
@@ -1,8 +1,9 @@
 #ifndef UACCESS_H
 #define UACCESS_H
-extern void *__user_addr_min, *__user_addr_max;
 
-#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
+#include 
+
+extern void *__user_addr_min, *__user_addr_max;
 
 static inline void __chk_user_ptr(const volatile void *p, size_t size)
 {
@@ -13,7 +14,7 @@ static inline void __chk_user_ptr(const volatile void *p, 
size_t size)
 ({ \
typeof(ptr) __pu_ptr = (ptr);   \
__chk_user_ptr(__pu_ptr, sizeof(*__pu_ptr));\
-   ACCESS_ONCE(*(__pu_ptr)) = x;   \
+   WRITE_ONCE(*(__pu_ptr), x); \
0;  \
 })
 
@@ -21,7 +22,7 @@ static inline void __chk_user_ptr(const volatile void *p, 
size_t size)
 ({ \
typeof(ptr) __pu_ptr = (ptr);   \
__chk_user_ptr(__pu_ptr, sizeof(*__pu_ptr));\
-   x = ACCESS_ONCE(*(__pu_ptr));   \
+   x = READ_ONCE(*(__pu_ptr)); \
0;  \
 })
 
-- 
2.7.4



[PATCH 1/3] tools/virtio: fix READ_ONCE()

2016-11-24 Thread Mark Rutland
The virtio tools implementation of READ_ONCE() has a single parameter called
'var', but erroneously refers to 'val' for its cast, and thus won't work unless
there's a variable of the correct type that happens to be called 'var'.

Fix this with s/var/val/, making READ_ONCE() work as expected regardless.

Fixes: a7c490333df3cff5 ("tools/virtio: use virt_xxx barriers")
Signed-off-by: Mark Rutland 
Cc: Jason Wang 
Cc: Michael S. Tsirkin 
Cc: linux-ker...@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
---
 tools/virtio/linux/compiler.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/virtio/linux/compiler.h b/tools/virtio/linux/compiler.h
index 845960e..c9ccfd4 100644
--- a/tools/virtio/linux/compiler.h
+++ b/tools/virtio/linux/compiler.h
@@ -4,6 +4,6 @@
 #define WRITE_ONCE(var, val) \
(*((volatile typeof(val) *)(&(var))) = (val))
 
-#define READ_ONCE(var) (*((volatile typeof(val) *)(&(var
+#define READ_ONCE(var) (*((volatile typeof(var) *)(&(var
 
 #endif
-- 
2.7.4



[PATCH 2/3] vringh: kill off ACCESS_ONCE()

2016-11-24 Thread Mark Rutland
Despite living under drivers/ vringh.c is also used as part of the userspace
virtio tools. Before we can kill off the ACCESS_ONCE()definition in the tools,
we must convert vringh.c to use {READ,WRITE}_ONCE().

This patch does so, along with the required include of  for
the relevant definitions. The userspace tools provide their own definitions in
their own .

Signed-off-by: Mark Rutland 
Cc: Jason Wang 
Cc: Michael S. Tsirkin 
Cc: k...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
---
 drivers/vhost/vringh.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vringh.c b/drivers/vhost/vringh.c
index 3bb02c6..bb8971f 100644
--- a/drivers/vhost/vringh.c
+++ b/drivers/vhost/vringh.c
@@ -3,6 +3,7 @@
  *
  * Since these may be in userspace, we use (inline) accessors.
  */
+#include 
 #include 
 #include 
 #include 
@@ -820,13 +821,13 @@ EXPORT_SYMBOL(vringh_need_notify_user);
 static inline int getu16_kern(const struct vringh *vrh,
  u16 *val, const __virtio16 *p)
 {
-   *val = vringh16_to_cpu(vrh, ACCESS_ONCE(*p));
+   *val = vringh16_to_cpu(vrh, READ_ONCE(*p));
return 0;
 }
 
 static inline int putu16_kern(const struct vringh *vrh, __virtio16 *p, u16 val)
 {
-   ACCESS_ONCE(*p) = cpu_to_vringh16(vrh, val);
+   WRITE_ONCE(*p, cpu_to_vringh16(vrh, val));
return 0;
 }
 
-- 
2.7.4



[PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-24 Thread Mark Rutland
For several reasons, it would be beneficial to kill off ACCESS_ONCE()
tree-wide, in favour of {READ,WRITE}_ONCE(). These work with aggregate types,
more obviously document their intended behaviour, and are necessary for tools
like KTSAN to work correctly (as otherwise reads and writes cannot be
instrumented separately).

While it's possible to script the bulk of this tree-wide conversion, some cases
such as the virtio code, require some manual intervention. This series moves
the virtio and vringh code over to {READ,WRITE}_ONCE(), in the process fixing a
bug in the virtio headers.

Thanks,
Mark.

Mark Rutland (3):
  tools/virtio: fix READ_ONCE()
  vringh: kill off ACCESS_ONCE()
  tools/virtio: use {READ,WRITE}_ONCE() in uaccess.h

 drivers/vhost/vringh.c| 5 +++--
 tools/virtio/linux/compiler.h | 2 +-
 tools/virtio/linux/uaccess.h  | 9 +
 3 files changed, 9 insertions(+), 7 deletions(-)

-- 
2.7.4



Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Mark Rutland
On Thu, Nov 24, 2016 at 10:36:58PM +0200, Michael S. Tsirkin wrote:
> On Thu, Nov 24, 2016 at 10:25:11AM +0000, Mark Rutland wrote:
> > For several reasons, it would be beneficial to kill off ACCESS_ONCE()
> > tree-wide, in favour of {READ,WRITE}_ONCE(). These work with aggregate 
> > types,
> > more obviously document their intended behaviour, and are necessary for 
> > tools
> > like KTSAN to work correctly (as otherwise reads and writes cannot be
> > instrumented separately).
> > 
> > While it's possible to script the bulk of this tree-wide conversion, some 
> > cases
> > such as the virtio code, require some manual intervention. This series moves
> > the virtio and vringh code over to {READ,WRITE}_ONCE(), in the process 
> > fixing a
> > bug in the virtio headers.
> > 
> > Thanks,
> > Mark.
> 
> I don't have a problem with this specific patchset.

Good to hear. :)

Does that mean you're happy to queue these patches? Or would you prefer
a new posting at some later point, with ack/review tags accumulated?

> Though I really question the whole _ONCE APIs esp with
> aggregate types - these seem to generate a memcpy and
> an 8-byte read/writes sometimes, and I'm pretty sure this simply
> can't be read/written at once on all architectures.

Yes, in cases where the access is larger than the machine can perform in
a single access, this will result in a memcpy.

My understanding is that this has always been the case with
ACCESS_ONCE(), where multiple accesses were silently/implicitly
generated by the compiler.

We could add some compile-time warnings for those cases. I'm not sure if
there's a reason we avoided doing that so far; perhaps Christian has a
some idea.

Thanks,
Mark.


Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Mark Rutland
On Fri, Nov 25, 2016 at 12:33:48PM +0100, Christian Borntraeger wrote:
> On 11/25/2016 12:22 PM, Mark Rutland wrote:
> > On Thu, Nov 24, 2016 at 10:36:58PM +0200, Michael S. Tsirkin wrote:
> >> Though I really question the whole _ONCE APIs esp with
> >> aggregate types - these seem to generate a memcpy and
> >> an 8-byte read/writes sometimes, and I'm pretty sure this simply
> >> can't be read/written at once on all architectures.
> > 
> > Yes, in cases where the access is larger than the machine can perform in
> > a single access, this will result in a memcpy.
> > 
> > My understanding is that this has always been the case with
> > ACCESS_ONCE(), where multiple accesses were silently/implicitly
> > generated by the compiler.
> > 
> > We could add some compile-time warnings for those cases. I'm not sure if
> > there's a reason we avoided doing that so far; perhaps Christian has a
> > some idea.
> 
> My first version had this warning, but it was removed later on as requested
> by Linus
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02670.html
> ---snip---
> 
> Get rid of the f*cking size checks etc on READ_ONCE() and friends.
> 
> They are about - wait for it - "reading a value once".
> 
> Note how it doesn't say ANYTHING about "atomic" or anything like that.
> It's about reading *ONCE*.
> 
> ---snip---

I see. That's unfortunate, given that practically every use I'm aware of
assumes some atomicity (e.g. freedom from tearing when loading/storing
pointers or values up to the native width of the machine). I believe
that's the case here, for virtio, for example.

Perhaps we can add new accessors that are supposed to guarantee that,
into which we can drop appropriate warnings.

Naming will be problematic; calling them ATOMIC_* makes tham sound like
they work on atomic_t. That and I have no idea how to ensure correct
usage tree-wide; I'm not sure if/how Coccinelle can help.

Peter, thoughts?

Thanks,
Mark.


Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Mark Rutland
On Fri, Nov 25, 2016 at 01:40:44PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 25, 2016 at 12:23:56PM +0000, Mark Rutland wrote:
> > Naming will be problematic; calling them ATOMIC_* makes tham sound like
> > they work on atomic_t. That and I have no idea how to ensure correct
> > usage tree-wide; I'm not sure if/how Coccinelle can help.
> > 
> > Peter, thoughts?
> 
> Something like so perhaps?

> /*
>  * Provide accessors for Single-Copy atomicy.
>  *
>  * That is, ensure that machine word sized loads/stores to naturally
>  * aligned variables are single instructions.

Minor nit: this sounds like we *only* support the machine word size,
whereas (excluding alpha IIRC) we can generally acccess power-of-two
sizes from byte up to that.

So perhaps:

That is, ensure that loads/stores are made with single
instructions, where the machine can perform a tear-free access
of that size.

>  * By reason of not being able to use C11 atomic crud, use our beloved
>  * volatile qualifier. Since volatile tells the compiler the value can
>  * be changed behind its back, it must use Single-Copy atomic loads and
>  * stores to access them, otherwise it runs the risk of load/store
>  * tearing.
>  */
> 
> #define SINGLE_LOAD(x)\
> {(\
>   compiletime_assert_atomic_type(typeof(x));  \
>   WARN_SINGLE_COPY_ALIGNMENT(&(x));   \
>   READ_ONCE(x);   \
> })
> 
> #define SINGLE_STORE(x, v)\
> ({\
>   compiletime_assert_atomic_type(typeof(x));  \
>   WARN_SINGLE_COPY_ALIGNMENT(&(x));   \
>   WRITE_ONCE(x, v);   \
> })

Modulo your type comment, and mine above, this looks good to me.

Thanks,
Mark.


Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Mark Rutland
On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
> 
> READ/WRITE_ONCE imply atomicity. Even if their names don't spell it (a
> function name doesn't have to spell all of its guarantees). Most of
> the uses of READ/WRITE_ONCE will be broken if they are not atomic.

In practice, this is certainly the assumption made by many/most users of
the *_ONCE() accessors.

Looking again, Linus does seem to agree that word-sized accesses should
result in single instructions (and be single-copy atomic) [1], so in
contrast to [2], that's clearly *part* of the point of the *_ONCE()
accessors...

> "Read once but not necessary atomically" is a very subtle primitive
> which is very easy to misuse.

I agree. Unfortunately, Linus does not appear to [2].

> What are use cases for such primitive that won't be OK with "read once
> _and_ atomically"?

I have none to hand.

Thanks,
Mark.

[1] http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02674.html
[2] http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02670.html


Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Mark Rutland
On Fri, Nov 25, 2016 at 05:17:09PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 25, 2016 at 04:10:04PM +0000, Mark Rutland wrote:
> > On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
> 
> > > What are use cases for such primitive that won't be OK with "read once
> > > _and_ atomically"?
> > 
> > I have none to hand.
> 
> Whatever triggers the __builtin_memcpy() paths, and even the size==8
> paths on 32bit.

Lockref, per:

http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02294.html

In that specific case, a torn value just means we'll retry until we get
a non torn value, due to the cmpxchg. For that case, all we need is the
value to be reloaded per invocation of READ_ONCE().

This guy seems to have the full story:

http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02389.html
http://lkml.iu.edu/hypermail/linux/kernel/1503.3/02558.html

Thanks,
Mark.


Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Mark Rutland
On Fri, Nov 25, 2016 at 05:49:45PM +0100, Christian Borntraeger wrote:
> On 11/25/2016 05:17 PM, Peter Zijlstra wrote:
> > On Fri, Nov 25, 2016 at 04:10:04PM +0000, Mark Rutland wrote:
> >> On Fri, Nov 25, 2016 at 04:21:39PM +0100, Dmitry Vyukov wrote:
> > 
> >>> What are use cases for such primitive that won't be OK with "read once
> >>> _and_ atomically"?
> >>
> >> I have none to hand.
> > 
> > Whatever triggers the __builtin_memcpy() paths, and even the size==8
> > paths on 32bit.
> > 
> > You could put a WARN in there to easily find them.
> 
> There were several cases that I found during writing the *ONCE stuff.
> For example there are some 32bit ppc variants with 64bit PTEs. Some for
> others (I think sparc).

We have similar on 32-bit ARM w/ LPAE. LPAE implies that a naturally
aligned 64-bit access is single-copy atomic, which is what makes that
ok.

> And the mm/ code is perfectly fine with these PTE accesses being done
> NOT atomic.

That strikes me as surprising. Is there some mutual exclusion that
prevents writes from occuring wherever a READ_ONCE() happens to a PTE?

Otherwise, how is tearing not a problem? Does it have some pattern like
the lockref cmpxchg?

Thanks,
Mark.


Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Mark Rutland
On Fri, Nov 25, 2016 at 06:28:53PM +0100, Dmitry Vyukov wrote:
> On Fri, Nov 25, 2016 at 5:17 PM, Peter Zijlstra  wrote:
> >> > What are use cases for such primitive that won't be OK with "read once
> >> > _and_ atomically"?
> >>
> >> I have none to hand.
> >
> > Whatever triggers the __builtin_memcpy() paths, and even the size==8
> > paths on 32bit.
> >
> > You could put a WARN in there to easily find them.
> >
> > The advantage of introducing the SINGLE_{LOAD,STORE}() helpers is that
> > they compiletime validate this the size is 'right' and can runtime check
> > alignment constraints.
> >
> > IE, they are strictly stronger than {READ,WRITE}_ONCE().
> 
> Uh, so, READ/WRITE_ONCE are non-atomic now. I missed that.

Yes, but *only* for types larger than word size. That has *always* been
the case.

It's still assumed that *_ONCE are single-copy-atomic for word size (or
smaller). Some architectures may also provide that guarnatee for
accesses larger than word size (e.g. 32-bit ARM w/ LPAE).

... It's just that as things stand we can't put checks in *_ONCE() for
the access size, since they're *also* used for larger accesses that
don't need atomicity.

> If READ/WRITE_ONCE are non-atomic, half of kernel is broken. All these
> loads of flags, ringbuffer positions, pointers, etc are broken.

Most of these will be fine, as above.

> What about restoring READ/WRITE_ONCE as atomic, and introducing
> separate primitives for _non_ atomic loads/stores?

Having a separate *_ONCE_TEARABLE() would certainly limit the number of
things we have to fix up, and would also make it clear that atomicity is
not expected.

... but we might have to go with SINGLE_*() if we can't convince Linus.

Thanks,
Mark.


Re: [PATCH 0/3] virtio/vringh: kill off ACCESS_ONCE()

2016-11-25 Thread Mark Rutland
On Fri, Nov 25, 2016 at 09:52:50AM -0800, Linus Torvalds wrote:
> READ/WRITE_ONCE() are atomic *WHEN*THAT*IS*POSSIBLE*.

> But sometimes it's not going to be atomic.

That's the problem.

Common code may rely on something being atomic when that's only true on
a subset of platforms. On others, it's silently "fixed" into something
that isn't atomic, and we get no diagnostic. The bug lurks beneath the
surface.

Thanks,
Mark.


v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Mark Rutland
Hi all,

I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
skb_copy_and_csum_bits().

I've uploaded a copy of the splat, my config, and (full) Syzkaller log
to my kernel.org web space [1]. I haven't had the opportunity to
reproduce this yet. 

This isn't a pure v4.14-rc2, as I have a not-yet-upstream fix [2]
applied to avoid a userfaultfd bug. However, per the Syzkaller log, the
userfaultfd syscall wasn't invoked, so I don't believe that should
matter.

Thanks,
Mark.

[1] 
https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20171002-skbuff-bug/
[2] https://lkml.kernel.org/r/20170920180413.26713-1-aarca...@redhat.com

[ cut here ]
kernel BUG at net/core/skbuff.c:2626!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.0-rc2-1-gd7ad33d #115
Hardware name: linux,dummy-virt (DT)
task: 80003a901a80 task.stack: 80003a908000
PC is at skb_copy_and_csum_bits+0x8dc/0xae0 net/core/skbuff.c:2626
LR is at skb_copy_and_csum_bits+0x8dc/0xae0 net/core/skbuff.c:2626
pc : [] lr : [] pstate: 0145
sp : 80003efd7b50
x29: 80003efd7b50 x28: 003c 
x27: 01e8 x26: 80003a901a90 
x25: 003c x24: dfff2000 
x23: 800035723a80 x22: 003c 
x21:  x20:  
x19: 3a6d x18: 2da58140 
x17:  x16: 0001 
x15: 2e1485a0 x14: 282f8980 
x13: 29fc73d0 x12: 29fc707c 
x11: 12c2a3fc x10: 12c2a3fc 
x9 : dfff2000 x8 : 07030301a8ff1127 
x7 : edff11270a080204 x6 : 800016151fe8 
x5 : 12c2a3fd x4 : 000c 
x3 : 0030 x2 : 16ae47a1 
x1 : 01f6cee936b5bc00 x0 :  
Process swapper/3 (pid: 0, stack limit = 0x80003a908000)
Call trace:
Exception stack(0x80003efd7a10 to 0x80003efd7b50)
7a00:    01f6cee936b5bc00
7a20: 16ae47a1 0030 000c 12c2a3fd
7a40: 800016151fe8 edff11270a080204 07030301a8ff1127 dfff2000
7a60: 12c2a3fc 12c2a3fc 29fc707c 29fc73d0
7a80: 282f8980 2e1485a0 0001 
7aa0: 2da58140 3a6d  
7ac0: 003c 800035723a80 dfff2000 003c
7ae0: 80003a901a90 01e8 003c 80003efd7b50
7b00: 29e03214 80003efd7b50 29e03214 0145
7b20: 3a6d  0001 003c
7b40: 80003efd7b50 29e03214
[] skb_copy_and_csum_bits+0x8dc/0xae0 net/core/skbuff.c:2626
[] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
[] __ip_append_data+0x10e4/0x20a8 net/ipv4/ip_output.c:1018
[] ip_append_data.part.3+0xe8/0x1a0 net/ipv4/ip_output.c:1170
[] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
[] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
[] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
[] ip_fragment.constprop.4+0x208/0x340 
net/ipv4/ip_output.c:552
[] ip_finish_output+0x3a8/0xab0 net/ipv4/ip_output.c:315
[] NF_HOOK_COND include/linux/netfilter.h:238 [inline]
[] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
[] dst_output include/net/dst.h:458 [inline]
[] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
[] ip_queue_xmit+0x850/0x18e0 net/ipv4/ip_output.c:504
[] tcp_transmit_skb+0x107c/0x3338 net/ipv4/tcp_output.c:1123
[] __tcp_retransmit_skb+0x614/0x1d18 
net/ipv4/tcp_output.c:2847
[] tcp_send_loss_probe+0x478/0x7d0 net/ipv4/tcp_output.c:2457
[] tcp_write_timer_handler+0x50c/0x7e8 
net/ipv4/tcp_timer.c:557
[] tcp_write_timer+0x78/0x170 net/ipv4/tcp_timer.c:579
[] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
[] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
[] __run_timers kernel/time/timer.c:1620 [inline]
[] run_timer_softirq+0x214/0x5f0 kernel/time/timer.c:1646
[] __do_softirq+0x350/0xc0c kernel/softirq.c:284
[] do_softirq_own_stack include/linux/interrupt.h:498 [inline]
[] invoke_softirq kernel/softirq.c:371 [inline]
[] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
[] __handle_domain_irq+0xdc/0x230 kernel/irq/irqdesc.c:647
[] handle_domain_irq include/linux/irqdesc.h:175 [inline]
[] gic_handle_irq+0x6c/0xe0 drivers/irqchip/irq-gic.c:367
Exception stack(0x80003a90bb70 to 0x80003a90bcb0)
bb60:   80003a90234c 0007
bb80:  17520469 1fffe400017ad00c dfff2000
bba0: dfff2000  80003a902350 17520469
bbc0: 80003a902348 80003a902368 1752046c 1752046e
bbe0: 1752046d 2e1485a0  0001
bc00: 2da58140 80003efd9800 80003efd9800 2ae6
bc20: 80003a971a80 175217

v4.14-rc2/arm64 misaligned atomic in ip_expire() / skb_clone()

2017-10-02 Thread Mark Rutland
Hi all,

I'm intermittently hitting splats like below in skb_clone() while
fuzzing v4.14-rc2 on arm64 with Syzkaller. It looks like the
atomic_inc() at the end of __skb_clone() is being passed a misaligned
pointer.

I've uploaded a number of splats and their associated (full) Syzkaller
logs, along with my kernel config to my kernel.org webspace [1]. It
might take a while for that to appear.

This isn't a pure v4.14-rc2, as I have a not-yet-upstream fix [2]
applied to avoid a userfaultfd bug. The userfaultfd syscall appears in
all of the Syzkaller logs, so there is the chance that this is related,
but as I've not seen any other issues I suspect that's unlikely.

Thanks,
Mark.

[1] 
https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20171002-skb_clone-misaligned-atomic
[2] https://lkml.kernel.org/r/20170920180413.26713-1-aarca...@redhat.com

Unable to handle kernel paging request at virtual address 80002fd714a2
Mem abort info:
  Exception class = DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x0033
  CM = 0, WnR = 0
swapper pgtable: 4k pages, 48-bit VAs, pgd = 2eeb2000
[80002fd714a2] *pgd=7eff7003, *pud=7eff6003, 
*pmd=00f86fc00711
Internal error: Oops: 9621 [#1] PREEMPT SMP
Modules linked in:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.0-rc2-1-gd7ad33d #115
Hardware name: linux,dummy-virt (DT)
task: 80003a901a80 task.stack: 80003a908000
PC is at __ll_sc_atomic_add+0x4/0x18 arch/arm64/include/asm/atomic_ll_sc.h:113
LR is at atomic_add arch/arm64/include/asm/atomic_lse.h:45 [inline]
LR is at __skb_clone+0x4a8/0x6c0 net/core/skbuff.c:873
pc : [] lr : [] pstate: 1145
sp : 80003efd86e0
x29: 80003efd86e0 x28: 60003418b000 
x27: 2ae55360 x26: 8000182c1608 
x25: 80002fd7137e x24: 8000182c1610 
x23: 2ae6 x22: 80001577871c 
x21: 17dfb0e8 x20: 8000182c1540 
x19: 800015778640 x18: 2da58140 
x17:  x16: 0002 
x15: 2e1485a0 x14: 282f912c 
x13: 282f8dcc x12: 282f8980 
x11: 12aef0df x10: 12aef0df 
x9 : dfff2000 x8 : 0082009000a40008 
x7 :  x6 : 800015778700 
x5 : 12aef0e0 x4 :  
x3 : 12aef0e3 x2 : 80002fd7147e 
x1 : 80002fd714a2 x0 : 0001 
Process swapper/3 (pid: 0, stack limit = 0x80003a908000)
Call trace:
Exception stack(0x80003efd85a0 to 0x80003efd86e0)
85a0: 0001 80002fd714a2 80002fd7147e 12aef0e3
85c0:  12aef0e0 800015778700 
85e0: 0082009000a40008 dfff2000 12aef0df 12aef0df
8600: 282f8980 282f8dcc 282f912c 2e1485a0
8620: 0002  2da58140 800015778640
8640: 8000182c1540 17dfb0e8 80001577871c 2ae6
8660: 8000182c1610 80002fd7137e 8000182c1608 2ae55360
8680: 60003418b000 80003efd86e0 29dffb58 80003efd86e0
86a0: 2a30ce44 1145 800015778640 8000182c1540
86c0: 0001 8000182c15ce 80003efd86e0 2a30ce44
[] __ll_sc_atomic_add+0x4/0x18 
arch/arm64/include/asm/atomic_ll_sc.h:113
[] skb_clone+0x1c4/0x3b0 net/core/skbuff.c:1286
[] ip_expire+0x4e8/0x7c0 net/ipv4/ip_fragment.c:239
[] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
[] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
[] __run_timers kernel/time/timer.c:1620 [inline]
[] run_timer_softirq+0x214/0x5f0 kernel/time/timer.c:1646
[] __do_softirq+0x350/0xc0c kernel/softirq.c:284
[] do_softirq_own_stack include/linux/interrupt.h:498 [inline]
[] invoke_softirq kernel/softirq.c:371 [inline]
[] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
[] __handle_domain_irq+0xdc/0x230 kernel/irq/irqdesc.c:647
[] handle_domain_irq include/linux/irqdesc.h:175 [inline]
[] gic_handle_irq+0x6c/0xe0 drivers/irqchip/irq-gic.c:367
Exception stack(0x80003a90bd70 to 0x80003a90beb0)
bd60:   80003a90234c 0007
bd80:  17520469 1fffe400017ad00c e540
bda0:   80003a902350 17520469
bdc0: 80003a902348 80003a902368 1752046c 1752046e
bde0: 1752046d 2e1485a0  00029d44
be00: 2da58140 80003a901a80 80003a901a80 dfff2000
be20: 2ae60e98 0400015cc1d3  2ae60df8
be40: 2ae60df8   80003a90beb0
be60: 28089b50 80003a90beb0 28089b54 1145
be80: 80003a901a80 80003a901a80  01f6cee936b5bc00
bea0: 80003a90beb0 28089b54
[] el1_irq+0xb4/0x12c arch/arm64/kernel/entry.S:569
[] arch_local_irq_enable arch/arm64/include/asm/irqfla

Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Mark Rutland
Hi Eric,

On Mon, Oct 02, 2017 at 06:36:32AM -0700, Eric Dumazet wrote:
> On Mon, Oct 2, 2017 at 3:49 AM, Mark Rutland  wrote:
> > I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
> > on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
> > skb_copy_and_csum_bits().

> > kernel BUG at net/core/skbuff.c:2626!

> > [] skb_copy_and_csum_bits+0x8dc/0xae0 
> > net/core/skbuff.c:2626
> > [] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
> > [] __ip_append_data+0x10e4/0x20a8 
> > net/ipv4/ip_output.c:1018
> > [] ip_append_data.part.3+0xe8/0x1a0 
> > net/ipv4/ip_output.c:1170
> > [] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
> > [] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
> > [] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
> > [] ip_fragment.constprop.4+0x208/0x340 
> > net/ipv4/ip_output.c:552
> > [] ip_finish_output+0x3a8/0xab0 net/ipv4/ip_output.c:315
> > [] NF_HOOK_COND include/linux/netfilter.h:238 [inline]
> > [] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
> > [] dst_output include/net/dst.h:458 [inline]
> > [] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
> > [] ip_queue_xmit+0x850/0x18e0 net/ipv4/ip_output.c:504
> > [] tcp_transmit_skb+0x107c/0x3338 
> > net/ipv4/tcp_output.c:1123
> > [] __tcp_retransmit_skb+0x614/0x1d18 
> > net/ipv4/tcp_output.c:2847
> > [] tcp_send_loss_probe+0x478/0x7d0 
> > net/ipv4/tcp_output.c:2457
> > [] tcp_write_timer_handler+0x50c/0x7e8 
> > net/ipv4/tcp_timer.c:557
> > [] tcp_write_timer+0x78/0x170 net/ipv4/tcp_timer.c:579
> > [] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
> > [] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
> > [] __run_timers kernel/time/timer.c:1620 [inline]
> > [] run_timer_softirq+0x214/0x5f0 kernel/time/timer.c:1646
> > [] __do_softirq+0x350/0xc0c kernel/softirq.c:284
> > [] do_softirq_own_stack include/linux/interrupt.h:498 
> > [inline]
> > [] invoke_softirq kernel/softirq.c:371 [inline]
> > [] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
> > [] __handle_domain_irq+0xdc/0x230 kernel/irq/irqdesc.c:647
> > [] handle_domain_irq include/linux/irqdesc.h:175 [inline]
> > [] gic_handle_irq+0x6c/0xe0 drivers/irqchip/irq-gic.c:367

> This is most likely a bug caused by syzkaller setting a ridiculous MTU
> on loopback device, below minimum size of ipv4 MTU.

> I tried to track it in August [1], but it seems hard to find all the
> issues with this.
> 
> commit c780a049f9bf442314335372c9abc4548bfe3e44
> Author: Eric Dumazet 
> Date:   Wed Aug 16 11:09:12 2017 -0700
> 
> ipv4: better IP_MAX_MTU enforcement
> 
> While working on yet another syzkaller report, I found
> that our IP_MAX_MTU enforcements were not properly done.
> 
> gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
> final result can be bigger than IP_MAX_MTU :/
> 
> This is a problem because device mtu can be changed on other cpus or
> threads.
> 
> While this patch does not fix the issue I am working on, it is
> probably worth addressing it.

Just to check I've understood correctly, are you suggesting that the
IPv4 code should also check the dev->mtu against a IP_MIN_MTU (which
doesn't seem to exist today)?

Otherwise, I do spot another potential issue. The writer side (e.g. most
net_device::ndo_change_mtu implementations and the __dev_set_mtu()
fallback) doesn't use WRITE_ONCE().

IIUC, that means that the write could be torn across multiple accesses,
and we could see dev->mtu < dev->min_mtu on the read side, even if we
use READ_ONCE(), and sanity check the mtu value before calling
__dev_set_mtu().

Thanks,
Mark.


Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Mark Rutland
On Mon, Oct 02, 2017 at 07:42:17AM -0700, Eric Dumazet wrote:
> On Mon, Oct 2, 2017 at 7:21 AM, Mark Rutland  wrote:
> > Just to check I've understood correctly, are you suggesting that the
> > IPv4 code should also check the dev->mtu against a IP_MIN_MTU (which
> > doesn't seem to exist today)?
> 
> We have plenty of places this is checked.
> 
> For example, trying to set MTU < 68 usually removes IPv4 addresses and routes.
> 
> Problem is : these checks are not fool proof yet.
> 
> ( Only the admin was supposed to play these games )

Sorry, I meant that there was no constant called IP_MIN_MTU, and I was
just looking at the sites fixed up by c780a049f9bf4423.

I appreciate given that this requires admin privileges it's not exactly
high priority. I didn't mean for the above to sound like some kind of
accusation!

> > Otherwise, I do spot another potential issue. The writer side (e.g. most
> > net_device::ndo_change_mtu implementations and the __dev_set_mtu()
> > fallback) doesn't use WRITE_ONCE().
> 
> It does not matter how many strange values can be observed by the reader :
> We must be fool proof anyway from reader point of view, so the
> WRITE_ONCE() is not strictly needed.

Ok. If we expect to always check somewhere on the reader side I guess
that makes sense.

Thanks,
Mark.


Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Mark Rutland
On Mon, Oct 02, 2017 at 07:48:28AM -0700, Eric Dumazet wrote:
> Please try the following fool proof patch.
> 
> This is what I had in my local tree back in August but could not
> conclude on the syzkaller bug I was working on.

Thanks, I'll give this a go shortly.

I'm currently minimizing the Syzkaller log so that I can trigger the
issue more quickly (and have some confidence in a Tested-by)!

Thanks,
Mark.

> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index 
> 681e33998e03b609fdca83a83e0fc62a3fee8c39..e51d97a927058760a1ab7af00579f7488cb5
>  100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -732,7 +732,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int 
> code, __be32 info)
>   room = 576;
>   room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
>   room -= sizeof(struct icmphdr);
> -
> + if (room < 0)
> + goto ende;
>   icmp_param.data_len = skb_in->len - icmp_param.offset;
>   if (icmp_param.data_len > room)
>   icmp_param.data_len = room;
> 
> 
> 


Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Mark Rutland
On Mon, Oct 02, 2017 at 07:48:28AM -0700, Eric Dumazet wrote:
> Please try the following fool proof patch.
>
> This is what I had in my local tree back in August but could not
> conclude on the syzkaller bug I was working on.
> 
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index 
> 681e33998e03b609fdca83a83e0fc62a3fee8c39..e51d97a927058760a1ab7af00579f7488cb5
>  100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -732,7 +732,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int 
> code, __be32 info)
>   room = 576;
>   room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
>   room -= sizeof(struct icmphdr);
> -
> + if (room < 0)
> + goto ende;
>   icmp_param.data_len = skb_in->len - icmp_param.offset;
>   if (icmp_param.data_len > room)
>   icmp_param.data_len = room;
> 

Unfortuantely, with this applied I still see the issue.

Syzkaller came up with a minimized reproducer [1], which can trigger the
issue near instantly under syz-execprog. If there's anything that would
help to narrow this down, I'm more than happy to give it a go.

Thanks,
Mark.

[1] 
https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20171002-skb_clone-misaligned-atomic/syzkaller.repro


Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Mark Rutland
On Mon, Oct 02, 2017 at 10:27:15AM -0700, Eric Dumazet wrote:
> On Mon, Oct 2, 2017 at 10:21 AM, Mark Rutland  wrote:
> > On Mon, Oct 02, 2017 at 07:48:28AM -0700, Eric Dumazet wrote:
> >> Please try the following fool proof patch.
> >>
> >> This is what I had in my local tree back in August but could not
> >> conclude on the syzkaller bug I was working on.
> >>
> >> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> >> index 
> >> 681e33998e03b609fdca83a83e0fc62a3fee8c39..e51d97a927058760a1ab7af00579f7488cb5
> >>  100644
> >> --- a/net/ipv4/icmp.c
> >> +++ b/net/ipv4/icmp.c
> >> @@ -732,7 +732,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int 
> >> code, __be32 info)
> >>   room = 576;
> >>   room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
> >>   room -= sizeof(struct icmphdr);
> >> -
> >> + if (room < 0)
> >> + goto ende;
> >>   icmp_param.data_len = skb_in->len - icmp_param.offset;
> >>   if (icmp_param.data_len > room)
> >>   icmp_param.data_len = room;
> >>
> >
> > Unfortuantely, with this applied I still see the issue.
> >
> > Syzkaller came up with a minimized reproducer [1], which can trigger the
> > issue near instantly under syz-execprog. If there's anything that would
> > help to narrow this down, I'm more than happy to give it a go.
> >
> > Thanks,
> > Mark.
> >
> > [1] 
> > https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20171002-skb_clone-misaligned-atomic/syzkaller.repro
> 
> Note that I was not trying to address the misaligned stuff.

Aargh, I put the reproducer in the wrong folder thanks to tab-completing
my kup command. :/

The reproducer linked above is for the kernel BUG at
net/core/skbuff.c:2626.

I've uploaded a copy into the relevant bug directory [1], but that'll
take a little while to sync out. I'll drop it from the misalignment bug
folder once that's visible to all.

Sorry about that!

Thanks,
Mark.

[1] 
https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20171002-skbuff-bug/


EBPF-triggered WARNING at mm/percpu.c:1361 in v4-14-rc2

2017-09-28 Thread Mark Rutland
Hi,

While fuzzing v4.14-rc2 with Syzkaller, I found it was possible to trigger the
warning at mm/percpu.c:1361, on both arm64 and x86_64. This appears to require
increasing RLIMIT_MEMLOCK, so to the best of my knowledge this cannot be
triggered by an unprivileged user.

I've included example splats for both x86_64 and arm64, along with a C
reproducer, inline below.

It looks like dev_map_alloc() requests a percpu alloction of 32776 bytes, which
is larger than the maximum supported allocation size of 32768 bytes.

I wonder if it would make more sense to pr_warn() for sizes that are too
large, so that callers don't have to roll their own checks against
PCPU_MIN_UNIT_SIZE?

e.g. something like:


diff --git a/mm/percpu.c b/mm/percpu.c
index 59d44d6..f731c45 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1355,8 +1355,13 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
bits = size >> PCPU_MIN_ALLOC_SHIFT;
bit_align = align >> PCPU_MIN_ALLOC_SHIFT;
 
-   if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE ||
-!is_power_of_2(align))) {
+   if (unlikely(size > PCPU_MIN_UNIT_SIZE)) {
+   pr_warn("cannot allocate pcpu chunk of size %zu (max %zu)\n",
+   size, PCPU_MIN_UNIT_SIZE);
+   return NULL;
+   }
+
+   if (unlikely(!size || align > PAGE_SIZE || !is_power_of_2(align))) {
WARN(true, "illegal size (%zu) or align (%zu) for percpu 
allocation\n",
 size, align);
return NULL;


Thanks,
Mark.



Example splat(x86_64)

[  138.144185] illegal size (32776) or align (8) for percpu allocation
[  138.150452] [ cut here ]
[  138.155074] WARNING: CPU: 1 PID: 2223 at mm/percpu.c:1361 
pcpu_alloc+0x7c/0x5f0
[  138.162369] Modules linked in:
[  138.165423] CPU: 1 PID: 2223 Comm: repro Not tainted 4.14.0-rc2 #3
[  138.171593] Hardware name: LENOVO 7484A3G/LENOVO, BIOS 5CKT54AUS 09/07/2009
[  138.178543] task: 881b73069980 task.stack: a36f40f9
[  138.184455] RIP: 0010:pcpu_alloc+0x7c/0x5f0
[  138.188633] RSP: 0018:a36f40f93e00 EFLAGS: 00010286
[  138.193853] RAX: 0037 RBX:  RCX: 
[  138.200974] RDX: 881b7ec94a40 RSI: 881b7ec8cbb8 RDI: 881b7ec8cbb8
[  138.208097] RBP: a36f40f93e68 R08: 0001 R09: 02c4
[  138.215219] R10: 562a577047f0 R11: a10ad7cd R12: 881b73216cc0
[  138.222343] R13: 0014 R14: 7ffebeed0900 R15: ffea
[  138.229463] FS:  7fef84a15700() GS:881b7ec8() 
knlGS:
[  138.237538] CS:  0010 DS:  ES:  CR0: 80050033
[  138.243274] CR2: 7fef84497ba0 CR3: 0001b3235000 CR4: 000406e0
[  138.250397] Call Trace:
[  138.252844]  __alloc_percpu+0x10/0x20
[  138.256508]  dev_map_alloc+0x122/0x1b0
[  138.260255]  SyS_bpf+0x8f9/0x10b0
[  138.263570]  ? security_task_setrlimit+0x3e/0x60
[  138.268184]  ? do_prlimit+0xa6/0x1f0
[  138.271760]  entry_SYSCALL_64_fastpath+0x13/0x94
[  138.276372] RIP: 0033:0x7fef84546259
[  138.279946] RSP: 002b:7ffebeed09b8 EFLAGS: 0206 ORIG_RAX: 
0141
[  138.287503] RAX: ffda RBX:  RCX: 7fef84546259
[  138.294627] RDX: 0014 RSI: 7ffebeed09d0 RDI: 
[  138.301749] RBP: 562a57704780 R08: 7fef84810cb0 R09: 7ffebeed0ae8
[  138.308874] R10: 562a577047f0 R11: 0206 R12: 562a577045d0
[  138.315997] R13: 7ffebeed0ae0 R14:  R15: 
[  138.323122] Code: fe 00 10 00 00 77 10 48 8b 4d b8 48 89 c8 48 83 e8 01 48 
85 c1 74 1e 48 8b 55 b8 48 8b 75 c0 48 c7 c7 90 5e be a0 e8 40 88 f3 ff <0f> ff 
45 31 ed e9 5e 02 00 00 4c 8b 6d c0 49 89 cc 49 c1 ec 02 
[  138.341953] ---[ end trace b6e380365bfb8a36 ]---




Example splat (arm64)

[   17.287365] illegal size (32776) or align (8) for percpu allocation
[   17.295347] [ cut here ]
[   17.297191] WARNING: CPU: 1 PID: 1440 at mm/percpu.c:1361 
pcpu_alloc+0x120/0x9f0
[   17.307723] Kernel panic - not syncing: panic_on_warn set ...
[   17.307723] 
[   17.311755] CPU: 1 PID: 1440 Comm: repro Not tainted 
4.14.0-rc2-1-gd7ad33d #115
[   17.320675] Hardware name: linux,dummy-virt (DT)
[   17.323858] Call trace:
[   17.325246] [] dump_backtrace+0x0/0x558
[   17.332538] [] show_stack+0x20/0x30
[   17.340391] [] dump_stack+0x128/0x1a0
[   17.342081] [] panic+0x250/0x518
[   17.344096] [] __warn+0x2a4/0x310
[   17.345654] [] report_bug+0x1d4/0x290
[   17.348652] [] bug_handler.part.1+0x40/0xf8
[   17.356873] [] bug_handler+0x4c/0x88
[   17.360543] [] brk_handler+0x1c4/0x360
[   17.365076] [] do_debug_exception+0x118/0x398
[   17.368297] Exception stack(0x80001c82b930 to 0x80001c82ba70)
[   17.372981] b920:   0037 
0

Re: EBPF-triggered WARNING at mm/percpu.c:1361 in v4-14-rc2

2017-09-28 Thread Mark Rutland
On Thu, Sep 28, 2017 at 04:37:46PM +0200, Daniel Borkmann wrote:
> On 09/28/2017 01:27 PM, Mark Rutland wrote:
> >Hi,
> >
> >While fuzzing v4.14-rc2 with Syzkaller, I found it was possible to trigger 
> >the
> >warning at mm/percpu.c:1361, on both arm64 and x86_64. This appears to 
> >require
> >increasing RLIMIT_MEMLOCK, so to the best of my knowledge this cannot be
> >triggered by an unprivileged user.
> >
> >I've included example splats for both x86_64 and arm64, along with a C
> >reproducer, inline below.
> >
> >It looks like dev_map_alloc() requests a percpu alloction of 32776 bytes, 
> >which
> >is larger than the maximum supported allocation size of 32768 bytes.
> >
> >I wonder if it would make more sense to pr_warn() for sizes that are too
> >large, so that callers don't have to roll their own checks against
> >PCPU_MIN_UNIT_SIZE?
> 
> Perhaps the pr_warn() should be ratelimited; or could there be an
> option where we only return NULL, not triggering a warn at all (which
> would likely be what callers might do anyway when checking against
> PCPU_MIN_UNIT_SIZE and then bailing out)?

Those both make sense to me; checking __GFP_NOWARN should be easy
enough.

Just to check, do you think that dev_map_alloc() should explicitly test
the size against PCPU_MIN_UNIT_SIZE, prior to calling pcpu_alloc()?

I can spin both patches if so.

Thanks,
Mark.


Re: [PATCH v2 29/35] dt-bindings: nds32 CPU Bindings

2017-11-27 Thread Mark Rutland
Him

On Mon, Nov 27, 2017 at 08:28:16PM +0800, Greentime Hu wrote:
> From: Greentime Hu 
> 
> This patch adds nds32 CPU binding documents.
> 
> Signed-off-by: Vincent Chen 
> Signed-off-by: Rick Chen 
> Signed-off-by: Zong Li 
> Signed-off-by: Greentime Hu 
> ---
>  Documentation/devicetree/bindings/nds32/cpus.txt |   32 
> ++
>  1 file changed, 32 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/nds32/cpus.txt
> 
> diff --git a/Documentation/devicetree/bindings/nds32/cpus.txt 
> b/Documentation/devicetree/bindings/nds32/cpus.txt
> new file mode 100644
> index 000..c302c89
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/nds32/cpus.txt
> @@ -0,0 +1,32 @@
> +* Andestech Processor Binding
> +
> +This binding specifies what properties must be available in the device tree
> +representation of a Andestech Processor Core, which is the root node in the
> +tree.
> +
> +Required properties:
> +
> + - compatible:
> + Usage: required
> + Value type: 
> + Definition: should be one of:
> + "andestech,n13"
> + "andestech,n15"
> + "andestech,d15"
> + "andestech,n10"
> + "andestech,d10"
> + "andestech,nds32v3"

Are these specific parts, or architecture versions?

I guess "andestech,nds32v3" is an architecture version, and the rest are
specific parts?

> + - device_type
> + Usage: required
> + Value type: 
> + Definition: must be "cpu"
> + - clock-frequency: Contains the clock frequency for CPU, in Hz.
> +
> +* Examples
> +
> +/ {
> + cpu {
> + device_type = "cpu";
> + compatible = "andestech,n13", "andestech,nds32v3";
> + };

This is missing the required clock-frequency property.

There should be a /cpus node, with each CPU listed under that, to align
with the devicetree spec. Even if you never support SMP, there's no
reason to be different from other architectures.

You should have something like:

/ {
cpus {
cpu {
device_type = "cpu";
compatible = "andestech,n13";
clock-frequency = <...>;
};
};
};

Thanks,
Mark.


Re: [PATCH v2 06/35] nds32: MMU fault handling and page table management

2017-11-27 Thread Mark Rutland
Hi,

On Mon, Nov 27, 2017 at 08:27:53PM +0800, Greentime Hu wrote:
> +void do_page_fault(unsigned long entry, unsigned long addr,
> +unsigned int error_code, struct pt_regs *regs)
> +{

> + /*
> +  * As per x86, we may deadlock here. However, since the kernel only
> +  * validly references user space from well defined areas of the code,
> +  * we can bug out early if this is from code which shouldn't.
> +  */
> + if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
> + if (!user_mode(regs) &&
> + !search_exception_tables(instruction_pointer(regs)))
> + goto no_context;
> +retry:
> + down_read(&mm->mmap_sem);
> + } else {
> + /*
> +  * The above down_read_trylock() might have succeeded in which
> +  * case, we'll have missed the might_sleep() from down_read().
> +  */
> + might_sleep();
> + if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> + if (!user_mode(regs) &&
> + !search_exception_tables(instruction_pointer(regs)))
> + goto no_context;
> + }
> + }

> + fault = handle_mm_fault(vma, addr, flags);
> +
> + /*
> +  * If we need to retry but a fatal signal is pending, handle the
> +  * signal first. We do not need to release the mmap_sem because it
> +  * would already be released in __lock_page_or_retry in mm/filemap.c.
> +  */
> + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> + return;

I believe you can get stuck in a livelock here (with an unkillable
task), if a uaccess primitive tries to access a region protected by a
userfaultfd. Please see:

  
https://lkml.kernel.org/r/1499782590-31366-1-git-send-email-mark.rutl...@arm.com

... for details and a test case.

Thanks,
Mark.


Re: [PATCH v2 10/35] nds32: Atomic operations

2017-11-27 Thread Mark Rutland
Hi,

On Mon, Nov 27, 2017 at 08:27:57PM +0800, Greentime Hu wrote:
> +static inline void arch_spin_unlock(arch_spinlock_t * lock)
> +{
> + asm volatile(
> + "xor$r15,  $r15, $r15\n"
> + "swi$r15, [%0]\n"
> + :
> + :"r"(&lock->lock)
> + :"memory");
> +}

This looks suspicious. There's no clobber for $r15, so isn't this
corrupting a GPR behind the back of the compiler?

Shouldn't there be a tmp variable for the register allocation rather
than hard-coding $r15?

> +static inline void arch_write_unlock(arch_rwlock_t * rw)
> +{
> + asm volatile(
> + "xor$r15, $r15, $r15\n"
> + "swi$r15, [%0]\n"
> + :
> + :"r"(&rw->lock)
> + :"memory","$r15");
> +}

This time you have a clobber, but it's still not clear to me why you
don't use a tmp variable and leave the register allocation to the
compiler.

Thanks,
Mark.


Re: [PATCH] bpf: devmap: Check attr->max_entries more carefully

2017-10-17 Thread Mark Rutland
On Mon, Oct 16, 2017 at 08:52:13PM +0200, Daniel Borkmann wrote:
> [ +Tejun, Mark, John ]
> 
> On 10/16/2017 12:00 AM, Richard Weinberger wrote:
> > max_entries is user controlled and used as input for __alloc_percpu().
> > This function expects that the allocation size is a power of two and
> > less than PCPU_MIN_UNIT_SIZE.
> > Otherwise a WARN() is triggered.
> > 
> > Fixes: 11393cc9b9be ("xdp: Add batching support to redirect map")
> > Reported-by: Shankara Pailoor 
> > Reported-by: syzkaller 
> > Signed-off-by: Richard Weinberger 
> 
> Thanks for the patch, Richard. There was a prior discussion here [1] on
> the same issue, I thought this would have been resolved by now, but looks
> like it's still open and there was never a follow-up, at least I don't see
> it in the percpu tree if I didn't miss anything.

Sorry, this was on my todo list, but I've been bogged down with some
other work.

> I would suggest, we do the following below and pass __GFP_NOWARN from BPF
> side to the per-cpu allocs. This is kind of a generic 'issue' and we shouldn't
> add more code which bails out anyway just to work around the WARN(). Lets
> handle it properly instead.

Agreed. The below patch looks good to me, (with the suggested change to
the BPF side).

> If Tejun is fine with the one below, I could cook and official patch and
> cleanup the remaining call-sites from BPF which have similar pattern.

That would be great; thanks for taking this on.

Thanks,
Mark.

> 
>   [1] https://patchwork.kernel.org/patch/9975851/
> 
> Thanks,
> Daniel
> 
>  mm/percpu.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 59d44d6..5d9414e 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1357,7 +1357,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> align, bool reserved,
> 
>   if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE ||
>!is_power_of_2(align))) {
> - WARN(true, "illegal size (%zu) or align (%zu) for percpu 
> allocation\n",
> + WARN(!(gfp & __GFP_NOWARN),
> +  "illegal size (%zu) or align (%zu) for percpu 
> allocation\n",
>size, align);
>   return NULL;
>   }
> @@ -1478,7 +1479,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> align, bool reserved,
>  fail:
>   trace_percpu_alloc_percpu_fail(reserved, is_atomic, size, align);
> 
> - if (!is_atomic && warn_limit) {
> + if (!is_atomic && warn_limit && !(gfp & __GFP_NOWARN)) {
>   pr_warn("allocation failed, size=%zu align=%zu atomic=%d, %s\n",
>   size, align, is_atomic, err);
>   dump_stack();
> -- 
> 1.9.3


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-20 Thread Mark Rutland
On Thu, Oct 19, 2017 at 10:16:08PM -0400, Wei Wei wrote:
> Hi all,

Hi,

> I have fuzzed v4.14-rc3 using syzkaller and found a bug similar to that one 
> [1].
> But the call trace isn’t the same. The atomic_inc() might handle a corrupted 
> skb_buff.
> 
> The logs and config have been uploaded to my github repo [2].
> 
> [1] https://lkml.org/lkml/2017/10/2/216
> [2] https://github.com/dotweiba/skb_clone_atomic_inc_bug

These do look very similar to what I was hitting; all appear to be
misaligned atomics in the same path.

I see that you have some empty repro files in [2]. If you have any
reproducers, would you mind sharing them?

If any of those are smaller or more reliable than the one I was able to
generate [3], it might make it more obvious what's going on, and/or make
it simpler to come up with a plain C reproducer.

Thanks,
Mark.

[3] 
https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20171002-skb_clone-misaligned-atomic/syzkaller.repro


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-20 Thread Mark Rutland
On Fri, Oct 20, 2017 at 10:40:38AM -0400, Wei Wei wrote:
> Sadly, the syzkaller characterized it as a non-reproducible bug and there 
> were empty
>  repro files. But if manually executing in VM like this “./syz-execprog 
> -executor=
> ./syz-executor -repeat=0 -procs=16 -cover=0 crash-log”, it crashed when 
> executing exactly 
> program 1056 using log0 provided.
> 
> I failed to generate the C reproducer with syz-repro as it said “no target 
> compiler”
> in the final step. I would appreciate if you could give some hints.

syz-repro should produce a smaller syzkaller log before it tries to
generate a C file.

I use:

$ syz-repro -config qemu.cfg logN

... and in most cases it will eventually print a smaller log to the
console.

Thanks,
Mark.


Re: [PATCH RFC 3/4] dt-bindings: correct marvell orion MDIO binding document

2017-01-09 Thread Mark Rutland
On Sat, Jan 07, 2017 at 11:28:30AM +, Russell King wrote:
> Correct the Marvell Orion MDIO binding document to properly reflect the
> cases where an interrupt is present.  Augment the examples to show this.
> 
> Signed-off-by: Russell King 

This looks fine to me.

Acked-by: Mark Rutland 

Mark.

> ---
>  .../devicetree/bindings/net/marvell-orion-mdio.txt  | 17 
> +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt 
> b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
> index 9417e54c26c0..ca733ff68ab9 100644
> --- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
> +++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
> @@ -7,7 +7,10 @@ interface.
>  
>  Required properties:
>  - compatible: "marvell,orion-mdio"
> -- reg: address and length of the SMI register
> +- reg: address and length of the MDIO registers.  When an interrupt is
> +  not present, the length is the size of the SMI register (4 bytes)
> +  otherwise it must be 0x84 bytes to cover the interrupt control
> +  registers.
>  
>  Optional properties:
>  - interrupts: interrupt line number for the SMI error/done interrupt
> @@ -17,7 +20,7 @@ The child nodes of the MDIO driver are the individual PHY 
> devices
>  connected to this MDIO bus. They must have a "reg" property given the
>  PHY address on the MDIO bus.
>  
> -Example at the SoC level:
> +Example at the SoC level without an interrupt property:
>  
>  mdio {
>   #address-cells = <1>;
> @@ -26,6 +29,16 @@ mdio {
>   reg = <0xd0072004 0x4>;
>  };
>  
> +Example with an interrupt property:
> +
> +mdio {
> + #address-cells = <1>;
> + #size-cells = <0>;
> + compatible = "marvell,orion-mdio";
> + reg = <0xd0072004 0x84>;
> + interrupts = <30>;
> +};
> +
>  And at the board level:
>  
>  mdio {
> -- 
> 2.7.4
> 


[PATCH] SUNRPC: fix include for cmpxhg_relaxed()

2017-03-15 Thread Mark Rutland
Currently net/sunrpc/xprtmultipath.c is the only file outside of arch/
headers and asm-generic/ headers to include , apparently
for the use of cmpxchg_relaxed().

However, many architectures do not provide cmpxchg_relaxed() in their
, and it is necessary to include  to get
this definition, as noted in Documentation/core-api/atomic_ops.rst:

  If someone wants to use xchg(), cmpxchg() and their variants,
  linux/atomic.h should be included rather than asm/cmpxchg.h, unless
  the code is in arch/* and can take care of itself.

Evidently we're getting the right header this via some transitive
include today, but this isn't something we can/should rely upon,
especially with ongoing rework of the atomic headers for KASAN
instrumentation.

Let's fix the code to include , avoiding fragility.

Signed-off-by: Mark Rutland 
Cc: Trond Myklebust 
Cc: Anna Schumaker 
Cc: J. Bruce Fields 
Cc: Jeff Layton 
Cc: David S. Miller 
Cc: linux-...@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 net/sunrpc/xprtmultipath.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtmultipath.c b/net/sunrpc/xprtmultipath.c
index ae92a9e..3b1e856 100644
--- a/net/sunrpc/xprtmultipath.c
+++ b/net/sunrpc/xprtmultipath.c
@@ -12,7 +12,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
-- 
1.9.1



Re: [PATCH V8 1/3] irq: Add flags to request_percpu_irq function

2017-03-23 Thread Mark Rutland
Hi Daniel,

On Thu, Mar 23, 2017 at 06:42:01PM +0100, Daniel Lezcano wrote:
> In the next changes, we track the interrupts but we discard the timers as
> that does not make sense. The next interrupt on a timer is predictable.

Sorry, but I could not parse this. 

[...]

> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index 9612b84..0f5ab4a 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -661,7 +661,7 @@ static int cpu_pmu_request_irq(struct arm_pmu *cpu_pmu, 
> irq_handler_t handler)
>  
>   irq = platform_get_irq(pmu_device, 0);
>   if (irq > 0 && irq_is_percpu(irq)) {
> - err = request_percpu_irq(irq, handler, "arm-pmu",
> + err = request_percpu_irq(irq, 0, handler, "arm-pmu",
>&hw_events->percpu_pmu);
>   if (err) {
>   pr_err("unable to request IRQ%d for ARM PMU counters\n",

Please Cc myself and Will Deacon when modifying the arm_pmu driver, as
per MAINTAINERS. I only spotted this patch by chance.

This conflicts with arm_pmu changes I have queued for v4.12 [1].

So, can we leave the prototype of request_percpu_irq() as-is?

Why not add a new request_percpu_irq_flags() function, and leave
request_percpu_irq() as a wrapper for that? e.g.

static inline int
request_percpu_irq(unsigned int irq, irq_handler_t handler,
   const char *devname, void __percpu *percpu_dev_id)
{
return request_percpu_irq_flags(irq, handler, devname,
percpu_dev_id, 0);
}

... that would avoid having to touch any non-timer driver for now.

[...]

> -request_percpu_irq(unsigned int irq, irq_handler_t handler,
> -const char *devname, void __percpu *percpu_dev_id);
> +request_percpu_irq(unsigned int irq, unsigned long flags,
> +irq_handler_t handler,  const char *devname,
> +void __percpu *percpu_dev_id);
>  

Looking at request_irq, the prototype is:

int __must_check
request_irq(unsigned int irq, irq_handler_t handler,
unsigned long flags, const char *name,
void *dev);

... surely it would be better to share the same argument order? i.e.

int __must_check
request_percpu_irq(unsigned int irq, irq_handler_t handler,
   unsigned long flags, const char *devname,
   void __percpu *percpu_dev_id);

Thanks,
Mark.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm/perf/refactoring


[PATCH] net: ipconfig: fix ic_close_devs() use-after-free

2017-03-27 Thread Mark Rutland
Our chosen ic_dev may be anywhere in our list of ic_devs, and we may
free it before attempting to close others. When we compare d->dev and
ic_dev->dev, we're potentially dereferencing memory returned to the
allocator. This causes KASAN to scream for each subsequent ic_dev we
check.

As there's a 1-1 mapping between ic_devs and netdevs, we can instead
compare d and ic_dev directly, which implicitly handles the !ic_dev
case, and avoids the use-after-free. The ic_dev pointer may be stale,
but we will not dereference it.

Original splat:

[6.487446] 
==
[6.494693] BUG: KASAN: use-after-free in ic_close_devs+0xc4/0x154 at addr 
800367efa708
[6.503013] Read of size 8 by task swapper/0/1
[6.507452] CPU: 5 PID: 1 Comm: swapper/0 Not tainted 
4.11.0-rc3-2-gda42158 #8
[6.514993] Hardware name: AppliedMicro Mustang/Mustang, BIOS 
3.05.05-beta_rc Jan 27 2016
[6.523138] Call trace:
[6.525590] [] dump_backtrace+0x0/0x570
[6.530976] [] show_stack+0x20/0x30
[6.536017] [] dump_stack+0x120/0x188
[6.541231] [] kasan_object_err+0x24/0xa0
[6.546790] [] kasan_report_error+0x244/0x738
[6.552695] [] __asan_report_load8_noabort+0x54/0x80
[6.559204] [] ic_close_devs+0xc4/0x154
[6.564590] [] ip_auto_config+0x2ed4/0x2f1c
[6.570321] [] do_one_initcall+0xcc/0x370
[6.575882] [] kernel_init_freeable+0x5f8/0x6c4
[6.581959] [] kernel_init+0x18/0x190
[6.587171] [] ret_from_fork+0x10/0x40
[6.592468] Object at 800367efa700, in cache kmalloc-128 size: 128
[6.598969] Allocated:
[6.601324] PID = 1
[6.603427]  save_stack_trace_tsk+0x0/0x418
[6.607603]  save_stack_trace+0x20/0x30
[6.611430]  kasan_kmalloc+0xd8/0x188
[6.615087]  ip_auto_config+0x8c4/0x2f1c
[6.619002]  do_one_initcall+0xcc/0x370
[6.622832]  kernel_init_freeable+0x5f8/0x6c4
[6.627178]  kernel_init+0x18/0x190
[6.630660]  ret_from_fork+0x10/0x40
[6.634223] Freed:
[6.636233] PID = 1
[6.638334]  save_stack_trace_tsk+0x0/0x418
[6.642510]  save_stack_trace+0x20/0x30
[6.646337]  kasan_slab_free+0x88/0x178
[6.650167]  kfree+0xb8/0x478
[6.653131]  ic_close_devs+0x130/0x154
[6.656875]  ip_auto_config+0x2ed4/0x2f1c
[6.660875]  do_one_initcall+0xcc/0x370
[6.664705]  kernel_init_freeable+0x5f8/0x6c4
[6.669051]  kernel_init+0x18/0x190
[6.672534]  ret_from_fork+0x10/0x40
[6.676098] Memory state around the buggy address:
[6.680880]  800367efa600: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[6.688078]  800367efa680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
fc
[6.695276] >800367efa700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb
[6.702469]   ^
[6.705952]  800367efa780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
fc
[6.713149]  800367efa800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 
fb
[6.720343] 
==
[6.727536] Disabling lock debugging due to kernel taint

Signed-off-by: Mark Rutland 
Cc: Alexey Kuznetsov 
Cc: David S. Miller 
Cc: Hideaki YOSHIFUJI 
Cc: James Morris 
Cc: Patrick McHardy 
Cc: netdev@vger.kernel.org
---
 net/ipv4/ipconfig.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index fd9f34b..dfb2ab2 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -306,7 +306,7 @@ static void __init ic_close_devs(void)
while ((d = next)) {
next = d->next;
dev = d->dev;
-   if ((!ic_dev || dev != ic_dev->dev) && !netdev_uses_dsa(dev)) {
+   if (d != ic_dev && !netdev_uses_dsa(dev)) {
pr_debug("IP-Config: Downing %s\n", dev->name);
dev_change_flags(dev, d->flags);
}
-- 
1.9.1



amd-xgbe: unbalanced irq enable in v4.11-rc1

2017-03-10 Thread Mark Rutland
Hi,

I'm seeing the below splat when transferring data over the network, using
v4.11-rc1 on an AMD Seattle platform. I don't see the splat with v4.10.

Looking at just the driver, I couldn't see any suspicious changes. Reverting
commit 402168b4c2dc0734 ("amd-xgbe: Stop the PHY before releasing interrupts")
doesn't change matters.

Any ideas?

Thanks,
Mark.

[  106.114135] Unbalanced enable for IRQ 34
[  106.118157] [ cut here ]
[  106.122793] WARNING: CPU: 0 PID: 0 at kernel/irq/manage.c:529 
__enable_irq+0x168/0x1d8
[  106.130708] Modules linked in:
[  106.133766] 
[  106.135262] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.11.0-rc1-1-g2411f70 #5
[  106.142830] Hardware name: AMD Seattle (Rev.B0) Development Board 
(Overdrive) (DT)
[  106.150398] task: 2a9e2000 task.stack: 2a9c
[  106.156320] PC is at __enable_irq+0x168/0x1d8
[  106.160679] LR is at __enable_irq+0x168/0x1d8
[  106.165036] pc : [] lr : [] pstate: 
01c5
[  106.172428] sp : 8003fff17c50
[  106.175742] x29: 8003fff17c50 x28: 800359255160 
[  106.181063] x27: 10006b24aa2d x26: 800359255170 
[  106.186385] x25: 2a9ca000 x24: 800359255168 
[  106.191706] x23: 80036902ca80 x22: 80036bbe8f78 
[  106.197027] x21: 10006d77d1ef x20:  
[  106.202348] x19: 80036bbe8f00 x18: 2a9e27b8 
[  106.207668] x17: 7e2b5d90 x16:  
[  106.212989] x15: 1fffe4000153c4f7 x14: 6003f559 
[  106.218310] x13: 2ac45000 x12: 2b803000 
[  106.223631] x11: 1fffe40001ba7a8d x10:  
[  106.228951] x9 : 10007ffe2ee0 x8 : 2a9e27e8 
[  106.234272] x7 : 2d2d3000 x6 :  
[  106.239592] x5 : 41b58ab3 x4 : 1fffe40001ba3a10 
[  106.244913] x3 : 1fffe40001ba3a10 x2 : dfff2000 
[  106.250234] x1 : 10007ffe2f58 x0 : 001c 
[  106.23] 
[  106.257044] ---[ end trace 86935ef2e5e5c498 ]---
[  106.261658] Call trace:
[  106.264106] Exception stack(0x8003fff17a00 to 0x8003fff17b30)
[  106.270550] 7a00: 80036bbe8f00 0001 8003fff17c50 
282a1340
[  106.278383] 7a20: 01c5 003d 2a9ca000 
800359255170
[  106.286216] 7a40: 10006b24aa2d 2a9e2000 8003fff17b60 
73870f20
[  106.294048] 7a60: 41b58ab3 2a7164f0 28081ac8 
2a05d4a0
[  106.301881] 7a80: 80036902ca80 800359255168 2a9ca000 
800359255170
[  106.309713] 7aa0: 8003fff17c50 8003fff17c50 8003fff17c10 
ffc8
[  106.317546] 7ac0: 41b58ab3 2a72a340 28299380 
2a9e2000
[  106.325379] 7ae0: 8003fff17b20 28bf8f08 80036bbe8f00 
80036d8002c0
[  106.333211] 7b00: 80036d8003f8 10007ffe2f70  

[  106.341040] 7b20: 001c 10007ffe2f58
[  106.345921] [] __enable_irq+0x168/0x1d8
[  106.351323] [] enable_irq+0xa8/0x118
[  106.356467] [] xgbe_one_poll+0x190/0x280
[  106.361956] [] net_rx_action+0x6ac/0xd88
[  106.367444] [] __do_softirq+0x324/0xb70
[  106.372845] [] irq_exit+0x1cc/0x338
[  106.377900] [] __handle_domain_irq+0xdc/0x230
[  106.383821] [] gic_handle_irq+0x6c/0xe0
[  106.389220] Exception stack(0x2a9c3d40 to 0x2a9c3e70)
[  106.395663] 3d40: 2a9e27bc 0007  
1fffe4000153c4f7
[  106.403495] 3d60: 0004   
2d2d3000
[  106.411327] 3d80: 2a9e27c0 040001538762  
0003
[  106.419160] 3da0: 2b803000 2ac45000 10007ffe4281 
1fffe4000153c4f7
[  106.426992] 3dc0:  7e2b5d90 2a9e27b8 
2a9e2000
[  106.434824] 3de0: 2a9e2000  2a9cfeb0 
0003
[  106.442656] 3e00: 00ff 2a9cfe18 2a901058 

[  106.450488] 3e20: 008004810018 2a9c3e70 2808a1d8 
2a9c3e70
[  106.458321] 3e40: 2808a1dc 6145  
2a901058
[  106.466150] 3e60:  2808a1d8
[  106.471029] [] el1_irq+0xb8/0x130
[  106.475909] [] arch_cpu_idle+0x1c/0x28
[  106.481225] [] default_idle_call+0x34/0x78
[  106.486888] [] do_idle+0x21c/0x3c0
[  106.491856] [] cpu_startup_entry+0x24/0x28
[  106.497517] [] rest_init+0x1ec/0x200
[  106.502659] [] start_kernel+0x5d8/0x604
[  106.508061] [] __primary_switched+0x6c/0x74



Re: amd-xgbe: unbalanced irq enable in v4.11-rc1

2017-03-10 Thread Mark Rutland
On Fri, Mar 10, 2017 at 11:39:42AM -0600, Tom Lendacky wrote:
> On 3/10/2017 11:19 AM, Mark Rutland wrote:
> >Hi,
> >
> >I'm seeing the below splat when transferring data over the network, using
> >v4.11-rc1 on an AMD Seattle platform. I don't see the splat with v4.10.
> >
> >Looking at just the driver, I couldn't see any suspicious changes. Reverting
> >commit 402168b4c2dc0734 ("amd-xgbe: Stop the PHY before releasing 
> >interrupts")
> >doesn't change matters.
> >
> >Any ideas?
> 
> Yes, patch submitted.  Please see:
> 
>   http://marc.info/?l=linux-netdev&m=148910333810442&w=2

Ah, thanks for the pointer!

That appears to solve the issue for me.

Thanks,
Mark.


Re: Initializing MAC address at run-time

2017-01-18 Thread Mark Rutland
On Wed, Jan 18, 2017 at 03:03:57PM +0100, Mason wrote:
> Hello,
> 
> When my system boots up, eth0 is given a seemingly random MAC address.
> 
> [0.950734] nb8800 26000.ethernet eth0: MAC address ba:de:d6:38:b8:38
> [0.957334] nb8800 26000.ethernet eth0: MAC address 6e:f1:48:de:d6:c4
> 
> 
> The DT node for eth0 is:
> 
>   eth0: ethernet@26000 {
>   compatible = "sigma,smp8734-ethernet";
>   reg = <0x26000 0x800>;
>   interrupts = <38 IRQ_TYPE_LEVEL_HIGH>;
>   clocks = <&clkgen SYS_CLK>;
>   };
> 
> Documentation/devicetree/bindings/net/ethernet.txt mentions
> - local-mac-address: array of 6 bytes, specifies the MAC address that was
>   assigned to the network device;
> 
> And indeed, if I define this property, eth0 ends up with the MAC address
> I specify in the device tree. But of course, I don't want all my boards
> to share the same MAC address. Every interface has a unique MAC address.
> 
> In fact, the boot loader (not Uboot, a custom non-DT boot loader) stores
> the MAC address somewhere in MMIO space, in some weird custom format.
> 
> So, at init, I can find the MAC address, and dynamically insert the
> "local-mac-address" property in the eth0 node.

To me it sounds very convoluted to do this from the kernel, to pass
information back to itself. I don't think this is the best way to handle
this.

> Is there another (better) way to do this?
> 
> I'll post my code below, for illustration purpose.
> 
> Mark suggested this can be done from user-space, but I can't do that,
> because I'm using an NFS rootfs, so I need the network before I even
> have a user-space. And the DHCP server is configured to serve different
> root filesystems, based on the MAC address.

That's not quite what I said. I asked whether your information was
coming from userspace or from a kernel driver.

My suggestion was that this should be done in the probe path somehow,
by describing the relationship between the ethernet controller and the
device containing the MAC information.

e.g. on the ethernet device, have a phandle (and perhaps some other
args) describinng the device containing the MAC, and how to extract it.

That way, in the ethernet probe path we can go and look up the MAC
address from the provider of that information.

> I need to do something similar with the NAND partitions. The boot loader
> stores the partition offsets somewhere, and I need to pass this info
> to the NAND framework, so I assumed that inserting the corresponding
> properties at run-time was the correct way to do it.

I would say similar could happen here.

Thanks,
Mark.


Re: arm64: next-20170428 hangs on boot

2017-04-28 Thread Mark Rutland
On Fri, Apr 28, 2017 at 04:24:29PM +0300, Yury Norov wrote:
> Hi all,

Hi,

[adding Dave Miller, netdev, lkml]

> On QEMU the next-20170428 hangs on boot for me due to kernel panic in 
> rtnetlink_init():
> 
> void __init rtnetlink_init(void)
> {
> if (register_pernet_subsys(&rtnetlink_net_ops))
> panic("rtnetlink_init: cannot initialize rtnetlink\n");
> 
> ...
> }

I see the same thing with a next-20170428 arm64 defconfig, on a Juno R1
system:

[0.531949] Kernel panic - not syncing: rtnetlink_init: cannot initialize 
rtnetlink
[0.531949] 
[0.541271] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.11.0-rc8-next-20170428-2-g6ee3799 #10
[0.550307] Hardware name: ARM Juno development board (r1) (DT)
[0.556332] Call trace:
[0.558833] [] dump_backtrace+0x0/0x238
[0.564332] [] show_stack+0x14/0x20
[0.569477] [] dump_stack+0x9c/0xc0
[0.574622] [] panic+0x11c/0x28c
[0.579505] [] rtnetlink_init+0x2c/0x1d0
[0.585092] [] netlink_proto_init+0x14c/0x17c
[0.591119] [] do_one_initcall+0x38/0x120
[0.596796] [] kernel_init_freeable+0x1a0/0x240
[0.603003] [] kernel_init+0x10/0x100
[0.608324] [] ret_from_fork+0x10/0x50
[0.613736] SMP: stopping secondary CPUs
[0.617738] ---[ end Kernel panic - not syncing: rtnetlink_init: cannot 
initialize rtnetlink

If this isn't a known issue, it would be worth trying to bisect this.

Thanks,
Mark.

> The backtrace is:
> #0  arch_counter_get_cntvct () at ./arch/arm64/include/asm/arch_timer.h:160
> #1  __delay (cycles=62500) at arch/arm64/lib/delay.c:31
> #2  0x0838a430 in __const_udelay (xloops=) at 
> arch/arm64/lib/delay.c:41
> #3  0x08165eac in panic (fmt=) at kernel/panic.c:297
> #4  0x08b5b9c8 in rtnetlink_init () at net/core/rtnetlink.c:4196
> #5  0x08b5be08 in netlink_proto_init () at 
> net/netlink/af_netlink.c:2730
> #6  0x08083158 in do_one_initcall (fn=0x08b5bcc4 
> ) at init/main.c:795
> #7  0x08b20d04 in do_initcall_level (level=) at 
> init/main.c:861
> #8  do_initcalls () at init/main.c:869
> #9  do_basic_setup () at init/main.c:887
> #10 kernel_init_freeable () at init/main.c:1039
> #11 0x08817bb0 in kernel_init (unused=) at 
> init/main.c:962
> #12 0x08082ec0 in ret_from_fork () at arch/arm64/kernel/entry.S:789
> Backtrace stopped: previous frame identical to this frame (corrupt stack?) 
> 
> next-20170426 is OK though.
> 
> Yury
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality

2020-12-02 Thread Mark Rutland
Hi Alex,

On Mon, Nov 23, 2020 at 05:58:06PM +, Alex Belits wrote:
> In do_notify_resume(), call task_isolation_before_pending_work_check()
> first, to report isolation breaking, then after handling all pending
> work, call task_isolation_start() for TIF_TASK_ISOLATION tasks.
> 
> Add _TIF_TASK_ISOLATION to _TIF_WORK_MASK, and _TIF_SYSCALL_WORK,
> define local NOTIFY_RESUME_LOOP_FLAGS to check in the loop, since we
> don't clear _TIF_TASK_ISOLATION in the loop.
> 
> Early kernel entry code calls task_isolation_kernel_enter(). In
> particular:
> 
> Vectors:
> el1_sync -> el1_sync_handler() -> task_isolation_kernel_enter()
> el1_irq -> asm_nmi_enter(), handle_arch_irq()
> el1_error -> do_serror()
> el0_sync -> el0_sync_handler()
> el0_irq -> handle_arch_irq()
> el0_error -> do_serror()
> el0_sync_compat -> el0_sync_compat_handler()
> el0_irq_compat -> handle_arch_irq()
> el0_error_compat -> do_serror()
> 
> SDEI entry:
> __sdei_asm_handler -> __sdei_handler() -> nmi_enter()

As a heads-up, the arm64 entry code is changing, as we found that our
lockdep, RCU, and context-tracking management wasn't quite right. I have
a series of patches:

  https://lore.kernel.org/r/20201130115950.22492-1-mark.rutl...@arm.com

... which are queued in the arm64 for-next/fixes branch. I intend to
have some further rework ready for the next cycle. I'd appreciate if you
could Cc me on any patches altering the arm64 entry code, as I have a
vested interest.

That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were
chosen and context tracking was in use (e.g. with
CONTEXT_TRACKING_FORCE), so I'm assuming that this series has not been
tested in that configuration. What sort of testing has this seen?

It would be very helpful for the next posting if you could provide any
instructions on how to test this series (e.g. with pointers to any test
suite that you have), since it's very easy to introduce subtle breakage
in this area without realising it.

> 
> Functions called from there:
> asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
> asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()
> 
> Handlers:
> do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
>   or task_isolation_kernel_enter()
> el1_sync_handler() -> task_isolation_kernel_enter()
> el0_sync_handler() -> task_isolation_kernel_enter()
> el0_sync_compat_handler() -> task_isolation_kernel_enter()
> 
> handle_arch_irq() is irqchip-specific, most call handle_domain_irq()
> There is a separate patch for irqchips that do not follow this rule.
> 
> handle_domain_irq() -> task_isolation_kernel_enter()
> do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
> nmi_enter() -> task_isolation_kernel_enter()

The IRQ cases look very odd to me. With the rework I've just done for
arm64, we'll do the regular context tracking accounting before we ever
get into handle_domain_irq() or similar, so I suspect that's not
necessary at all?

> 
> Signed-off-by: Chris Metcalf 
> [abel...@marvell.com: simplified to match kernel 5.10]
> Signed-off-by: Alex Belits 
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/barrier.h |  1 +
>  arch/arm64/include/asm/thread_info.h |  7 +--
>  arch/arm64/kernel/entry-common.c |  7 +++
>  arch/arm64/kernel/ptrace.c   | 10 ++
>  arch/arm64/kernel/signal.c   | 13 -
>  arch/arm64/kernel/smp.c  |  3 +++
>  7 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1515f6f153a0..fc958d8d8945 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -141,6 +141,7 @@ config ARM64
>   select HAVE_ARCH_PREL32_RELOCATIONS
>   select HAVE_ARCH_SECCOMP_FILTER
>   select HAVE_ARCH_STACKLEAK
> + select HAVE_ARCH_TASK_ISOLATION
>   select HAVE_ARCH_THREAD_STRUCT_WHITELIST
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_ARCH_TRANSPARENT_HUGEPAGE
> diff --git a/arch/arm64/include/asm/barrier.h 
> b/arch/arm64/include/asm/barrier.h
> index c3009b0e5239..ad5a6dd380cf 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -49,6 +49,7 @@
>  #define dma_rmb()dmb(oshld)
>  #define dma_wmb()dmb(oshst)
>  
> +#define instr_sync() isb()

I think I've asked on prior versions of the patchset, but what is this
for? Where is it going to be used, and what is the expected semantics?
I'm wary of exposing this outside of arch code because there aren't
strong cross-architectural semantics, and at the least this requires
some documentation.

If it's unused, please delete it.

[...]

> diff --git a/arch/arm64/kernel/entry-common.c 
> b/arch/arm64/kernel/entry-common.c
> index 43d4c329775f..8152760de683 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -77,6 +78,8 @@ asmlink

Re: [EXT] Re: [PATCH v5 0/9] "Task_isolation" mode

2020-12-02 Thread Mark Rutland
On Tue, Nov 24, 2020 at 05:40:49PM +, Alex Belits wrote:
> 
> On Tue, 2020-11-24 at 08:36 -0800, Tom Rix wrote:
> > External Email
> > 
> > ---
> > ---
> > 
> > On 11/23/20 9:42 AM, Alex Belits wrote:
> > > This is an update of task isolation work that was originally done
> > > by
> > > Chris Metcalf  and maintained by him until
> > > November 2017. It is adapted to the current kernel and cleaned up
> > > to
> > > implement its functionality in a more complete and cleaner manner.
> > 
> > I am having problems applying the patchset to today's linux-next.
> > 
> > Which kernel should I be using ?
> 
> The patches are against Linus' tree, in particular, commit
> a349e4c659609fd20e4beea89e5c4a4038e33a95

Is there any reason to base on that commit in particular?

Generally it's preferred that a series is based on a tag (so either a
release or an -rc kernel), and that the cover letter explains what the
base is. If you can do that in future it'll make the series much easier
to work with.

Thanks,
Mark.


Re: [PATCH v5 5/9] task_isolation: Add driver-specific hooks

2020-12-02 Thread Mark Rutland
On Mon, Nov 23, 2020 at 05:57:42PM +, Alex Belits wrote:
> Some drivers don't call functions that call
> task_isolation_kernel_enter() in interrupt handlers. Call it
> directly.

I don't think putting this in drivers is the right approach. IIUC we
only need to track user<->kernel transitions, and we can do that within
the architectural entry code before we ever reach irqchip code. I
suspect the current approacch is an artifact of that being difficult in
the old structure of the arch code; recent rework should address that,
and we can restruecture things further in future.

Thanks,
Mark.

> Signed-off-by: Alex Belits 
> ---
>  drivers/irqchip/irq-armada-370-xp.c | 6 ++
>  drivers/irqchip/irq-gic-v3.c| 3 +++
>  drivers/irqchip/irq-gic.c   | 3 +++
>  drivers/s390/cio/cio.c  | 3 +++
>  4 files changed, 15 insertions(+)
> 
> diff --git a/drivers/irqchip/irq-armada-370-xp.c 
> b/drivers/irqchip/irq-armada-370-xp.c
> index d7eb2e93db8f..4ac7babe1abe 100644
> --- a/drivers/irqchip/irq-armada-370-xp.c
> +++ b/drivers/irqchip/irq-armada-370-xp.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -572,6 +573,7 @@ static const struct irq_domain_ops 
> armada_370_xp_mpic_irq_ops = {
>  static void armada_370_xp_handle_msi_irq(struct pt_regs *regs, bool 
> is_chained)
>  {
>   u32 msimask, msinr;
> + int isol_entered = 0;
>  
>   msimask = readl_relaxed(per_cpu_int_base +
>   ARMADA_370_XP_IN_DRBEL_CAUSE_OFFS)
> @@ -588,6 +590,10 @@ static void armada_370_xp_handle_msi_irq(struct pt_regs 
> *regs, bool is_chained)
>   continue;
>  
>   if (is_chained) {
> + if (!isol_entered) {
> + task_isolation_kernel_enter();
> + isol_entered = 1;
> + }
>   irq = irq_find_mapping(armada_370_xp_msi_inner_domain,
>  msinr - PCI_MSI_DOORBELL_START);
>   generic_handle_irq(irq);
> diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c
> index 16fecc0febe8..ded26dd4da0f 100644
> --- a/drivers/irqchip/irq-gic-v3.c
> +++ b/drivers/irqchip/irq-gic-v3.c
> @@ -18,6 +18,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -646,6 +647,8 @@ static asmlinkage void __exception_irq_entry 
> gic_handle_irq(struct pt_regs *regs
>  {
>   u32 irqnr;
>  
> + task_isolation_kernel_enter();
> +
>   irqnr = gic_read_iar();
>  
>   if (gic_supports_nmi() &&
> diff --git a/drivers/irqchip/irq-gic.c b/drivers/irqchip/irq-gic.c
> index 6053245a4754..bb482b4ae218 100644
> --- a/drivers/irqchip/irq-gic.c
> +++ b/drivers/irqchip/irq-gic.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -337,6 +338,8 @@ static void __exception_irq_entry gic_handle_irq(struct 
> pt_regs *regs)
>   struct gic_chip_data *gic = &gic_data[0];
>   void __iomem *cpu_base = gic_data_cpu_base(gic);
>  
> + task_isolation_kernel_enter();
> +
>   do {
>   irqstat = readl_relaxed(cpu_base + GIC_CPU_INTACK);
>   irqnr = irqstat & GICC_IAR_INT_ID_MASK;
> diff --git a/drivers/s390/cio/cio.c b/drivers/s390/cio/cio.c
> index 6d716db2a46a..beab1b6d 100644
> --- a/drivers/s390/cio/cio.c
> +++ b/drivers/s390/cio/cio.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -584,6 +585,8 @@ void cio_tsch(struct subchannel *sch)
>   struct irb *irb;
>   int irq_context;
>  
> + task_isolation_kernel_enter();
> +
>   irb = this_cpu_ptr(&cio_irb);
>   /* Store interrupt response block to lowcore. */
>   if (tsch(sch->schid, irb) != 0)
> -- 
> 2.20.1
> 


Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-12-02 Thread Mark Rutland
On Mon, Nov 23, 2020 at 05:58:22PM +, Alex Belits wrote:
> From: Yuri Norov 
> 
> For nohz_full CPUs the desirable behavior is to receive interrupts
> generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> obviously not desirable because it breaks isolation.
> 
> This patch adds check for it.
> 
> Signed-off-by: Yuri Norov 
> [abel...@marvell.com: updated, only exclude CPUs running isolated tasks]
> Signed-off-by: Alex Belits 
> ---
>  kernel/time/tick-sched.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index a213952541db..6c8679e200f0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
>   */
>  void tick_nohz_full_kick_cpu(int cpu)
>  {
> - if (!tick_nohz_full_cpu(cpu))
> + smp_rmb();

What does this barrier pair with? The commit message doesn't mention it,
and it's not clear in-context.

Thanks,
Mark.

> + if (!tick_nohz_full_cpu(cpu) || task_isolation_on_cpu(cpu))
>   return;
>  
>   irq_work_queue_on(&per_cpu(nohz_full_kick_work, cpu), cpu);
> -- 
> 2.20.1
> 


Re: [EXT] Re: [PATCH v5 6/9] task_isolation: arch/arm64: enable task isolation functionality

2020-12-07 Thread Mark Rutland
On Fri, Dec 04, 2020 at 12:37:32AM +, Alex Belits wrote:
> On Wed, 2020-12-02 at 13:59 +0000, Mark Rutland wrote:
> > On Mon, Nov 23, 2020 at 05:58:06PM +, Alex Belits wrote:

> > As a heads-up, the arm64 entry code is changing, as we found that
> > our lockdep, RCU, and context-tracking management wasn't quite
> > right. I have a series of patches:
> > 
> > https://lore.kernel.org/r/20201130115950.22492-1-mark.rutl...@arm.com
> > 
> > ... which are queued in the arm64 for-next/fixes branch. I intend to
> > have some further rework ready for the next cycle.

> > That was quite obviously broken if PROVE_LOCKING and NO_HZ_FULL were
> > chosen and context tracking was in use (e.g. with
> > CONTEXT_TRACKING_FORCE),
> 
> I am not yet sure about TRACE_IRQFLAGS, however NO_HZ_FULL and
> CONTEXT_TRACKING have to be enabled for it to do anything.
> 
> I will check it with PROVE_LOCKING and your patches.

Thanks. In future, please do test this functionality with PROVE_LOCKING,
because otherwise bugs with RCU and IRQ state maangement will easily be
missed (as has been the case until very recently).

Testing with all those debug optiosn enabled (and stating that you have
done so) will give reviuewers much greater confidence that this works,
and if that does start spewing errors it save everyone the time
identifying that.

> Entry code only adds an inline function that, if task isolation is
> enabled, uses raw_local_irq_save() / raw_local_irq_restore(), low-level 
> operations and accesses per-CPU variabled by offset, so at very least
> it should not add any problems. Even raw_local_irq_save() /
> raw_local_irq_restore() probably should be removed, however I wanted to
> have something that can be safely called if by whatever reason
> interrupts were enabled before kernel was fully entered.

Sure. In the new flows we have new enter_from_*() and exit_to_*()
functions where these calls should be able to live (and so we should be
able to ensure a more consistent environment).

The near-term plan for arm64 is to migrate more of the exception triage
assembly to C, then to rework the arm64 entry code and generic entry
code to be more similar, then to migrate as much as possible to the
generic entry code. So please bear in mind that anything that adds to
the differences between the two is goingf to be problematic.

> >  so I'm assuming that this series has not been
> > tested in that configuration. What sort of testing has this seen?
> 
> On various available arm64 hardware, with enabled
> 
> CONFIG_TASK_ISOLATION
> CONFIG_NO_HZ_FULL
> CONFIG_HIGH_RES_TIMERS
> 
> and disabled:
> 
> CONFIG_HZ_PERIODIC
> CONFIG_NO_HZ_IDLE
> CONFIG_NO_HZ

Ok. I'd recommend looking at the various debug options under the "kernel
hacking" section in kconfig, and enabling some of those. At the very
least PROVE_LOCKING, ideally also using the lockup dectors and anything
else for debugging RCU, etc.

[...]

> > > Functions called from there:
> > > asm_nmi_enter() -> nmi_enter() -> task_isolation_kernel_enter()
> > > asm_nmi_exit() -> nmi_exit() -> task_isolation_kernel_return()
> > > 
> > > Handlers:
> > > do_serror() -> nmi_enter() -> task_isolation_kernel_enter()
> > >   or task_isolation_kernel_enter()
> > > el1_sync_handler() -> task_isolation_kernel_enter()
> > > el0_sync_handler() -> task_isolation_kernel_enter()
> > > el0_sync_compat_handler() -> task_isolation_kernel_enter()
> > > 
> > > handle_arch_irq() is irqchip-specific, most call
> > > handle_domain_irq()
> > > There is a separate patch for irqchips that do not follow this
> > > rule.
> > > 
> > > handle_domain_irq() -> task_isolation_kernel_enter()
> > > do_handle_IPI() -> task_isolation_kernel_enter() (may be redundant)
> > > nmi_enter() -> task_isolation_kernel_enter()
> > 
> > The IRQ cases look very odd to me. With the rework I've just done
> > for arm64, we'll do the regular context tracking accounting before
> > we ever get into handle_domain_irq() or similar, so I suspect that's
> > not necessary at all?
> 
> The goal is to call task_isolation_kernel_enter() before anything that
> depends on a CPU state, including pipeline, that could remain un-
> synchronized when the rest of the kernel was sending synchronization
> IPIs. Similarly task_isolation_kernel_return() should be called when it
> is safe to turn off synchronization. If rework allows it to be done
> earlier, there is no need to touch more specific functions.

Sure; I think that's sorted as a result of the changes I made recently.

> 
> > 

Re: [EXT] Re: [PATCH v5 7/9] task_isolation: don't interrupt CPUs with tick_nohz_full_kick_cpu()

2020-12-07 Thread Mark Rutland
On Fri, Dec 04, 2020 at 12:54:29AM +, Alex Belits wrote:
> 
> On Wed, 2020-12-02 at 14:20 +, Mark Rutland wrote:
> > External Email
> > 
> > ---
> > ---
> > On Mon, Nov 23, 2020 at 05:58:22PM +, Alex Belits wrote:
> > > From: Yuri Norov 
> > > 
> > > For nohz_full CPUs the desirable behavior is to receive interrupts
> > > generated by tick_nohz_full_kick_cpu(). But for hard isolation it's
> > > obviously not desirable because it breaks isolation.
> > > 
> > > This patch adds check for it.
> > > 
> > > Signed-off-by: Yuri Norov 
> > > [abel...@marvell.com: updated, only exclude CPUs running isolated
> > > tasks]
> > > Signed-off-by: Alex Belits 
> > > ---
> > >  kernel/time/tick-sched.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index a213952541db..6c8679e200f0 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -20,6 +20,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -268,7 +269,8 @@ static void tick_nohz_full_kick(void)
> > >   */
> > >  void tick_nohz_full_kick_cpu(int cpu)
> > >  {
> > > - if (!tick_nohz_full_cpu(cpu))
> > > + smp_rmb();
> > 
> > What does this barrier pair with? The commit message doesn't mention
> > it,
> > and it's not clear in-context.
> 
> With barriers in task_isolation_kernel_enter()
> and task_isolation_exit_to_user_mode().

Please add a comment in the code as to what it pairs with.

Thanks,
Mark.


Preemptible idr_alloc() in QRTR code

2021-01-26 Thread Mark Rutland
Hi,

When fuzzing arm64 with Syzkaller, I'm seeing some splats where
this_cpu_ptr() is used in the bowels of idr_alloc(), by way of
radix_tree_node_alloc(), in a preemptible context:

| BUG: using smp_processor_id() in preemptible [] code: 
syz-executor.1/32582
| caller is debug_smp_processor_id+0x24/0x30
| CPU: 3 PID: 32582 Comm: syz-executor.1 Not tainted 
5.11.0-rc4-next-20210125-1-gf57e7edf910d #3
| Hardware name: linux,dummy-virt (DT)
| Call trace:
|  dump_backtrace+0x0/0x4a8
|  show_stack+0x34/0x88
|  dump_stack+0x1d4/0x2a0
|  check_preemption_disabled+0x1b8/0x210
|  debug_smp_processor_id+0x24/0x30
|  radix_tree_node_alloc.constprop.17+0x26c/0x3d0
|  radix_tree_extend+0x200/0x420
|  idr_get_free+0x63c/0xa38
|  idr_alloc_u32+0x164/0x2a0
|  __qrtr_bind.isra.8+0x350/0x658
|  qrtr_bind+0x18c/0x218
|  __sys_bind+0x1fc/0x238
|  __arm64_sys_bind+0x78/0xb0
|  el0_svc_common+0x1ac/0x4c8
|  do_el0_svc+0xfc/0x150
|  el0_svc+0x24/0x38
|  el0_sync_handler+0x134/0x180
|  el0_sync+0x154/0x180

It's not clear to me whether this is a bug in the caller or a bug the
implementation of idr_alloc(). The kerneldoc for idr_alloc() mentions
that callers must provide their own locking (and in this case a mutex is
used), but doesn't mention that preemption must be disabled.

Is this an intentional requirement that's simply missing from the
documentation and requires a change to the QRTR code, or is this
something to fix within the bowels of idr_alloc() and its callees?

Thanks,
Mark.


Re: Preemptible idr_alloc() in QRTR code

2021-01-26 Thread Mark Rutland
On Tue, Jan 26, 2021 at 02:58:33PM +, Matthew Wilcox wrote:
> On Tue, Jan 26, 2021 at 10:47:34AM +0000, Mark Rutland wrote:
> > Hi,
> > 
> > When fuzzing arm64 with Syzkaller, I'm seeing some splats where
> > this_cpu_ptr() is used in the bowels of idr_alloc(), by way of
> > radix_tree_node_alloc(), in a preemptible context:
> 
> I sent a patch to fix this last June.  The maintainer seems to be
> under the impression that I care an awful lot more about their
> code than I do.
> 
> https://lore.kernel.org/netdev/20200605120037.17427-1-wi...@infradead.org/

Ah; I hadn't spotted the (glaringly obvious) GFP_ATOMIC abuse, thanks
for the pointer, and sorry for the noise.

It looks like Eric was after a fix that trivially backported to v4.7
(and hence couldn't rely on xarray) but instead it just got left broken
for months. :/

Bjorn, is this something you care about? You seem to have the most
commits to the file, and otherwise the official maintainer is Dave
Miller per get_maintainer.pl.

It is very tempting to make the config option depend on BROKEN...

Thanks,
Mark.


Re: Preemptible idr_alloc() in QRTR code

2021-01-26 Thread Mark Rutland
On Tue, Jan 26, 2021 at 11:00:05AM -0600, Bjorn Andersson wrote:
> On Tue 26 Jan 10:21 CST 2021, Mark Rutland wrote:
> 
> > On Tue, Jan 26, 2021 at 02:58:33PM +, Matthew Wilcox wrote:
> > > On Tue, Jan 26, 2021 at 10:47:34AM +, Mark Rutland wrote:
> > > > Hi,
> > > > 
> > > > When fuzzing arm64 with Syzkaller, I'm seeing some splats where
> > > > this_cpu_ptr() is used in the bowels of idr_alloc(), by way of
> > > > radix_tree_node_alloc(), in a preemptible context:
> > > 
> > > I sent a patch to fix this last June.  The maintainer seems to be
> > > under the impression that I care an awful lot more about their
> > > code than I do.
> > > 
> > > https://lore.kernel.org/netdev/20200605120037.17427-1-wi...@infradead.org/
> > 
> > Ah; I hadn't spotted the (glaringly obvious) GFP_ATOMIC abuse, thanks
> > for the pointer, and sorry for the noise.
> > 
> 
> I'm afraid this isn't as obvious to me as it is to you. Are you saying
> that one must not use GFP_ATOMIC in non-atomic contexts?
> 
> That said, glancing at the code I'm puzzled to why it would use
> GFP_ATOMIC.

I'm also not entirely sure about the legitimacy of GFP_ATOMIC outside of
atomic contexts -- I couldn't spot any documentation saying that wasn't
legitimate, but Matthew's commit message implies so, and it sticks out
as odd.

> > It looks like Eric was after a fix that trivially backported to v4.7
> > (and hence couldn't rely on xarray) but instead it just got left broken
> > for months. :/
> > 
> > Bjorn, is this something you care about? You seem to have the most
> > commits to the file, and otherwise the official maintainer is Dave
> > Miller per get_maintainer.pl.
> 
> I certainly care about qrtr working and remember glancing at Matthew's
> patch, but seems like I never found time to properly review it.
> 
> > It is very tempting to make the config option depend on BROKEN...
> 
> I hear you and that would be bad, so I'll make sure to take a proper
> look at this and Matthew's patch.

Thanks! I'm happy to try/test patches if that's any help. My main
concern here is that this can be triggered on arbitrary platforms so
long as the driver is built in (e.g. my Syzkaller instance is hitting
this within a VM).

Thanks,
Mark.


Re: [PATCH net] flow_dissector: disable preemption around BPF calls

2019-05-13 Thread Mark Rutland
On Mon, May 13, 2019 at 09:38:55AM -0700, 'Eric Dumazet' via syzkaller wrote:
> Various things in eBPF really require us to disable preemption
> before running an eBPF program.

Is that true for all eBPF uses? I note that we don't disable preemption
in the lib/test_bpf.c module, for example.

If it's a general requirement, perhaps it's worth an assertion within
BPF_PROG_RUN()?

Thanks,
Mark.

> 
> syzbot reported :
> 
> BUG: assuming atomic context at net/core/flow_dissector.c:737
> in_atomic(): 0, irqs_disabled(): 0, pid: 24710, name: syz-executor.3
> 2 locks held by syz-executor.3/24710:
>  #0: e81a4bf1 (&tfile->napi_mutex){+.+.}, at: 
> tun_get_user+0x168e/0x3ff0 drivers/net/tun.c:1850
>  #1: 254afebd (rcu_read_lock){}, at: 
> __skb_flow_dissect+0x1e1/0x4bb0 net/core/flow_dissector.c:822
> CPU: 1 PID: 24710 Comm: syz-executor.3 Not tainted 5.1.0+ #6
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
>  __cant_sleep kernel/sched/core.c:6165 [inline]
>  __cant_sleep.cold+0xa3/0xbb kernel/sched/core.c:6142
>  bpf_flow_dissect+0xfe/0x390 net/core/flow_dissector.c:737
>  __skb_flow_dissect+0x362/0x4bb0 net/core/flow_dissector.c:853
>  skb_flow_dissect_flow_keys_basic include/linux/skbuff.h:1322 [inline]
>  skb_probe_transport_header include/linux/skbuff.h:2500 [inline]
>  skb_probe_transport_header include/linux/skbuff.h:2493 [inline]
>  tun_get_user+0x2cfe/0x3ff0 drivers/net/tun.c:1940
>  tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2037
>  call_write_iter include/linux/fs.h:1872 [inline]
>  do_iter_readv_writev+0x5fd/0x900 fs/read_write.c:693
>  do_iter_write fs/read_write.c:970 [inline]
>  do_iter_write+0x184/0x610 fs/read_write.c:951
>  vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
>  do_writev+0x15b/0x330 fs/read_write.c:1058
>  __do_sys_writev fs/read_write.c:1131 [inline]
>  __se_sys_writev fs/read_write.c:1128 [inline]
>  __x64_sys_writev+0x75/0xb0 fs/read_write.c:1128
>  do_syscall_64+0x103/0x670 arch/x86/entry/common.c:298
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 
> Cc: Petar Penkov 
> Cc: Stanislav Fomichev 
> ---
>  net/core/flow_dissector.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
> index 
> 9ca784c592ac8c9c58282289a81889fbe4658a9e..548f39dde30711ac5be9e921993a6d8e53f74161
>  100644
> --- a/net/core/flow_dissector.c
> +++ b/net/core/flow_dissector.c
> @@ -734,7 +734,9 @@ bool bpf_flow_dissect(struct bpf_prog *prog, struct 
> bpf_flow_dissector *ctx,
>   flow_keys->nhoff = nhoff;
>   flow_keys->thoff = flow_keys->nhoff;
>  
> + preempt_disable();
>   result = BPF_PROG_RUN(prog, ctx);
> + preempt_enable();
>  
>   flow_keys->nhoff = clamp_t(u16, flow_keys->nhoff, nhoff, hlen);
>   flow_keys->thoff = clamp_t(u16, flow_keys->thoff,
> -- 
> 2.21.0.1020.gf2820cf01a-goog
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/syzkaller/20190513163855.225489-1-edumazet%40google.com.
> For more options, visit https://groups.google.com/d/optout.


Re: [PATCH net] flow_dissector: disable preemption around BPF calls

2019-05-13 Thread Mark Rutland
On Mon, May 13, 2019 at 10:20:19AM -0700, 'Eric Dumazet' via syzkaller wrote:
> On Mon, May 13, 2019 at 10:17 AM Mark Rutland  wrote:
> >
> > On Mon, May 13, 2019 at 09:38:55AM -0700, 'Eric Dumazet' via syzkaller 
> > wrote:
> > > Various things in eBPF really require us to disable preemption
> > > before running an eBPF program.
> >
> > Is that true for all eBPF uses? I note that we don't disable preemption
> > in the lib/test_bpf.c module, for example.
> >
> > If it's a general requirement, perhaps it's worth an assertion within
> > BPF_PROG_RUN()?
> 
> The assertion is already there :)
> 
> This is how syzbot triggered the report.

Ah! :)

I also see I'm wrong about test_bpf.c, so sorry for the noise on both
counts!

Thanks,
Mark.


Re: [RFC PATCH v11 5/9] psci: Add hypercall service for ptp_kvm.

2020-04-30 Thread Mark Rutland
On Tue, Apr 28, 2020 at 07:14:52AM +0100, Jianyong Wu wrote:
> On 2020/4/24 6:39 PM, Mark Rutland wrote:
> > On Fri, Apr 24, 2020 at 03:50:22AM +0100, Jianyong Wu wrote:
> >> On 2020/4/21 5:57 PM, Mark Rutland wrote:
> >>> On Tue, Apr 21, 2020 at 11:23:00AM +0800, Jianyong Wu wrote:
> >>>> diff --git a/virt/kvm/arm/hypercalls.c b/virt/kvm/arm/hypercalls.c
> >>>> index 550dfa3e53cd..a5309c28d4dc 100644
> >>>> --- a/virt/kvm/arm/hypercalls.c
> >>>> +++ b/virt/kvm/arm/hypercalls.c
> >>>> @@ -62,6 +66,44 @@ int kvm_hvc_call_handler(struct kvm_vcpu *vcpu)
> >>>>if (gpa != GPA_INVALID)
> >>>>val = gpa;
> >>>>break;
> >>>> +/*
> >>>> + * This serves virtual kvm_ptp.
> >>>> + * Four values will be passed back.
> >>>> + * reg0 stores high 32-bit host ktime;
> >>>> + * reg1 stores low 32-bit host ktime;
> >>>> + * reg2 stores high 32-bit difference of host cycles and cntvoff;
> >>>> + * reg3 stores low 32-bit difference of host cycles and cntvoff.
> >>>> + */
> >>>> +case ARM_SMCCC_HYP_KVM_PTP_FUNC_ID:
> >>> Shouldn't the host opt-in to providing this to the guest, as with other
> >>> features?
> >> er, do you mean that "ARM_SMCCC_HV_PV_TIME_XXX" as "opt-in"? if so, I
> >> think this
> >>
> >> kvm_ptp doesn't need a buddy. the driver in guest will call this service
> >> in a definite way.
> > I mean that when creating the VM, userspace should be able to choose
> > whether the PTP service is provided to the guest. The host shouldn't
> > always provide it as there may be cases where doing so is undesireable.
> >
> I think I have implemented in patch 9/9 that userspace can get the info
> that if the host offers the kvm_ptp service. But for now, the host
> kernel will always offer the kvm_ptp capability in the current
> implementation. I think x86 follow the same behavior (see [1]). so I
> have not considered when and how to disable this kvm_ptp service in
> host. Do you think we should offer this opt-in?

I think taht should be opt-in, yes.

[...]

> > It's also not clear to me what notion of host time is being exposed to
> > the guest (and consequently how this would interact with time changes on
> > the host, time namespaces, etc). Having some description of that would
> > be very helpful.
> 
> sorry to have not made it clear.
> 
> Time will not change in host and only time in guest will change to sync
> with host. host time is the target that time in guest want to adjust to.
> guest need to get the host time then compute the different of the time
> in guest and host, so the guest can adjust the time base on the difference.

I understood that host time wouldn't change here, but what was not clear
is which notion of host time is being exposed to the guest.

e.g. is that a raw monotonic clock, or one subject to periodic adjument,
or wall time in the host? What is the epoch of the host time?

> I will add the base principle of time sync service in guest using
> kvm_ptp in commit message.

That would be great; thanks!

Mark.