date:20170703

Re: [PATCH] [net-next] net/mlx5: include wq.o in non-ethernet build for FPGA

2017-07-03 Thread Arnd Bergmann

On Sun, Jul 2, 2017 at 10:45 AM, Saeed Mahameed
 wrote:
> On Fri, Jun 30, 2017 at 10:25 PM, Arnd Bergmann  wrote:
>> On Fri, Jun 30, 2017 at 8:58 PM, Ilan Tayari  wrote:
>>
 diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
 b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
 index ca367445f864..50fe9e3c5dc2 100644
 --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
 +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
 @@ -9,7 +9,7 @@ mlx5_core-y :=main.o cmd.o debugfs.o fw.o eq.o 
 uar.o
 pagealloc.o \
  mlx5_core-$(CONFIG_MLX5_ACCEL) += accel/ipsec.o

  mlx5_core-$(CONFIG_MLX5_FPGA) += fpga/cmd.o fpga/core.o fpga/conn.o
 fpga/sdk.o \
 - fpga/ipsec.o
 + fpga/ipsec.o wq.o
>>>
>>> I believe we would prefer to move wq.o to mlx5_core-y.
>>> Otherwise you might build it twice.
>>
>> That's not a problem, Kbuild is smart enough to drop duplicate object files
>> that get built into the same module.
>>
>> If you think it's less confusing to readers of this file if it gets
>> put into core,
>> that's fine though, the only downside would be adding a little bit of
>> code bloat for users that want neither the ethernet nor the fpga code
>> (if that is a realistic use case).
>
> Hi Arnd,
>
> Thanks for the patch, your solution is good enough, but let's avoid
> confusing developers with such duplications.
> since the Makefile might get messy if we will keep using this method.
>
> I suggest to move wq.o to core or make MLX5_FPGA depend on MLX5_CORE_EN.
> I will discuss this with Ilan and we will provide the fix ASAP.

Ok, sounds good. Thanks!

Arnd

[PATCH] mwifiex: uninit wakeup info when failed to add card

2017-07-03 Thread Jeffy Chen

We inited wakeup info at the beginning of mwifiex_add_card, so we need
to uninit it in the error handling.

It's much the same as what we did in:
36908c4 mwifiex: uninit wakeup info when removing device

Signed-off-by: Jeffy Chen 

---

 drivers/net/wireless/marvell/mwifiex/main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/wireless/marvell/mwifiex/main.c 
b/drivers/net/wireless/marvell/mwifiex/main.c
index f2600b8..17d2cbe 100644
--- a/drivers/net/wireless/marvell/mwifiex/main.c
+++ b/drivers/net/wireless/marvell/mwifiex/main.c
@@ -1655,6 +1655,8 @@ mwifiex_add_card(void *card, struct completion *fw_done,
mwifiex_shutdown_drv(adapter);
}
 err_kmalloc:
+   if (adapter->irq_wakeup >= 0)
+   device_init_wakeup(adapter->dev, false);
mwifiex_free_adapter(adapter);
 
 err_init_sw:
-- 
2.1.4

Re: linux-next: manual merge of the net-next tree with the arm64 tree

2017-07-03 Thread Daniel Borkmann


On 07/03/2017 03:37 AM, Stephen Rothwell wrote:

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

   arch/arm64/net/bpf_jit_comp.c

between commit:

   425e1ed73e65 ("arm64: fix endianness annotation for 'struct jit_ctx' and 
friends")

from the arm64 tree and commit:

   f1c9eed7f437 ("bpf, arm64: take advantage of stack_depth tracking")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.


Looks good to me, thanks!

Re: linux-next: manual merge of the net-next tree with the net tree

2017-07-03 Thread Saeed Mahameed

On Mon, Jul 3, 2017 at 4:43 AM, Stephen Rothwell  wrote:
> Hi all,
>
> Today's linux-next merge of the net-next tree got conflicts in:
>
>   drivers/net/ethernet/mellanox/mlx5/core/health.c
>   include/linux/mlx5/driver.h
>
> between commit:
>
>   2a0165a034ac ("net/mlx5: Cancel delayed recovery work when unloading the 
> driver")
>
> from the net tree and commit:
>
>   0179720d6be2 ("Introduce new function for entering bad-health state.")
>
> from the net-next tree.
>
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
>
> --
> Cheers,
> Stephen Rothwell
>
> diff --cc drivers/net/ethernet/mellanox/mlx5/core/health.c
> index 8a8b5f0e497c,0648a659b21d..
> --- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
> @@@ -193,8 -193,8 +194,8 @@@ static void health_care(struct work_str
> mlx5_core_warn(dev, "handling bad device here\n");
> mlx5_handle_bad_state(dev);
>
> -   spin_lock(&health->wq_lock);
> +   spin_lock_irqsave(&health->wq_lock, flags);
>  -  if (!test_bit(MLX5_DROP_NEW_HEALTH_WORK, &health->flags))
>  +  if (!test_bit(MLX5_DROP_NEW_RECOVERY_WORK, &health->flags))
> schedule_delayed_work(&health->recover_work, recover_delay);
> else
> dev_err(&dev->pdev->dev,
> @@@ -334,11 -341,11 +343,12 @@@ void mlx5_stop_health_poll(struct mlx5_
>   void mlx5_drain_health_wq(struct mlx5_core_dev *dev)
>   {
> struct mlx5_core_health *health = &dev->priv.health;
> +   unsigned long flags;
>
> -   spin_lock(&health->wq_lock);
> +   spin_lock_irqsave(&health->wq_lock, flags);
> set_bit(MLX5_DROP_NEW_HEALTH_WORK, &health->flags);
>  +  set_bit(MLX5_DROP_NEW_RECOVERY_WORK, &health->flags);
> -   spin_unlock(&health->wq_lock);
> +   spin_unlock_irqrestore(&health->wq_lock, flags);
> cancel_delayed_work_sync(&health->recover_work);
> cancel_work_sync(&health->work);
>   }
> diff --cc include/linux/mlx5/driver.h
> index ba260330ce5e,2ab4ae3e3a1a..
> --- a/include/linux/mlx5/driver.h
> +++ b/include/linux/mlx5/driver.h
> @@@ -925,7 -945,7 +945,8 @@@ int mlx5_health_init(struct mlx5_core_d
>   void mlx5_start_health_poll(struct mlx5_core_dev *dev);
>   void mlx5_stop_health_poll(struct mlx5_core_dev *dev);
>   void mlx5_drain_health_wq(struct mlx5_core_dev *dev);
>  +void mlx5_drain_health_recovery(struct mlx5_core_dev *dev);
> + void mlx5_trigger_health_work(struct mlx5_core_dev *dev);
>   int mlx5_buf_alloc_node(struct mlx5_core_dev *dev, int size,
> struct mlx5_buf *buf, int node);
>   int mlx5_buf_alloc(struct mlx5_core_dev *dev, int size, struct mlx5_buf 
> *buf);

Hi Stephen,

The fix up looks good, I already notified Dave about this on net
submission and he approved.

Thanks,
Saeed.

Re: [PATCH net-next] vxlan: correctly set vxlan->net when creating the device in a netns

2017-07-03 Thread Matthias Schiffer

On 06/30/2017 03:50 PM, Sabrina Dubroca wrote:
> Commit a985343ba906 ("vxlan: refactor verification and application of
> configuration") modified vxlan device creation, and replaced the
> assignment of vxlan->net to src_net with dev_net(netdev) in ->setup().
> 
> But dev_net(netdev) is not the same as src_net. At the time ->setup()
> is called, dev_net hasn't been set yet, so we end up creating the
> socket for the vxlan device in init_net.
> 
> Fix this by bringing back the assignment of vxlan->net during device
> creation.
> 
> Fixes: a985343ba906 ("vxlan: refactor verification and application of 
> configuration")
> Signed-off-by: Sabrina Dubroca 

Thanks for fixing this up, I really didn't expect dev_net() not to return
the correct net in setup().

Reviewed-by: Matthias Schiffer 


> ---
>  drivers/net/vxlan.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index fd0ff97e3d81..47d6e65851aa 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -2656,7 +2656,6 @@ static void vxlan_setup(struct net_device *dev)
>   vxlan->age_timer.data = (unsigned long) vxlan;
>  
>   vxlan->dev = dev;
> - vxlan->net = dev_net(dev);
>  
>   gro_cells_init(&vxlan->gro_cells, dev);
>  
> @@ -3028,7 +3027,9 @@ static int vxlan_config_validate(struct net *src_net, 
> struct vxlan_config *conf,
>  
>  static void vxlan_config_apply(struct net_device *dev,
>  struct vxlan_config *conf,
> -struct net_device *lowerdev, bool changelink)
> +struct net_device *lowerdev,
> +struct net *src_net,
> +bool changelink)
>  {
>   struct vxlan_dev *vxlan = netdev_priv(dev);
>   struct vxlan_rdst *dst = &vxlan->default_dst;
> @@ -3044,6 +3045,8 @@ static void vxlan_config_apply(struct net_device *dev,
>  
>   if (conf->mtu)
>   dev->mtu = conf->mtu;
> +
> + vxlan->net = src_net;
>   }
>  
>   dst->remote_vni = conf->vni;
> @@ -3086,7 +3089,7 @@ static int vxlan_dev_configure(struct net *src_net, 
> struct net_device *dev,
>   if (ret)
>   return ret;
>  
> - vxlan_config_apply(dev, conf, lowerdev, changelink);
> + vxlan_config_apply(dev, conf, lowerdev, src_net, changelink);
>  
>   return 0;
>  }
> 




signature.asc
Description: OpenPGP digital signature

Re: linux-next: build failure after merge of the akpm tree

2017-07-03 Thread Stephen Rothwell

Hi all,

On Fri, 30 Jun 2017 16:32:41 +1000 Stephen Rothwell  
wrote:
>
> After merging the akpm tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> In file included from include/linux/bitmap.h:8:0,
>  from include/linux/cpumask.h:11,
>  from include/linux/mm_types_task.h:13,
>  from include/linux/mm_types.h:4,
>  from include/linux/kmemcheck.h:4,
>  from include/linux/skbuff.h:18,
>  from include/linux/if_ether.h:23,
>  from include/linux/etherdevice.h:25,
>  from drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c:33:
> In function 'memcpy',
> inlined from 'mlx5_fpga_query_qp' at 
> drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c:194:2:
> include/linux/string.h:315:4: error: call to '__read_overflow2' declared with 
> attribute error: detected read beyond size of object passed as 2nd parameter
> __read_overflow2();
> ^
> 
> Caused by commit
> 
>   c151149cc4db ("include/linux/string.h: add the option of fortified string.h 
> functions")
> 
> interacting with commit
> 
>   6062118d5cd2 ("net/mlx5: FPGA, Add FW commands for FPGA QPs")
> 
> from the net-next tree.
> 
> I took a guess and tried the following patch which seemed to work.
> 
> From: Stephen Rothwell 
> Date: Fri, 30 Jun 2017 16:24:35 +1000
> Subject: [PATCH] net/mlx5: fix memcpy limit?
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c
> index 5cb855fd618f..e37453d838db 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c
> @@ -191,7 +191,7 @@ int mlx5_fpga_query_qp(struct mlx5_core_dev *dev,
>   if (ret)
>   return ret;
>  
> - memcpy(fpga_qpc, MLX5_ADDR_OF(fpga_query_qp_out, in, fpga_qpc),
> + memcpy(fpga_qpc, MLX5_ADDR_OF(fpga_query_qp_out, out, fpga_qpc),
>  MLX5_FLD_SZ_BYTES(fpga_query_qp_out, fpga_qpc));
>   return ret;
>  }
> -- 
> 2.11.0

Again today ... so is the fix correct?  if so, Dave should apply it, if
not someone should supply a correct fix for Dave.

-- 
Cheers,
Stephen Rothwell

Re: [PATCH net 1/2] vxlan: fix hlist corruption

2017-07-03 Thread Jiri Benc

On Sun, 2 Jul 2017 16:06:10 -0400, Waiman Long wrote:
> I didn't see any init code for hlist4 and hlist6. Is vxlan_dev going to
> be *zalloc'ed so that they are guaranteed to be NULL? If not, you may
> need to add  init code as not both hlists will be hashed and so one of
> them may contain invalid data.

Yes, it's zalloced via alloc_netdev. No need to init the fields
explicitly.

 Jiri

Re: [PATCH net-next 00/12] qed: Add iWARP support for QL4xxxx

2017-07-03 Thread David Miller

From: Michal Kalderon 
Date: Sun, 2 Jul 2017 10:29:20 +0300

> This patch series adds iWARP support to our QL4 networking adapters.
> The code changes span across qed and qedr drivers, but this series contains
> changes to qed only. Once the series is accepted, the qedr series will
> be submitted to the rdma tree.
> There is one additional qed patch which enables the iWARP, this patch is
> delayed until the qedr series will be accepted. 
> 
> The patches were previously sent as an RFC, and these are the first 12
> patches in the RFC series:
> https://www.spinics.net/lists/linux-rdma/msg51416.html
> 
> This series was tested and built against net-next.
> 
> MAINTAINERS file is not updated in this PATCH as there is a pending patch
> for qedr driver update https://patchwork.kernel.org/patch/9752761.

Series applied, thanks.

Re: [PATCH V2 1/1] net: cdc_ncm: Reduce memory use when kernel memory low

2017-07-03 Thread David Miller

From: Jim Baxter 
Date: Wed, 28 Jun 2017 21:35:29 +0100

> The CDC-NCM driver can require large amounts of memory to create
> skb's and this can be a problem when the memory becomes fragmented.
> 
> This especially affects embedded systems that have constrained
> resources but wish to maximise the throughput of CDC-NCM with 16KiB
> NTB's.
> 
> The issue is after running for a while the kernel memory can become
> fragmented and it needs compacting.
> If the NTB allocation is needed before the memory has been compacted
> the atomic allocation can fail which can cause increased latency,
> large re-transmissions or disconnections depending upon the data
> being transmitted at the time.
> This situation occurs for less than a second until the kernel has
> compacted the memory but the failed devices can take a lot longer to
> recover from the failed TX packets.
> 
> To ease this temporary situation I modified the CDC-NCM TX path to
> temporarily switch into a reduced memory mode which allocates an NTB
> that will fit into a USB_CDC_NCM_NTB_MIN_OUT_SIZE (default 2048 Bytes)
> sized memory block and only transmit NTB's with a single network frame
> until the memory situation is resolved.
> Each time this issue occurs we wait for an increasing number of
> reduced size allocations before requesting a full size one to not
> put additional pressure on a low memory system.
> 
> Once the memory is compacted the CDC-NCM data can resume transmitting
> at the normal tx_max rate once again.
> 
> Signed-off-by: Jim Baxter 

Patch applied, thanks.

Re: [PATCH net] ipv6: dad: don't remove dynamic addresses if link is down

2017-07-03 Thread David Miller

From: Sabrina Dubroca 
Date: Thu, 29 Jun 2017 16:56:54 +0200

> Currently, when the link for $DEV is down, this command succeeds but the
> address is removed immediately by DAD (1):
> 
> ip addr add ::12/64 dev $DEV valid_lft 3600 preferred_lft 1800
> 
> In the same situation, this will succeed and not remove the address (2):
> 
> ip addr add ::12/64 dev $DEV
> ip addr change ::12/64 dev $DEV valid_lft 3600 preferred_lft 1800
> 
> The comment in addrconf_dad_begin() when !IF_READY makes it look like
> this is the intended behavior, but doesn't explain why:
> 
>  * If the device is not ready:
>  * - keep it tentative if it is a permanent address.
>  * - otherwise, kill it.
> 
> We clearly cannot prevent userspace from doing (2), but we can make (1)
> work consistently with (2).
> 
> addrconf_dad_stop() is only called in two cases: if DAD failed, or to
> skip DAD when the link is down. In that second case, the fix is to avoid
> deleting the address, like we already do for permanent addresses.
> 
> Fixes: 3c21edbd1137 ("[IPV6]: Defer IPv6 device initialization until the link 
> becomes ready.")
> Signed-off-by: Sabrina Dubroca 

Applied and queued up for -stable, thanks.

Re: linux-next: build failure after merge of the akpm tree

2017-07-03 Thread David Miller

From: Stephen Rothwell 
Date: Fri, 30 Jun 2017 16:32:41 +1000

> From: Stephen Rothwell 
> Date: Fri, 30 Jun 2017 16:24:35 +1000
> Subject: [PATCH] net/mlx5: fix memcpy limit?
> 
> Signed-off-by: Stephen Rothwell 

Applied, thanks.

Re: [PATCH net-next 00/12] qed: Add iWARP support for QL4xxxx

2017-07-03 Thread David Miller


You really have to compile test your work and do something with
the warnings:

drivers/net/ethernet/qlogic/qed/qed_iwarp.c:1721:5: warning: ‘ll2_syn_handle’ 
may be used uninitialized in this funct

This one is completely legitimate, you can goto "err" and use
the ll2_syn_handle without it being initialized.

Re: [PATCH net-next v8 0/2] Add loopback support in phy_driver and hns ethtool fix

2017-07-03 Thread David Miller

From: Lin Yun Sheng 
Date: Fri, 30 Jun 2017 17:44:14 +0800

> This Patch Set add set_loopback in phy_driver and use it to setup loopback
> when doing ethtool phy self_test.

Series applied, thank you.

Re: [PATCH net-next] vxlan: correctly set vxlan->net when creating the device in a netns

2017-07-03 Thread David Miller

From: Sabrina Dubroca 
Date: Fri, 30 Jun 2017 15:50:00 +0200

> Commit a985343ba906 ("vxlan: refactor verification and application of
> configuration") modified vxlan device creation, and replaced the
> assignment of vxlan->net to src_net with dev_net(netdev) in ->setup().
> 
> But dev_net(netdev) is not the same as src_net. At the time ->setup()
> is called, dev_net hasn't been set yet, so we end up creating the
> socket for the vxlan device in init_net.
> 
> Fix this by bringing back the assignment of vxlan->net during device
> creation.
> 
> Fixes: a985343ba906 ("vxlan: refactor verification and application of 
> configuration")
> Signed-off-by: Sabrina Dubroca 

Applied, thanks.

Re: [PATCH v2] Documentation: fix wrong example command

2017-07-03 Thread David Miller

From: Matteo Croce 
Date: Fri, 30 Jun 2017 18:21:47 +0200

> In the IPVLAN documentation there is an example command line where the
> master and slave interface names are inverted.
> Fix the command line and also add the optional `name' keyword to better
> describe what the command is doing.
> 
> v2: added commit message
> 
> Signed-off-by: Matteo Croce 

Applied, thank you.

Re: [PATCH net-next] net/packet: Fix Tx queue selection for AF_PACKET

2017-07-03 Thread David Miller

From: Iván Briano 
Date: Fri, 30 Jun 2017 14:02:32 -0700

> When PACKET_QDISC_BYPASS is not used, Tx queue selection will be done
> before the packet is enqueued, taking into account any mappings set by
> a queuing discipline such as mqprio without hardware offloading. This
> selection may be affected by a previously saved queue_mapping, either on
> the Rx path, or done before the packet reaches the device, as it's
> currently the case for AF_PACKET.
> 
> In order for queue selection to work as expected when using traffic
> control, there can't be another selection done before that point is
> reached, so move the call to packet_pick_tx_queue to
> packet_direct_xmit, leaving the default xmit path as it was before
> PACKET_QDISC_BYPASS was introduced.
> 
> A forward declaration of packet_pick_tx_queue() is introduced to avoid
> the need to reorder the functions within the file.
> 
> Signed-off-by: Iván Briano 

Please resubmit tihs with a proper "Fixes: " tag which shows what
commit introduced this problem.

Thanks.

Re: [PATCH] net: cdc_mbim: apply "NDP to end" quirk to HP lt4132

2017-07-03 Thread David Miller

From: Tore Anderson 
Date: Sat,  1 Jul 2017 15:20:02 +0200

> The HP lt4132 LTE/HSPA+ 4G Module (03f0:a31d) is a rebranded Huawei
> ME906s-158 device. It, like the ME906s-158, requires the "NDP to end"
> quirk for correct operation.
> 
> Signed-off-by: Tore Anderson 

Applied, thank you.

Re: [PATCH net-next 0/7] Misc BPF helper/verifier improvements

2017-07-03 Thread David Miller

From: Daniel Borkmann 
Date: Sun,  2 Jul 2017 02:13:24 +0200

> Miscellanous improvements I still had in my queue, it adds a new
> bpf_skb_adjust_room() helper for cls_bpf, exports to fdinfo whether
> tail call array owner is JITed, so iproute2 error reporting can be
> improved on that regard, a small cleanup and extension to trace
> printk, two verifier patches, one to make the code around narrower
> ctx access a bit more straight forward and one to allow for imm += x
> operations, that we've seen LLVM generating and the verifier currently
> rejecting. We've included the patch 6 given it's rather small and
> we ran into it from LLVM side, it would be great if it could be
> queued for stable as well after the merge window. Last but not least,
> test cases are added also related to imm alu improvement.

Series applied, thanks Daniel.

Re: [PATCH 00/17] v3 net generic subsystem refcount conversions

2017-07-03 Thread Eric Dumazet

On Fri, 2017-06-30 at 13:07 +0300, Elena Reshetova wrote:
> Changes in v3:
> Rebased on top of the net-next tree.
> 
> Changes in v2:
> No changes in patches apart from rebases, but now by
> default refcount_t = atomic_t (*) and uses all atomic standard operations
> unless CONFIG_REFCOUNT_FULL is enabled. This is a compromise for the
> systems that are critical on performance (such as net) and cannot accept even
> slight delay on the refcounter operations.
> 
> This series, for core network subsystem components, replaces atomic_t 
> reference
> counters with the new refcount_t type and API (see include/linux/refcount.h).
> By doing this we prevent intentional or accidental
> underflows or overflows that can led to use-after-free vulnerabilities.
> These patches contain only generic net pieces. Other changes will be sent 
> separately.
> 
> The patches are fully independent and can be cherry-picked separately.
> The big patches, such as conversions for sock structure, need a very detailed
> look from maintainers: refcount managing is quite complex in them and while
> it seems that they would benefit from the change, extra checking is needed.
> The biggest corner issue is the fact that refcount_inc() does not increment
> from zero.
> 
> If there are no objections to the patches, please merge them via respective 
> trees.
> 
> * The respective change is currently merged into -next as
>   "locking/refcount: Create unchecked atomic_t implementation".
> 
> Elena Reshetova (17):
>   net: convert inet_peer.refcnt from atomic_t to refcount_t
>   net: convert neighbour.refcnt from atomic_t to refcount_t
>   net: convert neigh_params.refcnt from atomic_t to refcount_t
>   net: convert nf_bridge_info.use from atomic_t to refcount_t
>   net: convert sk_buff.users from atomic_t to refcount_t
>   net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_t
>   net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
>   net: convert sock.sk_refcnt from atomic_t to refcount_t
>   net: convert ip_mc_list.refcnt from atomic_t to refcount_t
>   net: convert in_device.refcnt from atomic_t to refcount_t
>   net: convert netpoll_info.refcnt from atomic_t to refcount_t
>   net: convert unix_address.refcnt from atomic_t to refcount_t
>   net: convert fib_rule.refcnt from atomic_t to refcount_t
>   net: convert inet_frag_queue.refcnt from atomic_t to refcount_t
>   net: convert net.passive from atomic_t to refcount_t
>   net: convert netlbl_lsm_cache.refcount from atomic_t to refcount_t
>   net: convert packet_fanout.sk_ref from atomic_t to refcount_t


Can you take a look at this please ?

Thanks.

[   64.601749] [ cut here ]
[   64.601757] WARNING: CPU: 0 PID: 6476 at lib/refcount.c:184 
refcount_sub_and_test+0x75/0xa0
[   64.601758] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd 
mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
[   64.601769] CPU: 0 PID: 6476 Comm: ip Tainted: GW   
4.12.0-smp-DEV #274
[   64.601770] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.40.0 06/22/2016
[   64.601771] task: 8837bf482040 task.stack: 8837bdc08000
[   64.601773] RIP: 0010:refcount_sub_and_test+0x75/0xa0
[   64.601774] RSP: 0018:8837bdc0f5c0 EFLAGS: 00010286
[   64.601776] RAX: 0026 RBX: 0001 RCX: 
[   64.601777] RDX: 0026 RSI: 0096 RDI: ed06f7b81eae
[   64.601778] RBP: 8837bdc0f5d0 R08: 0004 R09: fbfff4a54c25
[   64.601779] R10: cbc500e5 R11: a52a6128 R12: 881febcf6f24
[   64.601779] R13: 881fbf4eaf00 R14: 881febcf6f80 R15: 8837d7a4ed00
[   64.601781] FS:  7ff5a2f6b700() GS:881fff80() 
knlGS:
[   64.601782] CS:  0010 DS:  ES:  CR0: 80050033
[   64.601783] CR2: 7ffcdc70d000 CR3: 001f9c91e000 CR4: 001406f0
[   64.601783] Call Trace:
[   64.601786]  refcount_dec_and_test+0x11/0x20
[   64.601790]  fib_nl_delrule+0xc39/0x1630
[   64.601793]  ? is_bpf_text_address+0xe/0x20
[   64.601795]  ? fib_nl_newrule+0x25e0/0x25e0
[   64.601798]  ? depot_save_stack+0x133/0x470
[   64.601801]  ? ns_capable+0x13/0x20
[   64.601803]  ? __netlink_ns_capable+0xcc/0x100
[   64.601806]  rtnetlink_rcv_msg+0x23a/0x6a0
[   64.601808]  ? rtnl_newlink+0x1630/0x1630
[   64.601811]  ? memset+0x31/0x40
[   64.601813]  netlink_rcv_skb+0x2d7/0x440
[   64.601815]  ? rtnl_newlink+0x1630/0x1630
[   64.601816]  ? netlink_ack+0xaf0/0xaf0
[   64.601818]  ? kasan_unpoison_shadow+0x35/0x50
[   64.601820]  ? __kmalloc_node_track_caller+0x4c/0x70
[   64.601821]  rtnetlink_rcv+0x28/0x30
[   64.601823]  netlink_unicast+0x422/0x610
[   64.601824]  ? netlink_attachskb+0x650/0x650
[   64.601826]  netlink_sendmsg+0x7b7/0xb60
[   64.601828]  ? netlink_unicast+0x610/0x610
[   64.601830]  ? netlink_unicast+0x610/0x610
[   64.601832]  sock_sendmsg+0xba/0xf0
[   64.601834]  ___sys_sendmsg+0x6a9/0x8c0
[   64.601835]  ? copy_msghdr_from_user+0x52

Re: [PATCH net-next] net/mlxfw: Properly handle dependancy with non-loadable mlx5

2017-07-03 Thread David Miller

From: Or Gerlitz 
Date: Sun,  2 Jul 2017 18:57:28 +0300

> If mlx5 is set to be built-in and mlxfw as a module, we
> get a link error:
> 
> drivers/built-in.o: In function `mlx5_firmware_flash':
> (.text+0x5aed72): undefined reference to `mlxfw_firmware_flash'
> 
> Since we don't want to mandate selecting mlxfw for mlx5 users, we
> use the IS_REACHABLE macro to make sure that a stub is exposed
> to the caller.
> 
> Signed-off-by: Or Gerlitz 
> Reported-by: Jakub Kicinski 
> Reported-by: Arnd Bergmann 

Applied, thank you.

Re: [PATCH net 0/2] vxlan, geneve: fix hlist corruption

2017-07-03 Thread David Miller

From: Jiri Benc 
Date: Sun,  2 Jul 2017 19:00:56 +0200

> Fix memory corruption introduced with the support of both IPv4 and IPv6
> sockets in a single device. The same bug is present in VXLAN and Geneve.
> 
> Signed-off-by: Jiri Benc 

Series applied and queued up for -stable, thanks.

Re: [PATCH] netxen_nic: Remove unused pointer hdr in netxen_setup_minidump()

2017-07-03 Thread David Miller

From: Christos Gkekas 
Date: Sun,  2 Jul 2017 23:16:11 +0100

> Pointer hdr in netxen_setup_minidump() is set but never used, thus
> should be removed.
> 
> Signed-off-by: Christos Gkekas 

Applied, thanks.

Re: [PATCH] net: core: Fix slab-out-of-bounds in netdev_stats_to_stats64

2017-07-03 Thread David Miller

From: Alban Browaeys 
Date: Mon,  3 Jul 2017 03:20:13 +0200

> commit 9256645af098 ("net/core: relax BUILD_BUG_ON in
> netdev_stats_to_stats64") made an attempt to read beyond
> the size of the source a possibility.
> 
> Fix to only copy src size to dest. As dest might be bigger than src.
 ...
> Signed-off-by: Alban Browaeys 

Applied and queued up for -stable, thanks.

Re: [PATCH 1/1] mlx4_en: make mlx4_log_num_mgm_entry_size static

2017-07-03 Thread David Miller

From: Zhu Yanjun 
Date: Mon,  3 Jul 2017 01:35:19 -0400

> The variable mlx4_log_num_mgm_entry_size is only called in main.c.
> 
> CC: Joe Jin 
> CC: Junxiao Bi 
> Signed-off-by: Zhu Yanjun 

Applied, thank you.

[PATCH net-next] net: avoid one splat in fib_nl_delrule()

2017-07-03 Thread Eric Dumazet

From: Eric Dumazet 

We need to use refcount_set() on a newly created rule to avoid
following error :

[   64.601749] [ cut here ]
[   64.601757] WARNING: CPU: 0 PID: 6476 at lib/refcount.c:184 
refcount_sub_and_test+0x75/0xa0
[   64.601758] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd 
mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
[   64.601769] CPU: 0 PID: 6476 Comm: ip Tainted: GW   
4.12.0-smp-DEV #274
[   64.601771] task: 8837bf482040 task.stack: 8837bdc08000
[   64.601773] RIP: 0010:refcount_sub_and_test+0x75/0xa0
[   64.601774] RSP: 0018:8837bdc0f5c0 EFLAGS: 00010286
[   64.601776] RAX: 0026 RBX: 0001 RCX: 
[   64.601777] RDX: 0026 RSI: 0096 RDI: ed06f7b81eae
[   64.601778] RBP: 8837bdc0f5d0 R08: 0004 R09: fbfff4a54c25
[   64.601779] R10: cbc500e5 R11: a52a6128 R12: 881febcf6f24
[   64.601779] R13: 881fbf4eaf00 R14: 881febcf6f80 R15: 8837d7a4ed00
[   64.601781] FS:  7ff5a2f6b700() GS:881fff80() 
knlGS:
[   64.601782] CS:  0010 DS:  ES:  CR0: 80050033
[   64.601783] CR2: 7ffcdc70d000 CR3: 001f9c91e000 CR4: 001406f0
[   64.601783] Call Trace:
[   64.601786]  refcount_dec_and_test+0x11/0x20
[   64.601790]  fib_nl_delrule+0xc39/0x1630
[   64.601793]  ? is_bpf_text_address+0xe/0x20
[   64.601795]  ? fib_nl_newrule+0x25e0/0x25e0
[   64.601798]  ? depot_save_stack+0x133/0x470
[   64.601801]  ? ns_capable+0x13/0x20
[   64.601803]  ? __netlink_ns_capable+0xcc/0x100
[   64.601806]  rtnetlink_rcv_msg+0x23a/0x6a0
[   64.601808]  ? rtnl_newlink+0x1630/0x1630
[   64.601811]  ? memset+0x31/0x40
[   64.601813]  netlink_rcv_skb+0x2d7/0x440
[   64.601815]  ? rtnl_newlink+0x1630/0x1630
[   64.601816]  ? netlink_ack+0xaf0/0xaf0
[   64.601818]  ? kasan_unpoison_shadow+0x35/0x50
[   64.601820]  ? __kmalloc_node_track_caller+0x4c/0x70
[   64.601821]  rtnetlink_rcv+0x28/0x30
[   64.601823]  netlink_unicast+0x422/0x610
[   64.601824]  ? netlink_attachskb+0x650/0x650
[   64.601826]  netlink_sendmsg+0x7b7/0xb60
[   64.601828]  ? netlink_unicast+0x610/0x610
[   64.601830]  ? netlink_unicast+0x610/0x610
[   64.601832]  sock_sendmsg+0xba/0xf0
[   64.601834]  ___sys_sendmsg+0x6a9/0x8c0
[   64.601835]  ? copy_msghdr_from_user+0x520/0x520
[   64.601837]  ? __alloc_pages_nodemask+0x160/0x520
[   64.601839]  ? memcg_write_event_control+0xd60/0xd60
[   64.601841]  ? __alloc_pages_slowpath+0x1d50/0x1d50
[   64.601843]  ? kasan_slab_free+0x71/0xc0
[   64.601845]  ? mem_cgroup_commit_charge+0xb2/0x11d0
[   64.601847]  ? lru_cache_add_active_or_unevictable+0x7d/0x1a0
[   64.601849]  ? __handle_mm_fault+0x1af8/0x2810
[   64.601851]  ? may_open_dev+0xc0/0xc0
[   64.601852]  ? __pmd_alloc+0x2c0/0x2c0
[   64.601853]  ? __fdget+0x13/0x20
[   64.601855]  __sys_sendmsg+0xc6/0x150
[   64.601856]  ? __sys_sendmsg+0xc6/0x150
[   64.601857]  ? SyS_shutdown+0x170/0x170
[   64.601859]  ? handle_mm_fault+0x28a/0x650
[   64.601861]  SyS_sendmsg+0x12/0x20
[   64.601863]  entry_SYSCALL_64_fastpath+0x13/0x94


Fixes: 717d1e993ad8 ("net: convert fib_rule.refcnt from atomic_t to 
refcount_t") 
Signed-off-by: Eric Dumazet 
---
 net/core/fib_rules.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 
c4ecd9f75a47f1c861e11b21f55768053609b649..a0093e1b0235355db66b980580243dd6619c9aa6
 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -517,7 +517,7 @@ int fib_nl_newrule(struct sk_buff *skb, struct nlmsghdr 
*nlh,
last = r;
}
 
-   fib_rule_get(rule);
+   refcount_set(&rule->refcnt, 1);
 
if (last)
list_add_rcu(&rule->list, &last->list);

RE: [PATCH 00/17] v3 net generic subsystem refcount conversions

2017-07-03 Thread Reshetova, Elena




> On Fri, 2017-06-30 at 13:07 +0300, Elena Reshetova wrote:
> > Changes in v3:
> > Rebased on top of the net-next tree.
> >
> > Changes in v2:
> > No changes in patches apart from rebases, but now by
> > default refcount_t = atomic_t (*) and uses all atomic standard operations
> > unless CONFIG_REFCOUNT_FULL is enabled. This is a compromise for the
> > systems that are critical on performance (such as net) and cannot accept 
> > even
> > slight delay on the refcounter operations.
> >
> > This series, for core network subsystem components, replaces atomic_t 
> > reference
> > counters with the new refcount_t type and API (see 
> > include/linux/refcount.h).
> > By doing this we prevent intentional or accidental
> > underflows or overflows that can led to use-after-free vulnerabilities.
> > These patches contain only generic net pieces. Other changes will be sent
> separately.
> >
> > The patches are fully independent and can be cherry-picked separately.
> > The big patches, such as conversions for sock structure, need a very 
> > detailed
> > look from maintainers: refcount managing is quite complex in them and while
> > it seems that they would benefit from the change, extra checking is needed.
> > The biggest corner issue is the fact that refcount_inc() does not increment
> > from zero.
> >
> > If there are no objections to the patches, please merge them via respective 
> > trees.
> >
> > * The respective change is currently merged into -next as
> >   "locking/refcount: Create unchecked atomic_t implementation".
> >
> > Elena Reshetova (17):
> >   net: convert inet_peer.refcnt from atomic_t to refcount_t
> >   net: convert neighbour.refcnt from atomic_t to refcount_t
> >   net: convert neigh_params.refcnt from atomic_t to refcount_t
> >   net: convert nf_bridge_info.use from atomic_t to refcount_t
> >   net: convert sk_buff.users from atomic_t to refcount_t
> >   net: convert sk_buff_fclones.fclone_ref from atomic_t to refcount_t
> >   net: convert sock.sk_wmem_alloc from atomic_t to refcount_t
> >   net: convert sock.sk_refcnt from atomic_t to refcount_t
> >   net: convert ip_mc_list.refcnt from atomic_t to refcount_t
> >   net: convert in_device.refcnt from atomic_t to refcount_t
> >   net: convert netpoll_info.refcnt from atomic_t to refcount_t
> >   net: convert unix_address.refcnt from atomic_t to refcount_t
> >   net: convert fib_rule.refcnt from atomic_t to refcount_t
> >   net: convert inet_frag_queue.refcnt from atomic_t to refcount_t
> >   net: convert net.passive from atomic_t to refcount_t
> >   net: convert netlbl_lsm_cache.refcount from atomic_t to refcount_t
> >   net: convert packet_fanout.sk_ref from atomic_t to refcount_t
> 
> 
> Can you take a look at this please ?
> 
> Thanks.

Thank you very much for the report! This is an underflow (dec/sub from zero) 
that is reported by WARNING. 
I guess it is unlikely that actual code underflows, so the most probable cause 
is that it attempted to do refcount_inc/add() from zero, but then failed. 
However  in that case you should have seen another warning on refcount_inc() 
somewhere earlier. That one is actually the one I need to see to track the root 
cause. 
Could you tell me how do you arrive to the below output? Boot in what 
config/etc. 
I can try to reproduce to debug further. 

Best Regards,
Elena

> 
> [   64.601749] [ cut here ]
> [   64.601757] WARNING: CPU: 0 PID: 6476 at lib/refcount.c:184
> refcount_sub_and_test+0x75/0xa0
> [   64.601758] Modules linked in: w1_therm wire cdc_acm ehci_pci ehci_hcd
> mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
> [   64.601769] CPU: 0 PID: 6476 Comm: ip Tainted: GW   
> 4.12.0-smp-DEV #274
> [   64.601770] Hardware name: Intel RML,PCH/Iota_QC_19, BIOS 2.40.0 06/22/2016
> [   64.601771] task: 8837bf482040 task.stack: 8837bdc08000
> [   64.601773] RIP: 0010:refcount_sub_and_test+0x75/0xa0
> [   64.601774] RSP: 0018:8837bdc0f5c0 EFLAGS: 00010286
> [   64.601776] RAX: 0026 RBX: 0001 RCX:
> 
> [   64.601777] RDX: 0026 RSI: 0096 RDI:
> ed06f7b81eae
> [   64.601778] RBP: 8837bdc0f5d0 R08: 0004 R09: 
> fbfff4a54c25
> [   64.601779] R10: cbc500e5 R11: a52a6128 R12: 
> 881febcf6f24
> [   64.601779] R13: 881fbf4eaf00 R14: 881febcf6f80 R15: 
> 8837d7a4ed00
> [   64.601781] FS:  7ff5a2f6b700() GS:881fff80()
> knlGS:
> [   64.601782] CS:  0010 DS:  ES:  CR0: 80050033
> [   64.601783] CR2: 7ffcdc70d000 CR3: 001f9c91e000 CR4:
> 001406f0
> [   64.601783] Call Trace:
> [   64.601786]  refcount_dec_and_test+0x11/0x20
> [   64.601790]  fib_nl_delrule+0xc39/0x1630
> [   64.601793]  ? is_bpf_text_address+0xe/0x20
> [   64.601795]  ? fib_nl_newrule+0x25e0/0x25e0
> [   64.601798]  ? depot_save_stack+0x133/0x470
> [   64.601801]  ? ns_capable+0x13/0x20
> [   64.601803]  ? __netlink_ns_capable+

Re: [PATCH NET V5 2/2] net: hns: Use phy_driver to setup Phy loopback

2017-07-03 Thread Yunsheng Lin

Hi, Andrew

On 2017/7/1 23:17, Andrew Lunn wrote:
> On Sat, Jul 01, 2017 at 11:57:32AM +, linyunsheng wrote:
>> Hi, Andrew
>>
>> I am agreed wih you on this.
>> But self test is also a feature of our product, and our
>> customer way choose to diagnose a problem using
>> self test, even if self test does not give a clear
>> reason to the problem.
>> we don't want to remove a feature that we don't
>> know when our customer will be using.
> 
> Far enough. So please take a close look at the code and try to fix
> it. The corner cases are your problem, a down'ed interface, WOL, etc.
> It is issues like this which can result in phy_resume() being called
> without there first being a phy_suspend.
> 
I looked into how the phy core deal with down'ed interface, WOL problem,
here is what I found:
1.phydev->state is used to track the state of the phy.
2.phy_start/stop and phy_state_machine work together to make sure the
  phydev->state is consistent with phydev->suspended.

And using phy_start/stop instead of phy_resume/suspend should take care
of down'ed interface problem.
Will using phy_start/stop cause other problems?

As for WOL,
phy_state_machine:
if (needs_aneg)
err = phy_start_aneg_priv(phydev, false);
else if (do_suspend)
phy_suspend(phydev);

I think the phy core also have the same problem, because above code does
not put the phy into suspending when it is WOL'ed, and it do not check the
return value of phy_suspend.

I hope I am not missing something obvious.
Please let me know if you have any idea about WOL problem, thanks.

Best Regards
Yunsheng Lin

Re: [PATCH 1/1] mlx4_en: make mlx4_log_num_mgm_entry_size static

2017-07-03 Thread Sergei Shtylyov


Hello!

On 7/3/2017 8:35 AM, Zhu Yanjun wrote:


The variable mlx4_log_num_mgm_entry_size is only called in main.c.

   s/called/used/.


CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 

[...]

MBR, Sergei

[PATCH net-next] net: make sk_ehashfn() static

2017-07-03 Thread Eric Dumazet

From: Eric Dumazet 

sk_ehashfn() is only used from a single file.

Signed-off-by: Eric Dumazet 
---
 include/net/inet_hashtables.h |1 -
 net/ipv4/inet_hashtables.c|2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 
b9e6e0e1f55ce1acd61ff491c88a12b83086f331..5026b1f08bb87bf7b9be9df84d70fedf2bd8707f
 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -359,7 +359,6 @@ static inline struct sock *__inet_lookup_skb(struct 
inet_hashinfo *hashinfo,
 refcounted);
 }
 
-u32 sk_ehashfn(const struct sock *sk);
 u32 inet6_ehashfn(const struct net *net,
  const struct in6_addr *laddr, const u16 lport,
  const struct in6_addr *faddr, const __be16 fport);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 
a4be2c1cb6887a9f639ccc1349df60a5befc7929..2e3389d614d1689856c3a8a9929dba8f7e7e1a37
 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -43,7 +43,7 @@ static u32 inet_ehashfn(const struct net *net, const __be32 
laddr,
 /* This function handles inet_sock, but also timewait and request sockets
  * for IPv4/IPv6.
  */
-u32 sk_ehashfn(const struct sock *sk)
+static u32 sk_ehashfn(const struct sock *sk)
 {
 #if IS_ENABLED(CONFIG_IPV6)
if (sk->sk_family == AF_INET6 &&

Re: [PATCH net-next] net: make sk_ehashfn() static

2017-07-03 Thread David Miller

From: Eric Dumazet 
Date: Mon, 03 Jul 2017 02:57:54 -0700

> From: Eric Dumazet 
> 
> sk_ehashfn() is only used from a single file.
> 
> Signed-off-by: Eric Dumazet 

Applied.

Re: [PATCH net-next] net: avoid one splat in fib_nl_delrule()

2017-07-03 Thread David Miller

From: Eric Dumazet 
Date: Mon, 03 Jul 2017 02:54:33 -0700

> From: Eric Dumazet 
> 
> We need to use refcount_set() on a newly created rule to avoid
> following error :
 ...
> Fixes: 717d1e993ad8 ("net: convert fib_rule.refcnt from atomic_t to 
> refcount_t") 
> Signed-off-by: Eric Dumazet 

Applied.

Re: [PATCH 00/17] v3 net generic subsystem refcount conversions

2017-07-03 Thread Eric Dumazet

On Mon, 2017-07-03 at 09:57 +, Reshetova, Elena wrote:

> Thank you very much for the report! This is an underflow (dec/sub from
> zero) that is reported by WARNING. 
> I guess it is unlikely that actual code underflows, so the most
> probable cause is that it attempted to do refcount_inc/add() from
> zero, but then failed. 
> However  in that case you should have seen another warning on
> refcount_inc() somewhere earlier. That one is actually the one I need
> to see to track the root cause. 
> Could you tell me how do you arrive to the below output? Boot in what
> config/etc. 
> I can try to reproduce to debug further. 

I sent this fix : 

https://patchwork.ozlabs.org/patch/783389/

Thanks.

[PATCH net] ixgbe: fix masking of bits read from IXGBE_VXLANCTRL register

2017-07-03 Thread Sabrina Dubroca

In ixgbe_clear_udp_tunnel_port(), we read the IXGBE_VXLANCTRL register
and then try to mask some bits out of the value, using the logical
instead of bitwise and operator.

Fixes: a21d0822ff69 ("ixgbe: add support for geneve Rx offload")
Signed-off-by: Sabrina Dubroca 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d39cba214320..c3e70ec2da0f 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -4868,7 +4868,7 @@ static void ixgbe_clear_udp_tunnel_port(struct 
ixgbe_adapter *adapter, u32 mask)
IXGBE_FLAG_GENEVE_OFFLOAD_CAPABLE)))
return;
 
-   vxlanctrl = IXGBE_READ_REG(hw, IXGBE_VXLANCTRL) && ~mask;
+   vxlanctrl = IXGBE_READ_REG(hw, IXGBE_VXLANCTRL) & ~mask;
IXGBE_WRITE_REG(hw, IXGBE_VXLANCTRL, vxlanctrl);
 
if (mask & IXGBE_VXLANCTRL_VXLAN_UDPPORT_MASK)
-- 
2.13.2

pull-request: wireless-drivers-next 2017-07-03

2017-07-03 Thread Kalle Valo

Hi Dave,

here's the late pull request to net-next I mentioned about last week to
get some new iwlwifi hw support to 4.13.

If this is too late just drop the request and let me know, I can then
resend it for 4.14 after the merge window. These patches were included
in today's linux-next build and I haven't received any reports about
problems, at least not yet.

Kalle

The following changes since commit fdcbe65d618af080ee23229f0137ffd37f2de36b:

  Merge ath-next from 
git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/ath.git (2017-06-28 
22:10:48 +0300)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git 
tags/wireless-drivers-next-for-davem-2017-07-03

for you to fetch changes up to 17d9aa66b08de445645bd0688fc1635bed77a57b:

  Merge tag 'iwlwifi-next-for-kalle-2017-06-30' of 
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next (2017-06-30 
13:48:19 +0300)


wireless-drivers-next patches for 4.13

Last minute changes to get new hardware and firmware support for
iwlwifi and few other changes I was able to squeeze in. Also two
patches for ieee80211.h and nl80211 as Johannes is away.

Major changes:

iwlwifi

* some important fixes for 9000 HW

* support for version 30 of the FW API for 8000 and 9000 series

* a few new PCI IDs for 9000 series

* reorganization of common files

brcmfmac

* support 4-way handshake offloading for WPA/WPA2-PSK and 802.1X


Andrei Otcheretianski (1):
  nl80211: Don't verify owner_nlportid on NAN commands

Arend van Spriel (3):
  brcmfmac: support 4-way handshake offloading for WPA/WPA2-PSK
  brcmfmac: support 4-way handshake offloading for 802.1X
  brcmfmac: switch to using cfg80211_connect_done()

Emmanuel Grumbach (5):
  iwlwifi: mvm: change when the BT_COEX is sent
  iwlwifi: mvm: don't send fetch the TID from a non-QoS packet in TSO
  iwlwifi: mvm: don't mess the SNAP header in TSO for non-QoS packets
  iwlwifi: pcie: propagate iwl_pcie_apm_init's status
  iwlwifi: pcie: wait longer after device reset

Ganapathi Bhat (1):
  mwifiex: do not update MCS set from hostapd

Haim Dreyfuss (2):
  iwlwifi: mvm: refactor geo init
  iwlwifi: mvm: Add debugfs entry to retrieve SAR geographic profile

Johannes Berg (31):
  iwlwifi: mvm: remove some CamelCase from firmware API
  iwlwifi: mvm: fix various "Excess ... description" kernel-doc warnings
  iwlwifi: mvm: remove various unused command IDs/structs
  iwlwifi: mvm: use __le16 even for reserved fields
  iwlwifi: mvm: add documentation for all command IDs
  iwlwifi: mvm: fix a bunch of kernel-doc warnings
  iwlwifi: dvm: use macros for format strings
  iwlwifi: pcie: only apply retention workaround on 9000-series A-step
  iwlwifi: pcie: fix 9000-series RF-kill interrupt propagation
  iwlwifi: mvm: use proper CDB check in PHY context modify
  iwlwifi: pcie: improve "invalid queue" warning
  iwlwifi: pcie: improve debug in iwl_pcie_rx_handle_rb()
  iwlwifi: unify external & internal modparam names
  iwlwifi: pcie: make ctxt-info free idempotent
  iwlwifi: pcie: warn if paging is already initialized during init
  iwlwifi: mvm: unconditionally stop device after init
  iwlwifi: mvm: fix deduplication start logic
  iwlwifi: mvm: rename iwl_shared_mem_cfg_v1 to the correct _v2
  iwlwifi: create new subdirectory for FW interaction
  iwlwifi: move notification wait into fw/
  iwlwifi: move configuration into sub-directory
  iwlwifi: mvm: remove version 2 of paging command
  iwlwifi: mvm: quietly accept non-sta assoc response frames
  iwlwifi: pcie: add MSI-X interrupt tracing
  iwlwifi: mvm: properly enable IP header checksumming
  iwlwifi: mvm: fix mac80211 queue tracking
  iwlwifi: mvm: map cab_queue to real one earlier
  iwlwifi: mvm: fix mac80211's hw_queue in DQA mode
  iwlwifi: pcie: reconfigure MSI-X HW on resume
  iwlwifi: mvm: remove DQA non-STA client mode special case
  iwlwifi: mvm: quietly accept non-sta disassoc frames

Kalle Valo (1):
  Merge tag 'iwlwifi-next-for-kalle-2017-06-30' of 
git://git.kernel.org/.../iwlwifi/iwlwifi-next

Liad Kaufman (3):
  iwlwifi: mvm: support aggs of 64 frames in A000 family
  iwlwifi: mvm: support multi tid ba notif
  iwlwifi: mvm: update rx statistics cmd api

Luca Coelho (2):
  iwlwifi: mvm: simplify CHECK_MLME_TRIGGER macro
  iwlwifi: bump MAX API for 8000/9000/A000 to 33

Peter Oh (1):
  ieee80211: update public action codes

Sharon Dvir (1):
  iwlwifi: mvm: change sta_id to u8

Tzipi Peres (2):
  iwlwifi: add the new a000_2ax series
  iwlwifi: add twelve new 9560 series PCI IDs

 .../broadcom/brcm80211/brcmfmac/cfg80211.c | 150 +++-
 .../broadcom/brcm80211/brcmfmac

[PATCH] openvswitch: fix mis-ordered comment lines for ovs_skb_cb

2017-07-03 Thread Daniel Axtens

I was trying to wrap my head around meaning of mru, and realised
that the second line of the comment defining it had somehow
ended up after the line defining cutlen, leading to much confusion.

Reorder the lines to make sense.

Signed-off-by: Daniel Axtens 
---
 net/openvswitch/datapath.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index da931bdef8a7..5d8dcd88815f 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -98,8 +98,8 @@ struct datapath {
  * @input_vport: The original vport packet came in on. This value is cached
  * when a packet is received by OVS.
  * @mru: The maximum received fragement size; 0 if the packet is not
- * @cutlen: The number of bytes from the packet end to be removed.
  * fragmented.
+ * @cutlen: The number of bytes from the packet end to be removed.
  */
 struct ovs_skb_cb {
struct vport*input_vport;
-- 
2.11.0

[PATCH 0/2] Fixes for errors reported by 0day on net-next tree

2017-07-03 Thread Elena Reshetova

Despite the fact that we have automatic testing enabled
on all our branches, there s390-config related errors got through.
I will investigate separately why it happened.
Sorry for the inconvince and please pull.

Elena Reshetova (2):
  net, iucv: fixing error from refcount conversion
  drivers, s390: fix errors resulting from refcount conversions

 drivers/s390/net/ctcm_fsms.c | 12 ++--
 net/iucv/af_iucv.c   |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

-- 
2.7.4

[PATCH 2/2] drivers, s390: fix errors resulting from refcount conversions

2017-07-03 Thread Elena Reshetova

For some reason it looks like our tree hasn't been tested
with s390-default_defconfig, so this commit fixes errors
reported from it.
---
 drivers/s390/net/ctcm_fsms.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/s390/net/ctcm_fsms.c b/drivers/s390/net/ctcm_fsms.c
index e9847ce..570ae3b 100644
--- a/drivers/s390/net/ctcm_fsms.c
+++ b/drivers/s390/net/ctcm_fsms.c
@@ -217,7 +217,7 @@ void ctcm_purge_skb_queue(struct sk_buff_head *q)
CTCM_DBF_TEXT(TRACE, CTC_DBF_DEBUG, __func__);
 
while ((skb = skb_dequeue(q))) {
-   atomic_dec(&skb->users);
+   refcount_dec(&skb->users);
dev_kfree_skb_any(skb);
}
 }
@@ -271,7 +271,7 @@ static void chx_txdone(fsm_instance *fi, int event, void 
*arg)
priv->stats.tx_bytes += 2;
first = 0;
}
-   atomic_dec(&skb->users);
+   refcount_dec(&skb->users);
dev_kfree_skb_irq(skb);
}
spin_lock(&ch->collect_lock);
@@ -297,7 +297,7 @@ static void chx_txdone(fsm_instance *fi, int event, void 
*arg)
skb_put(ch->trans_skb, skb->len), skb->len);
priv->stats.tx_packets++;
priv->stats.tx_bytes += skb->len - LL_HEADER_LENGTH;
-   atomic_dec(&skb->users);
+   refcount_dec(&skb->users);
dev_kfree_skb_irq(skb);
i++;
}
@@ -1248,7 +1248,7 @@ static void ctcmpc_chx_txdone(fsm_instance *fi, int 
event, void *arg)
priv->stats.tx_bytes += 2;
first = 0;
}
-   atomic_dec(&skb->users);
+   refcount_dec(&skb->users);
dev_kfree_skb_irq(skb);
}
spin_lock(&ch->collect_lock);
@@ -1298,7 +1298,7 @@ static void ctcmpc_chx_txdone(fsm_instance *fi, int 
event, void *arg)
data_space -= skb->len;
priv->stats.tx_packets++;
priv->stats.tx_bytes += skb->len;
-   atomic_dec(&skb->users);
+   refcount_dec(&skb->users);
dev_kfree_skb_any(skb);
peekskb = skb_peek(&ch->collect_queue);
if (peekskb->len > data_space)
@@ -1795,7 +1795,7 @@ static void ctcmpc_chx_send_sweep(fsm_instance *fsm, int 
event, void *arg)
fsm_event(grp->fsm, MPCG_EVENT_INOP, dev);
goto done;
} else {
-   atomic_inc(&skb->users);
+   refcount_inc(&skb->users);
skb_queue_tail(&wch->io_queue, skb);
}
 
-- 
2.7.4

[PATCH 1/2] net, iucv: fixing error from refcount conversion

2017-07-03 Thread Elena Reshetova

Fixing "net/iucv/af_iucv.c:405:22: error: passing
argument 1 of 'atomic_read' from incompatible pointer type"
---
 net/iucv/af_iucv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index ac033e4..1485331 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -402,7 +402,7 @@ static void iucv_sock_destruct(struct sock *sk)
}
 
WARN_ON(atomic_read(&sk->sk_rmem_alloc));
-   WARN_ON(atomic_read(&sk->sk_wmem_alloc));
+   WARN_ON(refcount_read(&sk->sk_wmem_alloc));
WARN_ON(sk->sk_wmem_queued);
WARN_ON(sk->sk_forward_alloc);
 }
-- 
2.7.4

Re: [PATCH 0/2] Fixes for errors reported by 0day on net-next tree

2017-07-03 Thread David Miller

From: Elena Reshetova 
Date: Mon,  3 Jul 2017 14:50:26 +0300

> Despite the fact that we have automatic testing enabled
> on all our branches, there s390-config related errors got through.
> I will investigate separately why it happened.
> Sorry for the inconvince and please pull.

Update your net-next tree I already fixed this stuff.

Thanks.

Re: [PATCH] openvswitch: fix mis-ordered comment lines for ovs_skb_cb

2017-07-03 Thread David Miller

From: Daniel Axtens 
Date: Mon,  3 Jul 2017 21:46:43 +1000

> I was trying to wrap my head around meaning of mru, and realised
> that the second line of the comment defining it had somehow
> ended up after the line defining cutlen, leading to much confusion.
> 
> Reorder the lines to make sense.
> 
> Signed-off-by: Daniel Axtens 

Applied, thanks.

Re: pull-request: wireless-drivers-next 2017-07-03

2017-07-03 Thread David Miller

From: Kalle Valo 
Date: Mon, 03 Jul 2017 14:39:07 +0300

> here's the late pull request to net-next I mentioned about last week to
> get some new iwlwifi hw support to 4.13.
> 
> If this is too late just drop the request and let me know, I can then
> resend it for 4.14 after the merge window. These patches were included
> in today's linux-next build and I haven't received any reports about
> problems, at least not yet.

Pulled, thanks.

Re: [PATCH RFC 1/2] skbuff: Function to send an skbuf on a socket

2017-07-03 Thread David Miller

From: Tom Herbert 
Date: Thu, 29 Jun 2017 11:27:04 -0700

> +int skb_send_sock(struct sk_buff *skb, struct socket *sock, unsigned int 
> offset)
> +{
> + unsigned int sent = 0;
> + unsigned int ret;
> + unsigned short fragidx;

Please use reverse christmas tree ordering for these local variables.

> + /* Deal with head data */
> + while (offset < skb_headlen(skb)) {
> + size_t len = skb_headlen(skb) - offset;
> + struct kvec kv;
> + struct msghdr msg;

Likewise.

Re: [PATCH RFC 2/2] kproxy: Kernel proxy

2017-07-03 Thread David Miller

From: Tom Herbert 
Date: Thu, 29 Jun 2017 11:27:05 -0700

> A proc file (/prox/net/kproxy) is created to list all the running
> kernel proxies and relevant statistics for them.

proc is deprecated for dumping information like this, please use
sock diag instead.

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Will Deacon

On Fri, Jun 30, 2017 at 03:18:40PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 30, 2017 at 02:13:39PM +0100, Will Deacon wrote:
> > On Fri, Jun 30, 2017 at 05:38:15AM -0700, Paul E. McKenney wrote:
> > > I also need to check all uses of spin_is_locked().  There might no
> > > longer be any that rely on any particular ordering...
> > 
> > Right. I think we're looking for the "insane case" as per 38b850a73034
> > (which was apparently used by ipc/sem.c at the time, but no longer).
> > 
> > There's a usage in kernel/debug/debug_core.c, but it doesn't fill me with
> > joy.
> 
> That is indeed an interesting one...  But my first round will be what
> semantics the implementations seem to provide:
> 
> Acquire courtesy of TSO: s390, sparc, x86.
> Acquire: ia64 (in reality fully ordered).
> Control dependency: alpha, arc, arm, blackfin, hexagon, m32r, mn10300, tile,
>   xtensa.
> Control dependency plus leading full barrier: arm64, powerpc.
> UP-only: c6x, cris, frv, h8300, m68k, microblaze nios2, openrisc, um, 
> unicore32.
> 
> Special cases:
>   metag: Acquire if !CONFIG_METAG_SMP_WRITE_REORDERING.
>  Otherwise control dependency?
>   mips: Control dependency, acquire if CONFIG_CPU_CAVIUM_OCTEON.
>   parisc: Acquire courtesy of TSO, but why barrier in smp_load_acquire?
>   sh: Acquire if one of SH4A, SH5, or J2, otherwise acquire?  UP-only?
> 
> Are these correct, or am I missing something with any of them?

That looks about right but, at least on ARM, I think we have to consider
the semantics of spin_is_locked with respect to the other spin_* functions,
rather than in isolation.

For example, ARM only has a control dependency, but spin_lock has a trailing
smp_mb() and spin_unlock has both leading and trailing smp_mb().

Will

Re: [PATCH net 1/2] vxlan: fix hlist corruption

2017-07-03 Thread Waiman Long

On 07/03/2017 04:23 AM, Jiri Benc wrote:
> On Sun, 2 Jul 2017 16:06:10 -0400, Waiman Long wrote:
>> I didn't see any init code for hlist4 and hlist6. Is vxlan_dev going to
>> be *zalloc'ed so that they are guaranteed to be NULL? If not, you may
>> need to add  init code as not both hlists will be hashed and so one of
>> them may contain invalid data.
> Yes, it's zalloced via alloc_netdev. No need to init the fields
> explicitly.
>
>  Jiri

Thanks for the clarification.

-Longman

[PATCH net v2 1/1] net: reflect mark on tcp syn ack packets

2017-07-03 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

SYN-ACK responses on a server in response to a SYN from a client
did not get the injected skb mark that was tagged on the SYN packet.

Fixes: 84f39b08d786 ("net: support marking accepting TCP sockets")
Reviewed-by: Lorenzo Colitti 
Signed-off-by: Jamal Hadi Salim 
---
 net/ipv4/ip_output.c  | 3 ++-
 net/ipv4/tcp_output.c | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 532b36e..94b36d9 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -173,7 +173,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, const struct 
sock *sk,
}
 
skb->priority = sk->sk_priority;
-   skb->mark = sk->sk_mark;
+   if (!skb->mark)
+   skb->mark = sk->sk_mark;
 
/* Send it out. */
return ip_local_out(net, skb->sk, skb);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4858e19..b1604d0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3134,6 +3134,7 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, 
struct dst_entry *dst,
tcp_ecn_make_synack(req, th);
th->source = htons(ireq->ir_num);
th->dest = ireq->ir_rmt_port;
+   skb->mark = ireq->ir_mark;
/* Setting of flags are superfluous here for callers (and ECE is
 * not even correctly set)
 */
-- 
1.9.1

[PATCH net v2 0/1] reflect mark on tcp syn ack packets

2017-07-03 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 


Changes from v1(Lorenzo):
unconditionally set skb->mark = ireq->ir_mark;

Jamal Hadi Salim (1):
  net: reflect mark on tcp syn ack packets

 net/ipv4/ip_output.c  | 3 ++-
 net/ipv4/tcp_output.c | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

-- 
1.9.1

[PATCH iproute2 V2 0/4] RDMAtool

2017-07-03 Thread Leon Romanovsky

From: Leon Romanovsky 

Hi,

This is second version of series implementing the RDAMtool -  the tool
to configure RDMA devices. The initial proposal was sent as RFC [1] and
was based on sysfs entries as POC.

The current series was rewritten completely to work with RDMA netlinks as
a source of user<->kernel communications. In order to achieve that, the
RDMA netlinks were extensively refactored and modernized [2, 3, 4 and 5].

The following is an example of various runs on my machine with 5 devices
(4 in IB mode and one in Ethernet mode)

### Without parameters
$ rdma
Usage: rdma [ OPTIONS ] OBJECT { COMMAND | help }
where  OBJECT := { dev | link | help }
   OPTIONS := { -V[ersion] | -d[etails]}

### With unspecified device name
$ rdma dev
1: mlx5_0: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:3457 
sys_image_guid 5254:00c0:fe12:3457
2: mlx5_1: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:3458 
sys_image_guid 5254:00c0:fe12:3458
3: mlx5_2: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:3459 
sys_image_guid 5254:00c0:fe12:3459
4: mlx5_3: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:345a 
sys_image_guid 5254:00c0:fe12:345a
5: mlx5_4: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:345b 
sys_image_guid 5254:00c0:fe12:345b

### Detailed mode
$ rdma -d dev
1: mlx5_0: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:3457 
sys_image_guid 5254:00c0:fe12:3457
caps: 
2: mlx5_1: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:3458 
sys_image_guid 5254:00c0:fe12:3458
caps: 
3: mlx5_2: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:3459 
sys_image_guid 5254:00c0:fe12:3459
caps: 
4: mlx5_3: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:345a 
sys_image_guid 5254:00c0:fe12:345a
caps: 
5: mlx5_4: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:345b 
sys_image_guid 5254:00c0:fe12:345b
caps: 

### Specific device
$ rdma dev show mlx5_4
5: mlx5_4: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:345b 
sys_image_guid 5254:00c0:fe12:345b

### Specific device in detailed mode
$ rdma dev show mlx5_4 -d
5: mlx5_4: node_type SWITCH fw 2.8. node_guid 5254:00c0:fe12:345b 
sys_image_guid 5254:00c0:fe12:345b
caps: 

### Unknown command (caps)
$ rdma dev show mlx5_4 caps
Unknown parameter 'caps'.

### Link properties without device name
$ rdma link
1/1: mlx5_0/1: subnet_prefix fe80::: lid 13399 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP
2/1: mlx5_1/1: subnet_prefix fe80::: lid 13400 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP
3/1: mlx5_2/1: subnet_prefix fe80::: lid 13401 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP
4/1: mlx5_3/1: state DOWN physical_state DISABLED
5/1: mlx5_4/1: subnet_prefix fe80::: lid 13403 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP

### Link properties in detailed mode
$ rdma link -d
1/1: mlx5_0/1: subnet_prefix fe80::: lid 13399 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP
caps: 
2/1: mlx5_1/1: subnet_prefix fe80::: lid 13400 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP
caps: 
3/1: mlx5_2/1: subnet_prefix fe80::: lid 13401 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP
caps: 
4/1: mlx5_3/1: state DOWN physical_state DISABLED
caps: 
5/1: mlx5_4/1: subnet_prefix fe80::: lid 13403 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP
caps: 

### All links for specific device
$ rdma link show mlx5_3
1/1: mlx5_0/1: subnet_prefix fe80::: lid 13399 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP

### Detailed link properties for specific device
$ rdma link -d show mlx5_3
1/1: mlx5_0/1: subnet_prefix fe80::: lid 13399 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP
caps: 

### Specific port for specific device
$ rdma link show mlx5_4/1
1/1: mlx5_0/1: subnet_prefix fe80::: lid 13399 sm_lid 49151 lmc 0 
state ACTIVE physical_state LINK_UP

### Unknown parameter
$ rdma link show mlx5_4/1 caps
Unknown parameter 'caps'.

Thanks

Changelog
v1->v2:
 * Squashed multiple (and similar) patches to be one patch for dev object
   and one patch for link object.
 * Removed port_map struct
 * Removed global netlink dump during initialization, it removed the need to 
store
   the intermediate variables and reuse ability of netlink to signal if variable
   exists or doesn't.
 * Added "-d" --details option and put all CAPs under it.

v0->v1:
 * Moved hunk with changes in man/Makefile from first patch to the last patch
 * Removed the "unknown command" from the examples in commit messages
 * Removed special "caps" parsing command and put it to be part of general 
"show" command
 * Changed parsed capability format to be similar to iproute2 suite
 * Added FW version as an output of show command.
 * Added forgotten CAP_FLAGS to the nla_policy list
RFC->v0:
 * Removed everything that is not

[PATCH iproute2 V2 1/4] rdma: Add basic infrastructure for RDMA tool

2017-07-03 Thread Leon Romanovsky

From: Leon Romanovsky 

RDMA devices are cross-functional devices from one side,
but very tailored for the specific markets from another.

Such diversity caused to spread of RDMA related configuration
across various tools, e.g. devlink, ip, ethtool, ib specific and
vendor specific solutions.

This patch adds ability to fill device and port information
by reading RDMA netlink.

Signed-off-by: Leon Romanovsky 
---
 Makefile|   2 +-
 rdma/.gitignore |   1 +
 rdma/Makefile   |  22 ++
 rdma/rdma.c | 116 
 rdma/rdma.h |  71 +
 rdma/utils.c| 232 
 6 files changed, 443 insertions(+), 1 deletion(-)
 create mode 100644 rdma/.gitignore
 create mode 100644 rdma/Makefile
 create mode 100644 rdma/rdma.c
 create mode 100644 rdma/rdma.h
 create mode 100644 rdma/utils.c

diff --git a/Makefile b/Makefile
index 18de7dcb..c255063b 100644
--- a/Makefile
+++ b/Makefile
@@ -52,7 +52,7 @@ WFLAGS += -Wmissing-declarations -Wold-style-definition 
-Wformat=2
 CFLAGS := $(WFLAGS) $(CCOPTS) -I../include $(DEFINES) $(CFLAGS)
 YACCFLAGS = -d -t -v
 
-SUBDIRS=lib ip tc bridge misc netem genl tipc devlink man
+SUBDIRS=lib ip tc bridge misc netem genl tipc devlink rdma man
 
 LIBNETLINK=../lib/libnetlink.a ../lib/libutil.a
 LDLIBS += $(LIBNETLINK)
diff --git a/rdma/.gitignore b/rdma/.gitignore
new file mode 100644
index ..51fb172b
--- /dev/null
+++ b/rdma/.gitignore
@@ -0,0 +1 @@
+rdma
diff --git a/rdma/Makefile b/rdma/Makefile
new file mode 100644
index ..64da2142
--- /dev/null
+++ b/rdma/Makefile
@@ -0,0 +1,22 @@
+include ../Config
+
+ifeq ($(HAVE_MNL),y)
+
+RDMA_OBJ = rdma.o utils.o
+
+TARGETS=rdma
+CFLAGS += $(shell $(PKG_CONFIG) libmnl --cflags)
+LDLIBS += $(shell $(PKG_CONFIG) libmnl --libs)
+
+endif
+
+all:   $(TARGETS) $(LIBS)
+
+rdma:  $(RDMA_OBJ) $(LIBS)
+   $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
+
+install: all
+   install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR)
+
+clean:
+   rm -f $(RDMA_OBJ) $(TARGETS)
diff --git a/rdma/rdma.c b/rdma/rdma.c
new file mode 100644
index ..29273839
--- /dev/null
+++ b/rdma/rdma.c
@@ -0,0 +1,116 @@
+/*
+ * rdma.c  RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include 
+#include 
+
+#include "rdma.h"
+#include "SNAPSHOT.h"
+
+static void help(char *name)
+{
+   pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
+  "where  OBJECT := { help }\n"
+  "   OPTIONS := { -V[ersion] | -d[etails]}\n", name);
+}
+
+static int cmd_help(struct rdma *rd)
+{
+   help(rd->filename);
+   return 0;
+}
+
+static int rd_cmd(struct rdma *rd)
+{
+   const struct rdma_cmd cmds[] = {
+   { NULL, cmd_help },
+   { "help",   cmd_help },
+   { 0 }
+   };
+
+   return rdma_exec_cmd(rd, cmds, "object");
+}
+
+static int rd_init(struct rdma *rd, int argc, char **argv, char *filename)
+{
+   uint32_t seq;
+   int ret;
+
+   rd->filename = filename;
+   rd->argc = argc;
+   rd->argv = argv;
+   INIT_LIST_HEAD(&rd->dev_map_list);
+   rd->buff = malloc(MNL_SOCKET_BUFFER_SIZE);
+   if (!rd->buff)
+   return -ENOMEM;
+
+   rdma_prepare_msg(rd, RDMA_NLDEV_CMD_GET, &seq, (NLM_F_REQUEST | 
NLM_F_ACK | NLM_F_DUMP));
+   if ((ret = rdma_send_msg(rd)))
+   return ret;
+
+   return rdma_recv_msg(rd, rd_dev_init_cb, rd, seq);
+}
+
+static void rd_free(struct rdma *rd)
+{
+   free(rd->buff);
+   rdma_free_devmap(rd);
+}
+int main(int argc, char **argv)
+{
+   static const struct option long_options[] = {
+   { "version",no_argument,NULL, 'V' },
+   { "help",   no_argument,NULL, 'h' },
+   { "details",no_argument,NULL, 'd' },
+   { NULL, 0, NULL, 0 }
+   };
+   bool show_details = false;
+   char *filename;
+   struct rdma rd;
+   int opt;
+   int err;
+
+   filename = basename(argv[0]);
+
+   while ((opt = getopt_long(argc, argv, "Vhd",
+ long_options, NULL)) >= 0) {
+
+   switch (opt) {
+   case 'V':
+   printf("%s utility, iproute2-ss%s\n", filename, 
SNAPSHOT);
+   return EXIT_SUCCESS;
+   case 'd':
+   show_details = true;
+   break;
+   case 'h':
+   help(filename);
+   return EXIT_SUCCESS;
+   default:
+

[PATCH iproute2 V2 3/4] rdma: Add link object

2017-07-03 Thread Leon Romanovsky

From: Leon Romanovsky 

Link (port) object represent struct ib_port to the user space.

Link properties:
 * Port capabilities
 * IB subnet prefix
 * LID, SM_LID and LMC
 * Port state
 * Physical state

Signed-off-by: Leon Romanovsky 
---
 rdma/Makefile |   2 +-
 rdma/link.c   | 280 ++
 rdma/rdma.c   |   3 +-
 rdma/utils.c  |   5 ++
 4 files changed, 288 insertions(+), 2 deletions(-)
 create mode 100644 rdma/link.c

diff --git a/rdma/Makefile b/rdma/Makefile
index 123d7ac5..1a9e4b1a 100644
--- a/rdma/Makefile
+++ b/rdma/Makefile
@@ -2,7 +2,7 @@ include ../Config
 
 ifeq ($(HAVE_MNL),y)
 
-RDMA_OBJ = rdma.o utils.o dev.o
+RDMA_OBJ = rdma.o utils.o dev.o link.o
 
 TARGETS=rdma
 CFLAGS += $(shell $(PKG_CONFIG) libmnl --cflags)
diff --git a/rdma/link.c b/rdma/link.c
new file mode 100644
index ..e7455cfe
--- /dev/null
+++ b/rdma/link.c
@@ -0,0 +1,280 @@
+/*
+ * link.c  RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+static int link_help(struct rdma *rd)
+{
+   pr_out("Usage: %s link show [DEV/PORT_INDEX]\n", rd->filename);
+   return 0;
+}
+
+static void link_print_caps(struct nlattr **tb)
+{
+   uint64_t caps;
+   uint32_t idx;
+
+   /*
+* FIXME: move to indexes when kernel will start exporting them.
+*/
+   static const char *link_caps[64] = {
+   "UNKNOWN",
+   "SM",
+   "NOTICE",
+   "TRAP",
+   "OPT_IPD",
+   "AUTO_MIGR",
+   "SL_MAP",
+   "MKEY_NVRAM",
+   "PKEY_NVRAM",
+   "LED_INFO",
+   "SM_DISABLED",
+   "SYS_IMAGE_GUID",
+   "PKEY_SW_EXT_PORT_TRAP",
+   "UNKNOWN",
+   "EXTENDED_SPEEDS",
+   "UNKNOWN",
+   "CM",
+   "SNMP_TUNNEL",
+   "REINIT",
+   "DEVICE_MGMT",
+   "VENDOR_CLASS",
+   "DR_NOTICE",
+   "CAP_MASK_NOTICE",
+   "BOOT_MGMT",
+   "LINK_LATENCY",
+   "CLIENT_REG",
+   "IP_BASED_GIDS",
+   };
+
+   if (!tb[RDMA_NLDEV_ATTR_CAP_FLAGS])
+   return;
+
+   caps = mnl_attr_get_u64(tb[RDMA_NLDEV_ATTR_CAP_FLAGS]);
+
+   pr_out("\ncaps: <");
+   for (idx = 0; idx < 64; idx++) {
+   if (caps & 0x1) {
+   pr_out("%s", link_caps[idx]?link_caps[idx]:"UNKNONW");
+   if (caps >> 0x1)
+   pr_out(", ");
+   }
+   caps >>= 0x1;
+   }
+
+   pr_out(">");
+}
+
+static void link_print_subnet_prefix(struct nlattr **tb)
+{
+   uint64_t subnet_prefix;
+   uint16_t sp[4];
+
+   if (!tb[RDMA_NLDEV_ATTR_SUBNET_PREFIX])
+   return;
+
+   subnet_prefix = mnl_attr_get_u64(tb[RDMA_NLDEV_ATTR_SUBNET_PREFIX]);
+   memcpy(sp, &subnet_prefix, sizeof(uint64_t));
+   pr_out("subnet_prefix %04x:%04x:%04x:%04x ", sp[3], sp[2], sp[1], 
sp[0]);
+}
+
+static void link_print_lid(struct nlattr **tb)
+{
+   if (!tb[RDMA_NLDEV_ATTR_LID])
+   return;
+
+   pr_out("lid %u ",
+  mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_LID]));
+}
+
+static void link_print_sm_lid(struct nlattr **tb)
+{
+
+   if (!tb[RDMA_NLDEV_ATTR_SM_LID])
+   return;
+
+   pr_out("sm_lid %u ",
+  mnl_attr_get_u32(tb[RDMA_NLDEV_ATTR_SM_LID]));
+}
+
+static void link_print_lmc(struct nlattr **tb)
+{
+   if (!tb[RDMA_NLDEV_ATTR_LMC])
+   return;
+
+   pr_out("lmc %u ", mnl_attr_get_u8(tb[RDMA_NLDEV_ATTR_LMC]));
+}
+
+static void link_print_state(struct nlattr **tb)
+{
+   uint8_t state;
+   /*
+* FIXME: move to index exported by the kernel
+*/
+   static const char *str[] = {
+   "NOP",
+   "DOWN",
+   "INIT",
+   "ARMED",
+   "ACTIVE",
+   "ACTIVE_DEFER",
+   };
+
+   if (!tb[RDMA_NLDEV_ATTR_PORT_STATE])
+   return;
+
+   state = mnl_attr_get_u8(tb[RDMA_NLDEV_ATTR_PORT_STATE]);
+
+   if (state < 6 )
+   pr_out("state %s ", str[state]);
+   else
+   pr_out("state UNKNOWN ");
+}
+
+static void link_print_phys_state(struct nlattr **tb)
+{
+   uint8_t phys_state;
+   /*
+* FIXME: move to index exported by the kernel
+*/
+   static const char *str[] = {
+   "UNKNOWN",
+   "SLEEP",
+   "POLLING",
+   "DISABLED",
+   "PORT_CON

[PATCH iproute2 V2 2/4] rdma: Add dev object

2017-07-03 Thread Leon Romanovsky

From: Leon Romanovsky 

Device (dev) object represents struct ib_device to the user space.

Device properties:
 * Device capabilities
 * FW version to the device output
 * node_guid and sys_image_guid
 * node_type

Signed-off-by: Leon Romanovsky 
---
 rdma/Makefile |   2 +-
 rdma/dev.c| 235 ++
 rdma/rdma.c   |   3 +-
 rdma/rdma.h   |  12 ++-
 rdma/utils.c  |  46 +++-
 5 files changed, 293 insertions(+), 5 deletions(-)
 create mode 100644 rdma/dev.c

diff --git a/rdma/Makefile b/rdma/Makefile
index 64da2142..123d7ac5 100644
--- a/rdma/Makefile
+++ b/rdma/Makefile
@@ -2,7 +2,7 @@ include ../Config
 
 ifeq ($(HAVE_MNL),y)
 
-RDMA_OBJ = rdma.o utils.o
+RDMA_OBJ = rdma.o utils.o dev.o
 
 TARGETS=rdma
 CFLAGS += $(shell $(PKG_CONFIG) libmnl --cflags)
diff --git a/rdma/dev.c b/rdma/dev.c
new file mode 100644
index ..b80e5288
--- /dev/null
+++ b/rdma/dev.c
@@ -0,0 +1,235 @@
+/*
+ * dev.c   RDMA tool
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License
+ *  as published by the Free Software Foundation; either version
+ *  2 of the License, or (at your option) any later version.
+ *
+ * Authors: Leon Romanovsky 
+ */
+
+#include "rdma.h"
+
+static int dev_help(struct rdma *rd)
+{
+   pr_out("Usage: %s dev show [DEV]\n", rd->filename);
+   return 0;
+}
+
+static void dev_print_caps(struct nlattr **tb)
+{
+   uint64_t caps;
+   uint32_t idx;
+
+   /*
+* FIXME: move to indexes when kernel will start exporting them.
+*/
+   static const char *dev_caps[64] = {
+   "RESIZE_MAX_WR",
+   "BAD_PKEY_CNTR",
+   "BAD_QKEY_CNTR",
+   "RAW_MULTI",
+   "AUTO_PATH_MIG",
+   "CHANGE_PHY_PORT",
+   "UD_AV_PORT_ENFORCE",
+   "CURR_QP_STATE_MOD",
+   "SHUTDOWN_PORT",
+   "INIT_TYPE",
+   "PORT_ACTIVE_EVENT",
+   "SYS_IMAGE_GUID",
+   "RC_RNR_NAK_GEN",
+   "SRQ_RESIZE",
+   "N_NOTIFY_CQ",
+   "LOCAL_DMA_LKEY",
+   "RESERVED",
+   "MEM_WINDOW",
+   "UD_IP_CSUM",
+   "UD_TSO",
+   "XRC",
+   "MEM_MGT_EXTENSIONS",
+   "BLOCK_MULTICAST_LOOPBACK",
+   "MEM_WINDOW_TYPE_2A",
+   "MEM_WINDOW_TYPE_2B",
+   "RC_IP_CSUM",
+   "RAW_IP_CSUM",
+   "CROSS_CHANNEL",
+   "MANAGED_FLOW_STEERING",
+   "SIGNATURE_HANDOVER",
+   "ON_DEMAND_PAGING",
+   "SG_GAPS_REG",
+   "VIRTUAL_FUNCTION",
+   "RAW_SCATTER_FCS",
+   "RDMA_NETDEV_OPA_VNIC",
+   };
+
+   if (!tb[RDMA_NLDEV_ATTR_CAP_FLAGS])
+  return;
+
+   caps = mnl_attr_get_u64(tb[RDMA_NLDEV_ATTR_CAP_FLAGS]);
+
+   pr_out("\ncaps: <");
+   for (idx = 0; idx < 64; idx++) {
+   if (caps & 0x1) {
+   pr_out("%s", dev_caps[idx]?dev_caps[idx]:"UNKNONW");
+   if (caps >> 0x1)
+   pr_out(", ");
+   }
+   caps >>= 0x1;
+   }
+
+   pr_out(">");
+}
+
+static void dev_print_fw(struct nlattr **tb)
+{
+   if (!tb[RDMA_NLDEV_ATTR_FW_VERSION])
+   return;
+
+   pr_out("fw %s ",
+  mnl_attr_get_str(tb[RDMA_NLDEV_ATTR_FW_VERSION]));
+}
+
+static void _dev_print_be64(char *name, uint64_t val)
+{
+   uint16_t vp[4];
+
+   memcpy(vp, &val, sizeof(uint64_t));
+   pr_out("%s %04x:%04x:%04x:%04x ", name, vp[3], vp[2], vp[1], vp[0]);
+}
+
+static void dev_print_node_guid(struct nlattr **tb)
+{
+   uint64_t node_guid;
+
+   if (!tb[RDMA_NLDEV_ATTR_NODE_GUID])
+   return;
+
+   node_guid = mnl_attr_get_u64(tb[RDMA_NLDEV_ATTR_NODE_GUID]);
+   _dev_print_be64("node_guid", node_guid);
+}
+
+static void dev_print_sys_image_guid(struct nlattr **tb)
+{
+   uint64_tsys_image_guid;
+
+   if (!tb[RDMA_NLDEV_ATTR_SYS_IMAGE_GUID])
+   return;
+
+   sys_image_guid = mnl_attr_get_u64(tb[RDMA_NLDEV_ATTR_SYS_IMAGE_GUID]);
+   _dev_print_be64("sys_image_guid", sys_image_guid);
+}
+
+static void dev_print_node_type(struct nlattr **tb)
+{
+   uint8_t node_type;
+   /*
+* FIXME: move to index exported by the kernel
+*/
+   static const char *str[] = {
+   "UNKNOWN",
+   "SWITCH",
+   "ROUTER",
+   "RNIC",
+   "USNIC",
+   "USNIC_UDP",
+   };
+
+   if (!tb[RDMA_NLDEV_ATTR_DEV_NODE_TYPE])
+   return;
+
+   node_type = mnl_attr_get_u8(tb[RDMA_NLDEV_ATTR_DEV_NODE_TYPE]);
+
+   if (node_type < 7 )
+   pr_

[PATCH iproute2 V2 4/4] rdma: Add initial manual for the tool

2017-07-03 Thread Leon Romanovsky

From: Leon Romanovsky 

Signed-off-by: Leon Romanovsky 
---
 man/man8/Makefile |  3 +-
 man/man8/rdma.8   | 82 +++
 2 files changed, 84 insertions(+), 1 deletion(-)
 create mode 100644 man/man8/rdma.8

diff --git a/man/man8/Makefile b/man/man8/Makefile
index f3318644..81979a07 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -19,7 +19,8 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 rtacct.8 
rtmon.8 rtpr.8 ss.
tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8 tc-skbmod.8 tc-ife.8 \
tc-tunnel_key.8 tc-sample.8 \
devlink.8 devlink-dev.8 devlink-monitor.8 devlink-port.8 devlink-sb.8 \
-   ifstat.8
+   ifstat.8 \
+   rdma.8
 
 all: $(TARGETS)
 
diff --git a/man/man8/rdma.8 b/man/man8/rdma.8
new file mode 100644
index ..7578c15e
--- /dev/null
+++ b/man/man8/rdma.8
@@ -0,0 +1,82 @@
+.TH RDMA 8 "28 Mar 2017" "iproute2" "Linux"
+.SH NAME
+rdma \- RDMA tool
+.SH SYNOPSIS
+.sp
+.ad l
+.in +8
+.ti -8
+.B rdma
+.RI "[ " OPTIONS " ] " OBJECT " { " COMMAND " | "
+.BR help " }"
+.sp
+
+.ti -8
+.IR OBJECT " := { "
+.BR dev " | " link " }"
+.sp
+
+.ti -8
+.IR OPTIONS " := { "
+\fB\-V\fR[\fIersion\fR] }
+
+.SH OPTIONS
+
+.TP
+.BR "\-V" , " -Version"
+Print the version of the
+.B rdma
+tool and exit.
+
+.SS
+.I OBJECT
+
+.TP
+.B dev
+- RDMA device.
+
+.TP
+.B link
+- RDMA port related.
+
+.PP
+The names of all objects may be written in full or
+abbreviated form, for example
+.B stats
+can be abbreviated as
+.B stat
+or just
+.B s.
+
+.SS
+.I COMMAND
+
+Specifies the action to perform on the object.
+The set of possible actions depends on the object type.
+As a rule, it is possible to
+.B show
+(or
+.B list
+) objects, but some objects do not allow all of these operations
+or have some additional commands. The
+.B help
+command is available for all objects. It prints
+out a list of available commands and argument syntax conventions.
+.sp
+If no command is given, some default command is assumed.
+Usually it is
+.B list
+or, if the objects of this class cannot be listed,
+.BR "help" .
+
+.SH EXIT STATUS
+Exit status is 0 if command was successful or a positive integer upon failure.
+
+.SH REPORTING BUGS
+Report any bugs to the Linux RDMA mailing list
+.B 
+where the development and maintenance is primarily done.
+You do not have to be subscribed to the list to send a message there.
+
+.SH AUTHOR
+Leon Romanovsky 
-- 
2.13.2

locking issues in macvtap (looks like due to tap: Extending tap device create/destroy APIs)

2017-07-03 Thread Christian Borntraeger

Sainath,

with rcu debugging and lock debugging I get the following splats.
I think doing a mutex_lock while in an rcu read-side is not allowed,
since mutex_lock can sleep.

This is in 4.11 and 4.12 and seems to be introduced with commit
d9f1f61c0801a7("tap: Extending tap device create/destroy APIs").


Christian


[  125.678015] ===
[  125.678018] [ ERR: suspicious RCU usage.  ]
[  125.678022] 4.11.0+ #18 Not tainted
[  125.678025] ---
[  125.678028] ./include/linux/rcupdate.h:521 Illegal context switch in RCU 
read-side critical section!
[  125.678031] 
   other info that might help us debug this:

[  125.678035] 
   rcu_scheduler_active = 2, debug_locks = 0
[  125.678038] 2 locks held by libvirtd/3050:
[  125.678041]  #0:  (rtnl_mutex){+.+.+.}, at: [<00772b02>] 
rtnl_newlink+0x2ea/0x880
[  125.678057]  #1:  (rcu_read_lock){..}, at: [<03ff800dad00>] 
tap_get_minor+0x0/0x1d8 [tap]
[  125.678068] 
   stack backtrace:
[  125.678073] CPU: 26 PID: 3050 Comm: libvirtd Not tainted 4.11.0+ #18
[  125.678076] Hardware name: IBM 2964 NC9 704 (LPAR)
[  125.678079] Stack:
[  125.678081]00fa977cb230 00fa977cb2c0 0003 

[  125.678091]00fa977cb360 00fa977cb2d8 00fa977cb2d8 
0020
[  125.678100] 03ff0020 00fa000a 
00fa000a
[  125.678109]000c 00fa977cb328  

[  125.678119]008e2510 001139ac 00fa977cb2c0 
00fa977cb318
[  125.678150] Call Trace:
[  125.678157] ([<00113872>] show_trace+0xea/0xf0)
[  125.678160]  [<00113950>] show_stack+0x68/0xe0 
[  125.678165]  [<0057ef8c>] dump_stack+0x94/0xd8 
[  125.678172]  [<001a4422>] ___might_sleep+0x21a/0x268 
[  125.678177]  [<008ca842>] __mutex_lock+0x52/0x968 
[  125.678180]  [<008cb192>] mutex_lock_nested+0x3a/0x48 
[  125.678184]  [<03ff800dadd6>] tap_get_minor+0xd6/0x1d8 [tap] 
[  125.678188]  [<03ff801773a2>] macvtap_device_event+0x9a/0x1a0 [macvtap] 
[  125.678191]  [<0019bfbe>] notifier_call_chain+0x56/0x98 
[  125.678195]  [<0019c1b2>] raw_notifier_call_chain+0x32/0x40 
[  125.678200]  [<0075d014>] register_netdevice+0x3f4/0x508 
[  125.678204]  [<03ff801718a0>] macvlan_common_newlink+0x360/0x430 
[macvlan] 
[  125.678207]  [<03ff80177564>] macvtap_newlink+0xbc/0xf0 [macvtap] 
[  125.678211]  [<00772e32>] rtnl_newlink+0x61a/0x880 
[  125.678214]  [<0077313c>] rtnetlink_rcv_msg+0xa4/0x248 
[  125.678219]  [<0079cec0>] netlink_rcv_skb+0xd8/0x108 
[  125.678222]  [<0076f538>] rtnetlink_rcv+0x48/0x58 
[  125.678226]  [<0079c750>] netlink_unicast+0x178/0x1f8 
[  125.678229]  [<0079cbd4>] netlink_sendmsg+0x304/0x3b0 
[  125.678233]  [<00730676>] sock_sendmsg+0x6e/0x80 
[  125.678237]  [<007311b0>] ___sys_sendmsg+0x2a0/0x2a8 
[  125.678240]  [<007324d8>] __sys_sendmsg+0x60/0xa8 
[  125.678244]  [<00732ed4>] SyS_socketcall+0x33c/0x390 
[  125.678248]  [<008d08bc>] system_call+0xc4/0x258 
[  125.678251] INFO: lockdep is turned off.
[  125.678255] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:747
[  125.678257] in_atomic(): 1, irqs_disabled(): 0, pid: 3050, name: libvirtd
[  125.678261] INFO: lockdep is turned off.
[  125.678264] CPU: 26 PID: 3050 Comm: libvirtd Not tainted 4.11.0+ #18
[  125.678267] Hardware name: IBM 2964 NC9 704 (LPAR)
[  125.678269] Stack:
[  125.678272]00fa977cb230 00fa977cb2c0 0003 

[  125.678281]00fa977cb360 00fa977cb2d8 00fa977cb2d8 
0020
[  125.678290] 00fa0020 00fa000a 
00fa000a
[  125.678298]000c 00fa977cb328  

[  125.678308]008e2510 001139ac 00fa977cb2c0 
00fa977cb318
[  125.678323] Call Trace:
[  125.678326] ([<00113872>] show_trace+0xea/0xf0)
[  125.678330]  [<00113950>] show_stack+0x68/0xe0 
[  125.678334]  [<0057ef8c>] dump_stack+0x94/0xd8 
[  125.678337]  [<001a438e>] ___might_sleep+0x186/0x268 
[  125.678341]  [<008ca842>] __mutex_lock+0x52/0x968 
[  125.678346]  [<008cb192>] mutex_lock_nested+0x3a/0x48 
[  125.678350]  [<03ff800dadd6>] tap_get_minor+0xd6/0x1d8 [tap] 
[  125.678354]  [<03ff801773a2>] macvtap_device_event+0x9a/0x1a0 [macvtap] 
[  125.678357]  [<0019bfbe>] notifier_call_chain+0x56/0x98 
[  125.678360]  [<0019c1b2>] raw_notifier_call_chain+0x32/0x40 
[  125.678364]  [<0075d014>] register_netdevice+0x3f4/0x508 
[  125.678368]  [<03ff801718a0>] macvlan_common_newlink+0x360/0x430 
[macvlan] 
[  125.678371]  [<03ff80177564>]

Re: [PATCH iproute2 V2 1/4] rdma: Add basic infrastructure for RDMA tool

2017-07-03 Thread Yuval Shaia

On Mon, Jul 03, 2017 at 05:06:55PM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> RDMA devices are cross-functional devices from one side,
> but very tailored for the specific markets from another.
> 
> Such diversity caused to spread of RDMA related configuration
> across various tools, e.g. devlink, ip, ethtool, ib specific and
> vendor specific solutions.
> 
> This patch adds ability to fill device and port information
> by reading RDMA netlink.
> 
> Signed-off-by: Leon Romanovsky 
> ---
>  Makefile|   2 +-
>  rdma/.gitignore |   1 +
>  rdma/Makefile   |  22 ++
>  rdma/rdma.c | 116 
>  rdma/rdma.h |  71 +
>  rdma/utils.c| 232 
> 
>  6 files changed, 443 insertions(+), 1 deletion(-)
>  create mode 100644 rdma/.gitignore
>  create mode 100644 rdma/Makefile
>  create mode 100644 rdma/rdma.c
>  create mode 100644 rdma/rdma.h
>  create mode 100644 rdma/utils.c
> 
> diff --git a/Makefile b/Makefile
> index 18de7dcb..c255063b 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -52,7 +52,7 @@ WFLAGS += -Wmissing-declarations -Wold-style-definition 
> -Wformat=2
>  CFLAGS := $(WFLAGS) $(CCOPTS) -I../include $(DEFINES) $(CFLAGS)
>  YACCFLAGS = -d -t -v
>  
> -SUBDIRS=lib ip tc bridge misc netem genl tipc devlink man
> +SUBDIRS=lib ip tc bridge misc netem genl tipc devlink rdma man
>  
>  LIBNETLINK=../lib/libnetlink.a ../lib/libutil.a
>  LDLIBS += $(LIBNETLINK)
> diff --git a/rdma/.gitignore b/rdma/.gitignore
> new file mode 100644
> index ..51fb172b
> --- /dev/null
> +++ b/rdma/.gitignore
> @@ -0,0 +1 @@
> +rdma
> diff --git a/rdma/Makefile b/rdma/Makefile
> new file mode 100644
> index ..64da2142
> --- /dev/null
> +++ b/rdma/Makefile
> @@ -0,0 +1,22 @@
> +include ../Config
> +
> +ifeq ($(HAVE_MNL),y)
> +
> +RDMA_OBJ = rdma.o utils.o
> +
> +TARGETS=rdma
> +CFLAGS += $(shell $(PKG_CONFIG) libmnl --cflags)
> +LDLIBS += $(shell $(PKG_CONFIG) libmnl --libs)
> +
> +endif
> +
> +all: $(TARGETS) $(LIBS)
> +
> +rdma:$(RDMA_OBJ) $(LIBS)
> + $(QUIET_LINK)$(CC) $^ $(LDFLAGS) $(LDLIBS) -o $@
> +
> +install: all
> + install -m 0755 $(TARGETS) $(DESTDIR)$(SBINDIR)
> +
> +clean:
> + rm -f $(RDMA_OBJ) $(TARGETS)
> diff --git a/rdma/rdma.c b/rdma/rdma.c
> new file mode 100644
> index ..29273839
> --- /dev/null
> +++ b/rdma/rdma.c
> @@ -0,0 +1,116 @@
> +/*
> + * rdma.cRDMA tool
> + *
> + *  This program is free software; you can redistribute it and/or
> + *  modify it under the terms of the GNU General Public License
> + *  as published by the Free Software Foundation; either version
> + *  2 of the License, or (at your option) any later version.
> + *
> + * Authors: Leon Romanovsky 
> + */
> +
> +#include 
> +#include 
> +
> +#include "rdma.h"
> +#include "SNAPSHOT.h"
> +
> +static void help(char *name)
> +{
> + pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
> +"where  OBJECT := { help }\n"
> +"   OPTIONS := { -V[ersion] | -d[etails]}\n", name);
> +}
> +
> +static int cmd_help(struct rdma *rd)
> +{
> + help(rd->filename);
> + return 0;

Can we change it to void?

> +}
> +
> +static int rd_cmd(struct rdma *rd)
> +{
> + const struct rdma_cmd cmds[] = {
> + { NULL, cmd_help },
> + { "help",   cmd_help },
> + { 0 }
> + };
> +
> + return rdma_exec_cmd(rd, cmds, "object");
> +}
> +
> +static int rd_init(struct rdma *rd, int argc, char **argv, char *filename)
> +{
> + uint32_t seq;
> + int ret;
> +
> + rd->filename = filename;
> + rd->argc = argc;
> + rd->argv = argv;
> + INIT_LIST_HEAD(&rd->dev_map_list);
> + rd->buff = malloc(MNL_SOCKET_BUFFER_SIZE);
> + if (!rd->buff)
> + return -ENOMEM;
> +
> + rdma_prepare_msg(rd, RDMA_NLDEV_CMD_GET, &seq, (NLM_F_REQUEST | 
> NLM_F_ACK | NLM_F_DUMP));
> + if ((ret = rdma_send_msg(rd)))
> + return ret;

Maybe it is only my perspective but as i see it - if some init function
fails at one of the steps it needs to rollback to starting point before
returning an error. A caller expect that if init fails then no reason to
call the free.

Caller here calls free when init fails so we are fine but just razing a
point here.

> +
> + return rdma_recv_msg(rd, rd_dev_init_cb, rd, seq);
> +}
> +
> +static void rd_free(struct rdma *rd)
> +{
> + free(rd->buff);
> + rdma_free_devmap(rd);
> +}
> +int main(int argc, char **argv)
> +{
> + static const struct option long_options[] = {
> + { "version",no_argument,NULL, 'V' },
> + { "help",   no_argument,NULL, 'h' },
> + { "details",no_argument,NULL, 'd' },
> + { NULL, 0, NULL, 0 }
> + };
> + bool show_details = false;
> + cha

Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-03 Thread Alan Stern

On Sat, 1 Jul 2017, Manfred Spraul wrote:

> As we want to remove spin_unlock_wait() and replace it with explicit
> spin_lock()/spin_unlock() calls, we can use this to simplify the
> locking.
> 
> In addition:
> - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> - The new code avoids the backwards loop.
> 
> Only slightly tested, I did not manage to trigger calls to
> nf_conntrack_all_lock().
> 
> Fixes: b16c29191dc8
> Signed-off-by: Manfred Spraul 
> Cc: 
> Cc: Sasha Levin 
> Cc: Pablo Neira Ayuso 
> Cc: netfilter-de...@vger.kernel.org
> ---
>  net/netfilter/nf_conntrack_core.c | 44 
> +--
>  1 file changed, 24 insertions(+), 20 deletions(-)
> 
> diff --git a/net/netfilter/nf_conntrack_core.c 
> b/net/netfilter/nf_conntrack_core.c
> index e847dba..1193565 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
>  
>  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
>  {
> + /* 1) Acquire the lock */
>   spin_lock(lock);
> - while (unlikely(nf_conntrack_locks_all)) {
> - spin_unlock(lock);
>  
> - /*
> -  * Order the 'nf_conntrack_locks_all' load vs. the
> -  * spin_unlock_wait() loads below, to ensure
> -  * that 'nf_conntrack_locks_all_lock' is indeed held:
> -  */
> - smp_rmb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
> - spin_unlock_wait(&nf_conntrack_locks_all_lock);
> - spin_lock(lock);
> - }
> + /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
> + if (likely(smp_load_acquire(&nf_conntrack_locks_all) == false))
> + return;

As far as I can tell, this read does not need to have ACQUIRE
semantics.

You need to guarantee that two things can never happen:

(1) We read nf_conntrack_locks_all == false, and this routine's
critical section for nf_conntrack_locks[i] runs after the
(empty) critical section for that lock in 
nf_conntrack_all_lock().

(2) We read nf_conntrack_locks_all == true, and this routine's 
critical section for nf_conntrack_locks_all_lock runs before 
the critical section in nf_conntrack_all_lock().

In fact, neither one can happen even if smp_load_acquire() is replaced
with READ_ONCE().  The reason is simple enough, using this property of
spinlocks:

If critical section CS1 runs before critical section CS2 (for 
the same lock) then: (a) every write coming before CS1's
spin_unlock() will be visible to any read coming after CS2's
spin_lock(), and (b) no write coming after CS2's spin_lock()
will be visible to any read coming before CS1's spin_unlock().

Thus for (1), assuming the critical sections run in the order mentioned
above, since nf_conntrack_all_lock() writes to nf_conntrack_locks_all
before releasing nf_conntrack_locks[i], and since nf_conntrack_lock()
acquires nf_conntrack_locks[i] before reading nf_conntrack_locks_all,
by (a) the read will always see the write.

Similarly for (2), since nf_conntrack_all_lock() acquires 
nf_conntrack_locks_all_lock before writing to nf_conntrack_locks_all, 
and since nf_conntrack_lock() reads nf_conntrack_locks_all before 
releasing nf_conntrack_locks_all_lock, by (b) the read cannot see the 
write.

Alan Stern

> +
> + /* fast path failed, unlock */
> + spin_unlock(lock);
> +
> + /* Slow path 1) get global lock */
> + spin_lock(&nf_conntrack_locks_all_lock);
> +
> + /* Slow path 2) get the lock we want */
> + spin_lock(lock);
> +
> + /* Slow path 3) release the global lock */
> + spin_unlock(&nf_conntrack_locks_all_lock);
>  }
>  EXPORT_SYMBOL_GPL(nf_conntrack_lock);
>  
> @@ -149,18 +154,17 @@ static void nf_conntrack_all_lock(void)
>   int i;
>  
>   spin_lock(&nf_conntrack_locks_all_lock);
> - nf_conntrack_locks_all = true;
>  
> - /*
> -  * Order the above store of 'nf_conntrack_locks_all' against
> -  * the spin_unlock_wait() loads below, such that if
> -  * nf_conntrack_lock() observes 'nf_conntrack_locks_all'
> -  * we must observe nf_conntrack_locks[] held:
> -  */
> - smp_mb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
> + nf_conntrack_locks_all = true;
>  
>   for (i = 0; i < CONNTRACK_LOCKS; i++) {
> - spin_unlock_wait(&nf_conntrack_locks[i]);
> + spin_lock(&nf_conntrack_locks[i]);
> +
> + /* This spin_unlock provides the "release" to ensure that
> +  * nf_conntrack_locks_all==true is visible to everyone that
> +  * acquired spin_lock(&nf_conntrack_locks[]).
> +  */
> + spin_unlock(&nf_conntrack_locks[i]);
>   }
>  }

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney

On Mon, Jul 03, 2017 at 02:15:14PM +0100, Will Deacon wrote:
> On Fri, Jun 30, 2017 at 03:18:40PM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 30, 2017 at 02:13:39PM +0100, Will Deacon wrote:
> > > On Fri, Jun 30, 2017 at 05:38:15AM -0700, Paul E. McKenney wrote:
> > > > I also need to check all uses of spin_is_locked().  There might no
> > > > longer be any that rely on any particular ordering...
> > > 
> > > Right. I think we're looking for the "insane case" as per 38b850a73034
> > > (which was apparently used by ipc/sem.c at the time, but no longer).
> > > 
> > > There's a usage in kernel/debug/debug_core.c, but it doesn't fill me with
> > > joy.
> > 
> > That is indeed an interesting one...  But my first round will be what
> > semantics the implementations seem to provide:
> > 
> > Acquire courtesy of TSO: s390, sparc, x86.
> > Acquire: ia64 (in reality fully ordered).
> > Control dependency: alpha, arc, arm, blackfin, hexagon, m32r, mn10300, tile,
> > xtensa.
> > Control dependency plus leading full barrier: arm64, powerpc.
> > UP-only: c6x, cris, frv, h8300, m68k, microblaze nios2, openrisc, um, 
> > unicore32.
> > 
> > Special cases:
> > metag: Acquire if !CONFIG_METAG_SMP_WRITE_REORDERING.
> >Otherwise control dependency?
> > mips: Control dependency, acquire if CONFIG_CPU_CAVIUM_OCTEON.
> > parisc: Acquire courtesy of TSO, but why barrier in smp_load_acquire?
> > sh: Acquire if one of SH4A, SH5, or J2, otherwise acquire?  UP-only?
> > 
> > Are these correct, or am I missing something with any of them?
> 
> That looks about right but, at least on ARM, I think we have to consider
> the semantics of spin_is_locked with respect to the other spin_* functions,
> rather than in isolation.
> 
> For example, ARM only has a control dependency, but spin_lock has a trailing
> smp_mb() and spin_unlock has both leading and trailing smp_mb().

Agreed, and my next step is to look at spin_lock() followed by
spin_is_locked(), not necessarily the same lock.

Thanx, Paul

Re: [CRIU] BUG: Dentry ffff9f795a08fe60{i=af565f, n=lo} still in use (1) [unmount of proc proc]

2017-07-03 Thread Andrei Vagin

On Fri, Jun 30, 2017 at 12:11:07PM -0700, Andrei Vagin wrote:
> On Thu, Jun 29, 2017 at 08:42:23PM -0500, Eric W. Biederman wrote:
> > Andrei Vagin  writes:
> > 
> > > On Thu, Jun 29, 2017 at 12:06 PM, Eric W. Biederman
> > >  wrote:
> > >> Andrei Vagin  writes:
> > >>
> > >>> Hello,
> > >>>
> > >>> We run CRIU tests on linus' tree and today we found this issue.
> > >>>
> > >>> CRIU tests are the set of small programs to check checkpoint/restore
> > >>> of different primitives (files, sockets, signals, pipes, etc).
> > >>> https://github.com/xemul/criu/tree/master/test
> > >>>
> > >>> Each test is executed three times: without namespaces, in a set of all
> > >>> namespaces except userns, in a set of all namespaces. When a test
> > >>> passed the preparation tests, it sends a signal to an executer, and
> > >>> then the executer dumps and restores tests processes, and sends a
> > >>> signal to the test back to check that everything are restored
> > >>> correctly.
> > >>
> > >> I am not certain what you are saying, and you seem to have Cc'd
> > >> every list except the netdev and netfilter lists that are needed
> > >> to deal with this.
> > >>
> > >> Are you saing that the change from Liping Zhang is needed? Or are you
> > >> saying that change introduces the problem below?
> > >
> > > Hi Eric,
> > >
> > > Here I tried to explain our usecase. I don't know which changes in the
> > > kernel affect this issue.
> > >
> > > Actually I reported about the similar problem a few month ago on the 
> > > linux-next:
> > > https://lkml.org/lkml/2017/3/10/1586
> > >
> > > So I don't think that the change from Liping Zhang affects this issue
> > > somehow. I mentioned it just to describe what kernel we used.
> > >
> > > And I don't know how to reproduce the issue. You can see from the
> > > kernel log, that the kernel worked for more than 6 hours in out case.
> > > During this perioud we run all our tests a few times, so I think there
> > > is a kind of race.
> > >
> > >>
> > >> I could not find the mentioned commits.  Are the in Linus's tree or
> > >> someone's next tree that feeds into linux-next?
> > >
> > > Here is the patch from Liping Zhang
> > > https://patchwork.ozlabs.org/patch/770887/
> > >
> > > The second mentioned commit is HEAD of the master branch in Linus' tree:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6474924e2b5ddb0030c38966adcbe3b49022
> > 
> > Apologies I somehow thought that g in the kernel version you mentioned
> > was part of the commit id and thus I could not find it.  Sigh.
> > 
> > Ok so with Linus's tree and that one extra patch from Liping Zhang you
> > have kernel problems sometimes.
> > 
> > The warning and the oops combined are quite suggestive of what is going
> > on.  It does sound like while the pid namespace is being unregistered
> > something under /proc/sys/net/ipv4/conf//... is being accessed
> > and keeping the inode busy.
> > 
> > Which then leads to an oops when the network namespace is being cleaned
> > up later, as it tries to purge all of the inodes.
> > 
> > Which raises the question how in the world can the count of the
> > superblock drop to zero with an inode in use.
> > 
> > As devinet is where things go strange this does seem completely
> > independent of what Liping Zhang was looking at.
> > 
> > This does smell like a bug in the generic code.  Hmm.
> > 
> > Is this consistently reproducible when you run your tests...
> 
> I'm not sure about that. I'm going to do some experiments to understand
> how often it is reproduced on our test system, and then will try to
> revert the patch from Konstantin.

I did a few experiments and found that the bug is reproduced for 6-12
hours on the our test server. Then I reverted two patches and the server
is working normally for more than 24 hours already, so the bug is
probably in one of these patches.

commit e3d0065ab8535cbeee69a4c46a59f4d7360803ae
Author: Andrei Vagin 
Date:   Sun Jul 2 07:41:25 2017 +0200

Revert "proc/sysctl: prune stale dentries during unregistering"

This reverts commit d6cffbbe9a7e51eb705182965a189457c17ba8a3.

commit 2d3c50dac81011c1da4d2f7a63b84bd75287e320
Author: Andrei Vagin 
Date:   Sun Jul 2 07:40:08 2017 +0200

Revert "proc/sysctl: Don't grab i_lock under sysctl_lock."

This reverts commit ace0c791e6c3cf5ef37cad2df69f0d90ccc40ffb.


FYI: This bug has been reproduced on 4.11.7
[192885.875105] BUG: Dentry 895a3dd01240{i=4e7c09a,n=lo}  still in use (1) 
[unmount of proc proc]
[192885.875313] [ cut here ]
[192885.875328] WARNING: CPU: 1 PID: 13588 at fs/dcache.c:1445 
umount_check+0x6e/0x80
[192885.875331] Modules linked in: veth tun macvlan nf_conntrack_netlink 
xt_mark udp_diag tcp_diag inet_diag netlink_diag af_packet_diag unix_diag nfsd 
auth_rpcgss nfs_acl lockd grace binfmt_misc ip6t_rpfilter ip6t_REJECT 
nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc 
ebtable_nat ip6table_nat nf_

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Linus Torvalds

On Mon, Jul 3, 2017 at 9:18 AM, Paul E. McKenney
 wrote:
>
> Agreed, and my next step is to look at spin_lock() followed by
> spin_is_locked(), not necessarily the same lock.

Hmm. Most (all?) "spin_is_locked()" really should be about the same
thread that took the lock (ie it's about asserts and lock debugging).

The optimistic ABBA avoidance pattern for spinlocks *should* be

spin_lock(inner)
...
if (!try_lock(outer)) {
   spin_unlock(inner);
   .. do them in the right order ..

so I don't think spin_is_locked() should have any memory barriers.

In fact, the core function for spin_is_locked() is arguably
arch_spin_value_unlocked() which doesn't even do the access itself.

   Linus

Re: [PATCH net] virtio-net: unbreak cusmed packet for small buffer XDP

2017-07-03 Thread Michael S. Tsirkin

On Wed, Jun 28, 2017 at 08:05:06PM +0800, Jason Wang wrote:
> 
> 
> On 2017年06月28日 12:01, Michael S. Tsirkin wrote:
> > On Wed, Jun 28, 2017 at 11:40:30AM +0800, Jason Wang wrote:
> > > 
> > > On 2017年06月28日 11:31, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 28, 2017 at 10:45:18AM +0800, Jason Wang wrote:
> > > > > On 2017年06月28日 10:17, Michael S. Tsirkin wrote:
> > > > > > On Wed, Jun 28, 2017 at 10:14:34AM +0800, Jason Wang wrote:
> > > > > > > On 2017年06月28日 10:02, Michael S. Tsirkin wrote:
> > > > > > > > On Wed, Jun 28, 2017 at 09:54:03AM +0800, Jason Wang wrote:
> > > > > > > > > We should allow csumed packet for small buffer, otherwise 
> > > > > > > > > XDP_PASS
> > > > > > > > > won't work correctly.
> > > > > > > > > 
> > > > > > > > > Fixes commit bb91accf2733 ("virtio-net: XDP support for small 
> > > > > > > > > buffers")
> > > > > > > > > Signed-off-by: Jason Wang
> > > > > > > > The issue would be VIRTIO_NET_HDR_F_DATA_VALID might be set.
> > > > > > > > What do you think?
> > > > > > > I think it's safe. For XDP_PASS, it work like in the past.
> > > > > > That's the part I don't get. With DATA_VALID csum in packet is 
> > > > > > wrong, XDP
> > > > > > tools assume it's value.
> > > > > DATA_VALID is CHECKSUM_UNCESSARY on the host, and according to the 
> > > > > comment
> > > > > in skbuff.h
> > > > > 
> > > > > 
> > > > > "
> > > > >*   The hardware you're dealing with doesn't calculate the full 
> > > > > checksum
> > > > >*   (as in CHECKSUM_COMPLETE), but it does parse headers and verify
> > > > > checksums
> > > > >*   for specific protocols. For such packets it will set
> > > > > CHECKSUM_UNNECESSARY
> > > > >*   if their checksums are okay. skb->csum is still undefined in 
> > > > > this case
> > > > >*   though. A driver or device must never modify the checksum 
> > > > > field in the
> > > > >*   packet even if checksum is verified.
> > > > > "
> > > > > 
> > > > > The csum is correct I believe?
> > > > > 
> > > > > Thanks
> > > > That's on input. But I think for tun it's output, where that is 
> > > > equivalent
> > > > to CHECKSUM_NONE
> > > > 
> > > > 
> > > Yes, but the comment said:
> > > 
> > > "
> > > CKSUM_NONE:
> > >   *
> > >   *   The skb was already checksummed by the protocol, or a checksum is 
> > > not
> > >   *   required.
> > >   *
> > >   * CHECKSUM_UNNECESSARY:
> > >   *
> > >   *   This has the same meaning on as CHECKSUM_NONE for checksum offload 
> > > on
> > >   *   output.
> > >   *
> > > "
> > > 
> > > So still correct I think?
> > > 
> > > Thanks
> > Hmm maybe I mean NEEDS_CHECKSUM actually.
> > 
> > I'll need to re-read the spec.
> > 
> 
> Not sure this is an issue. But if it is, we can probably checksum the packet
> before passing it to XDP. But it would be a little slow.
> 
> Thanks



Right. I confused DATA_VALID with NEEDS_CHECKSUM.

IIUC XDP generally refuses to attach if checksum offload
is enabled.

Could you pls explain how to reproduce the issue you are seeing?

-- 
MST

[PATCH 1/1] bridge: mdb: report complete_info ptr as not a kmemleak

2017-07-03 Thread Eduardo Valentin

We currently get the following kmemleak report:
unreferenced object 0x8800039d9820 (size 32):
  comm "softirq", pid 0, jiffies 4295212383 (age 792.416s)
  hex dump (first 32 bytes):
00 0c e0 03 00 88 ff ff ff 02 00 00 00 00 00 00  
00 00 00 01 ff 11 00 02 86 dd 00 00 ff ff ff ff  
  backtrace:
[] kmemleak_alloc+0x4a/0xa0
[] kmem_cache_alloc_trace+0xb8/0x1c0
[] __br_mdb_notify+0x2a3/0x300 [bridge]
[] br_mdb_notify+0x6e/0x70 [bridge]
[] br_multicast_add_group+0x109/0x150 [bridge]
[] br_ip6_multicast_add_group+0x58/0x60 [bridge]
[] br_multicast_rcv+0x1d5/0xdb0 [bridge]
[] br_handle_frame_finish+0xcf/0x510 [bridge]
[] br_nf_hook_thresh.part.27+0xb/0x10 [br_netfilter]
[] br_nf_hook_thresh+0x48/0xb0 [br_netfilter]
[] br_nf_pre_routing_finish_ipv6+0x109/0x1d0 
[br_netfilter]
[] br_nf_pre_routing_ipv6+0xd0/0x14c [br_netfilter]
[] br_nf_pre_routing+0x197/0x3d0 [br_netfilter]
[] nf_iterate+0x52/0x60
[] nf_hook_slow+0x5c/0xb0
[] br_handle_frame+0x1a4/0x2c0 [bridge]

This patch flags the complete_info ptr object as not a leak as it will
get freed when .complete_priv() is called, for the br mdb case, it
will be freed at br_mdb_complete().

Cc: stable  # v4.9+
Reviewed-by: Vallish Vaidyeshwara 
Signed-off-by: Eduardo Valentin 
---
 net/bridge/br_mdb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c
index b084548..1c81546 100644
--- a/net/bridge/br_mdb.c
+++ b/net/bridge/br_mdb.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -319,6 +320,8 @@ static void __br_mdb_notify(struct net_device *dev, struct 
net_bridge_port *p,
if (port_dev && type == RTM_NEWMDB) {
complete_info = kmalloc(sizeof(*complete_info), GFP_ATOMIC);
if (complete_info) {
+   /* This pointer is freed in br_mdb_complete() */
+   kmemleak_not_leak(complete_info);
complete_info->port = p;
__mdb_entry_to_br_ip(entry, &complete_info->ip);
mdb.obj.complete_priv = complete_info;
-- 
2.7.4

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Will Deacon

On Mon, Jul 03, 2017 at 09:40:22AM -0700, Linus Torvalds wrote:
> On Mon, Jul 3, 2017 at 9:18 AM, Paul E. McKenney
>  wrote:
> >
> > Agreed, and my next step is to look at spin_lock() followed by
> > spin_is_locked(), not necessarily the same lock.
> 
> Hmm. Most (all?) "spin_is_locked()" really should be about the same
> thread that took the lock (ie it's about asserts and lock debugging).
> 
> The optimistic ABBA avoidance pattern for spinlocks *should* be
> 
> spin_lock(inner)
> ...
> if (!try_lock(outer)) {
>spin_unlock(inner);
>.. do them in the right order ..
> 
> so I don't think spin_is_locked() should have any memory barriers.
> 
> In fact, the core function for spin_is_locked() is arguably
> arch_spin_value_unlocked() which doesn't even do the access itself.

Yeah, but there's some spaced-out stuff going on in kgdb_cpu_enter where
it looks to me like raw_spin_is_locked is used for synchronization. My
eyes are hurting looking at it, though.

Will

Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-03 Thread Paul E. McKenney

On Mon, Jul 03, 2017 at 10:39:49AM -0400, Alan Stern wrote:
> On Sat, 1 Jul 2017, Manfred Spraul wrote:
> 
> > As we want to remove spin_unlock_wait() and replace it with explicit
> > spin_lock()/spin_unlock() calls, we can use this to simplify the
> > locking.
> > 
> > In addition:
> > - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> > - The new code avoids the backwards loop.
> > 
> > Only slightly tested, I did not manage to trigger calls to
> > nf_conntrack_all_lock().
> > 
> > Fixes: b16c29191dc8
> > Signed-off-by: Manfred Spraul 
> > Cc: 
> > Cc: Sasha Levin 
> > Cc: Pablo Neira Ayuso 
> > Cc: netfilter-de...@vger.kernel.org
> > ---
> >  net/netfilter/nf_conntrack_core.c | 44 
> > +--
> >  1 file changed, 24 insertions(+), 20 deletions(-)
> > 
> > diff --git a/net/netfilter/nf_conntrack_core.c 
> > b/net/netfilter/nf_conntrack_core.c
> > index e847dba..1193565 100644
> > --- a/net/netfilter/nf_conntrack_core.c
> > +++ b/net/netfilter/nf_conntrack_core.c
> > @@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
> >  
> >  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
> >  {
> > +   /* 1) Acquire the lock */
> > spin_lock(lock);
> > -   while (unlikely(nf_conntrack_locks_all)) {
> > -   spin_unlock(lock);
> >  
> > -   /*
> > -* Order the 'nf_conntrack_locks_all' load vs. the
> > -* spin_unlock_wait() loads below, to ensure
> > -* that 'nf_conntrack_locks_all_lock' is indeed held:
> > -*/
> > -   smp_rmb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
> > -   spin_unlock_wait(&nf_conntrack_locks_all_lock);
> > -   spin_lock(lock);
> > -   }
> > +   /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
> > +   if (likely(smp_load_acquire(&nf_conntrack_locks_all) == false))
> > +   return;
> 
> As far as I can tell, this read does not need to have ACQUIRE
> semantics.
> 
> You need to guarantee that two things can never happen:
> 
> (1) We read nf_conntrack_locks_all == false, and this routine's
>   critical section for nf_conntrack_locks[i] runs after the
>   (empty) critical section for that lock in 
>   nf_conntrack_all_lock().
> 
> (2) We read nf_conntrack_locks_all == true, and this routine's 
>   critical section for nf_conntrack_locks_all_lock runs before 
>   the critical section in nf_conntrack_all_lock().
> 
> In fact, neither one can happen even if smp_load_acquire() is replaced
> with READ_ONCE().  The reason is simple enough, using this property of
> spinlocks:
> 
>   If critical section CS1 runs before critical section CS2 (for 
>   the same lock) then: (a) every write coming before CS1's
>   spin_unlock() will be visible to any read coming after CS2's
>   spin_lock(), and (b) no write coming after CS2's spin_lock()
>   will be visible to any read coming before CS1's spin_unlock().
> 
> Thus for (1), assuming the critical sections run in the order mentioned
> above, since nf_conntrack_all_lock() writes to nf_conntrack_locks_all
> before releasing nf_conntrack_locks[i], and since nf_conntrack_lock()
> acquires nf_conntrack_locks[i] before reading nf_conntrack_locks_all,
> by (a) the read will always see the write.
> 
> Similarly for (2), since nf_conntrack_all_lock() acquires 
> nf_conntrack_locks_all_lock before writing to nf_conntrack_locks_all, 
> and since nf_conntrack_lock() reads nf_conntrack_locks_all before 
> releasing nf_conntrack_locks_all_lock, by (b) the read cannot see the 
> write.

And the Linux kernel memory model (https://lwn.net/Articles/718628/
and https://lwn.net/Articles/720550/) agrees with Alan.  Here is
a litmus test, which emulates spin_lock() with xchg_acquire() and
spin_unlock() with smp_store_release():



C C-ManfredSpraul-L1G1xchgnr.litmus

(* Expected result: Never.  *)

{
}

P0(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
{
/* Acquire local lock. */
r10 = xchg_acquire(lcl, 1);
r1 = READ_ONCE(*nfcla);
if (r1) {
smp_store_release(lcl, 0);
r11 = xchg_acquire(gbl, 1);
r12 = xchg_acquire(lcl, 1);
smp_store_release(gbl, 0);
}
r2 = READ_ONCE(*gbl_held);
WRITE_ONCE(*lcl_held, 1);
WRITE_ONCE(*lcl_held, 0);
smp_store_release(lcl, 0);
}

P1(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
{
/* Acquire global lock. */
r10 = xchg_acquire(gbl, 1);
WRITE_ONCE(*nfcla, 1);
r11 = xchg_acquire(lcl, 1);
smp_store_release(lcl, 0);
r2 = READ_ONCE(*lcl_held);
WRITE_ONCE(*gbl_held, 1);
WRITE_ONCE(*gbl_held, 0);
smp_store_release(gbl, 0);
}

exists
((0:r2=1 \/ 1:r2=1) /\ 0:r10=0 /\ 0:r11=0 /\ 0:r12=0 /\ 1:r10=0 /\ 1:r1

Re: [PATCH net-next 00/12] qed: Add iWARP support for QL4xxxx

2017-07-03 Thread Kalderon, Michal

From: David Miller 
Sent: Monday, July 3, 2017 11:59 AM

> You really have to compile test your work and do something with
> the warnings:

> drivers/net/ethernet/qlogic/qed/qed_iwarp.c:1721:5: warning: ‘ll2_syn_handle’ 
> may be used uninitialized in this funct

> This one is completely legitimate, you can goto "err" and use
> the ll2_syn_handle without it being initialized.

Sorry about that, this warning didn't appear locally (gcc 4.8.5).
Fix is on its way (after verifying with newer gcc).

[PATCH net-next] qed: initialize ll2_syn_handle at start of function

2017-07-03 Thread Michal Kalderon

Fix compilation warning
qed_iwarp.c:1721:5: warning: ll2_syn_handle may be used
uninitialized in this function

Signed-off-by: Michal Kalderon 
Signed-off-by: Ariel Elior 
---
 drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c 
b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
index 5cd20da..b251eba 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
@@ -1724,7 +1724,7 @@ int qed_iwarp_reject(void *rdma_cxt, struct 
qed_iwarp_reject_in *iparams)
int rc;
 
memset(&cm_info, 0, sizeof(cm_info));
-
+   ll2_syn_handle = p_hwfn->p_rdma_info->iwarp.ll2_syn_handle;
if (GET_FIELD(data->parse_flags,
  PARSING_AND_ERR_FLAGS_L4CHKSMWASCALCULATED) &&
GET_FIELD(data->parse_flags, PARSING_AND_ERR_FLAGS_L4CHKSMERROR)) {
@@ -1740,7 +1740,6 @@ int qed_iwarp_reject(void *rdma_cxt, struct 
qed_iwarp_reject_in *iparams)
goto err;
 
/* Check if there is a listener for this 4-tuple+vlan */
-   ll2_syn_handle = p_hwfn->p_rdma_info->iwarp.ll2_syn_handle;
listener = qed_iwarp_get_listener(p_hwfn, &cm_info);
if (!listener) {
DP_VERBOSE(p_hwfn,
-- 
1.8.3.1

Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-03 Thread Manfred Spraul


On 07/03/2017 07:14 PM, Paul E. McKenney wrote:

On Mon, Jul 03, 2017 at 10:39:49AM -0400, Alan Stern wrote:

On Sat, 1 Jul 2017, Manfred Spraul wrote:


As we want to remove spin_unlock_wait() and replace it with explicit
spin_lock()/spin_unlock() calls, we can use this to simplify the
locking.

In addition:
- Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
- The new code avoids the backwards loop.

Only slightly tested, I did not manage to trigger calls to
nf_conntrack_all_lock().

Fixes: b16c29191dc8
Signed-off-by: Manfred Spraul 
Cc: 
Cc: Sasha Levin 
Cc: Pablo Neira Ayuso 
Cc: netfilter-de...@vger.kernel.org
---
  net/netfilter/nf_conntrack_core.c | 44 +--
  1 file changed, 24 insertions(+), 20 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index e847dba..1193565 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
  
  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)

  {
+   /* 1) Acquire the lock */
spin_lock(lock);
-   while (unlikely(nf_conntrack_locks_all)) {
-   spin_unlock(lock);
  
-		/*

-* Order the 'nf_conntrack_locks_all' load vs. the
-* spin_unlock_wait() loads below, to ensure
-* that 'nf_conntrack_locks_all_lock' is indeed held:
-*/
-   smp_rmb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
-   spin_unlock_wait(&nf_conntrack_locks_all_lock);
-   spin_lock(lock);
-   }
+   /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
+   if (likely(smp_load_acquire(&nf_conntrack_locks_all) == false))
+   return;

As far as I can tell, this read does not need to have ACQUIRE
semantics.

You need to guarantee that two things can never happen:

 (1) We read nf_conntrack_locks_all == false, and this routine's
critical section for nf_conntrack_locks[i] runs after the
(empty) critical section for that lock in
nf_conntrack_all_lock().

 (2) We read nf_conntrack_locks_all == true, and this routine's
critical section for nf_conntrack_locks_all_lock runs before
the critical section in nf_conntrack_all_lock().

I was looking at nf_conntrack_all_unlock:
There is a smp_store_release() - which memory barrier does this pair with?

nf_conntrack_all_unlock()

smp_store_release(a, false)
spin_unlock(b);

nf_conntrack_lock()
spin_lock(c);
xx=read_once(a)
if (xx==false)
return



In fact, neither one can happen even if smp_load_acquire() is replaced
with READ_ONCE().  The reason is simple enough, using this property of
spinlocks:

If critical section CS1 runs before critical section CS2 (for
the same lock) then: (a) every write coming before CS1's
spin_unlock() will be visible to any read coming after CS2's
spin_lock(), and (b) no write coming after CS2's spin_lock()
will be visible to any read coming before CS1's spin_unlock().

Does this apply? The locks are different.

Thus for (1), assuming the critical sections run in the order mentioned
above, since nf_conntrack_all_lock() writes to nf_conntrack_locks_all
before releasing nf_conntrack_locks[i], and since nf_conntrack_lock()
acquires nf_conntrack_locks[i] before reading nf_conntrack_locks_all,
by (a) the read will always see the write.

Similarly for (2), since nf_conntrack_all_lock() acquires
nf_conntrack_locks_all_lock before writing to nf_conntrack_locks_all,
and since nf_conntrack_lock() reads nf_conntrack_locks_all before
releasing nf_conntrack_locks_all_lock, by (b) the read cannot see the
write.

And the Linux kernel memory model (https://lwn.net/Articles/718628/
and https://lwn.net/Articles/720550/) agrees with Alan.  Here is
a litmus test, which emulates spin_lock() with xchg_acquire() and
spin_unlock() with smp_store_release():



C C-ManfredSpraul-L1G1xchgnr.litmus

(* Expected result: Never.  *)

{
}

P0(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
{
/* Acquire local lock. */
r10 = xchg_acquire(lcl, 1);
r1 = READ_ONCE(*nfcla);
if (r1) {
smp_store_release(lcl, 0);
r11 = xchg_acquire(gbl, 1);
r12 = xchg_acquire(lcl, 1);
smp_store_release(gbl, 0);
}
r2 = READ_ONCE(*gbl_held);
WRITE_ONCE(*lcl_held, 1);
WRITE_ONCE(*lcl_held, 0);
smp_store_release(lcl, 0);
}

P1(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
{
/* Acquire global lock. */
r10 = xchg_acquire(gbl, 1);
WRITE_ONCE(*nfcla, 1);
r11 = xchg_acquire(lcl, 1);
smp_store_release(lcl, 0);
r2 = READ_O

Re: [RFC/RFT PATCH 2/4] net: ethernat: ti: cpts: enable irq

2017-07-03 Thread Grygorii Strashko



On 06/30/2017 08:31 PM, Ivan Khoronzhuk wrote:
> On Tue, Jun 13, 2017 at 06:16:21PM -0500, Grygorii Strashko wrote:
>> There are two reasons for this change:
>> 1) enabling of HW_TS_PUSH events as suggested by Richard Cochran and
>> discussed in [1]
>> 2) fixing an TX timestamping miss issue which happens with low speed
>> ethernet connections and was reproduced on am57xx and am335x boards.
>> Issue description: With the low Ethernet connection speed CPDMA notification
>> about packet processing can be received before CPTS TX timestamp event,
>> which is sent when packet actually left CPSW while cpdma notification is
>> sent when packet pushed in CPSW fifo.  As result, when connection is slow
>> and CPU is fast enough TX timestamp can be missed and not working properly.
>>
>> This patch converts CPTS driver to use IRQ instead of polling in the
>> following way:
>>
>>   - CPTS_EV_PUSH: CPTS_EV_PUSH is used to get current CPTS counter value and
>> triggered from PTP callbacks and cpts_overflow_check() work. With this
>> change current CPTS counter value will be read in IRQ handler and saved in
>> CPTS context "cur_timestamp" field. The compeltion event will be signalled 
>> to the
>> requestor. The timecounter->read() will just read saved value. Access to
>> the "cur_timestamp" is protected by mutex "ptp_clk_mutex".
>>
>> cpts_get_time:
>>reinit_completion(&cpts->ts_push_complete);
>>cpts_write32(cpts, TS_PUSH, ts_push);
>>wait_for_completion_interruptible_timeout(&cpts->ts_push_complete, HZ);
>>ns = timecounter_read(&cpts->tc);
>>
>> cpts_irq:
>>case CPTS_EV_PUSH:
>>  cpts->cur_timestamp = lo;
>>  complete(&cpts->ts_push_complete);
>>
>> - CPTS_EV_TX: signals when CPTS timestamp is ready for valid TX PTP
>> packets. The TX timestamp is requested from cpts_tx_timestamp() which is
>> called for each transmitted packet from NAPI cpsw_tx_poll() callback. With
>> this change, CPTS event queue will be checked for existing CPTS_EV_TX
>> event, corresponding to the current TX packet, and if event is not found - 
>> packet
>> will be placed in CPTS TX packet queue for later processing. CPTS TX packet
>> queue will be processed from hi-priority cpts_ts_work() work which is 
>> scheduled
>> as from cpts_tx_timestamp() as from CPTS IRQ handler when CPTS_EV_TX event
>> is received.
>>
>> cpts_tx_timestamp:
>>   check if packet is PTP packet
>>   try to find corresponding CPTS_EV_TX event
>> if found: report timestamp
>> if not found: put packet in TX queue, schedule cpts_ts_work()
> I've not read patch itself yet, but why schedule is needed if timestamp is not
> found? Anyway it is scheduled with irq when timestamp arrives. It's rather 
> should
> be scheduled if timestamp is found,

CPTS IRQ, cpts_ts_work and Net SoftIRQ processing might happen on
different CPUs, as result - CPTS IRQ will detect TX event and schedule 
cpts_ts_work on
one CPU and this work might race with SKB processing in Net SoftIRQ on another, 
so
both SKB and CPTS TX event might be queued, but no cpts_ts_work scheduled until
next CPTS event is received (worst case for cpts_overflow_check period).

Situation became even more complex on RT kernel where everything is
executed in kthread contexts.

> 
>>
>> cpts_irq:
>>   case CPTS_EV_TX:
>>   put event in CPTS event queue
>>   schedule cpts_ts_work()
>>
>> cpts_ts_work:
>> for each packet in  CPTS TX packet queue
>> try to find corresponding CPTS_EV_TX event
>> if found: report timestamp
>> if timeout: drop packet
>>
>> - CPTS_EV_RX: signals when CPTS timestamp is ready for valid RX PTP
>> packets. The RX timestamp is requested from cpts_rx_timestamp() which is
>> called for each received packet from NAPI cpsw_rx_poll() callback. With
>> this change, CPTS event queue will be checked for existing CPTS_EV_RX
>> event, corresponding to the current RX packet, and if event is not found - 
>> packet
>> will be placed in CPTS RX packet queue for later processing. CPTS RX packet
>> queue will be processed from hi-priority cpts_ts_work() work which is 
>> scheduled
>> as from cpts_rx_timestamp() as from CPTS IRQ handler when CPTS_EV_RX event
>> is received. cpts_rx_timestamp() has been updated to return failure in case
>> of RX timestamp processing delaying and, in such cases, caller of
>> cpts_rx_timestamp() should not call netif_receive_skb().
> It's much similar to tx path, but fix is needed for tx only according to 
> targets
> of patch, why rx uses the same approach? Does rx has same isue, then how it 
> happens
> as the delay caused race for tx packet should allow race for rx packet?
> tx : send packet -> tx poll (no ts) -> latency -> hw timstamp (race)
> rx : hw timestamp -> latency -> rx poll (ts) -> rx packet (no race)
> 
> Is to be consistent or race is realy present?

I've hit it on RT and then modeled using request_threaded_irq().

CPTS timestamping was part of NET RX/TX path when used in polling mode, but 
after
switching CPT

Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-03 Thread Alan Stern

On Mon, 3 Jul 2017, Manfred Spraul wrote:

> >>> + /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
> >>> + if (likely(smp_load_acquire(&nf_conntrack_locks_all) == false))
> >>> + return;
> >> As far as I can tell, this read does not need to have ACQUIRE
> >> semantics.
> >>
> >> You need to guarantee that two things can never happen:
> >>
> >>  (1) We read nf_conntrack_locks_all == false, and this routine's
> >>critical section for nf_conntrack_locks[i] runs after the
> >>(empty) critical section for that lock in
> >>nf_conntrack_all_lock().
> >>
> >>  (2) We read nf_conntrack_locks_all == true, and this routine's
> >>critical section for nf_conntrack_locks_all_lock runs before
> >>the critical section in nf_conntrack_all_lock().
> I was looking at nf_conntrack_all_unlock:
> There is a smp_store_release() - which memory barrier does this pair with?
> 
> nf_conntrack_all_unlock()
>  
>  smp_store_release(a, false)
>  spin_unlock(b);
> 
> nf_conntrack_lock()
>  spin_lock(c);
>  xx=read_once(a)
>  if (xx==false)
>  return
>  

Ah, I see your point.  Yes, I did wonder about what would happen when
nf_conntrack_locks_all was set back to false.  But I didn't think about
it any further, because the relevant code wasn't in your patch.

> I tried to pair the memory barriers:
> nf_conntrack_all_unlock() contains a smp_store_release().
> What does that pair with?

You are right, this does need to be smp_load_acquire() after all.  
Perhaps the preceding comment should mention that it pairs with the 
smp_store_release() from an earlier invocation of 
nf_conntrack_all_unlock().

(Alternatively, you could make nf_conntrack_all_unlock() do a
lock+unlock on all the locks in the array, just like
nf_conntrack_all_lock().  But of course, that would be a lot less
efficient.)

Alan Stern

Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-03 Thread Alan Stern

On Mon, 3 Jul 2017, Paul E. McKenney wrote:

> On Mon, Jul 03, 2017 at 10:39:49AM -0400, Alan Stern wrote:
> > On Sat, 1 Jul 2017, Manfred Spraul wrote:
> > 
> > > As we want to remove spin_unlock_wait() and replace it with explicit
> > > spin_lock()/spin_unlock() calls, we can use this to simplify the
> > > locking.
> > > 
> > > In addition:
> > > - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> > > - The new code avoids the backwards loop.
> > > 
> > > Only slightly tested, I did not manage to trigger calls to
> > > nf_conntrack_all_lock().
> > > 
> > > Fixes: b16c29191dc8
> > > Signed-off-by: Manfred Spraul 
> > > Cc: 
> > > Cc: Sasha Levin 
> > > Cc: Pablo Neira Ayuso 
> > > Cc: netfilter-de...@vger.kernel.org
> > > ---
> > >  net/netfilter/nf_conntrack_core.c | 44 
> > > +--
> > >  1 file changed, 24 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/net/netfilter/nf_conntrack_core.c 
> > > b/net/netfilter/nf_conntrack_core.c
> > > index e847dba..1193565 100644
> > > --- a/net/netfilter/nf_conntrack_core.c
> > > +++ b/net/netfilter/nf_conntrack_core.c
> > > @@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
> > >  
> > >  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
> > >  {
> > > + /* 1) Acquire the lock */
> > >   spin_lock(lock);
> > > - while (unlikely(nf_conntrack_locks_all)) {
> > > - spin_unlock(lock);
> > >  
> > > - /*
> > > -  * Order the 'nf_conntrack_locks_all' load vs. the
> > > -  * spin_unlock_wait() loads below, to ensure
> > > -  * that 'nf_conntrack_locks_all_lock' is indeed held:
> > > -  */
> > > - smp_rmb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
> > > - spin_unlock_wait(&nf_conntrack_locks_all_lock);
> > > - spin_lock(lock);
> > > - }
> > > + /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
> > > + if (likely(smp_load_acquire(&nf_conntrack_locks_all) == false))
> > > + return;
> > 
> > As far as I can tell, this read does not need to have ACQUIRE
> > semantics.
> > 
> > You need to guarantee that two things can never happen:
> > 
> > (1) We read nf_conntrack_locks_all == false, and this routine's
> > critical section for nf_conntrack_locks[i] runs after the
> > (empty) critical section for that lock in 
> > nf_conntrack_all_lock().
> > 
> > (2) We read nf_conntrack_locks_all == true, and this routine's 
> > critical section for nf_conntrack_locks_all_lock runs before 
> > the critical section in nf_conntrack_all_lock().
> > 
> > In fact, neither one can happen even if smp_load_acquire() is replaced
> > with READ_ONCE().  The reason is simple enough, using this property of
> > spinlocks:
> > 
> > If critical section CS1 runs before critical section CS2 (for 
> > the same lock) then: (a) every write coming before CS1's
> > spin_unlock() will be visible to any read coming after CS2's
> > spin_lock(), and (b) no write coming after CS2's spin_lock()
> > will be visible to any read coming before CS1's spin_unlock().
> > 
> > Thus for (1), assuming the critical sections run in the order mentioned
> > above, since nf_conntrack_all_lock() writes to nf_conntrack_locks_all
> > before releasing nf_conntrack_locks[i], and since nf_conntrack_lock()
> > acquires nf_conntrack_locks[i] before reading nf_conntrack_locks_all,
> > by (a) the read will always see the write.
> > 
> > Similarly for (2), since nf_conntrack_all_lock() acquires 
> > nf_conntrack_locks_all_lock before writing to nf_conntrack_locks_all, 
> > and since nf_conntrack_lock() reads nf_conntrack_locks_all before 
> > releasing nf_conntrack_locks_all_lock, by (b) the read cannot see the 
> > write.
> 
> And the Linux kernel memory model (https://lwn.net/Articles/718628/
> and https://lwn.net/Articles/720550/) agrees with Alan.  Here is
> a litmus test, which emulates spin_lock() with xchg_acquire() and
> spin_unlock() with smp_store_release():
> 
> 
> 
> C C-ManfredSpraul-L1G1xchgnr.litmus
> 
> (* Expected result: Never.  *)
> 
> {
> }
> 
> P0(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
> {
>   /* Acquire local lock. */
>   r10 = xchg_acquire(lcl, 1);
>   r1 = READ_ONCE(*nfcla);
>   if (r1) {
>   smp_store_release(lcl, 0);
>   r11 = xchg_acquire(gbl, 1);
>   r12 = xchg_acquire(lcl, 1);
>   smp_store_release(gbl, 0);
>   }
>   r2 = READ_ONCE(*gbl_held);
>   WRITE_ONCE(*lcl_held, 1);
>   WRITE_ONCE(*lcl_held, 0);
>   smp_store_release(lcl, 0);
> }
> 
> P1(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int *lcl_held)
> {
>   /* Acquire global lock. */
>   r10 = xchg_acquire(gbl, 1);
>   WRITE_ONCE(*nfcla, 1);
>   r11 = xchg_acquire(lcl, 1);
>   smp_store_release(lcl, 0);
>   r2

Re: [PATCH RFC 01/26] netfilter: Replace spin_unlock_wait() with lock/unlock pair

2017-07-03 Thread Paul E. McKenney

On Mon, Jul 03, 2017 at 04:04:14PM -0400, Alan Stern wrote:
> On Mon, 3 Jul 2017, Paul E. McKenney wrote:
> 
> > On Mon, Jul 03, 2017 at 10:39:49AM -0400, Alan Stern wrote:
> > > On Sat, 1 Jul 2017, Manfred Spraul wrote:
> > > 
> > > > As we want to remove spin_unlock_wait() and replace it with explicit
> > > > spin_lock()/spin_unlock() calls, we can use this to simplify the
> > > > locking.
> > > > 
> > > > In addition:
> > > > - Reading nf_conntrack_locks_all needs ACQUIRE memory ordering.
> > > > - The new code avoids the backwards loop.
> > > > 
> > > > Only slightly tested, I did not manage to trigger calls to
> > > > nf_conntrack_all_lock().
> > > > 
> > > > Fixes: b16c29191dc8
> > > > Signed-off-by: Manfred Spraul 
> > > > Cc: 
> > > > Cc: Sasha Levin 
> > > > Cc: Pablo Neira Ayuso 
> > > > Cc: netfilter-de...@vger.kernel.org
> > > > ---
> > > >  net/netfilter/nf_conntrack_core.c | 44 
> > > > +--
> > > >  1 file changed, 24 insertions(+), 20 deletions(-)
> > > > 
> > > > diff --git a/net/netfilter/nf_conntrack_core.c 
> > > > b/net/netfilter/nf_conntrack_core.c
> > > > index e847dba..1193565 100644
> > > > --- a/net/netfilter/nf_conntrack_core.c
> > > > +++ b/net/netfilter/nf_conntrack_core.c
> > > > @@ -96,19 +96,24 @@ static struct conntrack_gc_work conntrack_gc_work;
> > > >  
> > > >  void nf_conntrack_lock(spinlock_t *lock) __acquires(lock)
> > > >  {
> > > > +   /* 1) Acquire the lock */
> > > > spin_lock(lock);
> > > > -   while (unlikely(nf_conntrack_locks_all)) {
> > > > -   spin_unlock(lock);
> > > >  
> > > > -   /*
> > > > -* Order the 'nf_conntrack_locks_all' load vs. the
> > > > -* spin_unlock_wait() loads below, to ensure
> > > > -* that 'nf_conntrack_locks_all_lock' is indeed held:
> > > > -*/
> > > > -   smp_rmb(); /* spin_lock(&nf_conntrack_locks_all_lock) */
> > > > -   spin_unlock_wait(&nf_conntrack_locks_all_lock);
> > > > -   spin_lock(lock);
> > > > -   }
> > > > +   /* 2) read nf_conntrack_locks_all, with ACQUIRE semantics */
> > > > +   if (likely(smp_load_acquire(&nf_conntrack_locks_all) == false))
> > > > +   return;
> > > 
> > > As far as I can tell, this read does not need to have ACQUIRE
> > > semantics.
> > > 
> > > You need to guarantee that two things can never happen:
> > > 
> > > (1) We read nf_conntrack_locks_all == false, and this routine's
> > >   critical section for nf_conntrack_locks[i] runs after the
> > >   (empty) critical section for that lock in 
> > >   nf_conntrack_all_lock().
> > > 
> > > (2) We read nf_conntrack_locks_all == true, and this routine's 
> > >   critical section for nf_conntrack_locks_all_lock runs before 
> > >   the critical section in nf_conntrack_all_lock().
> > > 
> > > In fact, neither one can happen even if smp_load_acquire() is replaced
> > > with READ_ONCE().  The reason is simple enough, using this property of
> > > spinlocks:
> > > 
> > >   If critical section CS1 runs before critical section CS2 (for 
> > >   the same lock) then: (a) every write coming before CS1's
> > >   spin_unlock() will be visible to any read coming after CS2's
> > >   spin_lock(), and (b) no write coming after CS2's spin_lock()
> > >   will be visible to any read coming before CS1's spin_unlock().
> > > 
> > > Thus for (1), assuming the critical sections run in the order mentioned
> > > above, since nf_conntrack_all_lock() writes to nf_conntrack_locks_all
> > > before releasing nf_conntrack_locks[i], and since nf_conntrack_lock()
> > > acquires nf_conntrack_locks[i] before reading nf_conntrack_locks_all,
> > > by (a) the read will always see the write.
> > > 
> > > Similarly for (2), since nf_conntrack_all_lock() acquires 
> > > nf_conntrack_locks_all_lock before writing to nf_conntrack_locks_all, 
> > > and since nf_conntrack_lock() reads nf_conntrack_locks_all before 
> > > releasing nf_conntrack_locks_all_lock, by (b) the read cannot see the 
> > > write.
> > 
> > And the Linux kernel memory model (https://lwn.net/Articles/718628/
> > and https://lwn.net/Articles/720550/) agrees with Alan.  Here is
> > a litmus test, which emulates spin_lock() with xchg_acquire() and
> > spin_unlock() with smp_store_release():
> > 
> > 
> > 
> > C C-ManfredSpraul-L1G1xchgnr.litmus
> > 
> > (* Expected result: Never.  *)
> > 
> > {
> > }
> > 
> > P0(int *nfcla, spinlock_t *gbl, int *gbl_held, spinlock_t *lcl, int 
> > *lcl_held)
> > {
> > /* Acquire local lock. */
> > r10 = xchg_acquire(lcl, 1);
> > r1 = READ_ONCE(*nfcla);
> > if (r1) {
> > smp_store_release(lcl, 0);
> > r11 = xchg_acquire(gbl, 1);
> > r12 = xchg_acquire(lcl, 1);
> > smp_store_release(gbl, 0);
> > }
> > r2 = READ_ONCE(*gbl_held);
> > WRITE_ONCE(*lcl_h

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney

On Mon, Jul 03, 2017 at 09:40:22AM -0700, Linus Torvalds wrote:
> On Mon, Jul 3, 2017 at 9:18 AM, Paul E. McKenney
>  wrote:
> >
> > Agreed, and my next step is to look at spin_lock() followed by
> > spin_is_locked(), not necessarily the same lock.
> 
> Hmm. Most (all?) "spin_is_locked()" really should be about the same
> thread that took the lock (ie it's about asserts and lock debugging).

Good to know, that does make things easier.  ;-)

I am not certain that it is feasible to automatically recognize
non-assert/non-debugging use cases of spin_is_locked(), but there is
aways manual inspection.

> The optimistic ABBA avoidance pattern for spinlocks *should* be
> 
> spin_lock(inner)
> ...
> if (!try_lock(outer)) {
>spin_unlock(inner);
>.. do them in the right order ..
> 
> so I don't think spin_is_locked() should have any memory barriers.
> 
> In fact, the core function for spin_is_locked() is arguably
> arch_spin_value_unlocked() which doesn't even do the access itself.

OK, so we should rework any cases where people are relying on acquisition
of one spin_lock() being ordered with a later spin_is_locked() on some
other lock by that same thread.

Thanx, Paul

Re: [PATCH net-next] qed: initialize ll2_syn_handle at start of function

2017-07-03 Thread David Miller

From: Michal Kalderon 
Date: Mon, 3 Jul 2017 21:55:25 +0300

> Fix compilation warning
> qed_iwarp.c:1721:5: warning: ll2_syn_handle may be used
> uninitialized in this function
> 
> Signed-off-by: Michal Kalderon 
> Signed-off-by: Ariel Elior 

Applied, thanks for fixing this so fast.

Re: [RFC/RFT PATCH 2/4] net: ethernat: ti: cpts: enable irq

2017-07-03 Thread Ivan Khoronzhuk

On Mon, Jul 03, 2017 at 02:31:06PM -0500, Grygorii Strashko wrote:
> 
> 
> On 06/30/2017 08:31 PM, Ivan Khoronzhuk wrote:
> > On Tue, Jun 13, 2017 at 06:16:21PM -0500, Grygorii Strashko wrote:
> >> There are two reasons for this change:
> >> 1) enabling of HW_TS_PUSH events as suggested by Richard Cochran and
> >> discussed in [1]
> >> 2) fixing an TX timestamping miss issue which happens with low speed
> >> ethernet connections and was reproduced on am57xx and am335x boards.
> >> Issue description: With the low Ethernet connection speed CPDMA 
> >> notification
> >> about packet processing can be received before CPTS TX timestamp event,
> >> which is sent when packet actually left CPSW while cpdma notification is
> >> sent when packet pushed in CPSW fifo.  As result, when connection is slow
> >> and CPU is fast enough TX timestamp can be missed and not working properly.
> >>
> >> This patch converts CPTS driver to use IRQ instead of polling in the
> >> following way:
> >>
> >>   - CPTS_EV_PUSH: CPTS_EV_PUSH is used to get current CPTS counter value 
> >> and
> >> triggered from PTP callbacks and cpts_overflow_check() work. With this
> >> change current CPTS counter value will be read in IRQ handler and saved in
> >> CPTS context "cur_timestamp" field. The compeltion event will be signalled 
> >> to the
> >> requestor. The timecounter->read() will just read saved value. Access to
> >> the "cur_timestamp" is protected by mutex "ptp_clk_mutex".
> >>
> >> cpts_get_time:
> >>reinit_completion(&cpts->ts_push_complete);
> >>cpts_write32(cpts, TS_PUSH, ts_push);
> >>wait_for_completion_interruptible_timeout(&cpts->ts_push_complete, HZ);
> >>ns = timecounter_read(&cpts->tc);
> >>
> >> cpts_irq:
> >>case CPTS_EV_PUSH:
> >>cpts->cur_timestamp = lo;
> >>complete(&cpts->ts_push_complete);
> >>
> >> - CPTS_EV_TX: signals when CPTS timestamp is ready for valid TX PTP
> >> packets. The TX timestamp is requested from cpts_tx_timestamp() which is
> >> called for each transmitted packet from NAPI cpsw_tx_poll() callback. With
> >> this change, CPTS event queue will be checked for existing CPTS_EV_TX
> >> event, corresponding to the current TX packet, and if event is not found - 
> >> packet
> >> will be placed in CPTS TX packet queue for later processing. CPTS TX packet
> >> queue will be processed from hi-priority cpts_ts_work() work which is 
> >> scheduled
> >> as from cpts_tx_timestamp() as from CPTS IRQ handler when CPTS_EV_TX event
> >> is received.
> >>
> >> cpts_tx_timestamp:
> >>   check if packet is PTP packet
> >>   try to find corresponding CPTS_EV_TX event
> >> if found: report timestamp
> >> if not found: put packet in TX queue, schedule cpts_ts_work()
> > I've not read patch itself yet, but why schedule is needed if timestamp is 
> > not
> > found? Anyway it is scheduled with irq when timestamp arrives. It's rather 
> > should
> > be scheduled if timestamp is found,
> 
> CPTS IRQ, cpts_ts_work and Net SoftIRQ processing might happen on
> different CPUs, as result - CPTS IRQ will detect TX event and schedule 
> cpts_ts_work on
> one CPU and this work might race with SKB processing in Net SoftIRQ on 
> another, so
> both SKB and CPTS TX event might be queued, but no cpts_ts_work scheduled 
> until
> next CPTS event is received (worst case for cpts_overflow_check period).

Couldn't be better to put packet in TX/RX queue under cpts->lock?
Then, probably, no need to schedule work in rx/tx timestamping and potentially
cpts_ts_work() will not be scheduled twice. I know it makes Irq handler to
wait a little, but it waits anyway while NetSoftIRQ retrieves ts.

> 
> Situation became even more complex on RT kernel where everything is
> executed in kthread contexts.
> 
> > 
> >>
> >> cpts_irq:
> >>   case CPTS_EV_TX:
> >>   put event in CPTS event queue
> >>   schedule cpts_ts_work()
> >>
> >> cpts_ts_work:
> >> for each packet in  CPTS TX packet queue
> >> try to find corresponding CPTS_EV_TX event
> >> if found: report timestamp
> >> if timeout: drop packet
> >>
> >> - CPTS_EV_RX: signals when CPTS timestamp is ready for valid RX PTP
> >> packets. The RX timestamp is requested from cpts_rx_timestamp() which is
> >> called for each received packet from NAPI cpsw_rx_poll() callback. With
> >> this change, CPTS event queue will be checked for existing CPTS_EV_RX
> >> event, corresponding to the current RX packet, and if event is not found - 
> >> packet
> >> will be placed in CPTS RX packet queue for later processing. CPTS RX packet
> >> queue will be processed from hi-priority cpts_ts_work() work which is 
> >> scheduled
> >> as from cpts_rx_timestamp() as from CPTS IRQ handler when CPTS_EV_RX event
> >> is received. cpts_rx_timestamp() has been updated to return failure in case
> >> of RX timestamp processing delaying and, in such cases, caller of
> >> cpts_rx_timestamp() should not call netif_receive_skb().
> > It's much similar to tx path, but

[PATCH net-next] bridge: allow ext learned entries to change ports

2017-07-03 Thread Roopa Prabhu

From: Nikolay Aleksandrov 

current code silently ignores change of port in the request
message. This patch makes sure the port is modified and
notification is sent to userspace.

Fixes: cf6b8e1eedff ("bridge: add API to notify bridge driver of learned FBD on 
offloaded device")
Signed-off-by: Nikolay Aleksandrov 
Signed-off-by: Roopa Prabhu 
---
 net/bridge/br_fdb.c | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index fef7872..a5e4a73 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -1079,8 +1079,9 @@ void br_fdb_unsync_static(struct net_bridge *br, struct 
net_bridge_port *p)
 int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
  const unsigned char *addr, u16 vid)
 {
-   struct hlist_head *head;
struct net_bridge_fdb_entry *fdb;
+   struct hlist_head *head;
+   bool modified = false;
int err = 0;
 
spin_lock_bh(&br->hash_lock);
@@ -1095,14 +1096,25 @@ int br_fdb_external_learn_add(struct net_bridge *br, 
struct net_bridge_port *p,
}
fdb->added_by_external_learn = 1;
fdb_notify(br, fdb, RTM_NEWNEIGH);
-   } else if (fdb->added_by_external_learn) {
-   /* Refresh entry */
-   fdb->updated = fdb->used = jiffies;
-   } else if (!fdb->added_by_user) {
-   /* Take over SW learned entry */
-   fdb->added_by_external_learn = 1;
+   } else {
fdb->updated = jiffies;
-   fdb_notify(br, fdb, RTM_NEWNEIGH);
+
+   if (fdb->dst != p) {
+   fdb->dst = p;
+   modified = true;
+   }
+
+   if (fdb->added_by_external_learn) {
+   /* Refresh entry */
+   fdb->used = jiffies;
+   } else if (!fdb->added_by_user) {
+   /* Take over SW learned entry */
+   fdb->added_by_external_learn = 1;
+   modified = true;
+   }
+
+   if (modified)
+   fdb_notify(br, fdb, RTM_NEWNEIGH);
}
 
 err_unlock:
-- 
1.9.1

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney

On Mon, Jul 03, 2017 at 06:13:38PM +0100, Will Deacon wrote:
> On Mon, Jul 03, 2017 at 09:40:22AM -0700, Linus Torvalds wrote:
> > On Mon, Jul 3, 2017 at 9:18 AM, Paul E. McKenney
> >  wrote:
> > >
> > > Agreed, and my next step is to look at spin_lock() followed by
> > > spin_is_locked(), not necessarily the same lock.
> > 
> > Hmm. Most (all?) "spin_is_locked()" really should be about the same
> > thread that took the lock (ie it's about asserts and lock debugging).
> > 
> > The optimistic ABBA avoidance pattern for spinlocks *should* be
> > 
> > spin_lock(inner)
> > ...
> > if (!try_lock(outer)) {
> >spin_unlock(inner);
> >.. do them in the right order ..
> > 
> > so I don't think spin_is_locked() should have any memory barriers.
> > 
> > In fact, the core function for spin_is_locked() is arguably
> > arch_spin_value_unlocked() which doesn't even do the access itself.
> 
> Yeah, but there's some spaced-out stuff going on in kgdb_cpu_enter where
> it looks to me like raw_spin_is_locked is used for synchronization. My
> eyes are hurting looking at it, though.

That certainly is one interesting function, isn't it?  I wonder what
happens if you replace the raw_spin_is_locked() calls with an
unlock under a trylock check?  ;-)

Thanx, Paul

[PATCH net-next] mpls: route get support

2017-07-03 Thread Roopa Prabhu

From: Roopa Prabhu 

This patch adds RTM_GETROUTE doit handler for mpls routes.

Input:
RTA_DST - input label
RTA_NEWDST - labels in packet for multipath selection

By default the getroute handler returns matched
nexthop label, via and oif

With RTM_F_FIB_MATCH flag, full matched route is
returned.

example (with patched iproute2):
$ip -f mpls route show
101
nexthop as to 102/103 via inet 172.16.2.2 dev virt1-2
nexthop as to 302/303 via inet 172.16.12.2 dev virt1-12
201
nexthop as to 202/203 via inet6 2001:db8:2::2 dev virt1-2
nexthop as to 402/403 via inet6 2001:db8:12::2 dev virt1-12

$ip -f mpls route get 103
RTNETLINK answers: Network is unreachable

$ip -f mpls route get 101
101 as to 102/103 via inet 172.16.2.2 dev virt1-2

$ip -f mpls route get as to 302/303 101
101 as to 302/303 via inet 172.16.12.2 dev virt1-12

$ip -f mpls route get fibmatch 103
RTNETLINK answers: Network is unreachable

$ip -f mpls route get fibmatch 101
101
nexthop as to 102/103 via inet 172.16.2.2 dev virt1-2
nexthop as to 302/303 via inet 172.16.12.2 dev virt1-12

Signed-off-by: Roopa Prabhu 
---
 net/mpls/af_mpls.c | 162 -
 1 file changed, 161 insertions(+), 1 deletion(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 554daf3..4b6ff85 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -2110,6 +2110,166 @@ static void rtmsg_lfib(int event, u32 label, struct 
mpls_route *rt,
rtnl_set_sk_err(net, RTNLGRP_MPLS_ROUTE, err);
 }
 
+static int mpls_getroute(struct sk_buff *in_skb, struct nlmsghdr *in_nlh,
+struct netlink_ext_ack *extack)
+{
+   struct net *net = sock_net(in_skb->sk);
+   u32 portid = NETLINK_CB(in_skb).portid;
+   struct nlattr *tb[RTA_MAX + 1];
+   u32 labels[MAX_NEW_LABELS];
+   struct mpls_shim_hdr *hdr;
+   unsigned int hdr_size = 0;
+   struct net_device *dev;
+   struct mpls_route *rt;
+   struct rtmsg *rtm, *r;
+   struct nlmsghdr *nlh;
+   struct sk_buff *skb;
+   struct mpls_nh *nh;
+   int err = -EINVAL;
+   u32 in_label;
+   u8 n_labels;
+
+   err = nlmsg_parse(in_nlh, sizeof(*rtm), tb, RTA_MAX,
+ rtm_ipv4_policy, extack);
+   if (err < 0)
+   goto errout;
+
+   rtm = nlmsg_data(in_nlh);
+
+   if (tb[RTA_DST]) {
+   u8 label_count;
+
+   if (nla_get_labels(tb[RTA_DST], 1, &label_count,
+  &in_label, extack))
+   goto errout;
+
+   if (in_label < MPLS_LABEL_FIRST_UNRESERVED)
+   goto errout;
+   }
+
+   rt = mpls_route_input_rcu(net, in_label);
+   if (!rt) {
+   err = -ENETUNREACH;
+   goto errout;
+   }
+
+   if (rtm->rtm_flags & RTM_F_FIB_MATCH) {
+   skb = nlmsg_new(lfib_nlmsg_size(rt), GFP_KERNEL);
+   if (!skb) {
+   err = -ENOBUFS;
+   goto errout;
+   }
+
+   err = mpls_dump_route(skb, portid, in_nlh->nlmsg_seq,
+ RTM_NEWROUTE, in_label, rt, 0);
+   if (err < 0) {
+   /* -EMSGSIZE implies BUG in lfib_nlmsg_size */
+   WARN_ON(err == -EMSGSIZE);
+   goto errout_free;
+   }
+
+   return rtnl_unicast(skb, net, portid);
+   }
+
+   if (tb[RTA_NEWDST]) {
+   if (nla_get_labels(tb[RTA_NEWDST], MAX_NEW_LABELS, &n_labels,
+  labels, extack) != 0) {
+   err = -EINVAL;
+   goto errout;
+   }
+
+   hdr_size = n_labels * sizeof(struct mpls_shim_hdr);
+   }
+
+   skb = alloc_skb(NLMSG_GOODSIZE, GFP_KERNEL);
+   if (!skb) {
+   err = -ENOBUFS;
+   goto errout;
+   }
+
+   skb->protocol = htons(ETH_P_MPLS_UC);
+
+   if (hdr_size) {
+   bool bos;
+   int i;
+
+   if (skb_cow(skb, hdr_size)) {
+   err = -ENOBUFS;
+   goto errout_free;
+   }
+
+   skb_reserve(skb, hdr_size);
+   skb_push(skb, hdr_size);
+   skb_reset_network_header(skb);
+
+   /* Push new labels */
+   hdr = mpls_hdr(skb);
+   bos = true;
+   for (i = n_labels - 1; i >= 0; i--) {
+   hdr[i] = mpls_entry_encode(labels[i],
+  1, 0, bos);
+   bos = false;
+   }
+   }
+
+   nh = mpls_select_multipath(rt, skb);
+   if (!nh) {
+   err = -ENETUNREACH;
+   goto errout_free;
+   }
+
+   if (hdr_size) {
+   skb_pull(skb, hdr_size);
+

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Linus Torvalds

On Mon, Jul 3, 2017 at 3:30 PM, Paul E. McKenney
 wrote:
>
> That certainly is one interesting function, isn't it?  I wonder what
> happens if you replace the raw_spin_is_locked() calls with an
> unlock under a trylock check?  ;-)

Deadlock due to interrupts again?

Didn't your spin_unlock_wait() patches teach you anything? Checking
state is fundamentally different from taking the lock. Even a trylock.

I guess you could try with the irqsave versions. But no, we're not doing that.

Linus

Re: [PATCH] vmalloc: respect the GFP_NOIO and GFP_NOFS flags

2017-07-03 Thread Mikulas Patocka



On Mon, 3 Jul 2017, Michal Hocko wrote:

> We can add a warning (or move it from kvmalloc) and hope that the
> respective maintainers will fix those places properly. The reason I
> didn't add the warning to vmalloc and kept it in kvmalloc was to catch
> only new users rather than suddenly splat on existing ones. Note that
> there are users with panic_on_warn enabled.
> 
> Considering how many NOFS users we have in tree I would rather work with
> maintainers to fix them.

So - do you want this patch?

I still believe that the previous patch that pushes 
memalloc_noio/nofs_save into __vmalloc is better than this.

Currently there are 28 __vmalloc callers that use GFP_NOIO or GFP_NOFS, 
three of them already use memalloc_noio_save, 25 don't.

Mikulas

---
 drivers/block/drbd/drbd_bitmap.c|8 +---
 drivers/infiniband/hw/mlx4/qp.c |   21 +
 drivers/infiniband/sw/rdmavt/qp.c   |   19 +--
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |7 +--
 drivers/md/dm-bufio.c   |2 +-
 drivers/mtd/ubi/io.c|   11 +--
 fs/btrfs/free-space-tree.c  |7 ++-
 fs/ext4/super.c |   21 +
 fs/gfs2/dir.c   |   29 +
 fs/gfs2/quota.c |8 ++--
 fs/nfs/blocklayout/extent_tree.c|7 ++-
 fs/ntfs/malloc.h|   11 +--
 fs/ubifs/debug.c|5 -
 fs/ubifs/lprops.c   |5 -
 fs/ubifs/lpt_commit.c   |   10 --
 fs/ubifs/orphan.c   |5 -
 fs/ubifs/ubifs.h|1 +
 fs/xfs/kmem.c   |2 +-
 mm/page_alloc.c |2 +-
 mm/vmalloc.c|6 ++
 net/ceph/ceph_common.c  |   14 --
 21 files changed, 156 insertions(+), 45 deletions(-)

Index: linux-2.6/drivers/block/drbd/drbd_bitmap.c
===
--- linux-2.6.orig/drivers/block/drbd/drbd_bitmap.c
+++ linux-2.6/drivers/block/drbd/drbd_bitmap.c
@@ -26,6 +26,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -408,9 +409,10 @@ static struct page **bm_realloc_pages(st
bytes = sizeof(struct page *)*want;
new_pages = kzalloc(bytes, GFP_NOIO | __GFP_NOWARN);
if (!new_pages) {
-   new_pages = __vmalloc(bytes,
-   GFP_NOIO | __GFP_ZERO,
-   PAGE_KERNEL);
+   unsigned noio;
+   noio = memalloc_noio_save();
+   new_pages = vmalloc(bytes);
+   memalloc_noio_restore(noio);
if (!new_pages)
return NULL;
}
Index: linux-2.6/drivers/infiniband/hw/mlx4/qp.c
===
--- linux-2.6.orig/drivers/infiniband/hw/mlx4/qp.c
+++ linux-2.6/drivers/infiniband/hw/mlx4/qp.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -814,14 +815,26 @@ static int create_qp_common(struct mlx4_
 
qp->sq.wrid = kmalloc_array(qp->sq.wqe_cnt, sizeof(u64),
gfp | __GFP_NOWARN);
-   if (!qp->sq.wrid)
+   if (!qp->sq.wrid) {
+   unsigned noio;
+   if (!(gfp & __GFP_IO))
+   noio = memalloc_noio_save();
qp->sq.wrid = __vmalloc(qp->sq.wqe_cnt * sizeof(u64),
-   gfp, PAGE_KERNEL);
+   gfp | __GFP_FS | __GFP_IO, 
PAGE_KERNEL);
+   if (!(gfp & __GFP_IO))
+   memalloc_noio_restore(noio);
+   }
qp->rq.wrid = kmalloc_array(qp->rq.wqe_cnt, sizeof(u64),
gfp | __GFP_NOWARN);
-   if (!qp->rq.wrid)
+   if (!qp->rq.wrid) {
+   unsigned noio;
+   if (!(gfp & __GFP_IO))
+   noio = memalloc_noio_save();
qp->rq.wrid = __vmalloc(qp->rq.wqe_cnt * sizeof(u64),
-   gfp, PAGE_KERNEL);
+   gfp | __GFP_FS | __GFP_IO, 
PAGE_KERNEL);
+   if (!(gfp & __GFP_IO))
+   memalloc_noio_restore(noio);
+   }
if (!qp->sq.wrid || !qp->rq.wrid) {
err = -ENOMEM;
goto err_wrid;
Index: linux-2.6/drivers/infiniband/sw/rdmavt/qp.c
===
--- linux-2.6.orig/driver

[PATCH 1/1] net sched: Added the TC_LINKLAYER_CUSTOM linklayer type

2017-07-03 Thread McCabe, Robert J

This is to support user-space modification of the qdisc stab.

Signed-off-by: McCabe, Robert J 
---
 include/uapi/linux/pkt_sched.h | 1 +
 net/sched/sch_api.c| 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index 099bf55..289bb81 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -82,6 +82,7 @@ enum tc_link_layer {
TC_LINKLAYER_UNAWARE, /* Indicate unaware old iproute2 util */
TC_LINKLAYER_ETHERNET,
TC_LINKLAYER_ATM,
+   TC_LINKLAYER_CUSTOM,
 };
 #define TC_LINKLAYER_MASK 0x0F /* limit use to lower 4 bits */
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 43b94c7..174a925 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -533,6 +533,8 @@ static int qdisc_dump_stab(struct sk_buff *skb, struct 
qdisc_size_table *stab)
goto nla_put_failure;
if (nla_put(skb, TCA_STAB_BASE, sizeof(stab->szopts), &stab->szopts))
goto nla_put_failure;
+   if (nla_put(skb, TCA_STAB_DATA, sizeof(stab->szopts)*sizeof(u16), 
&stab->data))
+   goto nla_put_failure;
nla_nest_end(skb, nest);
 
return skb->len;
-- 
2.7.4

[PATCH 1/1] tc: custom qdisc pkt size translation table

2017-07-03 Thread McCabe, Robert J

Added the "custom" linklayer qdisc stab option.
Allows the user to specify the pkt size translation
parameters from stdin.
Example:
   tc qdisc add ... stab tsize 8 linklayer custom htb
  Custom size table:
  InputSizeStart -> IntputSizeEnd: OutputSize
  0  -> 511  : 600
  512-> 1023 : 1200
  1024   -> 1535 : 1800
  1536   -> 2047 : 2400
  2048   -> 2559 : 3000

blah

Signed-off-by: McCabe, Robert J 
---
 include/linux/pkt_sched.h |  1 +
 tc/tc_core.c  | 46 ++
 tc/tc_core.h  |  2 +-
 tc/tc_stab.c  | 20 +---
 tc/tc_util.c  |  5 +
 5 files changed, 62 insertions(+), 12 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 099bf55..289bb81 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -82,6 +82,7 @@ enum tc_link_layer {
TC_LINKLAYER_UNAWARE, /* Indicate unaware old iproute2 util */
TC_LINKLAYER_ETHERNET,
TC_LINKLAYER_ATM,
+   TC_LINKLAYER_CUSTOM,
 };
 #define TC_LINKLAYER_MASK 0x0F /* limit use to lower 4 bits */
 
diff --git a/tc/tc_core.c b/tc/tc_core.c
index 821b741..167f4d7 100644
--- a/tc/tc_core.c
+++ b/tc/tc_core.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "tc_core.h"
 #include 
@@ -28,6 +29,13 @@
 static double tick_in_usec = 1;
 static double clock_factor = 1;
 
+struct size_table_entry {
+   unsigned int input_size_boundary_start;
+   unsigned int output_size_bytes;
+};
+
+static struct size_table_entry* custom_size_table = NULL;
+static int num_size_table_entries = 0;
 int tc_core_time2big(unsigned int time)
 {
__u64 t = time;
@@ -89,6 +97,20 @@ static unsigned int tc_align_to_atm(unsigned int size)
return linksize;
 }
 
+static unsigned int tc_align_to_custom(unsigned int size)
+{
+   int i;
+
+   assert(custom_size_table != NULL);
+   for(i = num_size_table_entries -1; i >= 0 ; --i) {
+   if(custom_size_table[i].input_size_boundary_start < size) {
+   /* found it */
+   return custom_size_table[i].output_size_bytes;
+   }
+   }
+   return 0;
+}
+
 static unsigned int tc_adjust_size(unsigned int sz, unsigned int mpu, enum 
link_layer linklayer)
 {
if (sz < mpu)
@@ -97,6 +119,8 @@ static unsigned int tc_adjust_size(unsigned int sz, unsigned 
int mpu, enum link_
switch (linklayer) {
case LINKLAYER_ATM:
return tc_align_to_atm(sz);
+   case LINKLAYER_CUSTOM:
+   return tc_align_to_custom(sz);
case LINKLAYER_ETHERNET:
default:
/* No size adjustments on Ethernet */
@@ -185,6 +209,24 @@ int tc_calc_size_table(struct tc_sizespec *s, __u16 **stab)
if (!*stab)
return -1;
 
+   if(LINKLAYER_CUSTOM == linklayer) {
+custom_size_table = malloc(sizeof(struct size_table_entry)* 
s->tsize);
+if(!custom_size_table)
+ return -1;
+num_size_table_entries = s->tsize;
+
+printf("Custom size table:\n");
+printf("InputSizeStart -> IntputSizeEnd: OutputSize\n");
+for(i = 0; i <= s->tsize - 1; ++i) {
+ printf("%-14d -> %-13d: ", i << s->cell_log, ((i+1) 
<< s->cell_log) - 1);
+ if(!scanf("%u", 
&custom_size_table[i].output_size_bytes)) {
+   fprintf(stderr, "Invalid custom stab 
table entry!\n");
+   return -1;
+ }
+ custom_size_table[i].input_size_boundary_start = i << 
s->cell_log;
+}
+   }
+
 again:
for (i = s->tsize - 1; i >= 0; i--) {
sz = tc_adjust_size((i + 1) << s->cell_log, s->mpu, linklayer);
@@ -196,6 +238,10 @@ again:
}
 
s->cell_align = -1; /* Due to the sz calc */
+   if(custom_size_table) {
+free(custom_size_table);
+num_size_table_entries = 0;
+   }
return 0;
 }
 
diff --git a/tc/tc_core.h b/tc/tc_core.h
index 8a63b79..8e97222 100644
--- a/tc/tc_core.h
+++ b/tc/tc_core.h
@@ -10,9 +10,9 @@ enum link_layer {
LINKLAYER_UNSPEC,
LINKLAYER_ETHERNET,
LINKLAYER_ATM,
+   LINKLAYER_CUSTOM,
 };
 
-
 int  tc_core_time2big(unsigned time);
 unsigned tc_core_time2tick(unsigned time);
 unsigned tc_core_tick2time(unsigned tick);
diff --git a/tc/tc_stab.c b/tc/tc_stab.c
index 1a0a3e3..a468a70 100644
--- a/tc/tc_stab.c
+++ b/tc/tc_stab.c
@@ -37,7 +37,9 @@ static void stab_help(void)
"   tsize : how many slots should size table have {512}\n"
"   mpu   : minimum packet size used in rate computations\n"

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney

On Mon, Jul 03, 2017 at 03:49:42PM -0700, Linus Torvalds wrote:
> On Mon, Jul 3, 2017 at 3:30 PM, Paul E. McKenney
>  wrote:
> >
> > That certainly is one interesting function, isn't it?  I wonder what
> > happens if you replace the raw_spin_is_locked() calls with an
> > unlock under a trylock check?  ;-)
> 
> Deadlock due to interrupts again?

Unless I am missing something subtle, the kgdb_cpu_enter() function in
question has a local_irq_save() over the "interesting" portion of its
workings, so interrupt-handler self-deadlock should not happen.

> Didn't your spin_unlock_wait() patches teach you anything? Checking
> state is fundamentally different from taking the lock. Even a trylock.

That was an embarrassing bug, no two ways about it.  :-/

> I guess you could try with the irqsave versions. But no, we're not doing that.

Again, no need in this case.

But I agree with Will's assessment of this function...

The raw_spin_is_locked() looks to be asking if -any- CPU holds the
dbg_slave_lock, and the answer could of course change immediately
on return from raw_spin_is_locked().  Perhaps the theory is that
if other CPU holds the lock, this CPU is supposed to be subjected to
kgdb_roundup_cpus().  Except that the CPU that held dbg_slave_lock might
be just about to release that lock.  Odd.

Seems like there should be a get_online_cpus() somewhere, but maybe
that constraint is to be manually enforced.

Thanx, Paul

Re: [PATCH RFC 08/26] locking: Remove spin_unlock_wait() generic definitions

2017-07-03 Thread Paul E. McKenney

On Mon, Jul 03, 2017 at 05:39:36PM -0700, Paul E. McKenney wrote:
> On Mon, Jul 03, 2017 at 03:49:42PM -0700, Linus Torvalds wrote:
> > On Mon, Jul 3, 2017 at 3:30 PM, Paul E. McKenney
> >  wrote:
> > >
> > > That certainly is one interesting function, isn't it?  I wonder what
> > > happens if you replace the raw_spin_is_locked() calls with an
> > > unlock under a trylock check?  ;-)
> > 
> > Deadlock due to interrupts again?
> 
> Unless I am missing something subtle, the kgdb_cpu_enter() function in
> question has a local_irq_save() over the "interesting" portion of its
> workings, so interrupt-handler self-deadlock should not happen.
> 
> > Didn't your spin_unlock_wait() patches teach you anything? Checking
> > state is fundamentally different from taking the lock. Even a trylock.
> 
> That was an embarrassing bug, no two ways about it.  :-/
> 
> > I guess you could try with the irqsave versions. But no, we're not doing 
> > that.
> 
> Again, no need in this case.
> 
> But I agree with Will's assessment of this function...
> 
> The raw_spin_is_locked() looks to be asking if -any- CPU holds the
> dbg_slave_lock, and the answer could of course change immediately
> on return from raw_spin_is_locked().  Perhaps the theory is that
> if other CPU holds the lock, this CPU is supposed to be subjected to
> kgdb_roundup_cpus().  Except that the CPU that held dbg_slave_lock might
> be just about to release that lock.  Odd.
> 
> Seems like there should be a get_online_cpus() somewhere, but maybe
> that constraint is to be manually enforced.

Except that invoking get_online_cpus() from an exception handler would
be of course be a spectacularly bad idea.  I would feel better if the
num_online_cpus() was under the local_irq_save(), but perhaps this code
is relying on the stop_machine().  Except that it appears we could
deadlock with offline waiting for stop_machine() to complete and kdbg
waiting for all CPUs to report, including those in stop_machine().

Looks like the current situation is "Don't use kdbg if there is any
possibility of CPU-hotplug operations."  Not necessarily an unreasonable
restriction.

But I need to let me eyes heal a bit before looking at this more.

Thanx, Paul

[PATCH] net: ethernet: mediatek: fixed deadlock captured by lockdep

2017-07-03 Thread sean.wang

From: Sean Wang 

Lockdep found an inconsistent lock state when mtk_get_stats64 is called
in user context while NAPI updates MAC statistics in softirq.

Use spin_trylock_bh/spin_unlock_bh fix following lockdep warning.

[   81.321030] WARNING: inconsistent lock state
[   81.325266] 4.12.0-rc1-00035-gd9dda65 #32 Not tainted
[   81.330273] 
[   81.334505] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[   81.340464] ksoftirqd/0/7 [HC0[0]:SC1[1]:HE1:SE0] takes:
[   81.345731]  (&syncp->seq#2){+.?...}, at: [] 
mtk_handle_status_irq.part.6+0x70/0x84
[   81.354219] {SOFTIRQ-ON-W} state was registered at:
[   81.359062]   lock_acquire+0xfc/0x2b0
[   81.362696]   mtk_stats_update_mac+0x60/0x2c0
[   81.367017]   mtk_get_stats64+0x17c/0x18c
[   81.370995]   dev_get_stats+0x48/0xbc
[   81.374628]   rtnl_fill_stats+0x48/0x128
[   81.378520]   rtnl_fill_ifinfo+0x4ac/0xd1c
[   81.382584]   rtmsg_ifinfo_build_skb+0x7c/0xe0
[   81.386991]   rtmsg_ifinfo.part.5+0x24/0x54
[   81.391139]   rtmsg_ifinfo+0x24/0x28
[   81.394685]   __dev_notify_flags+0xa4/0xac
[   81.398749]   dev_change_flags+0x50/0x58
[   81.402640]   devinet_ioctl+0x768/0x85c
[   81.406444]   inet_ioctl+0x1a4/0x1d0
[   81.409990]   sock_ioctl+0x16c/0x33c
[   81.413538]   do_vfs_ioctl+0xb4/0xa34
[   81.417169]   SyS_ioctl+0x44/0x6c
[   81.420458]   ret_fast_syscall+0x0/0x1c
[   81.424260] irq event stamp: 3354692
[   81.427806] hardirqs last  enabled at (3354692): [] 
net_rx_action+0xc0/0x504
[   81.435660] hardirqs last disabled at (3354691): [] 
net_rx_action+0x8c/0x504
[   81.443515] softirqs last  enabled at (3354106): [] 
__do_softirq+0x4b4/0x614
[   81.451370] softirqs last disabled at (3354109): [] 
run_ksoftirqd+0x44/0x80
[   81.459134]
[   81.459134] other info that might help us debug this:
[   81.465608]  Possible unsafe locking scenario:
[   81.465608]
[   81.471478]CPU0
[   81.473900]
[   81.476321]   lock(&syncp->seq#2);
[   81.479701]   
[   81.482294] lock(&syncp->seq#2);
[   81.485847]
[   81.485847]  *** DEADLOCK ***
[   81.485847]
[   81.491720] 1 lock held by ksoftirqd/0/7:
[   81.495693]  #0:  (&(&mac->hw_stats->stats_lock)->rlock){+.+...}, at: 
[] mtk_handle_status_irq.part.6+0x48/0x84
[   81.506579]
[   81.506579] stack backtrace:
[   81.510904] CPU: 0 PID: 7 Comm: ksoftirqd/0 Not tainted 
4.12.0-rc1-00035-gd9dda65 #32
[   81.518668] Hardware name: Mediatek Cortex-A7 (Device Tree)
[   81.524208] [] (unwind_backtrace) from [] 
(show_stack+0x20/0x24)
[   81.531899] [] (show_stack) from [] 
(dump_stack+0xb4/0xe0)
[   81.539072] [] (dump_stack) from [] 
(print_usage_bug+0x234/0x2e0)
[   81.546846] [] (print_usage_bug) from [] 
(mark_lock+0x63c/0x7bc)
[   81.554532] [] (mark_lock) from [] 
(__lock_acquire+0x654/0x1bfc)
[   81.562217] [] (__lock_acquire) from [] 
(lock_acquire+0xfc/0x2b0)
[   81.569990] [] (lock_acquire) from [] 
(mtk_stats_update_mac+0x60/0x2c0)
[   81.578283] [] (mtk_stats_update_mac) from [] 
(mtk_handle_status_irq.part.6+0x70/0x84)
[   81.587865] [] (mtk_handle_status_irq.part.6) from [] 
(mtk_napi_tx+0x358/0x37c)
[   81.596845] [] (mtk_napi_tx) from [] 
(net_rx_action+0x244/0x504)
[   81.604533] [] (net_rx_action) from [] 
(__do_softirq+0x134/0x614)
[   81.612306] [] (__do_softirq) from [] 
(run_ksoftirqd+0x44/0x80)
[   81.619907] [] (run_ksoftirqd) from [] 
(smpboot_thread_fn+0x14c/0x25c)
[   81.628110] [] (smpboot_thread_fn) from [] 
(kthread+0x150/0x180)
[   81.635798] [] (kthread) from [] 
(ret_from_fork+0x14/0x24)

Signed-off-by: Sean Wang 
---
 drivers/net/ethernet/mediatek/mtk_eth_soc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mediatek/mtk_eth_soc.c 
b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
index 16f9755..8a2acb8 100644
--- a/drivers/net/ethernet/mediatek/mtk_eth_soc.c
+++ b/drivers/net/ethernet/mediatek/mtk_eth_soc.c
@@ -470,9 +470,9 @@ static void mtk_get_stats64(struct net_device *dev,
unsigned int start;
 
if (netif_running(dev) && netif_device_present(dev)) {
-   if (spin_trylock(&hw_stats->stats_lock)) {
+   if (spin_trylock_bh(&hw_stats->stats_lock)) {
mtk_stats_update_mac(mac);
-   spin_unlock(&hw_stats->stats_lock);
+   spin_unlock_bh(&hw_stats->stats_lock);
}
}
 
@@ -2156,9 +2156,9 @@ static void mtk_get_ethtool_stats(struct net_device *dev,
return;
 
if (netif_running(dev) && netif_device_present(dev)) {
-   if (spin_trylock(&hwstats->stats_lock)) {
+   if (spin_trylock_bh(&hwstats->stats_lock)) {
mtk_stats_update_mac(mac);
-   spin_unlock(&hwstats->stats_lock);
+   spin_unlock_bh(&hwstats->stats_lock);
}
}
 
-- 
2.7.4

'skb' buffer address information leakage

2017-07-03 Thread Dison River

Hi all:
I'd found several address leaks of "skb" buffer.When i have a
arbitrary address write vulnerability in kernel(enabled kASLR),I can
use skb's address find sk_destruct's address and overwrite it. And
then,invoke close(sock_fd) function can trigger the
shellcode(sk_destruct func).

In kernel 4.12-rc7
drivers/net/irda/vlsi_ir.c:326   seq_printf(seq, "skb=%p
data=%p hw=%p\n", rd->skb, rd->buf, rd->hw);
drivers/net/ethernet/netronome/nfp/nfp_net_debugfs.c:167
 seq_printf(file, " frag=%p", skb);
drivers/net/wireless/ath/wil6210/debugfs.c:926   seq_printf(s,
"  SKB = 0x%p\n", skb);

Thanks.

Re: 'skb' buffer address information leakage

2017-07-03 Thread Jakub Kicinski

On Tue, 4 Jul 2017 13:12:18 +0800, Dison River wrote:
> drivers/net/ethernet/netronome/nfp/nfp_net_debugfs.c:167
>  seq_printf(file, " frag=%p", skb);

FWIW that's actually not a skb pointer.  The structure is defined like
this:

struct nfp_net_tx_buf {
union { 
struct sk_buff *skb;
void *frag;
};
dma_addr_t dma_addr;
short int fidx;
u16 pkt_cnt;
u32 real_len;
};

So the line in question is actually reading the frag pointer, I just
reused the skb variable, because this has to be read via READ_ONCE()
and NULL-checked so I thought that doing it separately for skb and
frag is a waste of LOC especially in debug code.  I will queue up a
clean up for after the merge window.

Thanks!

Re: [PATCH 1/1] net sched: Added the TC_LINKLAYER_CUSTOM linklayer type

2017-07-03 Thread Jiri Pirko

Tue, Jul 04, 2017 at 02:14:25AM CEST, robert.mcc...@rockwellcollins.com wrote:
>This is to support user-space modification of the qdisc stab.
>
>Signed-off-by: McCabe, Robert J 
>---
> include/uapi/linux/pkt_sched.h | 1 +
> net/sched/sch_api.c| 2 ++
> 2 files changed, 3 insertions(+)
>
>diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
>index 099bf55..289bb81 100644
>--- a/include/uapi/linux/pkt_sched.h
>+++ b/include/uapi/linux/pkt_sched.h
>@@ -82,6 +82,7 @@ enum tc_link_layer {
>   TC_LINKLAYER_UNAWARE, /* Indicate unaware old iproute2 util */
>   TC_LINKLAYER_ETHERNET,
>   TC_LINKLAYER_ATM,
>+  TC_LINKLAYER_CUSTOM,
> };
> #define TC_LINKLAYER_MASK 0x0F /* limit use to lower 4 bits */
> 
>diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>index 43b94c7..174a925 100644
>--- a/net/sched/sch_api.c
>+++ b/net/sched/sch_api.c
>@@ -533,6 +533,8 @@ static int qdisc_dump_stab(struct sk_buff *skb, struct 
>qdisc_size_table *stab)
>   goto nla_put_failure;
>   if (nla_put(skb, TCA_STAB_BASE, sizeof(stab->szopts), &stab->szopts))
>   goto nla_put_failure;
>+  if (nla_put(skb, TCA_STAB_DATA, sizeof(stab->szopts)*sizeof(u16), 
>&stab->data))
>+  goto nla_put_failure;
 
You dump stab->data to user. How is this related to TC_LINKLAYER_CUSTOM
and howcome this "is to support user-space modification of the qdisc
stab" as your description says? I'm confused...



>   nla_nest_end(skb, nest);
> 
>   return skb->len;
>-- 
>2.7.4
>

[PATCH 8/9] net, ipv4: convert cipso_v4_doi.refcount from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 include/net/cipso_ipv4.h |  3 ++-
 net/ipv4/cipso_ipv4.c| 12 ++--
 2 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/net/cipso_ipv4.h b/include/net/cipso_ipv4.h
index a34b141..880adb2 100644
--- a/include/net/cipso_ipv4.h
+++ b/include/net/cipso_ipv4.h
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* known doi values */
@@ -85,7 +86,7 @@ struct cipso_v4_doi {
} map;
u8 tags[CIPSO_V4_TAG_MAXCNT];
 
-   atomic_t refcount;
+   refcount_t refcount;
struct list_head list;
struct rcu_head rcu;
 };
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index c204477..c4c6e19 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -375,7 +375,7 @@ static struct cipso_v4_doi *cipso_v4_doi_search(u32 doi)
struct cipso_v4_doi *iter;
 
list_for_each_entry_rcu(iter, &cipso_v4_doi_list, list)
-   if (iter->doi == doi && atomic_read(&iter->refcount))
+   if (iter->doi == doi && refcount_read(&iter->refcount))
return iter;
return NULL;
 }
@@ -429,7 +429,7 @@ int cipso_v4_doi_add(struct cipso_v4_doi *doi_def,
}
}
 
-   atomic_set(&doi_def->refcount, 1);
+   refcount_set(&doi_def->refcount, 1);
 
spin_lock(&cipso_v4_doi_list_lock);
if (cipso_v4_doi_search(doi_def->doi)) {
@@ -533,7 +533,7 @@ int cipso_v4_doi_remove(u32 doi, struct netlbl_audit 
*audit_info)
ret_val = -ENOENT;
goto doi_remove_return;
}
-   if (!atomic_dec_and_test(&doi_def->refcount)) {
+   if (!refcount_dec_and_test(&doi_def->refcount)) {
spin_unlock(&cipso_v4_doi_list_lock);
ret_val = -EBUSY;
goto doi_remove_return;
@@ -576,7 +576,7 @@ struct cipso_v4_doi *cipso_v4_doi_getdef(u32 doi)
doi_def = cipso_v4_doi_search(doi);
if (!doi_def)
goto doi_getdef_return;
-   if (!atomic_inc_not_zero(&doi_def->refcount))
+   if (!refcount_inc_not_zero(&doi_def->refcount))
doi_def = NULL;
 
 doi_getdef_return:
@@ -597,7 +597,7 @@ void cipso_v4_doi_putdef(struct cipso_v4_doi *doi_def)
if (!doi_def)
return;
 
-   if (!atomic_dec_and_test(&doi_def->refcount))
+   if (!refcount_dec_and_test(&doi_def->refcount))
return;
spin_lock(&cipso_v4_doi_list_lock);
list_del_rcu(&doi_def->list);
@@ -630,7 +630,7 @@ int cipso_v4_doi_walk(u32 *skip_cnt,
 
rcu_read_lock();
list_for_each_entry_rcu(iter_doi, &cipso_v4_doi_list, list)
-   if (atomic_read(&iter_doi->refcount) > 0) {
+   if (refcount_read(&iter_doi->refcount) > 0) {
if (doi_cnt++ < *skip_cnt)
continue;
ret_val = callback(iter_doi, cb_arg);
-- 
2.7.4

[PATCH 7/9] net, ipv6: convert ip6addrlbl_entry.refcnt from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 net/ipv6/addrlabel.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index 07cd7d2..7a428f6 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if 0
 #define ADDRLABEL(x...) printk(x)
@@ -36,7 +37,7 @@ struct ip6addrlbl_entry {
int addrtype;
u32 label;
struct hlist_node list;
-   atomic_t refcnt;
+   refcount_t refcnt;
struct rcu_head rcu;
 };
 
@@ -137,12 +138,12 @@ static void ip6addrlbl_free_rcu(struct rcu_head *h)
 
 static bool ip6addrlbl_hold(struct ip6addrlbl_entry *p)
 {
-   return atomic_inc_not_zero(&p->refcnt);
+   return refcount_inc_not_zero(&p->refcnt);
 }
 
 static inline void ip6addrlbl_put(struct ip6addrlbl_entry *p)
 {
-   if (atomic_dec_and_test(&p->refcnt))
+   if (refcount_dec_and_test(&p->refcnt))
call_rcu(&p->rcu, ip6addrlbl_free_rcu);
 }
 
@@ -236,7 +237,7 @@ static struct ip6addrlbl_entry *ip6addrlbl_alloc(struct net 
*net,
newp->label = label;
INIT_HLIST_NODE(&newp->list);
write_pnet(&newp->lbl_net, net);
-   atomic_set(&newp->refcnt, 1);
+   refcount_set(&newp->refcnt, 1);
return newp;
 }
 
-- 
2.7.4

[PATCH 9/9] net, ipv4: convert fib_info.fib_clntref from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 include/net/ip_fib.h | 7 ---
 net/ipv4/fib_semantics.c | 2 +-
 net/ipv4/fib_trie.c  | 2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 3dbfd5e..41d580c 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct fib_config {
u8  fc_dst_len;
@@ -105,7 +106,7 @@ struct fib_info {
struct hlist_node   fib_lhash;
struct net  *fib_net;
int fib_treeref;
-   atomic_tfib_clntref;
+   refcount_t  fib_clntref;
unsigned intfib_flags;
unsigned char   fib_dead;
unsigned char   fib_protocol;
@@ -430,12 +431,12 @@ void free_fib_info(struct fib_info *fi);
 
 static inline void fib_info_hold(struct fib_info *fi)
 {
-   atomic_inc(&fi->fib_clntref);
+   refcount_inc(&fi->fib_clntref);
 }
 
 static inline void fib_info_put(struct fib_info *fi)
 {
-   if (atomic_dec_and_test(&fi->fib_clntref))
+   if (refcount_dec_and_test(&fi->fib_clntref))
free_fib_info(fi);
 }
 
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index ff47ea1..22210010 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1253,7 +1253,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
}
 
fi->fib_treeref++;
-   atomic_inc(&fi->fib_clntref);
+   refcount_set(&fi->fib_clntref, 1);
spin_lock_bh(&fib_info_lock);
hlist_add_head(&fi->fib_hash,
   &fib_info_hash[fib_info_hashfn(fi)]);
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index d56659e..64668c6 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1463,7 +1463,7 @@ int fib_table_lookup(struct fib_table *tb, const struct 
flowi4 *flp,
}
 
if (!(fib_flags & FIB_LOOKUP_NOREF))
-   atomic_inc(&fi->fib_clntref);
+   refcount_inc(&fi->fib_clntref);
 
res->prefix = htonl(n->key);
res->prefixlen = KEYLENGTH - fa->fa_slen;
-- 
2.7.4

[PATCH 6/9] net, ipv6: convert xfrm6_tunnel_spi.refcnt from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 net/ipv6/xfrm6_tunnel.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv6/xfrm6_tunnel.c b/net/ipv6/xfrm6_tunnel.c
index d7b731a..4e438bc 100644
--- a/net/ipv6/xfrm6_tunnel.c
+++ b/net/ipv6/xfrm6_tunnel.c
@@ -59,7 +59,7 @@ struct xfrm6_tunnel_spi {
struct hlist_node   list_byspi;
xfrm_address_t  addr;
u32 spi;
-   atomic_trefcnt;
+   refcount_t  refcnt;
struct rcu_head rcu_head;
 };
 
@@ -160,7 +160,7 @@ static u32 __xfrm6_tunnel_alloc_spi(struct net *net, 
xfrm_address_t *saddr)
 
memcpy(&x6spi->addr, saddr, sizeof(x6spi->addr));
x6spi->spi = spi;
-   atomic_set(&x6spi->refcnt, 1);
+   refcount_set(&x6spi->refcnt, 1);
 
hlist_add_head_rcu(&x6spi->list_byspi, &xfrm6_tn->spi_byspi[index]);
 
@@ -178,7 +178,7 @@ __be32 xfrm6_tunnel_alloc_spi(struct net *net, 
xfrm_address_t *saddr)
spin_lock_bh(&xfrm6_tunnel_spi_lock);
x6spi = __xfrm6_tunnel_spi_lookup(net, saddr);
if (x6spi) {
-   atomic_inc(&x6spi->refcnt);
+   refcount_inc(&x6spi->refcnt);
spi = x6spi->spi;
} else
spi = __xfrm6_tunnel_alloc_spi(net, saddr);
@@ -207,7 +207,7 @@ static void xfrm6_tunnel_free_spi(struct net *net, 
xfrm_address_t *saddr)
  list_byaddr)
{
if (xfrm6_addr_equal(&x6spi->addr, saddr)) {
-   if (atomic_dec_and_test(&x6spi->refcnt)) {
+   if (refcount_dec_and_test(&x6spi->refcnt)) {
hlist_del_rcu(&x6spi->list_byaddr);
hlist_del_rcu(&x6spi->list_byspi);
call_rcu(&x6spi->rcu_head, x6spi_destroy_rcu);
-- 
2.7.4

[PATCH 5/9] net, ipv6: convert ifacaddr6.aca_refcnt from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 include/net/if_inet6.h | 2 +-
 net/ipv6/anycast.c | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 4bb52ce..d4088d1 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -147,7 +147,7 @@ struct ifacaddr6 {
struct rt6_info *aca_rt;
struct ifacaddr6*aca_next;
int aca_users;
-   atomic_taca_refcnt;
+   refcount_t  aca_refcnt;
unsigned long   aca_cstamp;
unsigned long   aca_tstamp;
 };
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 514ac25..0bbab8a 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -203,12 +203,12 @@ void ipv6_sock_ac_close(struct sock *sk)
 
 static void aca_get(struct ifacaddr6 *aca)
 {
-   atomic_inc(&aca->aca_refcnt);
+   refcount_inc(&aca->aca_refcnt);
 }
 
 static void aca_put(struct ifacaddr6 *ac)
 {
-   if (atomic_dec_and_test(&ac->aca_refcnt)) {
+   if (refcount_dec_and_test(&ac->aca_refcnt)) {
in6_dev_put(ac->aca_idev);
dst_release(&ac->aca_rt->dst);
kfree(ac);
@@ -232,7 +232,7 @@ static struct ifacaddr6 *aca_alloc(struct rt6_info *rt,
aca->aca_users = 1;
/* aca_tstamp should be updated upon changes */
aca->aca_cstamp = aca->aca_tstamp = jiffies;
-   atomic_set(&aca->aca_refcnt, 1);
+   refcount_set(&aca->aca_refcnt, 1);
 
return aca;
 }
-- 
2.7.4

[PATCH 3/9] net, ipv6: convert inet6_ifaddr.refcnt from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 include/net/addrconf.h | 6 +++---
 include/net/if_inet6.h | 2 +-
 net/ipv6/addrconf.c| 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 620bd9a..6df79e9 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -350,18 +350,18 @@ void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp);
 
 static inline void in6_ifa_put(struct inet6_ifaddr *ifp)
 {
-   if (atomic_dec_and_test(&ifp->refcnt))
+   if (refcount_dec_and_test(&ifp->refcnt))
inet6_ifa_finish_destroy(ifp);
 }
 
 static inline void __in6_ifa_put(struct inet6_ifaddr *ifp)
 {
-   atomic_dec(&ifp->refcnt);
+   refcount_dec(&ifp->refcnt);
 }
 
 static inline void in6_ifa_hold(struct inet6_ifaddr *ifp)
 {
-   atomic_inc(&ifp->refcnt);
+   refcount_inc(&ifp->refcnt);
 }
 
 
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index e7a17b2..2b41cb8 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -46,7 +46,7 @@ struct inet6_ifaddr {
/* In seconds, relative to tstamp. Expiry is at tstamp + HZ * lft. */
__u32   valid_lft;
__u32   prefered_lft;
-   atomic_trefcnt;
+   refcount_t  refcnt;
spinlock_t  lock;
 
int state;
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2365f12..3c46e95 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1050,7 +1050,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct 
in6_addr *addr,
 
ifa->idev = idev;
/* For caller */
-   in6_ifa_hold(ifa);
+   refcount_set(&ifa->refcnt, 1);
 
/* Add to big hash table */
hash = inet6_addr_hash(addr);
-- 
2.7.4

[PATCH 4/9] net, ipv6: convert ifmcaddr6.mca_refcnt from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 include/net/if_inet6.h |  2 +-
 net/ipv6/mcast.c   | 18 +-
 2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index 2b41cb8..4bb52ce 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -127,7 +127,7 @@ struct ifmcaddr6 {
struct timer_list   mca_timer;
unsigned intmca_flags;
int mca_users;
-   atomic_tmca_refcnt;
+   refcount_t  mca_refcnt;
spinlock_t  mca_lock;
unsigned long   mca_cstamp;
unsigned long   mca_tstamp;
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index e222113..12b7c27 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -701,7 +701,7 @@ static void igmp6_group_dropped(struct ifmcaddr6 *mc)
 
spin_lock_bh(&mc->mca_lock);
if (del_timer(&mc->mca_timer))
-   atomic_dec(&mc->mca_refcnt);
+   refcount_dec(&mc->mca_refcnt);
spin_unlock_bh(&mc->mca_lock);
 }
 
@@ -819,12 +819,12 @@ static void mld_clear_delrec(struct inet6_dev *idev)
 
 static void mca_get(struct ifmcaddr6 *mc)
 {
-   atomic_inc(&mc->mca_refcnt);
+   refcount_inc(&mc->mca_refcnt);
 }
 
 static void ma_put(struct ifmcaddr6 *mc)
 {
-   if (atomic_dec_and_test(&mc->mca_refcnt)) {
+   if (refcount_dec_and_test(&mc->mca_refcnt)) {
in6_dev_put(mc->idev);
kfree(mc);
}
@@ -846,7 +846,7 @@ static struct ifmcaddr6 *mca_alloc(struct inet6_dev *idev,
mc->mca_users = 1;
/* mca_stamp should be updated upon changes */
mc->mca_cstamp = mc->mca_tstamp = jiffies;
-   atomic_set(&mc->mca_refcnt, 1);
+   refcount_set(&mc->mca_refcnt, 1);
spin_lock_init(&mc->mca_lock);
 
/* initial mode is (EX, empty) */
@@ -1065,7 +1065,7 @@ static void igmp6_group_queried(struct ifmcaddr6 *ma, 
unsigned long resptime)
return;
 
if (del_timer(&ma->mca_timer)) {
-   atomic_dec(&ma->mca_refcnt);
+   refcount_dec(&ma->mca_refcnt);
delay = ma->mca_timer.expires - jiffies;
}
 
@@ -1074,7 +1074,7 @@ static void igmp6_group_queried(struct ifmcaddr6 *ma, 
unsigned long resptime)
 
ma->mca_timer.expires = jiffies + delay;
if (!mod_timer(&ma->mca_timer, jiffies + delay))
-   atomic_inc(&ma->mca_refcnt);
+   refcount_inc(&ma->mca_refcnt);
ma->mca_flags |= MAF_TIMER_RUNNING;
 }
 
@@ -1469,7 +1469,7 @@ int igmp6_event_report(struct sk_buff *skb)
if (ipv6_addr_equal(&ma->mca_addr, &mld->mld_mca)) {
spin_lock(&ma->mca_lock);
if (del_timer(&ma->mca_timer))
-   atomic_dec(&ma->mca_refcnt);
+   refcount_dec(&ma->mca_refcnt);
ma->mca_flags &= ~(MAF_LAST_REPORTER|MAF_TIMER_RUNNING);
spin_unlock(&ma->mca_lock);
break;
@@ -2391,12 +2391,12 @@ static void igmp6_join_group(struct ifmcaddr6 *ma)
 
spin_lock_bh(&ma->mca_lock);
if (del_timer(&ma->mca_timer)) {
-   atomic_dec(&ma->mca_refcnt);
+   refcount_dec(&ma->mca_refcnt);
delay = ma->mca_timer.expires - jiffies;
}
 
if (!mod_timer(&ma->mca_timer, jiffies + delay))
-   atomic_inc(&ma->mca_refcnt);
+   refcount_inc(&ma->mca_refcnt);
ma->mca_flags |= MAF_TIMER_RUNNING | MAF_LAST_REPORTER;
spin_unlock_bh(&ma->mca_lock);
 }
-- 
2.7.4

[PATCH 2/9] net, ipv6: convert inet6_dev.refcnt from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 include/net/addrconf.h | 8 
 include/net/if_inet6.h | 3 ++-
 net/ipv6/addrconf.c| 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index d0889cb..620bd9a 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -316,7 +316,7 @@ static inline struct inet6_dev *in6_dev_get(const struct 
net_device *dev)
rcu_read_lock();
idev = rcu_dereference(dev->ip6_ptr);
if (idev)
-   atomic_inc(&idev->refcnt);
+   refcount_inc(&idev->refcnt);
rcu_read_unlock();
return idev;
 }
@@ -332,18 +332,18 @@ void in6_dev_finish_destroy(struct inet6_dev *idev);
 
 static inline void in6_dev_put(struct inet6_dev *idev)
 {
-   if (atomic_dec_and_test(&idev->refcnt))
+   if (refcount_dec_and_test(&idev->refcnt))
in6_dev_finish_destroy(idev);
 }
 
 static inline void __in6_dev_put(struct inet6_dev *idev)
 {
-   atomic_dec(&idev->refcnt);
+   refcount_dec(&idev->refcnt);
 }
 
 static inline void in6_dev_hold(struct inet6_dev *idev)
 {
-   atomic_inc(&idev->refcnt);
+   refcount_inc(&idev->refcnt);
 }
 
 void inet6_ifa_finish_destroy(struct inet6_ifaddr *ifp);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index f656f90..e7a17b2 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -17,6 +17,7 @@
 
 #include 
 #include 
+#include 
 
 /* inet6_dev.if_flags */
 
@@ -187,7 +188,7 @@ struct inet6_dev {
 
struct ifacaddr6*ac_list;
rwlock_tlock;
-   atomic_trefcnt;
+   refcount_t  refcnt;
__u32   if_flags;
int dead;
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 114fb64..2365f12 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -426,7 +426,7 @@ static struct inet6_dev *ipv6_add_dev(struct net_device 
*dev)
}
 
/* One reference from device. */
-   in6_dev_hold(ndev);
+   refcount_set(&ndev->refcnt, 1);
 
if (dev->flags & (IFF_NOARP | IFF_LOOPBACK))
ndev->cnf.accept_dad = -1;
-- 
2.7.4

[PATCH 1/9] net, ipv6: convert ipv6_txoptions.refcnt from atomic_t to refcount_t

2017-07-03 Thread Elena Reshetova

refcount_t type and corresponding API should be
used instead of atomic_t when the variable is used as
a reference counter. This allows to avoid accidental
refcounter overflows that might lead to use-after-free
situations.

Signed-off-by: Elena Reshetova 
Signed-off-by: Hans Liljestrand 
Signed-off-by: Kees Cook 
Signed-off-by: David Windsor 
---
 include/net/ipv6.h   | 7 ---
 net/ipv6/exthdrs.c   | 4 ++--
 net/ipv6/ipv6_sockglue.c | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 3e505bb..6eac5cf 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -203,7 +204,7 @@ extern rwlock_t ip6_ra_lock;
  */
 
 struct ipv6_txoptions {
-   atomic_trefcnt;
+   refcount_t  refcnt;
/* Length of this structure */
int tot_len;
 
@@ -265,7 +266,7 @@ static inline struct ipv6_txoptions *txopt_get(const struct 
ipv6_pinfo *np)
rcu_read_lock();
opt = rcu_dereference(np->opt);
if (opt) {
-   if (!atomic_inc_not_zero(&opt->refcnt))
+   if (!refcount_inc_not_zero(&opt->refcnt))
opt = NULL;
else
opt = rcu_pointer_handoff(opt);
@@ -276,7 +277,7 @@ static inline struct ipv6_txoptions *txopt_get(const struct 
ipv6_pinfo *np)
 
 static inline void txopt_put(struct ipv6_txoptions *opt)
 {
-   if (opt && atomic_dec_and_test(&opt->refcnt))
+   if (opt && refcount_dec_and_test(&opt->refcnt))
kfree_rcu(opt, rcu);
 }
 
diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
index 0460af22..4996d73 100644
--- a/net/ipv6/exthdrs.c
+++ b/net/ipv6/exthdrs.c
@@ -971,7 +971,7 @@ ipv6_dup_options(struct sock *sk, struct ipv6_txoptions 
*opt)
*((char **)&opt2->dst1opt) += dif;
if (opt2->srcrt)
*((char **)&opt2->srcrt) += dif;
-   atomic_set(&opt2->refcnt, 1);
+   refcount_set(&opt2->refcnt, 1);
}
return opt2;
 }
@@ -1056,7 +1056,7 @@ ipv6_renew_options(struct sock *sk, struct ipv6_txoptions 
*opt,
return ERR_PTR(-ENOBUFS);
 
memset(opt2, 0, tot_len);
-   atomic_set(&opt2->refcnt, 1);
+   refcount_set(&opt2->refcnt, 1);
opt2->tot_len = tot_len;
p = (char *)(opt2 + 1);
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index a531ba0..85404e7 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -505,7 +505,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, 
int optname,
break;
 
memset(opt, 0, sizeof(*opt));
-   atomic_set(&opt->refcnt, 1);
+   refcount_set(&opt->refcnt, 1);
opt->tot_len = sizeof(*opt) + optlen;
retv = -EFAULT;
if (copy_from_user(opt+1, optval, optlen))
-- 
2.7.4

[PATCH 0/9] v2 ipv4/ipv6 refcount conversions

2017-07-03 Thread Elena Reshetova

Changes in v2:
 * rebase on top of net-next
 * currently by default refcount_t = atomic_t (*) and uses all 
   atomic standard operations unless CONFIG_REFCOUNT_FULL is enabled.
   This is a compromise for the systems that are critical on
   performance (such as net) and cannot accept even slight delay
   on the refcounter operations.

This series, for ipv4/ipv6 network components, replaces atomic_t reference
counters with the new refcount_t type and API (see include/linux/refcount.h).
By doing this we prevent intentional or accidental
underflows or overflows that can led to use-after-free vulnerabilities.

The patches are fully independent and can be cherry-picked separately.
In order to try with refcount functionality enabled in run-time,
CONFIG_REFCOUNT_FULL must be enabled.

NOTE: automatic kernel builder for some reason doesn't like all my
network branches and regularly times out the builds on these branches.
Suggestion for "waiting a day for a good coverage" doesn't work, as
we have seen with generic network conversions. So please wait for the
full report from kernel test rebot before merging further up.
This has been compile-tested in 116 configs, but 71 timed out (including
all s390-related configs again). I am trying to see if they can fix
build coverage for me in meanwhile.

* The respective change is currently merged into -next as
  "locking/refcount: Create unchecked atomic_t implementation".

Elena Reshetova (9):
  net, ipv6: convert ipv6_txoptions.refcnt from atomic_t to refcount_t
  net, ipv6: convert inet6_dev.refcnt from atomic_t to refcount_t
  net, ipv6: convert inet6_ifaddr.refcnt from atomic_t to refcount_t
  net, ipv6: convert ifmcaddr6.mca_refcnt from atomic_t to refcount_t
  net, ipv6: convert ifacaddr6.aca_refcnt from atomic_t to refcount_t
  net, ipv6: convert xfrm6_tunnel_spi.refcnt from atomic_t to refcount_t
  net, ipv6: convert ip6addrlbl_entry.refcnt from atomic_t to refcount_t
  net, ipv4: convert cipso_v4_doi.refcount from atomic_t to refcount_t
  net, ipv4: convert fib_info.fib_clntref from atomic_t to refcount_t

 include/net/addrconf.h   | 14 +++---
 include/net/cipso_ipv4.h |  3 ++-
 include/net/if_inet6.h   |  9 +
 include/net/ip_fib.h |  7 ---
 include/net/ipv6.h   |  7 ---
 net/ipv4/cipso_ipv4.c| 12 ++--
 net/ipv4/fib_semantics.c |  2 +-
 net/ipv4/fib_trie.c  |  2 +-
 net/ipv6/addrconf.c  |  4 ++--
 net/ipv6/addrlabel.c |  9 +
 net/ipv6/anycast.c   |  6 +++---
 net/ipv6/exthdrs.c   |  4 ++--
 net/ipv6/ipv6_sockglue.c |  2 +-
 net/ipv6/mcast.c | 18 +-
 net/ipv6/xfrm6_tunnel.c  |  8 
 15 files changed, 56 insertions(+), 51 deletions(-)

-- 
2.7.4

95 matches

Mail list logo