[patch net] mlxsw: spectrum_router: Avoid potential packets loss

2017-02-27 Thread Jiri Pirko
From: Ido Schimmel 

When the structure of the LPM tree changes (f.e., due to the addition of
a new prefix), we unbind the old tree and then bind the new one. This
may result in temporary packet loss.

Instead, overwrite the old binding with the new one.

Fixes: 6b75c4807db3 ("mlxsw: spectrum_router: Add virtual router management")
Signed-off-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 30 ++
 1 file changed, 20 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index d7ac22d..bd8de6b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -441,30 +441,40 @@ static int
 mlxsw_sp_vr_lpm_tree_check(struct mlxsw_sp *mlxsw_sp, struct mlxsw_sp_vr *vr,
   struct mlxsw_sp_prefix_usage *req_prefix_usage)
 {
-   struct mlxsw_sp_lpm_tree *lpm_tree;
+   struct mlxsw_sp_lpm_tree *lpm_tree = vr->lpm_tree;
+   struct mlxsw_sp_lpm_tree *new_tree;
+   int err;
 
-   if (mlxsw_sp_prefix_usage_eq(req_prefix_usage,
->lpm_tree->prefix_usage))
+   if (mlxsw_sp_prefix_usage_eq(req_prefix_usage, _tree->prefix_usage))
return 0;
 
-   lpm_tree = mlxsw_sp_lpm_tree_get(mlxsw_sp, req_prefix_usage,
+   new_tree = mlxsw_sp_lpm_tree_get(mlxsw_sp, req_prefix_usage,
 vr->proto, false);
-   if (IS_ERR(lpm_tree)) {
+   if (IS_ERR(new_tree)) {
/* We failed to get a tree according to the required
 * prefix usage. However, the current tree might be still good
 * for us if our requirement is subset of the prefixes used
 * in the tree.
 */
if (mlxsw_sp_prefix_usage_subset(req_prefix_usage,
->lpm_tree->prefix_usage))
+_tree->prefix_usage))
return 0;
-   return PTR_ERR(lpm_tree);
+   return PTR_ERR(new_tree);
}
 
-   mlxsw_sp_vr_lpm_tree_unbind(mlxsw_sp, vr);
-   mlxsw_sp_lpm_tree_put(mlxsw_sp, vr->lpm_tree);
+   /* Prevent packet loss by overwriting existing binding */
+   vr->lpm_tree = new_tree;
+   err = mlxsw_sp_vr_lpm_tree_bind(mlxsw_sp, vr);
+   if (err)
+   goto err_tree_bind;
+   mlxsw_sp_lpm_tree_put(mlxsw_sp, lpm_tree);
+
+   return 0;
+
+err_tree_bind:
vr->lpm_tree = lpm_tree;
-   return mlxsw_sp_vr_lpm_tree_bind(mlxsw_sp, vr);
+   mlxsw_sp_lpm_tree_put(mlxsw_sp, new_tree);
+   return err;
 }
 
 static struct mlxsw_sp_vr *mlxsw_sp_vr_get(struct mlxsw_sp *mlxsw_sp,
-- 
2.7.4



[bug report] dp83640: Delay scheduled work.

2017-02-27 Thread Dan Carpenter
Hello Stefan Sørensen,

The patch 4b063258ab93: "dp83640: Delay scheduled work." from Nov 3,
2015, leads to the following static checker warning:

drivers/net/phy/dp83640.c:1442 dp83640_rxtstamp()
warn: 'skb' was already freed.

drivers/net/phy/dp83640.c
  1402  struct dp83640_skb_info *skb_info = (struct dp83640_skb_info 
*)skb->cb;
  1403  struct list_head *this, *next;
  1404  struct rxts *rxts;
  1405  struct skb_shared_hwtstamps *shhwtstamps = NULL;
 ^^
  1406  unsigned long flags;
  1407  
  1408  if (is_status_frame(skb, type)) {
  1409  decode_status_frame(dp83640, skb);
  1410  kfree_skb(skb);
  1411  return true;
  1412  }
  1413  
  1414  if (!dp83640->hwts_rx_en)
  1415  return false;
  1416  
  1417  if ((type & dp83640->version) == 0 || (type & dp83640->layer) 
== 0)
  1418  return false;
  1419  
  1420  spin_lock_irqsave(>rx_lock, flags);
  1421  prune_rx_ts(dp83640);
  1422  list_for_each_safe(this, next, >rxts) {
  1423  rxts = list_entry(this, struct rxts, list);
  1424  if (match(skb, type, rxts)) {
  1425  shhwtstamps = skb_hwtstamps(skb);
  1426  memset(shhwtstamps, 0, sizeof(*shhwtstamps));
  1427  shhwtstamps->hwtstamp = ns_to_ktime(rxts->ns);
  1428  netif_rx_ni(skb);
^^^
If shhwtstamps is non-NULL then we call netif_rx_ni(skb);.  If this
call returns NET_RX_DROP then that means we've done a kfree_skb(skb).

  1429  list_del_init(>list);
  1430  list_add(>list, >rxpool);
  1431  break;
  1432  }
  1433  }
  1434  spin_unlock_irqrestore(>rx_lock, flags);
  1435  
  1436  if (!shhwtstamps) {
  1437  skb_info->ptp_type = type;
  1438  skb_info->tmo = jiffies + SKB_TIMESTAMP_TIMEOUT;
  1439  skb_queue_tail(>rx_queue, skb);
  1440  schedule_delayed_work(>ts_work, 
SKB_TIMESTAMP_TIMEOUT);
  1441  } else {
  1442  netif_rx_ni(skb);


And then we call it a second time outside the spinlock.  When I look at
the commit which added this, it feels like something that was added by
mistake.  But I'm really familiar enough with this code to say if I
haven't missed something.

  1443  }
  1444  
  1445  return true;
  1446  }

regards,
dan carpenter


Re: [Patch net v2] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread kbuild test robot
Hi Cong,

[auto build test WARNING on net/master]

url:
https://github.com/0day-ci/linux/commits/Cong-Wang/ipv6-check-for-ip6_null_entry-in-__ip6_del_rt_siblings/20170228-135206
config: x86_64-randconfig-x017-201709 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   Cyclomatic Complexity 3 include/net/sock.h:lockdep_sock_is_held
   Cyclomatic Complexity 5 include/net/sock.h:__sk_dst_get
   Cyclomatic Complexity 6 include/net/sock.h:sock_owned_by_me
   Cyclomatic Complexity 1 include/net/sock.h:sock_owned_by_user
   Cyclomatic Complexity 5 include/net/addrconf.h:__in6_dev_get
   Cyclomatic Complexity 11 net/ipv6/route.c:rt6_mtu_change_route
   Cyclomatic Complexity 4 net/ipv6/route.c:ip6_mtu
   Cyclomatic Complexity 3 include/net/neighbour.h:__neigh_lookup
   Cyclomatic Complexity 2 include/net/neighbour.h:neigh_release
   Cyclomatic Complexity 2 net/ipv6/route.c:ip6_print_replace_route_err
   Cyclomatic Complexity 6 net/ipv6/route.c:ip6_pkt_drop
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_pkt_discard
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_pkt_discard_out
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_pkt_prohibit
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_pkt_prohibit_out
   Cyclomatic Complexity 12 net/ipv6/route.c:ip6_convert_metrics
   Cyclomatic Complexity 7 net/ipv6/route.c:ip6_route_info_append
   Cyclomatic Complexity 3 net/ipv6/route.c:__ip6_del_rt
   Cyclomatic Complexity 2 arch/x86/include/asm/uaccess.h:copy_user_overflow
   Cyclomatic Complexity 2 net/ipv6/route.c:rt6_nlmsg_size
   Cyclomatic Complexity 1 include/linux/skbuff.h:alloc_skb
   Cyclomatic Complexity 1 include/net/netlink.h:nlmsg_new
   Cyclomatic Complexity 2 include/net/netlink.h:nlmsg_put
   Cyclomatic Complexity 5 include/net/ip6_route.h:ip6_route_get_saddr
   Cyclomatic Complexity 1 include/net/netlink.h:nla_put_in6_addr
   Cyclomatic Complexity 1 include/net/netlink.h:nla_put_u32
   Cyclomatic Complexity 2 include/net/netlink.h:nla_nest_start
   Cyclomatic Complexity 1 include/net/netlink.h:nla_put_u8
   Cyclomatic Complexity 10 net/ipv6/route.c:rt6_nexthop_info
   Cyclomatic Complexity 4 net/ipv6/route.c:rt6_add_nexthop
   Cyclomatic Complexity 3 include/net/netlink.h:nlmsg_trim
   Cyclomatic Complexity 1 include/net/netlink.h:nlmsg_cancel
   Cyclomatic Complexity 38 net/ipv6/route.c:rt6_fill_node
   Cyclomatic Complexity 10 net/ipv6/route.c:__ip6_del_rt_siblings
   Cyclomatic Complexity 16 net/ipv6/route.c:ip6_route_del
   Cyclomatic Complexity 6 net/ipv6/route.c:ip6_dst_gc
   Cyclomatic Complexity 2 net/ipv6/route.c:ipv6_sysctl_rtcache_flush
   Cyclomatic Complexity 5 net/ipv6/route.c:ip6_route_net_init
   Cyclomatic Complexity 2 include/net/netlink.h:nlmsg_parse
   Cyclomatic Complexity 1 include/net/netlink.h:nla_get_in6_addr
   Cyclomatic Complexity 24 net/ipv6/route.c:rtm_to_fib6_config
   Cyclomatic Complexity 6 net/ipv6/route.c:ip6_route_multipath_del
   Cyclomatic Complexity 3 net/ipv6/route.c:inet6_rtm_delroute
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_route_net_exit_late
   Cyclomatic Complexity 1 net/ipv6/route.c:rt6_stats_seq_open
   Cyclomatic Complexity 1 net/ipv6/route.c:rt6_stats_seq_show
   Cyclomatic Complexity 1 include/linux/proc_fs.h:proc_create
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_route_net_init_late
   Cyclomatic Complexity 1 net/ipv6/route.c:ipv6_inetpeer_exit
   Cyclomatic Complexity 2 net/ipv6/route.c:ipv6_inetpeer_init
   Cyclomatic Complexity 4 net/ipv6/route.c:ip6_dst_alloc
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_route_lookup
   Cyclomatic Complexity 3 net/ipv6/route.c:rt6_lookup
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_ins_rt
   Cyclomatic Complexity 9 net/ipv6/route.c:__ip6_rt_update_pmtu
   Cyclomatic Complexity 2 net/ipv6/route.c:ip6_rt_update_pmtu
   Cyclomatic Complexity 13 net/ipv6/route.c:ip6_pol_route
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_pol_route_input
   Cyclomatic Complexity 1 net/ipv6/route.c:ip6_pol_route_output
   Cyclomatic Complexity 4 net/ipv6/route.c:ip6_nh_lookup_table
   Cyclomatic Complexity 59 net/ipv6/route.c:ip6_route_info_create
   Cyclomatic Complexity 3 net/ipv6/route.c:ip6_route_input_lookup
   Cyclomatic Complexity 3 net/ipv6/route.c:ip6_route_input
   Cyclomatic Complexity 10 net/ipv6/route.c:ip6_route_output_flags
   Cyclomatic Complexity 1 include/net/ip6_route.h:ip6_route_output
   Cyclomatic Complexity 18 net/ipv6/route.c:inet6_rtm_getroute
   Cyclomatic Complexity 4 net/ipv6/route.c:ip6_blackhole_route
   Cyclomatic Complexity 4 net/ipv6/route.c:ip6_update_pmtu
   Cyclomatic Complexity 6 net/ipv6/route.c:ip6_sk_update_pmtu
   Cyclomatic Complexity 3 net/ipv6/route.c:icmp6_dst_alloc
   Cyclomatic Complexity 3 

[PATCH net] sctp: call rcu_read_lock before checking for duplicate transport nodes

2017-02-27 Thread Xin Long
Commit cd2b70875058 ("sctp: check duplicate node before inserting a
new transport") called rhltable_lookup() to check for the duplicate
transport node in transport rhashtable.

But rhltable_lookup() doesn't call rcu_read_lock inside, it could cause
a use-after-free issue if it tries to dereference the node that another
cpu has freed it. Note that sock lock can not avoid this as it is per
sock.

This patch is to fix it by calling rcu_read_lock before checking for
duplicate transport nodes.

Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
transport")
Reported-by: Andrey Konovalov 
Signed-off-by: Xin Long 
---
 net/sctp/input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sctp/input.c b/net/sctp/input.c
index fc45896..2a28ab2 100644
--- a/net/sctp/input.c
+++ b/net/sctp/input.c
@@ -884,14 +884,17 @@ int sctp_hash_transport(struct sctp_transport *t)
arg.paddr = >ipaddr;
arg.lport = htons(t->asoc->base.bind_addr.port);
 
+   rcu_read_lock();
list = rhltable_lookup(_transport_hashtable, ,
   sctp_hash_params);
 
rhl_for_each_entry_rcu(transport, tmp, list, node)
if (transport->asoc->ep == t->asoc->ep) {
+   rcu_read_unlock();
err = -EEXIST;
goto out;
}
+   rcu_read_unlock();
 
err = rhltable_insert_key(_transport_hashtable, ,
  >node, sctp_hash_params);
-- 
2.1.0



Re: [PATCH V2] vhost: introduce O(1) vq metadata cache

2017-02-27 Thread Jason Wang



On 2017年02月28日 02:35, Michael S. Tsirkin wrote:

On Wed, Feb 15, 2017 at 01:37:17PM +0800, Jason Wang wrote:


On 2016年12月14日 17:53, Jason Wang wrote:

When device IOTLB is enabled, all address translations were stored in
interval tree. O(lgN) searching time could be slow for virtqueue
metadata (avail, used and descriptors) since they were accessed much
often than other addresses. So this patch introduces an O(1) array
which points to the interval tree nodes that store the translations of
vq metadata. Those array were update during vq IOTLB prefetching and
were reset during each invalidation and tlb update. Each time we want
to access vq metadata, this small array were queried before interval
tree. This would be sufficient for static mappings but not dynamic
mappings, we could do optimizations on top.

Test were done with l2fwd in guest (2M hugepage):

 noiommu  | before| after
tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%)
rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%)

We can almost reach the same performance as noiommu mode.

Signed-off-by: Jason Wang
---
Changes from V1:
- silent 32bit build warning

ping

Could you rebase pls?
I pushed my tree into linux next.



Ok, will do.

Thanks


[GIT] Networking

2017-02-27 Thread David Miller

1) Don't save TIPC header values before the header has been validated,
   from Jon Paul Maloy.

2) Fix memory leak in RDS, from Zhu Yanjun.

3) We miss to initialize the UID in the flow key in some paths, from
   Julian Anastasov.

4) Fix latent TOS masking bug in the routing cache removal from years
   ago, also from Julian.

5) We forget to set the sockaddr port in sctp_copy_local_addr_list(),
   fix from Xin Long.

6) Missing module ref count drop in packet scheduler actions, from
   Roman Mashak.

7) Fix RCU annotations in rht_bucket_nested, from Herbert Xu.

8) Fix use after free which happens because L2TP's ipv4 support
   returns non-zero values from it's backlog_rcv function which ipv4
   interprets as protocol values.  Fix from Paul Hüber.

Please pull, thanks a lot!

The following changes since commit f1ef09fde17f9b77ca1435a5b53a28b203afb81c:

  Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 
(2017-02-23 20:33:51 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to 2f44f75257d57f0d5668dba3a6ada0f4872132c9:

  Merge branch 'qed-fixes' (2017-02-27 09:22:10 -0500)


Brian Russell (1):
  vxlan: don't allow overwrite of config src addr

Colin Ian King (1):
  lib: fix spelling mistake: "actualy" -> "actually"

David Forster (1):
  vti6: return GRE_KEY for vti6

David Howells (1):
  rxrpc: Kernel calls get stuck in recvmsg

David S. Miller (2):
  Merge git://git.kernel.org/.../pablo/nf
  Merge branch 'qed-fixes'

Dmitry V. Levin (2):
  uapi: stop including linux/sysctl.h in uapi/linux/netfilter.h
  uapi: fix linux/netfilter/xt_hashlimit.h userspace compilation error

Eric Dumazet (1):
  net/mlx4_en: fix overflow in mlx4_en_init_timestamp()

Florian Fainelli (1):
  net: phy: Add missing driver check in phy_aneg_done()

Florian Westphal (1):
  netfilter: nft_ct: fix random validation errors for zone set support

Geert Uytterhoeven (2):
  drivers: net: xgene: Simplify xgene_enet_setup_mss() to kill warning
  lib: Allow compile-testing of parman

Herbert Xu (2):
  rhashtable: Fix use before NULL check in bucket_table_free
  rhashtable: Fix RCU dereference annotation in rht_bucket_nested

Jarno Rajahalme (2):
  netfilter: nf_ct_expect: nf_ct_expect_related_report(): Return zero on 
success.
  netfilter: nf_ct_expect: Change __nf_ct_expect_check() return value.

Jon Paul Maloy (1):
  tipc: move premature initilalization of stack variables

Julian Anastasov (3):
  ipv4: add missing initialization for flowi4_uid
  ipv4: mask tos for input route
  xfrm: provide correct dst in xfrm_neigh_lookup

LABBE Corentin (3):
  net: stmmac: unify registers dumps methods
  net: vxge: fix typo argumnet argument
  net: s2io: fix typo argumnet argument

Marc Dionne (1):
  rxrpc: Fix an assertion in rxrpc_read()

Marcelo Ricardo Leitner (1):
  sctp: deny peeloff operation on asocs with threads sleeping on it

Matthias Schiffer (1):
  vxlan: correctly validate VXLAN ID against VXLAN_N_VID

Mintz, Yuval (2):
  qed: Fix race with multiple VFs
  qed: Don't use attention PTT for configuring BW

Pablo Neira Ayuso (1):
  netfilter: nft_set_bitmap: incorrect bitmap size

Paul Hüber (1):
  l2tp: avoid use-after-free caused by l2tp_ip_backlog_recv

Roman Mashak (2):
  net sched actions: decrement module reference count after table flush.
  net sched actions: do not overwrite status of action creation.

Wu Fengguang (1):
  RDS: IB: fix ifnullfree.cocci warnings

Xin Long (2):
  sctp: set sin_port for addr param when checking duplicate address
  ipv6: check sk sk_type and protocol early in ip_mroute_set/getsockopt

Zhu Yanjun (1):
  rds: fix memory leak error

 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 13 +++--
 drivers/net/ethernet/mellanox/mlx4/en_clock.c| 18 --
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |  1 -
 drivers/net/ethernet/neterion/s2io.c |  2 +-
 drivers/net/ethernet/neterion/vxge/vxge-ethtool.c|  2 +-
 drivers/net/ethernet/qlogic/qed/qed.h|  4 +++-
 drivers/net/ethernet/qlogic/qed/qed_dev.c|  6 +++---
 drivers/net/ethernet/qlogic/qed/qed_mcp.c|  3 ++-
 drivers/net/ethernet/qlogic/qed/qed_sriov.c  | 39 
+++
 drivers/net/ethernet/qlogic/qed/qed_sriov.h  |  4 +++-
 drivers/net/ethernet/stmicro/stmmac/common.h |  4 ++--
 drivers/net/ethernet/stmicro/stmmac/dwmac1000_core.c | 10 +++---
 drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c  | 16 ++--
 drivers/net/ethernet/stmicro/stmmac/dwmac100_core.c  | 30 
++
 drivers/net/ethernet/stmicro/stmmac/dwmac100_dma.c   | 

Re: [PATCH v2 net] net: solve a NAPI race

2017-02-27 Thread David Miller
From: Eric Dumazet 
Date: Mon, 27 Feb 2017 08:44:14 -0800

> Any point doing a napi_schedule() not from device hard irq handler
> is subject to the race for NIC using some kind of edge trigger
> interrupts.
> 
> Since we do not provide a ndo to disable device interrupts, the
> following can happen.

Ok, now I understand.

I think even without considering the race you are trying to solve,
this situation is really dangerous.

I am sure that every ->poll() handler out there was written by an
author who completely assumed that if they are executing then the
device's interrupts for that NAPI instance are disabled.  And this is
with very few, if any, exceptions.

So if we saw a driver doing something like:

reg->irq_enable ^= value;

after napi_complete_done(), it would be quite understandable.

We really made a mistake taking the napi_schedule() call out of
the domain of the driver so that it could manage the interrupt
state properly.

I'm not against your missed bit fix as a short-term cure for now, it's
just that somewhere down the road we need to manage the interrupt
properly.


Re: [Patch net v3] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread David Ahern
On 2/27/17 4:07 PM, Cong Wang wrote:
> Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
> -> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
> because ip6_null_entry is returned in this path since ip6_null_entry
> is kinda default for a ipv6 route table root node. Quote from


Missed this earlier. The issue here is an attempt to delete the NULL
route, not that the null_entry is being returned as happens during a
route lookup. This will also hit the bug:
ip -6 ro del ::/0


[RFC PATCH] uapi: fix linux/packet_diag.h userspace compilation error

2017-02-27 Thread Dmitry V. Levin
Replace MAX_ADDR_LEN with its numeric value to fix the following
linux/packet_diag.h userspace compilation error:

/usr/include/linux/packet_diag.h:67:17: error: 'MAX_ADDR_LEN' undeclared here 
(not in a function)
  __u8 pdmc_addr[MAX_ADDR_LEN];

This is not the first case in the UAPI where the numeric value
of MAX_ADDR_LEN is used, uapi/linux/if_link.h already does the same,
and there are no UAPI headers besides these two that use MAX_ADDR_LEN.

The alternative fix would be to include  which
pulls in other headers and a lot of definitions with them.

Signed-off-by: Dmitry V. Levin 
---
I'm not quite comfortable with this approach but the alternative
has its own drawbacks.

 include/uapi/linux/packet_diag.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/packet_diag.h b/include/uapi/linux/packet_diag.h
index d08c63f..0c5d5dd 100644
--- a/include/uapi/linux/packet_diag.h
+++ b/include/uapi/linux/packet_diag.h
@@ -64,7 +64,7 @@ struct packet_diag_mclist {
__u32   pdmc_count;
__u16   pdmc_type;
__u16   pdmc_alen;
-   __u8pdmc_addr[MAX_ADDR_LEN];
+   __u8pdmc_addr[32]; /* MAX_ADDR_LEN */
 };
 
 struct packet_diag_ring {
-- 
ldv


Re: [Patch net] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Cong Wang
On Mon, Feb 27, 2017 at 1:06 PM, David Ahern  wrote:
> On 2/27/17 1:04 PM, Cong Wang wrote:
>>>
>>> for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) {
>>> +   /* do not allow deletion of the null route */
>>> +   if (rt == net->ipv6.ip6_null_entry)
>>> +   continue;
>>>
>>> Fixes: 0ae8133586ad net: ipv6: Allow shorthand delete of all nexthops in
>>> multipath route
>>
>> Note, I moved the check into __ip6_del_rt_siblings() because __ip6_del_rt()
>> has a same check.
>>
>
> that's b/c __ip6_del_rt has a second call path. __ip6_del_rt_siblings is
> new and is not expecting to see the null entry. Catching it before the
> dst_hold would be better.

Yeah, but it also depends on if we want to continue after the null entry,
at least __ip6_del_rt () returns an error for null entry, which looks more
correct than continuing to proceed after it.


Re: [PATCH v3 net] net: solve a NAPI race

2017-02-27 Thread Stephen Hemminger
On Mon, 27 Feb 2017 17:48:54 -0500 (EST)
David Miller  wrote:

> From: Stephen Hemminger 
> Date: Mon, 27 Feb 2017 14:44:55 -0800
> 
> > On Mon, 27 Feb 2017 14:35:17 -0800
> > Eric Dumazet  wrote:
> >   
> >> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> >>   
> >> > The original design (as Davem mentioned) was that IRQ's must be disabled
> >> > during device polling. If that was true, then the race above
> >> > would be impossible.
> >> 
> >> I would love to see an alternative patch.  
> > 
> > Turn off busy poll? 
> > The poll stuff runs risk of breaking more things.  
> 
> Eric is exactly trying to make busy poll even more prominent in
> the stack, not less prominent.
> 
> It's an important component of some performance improvements he is
> working on.

Maybe making IRQ controlled as part of the network device model
(instead of a side effect left to device driver to handle) would
be less problematic.

Really just shooting in the dark because I don't have any of the problematic
hardware to play with.


[PATCH v1.2] dt: emac: document device-tree based phy discovery and setup

2017-02-27 Thread Christian Lamparter
This patch adds documentation for a new "phy-handle" property,
"fixed-link" and "mdio" sub-node. These allows the enumeration
of PHYs which are supported by the phy library under drivers/net/phy.

The EMAC ethernet controller in IBM and AMCC 4xx chips is
currently stuck with a few privately defined phy
implementations. It has no support for PHYs which
are supported by the generic phylib.

Acked-by: Rob Herring 
Signed-off-by: Christian Lamparter 
---
Fixed: phy node - so it conforms to phy.txt.
---
---
 .../devicetree/bindings/powerpc/4xx/emac.txt   | 61 +-
 1 file changed, 59 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt 
b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
index 712baf6c3e24..2fa861378294 100644
--- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
+++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
@@ -71,6 +71,9 @@
  For Axon it can be absent, though my current driver
  doesn't handle phy-address yet so for now, keep
  0x00ff in it.
+- phy-handle   : Used to describe configurations where a external PHY
+ is used. Please refer to:
+ Documentation/devicetree/bindings/net/ethernet.txt
 - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
  operations (if absent the value is the same as
  rx-fifo-size).  For Axon, either absent or 2048.
@@ -81,8 +84,22 @@
  offload, phandle of the TAH device node.
 - tah-channel   : 1 cell, optional. If appropriate, channel used on the
  TAH engine.
+- fixed-link   : Fixed-link subnode describing a link to a non-MDIO
+ managed entity. See
+ Documentation/devicetree/bindings/net/fixed-link.txt
+ for details.
+- mdio subnode : When the EMAC has a phy connected to its local
+ mdio, which us supported by the kernel's network
+ PHY library in drivers/net/phy, there must be device
+ tree subnode with the following required properties:
+   - #address-cells: Must be <1>.
+   - #size-cells: Must be <0>.
 
-Example:
+ For PHY definitions: Please refer to
+ Documentation/devicetree/bindings/net/phy.txt and
+ Documentation/devicetree/bindings/net/ethernet.txt
+
+Examples:
 
EMAC0: ethernet@4800 {
device_type = "network";
@@ -104,6 +121,47 @@
zmii-channel = <0>;
};
 
+   EMAC1: ethernet@ef600c00 {
+   device_type = "network";
+   compatible = "ibm,emac-apm821xx", "ibm,emac4sync";
+   interrupt-parent = <>;
+   interrupts = <0 1>;
+   #interrupt-cells = <1>;
+   #address-cells = <0>;
+   #size-cells = <0>;
+   interrupt-map = <0  0x10 IRQ_TYPE_LEVEL_HIGH /* Status */
+1  0x14 IRQ_TYPE_LEVEL_HIGH /* Wake */>;
+   reg = <0xef600c00 0x00c4>;
+   local-mac-address = []; /* Filled in by U-Boot */
+   mal-device = <>;
+   mal-tx-channel = <0>;
+   mal-rx-channel = <0>;
+   cell-index = <0>;
+   max-frame-size = <9000>;
+   rx-fifo-size = <16384>;
+   tx-fifo-size = <2048>;
+   fifo-entry-size = <10>;
+   phy-mode = "rgmii";
+   phy-handle = <>;
+   phy-map = <0x>;
+   rgmii-device = <>;
+   rgmii-channel = <0>;
+   tah-device = <>;
+   tah-channel = <0>;
+   has-inverted-stacr-oc;
+   has-new-stacr-staopc;
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   phy0: ethernet-phy@0 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <0>;
+   };
+   };
+
+
   ii) McMAL node
 
 Required properties:
@@ -145,4 +203,3 @@
 - revision   : as provided by the RGMII new version register if
   available.
   For Axon: 0x012a
-
-- 
2.11.0



Re: [PATCH v2 net] net: phy: Fix LED mode in DT single property.

2017-02-27 Thread Rob Herring
On Fri, Feb 24, 2017 at 4:47 AM, Raju Lakkaraju
 wrote:
> From: Raju Lakkaraju 
>
> Fix the LED mode DT parameters combine to a single property
> and change the vendor prefix i.e. mscc.
>
> Signed-off-by: Raju Lakkaraju 
> ---
> Change set:
> v0: Fix the LED mode DT parameters combine to a single property
> v1: Fix the build test ERROR
> v2: Add default LED mode "vsc85xx_dt_led_mode_get" function.

See my comments on v1.

>  .../devicetree/bindings/net/mscc-phy-vsc8531.txt   | 20 +++
>  drivers/net/phy/mscc.c | 65 
> --
>  2 files changed, 45 insertions(+), 40 deletions(-)
>
> diff --git a/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt 
> b/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
> index 0eedabe..2253de5 100644
> --- a/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
> +++ b/Documentation/devicetree/bindings/net/mscc-phy-vsc8531.txt
> @@ -6,12 +6,12 @@ Required properties:
>   Documentation/devicetree/bindings/net/phy.txt
>
>  Optional properties:
> -- vsc8531,vddmac   : The vddmac in mV. Allowed values is listed
> +- mscc,vddmac  : The vddmac in mV. Allowed values is listed
>   in the first row of Table 1 (below).
>   This property is only used in combination
>   with the 'edge-slowdown' property.
>   Default value is 3300.
> -- vsc8531,edge-slowdown: % the edge should be slowed down relative to
> +- mscc,edge-slowdown   : % the edge should be slowed down relative to
>   the fastest possible edge time.
>   Edge rate sets the drive strength of the MAC
>   interface output signals.  Changing the
> @@ -27,14 +27,11 @@ Optional properties:
>   'vddmac'.
>   Default value is 0%.
>   Ref: Table:1 - Edge rate change (below).
> -- vsc8531,led-0-mode   : LED mode. Specify how the LED[0] should behave.
> +- mscc,led-mode: LED mode. Specify how the LED[0] and LED[1] 
> should behave.
>   Allowed values are define in
>   "include/dt-bindings/net/mscc-phy-vsc8531.h".
> - Default value is VSC8531_LINK_1000_ACTIVITY (1).
> -- vsc8531,led-1-mode   : LED mode. Specify how the LED[1] should behave.
> - Allowed values are define in
> - "include/dt-bindings/net/mscc-phy-vsc8531.h".
> - Default value is VSC8531_LINK_100_ACTIVITY (2).
> + Default LED[0] value is VSC8531_LINK_1000_ACTIVITY 
> (1).
> + Default LED[1] value is VSC8531_LINK_100_ACTIVITY 
> (2).
>
>  Table: 1 - Edge rate change
>  |
> @@ -66,8 +63,7 @@ Example:
>
>  vsc8531_0: ethernet-phy@0 {
>  compatible = "ethernet-phy-id0007.0570";
> -vsc8531,vddmac = <3300>;
> -vsc8531,edge-slowdown  = <7>;
> -vsc8531,led-0-mode = ;
> -vsc8531,led-1-mode = ;
> +mscc,vddmac= /bits/ 16 <3300>;
> +mscc,edge-slowdown = /bits/ 8  <7>;
> +mscc,led-mode  = ;
>  };
> diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
> index 650c266..5cd705b 100644
> --- a/drivers/net/phy/mscc.c
> +++ b/drivers/net/phy/mscc.c
> @@ -385,11 +385,11 @@ static int vsc85xx_edge_rate_magic_get(struct 
> phy_device *phydev)
> if (!of_node)
> return -ENODEV;
>
> -   rc = of_property_read_u16(of_node, "vsc8531,vddmac", );
> +   rc = of_property_read_u16(of_node, "mscc,vddmac", );
> if (rc != 0)
> vdd = MSCC_VDDMAC_3300;
>
> -   rc = of_property_read_u8(of_node, "vsc8531,edge-slowdown", );
> +   rc = of_property_read_u8(of_node, "mscc,edge-slowdown", );
> if (rc != 0)
> sd = 0;
>
> @@ -402,26 +402,43 @@ static int vsc85xx_edge_rate_magic_get(struct 
> phy_device *phydev)
> return -EINVAL;
>  }
>
> -static int vsc85xx_dt_led_mode_get(struct phy_device *phydev,
> -  char *led,
> -  u8 default_mode)
> +static int vsc85xx_dt_led_mode_get(struct phy_device *phydev, char *led)
>  {
> struct device *dev = >mdio.dev;
> struct device_node *of_node = dev->of_node;
> -   u8 led_mode;
> -   int err;
> +   struct vsc8531_private *vsc8531 = phydev->priv;
> +   u8 led_0_mode = VSC8531_LINK_1000_ACTIVITY;
> +   u8 led_1_mode = VSC8531_LINK_100_ACTIVITY;
> +   const __be32 *paddr_end;
> +   const __be32 *paddr;
> +   int 

Re: [PATCH v1.1] dt: emac: document device-tree based phy discovery and setup

2017-02-27 Thread Florian Fainelli
On 02/27/2017 12:11 PM, Christian Lamparter wrote:
> This patch adds documentation for a new "phy-handle" property,
> "fixed-link" and "mdio" sub-node. These allows the enumeration
> of PHYs which are supported by the phy library under drivers/net/phy.
> 
> The EMAC ethernet controller in IBM and AMCC 4xx chips is
> currently stuck with a few privately defined phy
> implementations. It has no support for PHYs which
> are supported by the generic phylib.
> 
> Acked-by: Rob Herring 
> Signed-off-by: Christian Lamparter 
> ---
> Resent - no changes.
> ---
>  .../devicetree/bindings/powerpc/4xx/emac.txt   | 64 
> +-
>  1 file changed, 62 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt 
> b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> index 712baf6c3e24..1893b4c4d93b 100644
> --- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> +++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> @@ -71,6 +71,9 @@
> For Axon it can be absent, though my current driver
> doesn't handle phy-address yet so for now, keep
> 0x00ff in it.
> +- phy-handle : Used to describe configurations where a external PHY
> +   is used. Please refer to:
> +   Documentation/devicetree/bindings/net/ethernet.txt
>  - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
> operations (if absent the value is the same as
> rx-fifo-size).  For Axon, either absent or 2048.
> @@ -81,8 +84,22 @@
> offload, phandle of the TAH device node.
>  - tah-channel   : 1 cell, optional. If appropriate, channel used on 
> the
> TAH engine.
> +- fixed-link : Fixed-link subnode describing a link to a non-MDIO
> +   managed entity. See
> +   Documentation/devicetree/bindings/net/fixed-link.txt
> +   for details.
> +- mdio subnode   : When the EMAC has a phy connected to its local
> +   mdio, which us supported by the kernel's network
> +   PHY library in drivers/net/phy, there must be device
> +   tree subnode with the following required properties:
> + - #address-cells: Must be <1>.
> + - #size-cells: Must be <0>.
>  
> -Example:
> +   For PHY definitions: Please refer to
> +   Documentation/devicetree/bindings/net/phy.txt and
> +   Documentation/devicetree/bindings/net/ethernet.txt
> +
> +Examples:
>  
>   EMAC0: ethernet@4800 {
>   device_type = "network";
> @@ -104,6 +121,50 @@
>   zmii-channel = <0>;
>   };
>  
> + EMAC1: ethernet@ef600c00 {
> + device_type = "network";
> + compatible = "ibm,emac-apm821xx", "ibm,emac4sync";
> + interrupt-parent = <>;
> + interrupts = <0 1>;
> + #interrupt-cells = <1>;
> + #address-cells = <0>;
> + #size-cells = <0>;
> + interrupt-map = <0  0x10 IRQ_TYPE_LEVEL_HIGH /* Status */
> +  1  0x14 IRQ_TYPE_LEVEL_HIGH /* Wake */>;
> + reg = <0xef600c00 0x00c4>;
> + local-mac-address = []; /* Filled in by U-Boot */
> + mal-device = <>;
> + mal-tx-channel = <0>;
> + mal-rx-channel = <0>;
> + cell-index = <0>;
> + max-frame-size = <9000>;
> + rx-fifo-size = <16384>;
> + tx-fifo-size = <2048>;
> + fifo-entry-size = <10>;
> + phy-mode = "rgmii";
> + phy-handle = <>;
> + phy-map = <0x>;
> + rgmii-device = <>;
> + rgmii-channel = <0>;
> + tah-device = <>;
> + tah-channel = <0>;
> + has-inverted-stacr-oc;
> + has-new-stacr-staopc;
> +
> + mdio {
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + phy0: ethernet-phy@0 {
> + device_type = "ethernet-phy";


Christian, sorry for noticing this that late, but his does not quite
conform to the phy.txt binding document here.

Please put something standard e.g:

phy0: ethernet-phy@0 {
compatible = "ethernet-phy-ieee802.3-c22";
reg = <0>;
};

examples tend to be copy/pasted more often than they should...
-- 
Florian


Re: [Patch net v3] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Cong Wang
On Mon, Feb 27, 2017 at 4:14 PM, David Ahern  wrote:
> On 2/27/17 4:07 PM, Cong Wang wrote:
>> Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
>> -> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
>> because ip6_null_entry is returned in this path since ip6_null_entry
>> is kinda default for a ipv6 route table root node. Quote from
>
>
> Missed this earlier. The issue here is an attempt to delete the NULL
> route, not that the null_entry is being returned as happens during a
> route lookup. This will also hit the bug:
> ip -6 ro del ::/0

By "returned" I mean it is in the fn->leaf list.


Re: Extending socket timestamping API for NTP

2017-02-27 Thread Willem de Bruijn
On Mon, Feb 27, 2017 at 10:23 AM, Miroslav Lichvar  wrote:
> On Tue, Feb 07, 2017 at 02:32:04PM -0800, Willem de Bruijn wrote:
>> >> 4) allow sockets to use both SW and HW TX timestamping at the same time
>> >>
>> >>When using a socket which is not bound to a specific interface, it
>> >>would be nice to get transmit SW timestamps when HW timestamps are
>> >>missing. I suspect it's difficult to predict if a HW timestamp will
>> >>be available. Maybe it would be acceptable to get from the error
>> >>queue two messages per transmission if the interface supports both
>> >>SW and HW timestamping?
>> >
>> >
>> > This seems useful,
>>
>> Agreed, as long as it is optional so that it does not change the
>> behavior for existing applications.
>
> Do you think it is safe to assume that no application enabled both SW
> and HW TX timestamping?

We cannot rule out that a process set both flags.

> Do we need a new option for this?

Similar to OPT_TSONLY or OPT_ID, but to signal the intent of
receiving both timestamps. Yes, agreed.

>> > but not sure how best to implement it.
>>
>> It might be sufficient to just remove the second line in sw_tx_timestamp
>>
>> static inline void sw_tx_timestamp(struct sk_buff *skb)
>> {
>> if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP &&
>> !(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
>> skb_tstamp_tx(skb, NULL);
>> }
>
> With this change I'm getting two error messages per transmission, but
> it looks like it may need some additional changes.
>
> If the first error message is received after the HW timestamp was
> captured,

When does this happen? The first timestamp is generated from
skb_tx_timestamp in the device driver's ndo_start_xmit before
passing the packet to the NIC, the second when the device
driver cleans the tx descriptor on completion.

Is this for drivers that do not have skb_tx_timestamp, as you
mention below? Then the solution is to add that call.

> it contains both timestamps as the HW timestamp is in the
> shared info of the skb. Is it possible it could contain a partially
> updated HW timestamp? I'm not sure how locking works here. Is
> scm_timestamping actually allowed to contain more than one timestamp?
> The timestamping.txt document says "Only one field is non-zero at any
> time.", but that wasn't true even before if both SW and HW RX
> timestamping was enabled.
>
> If SO_TIMESTAMP{,NS} is enabled, ts[0] in the second error message
> will contain a bogus SW timestamp added by __sock_recv_timestamp() for
> a "Race occurred between timestamp enabling and packet receiving".  Is

Good point. That should not be set on transmit timestamps.

> there a guarantee applications will get a timestamp for all messages
> after enabling SO_TIMESTAMP? The original code is older than the git
> repo, so I'm not sure what was the reason for this. To me it would
> make more sense to not add any SCM_TIMESTAMP (and SW timestamp in
> SCM_TIMESTAMPING) when the the timestamp is missing. If that's not
> always acceptable, maybe it could be restricted to sockets that have
> HW timestamping enabled?

I would limit scope to tx timestamping and leave rx semantics as is.

> Some drivers don't call skb_tx_timestamp() when HW timestamp was
> requested. From a cursory look it is e1000e, xgbe, sxgbe, and stmmac.
> This should hopefully be an easy fix.

Indeed. that should be added, then.


[PATCH] wireless: ipw2200: remove redundant check of rc < 0

2017-02-27 Thread Colin King
From: Colin Ian King 

The check for rc < 0 is always false so the check is redundant
and can be removed.

Detected with CoverityScan, CID#101143 ("Logically dead code")

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/intel/ipw2x00/ipw2200.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/wireless/intel/ipw2x00/ipw2200.c 
b/drivers/net/wireless/intel/ipw2x00/ipw2200.c
index 5ef3c5c..bbc579b 100644
--- a/drivers/net/wireless/intel/ipw2x00/ipw2200.c
+++ b/drivers/net/wireless/intel/ipw2x00/ipw2200.c
@@ -3539,9 +3539,6 @@ static int ipw_load(struct ipw_priv *priv)
fw_img = >data[le32_to_cpu(fw->boot_size) +
   le32_to_cpu(fw->ucode_size)];
 
-   if (rc < 0)
-   goto error;
-
if (!priv->rxq)
priv->rxq = ipw_rx_queue_alloc(priv);
else
-- 
2.10.2



[Patch net v3] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Cong Wang
Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
-> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
because ip6_null_entry is returned in this path since ip6_null_entry
is kinda default for a ipv6 route table root node. Quote from
David Ahern:

 ip6_null_entry is the root of all ipv6 fib tables making it integrated
 into the table ...

We should ignore any attempt of trying to delete it, like we do in
__ip6_del_rt() path and several others.

Reported-by: Andrey Konovalov 
Fixes: 0ae8133586ad ("net: ipv6: Allow shorthand delete of all nexthops in 
multipath route")
Cc: David Ahern 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 net/ipv6/route.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f54f426..77c7ce7 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2169,10 +2169,13 @@ int ip6_del_rt(struct rt6_info *rt)
 static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config *cfg)
 {
struct nl_info *info = >fc_nlinfo;
+   struct net *net = info->nl_net;
struct sk_buff *skb = NULL;
struct fib6_table *table;
-   int err;
+   int err = -ENOENT;
 
+   if (rt == net->ipv6.ip6_null_entry)
+   goto out_put;
table = rt->rt6i_table;
write_lock_bh(>tb6_lock);
 
@@ -2184,7 +2187,7 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
struct fib6_config *cfg)
if (skb) {
u32 seq = info->nlh ? info->nlh->nlmsg_seq : 0;
 
-   if (rt6_fill_node(info->nl_net, skb, rt,
+   if (rt6_fill_node(net, skb, rt,
  NULL, NULL, 0, RTM_DELROUTE,
  info->portid, seq, 0) < 0) {
kfree_skb(skb);
@@ -2198,17 +2201,18 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
struct fib6_config *cfg)
 rt6i_siblings) {
err = fib6_del(sibling, info);
if (err)
-   goto out;
+   goto out_unlock;
}
}
 
err = fib6_del(rt, info);
-out:
+out_unlock:
write_unlock_bh(>tb6_lock);
+out_put:
ip6_rt_put(rt);
 
if (skb) {
-   rtnl_notify(skb, info->nl_net, info->portid, RTNLGRP_IPV6_ROUTE,
+   rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
info->nlh, gfp_any());
}
return err;
-- 
2.5.5



Re: [PATCH v3 net] net: solve a NAPI race

2017-02-27 Thread David Miller
From: Stephen Hemminger 
Date: Mon, 27 Feb 2017 14:44:55 -0800

> On Mon, 27 Feb 2017 14:35:17 -0800
> Eric Dumazet  wrote:
> 
>> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
>> 
>> > The original design (as Davem mentioned) was that IRQ's must be disabled
>> > during device polling. If that was true, then the race above
>> > would be impossible.  
>> 
>> I would love to see an alternative patch.
> 
> Turn off busy poll? 
> The poll stuff runs risk of breaking more things.

Eric is exactly trying to make busy poll even more prominent in
the stack, not less prominent.

It's an important component of some performance improvements he is
working on.


[PATCH v1.1] dt: emac: document device-tree based phy discovery and setup

2017-02-27 Thread Christian Lamparter
This patch adds documentation for a new "phy-handle" property,
"fixed-link" and "mdio" sub-node. These allows the enumeration
of PHYs which are supported by the phy library under drivers/net/phy.

The EMAC ethernet controller in IBM and AMCC 4xx chips is
currently stuck with a few privately defined phy
implementations. It has no support for PHYs which
are supported by the generic phylib.

Acked-by: Rob Herring 
Signed-off-by: Christian Lamparter 
---
Resent - no changes.
---
 .../devicetree/bindings/powerpc/4xx/emac.txt   | 64 +-
 1 file changed, 62 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt 
b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
index 712baf6c3e24..1893b4c4d93b 100644
--- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
+++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
@@ -71,6 +71,9 @@
  For Axon it can be absent, though my current driver
  doesn't handle phy-address yet so for now, keep
  0x00ff in it.
+- phy-handle   : Used to describe configurations where a external PHY
+ is used. Please refer to:
+ Documentation/devicetree/bindings/net/ethernet.txt
 - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
  operations (if absent the value is the same as
  rx-fifo-size).  For Axon, either absent or 2048.
@@ -81,8 +84,22 @@
  offload, phandle of the TAH device node.
 - tah-channel   : 1 cell, optional. If appropriate, channel used on the
  TAH engine.
+- fixed-link   : Fixed-link subnode describing a link to a non-MDIO
+ managed entity. See
+ Documentation/devicetree/bindings/net/fixed-link.txt
+ for details.
+- mdio subnode : When the EMAC has a phy connected to its local
+ mdio, which us supported by the kernel's network
+ PHY library in drivers/net/phy, there must be device
+ tree subnode with the following required properties:
+   - #address-cells: Must be <1>.
+   - #size-cells: Must be <0>.
 
-Example:
+ For PHY definitions: Please refer to
+ Documentation/devicetree/bindings/net/phy.txt and
+ Documentation/devicetree/bindings/net/ethernet.txt
+
+Examples:
 
EMAC0: ethernet@4800 {
device_type = "network";
@@ -104,6 +121,50 @@
zmii-channel = <0>;
};
 
+   EMAC1: ethernet@ef600c00 {
+   device_type = "network";
+   compatible = "ibm,emac-apm821xx", "ibm,emac4sync";
+   interrupt-parent = <>;
+   interrupts = <0 1>;
+   #interrupt-cells = <1>;
+   #address-cells = <0>;
+   #size-cells = <0>;
+   interrupt-map = <0  0x10 IRQ_TYPE_LEVEL_HIGH /* Status */
+1  0x14 IRQ_TYPE_LEVEL_HIGH /* Wake */>;
+   reg = <0xef600c00 0x00c4>;
+   local-mac-address = []; /* Filled in by U-Boot */
+   mal-device = <>;
+   mal-tx-channel = <0>;
+   mal-rx-channel = <0>;
+   cell-index = <0>;
+   max-frame-size = <9000>;
+   rx-fifo-size = <16384>;
+   tx-fifo-size = <2048>;
+   fifo-entry-size = <10>;
+   phy-mode = "rgmii";
+   phy-handle = <>;
+   phy-map = <0x>;
+   rgmii-device = <>;
+   rgmii-channel = <0>;
+   tah-device = <>;
+   tah-channel = <0>;
+   has-inverted-stacr-oc;
+   has-new-stacr-staopc;
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   phy0: ethernet-phy@0 {
+   device_type = "ethernet-phy";
+   reg = <0>;
+
+   qca,ar8327-initvals = <
+   0x0010 0x4000>;
+   };
+   };
+
+
   ii) McMAL node
 
 Required properties:
@@ -145,4 +206,3 @@
 - revision   : as provided by the RGMII new version register if
   available.
   For Axon: 0x012a
-
-- 
2.11.0



Re: [PATCH v3 net] net: solve a NAPI race

2017-02-27 Thread Stephen Hemminger
On Mon, 27 Feb 2017 14:35:17 -0800
Eric Dumazet  wrote:

> On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:
> 
> > The original design (as Davem mentioned) was that IRQ's must be disabled
> > during device polling. If that was true, then the race above
> > would be impossible.  
> 
> I would love to see an alternative patch.

Turn off busy poll? 
The poll stuff runs risk of breaking more things.


[PATCH] net: smsc: epic100: use new api ethtool_{get|set}_link_ksettings

2017-02-27 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/smsc/epic100.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/smsc/epic100.c 
b/drivers/net/ethernet/smsc/epic100.c
index 5f27371..db6dcb0 100644
--- a/drivers/net/ethernet/smsc/epic100.c
+++ b/drivers/net/ethernet/smsc/epic100.c
@@ -1387,25 +1387,27 @@ static void netdev_get_drvinfo (struct net_device *dev, 
struct ethtool_drvinfo *
strlcpy(info->bus_info, pci_name(np->pci_dev), sizeof(info->bus_info));
 }
 
-static int netdev_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int netdev_get_link_ksettings(struct net_device *dev,
+struct ethtool_link_ksettings *cmd)
 {
struct epic_private *np = netdev_priv(dev);
int rc;
 
spin_lock_irq(>lock);
-   rc = mii_ethtool_gset(>mii, cmd);
+   rc = mii_ethtool_get_link_ksettings(>mii, cmd);
spin_unlock_irq(>lock);
 
return rc;
 }
 
-static int netdev_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int netdev_set_link_ksettings(struct net_device *dev,
+const struct ethtool_link_ksettings *cmd)
 {
struct epic_private *np = netdev_priv(dev);
int rc;
 
spin_lock_irq(>lock);
-   rc = mii_ethtool_sset(>mii, cmd);
+   rc = mii_ethtool_set_link_ksettings(>mii, cmd);
spin_unlock_irq(>lock);
 
return rc;
@@ -1460,14 +1462,14 @@ static void ethtool_complete(struct net_device *dev)
 
 static const struct ethtool_ops netdev_ethtool_ops = {
.get_drvinfo= netdev_get_drvinfo,
-   .get_settings   = netdev_get_settings,
-   .set_settings   = netdev_set_settings,
.nway_reset = netdev_nway_reset,
.get_link   = netdev_get_link,
.get_msglevel   = netdev_get_msglevel,
.set_msglevel   = netdev_set_msglevel,
.begin  = ethtool_begin,
-   .complete   = ethtool_complete
+   .complete   = ethtool_complete,
+   .get_link_ksettings = netdev_get_link_ksettings,
+   .set_link_ksettings = netdev_set_link_ksettings,
 };
 
 static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
-- 
1.7.4.4



Re: [PATCH v3 net] net: solve a NAPI race

2017-02-27 Thread Eric Dumazet
On Mon, 2017-02-27 at 14:14 -0800, Stephen Hemminger wrote:

> The original design (as Davem mentioned) was that IRQ's must be disabled
> during device polling. If that was true, then the race above
> would be impossible.

I would love to see an alternative patch.





[PATCH] net: sis: sis900: use new api ethtool_{get|set}_link_ksettings

2017-02-27 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/sis/sis900.c |   18 +-
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/sis/sis900.c 
b/drivers/net/ethernet/sis/sis900.c
index 19a4587..f317034 100644
--- a/drivers/net/ethernet/sis/sis900.c
+++ b/drivers/net/ethernet/sis/sis900.c
@@ -2035,23 +2035,23 @@ static u32 sis900_get_link(struct net_device *net_dev)
return mii_link_ok(_priv->mii_info);
 }
 
-static int sis900_get_settings(struct net_device *net_dev,
-   struct ethtool_cmd *cmd)
+static int sis900_get_link_ksettings(struct net_device *net_dev,
+struct ethtool_link_ksettings *cmd)
 {
struct sis900_private *sis_priv = netdev_priv(net_dev);
spin_lock_irq(_priv->lock);
-   mii_ethtool_gset(_priv->mii_info, cmd);
+   mii_ethtool_get_link_ksettings(_priv->mii_info, cmd);
spin_unlock_irq(_priv->lock);
return 0;
 }
 
-static int sis900_set_settings(struct net_device *net_dev,
-   struct ethtool_cmd *cmd)
+static int sis900_set_link_ksettings(struct net_device *net_dev,
+const struct ethtool_link_ksettings *cmd)
 {
struct sis900_private *sis_priv = netdev_priv(net_dev);
int rt;
spin_lock_irq(_priv->lock);
-   rt = mii_ethtool_sset(_priv->mii_info, cmd);
+   rt = mii_ethtool_set_link_ksettings(_priv->mii_info, cmd);
spin_unlock_irq(_priv->lock);
return rt;
 }
@@ -2129,11 +2129,11 @@ static void sis900_get_wol(struct net_device *net_dev, 
struct ethtool_wolinfo *w
.get_msglevel   = sis900_get_msglevel,
.set_msglevel   = sis900_set_msglevel,
.get_link   = sis900_get_link,
-   .get_settings   = sis900_get_settings,
-   .set_settings   = sis900_set_settings,
.nway_reset = sis900_nway_reset,
.get_wol= sis900_get_wol,
-   .set_wol= sis900_set_wol
+   .set_wol= sis900_set_wol,
+   .get_link_ksettings = sis900_get_link_ksettings,
+   .set_link_ksettings = sis900_set_link_ksettings,
 };
 
 /**
-- 
1.7.4.4



Re: net/ipv6: null-ptr-deref in ip6_route_del/lock_acquire

2017-02-27 Thread Andrey Konovalov
On Mon, Feb 27, 2017 at 9:34 PM, Cong Wang  wrote:
> On Mon, Feb 27, 2017 at 12:05 PM, Andrey Konovalov
>  wrote:
>> On Mon, Feb 27, 2017 at 8:59 PM, David Ahern  
>> wrote:
>>> On 2/27/17 10:11 AM, Cong Wang wrote:
 The attached patch fixes this crash, but I am not sure if it is the
 best way to fix this bug yet...
>>>
>>> I'll take a look. I can not reproduce this using route or ip, so the
>>> fuzzer is doing something interesting.
>>
>> Hi David,
>>
>> I've attached a simple reproducer to the report, it doesn't work for you?
>
> It works for me and I have verified the formal patch I sent.

Hi Cong,

That's what I thought when I read your message, thanks!

I was just confused by David saying that the fuzzer is doing something
interesting, when the reproducer is just an ioctl call on a socket.

>
> --
> You received this message because you are subscribed to the Google Groups 
> "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to syzkaller+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


[PATCH] net: silan: sc92031: use new api ethtool_{get|set}_link_ksettings

2017-02-27 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/silan/sc92031.c |   83 +++---
 1 files changed, 47 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/silan/sc92031.c 
b/drivers/net/ethernet/silan/sc92031.c
index 6c2e2b3..751c818 100644
--- a/drivers/net/ethernet/silan/sc92031.c
+++ b/drivers/net/ethernet/silan/sc92031.c
@@ -1122,14 +1122,16 @@ static void sc92031_poll_controller(struct net_device 
*dev)
 }
 #endif
 
-static int sc92031_ethtool_get_settings(struct net_device *dev,
-   struct ethtool_cmd *cmd)
+static int
+sc92031_ethtool_get_link_ksettings(struct net_device *dev,
+  struct ethtool_link_ksettings *cmd)
 {
struct sc92031_priv *priv = netdev_priv(dev);
void __iomem *port_base = priv->port_base;
u8 phy_address;
u32 phy_ctrl;
u16 output_status;
+   u32 supported, advertising;
 
spin_lock_bh(>lock);
 
@@ -1142,68 +1144,77 @@ static int sc92031_ethtool_get_settings(struct 
net_device *dev,
 
spin_unlock_bh(>lock);
 
-   cmd->supported = SUPPORTED_10baseT_Half | SUPPORTED_10baseT_Full
+   supported = SUPPORTED_10baseT_Half | SUPPORTED_10baseT_Full
| SUPPORTED_100baseT_Half | SUPPORTED_100baseT_Full
| SUPPORTED_Autoneg | SUPPORTED_TP | SUPPORTED_MII;
 
-   cmd->advertising = ADVERTISED_TP | ADVERTISED_MII;
+   advertising = ADVERTISED_TP | ADVERTISED_MII;
 
if ((phy_ctrl & (PhyCtrlDux | PhyCtrlSpd100 | PhyCtrlSpd10))
== (PhyCtrlDux | PhyCtrlSpd100 | PhyCtrlSpd10))
-   cmd->advertising |= ADVERTISED_Autoneg;
+   advertising |= ADVERTISED_Autoneg;
 
if ((phy_ctrl & PhyCtrlSpd10) == PhyCtrlSpd10)
-   cmd->advertising |= ADVERTISED_10baseT_Half;
+   advertising |= ADVERTISED_10baseT_Half;
 
if ((phy_ctrl & (PhyCtrlSpd10 | PhyCtrlDux))
== (PhyCtrlSpd10 | PhyCtrlDux))
-   cmd->advertising |= ADVERTISED_10baseT_Full;
+   advertising |= ADVERTISED_10baseT_Full;
 
if ((phy_ctrl & PhyCtrlSpd100) == PhyCtrlSpd100)
-   cmd->advertising |= ADVERTISED_100baseT_Half;
+   advertising |= ADVERTISED_100baseT_Half;
 
if ((phy_ctrl & (PhyCtrlSpd100 | PhyCtrlDux))
== (PhyCtrlSpd100 | PhyCtrlDux))
-   cmd->advertising |= ADVERTISED_100baseT_Full;
+   advertising |= ADVERTISED_100baseT_Full;
 
if (phy_ctrl & PhyCtrlAne)
-   cmd->advertising |= ADVERTISED_Autoneg;
+   advertising |= ADVERTISED_Autoneg;
 
-   ethtool_cmd_speed_set(cmd,
- (output_status & 0x2) ? SPEED_100 : SPEED_10);
-   cmd->duplex = (output_status & 0x4) ? DUPLEX_FULL : DUPLEX_HALF;
-   cmd->port = PORT_MII;
-   cmd->phy_address = phy_address;
-   cmd->transceiver = XCVR_INTERNAL;
-   cmd->autoneg = (phy_ctrl & PhyCtrlAne) ? AUTONEG_ENABLE : 
AUTONEG_DISABLE;
+   cmd->base.speed = (output_status & 0x2) ? SPEED_100 : SPEED_10;
+   cmd->base.duplex = (output_status & 0x4) ? DUPLEX_FULL : DUPLEX_HALF;
+   cmd->base.port = PORT_MII;
+   cmd->base.phy_address = phy_address;
+   cmd->base.autoneg = (phy_ctrl & PhyCtrlAne) ?
+   AUTONEG_ENABLE : AUTONEG_DISABLE;
+
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.supported,
+   supported);
+   ethtool_convert_legacy_u32_to_link_mode(cmd->link_modes.advertising,
+   advertising);
 
return 0;
 }
 
-static int sc92031_ethtool_set_settings(struct net_device *dev,
-   struct ethtool_cmd *cmd)
+static int
+sc92031_ethtool_set_link_ksettings(struct net_device *dev,
+  const struct ethtool_link_ksettings *cmd)
 {
struct sc92031_priv *priv = netdev_priv(dev);
void __iomem *port_base = priv->port_base;
-   u32 speed = ethtool_cmd_speed(cmd);
+   u32 speed = cmd->base.speed;
u32 phy_ctrl;
u32 old_phy_ctrl;
+   u32 advertising;
+
+   ethtool_convert_link_mode_to_legacy_u32(,
+   cmd->link_modes.advertising);
 
if (!(speed == SPEED_10 || speed == SPEED_100))
return -EINVAL;
-   if (!(cmd->duplex == DUPLEX_HALF || cmd->duplex == DUPLEX_FULL))
-   return -EINVAL;
-   if (!(cmd->port == PORT_MII))
+   if (!(cmd->base.duplex == DUPLEX_HALF ||
+ cmd->base.duplex == DUPLEX_FULL))
return -EINVAL;
-   if (!(cmd->phy_address == 0x1f))
+   if 

Re: [PATCH v3 net] net: solve a NAPI race

2017-02-27 Thread Stephen Hemminger
On Mon, 27 Feb 2017 12:18:31 -0800
Eric Dumazet  wrote:

> thread 1 thread 2 (could be on same cpu)
> 
> // busy polling or napi_watchdog()
> napi_schedule();
> ...
> napi->poll()
> 
> device polling:
> read 2 packets from ring buffer
>   Additional 3rd packet is available.
>   device hard irq
> 
>   // does nothing because 
> NAPI_STATE_SCHED bit is owned by thread 1
>   napi_schedule();
>   
> napi_complete_done(napi, 2);
> rearm_irq();

The original design (as Davem mentioned) was that IRQ's must be disabled
during device polling. If that was true, then the race above
would be impossible. Also NAPI assumes interrupts are level triggered.



[PATCH] net: sis: sis190: use new api ethtool_{get|set}_link_ksettings

2017-02-27 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/sis/sis190.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/sis/sis190.c 
b/drivers/net/ethernet/sis/sis190.c
index 210e35d..02da106 100644
--- a/drivers/net/ethernet/sis/sis190.c
+++ b/drivers/net/ethernet/sis/sis190.c
@@ -1734,18 +1734,20 @@ static void sis190_set_speed_auto(struct net_device 
*dev)
   BMCR_ANENABLE | BMCR_ANRESTART | BMCR_RESET);
 }
 
-static int sis190_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int sis190_get_link_ksettings(struct net_device *dev,
+struct ethtool_link_ksettings *cmd)
 {
struct sis190_private *tp = netdev_priv(dev);
 
-   return mii_ethtool_gset(>mii_if, cmd);
+   return mii_ethtool_get_link_ksettings(>mii_if, cmd);
 }
 
-static int sis190_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int sis190_set_link_ksettings(struct net_device *dev,
+const struct ethtool_link_ksettings *cmd)
 {
struct sis190_private *tp = netdev_priv(dev);
 
-   return mii_ethtool_sset(>mii_if, cmd);
+   return mii_ethtool_set_link_ksettings(>mii_if, cmd);
 }
 
 static void sis190_get_drvinfo(struct net_device *dev,
@@ -1797,8 +1799,6 @@ static void sis190_set_msglevel(struct net_device *dev, 
u32 value)
 }
 
 static const struct ethtool_ops sis190_ethtool_ops = {
-   .get_settings   = sis190_get_settings,
-   .set_settings   = sis190_set_settings,
.get_drvinfo= sis190_get_drvinfo,
.get_regs_len   = sis190_get_regs_len,
.get_regs   = sis190_get_regs,
@@ -1806,6 +1806,8 @@ static void sis190_set_msglevel(struct net_device *dev, 
u32 value)
.get_msglevel   = sis190_get_msglevel,
.set_msglevel   = sis190_set_msglevel,
.nway_reset = sis190_nway_reset,
+   .get_link_ksettings = sis190_get_link_ksettings,
+   .set_link_ksettings = sis190_set_link_ksettings,
 };
 
 static int sis190_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
-- 
1.7.4.4



Re: [Patch net] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread David Ahern
On 2/27/17 1:04 PM, Cong Wang wrote:
>>
>> for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) {
>> +   /* do not allow deletion of the null route */
>> +   if (rt == net->ipv6.ip6_null_entry)
>> +   continue;
>>
>> Fixes: 0ae8133586ad net: ipv6: Allow shorthand delete of all nexthops in
>> multipath route
> 
> Note, I moved the check into __ip6_del_rt_siblings() because __ip6_del_rt()
> has a same check.
> 

that's b/c __ip6_del_rt has a second call path. __ip6_del_rt_siblings is
new and is not expecting to see the null entry. Catching it before the
dst_hold would be better.


Re: [Patch net v2] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Eric Dumazet
On Mon, 2017-02-27 at 13:34 -0800, Cong Wang wrote:
> Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
> -> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
> because ip6_null_entry is returned in this path since ip6_null_entry
> is kinda default for a ipv6 route table root node. Quote from
> David Ahern:
> 
>  ip6_null_entry is the root of all ipv6 fib tables making it integrated
>  into the table ...
> 
> We should ignore any attempt of trying to delete it, like we do in
> __ip6_del_rt() path and several others.
> 
> Reported-by: Andrey Konovalov 
> Fixes: 0ae8133586ad ("net: ipv6: Allow shorthand delete of all nexthops in 
> multipath route")
> Cc: David Ahern 
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 
> ---
>  net/ipv6/route.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index f54f426..78be2cb 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2169,10 +2169,13 @@ int ip6_del_rt(struct rt6_info *rt)
>  static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config 
> *cfg)
>  {
>   struct nl_info *info = >fc_nlinfo;
> + struct net *net = info->nl_net;
>   struct sk_buff *skb = NULL;
>   struct fib6_table *table;
>   int err;
>  
> + if (rt == net->ipv6.ip6_null_entry)
> + goto out_put;

err is not initialized at this point.


>   table = rt->rt6i_table;
>   write_lock_bh(>tb6_lock);
>  
> @@ -2184,7 +2187,7 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
> struct fib6_config *cfg)
>   if (skb) {
>   u32 seq = info->nlh ? info->nlh->nlmsg_seq : 0;
>  
> - if (rt6_fill_node(info->nl_net, skb, rt,
> + if (rt6_fill_node(net, skb, rt,
> NULL, NULL, 0, RTM_DELROUTE,
> info->portid, seq, 0) < 0) {
>   kfree_skb(skb);
> @@ -2205,10 +2208,11 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
> struct fib6_config *cfg)
>   err = fib6_del(rt, info);
>  out:
>   write_unlock_bh(>tb6_lock);
> +out_put:
>   ip6_rt_put(rt);
>  
>   if (skb) {
> - rtnl_notify(skb, info->nl_net, info->portid, RTNLGRP_IPV6_ROUTE,
> + rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
>   info->nlh, gfp_any());
>   }
>   return err;

This returns garbage here.





[Patch net v2] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Cong Wang
Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
-> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
because ip6_null_entry is returned in this path since ip6_null_entry
is kinda default for a ipv6 route table root node. Quote from
David Ahern:

 ip6_null_entry is the root of all ipv6 fib tables making it integrated
 into the table ...

We should ignore any attempt of trying to delete it, like we do in
__ip6_del_rt() path and several others.

Reported-by: Andrey Konovalov 
Fixes: 0ae8133586ad ("net: ipv6: Allow shorthand delete of all nexthops in 
multipath route")
Cc: David Ahern 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 net/ipv6/route.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f54f426..78be2cb 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2169,10 +2169,13 @@ int ip6_del_rt(struct rt6_info *rt)
 static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config *cfg)
 {
struct nl_info *info = >fc_nlinfo;
+   struct net *net = info->nl_net;
struct sk_buff *skb = NULL;
struct fib6_table *table;
int err;
 
+   if (rt == net->ipv6.ip6_null_entry)
+   goto out_put;
table = rt->rt6i_table;
write_lock_bh(>tb6_lock);
 
@@ -2184,7 +2187,7 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
struct fib6_config *cfg)
if (skb) {
u32 seq = info->nlh ? info->nlh->nlmsg_seq : 0;
 
-   if (rt6_fill_node(info->nl_net, skb, rt,
+   if (rt6_fill_node(net, skb, rt,
  NULL, NULL, 0, RTM_DELROUTE,
  info->portid, seq, 0) < 0) {
kfree_skb(skb);
@@ -2205,10 +2208,11 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
struct fib6_config *cfg)
err = fib6_del(rt, info);
 out:
write_unlock_bh(>tb6_lock);
+out_put:
ip6_rt_put(rt);
 
if (skb) {
-   rtnl_notify(skb, info->nl_net, info->portid, RTNLGRP_IPV6_ROUTE,
+   rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
info->nlh, gfp_any());
}
return err;
-- 
2.5.5



Re: [Patch net] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Cong Wang
On Mon, Feb 27, 2017 at 1:00 PM, David Ahern  wrote:
> On 2/27/17 12:34 PM, Eric Dumazet wrote:
>> On Mon, 2017-02-27 at 11:07 -0800, Cong Wang wrote:
>>> Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
>>> -> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
>>> because ip6_null_entry is returned in this path since ip6_null_entry
>>> is kinda default for a ipv6 route table root node. Quote from
>>> David Ahern:
>>>
>>>  ip6_null_entry is the root of all ipv6 fib tables making it integrated
>>>  into the table ...
>>>
>>> We should ignore any attempt of trying to delete it, like we do in
>>> __ip6_del_rt() path and several others.
>>>
>>> Reported-by: Andrey Konovalov 
>>> Signed-off-by: Cong Wang 
>>> ---
>>>  net/ipv6/route.c | 7 +--
>>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index f54f426..9da77e9 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>>> @@ -2169,10 +2169,13 @@ int ip6_del_rt(struct rt6_info *rt)
>>>  static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config 
>>> *cfg)
>>>  {
>>>  struct nl_info *info = >fc_nlinfo;
>>> +struct net *net = info->nl_net;
>>>  struct sk_buff *skb = NULL;
>>>  struct fib6_table *table;
>>>  int err;
>>>
>>> +if (rt == net->ipv6.ip6_null_entry)
>>> +return -ENOENT;
>>
>> It looks the caller did a dst_hold(>dst);
>>
>> So this new error path would leave a refcount leak.
>>
>> Note that I was not able to trigger the crash on old kernels, so it
>> would be nice to get a precise idea of bug origin.
>>
>
> Cong: do you want to send a v2 catching the null entry in ip6_route_del
> before the refcnt?

Yeah, actually it is introduced by my patch because there is already
an ip6_rt_put() in __ip6_del_rt_siblings(). So v2 is coming...

>
> for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) {
> +   /* do not allow deletion of the null route */
> +   if (rt == net->ipv6.ip6_null_entry)
> +   continue;
>
> Fixes: 0ae8133586ad net: ipv6: Allow shorthand delete of all nexthops in
> multipath route

Note, I moved the check into __ip6_del_rt_siblings() because __ip6_del_rt()
has a same check.


Re: net/ipv6: null-ptr-deref in ip6_route_del/lock_acquire

2017-02-27 Thread David Ahern
On 2/27/17 12:37 PM, Andrey Konovalov wrote:
> That's what I thought when I read your message, thanks!
> 
> I was just confused by David saying that the fuzzer is doing something
> interesting, when the reproducer is just an ioctl call on a socket.

It means I have a cold, recently off a plane and not processing what I
was reading.

The interesting part was intent to delete the null route, but then Cong
mentioned that in his commit message.


[PATCH v3 net] net: solve a NAPI race

2017-02-27 Thread Eric Dumazet
From: Eric Dumazet 

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

thread 1 thread 2 (could be on same cpu)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
  Additional 3rd packet is available.
  device hard irq

  // does nothing because 
NAPI_STATE_SCHED bit is owned by thread 1
  napi_schedule();
  
napi_complete_done(napi, 2);
rearm_irq();


Note that rearm_irq() will not force the device to send an additional
IRQ for the packet it already signaled (3rd packet in my example)



This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

In v3, I added more details in the changelog and clears
NAPI_STATE_MISSED in busy_poll_stop()

Signed-off-by: Eric Dumazet 
---
 include/linux/netdevice.h |   29 +---
 net/core/dev.c|   63 +---
 2 files changed, 68 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
 
 enum {
NAPI_STATE_SCHED,   /* Poll is scheduled */
+   NAPI_STATE_MISSED,  /* reschedule a napi */
NAPI_STATE_DISABLE, /* Disable pending */
NAPI_STATE_NPSVC,   /* Netpoll - don't dequeue from poll_list */
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
 };
 
 enum {
-   NAPIF_STATE_SCHED= (1UL << NAPI_STATE_SCHED),
-   NAPIF_STATE_DISABLE  = (1UL << NAPI_STATE_DISABLE),
-   NAPIF_STATE_NPSVC= (1UL << NAPI_STATE_NPSVC),
-   NAPIF_STATE_HASHED   = (1UL << NAPI_STATE_HASHED),
-   NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
-   NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_SCHED= BIT(NAPI_STATE_SCHED),
+   NAPIF_STATE_MISSED   = BIT(NAPI_STATE_MISSED),
+   NAPIF_STATE_DISABLE  = BIT(NAPI_STATE_DISABLE),
+   NAPIF_STATE_NPSVC= BIT(NAPI_STATE_NPSVC),
+   NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
+   NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+   NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
 };
 
 enum gro_result {
@@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct 
*n)
return test_bit(NAPI_STATE_DISABLE, >state);
 }
 
-/**
- * napi_schedule_prep - check if NAPI can be scheduled
- * @n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
-   return !napi_disable_pending(n) &&
-   !test_and_set_bit(NAPI_STATE_SCHED, >state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
 
 /**
  * napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 
304f2deae5f9897e60a79ed8b69d6ef208295ded..afcab3670aac18a9c193bf5a09e36e3dc9d0d63c
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n)
 EXPORT_SYMBOL(__napi_schedule);
 
 /**
+ * napi_schedule_prep - check if napi can be 

Re: [Patch net] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread David Ahern
On 2/27/17 12:34 PM, Eric Dumazet wrote:
> On Mon, 2017-02-27 at 11:07 -0800, Cong Wang wrote:
>> Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
>> -> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
>> because ip6_null_entry is returned in this path since ip6_null_entry
>> is kinda default for a ipv6 route table root node. Quote from
>> David Ahern:
>>
>>  ip6_null_entry is the root of all ipv6 fib tables making it integrated
>>  into the table ...
>>
>> We should ignore any attempt of trying to delete it, like we do in
>> __ip6_del_rt() path and several others.
>>
>> Reported-by: Andrey Konovalov 
>> Signed-off-by: Cong Wang 
>> ---
>>  net/ipv6/route.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index f54f426..9da77e9 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -2169,10 +2169,13 @@ int ip6_del_rt(struct rt6_info *rt)
>>  static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config 
>> *cfg)
>>  {
>>  struct nl_info *info = >fc_nlinfo;
>> +struct net *net = info->nl_net;
>>  struct sk_buff *skb = NULL;
>>  struct fib6_table *table;
>>  int err;
>>  
>> +if (rt == net->ipv6.ip6_null_entry)
>> +return -ENOENT;
> 
> It looks the caller did a dst_hold(>dst);
> 
> So this new error path would leave a refcount leak.
> 
> Note that I was not able to trigger the crash on old kernels, so it
> would be nice to get a precise idea of bug origin.
> 

Cong: do you want to send a v2 catching the null entry in ip6_route_del
before the refcnt?

for (rt = fn->leaf; rt; rt = rt->dst.rt6_next) {
+   /* do not allow deletion of the null route */
+   if (rt == net->ipv6.ip6_null_entry)
+   continue;

Fixes: 0ae8133586ad net: ipv6: Allow shorthand delete of all nexthops in
multipath route


Re: [PATCH v2 net] net: solve a NAPI race

2017-02-27 Thread Alexander Duyck
On Mon, Feb 27, 2017 at 8:19 AM, David Miller  wrote:
> From: Eric Dumazet 
> Date: Mon, 27 Feb 2017 06:21:38 -0800
>
>> A NAPI driver normally arms the IRQ after the napi_complete_done(),
>> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
>> it.
>>
>> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
>> while IRQ are not disabled, we might have later an IRQ firing and
>> finding this bit set, right before napi_complete_done() clears it.
>>
>> This can happen with busy polling users, or if gro_flush_timeout is
>> used. But some other uses of napi_schedule() in drivers can cause this
>> as well.
>>
>> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
>> can set if it could not grab NAPI_STATE_SCHED
>
> Various rules were meant to protect these sequences, and make sure
> nothing like this race could happen.
>
> Can you show the specific sequence that fails?
>
> One of the basic protections is that the device IRQ is not re-enabled
> until napi_complete_done() is finished, most drivers do something like
> this:
>
> napi_complete_done();
> - sets NAPI_STATE_SCHED
> enable device IRQ
>
> So I don't understand how it is possible that "later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it".
>
> While napi_complete_done() is running, the device's IRQ is still
> disabled, so there cannot be an IRQ firing before napi_complete_done()
> is finished.

So there are some drivers that will need to have the interrupts
enabled when busy polling and I assume that can cause this kind of
issue. Specifically in the case of i40e the part will not flush
completed descriptors until either 4 completed descriptors are ready
to be written back, or an interrupt fires.

Our other drivers have code in them that will force the interrupt to
unmask and fire once every 2 seconds in the unlikely event that an
interrupt was lost which can occur on some platforms.

- Alex


Re: [Patch net] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Cong Wang
On Mon, Feb 27, 2017 at 12:34 PM, Eric Dumazet  wrote:
> On Mon, 2017-02-27 at 11:07 -0800, Cong Wang wrote:
>> Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
>> -> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
>> because ip6_null_entry is returned in this path since ip6_null_entry
>> is kinda default for a ipv6 route table root node. Quote from
>> David Ahern:
>>
>>  ip6_null_entry is the root of all ipv6 fib tables making it integrated
>>  into the table ...
>>
>> We should ignore any attempt of trying to delete it, like we do in
>> __ip6_del_rt() path and several others.
>>
>> Reported-by: Andrey Konovalov 
>> Signed-off-by: Cong Wang 
>> ---
>>  net/ipv6/route.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index f54f426..9da77e9 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -2169,10 +2169,13 @@ int ip6_del_rt(struct rt6_info *rt)
>>  static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config 
>> *cfg)
>>  {
>>   struct nl_info *info = >fc_nlinfo;
>> + struct net *net = info->nl_net;
>>   struct sk_buff *skb = NULL;
>>   struct fib6_table *table;
>>   int err;
>>
>> + if (rt == net->ipv6.ip6_null_entry)
>> + return -ENOENT;
>
> It looks the caller did a dst_hold(>dst);
>
> So this new error path would leave a refcount leak.

Interesting, this error path is not new for __ip6_del_rt_siblings()
so the leak was already there before mine, but you are probably
right we have a leak here.

I will send a separate patch to address this leak.

>
> Note that I was not able to trigger the crash on old kernels, so it
> would be nice to get a precise idea of bug origin.

Right, I miss:
Fixes: 0ae8133586ad ("net: ipv6: Allow shorthand delete of all
nexthops in multipath route")

Thanks!


[PATCH v2] dt: emac: document device-tree based phy discovery and setup

2017-02-27 Thread Christian Lamparter
This patch adds documentation for a new "phy-handle" property,
"fixed-link" and "mdio" sub-node. These allows the enumeration
of PHYs which are supported by the phy library under drivers/net/phy.

The EMAC ethernet controller in IBM and AMCC 4xx chips is
currently stuck with a few privately defined phy
implementations. It has no support for PHYs which
are supported by the generic phylib.

Acked-by: Rob Herring 
Reviewed-by: Florian Fainelli 
Signed-off-by: Christian Lamparter 
---
---
 .../devicetree/bindings/powerpc/4xx/emac.txt   | 62 +-
 1 file changed, 60 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt 
b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
index 712baf6c3e24..44b842b6ca15 100644
--- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
+++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
@@ -71,6 +71,9 @@
  For Axon it can be absent, though my current driver
  doesn't handle phy-address yet so for now, keep
  0x00ff in it.
+- phy-handle   : Used to describe configurations where a external PHY
+ is used. Please refer to:
+ Documentation/devicetree/bindings/net/ethernet.txt
 - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
  operations (if absent the value is the same as
  rx-fifo-size).  For Axon, either absent or 2048.
@@ -81,8 +84,22 @@
  offload, phandle of the TAH device node.
 - tah-channel   : 1 cell, optional. If appropriate, channel used on the
  TAH engine.
+- fixed-link   : Fixed-link subnode describing a link to a non-MDIO
+ managed entity. See
+ Documentation/devicetree/bindings/net/fixed-link.txt
+ for details.
+- mdio subnode : When the EMAC has a phy connected to its local
+ mdio, which us supported by the kernel's network
+ PHY library in drivers/net/phy, there must be device
+ tree subnode with the following required properties:
+   - #address-cells: Must be <1>.
+   - #size-cells: Must be <0>.
 
-Example:
+ For PHY definitions: Please refer to
+ Documentation/devicetree/bindings/net/phy.txt and
+ Documentation/devicetree/bindings/net/ethernet.txt
+
+Examples:
 
EMAC0: ethernet@4800 {
device_type = "network";
@@ -104,6 +121,48 @@
zmii-channel = <0>;
};
 
+   EMAC1: ethernet@ef600c00 {
+   device_type = "network";
+   compatible = "ibm,emac-apm821xx", "ibm,emac4sync";
+   interrupt-parent = <>;
+   interrupts = <0 1>;
+   #interrupt-cells = <1>;
+   #address-cells = <0>;
+   #size-cells = <0>;
+   interrupt-map = <0  0x10 IRQ_TYPE_LEVEL_HIGH /* Status */
+1  0x14 IRQ_TYPE_LEVEL_HIGH /* Wake */>;
+   reg = <0xef600c00 0x00c4>;
+   local-mac-address = []; /* Filled in by U-Boot */
+   mal-device = <>;
+   mal-tx-channel = <0>;
+   mal-rx-channel = <0>;
+   cell-index = <0>;
+   max-frame-size = <9000>;
+   rx-fifo-size = <16384>;
+   tx-fifo-size = <2048>;
+   fifo-entry-size = <10>;
+   phy-mode = "rgmii";
+   phy-handle = <>;
+   phy-map = <0x>;
+   rgmii-device = <>;
+   rgmii-channel = <0>;
+   tah-device = <>;
+   tah-channel = <0>;
+   has-inverted-stacr-oc;
+   has-new-stacr-staopc;
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   phy0: ethernet-phy@0 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <0>;
+   };
+   };
+   };
+
+
   ii) McMAL node
 
 Required properties:
@@ -145,4 +204,3 @@
 - revision   : as provided by the RGMII new version register if
   available.
   For Axon: 0x012a
-
-- 
2.11.0



Re: net/ipv6: null-ptr-deref in ip6_route_del/lock_acquire

2017-02-27 Thread Cong Wang
On Mon, Feb 27, 2017 at 12:05 PM, Andrey Konovalov
 wrote:
> On Mon, Feb 27, 2017 at 8:59 PM, David Ahern  wrote:
>> On 2/27/17 10:11 AM, Cong Wang wrote:
>>> The attached patch fixes this crash, but I am not sure if it is the
>>> best way to fix this bug yet...
>>
>> I'll take a look. I can not reproduce this using route or ip, so the
>> fuzzer is doing something interesting.
>
> Hi David,
>
> I've attached a simple reproducer to the report, it doesn't work for you?

It works for me and I have verified the formal patch I sent.


Re: [PATCH v1.2] dt: emac: document device-tree based phy discovery and setup

2017-02-27 Thread Florian Fainelli
On 02/27/2017 12:41 PM, Christian Lamparter wrote:
> This patch adds documentation for a new "phy-handle" property,
> "fixed-link" and "mdio" sub-node. These allows the enumeration
> of PHYs which are supported by the phy library under drivers/net/phy.
> 
> The EMAC ethernet controller in IBM and AMCC 4xx chips is
> currently stuck with a few privately defined phy
> implementations. It has no support for PHYs which
> are supported by the generic phylib.
> 
> Acked-by: Rob Herring 
> Signed-off-by: Christian Lamparter 
> ---
> Fixed: phy node - so it conforms to phy.txt.
> ---
> ---
>  .../devicetree/bindings/powerpc/4xx/emac.txt   | 61 
> +-
>  1 file changed, 59 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt 
> b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> index 712baf6c3e24..2fa861378294 100644
> --- a/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> +++ b/Documentation/devicetree/bindings/powerpc/4xx/emac.txt
> @@ -71,6 +71,9 @@
> For Axon it can be absent, though my current driver
> doesn't handle phy-address yet so for now, keep
> 0x00ff in it.
> +- phy-handle : Used to describe configurations where a external PHY
> +   is used. Please refer to:
> +   Documentation/devicetree/bindings/net/ethernet.txt
>  - rx-fifo-size-gige : 1 cell, Rx fifo size in bytes for 1000 Mb/sec
> operations (if absent the value is the same as
> rx-fifo-size).  For Axon, either absent or 2048.
> @@ -81,8 +84,22 @@
> offload, phandle of the TAH device node.
>  - tah-channel   : 1 cell, optional. If appropriate, channel used on 
> the
> TAH engine.
> +- fixed-link : Fixed-link subnode describing a link to a non-MDIO
> +   managed entity. See
> +   Documentation/devicetree/bindings/net/fixed-link.txt
> +   for details.
> +- mdio subnode   : When the EMAC has a phy connected to its local
> +   mdio, which us supported by the kernel's network
> +   PHY library in drivers/net/phy, there must be device
> +   tree subnode with the following required properties:
> + - #address-cells: Must be <1>.
> + - #size-cells: Must be <0>.
>  
> -Example:
> +   For PHY definitions: Please refer to
> +   Documentation/devicetree/bindings/net/phy.txt and
> +   Documentation/devicetree/bindings/net/ethernet.txt
> +
> +Examples:
>  
>   EMAC0: ethernet@4800 {
>   device_type = "network";
> @@ -104,6 +121,47 @@
>   zmii-channel = <0>;
>   };
>  
> + EMAC1: ethernet@ef600c00 {
> + device_type = "network";
> + compatible = "ibm,emac-apm821xx", "ibm,emac4sync";
> + interrupt-parent = <>;
> + interrupts = <0 1>;
> + #interrupt-cells = <1>;
> + #address-cells = <0>;
> + #size-cells = <0>;
> + interrupt-map = <0  0x10 IRQ_TYPE_LEVEL_HIGH /* Status */
> +  1  0x14 IRQ_TYPE_LEVEL_HIGH /* Wake */>;
> + reg = <0xef600c00 0x00c4>;
> + local-mac-address = []; /* Filled in by U-Boot */
> + mal-device = <>;
> + mal-tx-channel = <0>;
> + mal-rx-channel = <0>;
> + cell-index = <0>;
> + max-frame-size = <9000>;
> + rx-fifo-size = <16384>;
> + tx-fifo-size = <2048>;
> + fifo-entry-size = <10>;
> + phy-mode = "rgmii";
> + phy-handle = <>;
> + phy-map = <0x>;
> + rgmii-device = <>;
> + rgmii-channel = <0>;
> + tah-device = <>;
> + tah-channel = <0>;
> + has-inverted-stacr-oc;
> + has-new-stacr-staopc;
> +
> + mdio {
> + #address-cells = <1>;
> + #size-cells = <0>;
> +
> + phy0: ethernet-phy@0 {
> + compatible = "ethernet-phy-ieee802.3-c22";
> + reg = <0>;

Missing closing curly brace here :) with that fixed:

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [Patch net] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Eric Dumazet
On Mon, 2017-02-27 at 11:07 -0800, Cong Wang wrote:
> Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
> -> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
> because ip6_null_entry is returned in this path since ip6_null_entry
> is kinda default for a ipv6 route table root node. Quote from
> David Ahern:
> 
>  ip6_null_entry is the root of all ipv6 fib tables making it integrated
>  into the table ...
> 
> We should ignore any attempt of trying to delete it, like we do in
> __ip6_del_rt() path and several others.
> 
> Reported-by: Andrey Konovalov 
> Signed-off-by: Cong Wang 
> ---
>  net/ipv6/route.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index f54f426..9da77e9 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -2169,10 +2169,13 @@ int ip6_del_rt(struct rt6_info *rt)
>  static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config 
> *cfg)
>  {
>   struct nl_info *info = >fc_nlinfo;
> + struct net *net = info->nl_net;
>   struct sk_buff *skb = NULL;
>   struct fib6_table *table;
>   int err;
>  
> + if (rt == net->ipv6.ip6_null_entry)
> + return -ENOENT;

It looks the caller did a dst_hold(>dst);

So this new error path would leave a refcount leak.

Note that I was not able to trigger the crash on old kernels, so it
would be nice to get a precise idea of bug origin.

Thanks.




Re: [PATCH v1.1] net: emac: add support for device-tree based PHY discovery and setup

2017-02-27 Thread David Miller
From: Christian Lamparter 
Date: Mon, 27 Feb 2017 20:42:15 +0100

> On Wednesday, February 22, 2017 3:37:35 PM CET David Miller wrote:
>> From: Christian Lamparter 
>> Date: Mon, 20 Feb 2017 20:10:58 +0100
>> 
>> > This patch adds glue-code that allows the EMAC driver to interface
>> > with the existing dt-supported PHYs in drivers/net/phy.
>> > 
>> > Because currently, the emac driver maintains a small library of
>> > supported phys for in a private phy.c file located in the drivers
>> > directory.
>> > 
>> > The support is limited to mostly single ethernet transceiver like the:
>> > CIS8201, BCM5248, ET1011C, Marvell 88E and 88E1112, AR8035.
>> > 
>> > However, routers like the Netgear WNDR4700 and Cisco Meraki MX60(W)
>> > have a 5-port switch (AR8327N) attached to the EMAC. The switch chip
>> > is supported by the qca8k mdio driver, which uses the generic phy
>> > library. Another reason is that PHYLIB also supports the BCM54610,
>> > which was used for the Western Digital My Book Live.
>> > 
>> > This will now also make EMAC select PHYLIB.
>> > 
>> > Signed-off-by: Christian Lamparter 
>> 
>> Applied, thanks.
>> 
> Thanks David.
> 
> I noticed that the DT Documentation patch:
> "[v1,1/2] dt: emac: document device-tree based phy discovery and setup"
> is still pending with "Changes Requested":
> 
> 
> I think this is because of Florian's comment on patch:
> "[v1,2/2] net: emac: add support for device-tree based PHY discovery and 
> setup"
> If so, can you please queue this documentation update patch for -next?
> (I haven't received any comments or complains. If necessary, I can also
> resent it.)

Please resend.


Re: net/ipv6: null-ptr-deref in ip6_route_del/lock_acquire

2017-02-27 Thread David Ahern
On 2/27/17 10:11 AM, Cong Wang wrote:
> The attached patch fixes this crash, but I am not sure if it is the
> best way to fix this bug yet...

I'll take a look. I can not reproduce this using route or ip, so the
fuzzer is doing something interesting.


Re: net/ipv6: null-ptr-deref in ip6_route_del/lock_acquire

2017-02-27 Thread Andrey Konovalov
On Mon, Feb 27, 2017 at 8:59 PM, David Ahern  wrote:
> On 2/27/17 10:11 AM, Cong Wang wrote:
>> The attached patch fixes this crash, but I am not sure if it is the
>> best way to fix this bug yet...
>
> I'll take a look. I can not reproduce this using route or ip, so the
> fuzzer is doing something interesting.

Hi David,

I've attached a simple reproducer to the report, it doesn't work for you?

Thanks!


Re: [PATCH v1.1] net: emac: add support for device-tree based PHY discovery and setup

2017-02-27 Thread Christian Lamparter
On Wednesday, February 22, 2017 3:37:35 PM CET David Miller wrote:
> From: Christian Lamparter 
> Date: Mon, 20 Feb 2017 20:10:58 +0100
> 
> > This patch adds glue-code that allows the EMAC driver to interface
> > with the existing dt-supported PHYs in drivers/net/phy.
> > 
> > Because currently, the emac driver maintains a small library of
> > supported phys for in a private phy.c file located in the drivers
> > directory.
> > 
> > The support is limited to mostly single ethernet transceiver like the:
> > CIS8201, BCM5248, ET1011C, Marvell 88E and 88E1112, AR8035.
> > 
> > However, routers like the Netgear WNDR4700 and Cisco Meraki MX60(W)
> > have a 5-port switch (AR8327N) attached to the EMAC. The switch chip
> > is supported by the qca8k mdio driver, which uses the generic phy
> > library. Another reason is that PHYLIB also supports the BCM54610,
> > which was used for the Western Digital My Book Live.
> > 
> > This will now also make EMAC select PHYLIB.
> > 
> > Signed-off-by: Christian Lamparter 
> 
> Applied, thanks.
> 
Thanks David.

I noticed that the DT Documentation patch:
"[v1,1/2] dt: emac: document device-tree based phy discovery and setup"
is still pending with "Changes Requested":


I think this is because of Florian's comment on patch:
"[v1,2/2] net: emac: add support for device-tree based PHY discovery and setup"
If so, can you please queue this documentation update patch for -next?
(I haven't received any comments or complains. If necessary, I can also
resent it.)

Thanks,
Christian


Re: [PATCH V2] vhost: introduce O(1) vq metadata cache

2017-02-27 Thread Michael S. Tsirkin
On Wed, Feb 15, 2017 at 01:37:17PM +0800, Jason Wang wrote:
> 
> 
> On 2016年12月14日 17:53, Jason Wang wrote:
> > When device IOTLB is enabled, all address translations were stored in
> > interval tree. O(lgN) searching time could be slow for virtqueue
> > metadata (avail, used and descriptors) since they were accessed much
> > often than other addresses. So this patch introduces an O(1) array
> > which points to the interval tree nodes that store the translations of
> > vq metadata. Those array were update during vq IOTLB prefetching and
> > were reset during each invalidation and tlb update. Each time we want
> > to access vq metadata, this small array were queried before interval
> > tree. This would be sufficient for static mappings but not dynamic
> > mappings, we could do optimizations on top.
> > 
> > Test were done with l2fwd in guest (2M hugepage):
> > 
> > noiommu  | before| after
> > tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%)
> > rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%)
> > 
> > We can almost reach the same performance as noiommu mode.
> > 
> > Signed-off-by: Jason Wang
> > ---
> > Changes from V1:
> > - silent 32bit build warning
> 
> ping

Could you rebase pls?
I pushed my tree into linux next.

-- 
MST


Re: [PATCH] iproute2: show network device dependency tree

2017-02-27 Thread Stephen Hemminger
On Sat, 25 Feb 2017 16:59:00 +
Zaboj Campula  wrote:

> dd the argument '-tree' to ip-link to show network devices dependency tree.
> 
> Example:
> 
> $ ip -tree link
> eth0
> bond0
> eth1
> bond0
> eth2
> bond1
> eth3
> bond1

Maybe use format similar to other utilities (lspci, lsusb, etc)?


[Patch net] ipv6: check for ip6_null_entry in __ip6_del_rt_siblings()

2017-02-27 Thread Cong Wang
Andrey reported a NULL pointer deref bug in ipv6_route_ioctl()
-> ip6_route_del() -> __ip6_del_rt_siblings() code path. This is
because ip6_null_entry is returned in this path since ip6_null_entry
is kinda default for a ipv6 route table root node. Quote from
David Ahern:

 ip6_null_entry is the root of all ipv6 fib tables making it integrated
 into the table ...

We should ignore any attempt of trying to delete it, like we do in
__ip6_del_rt() path and several others.

Reported-by: Andrey Konovalov 
Signed-off-by: Cong Wang 
---
 net/ipv6/route.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f54f426..9da77e9 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2169,10 +2169,13 @@ int ip6_del_rt(struct rt6_info *rt)
 static int __ip6_del_rt_siblings(struct rt6_info *rt, struct fib6_config *cfg)
 {
struct nl_info *info = >fc_nlinfo;
+   struct net *net = info->nl_net;
struct sk_buff *skb = NULL;
struct fib6_table *table;
int err;
 
+   if (rt == net->ipv6.ip6_null_entry)
+   return -ENOENT;
table = rt->rt6i_table;
write_lock_bh(>tb6_lock);
 
@@ -2184,7 +2187,7 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
struct fib6_config *cfg)
if (skb) {
u32 seq = info->nlh ? info->nlh->nlmsg_seq : 0;
 
-   if (rt6_fill_node(info->nl_net, skb, rt,
+   if (rt6_fill_node(net, skb, rt,
  NULL, NULL, 0, RTM_DELROUTE,
  info->portid, seq, 0) < 0) {
kfree_skb(skb);
@@ -2208,7 +2211,7 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
struct fib6_config *cfg)
ip6_rt_put(rt);
 
if (skb) {
-   rtnl_notify(skb, info->nl_net, info->portid, RTNLGRP_IPV6_ROUTE,
+   rtnl_notify(skb, net, info->portid, RTNLGRP_IPV6_ROUTE,
info->nlh, gfp_any());
}
return err;
-- 
2.5.5



Re: [PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-27 Thread Michael Kerrisk
[CC += linux-...@vger.kernel.org]

Hi Willem

This is a change to the kernel-user-space API. Please CC
linux-...@vger.kernel.org on any future iterations of this patch.

Thanks,

Michael



On Wed, Feb 22, 2017 at 5:38 PM, Willem de Bruijn
 wrote:
> From: Willem de Bruijn 
>
> RFCv2:
>
> I have received a few requests for status and rebased code of this
> feature. We have been running this code internally, discovering and
> fixing various bugs. With net-next closed, now seems like a good time
> to share an updated patchset with fixes. The rebase from RFCv1/v4.2
> was mostly straightforward: mainly iov_iter changes. Full changelog:
>
>   RFC -> RFCv2:
> - review comment: do not loop skb with zerocopy frags onto rx:
>   add skb_orphan_frags_rx to orphan even refcounted frags
>   call this in __netif_receive_skb_core, deliver_skb and tun:
>   the same as 1080e512d44d ("net: orphan frags on receive")
> - fix: hold an explicit sk reference on each notification skb.
>   previously relied on the reference (or wmem) held by the
>   data skb that would trigger notification, but this breaks
>   on skb_orphan.
> - fix: when aborting a send, do not inc the zerocopy counter
>   this caused gaps in the notification chain
> - fix: in packet with SOCK_DGRAM, pull ll headers before calling
>   zerocopy_sg_from_iter
> - fix: if sock_zerocopy_realloc does not allow coalescing,
>   do not fail, just allocate a new ubuf
> - fix: in tcp, check return value of second allocation attempt
> - chg: allocate notification skbs from optmem
>   to avoid affecting tcp write queue accounting (TSQ)
> - chg: limit #locked pages (ulimit) per user instead of per process
> - chg: grow notification ids from 16 to 32 bit
>   - pass range [lo, hi] through 32 bit fields ee_info and ee_data
> - chg: rebased to davem-net-next on top of v4.10-rc7
> - add: limit notification coalescing
>   sharing ubufs limits overhead, but delays notification until
>   the last packet is released, possibly unbounded. Add a cap.
> - tests: add snd_zerocopy_lo pf_packet test
> - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)
>
> The change to allocate notification skbuffs from optmem requires
> ensuring that net.core.optmem is at least a few 100KB. To
> experiment, run
>
>   sysctl -w net.core.optmem_max=1048576
>
> The snd_zerocopy_lo benchmarks reported in the individual patches were
> rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
> replaced with skb_orphan_frags to allow looping to local sockets. The
> netperf results below are also rerun with v2.
>
> In application load, copy avoidance shows a roughly 5% systemwide
> reduction in cycles when streaming large flows and a 4-8% reduction in
> wall clock time on early tensorflow test workloads.
>
>
> Overview (from original RFC):
>
> Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
> Implement the feature for TCP, UDP, RAW and packet sockets. This is
> a generalization of a previous packet socket RFC patch
>
>   http://patchwork.ozlabs.org/patch/413184/
>
> On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
> creates skbuff fragments directly from these pages. On tx completion,
> it notifies the socket owner that it is safe to modify memory by
> queuing a completion notification onto the socket error queue.
>
> The kernel already implements such copy avoidance with vmsplice plus
> splice and with ubuf_info for tun and virtio. Extend the second
> with features required by TCP and others: reference counting to
> support cloning (retransmit queue) and shared fragments (GSO) and
> notification coalescing to handle corking.
>
> Notifications are queued onto the socket error queue as a range
> range [N, N+m], where N is a per-socket counter incremented on each
> successful zerocopy send call.
>
> * Performance
>
> The below table shows cycles reported by perf for a netperf process
> sending a single 10 Gbps TCP_STREAM. The first three columns show
> Mcycles spent in the netperf process context. The second three columns
> show time spent systemwide (-a -C A,B) on the two cpus that run the
> process and interrupt handler. Reported is the median of at least 3
> runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
> Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
> are disabled and the kernel is booted with idle=halt.
>
> NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size
>
> perf stat -e cycles $NETPERF
> perf stat -C 2,3 -a -e cycles $NETPERF
>
> --process cycles--  cpu cycles
>std  zc   %  std zc   %
> 4K  27,609  11,217  41  49,217  39,175  79
> 16K 21,370   3,823  18  43,540  29,213  67
> 64K 20,557   2,312  11  42,189  26,910  64

[PATCH 1/5] netvsc: don't overload variable in same function

2017-02-27 Thread Stephen Hemminger
There are two variables named packet in the same function. One is the
metadata descriptor from host (vmpacket_descriptor) and the other is
the control block in the skb used to hold metadata from send.
Change name to avoid possible confusion and bugs.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index d35ebd993b38..5dedbc36c326 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -600,9 +600,9 @@ static inline void netvsc_free_send_slot(struct 
netvsc_device *net_device,
 static void netvsc_send_tx_complete(struct netvsc_device *net_device,
struct vmbus_channel *incoming_channel,
struct hv_device *device,
-   struct vmpacket_descriptor *packet)
+   const struct vmpacket_descriptor *desc)
 {
-   struct sk_buff *skb = (struct sk_buff *)(unsigned long)packet->trans_id;
+   struct sk_buff *skb = (struct sk_buff *)(unsigned long)desc->trans_id;
struct net_device *ndev = hv_get_drvdata(device);
struct net_device_context *net_device_ctx = netdev_priv(ndev);
struct vmbus_channel *channel = device->channel;
-- 
2.11.0



[PATCH 4/5] netvsc: enable GRO

2017-02-27 Thread Stephen Hemminger
Use GRO when receiving packets.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 54b6cab5af23..6dcd1f08b834 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -642,11 +642,11 @@ int netvsc_recv_callback(struct net_device *net,
 {
struct net_device_context *net_device_ctx = netdev_priv(net);
struct netvsc_device *net_device = net_device_ctx->nvdev;
+   u16 q_idx = channel->offermsg.offer.sub_channel_index;
+   struct netvsc_channel *nvchan = _device->chan_table[q_idx];
struct net_device *vf_netdev;
struct sk_buff *skb;
struct netvsc_stats *rx_stats;
-   u16 q_idx = channel->offermsg.offer.sub_channel_index;
-
 
if (net->reg_state != NETREG_REGISTERED)
return NVSP_STAT_FAIL;
@@ -679,7 +679,7 @@ int netvsc_recv_callback(struct net_device *net,
 * on the synthetic device because modifying the VF device
 * statistics will not work correctly.
 */
-   rx_stats = _device->chan_table[q_idx].rx_stats;
+   rx_stats = >rx_stats;
u64_stats_update_begin(_stats->syncp);
rx_stats->packets++;
rx_stats->bytes += len;
@@ -690,7 +690,7 @@ int netvsc_recv_callback(struct net_device *net,
++rx_stats->multicast;
u64_stats_update_end(_stats->syncp);
 
-   netif_receive_skb(skb);
+   napi_gro_receive(>napi, skb);
rcu_read_unlock();
 
return 0;
-- 
2.11.0



[PATCH v1 4/4] iscsi-target: use generic inet_pton_with_scope

2017-02-27 Thread Sagi Grimberg
Acked-by: Nicholas Bellinger 
Signed-off-by: Sagi Grimberg 
---
 drivers/target/iscsi/iscsi_target_configfs.c | 46 
 1 file changed, 12 insertions(+), 34 deletions(-)

diff --git a/drivers/target/iscsi/iscsi_target_configfs.c 
b/drivers/target/iscsi/iscsi_target_configfs.c
index bf40f03755dd..f30c27b83c5e 100644
--- a/drivers/target/iscsi/iscsi_target_configfs.c
+++ b/drivers/target/iscsi/iscsi_target_configfs.c
@@ -167,10 +167,7 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
struct iscsi_portal_group *tpg;
struct iscsi_tpg_np *tpg_np;
char *str, *str2, *ip_str, *port_str;
-   struct sockaddr_storage sockaddr;
-   struct sockaddr_in *sock_in;
-   struct sockaddr_in6 *sock_in6;
-   unsigned long port;
+   struct sockaddr_storage sockaddr = { };
int ret;
char buf[MAX_PORTAL_LEN + 1];
 
@@ -182,21 +179,19 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
memset(buf, 0, MAX_PORTAL_LEN + 1);
snprintf(buf, MAX_PORTAL_LEN + 1, "%s", name);
 
-   memset(, 0, sizeof(struct sockaddr_storage));
-
str = strstr(buf, "[");
if (str) {
-   const char *end;
-
str2 = strstr(str, "]");
if (!str2) {
pr_err("Unable to locate trailing \"]\""
" in IPv6 iSCSI network portal address\n");
return ERR_PTR(-EINVAL);
}
-   str++; /* Skip over leading "[" */
+
+   ip_str = str + 1; /* Skip over leading "[" */
*str2 = '\0'; /* Terminate the unbracketed IPv6 address */
str2++; /* Skip over the \0 */
+
port_str = strstr(str2, ":");
if (!port_str) {
pr_err("Unable to locate \":port\""
@@ -205,23 +200,8 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
}
*port_str = '\0'; /* Terminate string for IP */
port_str++; /* Skip over ":" */
-
-   ret = kstrtoul(port_str, 0, );
-   if (ret < 0) {
-   pr_err("kstrtoul() failed for port_str: %d\n", ret);
-   return ERR_PTR(ret);
-   }
-   sock_in6 = (struct sockaddr_in6 *)
-   sock_in6->sin6_family = AF_INET6;
-   sock_in6->sin6_port = htons((unsigned short)port);
-   ret = in6_pton(str, -1,
-   (void *)_in6->sin6_addr.in6_u, -1, );
-   if (ret <= 0) {
-   pr_err("in6_pton returned: %d\n", ret);
-   return ERR_PTR(-EINVAL);
-   }
} else {
-   str = ip_str = [0];
+   ip_str = [0];
port_str = strstr(ip_str, ":");
if (!port_str) {
pr_err("Unable to locate \":port\""
@@ -230,17 +210,15 @@ static struct se_tpg_np *lio_target_call_addnptotpg(
}
*port_str = '\0'; /* Terminate string for IP */
port_str++; /* Skip over ":" */
+   }
 
-   ret = kstrtoul(port_str, 0, );
-   if (ret < 0) {
-   pr_err("kstrtoul() failed for port_str: %d\n", ret);
-   return ERR_PTR(ret);
-   }
-   sock_in = (struct sockaddr_in *)
-   sock_in->sin_family = AF_INET;
-   sock_in->sin_port = htons((unsigned short)port);
-   sock_in->sin_addr.s_addr = in_aton(ip_str);
+   ret = inet_pton_with_scope(_net, AF_UNSPEC, ip_str,
+   port_str, );
+   if (ret) {
+   pr_err("malformed ip/port passed: %s\n", name);
+   return ERR_PTR(ret);
}
+
tpg = container_of(se_tpg, struct iscsi_portal_group, tpg_se_tpg);
ret = iscsit_get_tpg(tpg);
if (ret < 0)
-- 
2.7.4



Re: [PATCH] iproute2: show network device dependency tree

2017-02-27 Thread Stephen Hemminger
On Sat, 25 Feb 2017 16:59:00 +
Zaboj Campula  wrote:

> Add the argument '-tree' to ip-link to show network devices dependency tree.
> 
> Example:
> 
> $ ip -tree link
> eth0
> bond0
> eth1
> bond0
> eth2
> bond1
> eth3
> bond1

Another alternative format would be to make -tree a output modifier and ident 
(like ps tree options).

$ ip -t link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode 
DEFAULT group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
8: bond0  mtu 1500 qdisc pfifo_fast state DOWN mode 
DEFAULT group default qlen 1000
link/ether 52:54:00:66:24:cd brd ff:ff:ff:ff:ff:ff 
2: eth1:  mtu 1500 qdisc pfifo_fast master bond0 state 
DOWN mode DEFAULT group default qlen 1000
link/ether 52:54:00:66:24:cd brd ff:ff:ff:ff:ff:ff


[PATCH v2 net] net: solve a NAPI race

2017-02-27 Thread Eric Dumazet
From: Eric Dumazet 

While playing with mlx4 hardware timestamping of RX packets, I found
that some packets were received by TCP stack with a ~200 ms delay...

Since the timestamp was provided by the NIC, and my probe was added
in tcp_v4_rcv() while in BH handler, I was confident it was not
a sender issue, or a drop in the network.

This would happen with a very low probability, but hurting RPC
workloads.

A NAPI driver normally arms the IRQ after the napi_complete_done(),
after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
it.

Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
while IRQ are not disabled, we might have later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it.

This can happen with busy polling users, or if gro_flush_timeout is
used. But some other uses of napi_schedule() in drivers can cause this
as well.

This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
can set if it could not grab NAPI_STATE_SCHED

Then napi_complete_done() properly reschedules the napi to make sure
we do not miss something.

Since we manipulate multiple bits at once, use cmpxchg() like in
sk_busy_loop() to provide proper transactions.

In v2, I changed napi_watchdog() to use a relaxed variant of
napi_schedule_prep() : No need to set NAPI_STATE_MISSED from this point.

Signed-off-by: Eric Dumazet 
---
 include/linux/netdevice.h |   29 ++-
 net/core/dev.c|   53 +---
 2 files changed, 58 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 
f40f0ab3847a8caaf46bd4d5f224c65014f501cc..97456b2539e46d6232dda804f6a434db6fd7134f
 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -330,6 +330,7 @@ struct napi_struct {
 
 enum {
NAPI_STATE_SCHED,   /* Poll is scheduled */
+   NAPI_STATE_MISSED,  /* reschedule a napi */
NAPI_STATE_DISABLE, /* Disable pending */
NAPI_STATE_NPSVC,   /* Netpoll - don't dequeue from poll_list */
NAPI_STATE_HASHED,  /* In NAPI hash (busy polling possible) */
@@ -338,12 +339,13 @@ enum {
 };
 
 enum {
-   NAPIF_STATE_SCHED= (1UL << NAPI_STATE_SCHED),
-   NAPIF_STATE_DISABLE  = (1UL << NAPI_STATE_DISABLE),
-   NAPIF_STATE_NPSVC= (1UL << NAPI_STATE_NPSVC),
-   NAPIF_STATE_HASHED   = (1UL << NAPI_STATE_HASHED),
-   NAPIF_STATE_NO_BUSY_POLL = (1UL << NAPI_STATE_NO_BUSY_POLL),
-   NAPIF_STATE_IN_BUSY_POLL = (1UL << NAPI_STATE_IN_BUSY_POLL),
+   NAPIF_STATE_SCHED= BIT(NAPI_STATE_SCHED),
+   NAPIF_STATE_MISSED   = BIT(NAPI_STATE_MISSED),
+   NAPIF_STATE_DISABLE  = BIT(NAPI_STATE_DISABLE),
+   NAPIF_STATE_NPSVC= BIT(NAPI_STATE_NPSVC),
+   NAPIF_STATE_HASHED   = BIT(NAPI_STATE_HASHED),
+   NAPIF_STATE_NO_BUSY_POLL = BIT(NAPI_STATE_NO_BUSY_POLL),
+   NAPIF_STATE_IN_BUSY_POLL = BIT(NAPI_STATE_IN_BUSY_POLL),
 };
 
 enum gro_result {
@@ -414,20 +416,7 @@ static inline bool napi_disable_pending(struct napi_struct 
*n)
return test_bit(NAPI_STATE_DISABLE, >state);
 }
 
-/**
- * napi_schedule_prep - check if NAPI can be scheduled
- * @n: NAPI context
- *
- * Test if NAPI routine is already running, and if not mark
- * it as running.  This is used as a condition variable to
- * insure only one NAPI poll instance runs.  We also make
- * sure there is no pending NAPI disable.
- */
-static inline bool napi_schedule_prep(struct napi_struct *n)
-{
-   return !napi_disable_pending(n) &&
-   !test_and_set_bit(NAPI_STATE_SCHED, >state);
-}
+bool napi_schedule_prep(struct napi_struct *n);
 
 /**
  * napi_schedule - schedule NAPI poll
diff --git a/net/core/dev.c b/net/core/dev.c
index 
304f2deae5f9897e60a79ed8b69d6ef208295ded..edeb916487015f279036ecf7ff5d9096dff365d3
 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4884,6 +4884,32 @@ void __napi_schedule(struct napi_struct *n)
 EXPORT_SYMBOL(__napi_schedule);
 
 /**
+ * napi_schedule_prep - check if napi can be scheduled
+ * @n: napi context
+ *
+ * Test if NAPI routine is already running, and if not mark
+ * it as running.  This is used as a condition variable
+ * insure only one NAPI poll instance runs.  We also make
+ * sure there is no pending NAPI disable.
+ */
+bool napi_schedule_prep(struct napi_struct *n)
+{
+   unsigned long val, new;
+
+   do {
+   val = READ_ONCE(n->state);
+   if (unlikely(val & NAPIF_STATE_DISABLE))
+   return false;
+   new = val | NAPIF_STATE_SCHED;
+   if (unlikely(val & NAPIF_STATE_SCHED))
+   new |= NAPIF_STATE_MISSED;
+   } while (cmpxchg(>state, val, new) != val);
+
+   return !(val & NAPIF_STATE_SCHED);
+}
+EXPORT_SYMBOL(napi_schedule_prep);
+
+/**
  * 

Fw: [Bug 194723] New: connect() to localhost stalls after 4.9 -> 4.10 upgrade

2017-02-27 Thread Stephen Hemminger


Begin forwarded message:

Date: Mon, 27 Feb 2017 11:28:51 +
From: bugzilla-dae...@bugzilla.kernel.org
To: step...@networkplumber.org
Subject: [Bug 194723] New: connect() to localhost stalls after 4.9 -> 4.10 
upgrade


https://bugzilla.kernel.org/show_bug.cgi?id=194723

Bug ID: 194723
   Summary: connect() to localhost stalls after 4.9 -> 4.10
upgrade
   Product: Networking
   Version: 2.5
Kernel Version: 4.10
  Hardware: All
OS: Linux
  Tree: Mainline
Status: NEW
  Severity: high
  Priority: P1
 Component: IPV4
  Assignee: step...@networkplumber.org
  Reporter: l...@5t9.de
Regression: No

After upgrading a machine running the latest CentOS from using mainline kernel
linux-4.9 to linux-4.10, attempts to connect() via IPv4 to localhost fail in
about half of the cases, leaving the process trying to connect() stalled.

Reproduction:

> ncat -k -l 1 &
> C=1 ; while true ; do echo -n "$C " ; echo ping | ncat localhost 1 ;
> C=`expr $C + 1` ; sleep 1 ; done  

Using linux-4.10, the output looks like this:

> 1 ping
> 2 Ncat: Connection timed out.
> 3 ping
> 4 Ncat: Connection timed out.
> 5 ping
> 6 ping
> 7 ping
> 8 Ncat: Connection timed out.
> 9 ping
> 10 ping
> 11 Ncat: Connection timed out.
> 12 ping
> 13 Ncat: Connection timed out.
> 14 ping
> 15 Ncat: Connection timed out.
> 16 ping
> 17 Ncat: Connection timed out.
> 18 ping
> 19 ping
> 20 Ncat: Connection timed out.
> 21 ping
> 22 ping
> 23 Ncat: Connection timed out.
> 24 ping
> 25 Ncat: Connection timed out.
> 26 Ncat: Connection timed out.
> 27 ping
> 28 Ncat: Connection timed out.
> 29 ping  


Using linux-4.9, the output looks like this:

> 1 ping
> 2 ping
> 3 ping
> 4 ping
> 5 ping
> 6 ping
> 7 ping
> 8 ping
> 9 ping
> 10 ping
> 11 ping
> 12 ping
> 13 ping
> 14 ping
> 15 ping
> 16 ping
> 17 ping
> 18 ping
> 19 ping  

(The same behaviour was later also confirmed by a colleague on an Ubuntu
running machine after upgrading to Ubuntu's 4.10 kernel.)

-- 
You are receiving this mail because:
You are the assignee for the bug.


[PATCH 3/5] netvsc: implement NAPI

2017-02-27 Thread Stephen Hemminger
Use NAPI (softirq), to handle receive packets and send completions.
Previously this was handled by tasklet.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/hyperv_net.h   |   2 +
 drivers/net/hyperv/netvsc.c   | 140 ++
 drivers/net/hyperv/netvsc_drv.c   |   5 --
 drivers/net/hyperv/rndis_filter.c |   2 +
 4 files changed, 102 insertions(+), 47 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index d3e73ac158ae..7433b164e513 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -196,6 +196,7 @@ int netvsc_recv_callback(struct net_device *net,
 const struct ndis_tcp_ip_checksum_info *csum_info,
 const struct ndis_pkt_8021q_info *vlan);
 void netvsc_channel_cb(void *context);
+int netvsc_poll(struct napi_struct *napi, int budget);
 int rndis_filter_open(struct netvsc_device *nvdev);
 int rndis_filter_close(struct netvsc_device *nvdev);
 int rndis_filter_device_add(struct hv_device *dev,
@@ -720,6 +721,7 @@ struct net_device_context {
 /* Per channel data */
 struct netvsc_channel {
struct vmbus_channel *channel;
+   struct napi_struct napi;
struct multi_send_data msd;
struct multi_recv_comp mrc;
atomic_t queue_sends;
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 3681fb59bdbe..b1328cef9d5a 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -556,6 +556,7 @@ void netvsc_device_remove(struct hv_device *device)
struct net_device *ndev = hv_get_drvdata(device);
struct net_device_context *net_device_ctx = netdev_priv(ndev);
struct netvsc_device *net_device = net_device_ctx->nvdev;
+   int i;
 
netvsc_disconnect_vsp(device);
 
@@ -570,6 +571,9 @@ void netvsc_device_remove(struct hv_device *device)
/* Now, we can close the channel safely */
vmbus_close(device->channel);
 
+   for (i = 0; i < VRSS_CHANNEL_MAX; i++)
+   napi_disable(_device->chan_table[0].napi);
+
/* Release all resources */
free_netvsc_device(net_device);
 }
@@ -1063,7 +1067,7 @@ static inline struct recv_comp_data *get_recv_comp_slot(
return rcd;
 }
 
-static void netvsc_receive(struct net_device *ndev,
+static int netvsc_receive(struct net_device *ndev,
   struct netvsc_device *net_device,
   struct net_device_context *net_device_ctx,
   struct hv_device *device,
@@ -1073,20 +1077,19 @@ static void netvsc_receive(struct net_device *ndev,
 {
const struct vmtransfer_page_packet_header *vmxferpage_packet
= container_of(desc, const struct 
vmtransfer_page_packet_header, d);
+   u16 q_idx = channel->offermsg.offer.sub_channel_index;
char *recv_buf = net_device->recv_buf;
u32 status = NVSP_STAT_SUCCESS;
int i;
int count = 0;
int ret;
-   struct recv_comp_data *rcd;
-   u16 q_idx = channel->offermsg.offer.sub_channel_index;
 
/* Make sure this is a valid nvsp packet */
if (unlikely(nvsp->hdr.msg_type != NVSP_MSG1_TYPE_SEND_RNDIS_PKT)) {
netif_err(net_device_ctx, rx_err, ndev,
  "Unknown nvsp packet type received %u\n",
  nvsp->hdr.msg_type);
-   return;
+   return 0;
}
 
if (unlikely(vmxferpage_packet->xfer_pageset_id != 
NETVSC_RECEIVE_BUFFER_ID)) {
@@ -1094,7 +1097,7 @@ static void netvsc_receive(struct net_device *ndev,
  "Invalid xfer page set id - expecting %x got %x\n",
  NETVSC_RECEIVE_BUFFER_ID,
  vmxferpage_packet->xfer_pageset_id);
-   return;
+   return 0;
}
 
count = vmxferpage_packet->range_cnt;
@@ -1110,26 +1113,26 @@ static void netvsc_receive(struct net_device *ndev,
  channel, data, buflen);
}
 
-   if (!net_device->chan_table[q_idx].mrc.buf) {
+   if (net_device->chan_table[q_idx].mrc.buf) {
+   struct recv_comp_data *rcd;
+
+   rcd = get_recv_comp_slot(net_device, channel, q_idx);
+   if (rcd) {
+   rcd->tid = vmxferpage_packet->d.trans_id;
+   rcd->status = status;
+   } else {
+   netdev_err(ndev, "Recv_comp full buf q:%hd, tid:%llx\n",
+  q_idx, vmxferpage_packet->d.trans_id);
+   }
+   } else {
ret = netvsc_send_recv_completion(channel,
  vmxferpage_packet->d.trans_id,
  status);
if (ret)
netdev_err(ndev, "Recv_comp q:%hd, tid:%llx, err:%d\n",
   

[PATCH 5/5] netvsc: replace netdev_alloc_skb_ip_align with napi_alloc_skb

2017-02-27 Thread Stephen Hemminger
Gives potential performance gain.

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 6dcd1f08b834..f95d686b7957 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -589,13 +589,14 @@ void netvsc_linkstatus_callback(struct hv_device 
*device_obj,
 }
 
 static struct sk_buff *netvsc_alloc_recv_skb(struct net_device *net,
+struct napi_struct *napi,
 const struct 
ndis_tcp_ip_checksum_info *csum_info,
 const struct ndis_pkt_8021q_info 
*vlan,
 void *data, u32 buflen)
 {
struct sk_buff *skb;
 
-   skb = netdev_alloc_skb_ip_align(net, buflen);
+   skb = napi_alloc_skb(napi, buflen);
if (!skb)
return skb;
 
@@ -664,7 +665,8 @@ int netvsc_recv_callback(struct net_device *net,
net = vf_netdev;
 
/* Allocate a skb - TODO direct I/O to pages? */
-   skb = netvsc_alloc_recv_skb(net, csum_info, vlan, data, len);
+   skb = netvsc_alloc_recv_skb(net, >napi,
+   csum_info, vlan, data, len);
if (unlikely(!skb)) {
++net->stats.rx_dropped;
rcu_read_unlock();
-- 
2.11.0



[PATCH net 0/5] NAPI support for Hyper-V

2017-02-27 Thread Stephen Hemminger
These patches enable NAPI, GRO and napi_alloc_skb for Hyper-V netvsc
driver.

Stephen Hemminger (5):
  netvsc: don't overload variable in same function
  vmbus: introduce in-place packet iterator
  netvsc: implement NAPI
  netvsc: enable GRO
  netvsc: replace netdev_alloc_skb_ip_align with napi_alloc_skb

 drivers/hv/ring_buffer.c  |  94 -
 drivers/net/hyperv/hyperv_net.h   |   2 +
 drivers/net/hyperv/netvsc.c   | 172 --
 drivers/net/hyperv/netvsc_drv.c   |  19 ++---
 drivers/net/hyperv/rndis_filter.c |   2 +
 include/linux/hyperv.h|  96 +++--
 6 files changed, 242 insertions(+), 143 deletions(-)

-- 
2.11.0



[PATCH 2/5] vmbus: introduce in-place packet iterator

2017-02-27 Thread Stephen Hemminger
This is mostly just a refactoring of previous functions
(get_pkt_next_raw, put_pkt_raw and commit_rd_index) to make it easier
to use for other drivers and NAPI.

Signed-off-by: Stephen Hemminger 
---
 drivers/hv/ring_buffer.c| 94 +++-
 drivers/net/hyperv/netvsc.c | 34 +---
 include/linux/hyperv.h  | 96 ++---
 3 files changed, 133 insertions(+), 91 deletions(-)

diff --git a/drivers/hv/ring_buffer.c b/drivers/hv/ring_buffer.c
index 87799e81af97..c3f1a9e33cef 100644
--- a/drivers/hv/ring_buffer.c
+++ b/drivers/hv/ring_buffer.c
@@ -32,6 +32,8 @@
 
 #include "hyperv_vmbus.h"
 
+#define VMBUS_PKT_TRAILER  8
+
 /*
  * When we write to the ring buffer, check if the host needs to
  * be signaled. Here is the details of this protocol:
@@ -336,6 +338,12 @@ int hv_ringbuffer_write(struct vmbus_channel *channel,
return 0;
 }
 
+static inline void
+init_cached_read_index(struct hv_ring_buffer_info *rbi)
+{
+   rbi->cached_read_index = rbi->ring_buffer->read_index;
+}
+
 int hv_ringbuffer_read(struct vmbus_channel *channel,
   void *buffer, u32 buflen, u32 *buffer_actual_len,
   u64 *requestid, bool raw)
@@ -366,7 +374,8 @@ int hv_ringbuffer_read(struct vmbus_channel *channel,
return ret;
}
 
-   init_cached_read_index(channel);
+   init_cached_read_index(inring_info);
+
next_read_location = hv_get_next_read_location(inring_info);
next_read_location = hv_copyfrom_ringbuffer(inring_info, ,
sizeof(desc),
@@ -410,3 +419,86 @@ int hv_ringbuffer_read(struct vmbus_channel *channel,
 
return ret;
 }
+
+/*
+ * Determine number of bytes available in ring buffer after
+ * the current iterator (priv_read_index) location.
+ *
+ * This is similar to hv_get_bytes_to_read but with private
+ * read index instead.
+ */
+static u32 hv_pkt_iter_avail(const struct hv_ring_buffer_info *rbi)
+{
+   u32 priv_read_loc = rbi->priv_read_index;
+   u32 write_loc = READ_ONCE(rbi->ring_buffer->write_index);
+
+   if (write_loc >= priv_read_loc)
+   return write_loc - priv_read_loc;
+   else
+   return (rbi->ring_datasize - priv_read_loc) + write_loc;
+}
+
+/*
+ * Get first vmbus packet from ring buffer after read_index
+ *
+ * If ring buffer is empty, returns NULL and no other action needed.
+ */
+struct vmpacket_descriptor *hv_pkt_iter_first(struct vmbus_channel *channel)
+{
+   struct hv_ring_buffer_info *rbi = >inbound;
+
+   /* set state for later hv_signal_on_read() */
+   init_cached_read_index(rbi);
+
+   if (hv_pkt_iter_avail(rbi) < sizeof(struct vmpacket_descriptor))
+   return NULL;
+
+   return hv_get_ring_buffer(rbi) + rbi->priv_read_index;
+}
+EXPORT_SYMBOL_GPL(hv_pkt_iter_first);
+
+/*
+ * Get next vmbus packet from ring buffer.
+ *
+ * Advances the current location (priv_read_index) and checks for more
+ * data. If the end of the ring buffer is reached, then return NULL.
+ */
+struct vmpacket_descriptor *
+__hv_pkt_iter_next(struct vmbus_channel *channel,
+  const struct vmpacket_descriptor *desc)
+{
+   struct hv_ring_buffer_info *rbi = >inbound;
+   u32 packetlen = desc->len8 << 3;
+   u32 dsize = rbi->ring_datasize;
+
+   /* bump offset to next potential packet */
+   rbi->priv_read_index += packetlen + VMBUS_PKT_TRAILER;
+   if (rbi->priv_read_index >= dsize)
+   rbi->priv_read_index -= dsize;
+
+   /* more data? */
+   if (hv_pkt_iter_avail(rbi) < sizeof(struct vmpacket_descriptor))
+   return NULL;
+   else
+   return hv_get_ring_buffer(rbi) + rbi->priv_read_index;
+}
+EXPORT_SYMBOL_GPL(__hv_pkt_iter_next);
+
+/*
+ * Update host ring buffer after iterating over packets.
+ */
+void hv_pkt_iter_close(struct vmbus_channel *channel)
+{
+   struct hv_ring_buffer_info *rbi = >inbound;
+
+   /*
+* Make sure all reads are done before we update the read index since
+* the writer may start writing to the read area once the read index
+* is updated.
+*/
+   virt_rmb();
+   rbi->ring_buffer->read_index = rbi->priv_read_index;
+
+   hv_signal_on_read(channel);
+}
+EXPORT_SYMBOL_GPL(hv_pkt_iter_close);
diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 5dedbc36c326..3681fb59bdbe 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -647,14 +647,11 @@ static void netvsc_send_tx_complete(struct netvsc_device 
*net_device,
 static void netvsc_send_completion(struct netvsc_device *net_device,
   struct vmbus_channel *incoming_channel,
   struct hv_device *device,
-  struct vmpacket_descriptor *packet)
+ 

Re: net/ipv6: null-ptr-deref in ip6_route_del/lock_acquire

2017-02-27 Thread Cong Wang
On Mon, Feb 27, 2017 at 7:28 AM, Andrey Konovalov  wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit e5d56efc97f8240d0b5d66c03949382b6d7e5570 (Feb 26).
>
> A reproducer and .config are attached.
>
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] SMP KASAN
> Modules linked in:
> CPU: 0 PID: 4045 Comm: a.out Not tainted 4.10.0+ #54
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: 88006b6bac00 task.stack: 88006a688000
> RIP: 0010:__lock_acquire+0xac4/0x3270 kernel/locking/lockdep.c:3224
> RSP: 0018:88006a68f250 EFLAGS: 00010006
> RAX: dc00 RBX: dc00 RCX: 
> RDX: 0006 RSI:  RDI: 11000d4d1ea4
> RBP: 88006a68f788 R08: 0001 R09: 
> R10: 0030 R11:  R12: 88006b6bac00
> R13:  R14: 86e64ec0 R15: 0001
> FS:  7fda492ff700() GS:88006ca0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 208c4000 CR3: 6a7e9000 CR4: 06f0
> Call Trace:
>  lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753
>  __raw_write_lock_bh ./include/linux/rwlock_api_smp.h:203
>  _raw_write_lock_bh+0x3a/0x50 kernel/locking/spinlock.c:319
>  __ip6_del_rt_siblings net/ipv6/route.c:2177
>  ip6_route_del+0x4dd/0xa70 net/ipv6/route.c:2257
>  ipv6_route_ioctl+0x62d/0x790 net/ipv6/route.c:2620
>  inet6_ioctl+0xef/0x1e0 net/ipv6/af_inet6.c:520
>  sock_do_ioctl+0x65/0xb0 net/socket.c:895
>  sock_ioctl+0x28f/0x440 net/socket.c:993
>  vfs_ioctl fs/ioctl.c:43
>  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:683
>  SYSC_ioctl fs/ioctl.c:698
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689
>  entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:204

The attached patch fixes this crash, but I am not sure if it is the
best way to fix this bug yet...

Thanks.
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f54f426..3d1b260 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2216,12 +2216,13 @@ static int __ip6_del_rt_siblings(struct rt6_info *rt, 
struct fib6_config *cfg)
 
 static int ip6_route_del(struct fib6_config *cfg)
 {
+   struct net *net = cfg->fc_nlinfo.nl_net;
struct fib6_table *table;
struct fib6_node *fn;
struct rt6_info *rt;
int err = -ESRCH;
 
-   table = fib6_get_table(cfg->fc_nlinfo.nl_net, cfg->fc_table);
+   table = fib6_get_table(net, cfg->fc_table);
if (!table)
return err;
 
@@ -2247,6 +2248,8 @@ static int ip6_route_del(struct fib6_config *cfg)
continue;
if (cfg->fc_protocol && cfg->fc_protocol != 
rt->rt6i_protocol)
continue;
+   if (rt == net->ipv6.ip6_null_entry)
+   continue;
dst_hold(>dst);
read_unlock_bh(>tb6_lock);
 


[PATCH] net: bridge: allow IPv6 when multicast flood is disabled

2017-02-27 Thread Mike Manning
Even with multicast flooding turned off, IPv6 ND should still work so
that IPv6 connectivity is provided. Allow this by continuing to flood
multicast traffic originated by us. And similar to the unicast case,
set auto-mask if the multicast flood flag is set.

Fixes: b6cb5ac8331b ("net: bridge: add per-port multicast flood flag")
Cc: Nikolay Aleksandrov 
Signed-off-by: Mike Manning 
---
 include/linux/if_bridge.h | 2 +-
 net/bridge/br_forward.c   | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index c5847dc..7731808 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -40,12 +40,12 @@ struct br_ip_list {
 #define BR_ADMIN_COST  BIT(4)
 #define BR_LEARNINGBIT(5)
 #define BR_FLOOD   BIT(6)
-#define BR_AUTO_MASK   (BR_FLOOD | BR_LEARNING)
 #define BR_PROMISC BIT(7)
 #define BR_PROXYARPBIT(8)
 #define BR_LEARNING_SYNC   BIT(9)
 #define BR_PROXYARP_WIFI   BIT(10)
 #define BR_MCAST_FLOOD BIT(11)
+#define BR_AUTO_MASK   (BR_FLOOD | BR_LEARNING | BR_MCAST_FLOOD)
 #define BR_MULTICAST_TO_UNICASTBIT(12)
 #define BR_VLAN_TUNNEL BIT(13)
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 6bfac29..7fe7d58 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -186,8 +186,9 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
/* Do not flood unicast traffic to ports that turn it off */
if (pkt_type == BR_PKT_UNICAST && !(p->flags & BR_FLOOD))
continue;
+   /* Do not flood if mc off, except for traffic we originate */
if (pkt_type == BR_PKT_MULTICAST &&
-   !(p->flags & BR_MCAST_FLOOD))
+   !(p->flags & BR_MCAST_FLOOD) && (skb->dev != br->dev))
continue;
 
/* Do not flood to ports that enable proxy ARP */
-- 
2.1.4



[PATCH v1 1/4] net/utils: generic inet_pton_with_scope helper

2017-02-27 Thread Sagi Grimberg
Several locations in the stack need to handle ipv4/ipv6
(with scope) and port strings conversion to sockaddr.
Add a helper that takes either AF_INET, AF_INET6 or
AF_UNSPEC (for wildcard) to centralize this handling.

Suggested-by: Christoph Hellwig 
Signed-off-by: Sagi Grimberg 
---
 include/linux/inet.h |   6 +++
 net/core/utils.c | 103 +++
 2 files changed, 109 insertions(+)

diff --git a/include/linux/inet.h b/include/linux/inet.h
index 4cca05c9678e..636ebe87e6f8 100644
--- a/include/linux/inet.h
+++ b/include/linux/inet.h
@@ -43,6 +43,8 @@
 #define _LINUX_INET_H
 
 #include 
+#include 
+#include 
 
 /*
  * These mimic similar macros defined in user-space for inet_ntop(3).
@@ -54,4 +56,8 @@
 extern __be32 in_aton(const char *str);
 extern int in4_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
 extern int in6_pton(const char *src, int srclen, u8 *dst, int delim, const 
char **end);
+
+extern int inet_pton_with_scope(struct net *net, unsigned short af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr);
+
 #endif /* _LINUX_INET_H */
diff --git a/net/core/utils.c b/net/core/utils.c
index 6592d7bbed39..f96cf527bb8f 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -26,9 +26,11 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -300,6 +302,107 @@ int in6_pton(const char *src, int srclen,
 }
 EXPORT_SYMBOL(in6_pton);
 
+static int inet4_pton(const char *src, u16 port_num,
+   struct sockaddr_storage *addr)
+{
+   struct sockaddr_in *addr4 = (struct sockaddr_in *)addr;
+   int srclen = strlen(src);
+
+   if (srclen > INET_ADDRSTRLEN)
+   return -EINVAL;
+
+   if (in4_pton(src, srclen, (u8 *)>sin_addr.s_addr,
+'\n', NULL) == 0)
+   return -EINVAL;
+
+   addr4->sin_family = AF_INET;
+   addr4->sin_port = htons(port_num);
+
+   return 0;
+}
+
+static int inet6_pton(struct net *net, const char *src, u16 port_num,
+   struct sockaddr_storage *addr)
+{
+   struct sockaddr_in6 *addr6 = (struct sockaddr_in6 *)addr;
+   const char *scope_delim;
+   int srclen = strlen(src);
+
+   if (srclen > INET6_ADDRSTRLEN)
+   return -EINVAL;
+
+   if (in6_pton(src, srclen, (u8 *)>sin6_addr.s6_addr,
+'%', _delim) == 0)
+   return -EINVAL;
+
+   if (ipv6_addr_type(>sin6_addr) & IPV6_ADDR_LINKLOCAL &&
+   src + srclen != scope_delim && *scope_delim == '%') {
+   struct net_device *dev;
+   char scope_id[16];
+   size_t scope_len = min_t(size_t, sizeof(scope_id),
+src + srclen - scope_delim - 1);
+
+   memcpy(scope_id, scope_delim + 1, scope_len);
+   scope_id[scope_len] = '\0';
+
+   dev = dev_get_by_name(net, scope_id);
+   if (dev) {
+   addr6->sin6_scope_id = dev->ifindex;
+   dev_put(dev);
+   } else if (kstrtouint(scope_id, 0, >sin6_scope_id)) {
+   return -EINVAL;
+   }
+   }
+
+   addr6->sin6_family = AF_INET6;
+   addr6->sin6_port = htons(port_num);
+
+   return 0;
+}
+
+/**
+ * inet_pton_with_scope - convert an IPv4/IPv6 and port to socket address
+ * @net: net namespace (used for scope handling)
+ * @af: address family, AF_INET, AF_INET6 or AF_UNSPEC for either
+ * @src: the start of the address string
+ * @port: the start of the port string (or NULL for none)
+ * @addr: output socket address
+ *
+ * Return zero on success, return errno when any error occurs.
+ */
+int inet_pton_with_scope(struct net *net, __kernel_sa_family_t af,
+   const char *src, const char *port, struct sockaddr_storage 
*addr)
+{
+   u16 port_num;
+   int ret = -EINVAL;
+
+   if (port) {
+   if (kstrtou16(port, 0, _num))
+   return -EINVAL;
+   } else {
+   port_num = 0;
+   }
+
+   switch (af) {
+   case AF_INET:
+   ret = inet4_pton(src, port_num, addr);
+   break;
+   case AF_INET6:
+   ret = inet6_pton(net, src, port_num, addr);
+   break;
+   case AF_UNSPEC:
+   ret = inet4_pton(src, port_num, addr);
+   if (ret)
+   ret = inet6_pton(net, src, port_num, addr);
+   break;
+   default:
+   pr_err("unexpected address family %d\n", af);
+   };
+
+   return ret;
+}
+EXPORT_SYMBOL(inet_pton_with_scope);
+
 void inet_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb,
  __be32 from, __be32 to, bool pseudohdr)
 {
-- 
2.7.4



[PATCH v1 2/4] nvmet-rdma: use generic inet_pton_with_scope

2017-02-27 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg 
---
 drivers/nvme/target/rdma.c | 42 +-
 1 file changed, 29 insertions(+), 13 deletions(-)

diff --git a/drivers/nvme/target/rdma.c b/drivers/nvme/target/rdma.c
index 9aa1da3778b3..973b674ab55b 100644
--- a/drivers/nvme/target/rdma.c
+++ b/drivers/nvme/target/rdma.c
@@ -1429,12 +1429,16 @@ static void nvmet_rdma_delete_ctrl(struct nvmet_ctrl 
*ctrl)
 static int nvmet_rdma_add_port(struct nvmet_port *port)
 {
struct rdma_cm_id *cm_id;
-   struct sockaddr_in addr_in;
-   u16 port_in;
+   struct sockaddr_storage addr = { };
+   __kernel_sa_family_t af;
int ret;
 
switch (port->disc_addr.adrfam) {
case NVMF_ADDR_FAMILY_IP4:
+   af = AF_INET;
+   break;
+   case NVMF_ADDR_FAMILY_IP6:
+   af = AF_INET6;
break;
default:
pr_err("address family %d not supported\n",
@@ -1442,13 +1446,13 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
return -EINVAL;
}
 
-   ret = kstrtou16(port->disc_addr.trsvcid, 0, _in);
-   if (ret)
+   ret = inet_pton_with_scope(_net, af, port->disc_addr.traddr,
+   port->disc_addr.trsvcid, );
+   if (ret) {
+   pr_err("malformed ip/port passed: %s:%s\n",
+   port->disc_addr.traddr, port->disc_addr.trsvcid);
return ret;
-
-   addr_in.sin_family = AF_INET;
-   addr_in.sin_addr.s_addr = in_aton(port->disc_addr.traddr);
-   addr_in.sin_port = htons(port_in);
+   }
 
cm_id = rdma_create_id(_net, nvmet_rdma_cm_handler, port,
RDMA_PS_TCP, IB_QPT_RC);
@@ -1457,20 +1461,32 @@ static int nvmet_rdma_add_port(struct nvmet_port *port)
return PTR_ERR(cm_id);
}
 
-   ret = rdma_bind_addr(cm_id, (struct sockaddr *)_in);
+   /*
+* Allow both IPv4 and IPv6 sockets to bind a single port
+* at the same time.
+*/
+   ret = rdma_set_afonly(cm_id, 1);
+   if (ret) {
+   pr_err("rdma_set_afonly failed (%d)\n", ret);
+   goto out_destroy_id;
+   }
+
+   ret = rdma_bind_addr(cm_id, (struct sockaddr *));
if (ret) {
-   pr_err("binding CM ID to %pISpc failed (%d)\n", _in, ret);
+   pr_err("binding CM ID to %pISpcs failed (%d)\n",
+   (struct sockaddr *), ret);
goto out_destroy_id;
}
 
ret = rdma_listen(cm_id, 128);
if (ret) {
-   pr_err("listening to %pISpc failed (%d)\n", _in, ret);
+   pr_err("listening to %pISpcs failed (%d)\n",
+   (struct sockaddr *), ret);
goto out_destroy_id;
}
 
-   pr_info("enabling port %d (%pISpc)\n",
-   le16_to_cpu(port->disc_addr.portid), _in);
+   pr_info("enabling port %d (%pISpcs)\n",
+   le16_to_cpu(port->disc_addr.portid), (struct sockaddr *));
port->priv = cm_id;
return 0;
 
-- 
2.7.4



[PATCH v1 3/4] nvme-rdma: use inet_pton_with_scope helper

2017-02-27 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg 
---
 drivers/nvme/host/rdma.c | 63 +++-
 1 file changed, 19 insertions(+), 44 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 49b2121af689..3f4c49969f55 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -129,14 +129,8 @@ struct nvme_rdma_ctrl {
u64 cap;
u32 max_fr_pages;
 
-   union {
-   struct sockaddr addr;
-   struct sockaddr_in addr_in;
-   };
-   union {
-   struct sockaddr src_addr;
-   struct sockaddr_in src_addr_in;
-   };
+   struct sockaddr_storage addr;
+   struct sockaddr_storage src_addr;
 
struct nvme_ctrlctrl;
 };
@@ -571,11 +565,12 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl 
*ctrl,
return PTR_ERR(queue->cm_id);
}
 
-   queue->cm_error = -ETIMEDOUT;
if (ctrl->ctrl.opts->mask & NVMF_OPT_HOST_TRADDR)
-   src_addr = >src_addr;
+   src_addr = (struct sockaddr *)>src_addr;
 
-   ret = rdma_resolve_addr(queue->cm_id, src_addr, >addr,
+   queue->cm_error = -ETIMEDOUT;
+   ret = rdma_resolve_addr(queue->cm_id, src_addr,
+   (struct sockaddr *)>addr,
NVME_RDMA_CONNECT_TIMEOUT_MS);
if (ret) {
dev_info(ctrl->ctrl.device,
@@ -1857,27 +1852,13 @@ static int nvme_rdma_create_io_queues(struct 
nvme_rdma_ctrl *ctrl)
return ret;
 }
 
-static int nvme_rdma_parse_ipaddr(struct sockaddr_in *in_addr, char *p)
-{
-   u8 *addr = (u8 *)_addr->sin_addr.s_addr;
-   size_t buflen = strlen(p);
-
-   /* XXX: handle IPv6 addresses */
-
-   if (buflen > INET_ADDRSTRLEN)
-   return -EINVAL;
-   if (in4_pton(p, buflen, addr, '\0', NULL) == 0)
-   return -EINVAL;
-   in_addr->sin_family = AF_INET;
-   return 0;
-}
-
 static struct nvme_ctrl *nvme_rdma_create_ctrl(struct device *dev,
struct nvmf_ctrl_options *opts)
 {
struct nvme_rdma_ctrl *ctrl;
int ret;
bool changed;
+   char *port;
 
ctrl = kzalloc(sizeof(*ctrl), GFP_KERNEL);
if (!ctrl)
@@ -1885,34 +1866,28 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct 
device *dev,
ctrl->ctrl.opts = opts;
INIT_LIST_HEAD(>list);
 
-   ret = nvme_rdma_parse_ipaddr(>addr_in, opts->traddr);
+   if (opts->mask & NVMF_OPT_TRSVCID)
+   port = opts->trsvcid;
+   else
+   port = __stringify(NVME_RDMA_IP_PORT);
+
+   ret = inet_pton_with_scope(_net, AF_UNSPEC,
+   opts->traddr, port, >addr);
if (ret) {
-   pr_err("malformed IP address passed: %s\n", opts->traddr);
+   pr_err("malformed address passed: %s:%s\n", opts->traddr, port);
goto out_free_ctrl;
}
 
if (opts->mask & NVMF_OPT_HOST_TRADDR) {
-   ret = nvme_rdma_parse_ipaddr(>src_addr_in,
-   opts->host_traddr);
+   ret = inet_pton_with_scope(_net, AF_UNSPEC,
+   opts->host_traddr, NULL, >src_addr);
if (ret) {
-   pr_err("malformed src IP address passed: %s\n",
+   pr_err("malformed src address passed: %s\n",
   opts->host_traddr);
goto out_free_ctrl;
}
}
 
-   if (opts->mask & NVMF_OPT_TRSVCID) {
-   u16 port;
-
-   ret = kstrtou16(opts->trsvcid, 0, );
-   if (ret)
-   goto out_free_ctrl;
-
-   ctrl->addr_in.sin_port = cpu_to_be16(port);
-   } else {
-   ctrl->addr_in.sin_port = cpu_to_be16(NVME_RDMA_IP_PORT);
-   }
-
ret = nvme_init_ctrl(>ctrl, dev, _rdma_ctrl_ops,
0 /* no quirks, we're perfect! */);
if (ret)
@@ -1977,7 +1952,7 @@ static struct nvme_ctrl *nvme_rdma_create_ctrl(struct 
device *dev,
changed = nvme_change_ctrl_state(>ctrl, NVME_CTRL_LIVE);
WARN_ON_ONCE(!changed);
 
-   dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISp\n",
+   dev_info(ctrl->ctrl.device, "new ctrl: NQN \"%s\", addr %pISpcs\n",
ctrl->ctrl.opts->subsysnqn, >addr);
 
kref_get(>ctrl.kref);
-- 
2.7.4



[PATCH v1 0/4] Introduce a new helper for parsing ipv[4|6]:port to socket address

2017-02-27 Thread Sagi Grimberg
Changes from v0:
- rebased on 4.10
- splitted inet_pton_with_scope to be a bit saner (from Chrsitoph)
- converted nvme-rdma host_traddr to use a generic helper

We have some places in the stack that support ipv4 and ipv6. In
some cases the user configuration does not reveal which
address family is given and needs to be parsed from the input string.

Given that the user-input varies between subsystems, some processing
is required from the call-site to separate address and port strings.

As a side-effect, this set adds ipv6 support for nvme over fabrics.

Sagi Grimberg (4):
  net/utils: generic inet_pton_with_scope helper
  nvmet-rdma: use generic inet_pton_with_scope
  nvme-rdma: use inet_pton_with_scope helper
  iscsi-target: use generic inet_pton_with_scope

 drivers/nvme/host/rdma.c |  63 +---
 drivers/nvme/target/rdma.c   |  42 +++
 drivers/target/iscsi/iscsi_target_configfs.c |  46 
 include/linux/inet.h |   6 ++
 net/core/utils.c | 103 +++
 5 files changed, 169 insertions(+), 91 deletions(-)

-- 
2.7.4



Re: [PATCH v2 net] net: solve a NAPI race

2017-02-27 Thread Eric Dumazet
On Mon, 2017-02-27 at 08:44 -0800, Eric Dumazet wrote:

> // busy polling or napi_watchdog()

BTW, we also can add to the beginning of busy_poll_stop() :

clear_bit(NAPI_STATE_MISSED, >state);





Re: [PATCH] iproute2: show network device dependency tree

2017-02-27 Thread Stephen Hemminger
On Sun, 26 Feb 2017 08:56:33 +0100
Jiri Pirko  wrote:

> Did you see https://github.com/jbenc/plotnetcfg ?


Cool, thanks.


Re: phy deadlock -stable backport request

2017-02-27 Thread David Miller
From: Niklas Cassel 
Date: Mon, 27 Feb 2017 14:56:31 +0100

> I would like to request that
> 
> commit eab127717a6af54401ba534790c793ec143cd1fc
> Author: Florian Fainelli 
> Date:   Fri Jan 20 15:31:52 2017 -0800
> 
> net: phy: Avoid deadlock during phy_error()
 ...
> would be backported to stable branch v4.9.
> 
> I've seen this deadlock happen on v4.9.x

Ok, queued up.


Re: [PATCH] iproute2: show network device dependency tree

2017-02-27 Thread Jiri Benc
On Sun, 26 Feb 2017 15:46:10 +0100, Jiri Pirko wrote:
> You can also run it remotelly. Also I believe that you can catch the
> state into some dump file and process it later on. Not 100% sure though.
> Ccing Jiri Benc who is the original author of plotnetcfg.

It produces dot (graphviz) output or json and has no dependencies on
anything GUI related. Just run it on the remote machine and display the
output locally.

ssh root@remote plotnetcfg | dot -Tpdf | whatever_pdf_viewer

Note that some pdf viewers can't read stdin or require dash as the
parameter to use stdin.

I don't think it's possible to enhance iproute2 to display the network
interface dependencies in an useful way. It's just too complex. It's
not even a (undirected) tree.

 Jiri


Re: [PATCH v2 net] net: solve a NAPI race

2017-02-27 Thread Eric Dumazet
On Mon, 2017-02-27 at 11:19 -0500, David Miller wrote:

> Various rules were meant to protect these sequences, and make sure
> nothing like this race could happen.
> 
> Can you show the specific sequence that fails?
> 
> One of the basic protections is that the device IRQ is not re-enabled
> until napi_complete_done() is finished, most drivers do something like
> this:
> 
>   napi_complete_done();
>   - sets NAPI_STATE_SCHED
>   enable device IRQ
> 
> So I don't understand how it is possible that "later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it".
> 
> While napi_complete_done() is running, the device's IRQ is still
> disabled, so there cannot be an IRQ firing before napi_complete_done()
> is finished.


Any point doing a napi_schedule() not from device hard irq handler
is subject to the race for NIC using some kind of edge trigger interrupts.

Since we do not provide a ndo to disable device interrupts,
the following can happen.

thread 1 thread 2 (could be on same cpu)

// busy polling or napi_watchdog()
napi_schedule();
...
napi->poll()

device polling:
read 2 packets from ring buffer
  Additional 3rd packet is available.
  device hard irq

  // does nothing because 
NAPI_STATE_SCHED bit is owned by thread 1
  napi_schedule();
  
napi_complete_done(napi, 2);
rearm_irq();


Note that rearm_irq() will not force the device to send an additional IRQ
for the packet it already signaled (3rd packet in my example)

At least for mlx4, only 4th packet will trigger the IRQ again.

In the old days, the race would not happen since napi->poll() was called
in direct response to a prior device IRQ :
Edge triggered hard irqs from the device for this queue were already disabled.






Re: net/sctp: use-after-free in sctp_hash_transport

2017-02-27 Thread Xin Long
On Mon, Feb 27, 2017 at 11:45 PM, Andrey Konovalov
 wrote:
> Hi,
>
> I've got the following error report while fuzzing the kernel with syzkaller.
>
> On commit e5d56efc97f8240d0b5d66c03949382b6d7e5570 (Feb 26).
>
> A reproducer and .config are attached.
>
> ===
> [ ERR: suspicious RCU usage.  ]
> 4.10.0+ #54 Not tainted
> ---
> ./include/linux/rhashtable.h:602 suspicious rcu_dereference_check() usage!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 2, debug_locks = 0
> 1 lock held by a.out/4189:
>  #0:  (sk_lock-AF_INET6){+.+.+.}, at: []
> sctp_setsockopt+0x318/0x5f10
>
> stack backtrace:
> CPU: 1 PID: 4189 Comm: a.out Not tainted 4.10.0+ #54
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:15
>  dump_stack+0x292/0x398 lib/dump_stack.c:51
>  lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4452
>  __rhashtable_lookup ./include/linux/rhashtable.h:602
>  rhltable_lookup ./include/linux/rhashtable.h:690
>  sctp_hash_transport+0x826/0xcc0 net/sctp/input.c:887
>  sctp_assoc_add_peer+0xd0b/0x1470 net/sctp/associola.c:716
>  __sctp_connect+0x26d/0xdb0 net/sctp/socket.c:1184
>  __sctp_setsockopt_connectx+0x197/0x200 net/sctp/socket.c:1338
>  sctp_setsockopt_connectx net/sctp/socket.c:1370
>  sctp_setsockopt+0x15fa/0x5f10 net/sctp/socket.c:3936
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2725
>  SYSC_setsockopt net/socket.c:1786
>  SyS_setsockopt+0x270/0x3a0 net/socket.c:1765
>  entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:204
> RIP: 0033:0x7f3e27a55b79
> RSP: 002b:7f3e2296fd98 EFLAGS: 0206 ORIG_RAX: 0036
> RAX: ffda RBX: 7f3e229709c0 RCX: 7f3e27a55b79
> RDX: 006e RSI: 0084 RDI: 0003
> RBP: 7f3e27f21220 R08: 0010 R09: 
> R10: 20004000 R11: 0206 R12: 
> R13: 7f3e229709c0 R14: 7f3e2834c040 R15: 0003
> ==
> BUG: KASAN: use-after-free in sctp_hash_transport+0x855/0xcc0 at addr
> 8800671e1f8c
> Read of size 4 by task a.out/4189
> CPU: 1 PID: 4189 Comm: a.out Not tainted 4.10.0+ #54
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:15
>  dump_stack+0x292/0x398 lib/dump_stack.c:51
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:162
>  print_address_description mm/kasan/report.c:200
>  kasan_report_error mm/kasan/report.c:289
>  kasan_report.part.1+0x20e/0x4e0 mm/kasan/report.c:311
>  kasan_report mm/kasan/report.c:331
>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:331
>  rht_key_hashfn ./include/linux/rhashtable.h:254
>  __rhashtable_lookup ./include/linux/rhashtable.h:604
>  rhltable_lookup ./include/linux/rhashtable.h:690
>  sctp_hash_transport+0x855/0xcc0 net/sctp/input.c:887
>  sctp_assoc_add_peer+0xd0b/0x1470 net/sctp/associola.c:716
>  __sctp_connect+0x26d/0xdb0 net/sctp/socket.c:1184
>  __sctp_setsockopt_connectx+0x197/0x200 net/sctp/socket.c:1338
>  sctp_setsockopt_connectx net/sctp/socket.c:1370
>  sctp_setsockopt+0x15fa/0x5f10 net/sctp/socket.c:3936
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2725
>  SYSC_setsockopt net/socket.c:1786
>  SyS_setsockopt+0x270/0x3a0 net/socket.c:1765
>  entry_SYSCALL_64_fastpath+0x1f/0xc2 arch/x86/entry/entry_64.S:204
> RIP: 0033:0x7f3e27a55b79
> RSP: 002b:7f3e2296fd98 EFLAGS: 0206 ORIG_RAX: 0036
> RAX: ffda RBX: 7f3e229709c0 RCX: 7f3e27a55b79
> RDX: 006e RSI: 0084 RDI: 0003
> RBP: 7f3e27f21220 R08: 0010 R09: 
> R10: 20004000 R11: 0206 R12: 
> R13: 7f3e229709c0 R14: 7f3e2834c040 R15: 0003
> Object at 8800671e1f80, in cache kmalloc-1024 size: 1024
> Allocated:
> PID = 1
> save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:502
>  set_track mm/kasan/kasan.c:514
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:605
>  __kmalloc+0xa0/0x2d0 mm/slub.c:3745
>  kmalloc ./include/linux/slab.h:495
>  kzalloc ./include/linux/slab.h:663
>  bucket_table_alloc+0x618/0x930 lib/rhashtable.c:224
>  rhashtable_init+0x5f8/0xc60 lib/rhashtable.c:1006
>  rhltable_init+0x53/0xa0 lib/rhashtable.c:1037
>  sctp_transport_hashtable_init+0x1c/0x20 net/sctp/input.c:865
>  sctp_init+0x62c/0x88f net/sctp/protocol.c:1486
>  do_one_initcall+0xf3/0x390 init/main.c:788
>  do_initcall_level init/main.c:854
>  do_initcalls init/main.c:862
>  do_basic_setup init/main.c:880
>  kernel_init_freeable+0x5cc/0x6a6 init/main.c:1031
>  kernel_init+0x13/0x180 init/main.c:955
>  ret_from_fork+0x31/0x40 arch/x86/entry/entry_64.S:430
> Freed:
> PID = 0
>  save_stack_trace+0x16/0x20 

Re: [PATCH v2 net] net: solve a NAPI race

2017-02-27 Thread David Miller
From: Eric Dumazet 
Date: Mon, 27 Feb 2017 06:21:38 -0800

> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
> 
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
> 
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
> 
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED

Various rules were meant to protect these sequences, and make sure
nothing like this race could happen.

Can you show the specific sequence that fails?

One of the basic protections is that the device IRQ is not re-enabled
until napi_complete_done() is finished, most drivers do something like
this:

napi_complete_done();
- sets NAPI_STATE_SCHED
enable device IRQ

So I don't understand how it is possible that "later an IRQ firing and
finding this bit set, right before napi_complete_done() clears it".

While napi_complete_done() is running, the device's IRQ is still
disabled, so there cannot be an IRQ firing before napi_complete_done()
is finished.


[PATCH net] rxrpc: Fix deadlock between call creation and sendmsg/recvmsg

2017-02-27 Thread David Howells
All the routines by which rxrpc is accessed from the outside are serialised
by means of the socket lock (sendmsg, recvmsg, bind,
rxrpc_kernel_begin_call(), ...) and this presents a problem:

 (1) If a number of calls on the same socket are in the process of
 connection to the same peer, a maximum of four concurrent live calls
 are permitted before further calls need to wait for a slot.

 (2) If a call is waiting for a slot, it is deep inside sendmsg() or
 rxrpc_kernel_begin_call() and the entry function is holding the socket
 lock.

 (3) sendmsg() and recvmsg() or the in-kernel equivalents are prevented
 from servicing the other calls as they need to take the socket lock to
 do so.

 (4) The socket is stuck until a call is aborted and makes its slot
 available to the waiter.

Fix this by:

 (1) Provide each call with a mutex ('user_mutex') that arbitrates access
 by the users of rxrpc separately for each specific call.

 (2) Make rxrpc_sendmsg() and rxrpc_recvmsg() unlock the socket as soon as
 they've got a call and taken its mutex.

 Note that I'm returning EWOULDBLOCK from recvmsg() if MSG_DONTWAIT is
 set but someone else has the lock.  Should I instead only return
 EWOULDBLOCK if there's nothing currently to be done on a socket, and
 sleep in this particular instance because there is something to be
 done, but we appear to be blocked by the interrupt handler doing its
 ping?

 (3) Make rxrpc_new_client_call() unlock the socket after allocating a new
 call, locking its user mutex and adding it to the socket's call tree.
 The call is returned locked so that sendmsg() can add data to it
 immediately.

 From the moment the call is in the socket tree, it is subject to
 access by sendmsg() and recvmsg() - even if it isn't connected yet.

 (4) Lock new service calls in the UDP data_ready handler (in
 rxrpc_new_incoming_call()) because they may already be in the socket's
 tree and the data_ready handler makes them live immediately if a user
 ID has already been preassigned.

 Note that the new call is locked before any notifications are sent
 that it is live, so doing mutex_trylock() *ought* to always succeed.
 Userspace is prevented from doing sendmsg() on calls that are in a
 too-early state in rxrpc_do_sendmsg().

 (5) Make rxrpc_new_incoming_call() return the call with the user mutex
 held so that a ping can be scheduled immediately under it.

 Note that it might be worth moving the ping call into
 rxrpc_new_incoming_call() and then we can drop the mutex there.

 (6) Make rxrpc_accept_call() take the lock on the call it is accepting and
 release the socket after adding the call to the socket's tree.  This
 is slightly tricky as we've dequeued the call by that point and have
 to requeue it.

 Note that requeuing emits a trace event.

 (7) Make rxrpc_kernel_send_data() and rxrpc_kernel_recv_data() take the
 new mutex immediately and don't bother with the socket mutex at all.

This patch has the nice bonus that calls on the same socket are now to some
extent parallelisable.


Note that we might want to move rxrpc_service_prealloc() calls out from the
socket lock and give it its own lock, so that we don't hang progress in
other calls because we're waiting for the allocator.

We probably also want to avoid calling rxrpc_notify_socket() from within
the socket lock (rxrpc_accept_call()).

Signed-off-by: David Howells 
Tested-by: Marc Dionne 
---

 include/trace/events/rxrpc.h |2 +
 net/rxrpc/af_rxrpc.c |   12 +++--
 net/rxrpc/ar-internal.h  |1 +
 net/rxrpc/call_accept.c  |   48 +++
 net/rxrpc/call_object.c  |   18 -
 net/rxrpc/input.c|1 +
 net/rxrpc/recvmsg.c  |   39 -
 net/rxrpc/sendmsg.c  |   57 ++
 8 files changed, 156 insertions(+), 22 deletions(-)

diff --git a/include/trace/events/rxrpc.h b/include/trace/events/rxrpc.h
index 593f586545eb..39123c06a566 100644
--- a/include/trace/events/rxrpc.h
+++ b/include/trace/events/rxrpc.h
@@ -119,6 +119,7 @@ enum rxrpc_recvmsg_trace {
rxrpc_recvmsg_full,
rxrpc_recvmsg_hole,
rxrpc_recvmsg_next,
+   rxrpc_recvmsg_requeue,
rxrpc_recvmsg_return,
rxrpc_recvmsg_terminal,
rxrpc_recvmsg_to_be_accepted,
@@ -277,6 +278,7 @@ enum rxrpc_congest_change {
EM(rxrpc_recvmsg_full,  "FULL") \
EM(rxrpc_recvmsg_hole,  "HOLE") \
EM(rxrpc_recvmsg_next,  "NEXT") \
+   EM(rxrpc_recvmsg_requeue,   "REQU") \
EM(rxrpc_recvmsg_return,"RETN") \
EM(rxrpc_recvmsg_terminal,  "TERM") \
EM(rxrpc_recvmsg_to_be_accepted,"TBAC") \
diff --git 

BENEFIT

2017-02-27 Thread Mrs Julie Leach
You are a recipient to Mrs Julie Leach Donation of $3 million USD. 
Contact(julieleac...@gmail.com) for claims.


Re: [PATCH v3 20/20] checkpatch: warn for use of old PCI pool API

2017-02-27 Thread Joe Perches
On Mon, 2017-02-27 at 13:52 +0100, Romain Perier wrote:

> > I also wonder if you've in fact converted all of the
> > pci_pool struct and function uses why a new checkpatch
> > test is needed at all.
> 
> That's just to avoid futures mistakes/uses.

When all instances and macro definitions are removed
the check is pointless as any newly submitted patch
will not compile.



Re: [bpf] 9d876e79df: BUG: unable to handle kernel paging request at 653a8346

2017-02-27 Thread Daniel Borkmann

On 02/27/2017 03:14 AM, kernel test robot wrote:

Greetings,

0day kernel testing robot got the below dmesg and the first bad commit is

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master


I'll take a look, thanks for the report!


Re: [RFC PATCH net-next 2/5] net: split skb_checksum_help

2017-02-27 Thread Davide Caratti
On Mon, 2017-01-23 at 12:59 -0800, Tom Herbert wrote:
> > > > It might make sense to create some CRC helper functions, but last time
> > > > I checked there are so few users of CRC in skbufs I'm not even sure
> > > > that would make sense.

hello Tom and David,

after some (thinking + testing) time, I'm going to re-post this RFC as v2 with
some feedbacks. Thank you in advance for looking at it!

On Thu, 2017-02-02 at 10:08 -0800, Tom Herbert wrote:
> On Thu, 2017-02-02 at 16:07 +0100, Davide Caratti wrote:
> > This is exactly the cause of issues I see with SCTP. These packets can be
> > wrongly checksummed using skb_checksum_help, or simply not checksummed at
> > all; and in both cases, the packet goes out from the NIC with wrong L4
> > checksum.
> > 
> Okay, makes sense. Please consider doing the following:
> 
> - Add a bit to skbuf called something like "csum_not_inet". When
> ip_summed == CHECKSUM_PARTIAL and this bit is set that means we are
> dealing with something other than an Internet checksum.

Ok, done. Another solution would be to extend possible values of
skb->ip_summed, and define a new value suitable for identifying
not-yet-checksummed SCTP packets (something like CRC32C_PARTIAL). Since
skb->ip_summed is 2-bit wide, the overall effect on skb metadata is the
same as adding skb->csum_not_inet [1].

> - At the top of skb_checksum_help (or maybe before the point where the
> inet specific checksum start begins do something like:
> 
>if (unlikely(skb->csum_not_inet))
>return skb_checksum_help_not_inet(...);
> 
>The rest of skb_checksum_help should remained unchanged.

According to documentation [2], validate_xmit_skb() is a good place where
the if() statement above can be done, to preserve the possibility of having
the CRC32c computation offloaded by the NIC hardware:

if (unlikely(skb->csum_not_inet && !(features & NETIF_F_SCTP_CRC))
   return skb_checksum_help_not_inet(...);

On Thu, 2017-02-02 at 16:55 +, David Laight wrote:
> 
> I'd put the onus on any such interface to perform the checksum (and
> set CHECKSUM_COMPLETE (or is it UNNECESSARY?) before passing the 
> message onto an interface that doesn't advertise CRC32 support.
> 
> You certainly don't want to have to go through all the ethernet drivers!

Ideally, a driver not able to offload checksum computation should call
skb_checksum_help() or skb_sctp_csum_help() to resolve CHECKSUM_PARTIAL
and turn it to CHECKSUM_NONE.
But this wouldn't solve all possible setups: there can be scenarios
where the NIC is configured with NETIF_F_SCTP_CRC set and NETIF_F_CSUM_HW
cleared (it's evil, but possible). In this situation, non-GSO SCTP packets
having CHECKSUM_PARTIAL will be systematically corrupted when they are
processed by validate_xmit_skb().

On Thu, 2017-02-02 at 10:08 -0800, Tom Herbert wrote:

> 
> - Add a description of the new bit and how skb_checksum_help can work
> to the comments for CHECKSUM_PARTIAL in skbuff.h

Done.

> 
> - Add FCOE to the list of protocol that can set CHECKSUM_UNNECESSARY
> for a CRC/csum

Done.

> 
> - Add a note to CHECKSUM_COMPLETE section that it can only refer to an
> Internet checksum

Done.

/* references + notes */

[1] ... this recalls to latest comment from David Laight:
On Thu, 2017-02-02 at 16:55 +, David Laight wrote:
> 
> I have to admit to not knowing exactly what the CHECKSUM_xxx flags
> actually mean. I have a good idea about what the intention is though.

According to domumentation, CHECKSUM_COMPLETE and CHECKSUM_UNNECESSARY are
not used for SCTP (nor in the TX path at all); nevertheless, IPVS snat/dnat
actually set CHECKSUM_UNNECESSARY on SCTP packets after the checksum is
updated (see 97203abe6bc4 "net: ipvs: sctp: do not recalc...).

I'm not sure if setting CHECKSUM_UNNECESSARY fits my case, because this would
implicitly skip RX validation when using devices like veth or loopback.

[2] Documentation/networking/checksum_offloads.txt

regards,
--
davide


Re: net/bridge: warning in br_fdb_find

2017-02-27 Thread Andrey Konovalov
On Mon, Feb 27, 2017 at 4:33 PM, Andrey Konovalov  wrote:
> Hi,
>
> I'm getting a huge number of warning reports while fuzzing the kernel
> with syzkaller.
> Unfortunately they are not reproducible.
>
> On commit e5d56efc97f8240d0b5d66c03949382b6d7e5570 (Feb 26).
>
> My .config is attached.
>
> WARNING: CPU: 1 PID: 6203 at net/bridge/br_fdb.c:109
> br_fdb_find+0x20c/0x220 net/bridge/br_fdb.c:109
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 6203 Comm: syz-executor0 Not tainted 4.10.0+ #54
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:15 [inline]
>  dump_stack+0x292/0x398 lib/dump_stack.c:51
>  panic+0x1cb/0x3a9 kernel/panic.c:179
>  __warn+0x1c4/0x1e0 kernel/panic.c:540
>  warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
>  br_fdb_find+0x20c/0x220 net/bridge/br_fdb.c:109
>  fdb_insert+0xf7/0x300 net/bridge/br_fdb.c:529
>  br_fdb_insert+0x3a/0x60 net/bridge/br_fdb.c:557
>  __vlan_add+0x1621/0x3670 net/bridge/br_vlan.c:250
>  br_vlan_add+0x9e1/0xf30 net/bridge/br_vlan.c:599
>  br_vlan_init+0x241/0x320 net/bridge/br_vlan.c:940
>  br_dev_init+0xe2/0x220 net/bridge/br_device.c:106
>  register_netdevice+0x2f1/0xee0 net/core/dev.c:7137
>  register_netdev+0x1a/0x30 net/core/dev.c:7313
>  br_add_bridge+0x97/0xd0 net/bridge/br_if.c:393
>  br_ioctl_deviceless_stub+0x800/0xa40 net/bridge/br_ioctl.c:378
>  sock_ioctl+0x256/0x440 net/socket.c:960
>  vfs_ioctl fs/ioctl.c:43 [inline]
>  do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:683
>  SYSC_ioctl fs/ioctl.c:698 [inline]
>  SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689
>  entry_SYSCALL_64_fastpath+0x1f/0xc2
> RIP: 0033:0x4458b9
> RSP: 002b:7f4afad8fb58 EFLAGS: 0286 ORIG_RAX: 0010
> RAX: ffda RBX: 0005 RCX: 004458b9
> RDX: 2000 RSI: 014089a0 RDI: 0005
> RBP: 006e0be0 R08:  R09: 
> R10:  R11: 0286 R12: 00708000
> R13: 0010 R14: 00080003 R15: 
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..

+syzkal...@googlegroups.com


Re: Extending socket timestamping API for NTP

2017-02-27 Thread Miroslav Lichvar
On Tue, Feb 07, 2017 at 02:32:04PM -0800, Willem de Bruijn wrote:
> >> 4) allow sockets to use both SW and HW TX timestamping at the same time
> >>
> >>When using a socket which is not bound to a specific interface, it
> >>would be nice to get transmit SW timestamps when HW timestamps are
> >>missing. I suspect it's difficult to predict if a HW timestamp will
> >>be available. Maybe it would be acceptable to get from the error
> >>queue two messages per transmission if the interface supports both
> >>SW and HW timestamping?
> >
> >
> > This seems useful,
> 
> Agreed, as long as it is optional so that it does not change the
> behavior for existing applications.

Do you think it is safe to assume that no application enabled both SW
and HW TX timestamping? Do we need a new option for this?

> > but not sure how best to implement it.
> 
> It might be sufficient to just remove the second line in sw_tx_timestamp
> 
> static inline void sw_tx_timestamp(struct sk_buff *skb)
> {
> if (skb_shinfo(skb)->tx_flags & SKBTX_SW_TSTAMP &&
> !(skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS))
> skb_tstamp_tx(skb, NULL);
> }

With this change I'm getting two error messages per transmission, but
it looks like it may need some additional changes.

If the first error message is received after the HW timestamp was
captured, it contains both timestamps as the HW timestamp is in the
shared info of the skb. Is it possible it could contain a partially
updated HW timestamp? I'm not sure how locking works here. Is
scm_timestamping actually allowed to contain more than one timestamp?
The timestamping.txt document says "Only one field is non-zero at any
time.", but that wasn't true even before if both SW and HW RX
timestamping was enabled.

If SO_TIMESTAMP{,NS} is enabled, ts[0] in the second error message
will contain a bogus SW timestamp added by __sock_recv_timestamp() for
a "Race occurred between timestamp enabling and packet receiving". Is
there a guarantee applications will get a timestamp for all messages
after enabling SO_TIMESTAMP? The original code is older than the git
repo, so I'm not sure what was the reason for this. To me it would
make more sense to not add any SCM_TIMESTAMP (and SW timestamp in
SCM_TIMESTAMPING) when the the timestamp is missing. If that's not
always acceptable, maybe it could be restricted to sockets that have
HW timestamping enabled?

Some drivers don't call skb_tx_timestamp() when HW timestamp was
requested. From a cursory look it is e1000e, xgbe, sxgbe, and stmmac.
This should hopefully be an easy fix.

Thoughts?

-- 
Miroslav Lichvar


Re: [RFC PATCH net-next 2/5] net: split skb_checksum_help

2017-02-27 Thread Tom Herbert
On Mon, Feb 27, 2017 at 5:39 AM, Davide Caratti  wrote:
> On Mon, 2017-01-23 at 12:59 -0800, Tom Herbert wrote:
>> > > > It might make sense to create some CRC helper functions, but last time
>> > > > I checked there are so few users of CRC in skbufs I'm not even sure
>> > > > that would make sense.
>
> hello Tom and David,
>
> after some (thinking + testing) time, I'm going to re-post this RFC as v2 with
> some feedbacks. Thank you in advance for looking at it!
>
> On Thu, 2017-02-02 at 10:08 -0800, Tom Herbert wrote:
>> On Thu, 2017-02-02 at 16:07 +0100, Davide Caratti wrote:
>> > This is exactly the cause of issues I see with SCTP. These packets can be
>> > wrongly checksummed using skb_checksum_help, or simply not checksummed at
>> > all; and in both cases, the packet goes out from the NIC with wrong L4
>> > checksum.
>> >
>> Okay, makes sense. Please consider doing the following:
>>
>> - Add a bit to skbuf called something like "csum_not_inet". When
>> ip_summed == CHECKSUM_PARTIAL and this bit is set that means we are
>> dealing with something other than an Internet checksum.
>
> Ok, done. Another solution would be to extend possible values of
> skb->ip_summed, and define a new value suitable for identifying
> not-yet-checksummed SCTP packets (something like CRC32C_PARTIAL). Since
> skb->ip_summed is 2-bit wide, the overall effect on skb metadata is the
> same as adding skb->csum_not_inet [1].
>
>> - At the top of skb_checksum_help (or maybe before the point where the
>> inet specific checksum start begins do something like:
>>
>>if (unlikely(skb->csum_not_inet))
>>return skb_checksum_help_not_inet(...);
>>
>>The rest of skb_checksum_help should remained unchanged.
>
> According to documentation [2], validate_xmit_skb() is a good place where
> the if() statement above can be done, to preserve the possibility of having
> the CRC32c computation offloaded by the NIC hardware:
>
> if (unlikely(skb->csum_not_inet && !(features & NETIF_F_SCTP_CRC))
>return skb_checksum_help_not_inet(...);
>
> On Thu, 2017-02-02 at 16:55 +, David Laight wrote:
>>
>> I'd put the onus on any such interface to perform the checksum (and
>> set CHECKSUM_COMPLETE (or is it UNNECESSARY?) before passing the
>> message onto an interface that doesn't advertise CRC32 support.
>>
>> You certainly don't want to have to go through all the ethernet drivers!
>
> Ideally, a driver not able to offload checksum computation should call
> skb_checksum_help() or skb_sctp_csum_help() to resolve CHECKSUM_PARTIAL
> and turn it to CHECKSUM_NONE.
> But this wouldn't solve all possible setups: there can be scenarios
> where the NIC is configured with NETIF_F_SCTP_CRC set and NETIF_F_CSUM_HW
> cleared (it's evil, but possible). In this situation, non-GSO SCTP packets
> having CHECKSUM_PARTIAL will be systematically corrupted when they are
> processed by validate_xmit_skb().
>
> On Thu, 2017-02-02 at 10:08 -0800, Tom Herbert wrote:
>
>>
>> - Add a description of the new bit and how skb_checksum_help can work
>> to the comments for CHECKSUM_PARTIAL in skbuff.h
>
> Done.
>
>>
>> - Add FCOE to the list of protocol that can set CHECKSUM_UNNECESSARY
>> for a CRC/csum
>
> Done.
>
>>
>> - Add a note to CHECKSUM_COMPLETE section that it can only refer to an
>> Internet checksum
>
> Done.
>
> /* references + notes */
>
> [1] ... this recalls to latest comment from David Laight:
> On Thu, 2017-02-02 at 16:55 +, David Laight wrote:
>>
>> I have to admit to not knowing exactly what the CHECKSUM_xxx flags
>> actually mean. I have a good idea about what the intention is though.
>
> According to domumentation, CHECKSUM_COMPLETE and CHECKSUM_UNNECESSARY are
> not used for SCTP (nor in the TX path at all); nevertheless, IPVS snat/dnat
> actually set CHECKSUM_UNNECESSARY on SCTP packets after the checksum is
> updated (see 97203abe6bc4 "net: ipvs: sctp: do not recalc...).
>
CHECKSUM_PARTIAL is the preferred mechanism on the transmit path this
defers defers the checksum computation as long as possible.
Unfortunately, if SCTP is encapsulated in UDP we will probably need to
run the SCTP CRC on the host which will be done with your changes to
skb_checksum_help.

> I'm not sure if setting CHECKSUM_UNNECESSARY fits my case, because this would
> implicitly skip RX validation when using devices like veth or loopback.
>
CHECKSUM_UNNECESSARY can be used in the transmit path (really the
forwarding path), however this I think this must imply that the
checksum in the packet must be correct. Please see my post about
drivers that are mistakingly using CHECKSUM_UNNECESSARY with LRO since
the checksum in the packet sent into the stack is not correct.

Tom

> [2] Documentation/networking/checksum_offloads.txt
>
> regards,
> --
> davide


Re: [PATCH v3 2/6] ARM: dts: armada-xp-98dx3236: combine dfx server nodes

2017-02-27 Thread Rob Herring
On Thu, Feb 16, 2017 at 09:50:36PM +1300, Chris Packham wrote:
> Rather than having a separate node for the dfx server add a reg property
> to the parent node. This give some compatibility with the Marvell
> supplied SDK.
> 
> As no upstream driver currently exists for this block and support for
> this SoC is still quite fresh in the kernel it should not be necessary
> to retain a backwards compatible binding.
> 
> Signed-off-by: Chris Packham 
> ---
> 
> Notes:
> Changes in v2:
> - none
> Changes in v3:
> - update commit message to indicate backwards incompatible change and
>   why it's OK
> - retain dfx-server compatible string
> 
>  Documentation/devicetree/bindings/net/marvell,prestera.txt | 13 +
>  arch/arm/boot/dts/armada-xp-98dx3236.dtsi  | 10 +++---
>  2 files changed, 8 insertions(+), 15 deletions(-)

Acked-by: Rob Herring 


ANNOUNCE: Netdev 2.1 update Feb 27

2017-02-27 Thread Jamal Hadi Salim

A few announcements:

1) The CFP is now officially closed. Thanks to everyone who submitted.

2) We are extending the early registration to March 5.

Register early so we can plan better (and so you can save some $$).
https://onlineregistrations.ca/netdev21/

- hotel (If you can get the hotel cheaper online than conference
rates please send us email, dont book ):
https://www.netdevconf.org/2.1/hotel.html

3) Tech committee would like to make two announcements:

First, a talk by Hajime Tazaki titled "Playing BBR with a userspace 
network stack"


-
Linux kernel library (LKL) is aimed to run Linux kernel code upon
different environment such as Linux userspace, Windows userspace,
hypervisors, etc.  With the userspace deployments, an application can
benefit new additional features such as TCP extensions without
involving the update of host kernel.

This characteristic of network stack personality is useful since we
don't have to instantiate a virtual machine instance to use a new
feature of network stack (e.g., a TCP extension).  Instead, we just
need a single (userspace) process to introduce new features.

One concern of userspace network stack in general, and addressed by
David Miller in the last netdev conference (in Tokyo), is the achieved
timing accuracy in userspace which the important network feature such
as packet pacing and transport protocols requires.

In this talk, we're going to present our performance studies on this
timing accuracy concern of a userspace network stack. We present the
result of netperf benchmarks with a couple of congestion control
algorithm of TCP, BBR and cubic with the LKL-ed netperf and ordinal
netperf with Linux kernel.
We're trying to reveal that what is the obstacle of LKL (userspace
network stack) and what can be fixed to reach the performance goal of
LKL (i.e., x1 performance of the original kernel network stack).

Second, our first workshop announcement on "IoT related MAC layers, header
compression and routing protocols" chaired by Stefan Schmidt.
---
This workshop aims to identify generic requirements for the networking
subsystem for IoT and starting the process of addressing the gaps
found. The workshop will encompass related MAC layers, networking
protocols, adaptation layers, header compression, routing protocols
and application layers.

As a starting point we will look at existing subsystems (Bluetooth,
802.15.4, 6LoWPAN, etc) and discuss a way forward to  address the gaps
posed. An overview of MAC layers and open IoT related specifications
will help to identify things we should probably support in the future
(LPWAN, SCHC, RPL, Thread, etc). Note: We emphasize only on open
protocols and specifications as well as IPv6 instead of company grown
networking layers.
Additional user-space interfaces might be needed to cater for the
requirements of application layer stacks (ZigBee, IoTivity, etc.).
What kind of interfaces besides the normal socket API and existing
netlink interfaces do they need?
(header compression configuration, DTLS support in AF_KTLS, etc.)

A primary goal of this workshop is initially to target Linux as a
gateway or border router at the edge of IoT networks, be it industrial
or home automation.

A future scope, starting with discussions at the workshop, could be for
Linux to replace RTOSes on leaf nodes which operate on very constrained
hardware and power limits; such an effort would need a serious amount
of work towards tinyfication in all parts of the kernel.

We expect the workshop to ignite efforts on these topics and followup
discussions on mailing lists and future netdevs to produce results that
make it a possibility.


cheers,
jamal


Re: [4.9.10] ip_route_me_harder() reading off-slab

2017-02-27 Thread Daniel J Blueman
On 17 February 2017 at 15:39, Florian Westphal  wrote:
> Daniel J Blueman  wrote:
>
> [ CC nf-devel, pablo ]
>
>> When booting a VM in libvirt/KVM attached to a local bridge and KASAN
>> enabled on 4.9.10, we see a stream of KASAN warnings about off-slab
>> access [1].
>>
>> Let me know if you'd like more debug.
>
> Does this patch help?
>
> Subject: [PATCH nf] netfilter: use skb_to_full_sk in ip_route_me_harder
>
> inet_sk(skb->sk) is illegal in case skb is attached to request socket.
>
> Fixes: ca6fb0651883 ("tcp: attach SYNACK messages to request sockets instead 
> of listener")
> Reported by: Daniel J Blueman 
> Signed-off-by: Florian Westphal 
> ---
>  net/ipv4/netfilter.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/netfilter.c b/net/ipv4/netfilter.c
> index b3cc1335adbc..c0cc6aa8cfaa 100644
> --- a/net/ipv4/netfilter.c
> +++ b/net/ipv4/netfilter.c
> @@ -23,7 +23,8 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> struct rtable *rt;
> struct flowi4 fl4 = {};
> __be32 saddr = iph->saddr;
> -   __u8 flags = skb->sk ? inet_sk_flowi_flags(skb->sk) : 0;
> +   const struct sock *sk = skb_to_full_sk(skb);
> +   __u8 flags = sk ? inet_sk_flowi_flags(sk) : 0;
> struct net_device *dev = skb_dst(skb)->dev;
> unsigned int hh_len;
>
> @@ -40,7 +41,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> fl4.daddr = iph->daddr;
> fl4.saddr = saddr;
> fl4.flowi4_tos = RT_TOS(iph->tos);
> -   fl4.flowi4_oif = skb->sk ? skb->sk->sk_bound_dev_if : 0;
> +   fl4.flowi4_oif = sk ? sk->sk_bound_dev_if : 0;
> if (!fl4.flowi4_oif)
> fl4.flowi4_oif = l3mdev_master_ifindex(dev);
> fl4.flowi4_mark = skb->mark;
> @@ -61,7 +62,7 @@ int ip_route_me_harder(struct net *net, struct sk_buff 
> *skb, unsigned int addr_t
> xfrm_decode_session(skb, flowi4_to_flowi(), AF_INET) == 0) {
> struct dst_entry *dst = skb_dst(skb);
> skb_dst_set(skb, NULL);
> -   dst = xfrm_lookup(net, dst, flowi4_to_flowi(), skb->sk, 
> 0);
> +   dst = xfrm_lookup(net, dst, flowi4_to_flowi(), sk, 0);
> if (IS_ERR(dst))
> return PTR_ERR(dst);
> skb_dst_set(skb, dst);

Apologies for the delays; this also addresses the issue just fine.

Tested-by: Daniel J Blueman 

Dan
-- 
Daniel J Blueman


Re: [PATCH 0/2] qed: Bug fixes

2017-02-27 Thread David Miller
From: Yuval Mintz 
Date: Mon, 27 Feb 2017 11:06:31 +0200

> Hi Dave,
> 
> Patch #1 addresses a day-one race which is dependent on the number of Vfs
> [I.e., more child VFs from a single PF make it more probable].
> Patch #2 corrects a race that got introduced in the last set of fixes for
> qed, one that would happen each time PF transitions to UP state.
> 
> I've built & tested those against current net-next.
> Please consider applying the series there.

The net-next tree is closed, and bug fixes are supposed to target
the net tree so that's where I've applied these changes.

Please target the correct tree in the future, thanks.


phy deadlock -stable backport request

2017-02-27 Thread Niklas Cassel
Hello

I would like to request that

commit eab127717a6af54401ba534790c793ec143cd1fc
Author: Florian Fainelli 
Date:   Fri Jan 20 15:31:52 2017 -0800

net: phy: Avoid deadlock during phy_error()
   
phy_error() is called in the PHY state machine workqueue context, and
calls phy_trigger_machine() which does a cancel_delayed_work_sync() of
the workqueue we execute from, causing a deadlock situation.
   
Augment phy_trigger_machine() machine with a sync boolean indicating
whether we should use cancel_*_sync() or just cancel_*_work().
   
Fixes: 3c293f4e08b5 ("net: phy: Trigger state machine on state change and 
not polling.")
Reported-by: Russell King 
Signed-off-by: Florian Fainelli 
Signed-off-by: David S. Miller 

 drivers/net/phy/phy.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)


would be backported to stable branch v4.9.

I've seen this deadlock happen on v4.9.x


Regards,
Niklas


Re: [PATCH 0/6] Netfilter fixes for net

2017-02-27 Thread David Miller
From: Pablo Neira Ayuso 
Date: Mon, 27 Feb 2017 12:35:36 +0100

> The following patchset contains netfilter fixes for you net tree,
> they are:
> 
> 1) Missing ct zone size in the nft_ct initialization path, patch
>from Florian Westphal.
> 
> 2) Two patches for netfilter uapi headers, one to remove unnecessary
>sysctl.h inclusion and another to fix compilation of xt_hashlimit.h
>in userspace, from Dmitry V. Levin.
> 
> 3) Patch to fix a sloppy change in nf_ct_expect that incorrectly
>simplified nf_ct_expect_related_report() in the previous nf-next
>batch. This also includes another patch for __nf_ct_expect_check()
>to report success by returning 0 to keep it consistent with other
>existing functions. From Jarno Rajahalme.
> 
> 4) The ->walk() iterator of the new bitmap set type goes over the real
>bitmap size, this results in incorrect dumps when NFTA_SET_USERDATA
>is used.
> 
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks Pablo.


Re: [PATCH net] net: solve a NAPI race

2017-02-27 Thread Eric Dumazet
On Sun, 2017-02-26 at 19:31 -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> While playing with mlx4 hardware timestamping of RX packets, I found
> that some packets were received by TCP stack with a ~200 ms delay...
> 
> Since the timestamp was provided by the NIC, and my probe was added
> in tcp_v4_rcv() while in BH handler, I was confident it was not
> a sender issue, or a drop in the network.
> 
> This would happen with a very low probability, but hurting RPC
> workloads.
> 
> A NAPI driver normally arms the IRQ after the napi_complete_done(),
> after NAPI_STATE_SCHED is cleared, so that the hard irq handler can grab
> it.
> 
> Problem is that if another point in the stack grabs NAPI_STATE_SCHED bit
> while IRQ are not disabled, we might have later an IRQ firing and
> finding this bit set, right before napi_complete_done() clears it.
> 
> This can happen with busy polling users, or if gro_flush_timeout is
> used. But some other uses of napi_schedule() in drivers can cause this
> as well.
> 
> This patch adds a new NAPI_STATE_MISSED bit, that napi_schedule_prep()
> can set if it could not grab NAPI_STATE_SCHED
> 
> Then napi_complete_done() properly reschedules the napi to make sure
> we do not miss something.
> 
> Since we manipulate multiple bits at once, use cmpxchg() like in
> sk_busy_loop() to provide proper transactions.
> 
> Signed-off-by: Eric Dumazet 
> ---
>  include/linux/netdevice.h |   29 +++
>  net/core/dev.c|   44 ++--
>  2 files changed, 51 insertions(+), 22 deletions(-)

I will send a v2 of this patch.

Points trying to grab NAPI_STATE_SCHED not from the device driver IRQ
handler should not set NAPI_STATE_MISSED if they fail, otherwise this
adds extra work for no purpose.

One of this point is napi_watchdog() which can fire pretty often.






[PATCH net] net: route: add missing nla_policy entry for RTA_MARK attribute

2017-02-27 Thread Liping Zhang
From: Liping Zhang 

This will add stricter validating for RTA_MARK attribute.

Signed-off-by: Liping Zhang 
---
 net/ipv4/fib_frontend.c | 1 +
 net/ipv6/route.c| 1 +
 2 files changed, 2 insertions(+)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index b39a791..42bfd08 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -622,6 +622,7 @@ const struct nla_policy rtm_ipv4_policy[RTA_MAX + 1] = {
[RTA_ENCAP_TYPE]= { .type = NLA_U16 },
[RTA_ENCAP] = { .type = NLA_NESTED },
[RTA_UID]   = { .type = NLA_U32 },
+   [RTA_MARK]  = { .type = NLA_U32 },
 };
 
 static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f54f426..d94f1df 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2891,6 +2891,7 @@ static const struct nla_policy rtm_ipv6_policy[RTA_MAX+1] 
= {
[RTA_ENCAP] = { .type = NLA_NESTED },
[RTA_EXPIRES]   = { .type = NLA_U32 },
[RTA_UID]   = { .type = NLA_U32 },
+   [RTA_MARK]  = { .type = NLA_U32 },
 };
 
 static int rtm_to_fib6_config(struct sk_buff *skb, struct nlmsghdr *nlh,
-- 
2.5.5




Re: [PATCH v3 20/20] checkpatch: warn for use of old PCI pool API

2017-02-27 Thread Romain Perier
Hello,


Le 27/02/2017 à 13:38, Joe Perches a écrit :
> On Mon, 2017-02-27 at 13:26 +0100, Romain Perier wrote:
>> Hello,
>>
>>
>> Le 27/02/2017 à 12:22, Peter Senna Tschudin a écrit :
>>> On Sun, Feb 26, 2017 at 08:24:25PM +0100, Romain Perier wrote:
 pci_pool_*() functions should be replaced by the corresponding functions
 in the DMA pool API. This adds support to check for use of these pci
 functions and display a warning when it is the case.

>>> I guess Joe Perches did sent some comments for this one, did you address
>>> them?
>> See the changelog of 00/20 (for v2). I have already integrated his
>> comments :)
> Not quite.  You need to add blank lines before and after
> the new test you added.

Ok

>
> I also wonder if you've in fact converted all of the
> pci_pool struct and function uses why a new checkpatch
> test is needed at all.

That's just to avoid futures mistakes/uses.

>
> Also, it seems none of these patches have reached lkml.
> Are you sending the patch series with MIME/html parts?

Normally no. I use git send-email for all my patches.

Regards,
Romain

>
>>> Reviewed-by: Peter Senna Tschudin 
 Signed-off-by: Romain Perier 
 ---
  scripts/checkpatch.pl | 9 -
  1 file changed, 8 insertions(+), 1 deletion(-)

 diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
 index baa3c7b..f2c775c 100755
 --- a/scripts/checkpatch.pl
 +++ b/scripts/checkpatch.pl
 @@ -6064,7 +6064,14 @@ sub process {
WARN("USE_DEVICE_INITCALL",
 "please use device_initcall() or more appropriate 
 function instead of __initcall() (see include/linux/init.h)\n" . 
 $herecurr);
}
 -
 +# check for old PCI api pci_pool_*(), use dma_pool_*() instead
 +  if ($line =~ 
 /\bpci_pool(?:_(?:create|destroy|alloc|zalloc|free)|)\b/) {
 +  if (WARN("USE_DMA_POOL",
 +   "please use the dma pool api or more 
 appropriate function instead of the old pci pool api\n" . $herecurr) &&
 +  $fix) {
 +  while ($fixed[$fixlinenr] =~ 
 s/\bpci_pool(_(?:create|destroy|alloc|zalloc|free)|)\b/dma_pool$1/) {}
 +  }
 +  }
  # check for various structs that are normally const (ops, kgdb, 
 device_tree)
if ($line !~ /\bconst\b/ &&
$line =~ /\bstruct\s+($const_structs)\b/) {
 -- 
 2.9.3



Re: [PATCH v3 20/20] checkpatch: warn for use of old PCI pool API

2017-02-27 Thread Joe Perches
On Mon, 2017-02-27 at 13:26 +0100, Romain Perier wrote:
> Hello,
> 
> 
> Le 27/02/2017 à 12:22, Peter Senna Tschudin a écrit :
> > On Sun, Feb 26, 2017 at 08:24:25PM +0100, Romain Perier wrote:
> > > pci_pool_*() functions should be replaced by the corresponding functions
> > > in the DMA pool API. This adds support to check for use of these pci
> > > functions and display a warning when it is the case.
> > > 
> > 
> > I guess Joe Perches did sent some comments for this one, did you address
> > them?
> 
> See the changelog of 00/20 (for v2). I have already integrated his
> comments :)

Not quite.  You need to add blank lines before and after
the new test you added.

I also wonder if you've in fact converted all of the
pci_pool struct and function uses why a new checkpatch
test is needed at all.

Also, it seems none of these patches have reached lkml.
Are you sending the patch series with MIME/html parts?  

> > Reviewed-by: Peter Senna Tschudin 
> > > Signed-off-by: Romain Perier 
> > > ---
> > >  scripts/checkpatch.pl | 9 -
> > >  1 file changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > > index baa3c7b..f2c775c 100755
> > > --- a/scripts/checkpatch.pl
> > > +++ b/scripts/checkpatch.pl
> > > @@ -6064,7 +6064,14 @@ sub process {
> > >   WARN("USE_DEVICE_INITCALL",
> > >"please use device_initcall() or more appropriate 
> > > function instead of __initcall() (see include/linux/init.h)\n" . 
> > > $herecurr);
> > >   }
> > > -
> > > +# check for old PCI api pci_pool_*(), use dma_pool_*() instead
> > > + if ($line =~ 
> > > /\bpci_pool(?:_(?:create|destroy|alloc|zalloc|free)|)\b/) {
> > > + if (WARN("USE_DMA_POOL",
> > > +  "please use the dma pool api or more 
> > > appropriate function instead of the old pci pool api\n" . $herecurr) &&
> > > + $fix) {
> > > + while ($fixed[$fixlinenr] =~ 
> > > s/\bpci_pool(_(?:create|destroy|alloc|zalloc|free)|)\b/dma_pool$1/) {}
> > > + }
> > > + }
> > >  # check for various structs that are normally const (ops, kgdb, 
> > > device_tree)
> > >   if ($line !~ /\bconst\b/ &&
> > >   $line =~ /\bstruct\s+($const_structs)\b/) {
> > > -- 
> > > 2.9.3


Re: [PATCH 7/7] net: stmmac: dwc-qos: Add Tegra186 support

2017-02-27 Thread Mikko Perttunen

On 23.02.2017 19:24, Thierry Reding wrote:

From: Thierry Reding 

The NVIDIA Tegra186 SoC contains an instance of the Synopsys DWC
ethernet QOS IP core. The binding that it uses is slightly different
from existing ones because of the integration (clocks, resets, ...).

Signed-off-by: Thierry Reding 
---
 .../ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c| 252 +
 1 file changed, 252 insertions(+)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
index 5071d3c15adc..54dfbdc48f6d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -22,10 +23,24 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include "stmmac_platform.h"

+struct tegra_eqos {
+   struct device *dev;
+   void __iomem *regs;
+
+   struct reset_control *rst;
+   struct clk *clk_master;
+   struct clk *clk_slave;
+   struct clk *clk_tx;
+   struct clk *clk_rx;
+
+   struct gpio_desc *reset;
+};
+
 static int dwc_eth_dwmac_config_dt(struct platform_device *pdev,
   struct plat_stmmacenet_data *plat_dat)
 {
@@ -148,6 +163,237 @@ static int dwc_qos_remove(struct platform_device *pdev)
return 0;
 }

+#define SDMEMCOMPPADCTRL 0x8800
+#define  SDMEMCOMPPADCTRL_PAD_E_INPUT_OR_E_PWRD BIT(31)
+
+#define AUTO_CAL_CONFIG 0x8804
+#define  AUTO_CAL_CONFIG_START BIT(31)
+#define  AUTO_CAL_CONFIG_ENABLE BIT(29)
+
+#define AUTO_CAL_STATUS 0x880c
+#define  AUTO_CAL_STATUS_ACTIVE BIT(31)
+
+static void tegra_eqos_fix_speed(void *priv, unsigned int speed)
+{
+   struct tegra_eqos *eqos = priv;
+   unsigned long rate = 12500;
+   bool needs_calibration = false;
+   unsigned int i;
+   u32 value;
+
+   switch (speed) {
+   case SPEED_1000:
+   needs_calibration = true;
+   rate = 12500;
+   break;
+
+   case SPEED_100:
+   needs_calibration = true;
+   rate = 2500;
+   break;
+
+   case SPEED_10:
+   rate = 250;
+   break;
+
+   default:
+   dev_err(eqos->dev, "invalid speed %u\n", speed);
+   break;
+   }
+
+   if (needs_calibration) {
+   /* calibrate */
+   value = readl(eqos->regs + SDMEMCOMPPADCTRL);
+   value |= SDMEMCOMPPADCTRL_PAD_E_INPUT_OR_E_PWRD;
+   writel(value, eqos->regs + SDMEMCOMPPADCTRL);
+
+   udelay(1);
+
+   value = readl(eqos->regs + AUTO_CAL_CONFIG);
+   value |= AUTO_CAL_CONFIG_START | AUTO_CAL_CONFIG_ENABLE;
+   writel(value, eqos->regs + AUTO_CAL_CONFIG);
+
+   for (i = 0; i <= 10; i++) {
+   value = readl(eqos->regs + AUTO_CAL_STATUS);
+   if (value & AUTO_CAL_STATUS_ACTIVE)
+   break;
+
+   udelay(1);
+   }
+
+   if ((value & AUTO_CAL_STATUS_ACTIVE) == 0) {
+   dev_err(eqos->dev, "calibration did not start\n");
+   goto failed;
+   }
+
+   for (i = 0; i <= 10; i++) {
+   value = readl(eqos->regs + AUTO_CAL_STATUS);
+   if ((value & AUTO_CAL_STATUS_ACTIVE) == 0)
+   break;
+
+   udelay(20);
+   }
+
+   if (value & AUTO_CAL_STATUS_ACTIVE) {
+   dev_err(eqos->dev, "calibration didn't finish\n");
+   goto failed;
+   }


Could use readl_poll_timeout/readl_poll_timeout_atomic for these loops 
instead.



+
+   failed:
+   value = readl(eqos->regs + SDMEMCOMPPADCTRL);
+   value &= ~SDMEMCOMPPADCTRL_PAD_E_INPUT_OR_E_PWRD;
+   writel(value, eqos->regs + SDMEMCOMPPADCTRL);
+   } else {
+   value = readl(eqos->regs + AUTO_CAL_CONFIG);
+   value &= ~AUTO_CAL_CONFIG_ENABLE;
+   writel(value, eqos->regs + AUTO_CAL_CONFIG);
+   }
+
+   clk_set_rate(eqos->clk_tx, rate);


Could check error code here, and for other clock ops too.


+}
+
+static int tegra_eqos_init(struct platform_device *pdev, void *priv)
+{
+   struct tegra_eqos *eqos = priv;
+   unsigned long rate;
+   u32 value;
+
+   rate = clk_get_rate(eqos->clk_slave);
+
+   value = readl(eqos->regs + 0xdc);


No point in reading the value when it is fully overwritten.


+   value = (rate / 100) - 1;
+   writel(value, eqos->regs + 0xdc);


Please add a define for 0xdc.


+
+   return 0;
+}
+
+static void *tegra_eqos_probe(struct platform_device *pdev,
+   

Re: [PATCH v3 20/20] checkpatch: warn for use of old PCI pool API

2017-02-27 Thread Romain Perier
Hello,


Le 27/02/2017 à 12:22, Peter Senna Tschudin a écrit :
> On Sun, Feb 26, 2017 at 08:24:25PM +0100, Romain Perier wrote:
>> pci_pool_*() functions should be replaced by the corresponding functions
>> in the DMA pool API. This adds support to check for use of these pci
>> functions and display a warning when it is the case.
>>
> I guess Joe Perches did sent some comments for this one, did you address
> them?

See the changelog of 00/20 (for v2). I have already integrated his
comments :)

>
> Reviewed-by: Peter Senna Tschudin 
>> Signed-off-by: Romain Perier 
>> ---
>>  scripts/checkpatch.pl | 9 -
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
>> index baa3c7b..f2c775c 100755
>> --- a/scripts/checkpatch.pl
>> +++ b/scripts/checkpatch.pl
>> @@ -6064,7 +6064,14 @@ sub process {
>>  WARN("USE_DEVICE_INITCALL",
>>   "please use device_initcall() or more appropriate 
>> function instead of __initcall() (see include/linux/init.h)\n" . $herecurr);
>>  }
>> -
>> +# check for old PCI api pci_pool_*(), use dma_pool_*() instead
>> +if ($line =~ 
>> /\bpci_pool(?:_(?:create|destroy|alloc|zalloc|free)|)\b/) {
>> +if (WARN("USE_DMA_POOL",
>> + "please use the dma pool api or more 
>> appropriate function instead of the old pci pool api\n" . $herecurr) &&
>> +$fix) {
>> +while ($fixed[$fixlinenr] =~ 
>> s/\bpci_pool(_(?:create|destroy|alloc|zalloc|free)|)\b/dma_pool$1/) {}
>> +}
>> +}
>>  # check for various structs that are normally const (ops, kgdb, device_tree)
>>  if ($line !~ /\bconst\b/ &&
>>  $line =~ /\bstruct\s+($const_structs)\b/) {
>> -- 
>> 2.9.3
>>



Re: [PATCH v2 net-next 5/6] drivers: net: xgene-v2: Add transmit and receive

2017-02-27 Thread kbuild test robot
Hi Iyappan,

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Iyappan-Subramanian/drivers-net-xgene-v2-Add-RGMII-based-1G-driver/20170227-182414
config: x86_64-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/apm/xgene-v2/main.c: In function 'is_tx_slot_available':
>> drivers/net/ethernet/apm/xgene-v2/main.c:182:2: warning: this 'if' clause 
>> does not guard... [-Wmisleading-indentation]
 if (GET_BITS(E, le64_to_cpu(raw_desc->m0)) &&
 ^~
   drivers/net/ethernet/apm/xgene-v2/main.c:186:3: note: ...this statement, but 
the latter is misleadingly indented as if it is guarded by the 'if'
  return false;
  ^~
   drivers/net/ethernet/apm/xgene-v2/main.c: In function 'is_tx_hw_done':
   drivers/net/ethernet/apm/xgene-v2/main.c:246:2: warning: this 'if' clause 
does not guard... [-Wmisleading-indentation]
 if (GET_BITS(E, le64_to_cpu(raw_desc->m0)) &&
 ^~
   drivers/net/ethernet/apm/xgene-v2/main.c:250:3: note: ...this statement, but 
the latter is misleadingly indented as if it is guarded by the 'if'
  return false;
  ^~

vim +/if +182 drivers/net/ethernet/apm/xgene-v2/main.c

   166  if (ret)
   167  netdev_err(ndev, "Failed to request irq %s\n", 
pdata->irq_name);
   168  
   169  return ret;
   170  }
   171  
   172  static void xge_free_irq(struct net_device *ndev)
   173  {
   174  struct xge_pdata *pdata = netdev_priv(ndev);
   175  struct device *dev = >pdev->dev;
   176  
   177  devm_free_irq(dev, pdata->resources.irq, pdata);
   178  }
   179  
   180  static bool is_tx_slot_available(struct xge_raw_desc *raw_desc)
   181  {
 > 182  if (GET_BITS(E, le64_to_cpu(raw_desc->m0)) &&
   183  (GET_BITS(PKT_SIZE, le64_to_cpu(raw_desc->m0)) == 
SLOT_EMPTY))
   184  return true;
   185  
   186  return false;
   187  }
   188  
   189  static netdev_tx_t xge_start_xmit(struct sk_buff *skb, struct 
net_device *ndev)
   190  {

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH net] xfrm: provide correct dst in xfrm_neigh_lookup

2017-02-27 Thread Steffen Klassert
On Sun, Feb 26, 2017 at 09:35:48PM -0500, David Miller wrote:
> From: Julian Anastasov 
> Date: Sat, 25 Feb 2017 17:57:43 +0200
> 
> > Fix xfrm_neigh_lookup to provide dst->path to the
> > neigh_lookup dst_ops method.
> > 
> > When skb is provided, the IP address in packet should already
> > match the dst->path address family. But for the non-skb case,
> > we should consider the last tunnel address as nexthop address.
> > 
> > Fixes: f894cbf847c9 ("net: Add optional SKB arg to 
> > dst_ops->neigh_lookup().")
> > Signed-off-by: Julian Anastasov 
> 
> This looks good to me.
> 
> Steffen, I applied this directly to my tree, I hope you don't mind.

That's ok, no problem.


Re: [PATCH 2/2] iproute2: add support for invisible qdisc dumping

2017-02-27 Thread Phil Sutter
On Sat, Feb 25, 2017 at 10:29:17PM +0100, Jiri Kosina wrote:
> From: Jiri Kosina 
> 
> Support the new TCA_DUMP_INVISIBLE netlink attribute that allows asking 
> kernel to perform 'full qdisc dump', as for historical reasons some of the 
> default qdiscs are being hidden by the kernel.
> 
> The command syntax is being extended by voluntary 'invisible' argument to
> 'tc qdisc show'.
> 
> Signed-off-by: Jiri Kosina 
> ---
>  tc/tc_qdisc.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)

Would you mind adding a description of the new keyword to tc man page as
well?

> diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
> index 3a3701c2..29da9269 100644
> --- a/tc/tc_qdisc.c
> +++ b/tc/tc_qdisc.c
> @@ -34,7 +34,7 @@ static int usage(void)
>   fprintf(stderr, "   [ stab [ help | STAB_OPTIONS] ]\n");
>   fprintf(stderr, "   [ [ QDISC_KIND ] [ help | OPTIONS ] ]\n");
>   fprintf(stderr, "\n");
> - fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
> ]\n");
> + fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
> | invisible ]\n");

Doesn't look like these are mutually exclusive. Therefore I would
suggest fixing the syntax to:

| + fprintf(stderr, "   tc qdisc show [ dev STRING ] [ ingress | clsact 
] [ invisible ]\n");

Cheers, Phil


Re: [PATCH v3 20/20] checkpatch: warn for use of old PCI pool API

2017-02-27 Thread Joe Perches
On Mon, 2017-02-27 at 12:22 +0100, Peter Senna Tschudin wrote:
> On Sun, Feb 26, 2017 at 08:24:25PM +0100, Romain Perier wrote:
> > pci_pool_*() functions should be replaced by the corresponding functions
> > in the DMA pool API. This adds support to check for use of these pci
> > functions and display a warning when it is the case.
> > 
> 
> I guess Joe Perches did sent some comments for this one, did you address
> them?


> Reviewed-by: Peter Senna Tschudin 
> > Signed-off-by: Romain Perier 
> > ---
> >  scripts/checkpatch.pl | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl
> > index baa3c7b..f2c775c 100755
> > --- a/scripts/checkpatch.pl
> > +++ b/scripts/checkpatch.pl
> > @@ -6064,7 +6064,14 @@ sub process {
> > WARN("USE_DEVICE_INITCALL",
> >  "please use device_initcall() or more appropriate 
> > function instead of __initcall() (see include/linux/init.h)\n" . $herecurr);
> > }
> > -
> > +# check for old PCI api pci_pool_*(), use dma_pool_*() instead
> > +   if ($line =~ 
> > /\bpci_pool(?:_(?:create|destroy|alloc|zalloc|free)|)\b/) {
> > +   if (WARN("USE_DMA_POOL",
> > +"please use the dma pool api or more 
> > appropriate function instead of the old pci pool api\n" . $herecurr) &&
> > +   $fix) {
> > +   while ($fixed[$fixlinenr] =~ 
> > s/\bpci_pool(_(?:create|destroy|alloc|zalloc|free)|)\b/dma_pool$1/) {}
> > +   }
> > +   }
> >  # check for various structs that are normally const (ops, kgdb, 
> > device_tree)
> > if ($line !~ /\bconst\b/ &&
> > $line =~ /\bstruct\s+($const_structs)\b/) {
> > 

This is nearly identical to the suggestion that I
sent but this is slightly misformatted as it does
not have a leading nor a trailing blank line to
separate the test blocks.

Also, I think none of the patches have reached lkml.

Romain, are you using git-send-email to send these
patches?  Perhaps the patches you send also contain
html which are rejected by the mailing list.



Re: [PATCH 6/7] net: stmmac: dwc-qos: Split out ->probe() and ->remove()

2017-02-27 Thread Mikko Perttunen

On 23.02.2017 19:24, Thierry Reding wrote:

From: Thierry Reding 

Split out the binding specific parts of ->probe() and ->remove() to
enable the driver to support variants of the binding. This is useful in
order to keep backwards-compatibility while making it easy for a sub-
driver to deal only with the updated bindings rather than having to add
compatibility quirks all over the place.

Signed-off-by: Thierry Reding 
---
 .../ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c| 114 -
 1 file changed, 88 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c 
b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
index 1a3fa3d9f855..5071d3c15adc 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-dwc-qos-eth.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -106,13 +107,70 @@ static int dwc_eth_dwmac_config_dt(struct platform_device 
*pdev,
return 0;
 }

+static void *dwc_qos_probe(struct platform_device *pdev,
+  struct plat_stmmacenet_data *plat_dat,
+  struct stmmac_resources *stmmac_res)
+{
+   int err;
+
+   plat_dat->stmmac_clk = devm_clk_get(>dev, "apb_pclk");
+   if (IS_ERR(plat_dat->stmmac_clk)) {
+   dev_err(>dev, "apb_pclk clock not found.\n");
+   return ERR_CAST(plat_dat->stmmac_clk);
+   }
+
+   clk_prepare_enable(plat_dat->stmmac_clk);
+
+   plat_dat->pclk = devm_clk_get(>dev, "phy_ref_clk");
+   if (IS_ERR(plat_dat->pclk)) {
+   dev_err(>dev, "phy_ref_clk clock not found.\n");
+   err = PTR_ERR(plat_dat->pclk);
+   goto disable;
+   }
+
+   clk_prepare_enable(plat_dat->pclk);
+
+   return NULL;
+
+disable:
+   clk_disable_unprepare(plat_dat->stmmac_clk);
+   return ERR_PTR(err);
+}
+
+static int dwc_qos_remove(struct platform_device *pdev)
+{
+   struct net_device *ndev = platform_get_drvdata(pdev);
+   struct stmmac_priv *priv = netdev_priv(ndev);
+
+   clk_disable_unprepare(priv->plat->pclk);
+   clk_disable_unprepare(priv->plat->stmmac_clk);
+
+   return 0;
+}
+
+struct dwc_eth_dwmac_data {
+   void *(*probe)(struct platform_device *pdev,
+  struct plat_stmmacenet_data *data,
+  struct stmmac_resources *res);
+   int (*remove)(struct platform_device *pdev);
+};
+
+static const struct dwc_eth_dwmac_data dwc_qos_data = {
+   .probe = dwc_qos_probe,
+   .remove = dwc_qos_remove,
+};
+
 static int dwc_eth_dwmac_probe(struct platform_device *pdev)
 {
+   const struct dwc_eth_dwmac_data *data;
struct plat_stmmacenet_data *plat_dat;
struct stmmac_resources stmmac_res;
struct resource *res;
+   void *priv;
int ret;

+   data = of_device_get_match_data(>dev);
+
memset(_res, 0, sizeof(struct stmmac_resources));

/**
@@ -138,39 +196,26 @@ static int dwc_eth_dwmac_probe(struct platform_device 
*pdev)
if (IS_ERR(plat_dat))
return PTR_ERR(plat_dat);

-   plat_dat->stmmac_clk = devm_clk_get(>dev, "apb_pclk");
-   if (IS_ERR(plat_dat->stmmac_clk)) {
-   dev_err(>dev, "apb_pclk clock not found.\n");
-   ret = PTR_ERR(plat_dat->stmmac_clk);
-   plat_dat->stmmac_clk = NULL;
-   goto err_remove_config_dt;
-   }
-   clk_prepare_enable(plat_dat->stmmac_clk);
-
-   plat_dat->pclk = devm_clk_get(>dev, "phy_ref_clk");
-   if (IS_ERR(plat_dat->pclk)) {
-   dev_err(>dev, "phy_ref_clk clock not found.\n");
-   ret = PTR_ERR(plat_dat->pclk);
-   plat_dat->pclk = NULL;
-   goto err_out_clk_dis_phy;
+   priv = data->probe(pdev, plat_dat, _res);
+   if (IS_ERR(priv)) {
+   ret = PTR_ERR(priv);
+   dev_err(>dev, "failed to probe subdriver: %d\n", ret);
+   goto remove_config;
}
-   clk_prepare_enable(plat_dat->pclk);

ret = dwc_eth_dwmac_config_dt(pdev, plat_dat);
if (ret)
-   goto err_out_clk_dis_aper;
+   goto remove;

ret = stmmac_dvr_probe(>dev, plat_dat, _res);
if (ret)
-   goto err_out_clk_dis_aper;
+   goto remove;

-   return 0;
+   return ret;

-err_out_clk_dis_aper:
-   clk_disable_unprepare(plat_dat->pclk);
-err_out_clk_dis_phy:
-   clk_disable_unprepare(plat_dat->stmmac_clk);
-err_remove_config_dt:
+remove:
+   data->remove(pdev);
+remove_config:
stmmac_remove_config_dt(pdev, plat_dat);

return ret;
@@ -178,11 +223,28 @@ static int dwc_eth_dwmac_probe(struct platform_device 
*pdev)

 static int dwc_eth_dwmac_remove(struct platform_device *pdev)
 {
-   return 

[PATCH 2/6] uapi: stop including linux/sysctl.h in uapi/linux/netfilter.h

2017-02-27 Thread Pablo Neira Ayuso
From: "Dmitry V. Levin" 

linux/netfilter.h is the last uapi header file that includes
linux/sysctl.h but it does not depend on definitions provided
by this essentially dead header file.

Suggested-by: Eric W. Biederman 
Signed-off-by: Dmitry V. Levin 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/uapi/linux/netfilter.h b/include/uapi/linux/netfilter.h
index 7550e9176a54..c111a91adcc0 100644
--- a/include/uapi/linux/netfilter.h
+++ b/include/uapi/linux/netfilter.h
@@ -3,7 +3,6 @@
 
 #include 
 #include 
-#include 
 #include 
 #include 
 
-- 
2.1.4



[PATCH 3/6] uapi: fix linux/netfilter/xt_hashlimit.h userspace compilation error

2017-02-27 Thread Pablo Neira Ayuso
From: "Dmitry V. Levin" 

Include  like some of uapi/linux/netfilter/xt_*.h
headers do to fix the following linux/netfilter/xt_hashlimit.h
userspace compilation error:

/usr/include/linux/netfilter/xt_hashlimit.h:90:12: error: 'NAME_MAX' undeclared 
here (not in a function)
  char name[NAME_MAX];

Signed-off-by: Dmitry V. Levin 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/xt_hashlimit.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/netfilter/xt_hashlimit.h 
b/include/uapi/linux/netfilter/xt_hashlimit.h
index 3efc0ca18345..79da349f1060 100644
--- a/include/uapi/linux/netfilter/xt_hashlimit.h
+++ b/include/uapi/linux/netfilter/xt_hashlimit.h
@@ -2,6 +2,7 @@
 #define _UAPI_XT_HASHLIMIT_H
 
 #include 
+#include 
 #include 
 
 /* timings are in milliseconds. */
-- 
2.1.4



[PATCH 4/6] netfilter: nf_ct_expect: nf_ct_expect_related_report(): Return zero on success.

2017-02-27 Thread Pablo Neira Ayuso
From: Jarno Rajahalme 

Commit 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert()
returns void") inadvertently changed the successful return value of
nf_ct_expect_related_report() from 0 to 1, which caused openvswitch
conntrack integration fail in FTP test cases.

Fix this by always returning zero on the success code path.

Fixes: 4dee62b1b9b4 ("netfilter: nf_ct_expect: nf_ct_expect_insert() returns 
void")
Signed-off-by: Jarno Rajahalme 
Acked-by: Joe Stringer 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_expect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/nf_conntrack_expect.c 
b/net/netfilter/nf_conntrack_expect.c
index e19a69787d99..d6ace69d57dc 100644
--- a/net/netfilter/nf_conntrack_expect.c
+++ b/net/netfilter/nf_conntrack_expect.c
@@ -467,7 +467,7 @@ int nf_ct_expect_related_report(struct nf_conntrack_expect 
*expect,
 
spin_unlock_bh(_conntrack_expect_lock);
nf_ct_expect_event_report(IPEXP_NEW, expect, portid, report);
-   return ret;
+   return 0;
 out:
spin_unlock_bh(_conntrack_expect_lock);
return ret;
-- 
2.1.4



  1   2   >