[PATCH 2/2] vhost: forbid IOTLB invalidation when not enabled

2016-11-17 Thread Jason Wang
When IOTLB is not enabled, we should forbid IOTLB invalidation to
avoid a NULL pointer dereference.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c6f2d89..7d338d5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -959,6 +959,10 @@ int vhost_process_iotlb_msg(struct vhost_dev *dev,
vhost_iotlb_notify_vq(dev, msg);
break;
case VHOST_IOTLB_INVALIDATE:
+   if (!dev->iotlb) {
+   ret = -EFAULT;
+   break;
+   }
vhost_del_umem_range(dev->iotlb, msg->iova,
 msg->iova + msg->size - 1);
break;
-- 
2.7.4



RE: [PATCH net 1/2] r8152: fix the sw rx checksum is unavailable

2016-11-17 Thread Hayes Wang
Mark Lord [mailto:ml...@pobox.com]
> Sent: Thursday, November 17, 2016 9:42 PM
[...]
> What the above sample shows, is the URB transfer buffer ran out of space in 
> the
> middle
> of a packet, and the hardware then tried to just continue that same packet in 
> the
> next URB,
> without an rx_desc header inserted.  The r8152.c driver always assumes the URB
> buffer begins
> with an rx_desc, so of course this behaviour produces really weird effects, 
> and
> system crashes, etc..

The USB device wouldn't know the address and size of buffer. Only
the USB host controller knows. Therefore, the device sends the
data to host, and the host fills the memory. According to your
description, it seems the host splits the data from the device
into two different buffers (or URB transfers). I wonder if it would
occur. As far as I know, the host wouldn't allow the buffer size
less than the data length.

Our hw engineers need the log from the USB analyzer to confirm
what the device sends to the host. However, I don't think you
have USB analyzer to do this. I would try to reproduce the issue.
But, I am busy, so I don't think I would response quickly.

Besides, the maximum data length which the RTL8152 would send to
the host is 16KB. That is, if the agg_buf_sz is 16KB, the host
wouldn't split it. However, you still see problems for it.

[...]
> It is not clear to me how the chip decides when to forward an rx URB to the 
> host.
> If you could describe how that part works for us, then it would help in 
> further
> understanding why fast systems (eg. a PC) don't generally notice the issue,
> while much slower embedded systems do see the issue regularly.

The driver expects the rx buffer would be

rx_desc + a packet + padding to 8 alignment + 
rx_desc + a packet + padding to 8 alignment + ...

Therefore, when a urb transfer is completed, the driver parsers
the buffer by this way. After the buffer is handled, it would
be submitted to the host, until the transfer is completed again.
If the submitting fail, the driver would try again later. The
urb->actual_length means how much data the host fills. The drive
uses it to check the end of the data. The urb->status mean if
the transfer is successful. The driver submits the urb to the
host directly if the status is not successful.

Best Regards,
Hayes



pull-request: mac80211 2016-11-18

2016-11-17 Thread Johannes Berg
Hi Dave,

Due to travel/vacation, this is a bit late, but there aren't
that many fixes either. Most interesting/important are the
fixes from Felix and perhaps the scan entry limit.

Please pull and let me know if there's any problem.

Thanks,
johannes



The following changes since commit 269ebce4531b8edc4224259a02143181a1c1d77c:

  xen-netfront: cast grant table reference first to type int (2016-11-02 
15:33:36 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211.git 
tags/mac80211-for-davem-2016-11-18

for you to fetch changes up to 9853a55ef1bb66d7411136046060bbfb69c714fa:

  cfg80211: limit scan results cache size (2016-11-18 08:44:44 +0100)


A few more bugfixes:
 * limit # of scan results stored in memory - this is a long-standing bug
   Jouni and I only noticed while discussing other things in Santa Fe
 * revert AP_LINK_PS patch that was causing issues (Felix)
 * various A-MSDU/A-MPDU fixes for TXQ code (Felix)
 * interoperability workaround for peers with broken VHT capabilities
   (Filip Matusiak)
 * add bitrate definition for a VHT MCS that's supposed to be invalid
   but gets used by some hardware anyway (Thomas Pedersen)
 * beacon timer fix in hwsim (Benjamin Beichler)


Benjamin Beichler (1):
  mac80211_hwsim: fix beacon delta calculation

Felix Fietkau (4):
  Revert "mac80211: allow using AP_LINK_PS with mac80211-generated TIM IE"
  mac80211: update A-MPDU flag on tx dequeue
  mac80211: remove bogus skb vif assignment
  mac80211: fix A-MSDU aggregation with fast-xmit + txq

Filip Matusiak (1):
  mac80211: Ignore VHT IE from peer with wrong rx_mcs_map

Johannes Berg (1):
  cfg80211: limit scan results cache size

Pedersen, Thomas (1):
  cfg80211: add bitrate for 20MHz MCS 9

 drivers/net/wireless/mac80211_hwsim.c |  2 +-
 net/mac80211/sta_info.c   |  2 +-
 net/mac80211/tx.c | 14 +--
 net/mac80211/vht.c| 16 
 net/wireless/core.h   |  1 +
 net/wireless/scan.c   | 69 +++
 net/wireless/util.c   |  3 +-
 7 files changed, 100 insertions(+), 7 deletions(-)


[PATCH net-next] bridge: add igmpv3 and mldv2 query support

2016-11-17 Thread Hangbin Liu
Add bridge IGMPv3 and MLDv2 query support. But before we think it is stable
enough, only enable it when declare in force_igmp/mld_version.

Signed-off-by: Hangbin Liu 
---
 net/bridge/br_multicast.c | 203 --
 1 file changed, 194 insertions(+), 9 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 2136e45..9fb47f3 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -35,6 +35,10 @@
 
 #include "br_private.h"
 
+#define IGMP_V3_SEEN(in_dev) \
+   (IPV4_DEVCONF_ALL(dev_net(in_dev->dev), FORCE_IGMP_VERSION) == 3 || \
+IN_DEV_CONF_GET((in_dev), FORCE_IGMP_VERSION) == 3)
+
 static void br_multicast_start_querier(struct net_bridge *br,
   struct bridge_mcast_own_query *query);
 static void br_multicast_add_router(struct net_bridge *br,
@@ -360,9 +364,8 @@ static int br_mdb_rehash(struct net_bridge_mdb_htable __rcu 
**mdbp, int max,
return 0;
 }
 
-static struct sk_buff *br_ip4_multicast_alloc_query(struct net_bridge *br,
-   __be32 group,
-   u8 *igmp_type)
+static struct sk_buff *br_ip4_alloc_query_v2(struct net_bridge *br,
+__be32 group, u8 *igmp_type)
 {
struct sk_buff *skb;
struct igmphdr *ih;
@@ -428,10 +431,82 @@ static struct sk_buff 
*br_ip4_multicast_alloc_query(struct net_bridge *br,
return skb;
 }
 
+static struct sk_buff *br_ip4_alloc_query_v3(struct net_bridge *br,
+__be32 group, u8 *igmp_type)
+{
+   struct sk_buff *skb;
+   struct igmpv3_query *ih3;
+   struct ethhdr *eth;
+   struct iphdr *iph;
+
+   skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*iph) +
+sizeof(*ih3) + 4);
+   if (!skb)
+   goto out;
+
+   skb->protocol = htons(ETH_P_IP);
+
+   skb_reset_mac_header(skb);
+   eth = eth_hdr(skb);
+
+   ether_addr_copy(eth->h_source, br->dev->dev_addr);
+   eth->h_dest[0] = 1;
+   eth->h_dest[1] = 0;
+   eth->h_dest[2] = 0x5e;
+   eth->h_dest[3] = 0;
+   eth->h_dest[4] = 0;
+   eth->h_dest[5] = 1;
+   eth->h_proto = htons(ETH_P_IP);
+   skb_put(skb, sizeof(*eth));
+
+   skb_set_network_header(skb, skb->len);
+   iph = ip_hdr(skb);
+
+   iph->version = 4;
+   iph->ihl = 6;
+   iph->tos = 0xc0;
+   iph->tot_len = htons(sizeof(*iph) + sizeof(*ih3) + 4);
+   iph->id = 0;
+   iph->frag_off = htons(IP_DF);
+   iph->ttl = 1;
+   iph->protocol = IPPROTO_IGMP;
+   iph->saddr = br->multicast_query_use_ifaddr ?
+inet_select_addr(br->dev, 0, RT_SCOPE_LINK) : 0;
+   iph->daddr = htonl(INADDR_ALLHOSTS_GROUP);
+   ((u8 *)&iph[1])[0] = IPOPT_RA;
+   ((u8 *)&iph[1])[1] = 4;
+   ((u8 *)&iph[1])[2] = 0;
+   ((u8 *)&iph[1])[3] = 0;
+   ip_send_check(iph);
+   skb_put(skb, 24);
+
+   skb_set_transport_header(skb, skb->len);
+   ih3 = igmpv3_query_hdr(skb);
+
+   *igmp_type = IGMP_HOST_MEMBERSHIP_QUERY;
+   ih3->type = IGMP_HOST_MEMBERSHIP_QUERY;
+   ih3->code = (group ? br->multicast_last_member_interval :
+   br->multicast_query_response_interval) /
+  (HZ / IGMP_TIMER_SCALE);
+   ih3->csum = 0;
+   ih3->group = group;
+   ih3->resv = 0;
+   ih3->suppress = 0;
+   ih3->qrv= 2;
+   ih3->qqic = br->multicast_query_interval / HZ;
+   ih3->nsrcs = 0;
+   ih3->csum = ip_compute_csum((void *)ih3, sizeof(struct igmpv3_query ));
+   skb_put(skb, sizeof(*ih3));
+
+   __skb_pull(skb, sizeof(*eth));
+
+out:
+   return skb;
+}
 #if IS_ENABLED(CONFIG_IPV6)
-static struct sk_buff *br_ip6_multicast_alloc_query(struct net_bridge *br,
-   const struct in6_addr *grp,
-   u8 *igmp_type)
+static struct sk_buff *br_ip6_alloc_query_v1(struct net_bridge *br,
+const struct in6_addr *grp,
+u8 *igmp_type)
 {
struct sk_buff *skb;
struct ipv6hdr *ip6h;
@@ -514,19 +589,129 @@ static struct sk_buff 
*br_ip6_multicast_alloc_query(struct net_bridge *br,
 out:
return skb;
 }
+
+static struct sk_buff *br_ip6_alloc_query_v2(struct net_bridge *br,
+const struct in6_addr *grp,
+u8 *igmp_type)
+{
+   struct sk_buff *skb;
+   struct ipv6hdr *ip6h;
+   struct mld2_query *mld2q;
+   struct ethhdr *eth;
+   u8 *hopopt;
+   unsigned long interval;
+
+   skb = netdev_alloc_skb_ip_align(br->dev, sizeof(*eth) + sizeof(*ip6h) +
+

[PATCH net-next 2/4] geneve: Merge ipv4 and ipv6 geneve_build_skb()

2016-11-17 Thread Pravin B Shelar
There are minimal difference in building Geneve header
between ipv4 and ipv6 geneve tunnels. Following patch
refactors code to unify it.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/geneve.c | 100 ++-
 1 file changed, 26 insertions(+), 74 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 622cf3b..9a4351c 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -630,67 +630,34 @@ static int geneve_stop(struct net_device *dev)
 }
 
 static void geneve_build_header(struct genevehdr *geneveh,
-   __be16 tun_flags, u8 vni[3],
-   u8 options_len, u8 *options)
+   const struct ip_tunnel_info *info)
 {
geneveh->ver = GENEVE_VER;
-   geneveh->opt_len = options_len / 4;
-   geneveh->oam = !!(tun_flags & TUNNEL_OAM);
-   geneveh->critical = !!(tun_flags & TUNNEL_CRIT_OPT);
+   geneveh->opt_len = info->options_len / 4;
+   geneveh->oam = !!(info->key.tun_flags & TUNNEL_OAM);
+   geneveh->critical = !!(info->key.tun_flags & TUNNEL_CRIT_OPT);
geneveh->rsvd1 = 0;
-   memcpy(geneveh->vni, vni, 3);
+   tunnel_id_to_vni(info->key.tun_id, geneveh->vni);
geneveh->proto_type = htons(ETH_P_TEB);
geneveh->rsvd2 = 0;
 
-   memcpy(geneveh->options, options, options_len);
+   ip_tunnel_info_opts_get(geneveh->options, info);
 }
 
-static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb,
-   __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
-   bool xnet)
-{
-   bool udp_sum = !!(tun_flags & TUNNEL_CSUM);
-   struct genevehdr *gnvh;
-   int min_headroom;
-   int err;
-
-   skb_scrub_packet(skb, xnet);
-
-   min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len
-   + GENEVE_BASE_HLEN + opt_len + sizeof(struct iphdr);
-   err = skb_cow_head(skb, min_headroom);
-   if (unlikely(err))
-   goto free_rt;
-
-   err = udp_tunnel_handle_offloads(skb, udp_sum);
-   if (err)
-   goto free_rt;
-
-   gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + opt_len);
-   geneve_build_header(gnvh, tun_flags, vni, opt_len, opt);
-
-   skb_set_inner_protocol(skb, htons(ETH_P_TEB));
-   return 0;
-
-free_rt:
-   ip_rt_put(rt);
-   return err;
-}
-
-#if IS_ENABLED(CONFIG_IPV6)
-static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb,
-__be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
-bool xnet)
+static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb,
+   const struct ip_tunnel_info *info,
+   bool xnet, int ip_hdr_len)
 {
-   bool udp_sum = !!(tun_flags & TUNNEL_CSUM);
+   bool udp_sum = !!(info->key.tun_flags & TUNNEL_CSUM);
struct genevehdr *gnvh;
int min_headroom;
int err;
 
+   skb_reset_mac_header(skb);
skb_scrub_packet(skb, xnet);
 
-   min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len
-   + GENEVE_BASE_HLEN + opt_len + sizeof(struct ipv6hdr);
+   min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len +
+  GENEVE_BASE_HLEN + info->options_len + ip_hdr_len;
err = skb_cow_head(skb, min_headroom);
if (unlikely(err))
goto free_dst;
@@ -699,9 +666,9 @@ static int geneve6_build_skb(struct dst_entry *dst, struct 
sk_buff *skb,
if (err)
goto free_dst;
 
-   gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + opt_len);
-   geneve_build_header(gnvh, tun_flags, vni, opt_len, opt);
-
+   gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) +
+  info->options_len);
+   geneve_build_header(gnvh, info);
skb_set_inner_protocol(skb, htons(ETH_P_TEB));
return 0;
 
@@ -709,12 +676,11 @@ static int geneve6_build_skb(struct dst_entry *dst, 
struct sk_buff *skb,
dst_release(dst);
return err;
 }
-#endif
 
 static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
   struct net_device *dev,
   struct flowi4 *fl4,
-  struct ip_tunnel_info *info)
+  const struct ip_tunnel_info *info)
 {
bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
struct geneve_dev *geneve = netdev_priv(dev);
@@ -738,7 +704,7 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
}
fl4->flowi4_tos = RT_TOS(tos);
 
-   dst_cache = &info->dst_cache;
+   dst_cache = (struct dst_cache *)&info->dst_cache;
if (use_cache) {
rt = dst_cache_get_ip4(dst_cache, &fl4->saddr)

[PATCH net-next 0/4] geneve: Use LWT more effectively.

2016-11-17 Thread Pravin B Shelar
Following patch series make use of geneve LWT code path for
geneve netdev type of device.
This allows us to simplify geneve module.

Pravin B Shelar (4):
  geneve: Unify LWT and netdev handling.
  geneve: Merge ipv4 and ipv6 geneve_build_skb()
  geneve: Remove redundant socket checks.
  geneve: Optimize geneve device lookup.

 drivers/net/geneve.c | 678 +--
 1 file changed, 273 insertions(+), 405 deletions(-)

-- 
1.8.3.1



[PATCH net-next 1/4] geneve: Unify LWT and netdev handling.

2016-11-17 Thread Pravin B Shelar
Current geneve implementation has two separate cases to handle.
1. netdev xmit
2. LWT xmit.

In case of netdev, geneve configuration is stored in various
struct geneve_dev members. For example geneve_addr, ttl, tos,
label, flags, dst_cache, etc. For LWT ip_tunnel_info is passed
to the device in ip_tunnel_info.

Following patch uses ip_tunnel_info struct to store almost all
of configuration of a geneve netdevice. This allows us to unify
most of geneve driver code around ip_tunnel_info struct.
This dramatically simplify geneve code, since it does not
need to handle two different configuration cases. Removes
duplicate code, single code path can handle either type
of geneve devices.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/geneve.c | 611 ++-
 1 file changed, 262 insertions(+), 349 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 85a423a..622cf3b 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -45,41 +45,22 @@ struct geneve_net {
 
 static int geneve_net_id;
 
-union geneve_addr {
-   struct sockaddr_in sin;
-   struct sockaddr_in6 sin6;
-   struct sockaddr sa;
-};
-
-static union geneve_addr geneve_remote_unspec = { .sa.sa_family = AF_UNSPEC, };
-
 /* Pseudo network device */
 struct geneve_dev {
struct hlist_node  hlist;   /* vni hash table */
struct net *net;/* netns for packet i/o */
struct net_device  *dev;/* netdev for geneve tunnel */
+   struct ip_tunnel_info info;
struct geneve_sock __rcu *sock4;/* IPv4 socket used for geneve 
tunnel */
 #if IS_ENABLED(CONFIG_IPV6)
struct geneve_sock __rcu *sock6;/* IPv6 socket used for geneve 
tunnel */
 #endif
-   u8 vni[3];  /* virtual network ID for tunnel */
-   u8 ttl; /* TTL override */
-   u8 tos; /* TOS override */
-   union geneve_addr  remote;  /* IP address for link partner */
struct list_head   next;/* geneve's per namespace list */
-   __be32 label;   /* IPv6 flowlabel override */
-   __be16 dst_port;
-   bool   collect_md;
struct gro_cells   gro_cells;
-   u32flags;
-   struct dst_cache   dst_cache;
+   bool   collect_md;
+   bool   use_udp6_rx_checksums;
 };
 
-/* Geneve device flags */
-#define GENEVE_F_UDP_ZERO_CSUM_TX  BIT(0)
-#define GENEVE_F_UDP_ZERO_CSUM6_TX BIT(1)
-#define GENEVE_F_UDP_ZERO_CSUM6_RX BIT(2)
-
 struct geneve_sock {
boolcollect_md;
struct list_headlist;
@@ -87,7 +68,6 @@ struct geneve_sock {
struct rcu_head rcu;
int refcnt;
struct hlist_head   vni_list[VNI_HASH_SIZE];
-   u32 flags;
 };
 
 static inline __u32 geneve_net_vni_hash(u8 vni[3])
@@ -109,6 +89,20 @@ static __be64 vni_to_tunnel_id(const __u8 *vni)
 #endif
 }
 
+/* Convert 64 bit tunnel ID to 24 bit VNI. */
+static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
+{
+#ifdef __BIG_ENDIAN
+   vni[0] = (__force __u8)(tun_id >> 16);
+   vni[1] = (__force __u8)(tun_id >> 8);
+   vni[2] = (__force __u8)tun_id;
+#else
+   vni[0] = (__force __u8)((__force u64)tun_id >> 40);
+   vni[1] = (__force __u8)((__force u64)tun_id >> 48);
+   vni[2] = (__force __u8)((__force u64)tun_id >> 56);
+#endif
+}
+
 static sa_family_t geneve_get_sk_family(struct geneve_sock *gs)
 {
return gs->sock->sk->sk_family;
@@ -117,6 +111,7 @@ static sa_family_t geneve_get_sk_family(struct geneve_sock 
*gs)
 static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
__be32 addr, u8 vni[])
 {
+   __be64 id = vni_to_tunnel_id(vni);
struct hlist_head *vni_list_head;
struct geneve_dev *geneve;
__u32 hash;
@@ -125,8 +120,8 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock 
*gs,
hash = geneve_net_vni_hash(vni);
vni_list_head = &gs->vni_list[hash];
hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-   if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) &&
-   addr == geneve->remote.sin.sin_addr.s_addr)
+   if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) &&
+   addr == geneve->info.key.u.ipv4.dst)
return geneve;
}
return NULL;
@@ -136,6 +131,7 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock 
*gs,
 static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
 struct in6_addr addr6, u8 vni[])
 {
+   __be64 id = vni_to_tunnel_id(vni);
struct hlist_head *vni_list_head;
struct geneve_dev *geneve;
__u32 hash;
@@ -144,8 +140,8 @@ static struct geneve_d

[PATCH net-next 4/4] geneve: Optimize geneve device lookup.

2016-11-17 Thread Pravin B Shelar
Rather than comparing 64-bit tunnel-id, compare tunnel vni
which is 24-bit id. This also save conversion from vni
to tunnel id on each tunnel packet receive.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/geneve.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 8e6c659..21f3270 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -103,6 +103,17 @@ static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
 #endif
 }
 
+static bool cmp_tunnel_id_and_vni(u8 *tun_id, u8 *vni)
+{
+#ifdef __BIG_ENDIAN
+   return (vni[0] == tun_id[2]) &&
+  (vni[1] == tun_id[1]) &&
+  (vni[2] == tun_id[0]);
+#else
+   return !memcmp(vni, &tun_id[5], 3);
+#endif
+}
+
 static sa_family_t geneve_get_sk_family(struct geneve_sock *gs)
 {
return gs->sock->sk->sk_family;
@@ -111,7 +122,6 @@ static sa_family_t geneve_get_sk_family(struct geneve_sock 
*gs)
 static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
__be32 addr, u8 vni[])
 {
-   __be64 id = vni_to_tunnel_id(vni);
struct hlist_head *vni_list_head;
struct geneve_dev *geneve;
__u32 hash;
@@ -120,7 +130,7 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock 
*gs,
hash = geneve_net_vni_hash(vni);
vni_list_head = &gs->vni_list[hash];
hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-   if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) &&
+   if (cmp_tunnel_id_and_vni((u8 *)&geneve->info.key.tun_id, vni) 
&&
addr == geneve->info.key.u.ipv4.dst)
return geneve;
}
@@ -131,7 +141,6 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock 
*gs,
 static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
 struct in6_addr addr6, u8 vni[])
 {
-   __be64 id = vni_to_tunnel_id(vni);
struct hlist_head *vni_list_head;
struct geneve_dev *geneve;
__u32 hash;
@@ -140,7 +149,7 @@ static struct geneve_dev *geneve6_lookup(struct geneve_sock 
*gs,
hash = geneve_net_vni_hash(vni);
vni_list_head = &gs->vni_list[hash];
hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-   if (!memcmp(&id, &geneve->info.key.tun_id, sizeof(id)) &&
+   if (cmp_tunnel_id_and_vni((u8 *)&geneve->info.key.tun_id, vni) 
&&
ipv6_addr_equal(&addr6, &geneve->info.key.u.ipv6.dst))
return geneve;
}
-- 
1.8.3.1



[PATCH net-next 3/4] geneve: Remove redundant socket checks.

2016-11-17 Thread Pravin B Shelar
Geneve already has check for device socket in route
lookup function. So no need to check it in xmit
function.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/geneve.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 9a4351c..8e6c659 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -785,14 +785,11 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
struct geneve_sock *gs4 = rcu_dereference(geneve->sock4);
const struct ip_tunnel_key *key = &info->key;
struct rtable *rt;
-   int err = -EINVAL;
struct flowi4 fl4;
__u8 tos, ttl;
__be16 sport;
__be16 df;
-
-   if (!gs4)
-   return err;
+   int err;
 
rt = geneve_get_v4_rt(skb, dev, &fl4, info);
if (IS_ERR(rt))
@@ -828,13 +825,10 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
struct geneve_sock *gs6 = rcu_dereference(geneve->sock6);
const struct ip_tunnel_key *key = &info->key;
struct dst_entry *dst = NULL;
-   int err = -EINVAL;
struct flowi6 fl6;
__u8 prio, ttl;
__be16 sport;
-
-   if (!gs6)
-   return err;
+   int err;
 
dst = geneve_get_v6_dst(skb, dev, &fl6, info);
if (IS_ERR(dst))
-- 
1.8.3.1



Re: [PATCH net-next v3 4/7] vxlan: improve vxlan route lookup checks.

2016-11-17 Thread Pravin Shelar
On Thu, Nov 17, 2016 at 2:17 AM, David Laight  wrote:
> From: Of Jiri Benc
>> Sent: 15 November 2016 14:40
>> On Sun, 13 Nov 2016 20:43:55 -0800, Pravin B Shelar wrote:
>> > @@ -1929,8 +1951,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, 
>> > struct net_device *dev,
>> > union vxlan_addr *src;
>> > struct vxlan_metadata _md;
>> > struct vxlan_metadata *md = &_md;
>> > -   struct dst_entry *ndst = NULL;
>> > __be16 src_port = 0, dst_port;
>> > +   struct dst_entry *ndst = NULL;
>> > __be32 vni, label;
>> > __be16 df = 0;
>> > __u8 tos, ttl;
>>
>> This looks kind of arbitrary. You might want to remove this hunk or
>> merge it to patch 3.
>
> Worse than arbitrary, it adds 4 bytes of pad on 64bit systems.
>
OK. I will send out a patch.
But this is not real issue in vxlan module today ;)


Proposal

2016-11-17 Thread Teresa Au


Business Partnership Proposal For You,contact me via my personal E-mail for 
further 
detail's: ms_teresa_a...@outlook.com



Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM

2016-11-17 Thread Harini Katakam
Hi Rafal,

On Thu, Nov 17, 2016 at 7:05 PM, Rafal Ozieblo  wrote:
> -Original Message-
> From: Nicolas Ferre [mailto:nicolas.fe...@atmel.com]
> Sent: 17 listopada 2016 14:29
> To: Harini Katakam; Rafal Ozieblo
> Cc: harini.kata...@xilinx.com; netdev@vger.kernel.org; 
> linux-ker...@vger.kernel.org
> Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM
>
>> Le 17/11/2016 à 13:21, Harini Katakam a écrit :
>> > Hi Rafal,
>> >
>> > On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo  wrote:
>> >> Hello,
>> >> I think, there could a bug in your patch.
>> >>
>> >>> +
>> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>> >>> + dmacfg |= GEM_BIT(ADDR64); #endif
>> >>
>> >> You enable 64 bit addressing (64b dma bus width) always when appropriate 
>> >> architecture config option is enabled.
>> >> But there are some legacy controllers which do not support that feature. 
>> >> According Cadence hardware team:
>> >> "64 bit addressing was added in July 2013. Earlier version do not have it.
>> >> This feature was enhanced in release August 2014 to have separate upper 
>> >> address values for transmit and receive."
>> >>
>> >>> /* Bitfields in NSR */
>> >>> @@ -474,6 +479,10 @@
>> >>>  struct macb_dma_desc {
>> >>  >  u32 addr;
>> >>>   u32 ctrl;
>> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>> >>> + u32 addrh;
>> >>> + u32 resvd;
>> >>> +#endif
>> >>>  };
>> >>
>> >> It will not work for legacy hardware. Old descriptor is 2 words wide, the 
>> >> new one is 4 words wide.
>> >> If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't
>> >> support it at all, you will miss every second descriptor.
>> >>
>> >
>> > True, this feature is not available in all of Cadence IP versions.
>> > In fact, the IP version Zynq does not support this. But the one in ZynqMP 
>> > does.
>> > So, we enable kernel config for 64 bit DMA addressing for this SoC and
>> > hence the driver picks it up. My assumption was that if the legacy IP
>> > does not support
>> > 64 bit addressing, then this DMA option wouldn't be enabled.
>> >
>> > There is a design config register in Cadence IP which is being read to
>> > check for 64 bit address support - DMA mask is set based on that.
>> > But the addition of two descriptor words cannot be based on this runtime 
>> > check.
>> > For this reason, all the static changes were placed under this check.
>>
>> We have quite a bunch of options in this driver to determinate what is the 
>> real capacity of the underlying hardware.
>> If HW configuration registers are not appropriate, and it seems they are 
>> not, I would advice to simply use the DT compatibility string.
>>
>> Best regards,
>> --
>> Nicolas Ferre
>
> HW configuration registers are appropriate. The issue is that this code 
> doesn’t use the capability bit to switch between different dma descriptors (2 
> words vs. 4 words).
> DMA descriptor size is chosen based on kernel configuration, not based on 
> hardware capabilities.

HW configuration register does give appropriate information.
But addition of two address words in the macb descriptor structure is
a static change.

+static inline void macb_set_addr(struct macb_dma_desc *desc, dma_addr_t addr)
+{
+   desc->addr = (u32)addr;
+#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
+   desc->addrh = (u32)(addr >> 32);
+#endif
+

Even if the #ifdef condition here is changed to HW config check, addr
and addrh are different.
And "addrh" entry has to be present for 64 bit desc case to be handled
separately.
Can you please tell me how you propose change in DMA descriptor structure from
4 to 2 or 2 to 4 words *after* reading the DCFG register?

Regards,
Harini


[PATCH] net: fec: Detect and recover receive queue hangs

2016-11-17 Thread Chris Lesiak
This corrects a problem that appears to be similar to ERR006358.  But
while ERR006358 is a race when the tx queue transitions from empty to
not empty, this problem is a race when the rx queue transitions from
full to not full.

The symptom is a receive queue that is stuck.  The ENET_RDAR register
will read 0, indicating that there are no empty receive descriptors in
the receive ring.  Since no additional frames can be queued, no RXF
interrupts occur.

This problem can be triggered with a 1 Gb link and about 400 Mbps of
traffic.

This patch detects this condition, sets the work_rx bit, and
reschedules the poll method.

Signed-off-by: Chris Lesiak 
---
 drivers/net/ethernet/freescale/fec_main.c | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index fea0f33..8a87037 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1588,6 +1588,34 @@ fec_enet_interrupt(int irq, void *dev_id)
return ret;
 }
 
+static inline bool
+fec_enet_recover_rxq(struct fec_enet_private *fep, u16 queue_id)
+{
+   int work_bit = (queue_id == 0) ? 2 : ((queue_id == 1) ? 0 : 1);
+
+   if (readl(fep->rx_queue[queue_id]->bd.reg_desc_active))
+   return false;
+
+   dev_notice_once(&fep->pdev->dev, "Recovered rx queue\n");
+
+   fep->work_rx |= 1 << work_bit;
+
+   return true;
+}
+
+static inline bool fec_enet_recover_rxqs(struct fec_enet_private *fep)
+{
+   unsigned int q;
+   bool ret = false;
+
+   for (q = 0; q < fep->num_rx_queues; q++) {
+   if (fec_enet_recover_rxq(fep, q))
+   ret = true;
+   }
+
+   return ret;
+}
+
 static int fec_enet_rx_napi(struct napi_struct *napi, int budget)
 {
struct net_device *ndev = napi->dev;
@@ -1601,6 +1629,9 @@ static int fec_enet_rx_napi(struct napi_struct *napi, int 
budget)
if (pkts < budget) {
napi_complete(napi);
writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
+
+   if (fec_enet_recover_rxqs(fep) && napi_reschedule(napi))
+   writel(FEC_NAPI_IMASK, fep->hwp + FEC_IMASK);
}
return pkts;
 }
-- 
2.5.5



[iproute2] iproute2: fix the link group name getting error

2016-11-17 Thread Zhang Shengju
In the situation where more than one entry live in the same hash bucket,
loop to get the correct one.

Before:
$ cat /etc/iproute2/group
0   default
256 test

$ sudo ip link set group test dummy1

$ ip link show type dummy
11: dummy0:  mtu 1500 qdisc noop state DOWN mode DEFAULT group 
0 qlen 1000
link/ether 4e:3b:d3:6c:f0:e6 brd ff:ff:ff:ff:ff:ff
12: dummy1:  mtu 1500 qdisc noop state DOWN mode DEFAULT group 
test qlen 1000
link/ether d6:9c:a4:1f:e7:e5 brd ff:ff:ff:ff:ff:ff

After:
$ ip link show type dummy
11: dummy0:  mtu 1500 qdisc noop state DOWN mode DEFAULT group 
default qlen 1000
link/ether 4e:3b:d3:6c:f0:e6 brd ff:ff:ff:ff:ff:ff
12: dummy1:  mtu 1500 qdisc noop state DOWN mode DEFAULT group 
test qlen 1000
link/ether d6:9c:a4:1f:e7:e5 brd ff:ff:ff:ff:ff:ff

Signed-off-by: Zhang Shengju 
---
 lib/rt_names.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/lib/rt_names.c b/lib/rt_names.c
index b665d3e..c66cb1e 100644
--- a/lib/rt_names.c
+++ b/lib/rt_names.c
@@ -559,8 +559,12 @@ const char *rtnl_group_n2a(int id, char *buf, int len)
 
for (i = 0; i < 256; i++) {
entry = rtnl_group_hash[i];
-   if (entry && entry->id == id)
-   return entry->name;
+
+   while (entry) {
+   if (entry->id == id)
+   return entry->name;
+   entry = entry->next;
+   }
}
 
snprintf(buf, len, "%d", id);
-- 
1.8.3.1





Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 04:37 PM, Julian Anastasov wrote:

On Thu, 17 Nov 2016, Rick Jones wrote:


raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472

...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
16) = 0
setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0


connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.


The global -l option specifies the duration of the test.  It doesn't 
specify the TTL of the IP datagrams being generated by the actions of 
the test.


I resisted setting SO_DONTROUTE for a number of years after the first 
instance of UDP_STREAM being used in link up/down testing took-out a 
company's network (including security camera feeds to galactic HQ) but 
at this point I'm likely to keep it in there because there ended-up 
being a second such incident.  It is set only for UDP_STREAM.  It isn't 
set for UDP_RR or TCP_*.  And for UDP_STREAM it can be overridden by the 
test-specific -R option.


happy benchmarking,

rick jones


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Julian Anastasov

Hello,

On Thu, 17 Nov 2016, Rick Jones wrote:

> raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf -F
> src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472
> 
> ...
> 
> socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
> getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
> getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
> setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")},
> 16) = 0
> setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0

connected socket can benefit from dst cached in socket
but not if SO_DONTROUTE is set. If we do not want to send packets
via gateway this -l 1 should help but I don't see IP_TTL setsockopt
in your first example with connect() to 127.0.0.1.

Also, may be there can be another default, if -l is used to
specify TTL then SO_DONTROUTE should not be set. I.e. we should
avoid SO_DONTROUTE, if possible.

Regards

--
Julian Anastasov 


Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-17 Thread Jarno Rajahalme

> On Nov 14, 2016, at 3:09 PM, Eric Dumazet  wrote:
> 
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>> 
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>> 
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>> 
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>> 
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet 
>>> Date:   Wed Nov 18 06:31:03 2015 -0800
>> 
>> Absolutely, I will take a loop asap.
> 
> The worst offender should be fixed by the following patch.
> 
> busy poll needs to poll the physical device, not a virtual one...
> 
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index 
> d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2
>  100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, 
> struct net_device *de
>   struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
> 
>   __skb_queue_head_init(&cell->napi_skbs);
> +
> + set_bit(NAPI_STATE_NO_BUSY_POLL, &cell->napi.state);
> +
>   netif_napi_add(dev, &cell->napi, gro_cell_poll, 64);
>   napi_enable(&cell->napi);
>   }
> 

This solved a ~20 second slowdown between OVS datapath unit tests for me.

  Jarno



[Patch net v2] af_unix: conditionally use freezable blocking calls in read

2016-11-17 Thread Cong Wang
Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7e8
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
deadlock if the lock is later acquired in the suspend or hibernate path
(e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
cgroup_freezer if a lock is held inside a frozen cgroup that is later
acquired by a process outside that group.

The pipe_lock is still held at that point.

So use freezable version only for the recvmsg call path, avoid impact for
Android.

Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix 
sockets")
Reported-by: Dmitry Vyukov 
Cc: Tejun Heo 
Cc: Colin Cross 
Cc: Rafael J. Wysocki 
Cc: Hannes Frederic Sowa 
Signed-off-by: Cong Wang 
---
 net/unix/af_unix.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5d1c14a..2358f26 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2199,7 +2199,8 @@ static int unix_dgram_recvmsg(struct socket *sock, struct 
msghdr *msg,
  * Sleep until more data has arrived. But check for races..
  */
 static long unix_stream_data_wait(struct sock *sk, long timeo,
- struct sk_buff *last, unsigned int last_len)
+ struct sk_buff *last, unsigned int last_len,
+ bool freezable)
 {
struct sk_buff *tail;
DEFINE_WAIT(wait);
@@ -2220,7 +2221,10 @@ static long unix_stream_data_wait(struct sock *sk, long 
timeo,
 
sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
unix_state_unlock(sk);
-   timeo = freezable_schedule_timeout(timeo);
+   if (freezable)
+   timeo = freezable_schedule_timeout(timeo);
+   else
+   timeo = schedule_timeout(timeo);
unix_state_lock(sk);
 
if (sock_flag(sk, SOCK_DEAD))
@@ -2250,7 +2254,8 @@ struct unix_stream_read_state {
unsigned int splice_flags;
 };
 
-static int unix_stream_read_generic(struct unix_stream_read_state *state)
+static int unix_stream_read_generic(struct unix_stream_read_state *state,
+   bool freezable)
 {
struct scm_cookie scm;
struct socket *sock = state->socket;
@@ -2330,7 +2335,7 @@ static int unix_stream_read_generic(struct 
unix_stream_read_state *state)
mutex_unlock(&u->iolock);
 
timeo = unix_stream_data_wait(sk, timeo, last,
- last_len);
+ last_len, freezable);
 
if (signal_pending(current)) {
err = sock_intr_errno(timeo);
@@ -2472,7 +2477,7 @@ static int unix_stream_recvmsg(struct socket *sock, 
struct msghdr *msg,
.flags = flags
};
 
-   return unix_stream_read_generic(&state);
+   return unix_stream_read_generic(&state, true);
 }
 
 static int unix_stream_splice_actor(struct sk_buff *skb,
@@ -2503,7 +2508,7 @@ static ssize_t unix_stream_splice_read(struct socket 
*sock,  loff_t *ppos,
flags & SPLICE_F_NONBLOCK)
state.flags = MSG_DONTWAIT;
 
-   return unix_stream_read_generic(&state);
+   return unix_stream_read_generic(&state, false);
 }
 
 static int unix_shutdown(struct socket *sock, int mode)
-- 
2.1.0



Re: Virtio_net support vxlan encapsulation package TSO offload discuss

2016-11-17 Thread Jarno Rajahalme
I worked on the same issue a few months back. I rebased my proof-of-concept 
code to the current net-next and posted an RFC patch a moment ago.

I have zero experience on QEMU feature negotiation or extending the virtio_net 
spec. Since the virtio_net handling code is now all done using shared code, 
this should work for macvtap as well, not sure if macvtap needs some control 
plane changes.

I posted a separate patch to make af_packet also use the shared infra for 
virtio_net handling yesterday. My RFC patch assumes that af_packet need not be 
touched, i.e., assumes the af_packet patch is applied, even though the patches 
apply to net-next in either order.

  Jarno

> On Nov 16, 2016, at 11:27 PM, Jason Wang  wrote:
> 
> 
> 
> On 2016年11月17日 09:31, Zhangming (James, Euler) wrote:
>> On 2016年11月15日 11:28, Jason Wang wrote:
>>> On 2016年11月10日 14:19, Zhangming (James, Euler) wrote:
 On 2016年11月09日 15:14, Jason Wang wrote:
> On 2016年11月08日 19:58, Zhangming (James, Euler) wrote:
>> On 2016年11月08日 19:17, Jason Wang wrote:
>> 
>>> On 2016年11月08日 19:13, Jason Wang wrote:
 Cc Michael
 
 On 2016年11月08日 16:34, Zhangming (James, Euler) wrote:
> In container scenario, OVS is installed in the Virtual machine,
> and all the containers connected to the OVS will communicated
> through VXLAN encapsulation.
> 
> By now, virtio_net does not support TSO offload for VXLAN
> encapsulated TSO package. In this condition, the performance is
> not good, sender is bottleneck
> 
> I googled this scenario, but I didn’t find any information. Will
> virtio_net support VXLAN encapsulation package TSO offload later?
> 
 Yes and for both sender and receiver.
 
> My idea is virtio_net open encapsulated TSO offload, and
> transport encapsulation info to TUN, TUN will parse the info and
> build skb with encapsulation info.
> 
> OVS or kernel on the host should be modified to support this.
> Using this method, the TCP performance aremore than 2x as before.
> 
> Any advice and suggestions for this idea or new idea will be
> greatly appreciated!
> 
> Best regards,
> 
>  James zhang
> 
 Sounds very good. And we may also need features bits
 (VIRTIO_NET_F_GUEST|HOST_GSO_X) for this.
 
 This is in fact one of items in networking todo list. (See
 http://www.linux-kvm.org/page/NetworkingTodo). While at it, we'd
 better support not only VXLAN but also other tunnels.
>>> Cc Vlad who is working on extending virtio-net headers.
>>> 
 We can start with the spec work, or if you've already had some
 bits you can post them as RFC for early review.
 
 Thanks
>> Below is my demo code
>> Virtio_net.c
>> static int virtnet_probe(struct virtio_device *vdev), add belows codes:
>>   if (virtio_has_feature(vdev, VIRTIO_NET_F_MRG_RXBUF) ||
>> // avoid gso segment, it should be negotiation 
>> later, because in the demo I reuse num_buffers.
>>   virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
>>   dev->hw_enc_features |= NETIF_F_TSO;
>>   dev->hw_enc_features |= NETIF_F_ALL_CSUM;
>>   dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL;
>>   dev->hw_enc_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
>>   dev->hw_enc_features |=
>> NETIF_F_GSO_TUNNEL_REMCSUM;
>> 
>>   dev->features |= NETIF_F_GSO_UDP_TUNNEL;
>>   dev->features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
>>   dev->features |= NETIF_F_GSO_TUNNEL_REMCSUM;
>>   }
>> 
>> static int xmit_skb(struct send_queue *sq, struct sk_buff *skb), add
>> below to pieces of codes
>> 
>>   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL)
>>   hdr->hdr.gso_type |= VIRTIO_NET_HDR_GSO_TUNNEL;
>>   if (skb_shinfo(skb)->gso_type & 
>> SKB_GSO_UDP_TUNNEL_CSUM)
>>   hdr->hdr.gso_type |= 
>> VIRTIO_NET_HDR_GSO_TUNNEL_CSUM;
>>   if (skb_shinfo(skb)->gso_type & SKB_GSO_TUNNEL_REMCSUM)
>>   hdr->hdr.gso_type |=
>> VIRTIO_NET_HDR_GSO_TUNNEL_REMCSUM;
>> 
>>   if (skb->encapsulation && skb_is_gso(skb)) {
>>   inner_mac_len = skb_inner_network_header(skb) - 
>> skb_inner_mac_header(skb);
>>   tnl_len = skb_inner_mac_header(skb) - 
>> skb_mac_header(skb);
>>   if ( !(inner_mac_len >> DATA_LEN_SHIFT) && !(tnl_len 
>> >> DATA_LEN_SHIFT) ) {
>>   hdr->hdr.flags |= 
>> VIRTIO

Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 01:44 PM, Eric Dumazet wrote:

because netperf sends the same message
over and over...


Well, sort of, by default.  That can be altered to a degree.

The global -F option should cause netperf to fill the buffers in its 
send ring with data from the specified file.  The number of buffers in 
the send ring can be controlled via the global -W option.  The number of 
elements in the ring will default to one more than the initial SO_SNDBUF 
size divided by the send size.


raj@tardy:~/netperf2_trunk$ strace -v -o /tmp/netperf.strace src/netperf 
-F src/nettest_omni.c -t UDP_STREAM -l 1 -- -m 1472


...

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 4
getsockopt(4, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
getsockopt(4, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(4, {sa_family=AF_INET, sin_port=htons(0), 
sin_addr=inet_addr("0.0.0.0")}, 16) = 0

setsockopt(4, SOL_SOCKET, SO_DONTROUTE, [1], 4) = 0
setsockopt(4, SOL_IP, IP_RECVERR, [1], 4) = 0
open("src/nettest_omni.c", O_RDONLY)= 5
fstat(5, {st_dev=makedev(8, 2), st_ino=82075297, st_mode=S_IFREG|0664, 
st_nlink=1, st_uid=1000, st_gid=1000, st_blksize=4096, st_blocks=456, 
st_size=230027, st_atime=2016/11/16-09:49:29, 
st_mtime=2016/11/16-09:49:24, st_ctime=2016/11/16-09:49:24}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) 
= 0x7f3099f62000

read(5, "#ifdef HAVE_CONFIG_H\n#include rt_sigaction(SIGALRM, {0x402ea6, [ALRM], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0
rt_sigaction(SIGINT, {0x402ea6, [INT], SA_RESTORER|SA_INTERRUPT, 
0x7f30994a7cb0}, NULL, 8) = 0

alarm(1)= 0
sendto(4, "#ifdef HAVE_CONFIG_H\n#include {sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " used\\n\\\n-m local,remote   S"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, " do here but clear the legacy fl"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "e before we scan the test-specif"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472
sendto(4, "\n\n\tfprintf(where,\n\t\ttput_fmt_1_l"..., 1472, 0, 
{sa_family=AF_INET, sin_port=htons(58088), 
sin_addr=inet_addr("127.0.0.1")}, 16) = 1472


Of course, it will continue to send the same messages from the send_ring 
over and over instead of putting different data into the buffers each 
time, but if one has a sufficiently large -W option specified...

happy benchmarking,

rick jones


[RFC PATCH net-next] virtio_net: Support UDP Tunnel offloads.

2016-11-17 Thread Jarno Rajahalme
This patch is a proof-of-concept I did a few months ago for UDP tunnel
offload support in virtio_net interface, and rebased on to the current
net-next.

Real implementation needs to extend the virtio_net header rather than
piggy-backing on existing fields.  Inner MAC length (or inner network
offset) also needs to be passed as a new field.  Control plane (QEMU)
also needs to be updated.

All testing was done using Geneve, but this should work for all UDP
tunnels the same.

Signed-off-by: Jarno Rajahalme 
---
 drivers/net/tun.c   |  7 -
 drivers/net/virtio_net.c| 16 +++---
 include/linux/skbuff.h  |  5 
 include/linux/virtio_net.h  | 66 ++---
 include/uapi/linux/virtio_net.h |  7 +
 5 files changed, 78 insertions(+), 23 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1588469..36f3219 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -198,7 +198,9 @@ struct tun_struct {
struct net_device   *dev;
netdev_features_t   set_features;
 #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
- NETIF_F_TSO6|NETIF_F_UFO)
+  NETIF_F_TSO6|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL| \
+  NETIF_F_GSO_UDP_TUNNEL_CSUM| \
+  NETIF_F_GSO_TUNNEL_REMCSUM)
 
int align;
int vnet_hdr_sz;
@@ -1877,6 +1879,9 @@ static int set_offload(struct tun_struct *tun, unsigned 
long arg)
 
if (arg & TUN_F_UFO) {
features |= NETIF_F_UFO;
+#if 1
+   features |= 
NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM;
+#endif
arg &= ~TUN_F_UFO;
}
}
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ca5239a..eb8d887 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1789,7 +1789,10 @@ static int virtnet_probe(struct virtio_device *vdev)
 
if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {
dev->hw_features |= NETIF_F_TSO | NETIF_F_UFO
-   | NETIF_F_TSO_ECN | NETIF_F_TSO6;
+   | NETIF_F_TSO_ECN | NETIF_F_TSO6
+   | NETIF_F_GSO_UDP_TUNNEL
+   | NETIF_F_GSO_UDP_TUNNEL_CSUM
+   | NETIF_F_GSO_TUNNEL_REMCSUM;
}
/* Individual feature bits: what can host handle? */
if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_TSO4))
@@ -1798,13 +1801,18 @@ static int virtnet_probe(struct virtio_device *vdev)
dev->hw_features |= NETIF_F_TSO6;
if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_ECN))
dev->hw_features |= NETIF_F_TSO_ECN;
-   if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO))
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_UFO)) {
dev->hw_features |= NETIF_F_UFO;
-
+#if 1
+   dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL;
+   dev->hw_features |= NETIF_F_GSO_UDP_TUNNEL_CSUM;
+   dev->hw_features |= NETIF_F_GSO_TUNNEL_REMCSUM;
+#endif
+   }
dev->features |= NETIF_F_GSO_ROBUST;
 
if (gso)
-   dev->features |= dev->hw_features & 
(NETIF_F_ALL_TSO|NETIF_F_UFO);
+   dev->features |= dev->hw_features & 
(NETIF_F_ALL_TSO|NETIF_F_UFO|NETIF_F_GSO_UDP_TUNNEL|NETIF_F_GSO_UDP_TUNNEL_CSUM|NETIF_F_GSO_TUNNEL_REMCSUM);
/* (!csum && gso) case will be fixed by register_netdev() */
}
if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a4aeeca..992ad30 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2115,6 +2115,11 @@ static inline unsigned char *skb_inner_mac_header(const 
struct sk_buff *skb)
return skb->head + skb->inner_mac_header;
 }
 
+static inline int skb_inner_mac_offset(const struct sk_buff *skb)
+{
+   return skb_inner_mac_header(skb) - skb->data;
+}
+
 static inline void skb_reset_inner_mac_header(struct sk_buff *skb)
 {
skb->inner_mac_header = skb->data - skb->head;
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1c912f8..17384d1 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -8,10 +8,19 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
const struct virtio_net_hdr *hdr,
bool little_endian)
 {
-   unsigned short gso_type = 0;
+   u16 start = __virtio16_to_cpu(little_endian, hdr->csum_start);
+
+   if (hdr->flags & VIRTIO_NET_HDR_F_

Re: [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"

2016-11-17 Thread Cong Wang
On Thu, Nov 17, 2016 at 2:30 PM, Colin Cross  wrote:
> On Thu, Nov 17, 2016 at 2:09 PM, Cong Wang  wrote:
>> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
>> converts schedule_timeout() to its freezable version, it was probably
>> correct at that time, but later, commit 2b514574f7e8
>> ("net: af_unix: implement splice for stream af_unix sockets") breaks
>> the strong requirement for a freezable sleep, according to
>> commit 0f9548ca1091:
>>
>> We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
>> deadlock if the lock is later acquired in the suspend or hibernate path
>> (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
>> cgroup_freezer if a lock is held inside a frozen cgroup that is later
>> acquired by a process outside that group.
>>
>> The pipe_lock is still held at that point. So just revert commit 2b15af6f95.
>
> On my phone 77 threads are blocked in unix_stream_recvmsg.  A simple
> revert of this patch will cause every one of those threads to wake up
> twice per suspend cycle, which can be multiple times a second.  How
> about adding a freezable flag to unix_stream_read_state so
> unix_stream_recvmsg can stay freezable, and unix_stream_splice_read
> can be unfreezable?

Fair enough, I didn't know it could have such an impact.
I will send v2.

Thanks.


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Alexander Duyck
On Thu, Nov 17, 2016 at 9:34 AM, David Laight  wrote:
> From: Jesper Dangaard Brouer
>> Sent: 17 November 2016 14:58
>> On Thu, 17 Nov 2016 06:17:38 -0800
>> Eric Dumazet  wrote:
>>
>> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
>> >
>> > > I can see that qdisc layer does not activate xmit_more in this case.
>> > >
>> >
>> > Sure. Not enough pressure from the sender(s).
>> >
>> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
>> > limit is kept at a small value.
>> >
>> > (BTW not all NIC have expensive doorbells)
>>
>> I believe this NIC mlx5 (50G edition) does.
>>
>> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
>> 2414 cycles(tsc) 603.86 ns
>>
>> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
>>
>>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>>Overhead  CommandShared ObjectSymbol
>>  +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
>>- _raw_spin_lock
>>   + 90.78% __dev_queue_xmit
>>   + 7.83% dev_queue_xmit
>>   + 1.30% ___slab_alloc
>>  +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w
>>  +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
>>  +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup
>>  +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
>>  +3.11%  udp_flood  [kernel.vmlinux] [k] 
>> __ip_route_output_key_hash
>>  +2.49%  swapper[kernel.vmlinux] [k] __slab_free
>>
>> In this setup the spinlock in __dev_queue_xmit should be uncongested.
>> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
>>
>> But 8.92% of the time is spend on it, which corresponds to a cost of 215
>> cycles (2414*0.0892).  This cost is too high, thus something else is
>> going on... I claim this mysterious extra cost is the tailptr/doorbell.
>
> Try adding code to ring the doorbell twice.
> If this doesn't slow things down then it isn't (likely to be) responsible
> for the delay you are seeing.
>
> David
>

The problem isn't only the doorbell.  It is doorbell plus a locked
transaction on x86 results in a long wait until the doorbell write has
been completed.

You could batch a bunch of doorbell writes together and it isn't an
issue unless you do something like writel(), wmb(), writel(), wmb(),
then you will see the effect double since the write memory barrier is
what is forcing the delays.

- Alex


Re: [Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"

2016-11-17 Thread Colin Cross
On Thu, Nov 17, 2016 at 2:09 PM, Cong Wang  wrote:
> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
> converts schedule_timeout() to its freezable version, it was probably
> correct at that time, but later, commit 2b514574f7e8
> ("net: af_unix: implement splice for stream af_unix sockets") breaks
> the strong requirement for a freezable sleep, according to
> commit 0f9548ca1091:
>
> We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
> deadlock if the lock is later acquired in the suspend or hibernate path
> (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
> cgroup_freezer if a lock is held inside a frozen cgroup that is later
> acquired by a process outside that group.
>
> The pipe_lock is still held at that point. So just revert commit 2b15af6f95.

On my phone 77 threads are blocked in unix_stream_recvmsg.  A simple
revert of this patch will cause every one of those threads to wake up
twice per suspend cycle, which can be multiple times a second.  How
about adding a freezable flag to unix_stream_read_state so
unix_stream_recvmsg can stay freezable, and unix_stream_splice_read
can be unfreezable?


Re: net: BUG still has locks held in unix_stream_splice_read

2016-11-17 Thread Hannes Frederic Sowa
On 17.11.2016 22:44, Cong Wang wrote:
> On Sun, Oct 9, 2016 at 8:14 PM, Al Viro  wrote:
>> E.g what will happen if some code does a read on AF_UNIX socket with
>> some local mutex held?  AFAICS, there are exactly two callers of
>> freezable_schedule_timeout() - this one and one in XFS; the latter is
>> in a kernel thread where we do have good warranties about the locking
>> environment, but here it's in the bleeding ->recvmsg/->splice_read and
>> for those assumption that caller doesn't hold any locks is pretty
>> strong, especially since it's not documented anywhere.
>>
>> What's going on there?
> 
> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
> converts schedule_timeout() to its freezable version, it was probably correct
> at that time, but later, commit 2b514574f7e88c8498027ee366
> ("net: af_unix: implement splice for stream af_unix sockets") breaks its
> requirement for a freezable sleep:
> 
> commit 0f9548ca10916dec166eaf74c816bded7d8e611d
> 
> lockdep: check that no locks held at freeze time
> 
> We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
> deadlock if the lock is later acquired in the suspend or hibernate path
> (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
> cgroup_freezer if a lock is held inside a frozen cgroup that is later
> acquired by a process outside that group.
> 
> So probably we just need to revert commit 2b15af6f95 now.
> 
> I am going to send a revert for at least -net and -stable, since Dmitry
> saw this warning again.

I am not an expert on freezing but this looks around right from the
freezer code. Awesome, thanks a lot for spotting this one!





Re: Debugging Ethernet issues

2016-11-17 Thread Måns Rullgård
Florian Fainelli  writes:

> On 11/14/2016 11:00 AM, Måns Rullgård wrote:
>> Florian Fainelli  writes:
>> 
>>> On 11/14/2016 10:20 AM, Florian Fainelli wrote:
 On 11/14/2016 09:59 AM, Sebastian Frias wrote:
> On 11/14/2016 06:32 PM, Florian Fainelli wrote:
>> On 11/14/2016 07:33 AM, Mason wrote:
>>> On 14/11/2016 15:58, Mason wrote:
>>>
 nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control 
 rx/tx
 vs
 nb8800 26000.ethernet eth0: Link is Up - 100Mbps/Full - flow control 
 off

 I'm not sure whether "flow control" is relevant...
>>>
>>> Based on phy_print_status()
>>> phydev->pause ? "rx/tx" : "off"
>>> I added the following patch.
>>>
>>> diff --git a/drivers/net/ethernet/aurora/nb8800.c 
>>> b/drivers/net/ethernet/aurora/nb8800.c
>>> index defc22a15f67..4e758c1cfa4e 100644
>>> --- a/drivers/net/ethernet/aurora/nb8800.c
>>> +++ b/drivers/net/ethernet/aurora/nb8800.c
>>> @@ -667,6 +667,8 @@ static void nb8800_link_reconfigure(struct 
>>> net_device *dev)
>>> struct phy_device *phydev = priv->phydev;
>>> int change = 0;
>>>  
>>> +   printk("%s from %pf\n", __func__, __builtin_return_address(0));
>>> +
>>> if (phydev->link) {
>>> if (phydev->speed != priv->speed) {
>>> priv->speed = phydev->speed;
>>> @@ -1274,9 +1276,9 @@ static int nb8800_hw_init(struct net_device *dev)
>>> nb8800_writeb(priv, NB8800_PQ2, val & 0xff);
>>>  
>>> /* Auto-negotiate by default */
>>> -   priv->pause_aneg = true;
>>> -   priv->pause_rx = true;
>>> -   priv->pause_tx = true;
>>> +   priv->pause_aneg = false;
>>> +   priv->pause_rx = false;
>>> +   priv->pause_tx = false;
>>>  
>>> nb8800_mc_init(dev, 0);
>>>  
>>>
>> 
>> [...]
>> 
>> And the time difference is clearly accounted for auto-negotiation time
>> here, as you can see it takes about 3 seconds for Gigabit Ethernet to
>> auto-negotiate and that seems completely acceptable and normal to me
>> since it is a more involved process than lower speeds.
>>
>>>
>>>
>>> OK, so now it works (by accident?) even on 100 Mbps switch, but it still
>>> prints "flow control rx/tx"...
>>
>> Because your link partner advertises flow control, and that's what
>> phydev->pause and phydev->asym_pause report (I know it's confusing, but
>> that's what it is at the moment).
>
> Thanks.
> Could you confirm that Mason's patch is correct and/or that it does not
> has negative side-effects?

 The patch is not correct nor incorrect per-se, it changes the default
 policy of having pause frames advertised by default to not having them
 advertised by default.
>> 
>> I was advised to advertise flow control by default back when I was
>> working on the driver, and I think it makes sense to do so.
>> 
 This influences both your Ethernet MAC and the link partner in that
 the result is either flow control is enabled (before) or it is not
 (with the patch). There must be something amiss if you see packet
 loss or some kind of problem like that with an early exchange such as
 DHCP. Flow control tend to kick in under higher packet rates (at
 least, that's what you expect).

>
> Right now we know that Mason's patch makes this work, but we do not
> understand why nor its implications.

 You need to understand why, right now, the way this problem is
 presented, you came up with a workaround, not with the root cause or the
 solution. What does your link partner (switch?) reports, that is, what
 is the ethtool output when you have a link up from  your nb8800 adapter?
>>>
>>> Actually, nb8800_pause_config() seems to be doing a complete MAC/DMA
>>> reconfiguration when pause frames get auto-negotiated while the link is
>>> UP,
>> 
>> This is due to a silly hardware limitation.  The register containing the
>> flow control bits can't be written while rx is enabled.
>
> You do a DMA stop, but you don't disable the MAC receiver unlike what
> nb8800_stop() does, why is not calling nb8800_mac_rx() necessary here?

Oh, right.  That's because the RXC_CR register (where the flow control
bits are) can't be modified when the RCR_EN bit (rx dma enable) is set.
The MAC core register controlled by nb8800_mac_rx() doesn't matter here.
There is no way of changing the flow control setting without briefly
stopping rx dma.

None of this should be relevant here though since everything should be
all set up before dma is enabled the first time.

>>> and it does not differentiate being called from
>>> ethtool::set_pauseparam or the PHYLIB adjust_link callback (which it
>>> probably should),
>> 
>> Differentiate how?
>
> Differentiate in that when you are called from

Re: [PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver

2016-11-17 Thread Florian Fainelli
On 11/17/2016 02:10 PM, woojung@microchip.com wrote:
> From: Woojung Huh 
> 
> Relocate mdix code to phy driver to be called at config_init().
> 
> Signed-off-by: Woojung Huh 

Reviewed-by: Florian Fainelli 
-- 
Florian


[PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver

2016-11-17 Thread Woojung.Huh
From: Woojung Huh 

Relocate mdix code to phy driver to be called at config_init().

Signed-off-by: Woojung Huh 
---
 drivers/net/phy/microchip.c | 36 +-
 drivers/net/usb/lan78xx.c   | 73 ++---
 2 files changed, 38 insertions(+), 71 deletions(-)

diff --git a/drivers/net/phy/microchip.c b/drivers/net/phy/microchip.c
index 7c00e50..eb4db22 100644
--- a/drivers/net/phy/microchip.c
+++ b/drivers/net/phy/microchip.c
@@ -106,6 +106,40 @@ static int lan88xx_set_wol(struct phy_device *phydev,
return 0;
 }
 
+static void lan88xx_set_mdix(struct phy_device *phydev)
+{
+   int buf;
+   int val;
+
+   switch (phydev->mdix) {
+   case ETH_TP_MDI:
+   val = LAN88XX_EXT_MODE_CTRL_MDI_;
+   break;
+   case ETH_TP_MDI_X:
+   val = LAN88XX_EXT_MODE_CTRL_MDI_X_;
+   break;
+   case ETH_TP_MDI_AUTO:
+   val = LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_;
+   break;
+   default:
+   return;
+   }
+
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_1);
+   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
+   buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
+   buf |= val;
+   phy_write(phydev, LAN88XX_EXT_MODE_CTRL, buf);
+   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0);
+}
+
+static int lan88xx_config_aneg(struct phy_device *phydev)
+{
+   lan88xx_set_mdix(phydev);
+
+   return genphy_config_aneg(phydev);
+}
+
 static struct phy_driver microchip_phy_driver[] = {
 {
.phy_id = 0x0007c130,
@@ -120,7 +154,7 @@ static struct phy_driver microchip_phy_driver[] = {
.remove = lan88xx_remove,
 
.config_init= genphy_config_init,
-   .config_aneg= genphy_config_aneg,
+   .config_aneg= lan88xx_config_aneg,
.read_status= genphy_read_status,
 
.ack_interrupt  = lan88xx_phy_ack_interrupt,
diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index cf2857f..0c459e9 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -1471,62 +1471,12 @@ static void lan78xx_set_msglevel(struct net_device 
*net, u32 level)
dev->msg_enable = level;
 }
 
-static int lan78xx_get_mdix_status(struct net_device *net)
-{
-   struct phy_device *phydev = net->phydev;
-   int buf;
-
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_1);
-   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS, LAN88XX_EXT_PAGE_SPACE_0);
-
-   return buf;
-}
-
-static void lan78xx_set_mdix_status(struct net_device *net, __u8 mdix_ctrl)
-{
-   struct lan78xx_net *dev = netdev_priv(net);
-   struct phy_device *phydev = net->phydev;
-   int buf;
-
-   if (mdix_ctrl == ETH_TP_MDI) {
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_1);
-   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-   buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-   phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
- buf | LAN88XX_EXT_MODE_CTRL_MDI_);
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_0);
-   } else if (mdix_ctrl == ETH_TP_MDI_X) {
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_1);
-   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-   buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-   phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
- buf | LAN88XX_EXT_MODE_CTRL_MDI_X_);
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_0);
-   } else if (mdix_ctrl == ETH_TP_MDI_AUTO) {
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_1);
-   buf = phy_read(phydev, LAN88XX_EXT_MODE_CTRL);
-   buf &= ~LAN88XX_EXT_MODE_CTRL_MDIX_MASK_;
-   phy_write(phydev, LAN88XX_EXT_MODE_CTRL,
- buf | LAN88XX_EXT_MODE_CTRL_AUTO_MDIX_);
-   phy_write(phydev, LAN88XX_EXT_PAGE_ACCESS,
- LAN88XX_EXT_PAGE_SPACE_0);
-   }
-   dev->mdix_ctrl = mdix_ctrl;
-}
-
 static int lan78xx_get_link_ksettings(struct net_device *net,
  struct ethtool_link_ksettings *cmd)
 {
struct lan78xx_net *dev = netdev_priv(net);
struct phy_device *phydev = net->phydev;
int ret;
-   int buf;
 
ret = usb_autopm_get_interface(dev->intf);
if (ret < 0)
@@ -1534,20 +1484,6 @@ static int lan78xx_get_link_ksettings(struct net_device 
*net,
 
ret = phy_ethtool_ksettings_get(phydev, cmd);
 
-   buf = lan78xx_get_mdix_status(net);
-
-   buf 

[Patch net] af_unix: revert "af_unix: use freezable blocking calls in read"

2016-11-17 Thread Cong Wang
Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably
correct at that time, but later, commit 2b514574f7e8
("net: af_unix: implement splice for stream af_unix sockets") breaks
the strong requirement for a freezable sleep, according to
commit 0f9548ca1091:

We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
deadlock if the lock is later acquired in the suspend or hibernate path
(e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
cgroup_freezer if a lock is held inside a frozen cgroup that is later
acquired by a process outside that group.

The pipe_lock is still held at that point. So just revert commit 2b15af6f95.

Reported-by: Dmitry Vyukov 
Cc: Tejun Heo 
Cc: Colin Cross 
Cc: Rafael J. Wysocki 
Signed-off-by: Cong Wang 
---
 net/unix/af_unix.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5d1c14a..6f37f9e 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -116,7 +116,6 @@
 #include 
 #include 
 #include 
-#include 
 
 struct hlist_head unix_socket_table[2 * UNIX_HASH_SIZE];
 EXPORT_SYMBOL_GPL(unix_socket_table);
@@ -2220,7 +2219,7 @@ static long unix_stream_data_wait(struct sock *sk, long 
timeo,
 
sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
unix_state_unlock(sk);
-   timeo = freezable_schedule_timeout(timeo);
+   timeo = schedule_timeout(timeo);
unix_state_lock(sk);
 
if (sock_flag(sk, SOCK_DEAD))
-- 
2.1.0



Re: [net-next PATCH v2] net: dummy: Introduce dummy virtual functions

2016-11-17 Thread Or Gerlitz
On Mon, Nov 14, 2016 at 3:02 PM, Phil Sutter  wrote:

> Due to the assumption that all PFs are PCI devices, this implementation
> is not completely straightforward: In order to allow for
> rtnl_fill_ifinfo() to see the dummy VFs, a fake PCI parent device is
> attached to the dummy netdev. This has to happen at the right spot so
> register_netdevice() does not get confused. This patch abuses
> ndo_fix_features callback for that. In ndo_uninit callback, the fake
> parent is removed again for the same purpose.

So you did some mimic-ing of PCI interface, how do you let the user to
config the number of VFs? though a module param? why? if the module
param only serves to say how many VF the device supports, maybe
support the maximum possible by PCI spec and skip the module param?


> +module_param(num_vfs, int, 0);
> +MODULE_PARM_DESC(num_vfs, "Number of dummy VFs per dummy device");
> +

> @@ -190,6 +382,7 @@ static int __init dummy_init_one(void)
> err = register_netdevice(dev_dummy);
> if (err < 0)
> goto err;
> +
> return 0;

nit, remove this added blank line..


Re: [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options

2016-11-17 Thread Jerome Brunet
On Thu, 2016-11-17 at 23:30 +0530, Anand Moon wrote:
> Hi Jerone,
> 
> > > How about adding callback functionality for .soft_reset to handle
> > > BMCR
> > > where we update the Auto-Negotiation for the phy,
> > > as per the datasheet of the rtl8211f.

I think BMCR is already pretty well handled by the genphy, don't you
think ?

> > 
> > I'm not sure I understand how this would help with our issue (and
> > EEE).
> > Am I missing something or is it something unrelated that you would
> > like
> > to see happening on this driver ?
> > 
> [snip]
> 
> I was just tying other phy module to understand the feature.
> But in order to improve  the throughput I tried to integrate blow u-
> boot commit.
> 
> commit 3d6af748ebd831524cb22a29433e9092af469ec7
> Author: Shengzhou Liu 
> Date:   Thu Mar 12 18:54:59 2015 +0800
> 
> net/phy: Add support for realtek RTL8211F
> 
> RTL8211F has different registers from RTL8211E.
> This patch adds support for RTL8211F PHY which
> can be found on Freescale's T1023 RDB board.
> 
> And added the similar functionality to  .config_aneg=
> &rtl8211f_config_aneg,
> 

I assume this is the commit you are referring to : 
http://git.denx.de/?p=u-boot.git;a=commit;h=3d6af748ebd831524cb22a29433
e9092af469ec7

I tried looking a this particular commit and the other ones in
realtek.c history of u-boot. I don't really see what it does that linux
is not already doing.

> And I seem to have better results in through put with periodic drop
> but it recovers.
> -
> odroid@odroid64:~$ iperf3 -c 10.0.0.102 -p 2006 -i 1 -t 100 -V
> iperf 3.0.11
> Linux odroid64 4.9.0-rc5-xc2ml #18 SMP PREEMPT Thu Nov 17 22:56:00 

[...]

> 
> Test Complete. Summary Results:
> [ ID] Interval   Transfer Bandwidth   Retr
> [  4]   0.00-100.00 sec  10.5 GBytes   902
> Mbits/sec4 sender
> [  4]   0.00-100.00 sec  10.5 GBytes   902
> Mbits/sec  receiver
> CPU Utilization: local/sender 5.6% (0.2%u/5.4%s), remote/receiver
> 17.1% (1.2%u/15.9%s)
> 

That's the kind of throughput we have on the C2 once the link is
reliable (with EEE switch off for GbE)

> Can your confirm this at your end.
> Once confirm I will try to send this as a fix for this issue.
> 

I'm testing the code to disable EEE in a generic way. I'll post the RFC
for it asap.

> -Best Regards
> Anand Moon


Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems

2016-11-17 Thread Jerome Brunet
On Thu, 2016-11-17 at 19:44 +0100, André Roth wrote:
> Hi all,
> 
> > 
> > I checked again the kernel
> > at https://github.com/hardkernel/linux/tree/ odroidc2-3.14.y. The
> > version you mention (3.14.65-73) seems to be:
> > sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036 tag: jenkins-deb
> > s905
> > kernel-73
> 
> I downloaded the prebuilt image from hardkernel, I did not build the
> kernel myself. but hardkernel has an earlier release of the same
> kernel
> version, which works fine too. I assume they would have committed the
> change in the newer version..
>  
> > 
> > In this particular version, both realtek drivers:
> > - drivers/net/phy/realtek.c
> > - drivers/amlogic/ethernet/phy/am_realtek.c
> > 
> > have the hack to disable 1000M advertisement. I don't understand
> > how
> > it possible for you to have 1000Base-T Full Duplex with this, maybe
> > I'm missing something here ?
> 
> that's what I don't understand as well...
> 
> the patched kernel shows the following:
> 
> $ uname -a
> Linux T-06 4.9.0-rc4+ #21 SMP PREEMPT Sun Nov 13 12:07:19 UTC 2016
> 
> $ sudo ethtool eth0
> Settings for eth0:
> Supported ports: [ TP MII ]
> Supported link modes:   10baseT/Half 10baseT/Full 
> 100baseT/Half 100baseT/Full 
> 1000baseT/Full 
> Supported pause frame use: No
> Supports auto-negotiation: Yes
> Advertised link modes:  10baseT/Half 10baseT/Full 
> 100baseT/Half 100baseT/Full 
> 1000baseT/Full 
> Advertised pause frame use: No
> Advertised auto-negotiation: Yes
> Link partner advertised link modes:  10baseT/Half
> 10baseT/Full 
>  100baseT/Half
> 100baseT/Full 1000baseT/Half 1000baseT/Full 
> Link partner advertised pause frame use: Symmetric Receive-
> only
> Link partner advertised auto-negotiation: Yes
> Speed: 1000Mb/s
> Duplex: Full
> Port: MII
> PHYAD: 0
> Transceiver: external
> Auto-negotiation: on
> Supports Wake-on: ug
> Wake-on: d
> Current message level: 0x003f (63)
>    drv probe link timer ifdown ifup
> Link detected: yes
> 
> $ sudo ethtool --show-eee eth0
> EEE Settings for eth0:
> EEE status: disabled
> Tx LPI: disabled
> Supported EEE link modes:  100baseT/Full 
>    1000baseT/Full 
> Advertised EEE link modes:  100baseT/Full 
> Link partner advertised EEE link modes:  100baseT/Full 
>  1000baseT/Full 
> 
> can it be that "EEE link modes" and the "normal" link modes are two
> different things ? 

Exactly, They are.
Hardkernel code disable both.

With hardkernel's kernel, you should not have 1000baseT/Full in
"Advertised link modes" and you would have nothing reported in
"Advertised EEE link modes"

> 
> > 
> > If you did compile the kernel yourself, could you check the 2 file
> > mentioned above ? Just to be sure there was no patch applied at the
> > last minute, which would not show up in the git history of
> > hardkernel ?
> 
> I cannot check this easily at the moment..
> 
> Regards,
> 
>  André
> 


Re: net: BUG still has locks held in unix_stream_splice_read

2016-11-17 Thread Cong Wang
On Sun, Oct 9, 2016 at 8:14 PM, Al Viro  wrote:
> E.g what will happen if some code does a read on AF_UNIX socket with
> some local mutex held?  AFAICS, there are exactly two callers of
> freezable_schedule_timeout() - this one and one in XFS; the latter is
> in a kernel thread where we do have good warranties about the locking
> environment, but here it's in the bleeding ->recvmsg/->splice_read and
> for those assumption that caller doesn't hold any locks is pretty
> strong, especially since it's not documented anywhere.
>
> What's going on there?

Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
converts schedule_timeout() to its freezable version, it was probably correct
at that time, but later, commit 2b514574f7e88c8498027ee366
("net: af_unix: implement splice for stream af_unix sockets") breaks its
requirement for a freezable sleep:

commit 0f9548ca10916dec166eaf74c816bded7d8e611d

lockdep: check that no locks held at freeze time

We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
deadlock if the lock is later acquired in the suspend or hibernate path
(e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
cgroup_freezer if a lock is held inside a frozen cgroup that is later
acquired by a process outside that group.

So probably we just need to revert commit 2b15af6f95 now.

I am going to send a revert for at least -net and -stable, since Dmitry
saw this warning again.


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Eric Dumazet
On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote:

> 
> Maybe you can share your udp flood "udpsnd" program source?

Very ugly. This is based on what I wrote when tracking the UDP v6
checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero
checksum in skb_checksum_help()), because netperf sends the same message
over and over...


Use -d 2   to remove the ip_idents_reserve() overhead.


#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

char buffer[1400];

int main(int argc, char** argv) {
  int fd, i;
  struct sockaddr_in6 addr;
  char *host = "2002:af6:798::1";
  int family = AF_INET6;
  int discover = -1;

  while ((i = getopt(argc, argv, "4H:d:")) != -1) {
switch (i) {
case 'H': host = optarg; break;
case '4': family = AF_INET; break;
case 'd': discover = atoi(optarg); break;
}
  }
  fd = socket(family, SOCK_DGRAM, 0);
  if (fd < 0)
error(1, errno, "failed to create socket");
  if (discover != -1)
setsockopt(fd, SOL_IP, IP_MTU_DISCOVER,
   &discover, sizeof(discover));

  memset(&addr, 0, sizeof(addr));
  if (family == AF_INET6) {
  addr.sin6_family = AF_INET6;
  addr.sin6_port = htons(9);
  inet_pton(family, host, (void *)&addr.sin6_addr.s6_addr);
  } else {
struct sockaddr_in *in = (struct sockaddr_in *)&addr;
in->sin_family = family;
in->sin_port = htons(9);
  inet_pton(family, host, &in->sin_addr);
  }
  connect(fd, (struct sockaddr *)&addr,
  (family == AF_INET6) ? sizeof(addr) :
 sizeof(struct sockaddr_in));
  memset(buffer, 1, 1400);
  for (i = 0; i < 65536; i++) {
memcpy(buffer, &i, sizeof(i));
send(fd, buffer, 100 + rand() % 200, 0);
  }
  return 0;
}




Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 10:51:23 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:
> 
> > The point is I can see a socket Send-Q forming, thus we do know the
> > application have something to send. Thus, and possibility for
> > non-opportunistic bulking. Allowing/implementing bulk enqueue from
> > socket layer into qdisc layer, should be fairly simple (and rest of
> > xmit_more is already in place).
> 
> 
> As I said, you are fooled by TX completions.
> 
> Please make sure to increase the sndbuf limits !
> 
> echo 2129920 >/proc/sys/net/core/wmem_default
> 
> lpaa23:~# sar -n DEV 1 10|grep eth1
> 10:49:25 eth1  7.00 9273283.00  0.61 2187214.90  0.00 
>  0.00  0.00
> 10:49:26 eth1  1.00 9230795.00  0.06 2176787.57  0.00 
>  0.00  1.00
> 10:49:27 eth1  2.00 9247906.00  0.17 2180915.45  0.00 
>  0.00  0.00
> 10:49:28 eth1  3.00 9246542.00  0.23 2180790.38  0.00 
>  0.00  1.00
> 10:49:29 eth1  1.00 9239218.00  0.06 2179044.83  0.00 
>  0.00  0.00
> 10:49:30 eth1  3.00 9248775.00  0.23 2181257.84  0.00 
>  0.00  1.00
> 10:49:31 eth1  4.00 9225471.00  0.65 2175772.75  0.00 
>  0.00  0.00
> 10:49:32 eth1  2.00 9253536.00  0.33 2182666.44  0.00 
>  0.00  1.00
> 10:49:33 eth1  1.00 9265900.00  0.06 2185598.40  0.00 
>  0.00  0.00
> 10:49:34 eth1  1.00 6949031.00  0.06 1638889.63  0.00 
>  0.00  1.00
> Average: eth1  2.50 9018045.70  0.25 2126893.82  0.00 
>  0.00  0.50
> 
> 
> lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
>  xmit_more: 2251366909
>  xmit_more: 2256011392
> 
> lpaa23:~# echo 2256011392-2251366909 | bc
> 4644483

xmit more not happen that frequently for my setup, it does happen
sometimes. And I do monitor with "ethtool -S".

~/git/network-testing/bin/ethtool_stats.pl --sec 2 --dev mlx5p2
Show adapter(s) (mlx5p2) statistics (ONLY that changed!)
Ethtool(mlx5p2  ) stat: 92900913 ( 92,900,913) <= tx0_bytes /sec
Ethtool(mlx5p2  ) stat:36073 ( 36,073) <= tx0_nop /sec
Ethtool(mlx5p2  ) stat:  1548349 (  1,548,349) <= tx0_packets /sec
Ethtool(mlx5p2  ) stat:1 (  1) <= tx0_xmit_more /sec
Ethtool(mlx5p2  ) stat: 92884899 ( 92,884,899) <= tx_bytes /sec
Ethtool(mlx5p2  ) stat: 99297696 ( 99,297,696) <= tx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:  1548082 (  1,548,082) <= tx_csum_partial /sec
Ethtool(mlx5p2  ) stat:  1548082 (  1,548,082) <= tx_packets /sec
Ethtool(mlx5p2  ) stat:  1551527 (  1,551,527) <= tx_packets_phy /sec
Ethtool(mlx5p2  ) stat: 99076658 ( 99,076,658) <= tx_prio1_bytes /sec
Ethtool(mlx5p2  ) stat:  1548073 (  1,548,073) <= tx_prio1_packets /sec
Ethtool(mlx5p2  ) stat: 92936078 ( 92,936,078) <= 
tx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:  1548934 (  1,548,934) <= 
tx_vport_unicast_packets /sec
Ethtool(mlx5p2  ) stat:1 (  1) <= tx_xmit_more /sec

(after several attempts I got:)
$ ethtool -S mlx5p2|grep more; sleep 1;ethtool -S mlx5p2|grep more
 tx_xmit_more: 14048
 tx0_xmit_more: 14048
 tx_xmit_more: 14049
 tx0_xmit_more: 14049

This was with:
 $ grep -H . /proc/sys/net/core/wmem_default
 /proc/sys/net/core/wmem_default:2129920

>PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp], 
>  (all, 48 CPUs)
> -
> 
> 11.64%  [kernel]  [k] skb_set_owner_w   
>  6.21%  [kernel]  [k] queued_spin_lock_slowpath 
>  4.76%  [kernel]  [k] _raw_spin_lock
>  4.40%  [kernel]  [k] __ip_make_skb 
>  3.10%  [kernel]  [k] sock_wfree
>  2.87%  [kernel]  [k] ipt_do_table  
>  2.76%  [kernel]  [k] fq_dequeue
>  2.71%  [kernel]  [k] mlx4_en_xmit  
>  2.50%  [kernel]  [k] __dev_queue_xmit  
>  2.29%  [kernel]  [k] __ip_append_data.isra.40  
>  2.28%  [kernel]  [k] udp_sendmsg   
>  2.01%  [kernel]  [k] __alloc_skb   
>  1.90%  [kernel]  [k] napi_consume_skb  
>  1.63%  [kernel]  [k] udp_send_skb  
>  1.62%  [kernel]  [k] skb_release_data  
>  1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath 
>  1.56%  [kernel]  [k] dev_hard_start_xmit   
>  1.55%  udpsnd[.] __libc_send   
>  1.48%  [kernel]  [k] netif_skb_features
>  1.42%  [kernel]  [k] __qdisc_run   
>  1.35%  [kernel]  [k] sk_dst_check  
> 

Re: [PATCH net 1/1] net sched filters: pass netlink message flags in event notification

2016-11-17 Thread Cong Wang
On Wed, Nov 16, 2016 at 2:16 PM, Roman Mashak  wrote:
> Userland client should be able to read an event, and reflect it back to
> the kernel, therefore it needs to extract complete set of netlink flags.
>
> For example, this will allow "tc monitor" to distinguish Add and Replace
> operations.
>
> Signed-off-by: Roman Mashak 
> Signed-off-by: Jamal Hadi Salim 
> ---
>  net/sched/cls_api.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 2b2a797..8e93d4a 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -112,7 +112,7 @@ static void tfilter_notify_chain(struct net *net, struct 
> sk_buff *oskb,
>
> for (it_chain = chain; (tp = rtnl_dereference(*it_chain)) != NULL;
>  it_chain = &tp->next)
> -   tfilter_notify(net, oskb, n, tp, 0, event, false);
> +   tfilter_notify(net, oskb, n, tp, n->nlmsg_flags, event, 
> false);


I must miss something, why does it make sense to pass n->nlmsg_flags
as 'fh' to tfilter_notify()??


>  }
>
>  /* Select new prio value from the range, managed by kernel. */
> @@ -430,7 +430,8 @@ static int tfilter_notify(struct net *net, struct sk_buff 
> *oskb,
> if (!skb)
> return -ENOBUFS;
>
> -   if (tcf_fill_node(net, skb, tp, fh, portid, n->nlmsg_seq, 0, event) 
> <= 0) {
> +   if (tcf_fill_node(net, skb, tp, fh, portid, n->nlmsg_seq,
> + n->nlmsg_flags, event) <= 0) {


This part makes sense.


Re: [PATCH net-next v2 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set

2016-11-17 Thread Daniel Borkmann

On 11/17/2016 10:48 AM, Saeed Mahameed wrote:

On Wed, Nov 16, 2016 at 5:26 PM, Daniel Borkmann  wrote:

On 11/16/2016 03:30 PM, Saeed Mahameed wrote:

On Wed, Nov 16, 2016 at 3:54 PM, Daniel Borkmann 
wrote:

On 11/16/2016 01:25 PM, Saeed Mahameed wrote:

On Wed, Nov 16, 2016 at 2:04 AM, Daniel Borkmann 
wrote:


There are multiple issues in mlx5e_xdp_set():

1) The batched bpf_prog_add() is currently not checked for errors! When
  doing so, it should be done at an earlier point in time to makes
sure
  that we cannot fail anymore at the time we want to set the program
for
  each channel. This only means that we have to undo the
bpf_prog_add()
  in case we return due to required reset.

2) When swapping the priv->xdp_prog, then no extra reference count must
be
  taken since we got that from call path via dev_change_xdp_fd()
already.
  Otherwise, we'd never be able to free the program. Also,
bpf_prog_add()
  without checking the return code could fail.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs
support")
Signed-off-by: Daniel Borkmann 
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 25
++-
1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ab0c336..cf26672 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3142,6 +3142,17 @@ static int mlx5e_xdp_set(struct net_device
*netdev, struct bpf_prog *prog)
   goto unlock;
   }

+   if (prog) {
+   /* num_channels is invariant here, so we can take the
+* batched reference right upfront.
+*/
+   prog = bpf_prog_add(prog, priv->params.num_channels);
+   if (IS_ERR(prog)) {
+   err = PTR_ERR(prog);
+   goto unlock;
+   }
+   }
+


With this approach you will end up taking a ref count twice per RQ! on
the first time xdp_set is called i.e (old_prog = NULL, prog != NULL).
One ref will be taken per RQ/Channel from the above code, and since
reset will be TRUE mlx5e_open_locked will be called and another ref
count will be taken on mlx5e_create_rq.

The issue here is that we have two places for ref count accounting,
xdp_set and  mlx5e_create_rq. Having ref-count updates in
mlx5e_create_rq is essential for num_channels configuration changes
(mlx5e_set_ringparam).

Our previous approach made sure that only one path will do the ref
counting (mlx5e_open_locked vs. mlx5e_xdp_set batch ref update when
reset == false).


That is correct, for a short time bpf_prog_add() was charged also when
we reset. When we reset, we will then jump to unlock_put and safely undo
this since we took ref from mlx5e_create_rq() already in that case, and
return successfully. That was intentional, so that eventually we end up
just taking a /single/ ref per RQ when we exit mlx5e_xdp_set(), but more
below ...

[...]


2. Keep the current approach and make sure to not update the ref count
twice, you can batch update only if (!reset && was_open) otherwise you
can rely on mlx5e_open_locked to take the num_channels ref count for
you.

Personally I prefer option 2, since we will keep the current logic
which will allow configuration updates such as (mlx5e_set_ringparam)
to not worry about ref counting since it will be done in the reset
flow.


... agree on keeping current approach. I actually like the idea, so we'd
end up with this simpler version for the batched ref then.


Great :).

So let's do the batched update only if we are not going to reset (we
already know that in advance), instead of the current patch where you
batch update unconditionally and then
   unlock_put in case reset was performed (which is just redundant and
confusing).

Please note that if open_locked fails you need to goto unlock_put.


Sorry, I'm not quite sure I follow you here; are you /now/ commenting on
the original patch or on the already updated diff I did from my very last
email, that is, http://patchwork.ozlabs.org/patch/695564/ ?


I am commenting on this patch  "V2 2/4".
this patch will take batched ref count unconditionally and release the
extra refs "unlock_put" if reset was performed.
I am asking to take the batched ref only if we are not going to reset
in advance to save the extra "unlock_put"


Okay, sure, it's clear and we discussed about that already earlier, and
my response back then was the diff in the /end/ of this email here:

  https://www.spinics.net/lists/netdev/msg404709.html

Hence my confusion why we were back on the original patch after that. ;)
Anyway, I'll get the updated set with the suggestion out tomorrow, thanks
for the review!


Note, your "bpf_prog_add(prog, 1) // one ref for the device." is not
needed
since this we already got that one through dev_change_xdp_fd() as
mentioned.


If

[PATCH net-next v4 2/5] net: fsl: Allow most drivers to be built with COMPILE_TEST

2016-11-17 Thread Florian Fainelli
There are only a handful of Freescale Ethernet drivers that don't
actually build with COMPILE_TEST:

* FEC, for which we would need to define a default register layout if no
  supported architecture is defined

* UCC_GETH which depends on PowerPC cpm.h header (which could be moved
  to a generic location)

* GIANFAR needs to depend on HAS_DMA to fix linking failures on some
  architectures (like m32r)

We need to fix an unmet dependency to get there though:
warning: (FSL_XGMAC_MDIO) selects OF_MDIO which has unmet direct
dependencies (OF && PHYLIB)

which would result in CONFIG_OF_MDIO=[ym] without CONFIG_OF to be set.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/freescale/Kconfig | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/Kconfig 
b/drivers/net/ethernet/freescale/Kconfig
index aa3f615886b4..0d415516b577 100644
--- a/drivers/net/ethernet/freescale/Kconfig
+++ b/drivers/net/ethernet/freescale/Kconfig
@@ -8,7 +8,7 @@ config NET_VENDOR_FREESCALE
depends on FSL_SOC || QUICC_ENGINE || CPM1 || CPM2 || PPC_MPC512x || \
   M523x || M527x || M5272 || M528x || M520x || M532x || \
   ARCH_MXC || ARCH_MXS || (PPC_MPC52xx && PPC_BESTCOMM) || \
-  ARCH_LAYERSCAPE
+  ARCH_LAYERSCAPE || COMPILE_TEST
---help---
  If you have a network (Ethernet) card belonging to this class, say Y.
 
@@ -65,6 +65,7 @@ config FSL_PQ_MDIO
 config FSL_XGMAC_MDIO
tristate "Freescale XGMAC MDIO"
select PHYLIB
+   depends on OF
select OF_MDIO
---help---
  This driver supports the MDIO bus on the Fman 10G Ethernet MACs, and
@@ -85,6 +86,7 @@ config UGETH_TX_ON_DEMAND
 
 config GIANFAR
tristate "Gianfar Ethernet"
+   depends on HAS_DMA
select FSL_PQ_MDIO
select PHYLIB
select CRC32
-- 
2.9.3



[PATCH net-next v4 3/5] bus: mvebu-bus: Provide inline stub for mvebu_mbus_get_dram_win_info

2016-11-17 Thread Florian Fainelli
In preparation for allowing CONFIG_MVNETA_BM to build with COMPILE_TEST,
provide an inline stub for mvebu_mbus_get_dram_win_info().

Signed-off-by: Florian Fainelli 
---
 include/linux/mbus.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/linux/mbus.h b/include/linux/mbus.h
index 2931aa43dab1..0d3f14fd2621 100644
--- a/include/linux/mbus.h
+++ b/include/linux/mbus.h
@@ -82,6 +82,7 @@ static inline int mvebu_mbus_get_io_win_info(phys_addr_t 
phyaddr, u32 *size,
 }
 #endif
 
+#ifdef CONFIG_MVEBU_MBUS
 int mvebu_mbus_save_cpu_target(u32 __iomem *store_addr);
 void mvebu_mbus_get_pcie_mem_aperture(struct resource *res);
 void mvebu_mbus_get_pcie_io_aperture(struct resource *res);
@@ -97,5 +98,12 @@ int mvebu_mbus_init(const char *soc, phys_addr_t 
mbus_phys_base,
size_t mbus_size, phys_addr_t sdram_phys_base,
size_t sdram_size);
 int mvebu_mbus_dt_init(bool is_coherent);
+#else
+static inline int mvebu_mbus_get_dram_win_info(phys_addr_t phyaddr, u8 *target,
+  u8 *attr)
+{
+   return -EINVAL;
+}
+#endif /* CONFIG_MVEBU_MBUS */
 
 #endif /* __LINUX_MBUS_H */
-- 
2.9.3



[PATCH net-next v4 0/5] net: Enable COMPILE_TEST for Marvell & Freescale drivers

2016-11-17 Thread Florian Fainelli
Hi all,

This patch series allows building the Freescale and Marvell Ethernet network
drivers with COMPILE_TEST.

Changes in v4:

- add proper HAS_DMA to fix build errors on m32r
- provide an inline stub for mvebu_mbus_get_dram_win_info
- added an additional patch to fix build errors with mv88e6xxx on m32r

Changes in v3:

- reorder patches to avoid introducing a build warning between commits

Changes in v2:

- rename register define clash when building for i386 (spotted by LKP)


Florian Fainelli (5):
  net: gianfar_ptp: Rename FS bit to FIPERST
  net: fsl: Allow most drivers to be built with COMPILE_TEST
  bus: mvebu-bus: Provide inline stub for mvebu_mbus_get_dram_win_info
  net: marvell: Allow drivers to be built with COMPILE_TEST
  net: dsa: mv88e6xxx: Select IRQ_DOMAIN

 drivers/net/dsa/mv88e6xxx/Kconfig|  1 +
 drivers/net/ethernet/freescale/Kconfig   |  4 +++-
 drivers/net/ethernet/freescale/gianfar_ptp.c |  4 ++--
 drivers/net/ethernet/marvell/Kconfig | 11 +++
 include/linux/mbus.h |  8 
 5 files changed, 21 insertions(+), 7 deletions(-)

-- 
2.9.3



[PATCH net-next v4 4/5] net: marvell: Allow drivers to be built with COMPILE_TEST

2016-11-17 Thread Florian Fainelli
All Marvell Ethernet drivers actually build fine with COMPILE_TEST with
a few warnings. We need to add a few HAS_DMA dependencies to fix linking
failures on problematic architectures like m32r.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/marvell/Kconfig | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/Kconfig 
b/drivers/net/ethernet/marvell/Kconfig
index 2664827ddecd..d74d4e6f0b34 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -5,7 +5,7 @@
 config NET_VENDOR_MARVELL
bool "Marvell devices"
default y
-   depends on PCI || CPU_PXA168 || MV64X60 || PPC32 || PLAT_ORION || INET
+   depends on PCI || CPU_PXA168 || MV64X60 || PPC32 || PLAT_ORION || INET 
|| COMPILE_TEST
---help---
  If you have a network (Ethernet) card belonging to this class, say Y.
 
@@ -18,7 +18,8 @@ if NET_VENDOR_MARVELL
 
 config MV643XX_ETH
tristate "Marvell Discovery (643XX) and Orion ethernet support"
-   depends on (MV64X60 || PPC32 || PLAT_ORION) && INET
+   depends on (MV64X60 || PPC32 || PLAT_ORION || COMPILE_TEST) && INET
+   depends on HAS_DMA
select PHYLIB
select MVMDIO
---help---
@@ -55,7 +56,8 @@ config MVNETA_BM_ENABLE
 
 config MVNETA
tristate "Marvell Armada 370/38x/XP network interface support"
-   depends on PLAT_ORION
+   depends on PLAT_ORION || COMPILE_TEST
+   depends on HAS_DMA
select MVMDIO
select FIXED_PHY
---help---
@@ -77,7 +79,8 @@ config MVNETA_BM
 
 config MVPP2
tristate "Marvell Armada 375 network interface support"
-   depends on MACH_ARMADA_375
+   depends on MACH_ARMADA_375 || COMPILE_TEST
+   depends on HAS_DMA
select MVMDIO
---help---
  This driver supports the network interface units in the
-- 
2.9.3



[PATCH net-next v4 5/5] net: dsa: mv88e6xxx: Select IRQ_DOMAIN

2016-11-17 Thread Florian Fainelli
Some architectures may not define IRQ_DOMAIN (like m32r), fixes
undefined references to IRQ_DOMAIN functions.

Fixes: dc30c35be720 ("net: dsa: mv88e6xxx: Implement interrupt support.")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/mv88e6xxx/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/dsa/mv88e6xxx/Kconfig 
b/drivers/net/dsa/mv88e6xxx/Kconfig
index 486668813e15..1aaa7a95ebc4 100644
--- a/drivers/net/dsa/mv88e6xxx/Kconfig
+++ b/drivers/net/dsa/mv88e6xxx/Kconfig
@@ -1,6 +1,7 @@
 config NET_DSA_MV88E6XXX
tristate "Marvell 88E6xxx Ethernet switch fabric support"
depends on NET_DSA
+   select IRQ_DOMAIN
select NET_DSA_TAG_EDSA
select NET_DSA_TAG_DSA
help
-- 
2.9.3



[PATCH net-next v4 1/5] net: gianfar_ptp: Rename FS bit to FIPERST

2016-11-17 Thread Florian Fainelli
FS is a global symbol used by the x86 32-bit architecture, fixes builds
re-definitions:

>> drivers/net/ethernet/freescale/gianfar_ptp.c:75:0: warning: "FS"
>> redefined
#define FS(1<<28) /* FIPER start indication */

   In file included from arch/x86/include/uapi/asm/ptrace.h:5:0,
from arch/x86/include/asm/ptrace.h:6,
from arch/x86/include/asm/math_emu.h:4,
from arch/x86/include/asm/processor.h:11,
from include/linux/mutex.h:19,
from include/linux/kernfs.h:13,
from include/linux/sysfs.h:15,
from include/linux/kobject.h:21,
from include/linux/device.h:17,
from
drivers/net/ethernet/freescale/gianfar_ptp.c:23:
   arch/x86/include/uapi/asm/ptrace-abi.h:15:0: note: this is the
location of the previous definition
#define FS 9

Reported-by: kbuild test robot 
Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/freescale/gianfar_ptp.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar_ptp.c 
b/drivers/net/ethernet/freescale/gianfar_ptp.c
index 57798814160d..3e8d1fffe34e 100644
--- a/drivers/net/ethernet/freescale/gianfar_ptp.c
+++ b/drivers/net/ethernet/freescale/gianfar_ptp.c
@@ -72,7 +72,7 @@ struct gianfar_ptp_registers {
 /* Bit definitions for the TMR_CTRL register */
 #define ALM1P (1<<31) /* Alarm1 output polarity */
 #define ALM2P (1<<30) /* Alarm2 output polarity */
-#define FS(1<<28) /* FIPER start indication */
+#define FIPERST   (1<<28) /* FIPER start indication */
 #define PP1L  (1<<27) /* Fiper1 pulse loopback mode enabled. */
 #define PP2L  (1<<26) /* Fiper2 pulse loopback mode enabled. */
 #define TCLK_PERIOD_SHIFT (16) /* 1588 timer reference clock period. */
@@ -502,7 +502,7 @@ static int gianfar_ptp_probe(struct platform_device *dev)
gfar_write(&etsects->regs->tmr_fiper1, etsects->tmr_fiper1);
gfar_write(&etsects->regs->tmr_fiper2, etsects->tmr_fiper2);
set_alarm(etsects);
-   gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl|FS|RTPE|TE|FRD);
+   gfar_write(&etsects->regs->tmr_ctrl,   tmr_ctrl|FIPERST|RTPE|TE|FRD);
 
spin_unlock_irqrestore(&etsects->lock, flags);
 
-- 
2.9.3



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Hannes Frederic Sowa
On 17.11.2016 19:32, Ido Schimmel wrote:
> On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote:
>> On 17.11.2016 17:45, David Miller wrote:
>>> From: Hannes Frederic Sowa 
>>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>>
 The other way is the journal idea I had, which uses an rb-tree with
 timestamps as keys (can be lamport timestamps). You insert into the tree
 until the dump is finished and use it as queue later to shuffle stuff
 into the hardware.
>>>
>>> If you have this "place" where pending inserts are stored, you have
>>> a policy decision to make.
>>>
>>> First of all what do other lookups see when there are pending entires?
>>
>> I think this is a problem with the current approach already, as the
>> delayed work queue already postpones the insert for an undecidable
>> amount of time (and reorders depending on which CPU the entry was
>> inserted and the fib notifier was called).
> 
> The delayed work queue only postpones the insert into the listener's
> table, but the entries will be present in the kernel's table as usual.
> Therefore, other lookup made on the kernel's table will see the pending
> entries.
> 
> Note that both listeners that currently call the dump API do that during
> their init, before any lookups can be made on their tables. If you think
> it's critical, then we can make sure the workqueues are flushed before
> the listeners register their netdevs and effectively expose their tables
> to lookups.

I do see routing anyway as a best-effort. ;)

That means in not too long time the hardware needs to be fully
synchronized to the software path and either have all correct entries or
abort must have been called.

> I'm looking into the reordering issues you mentioned. I belive it's a
> valid point.

Thanks!

>> For user space queries we would still query the in-kernel table.
> 
> Right. No changes here.
> 
>>> If, once inserted into the pending queue, you return success to the
>>> inserting entity, then you must make those pending entries visible
>>> to lookups.
>>
>> I think this same problem is in this patch set already. The way I
>> understood it, is, that if a problem during insert emerges, the driver
>> calls abort and every packet will be send to user space, as the routing
>> table cannot be offloaded and it won't try it again, Ido?
> 
> First of all, this abort mechanism is already in place and isn't part of
> this patchset. Secondly, why is this a problem? The all-or-nothing
> approach is an hard requirement and current implementation is infinitely
> better than previous approach in which the kernel's tables were flushed
> upon route insertion failure. It rendered the system unusable. Current
> implementation of abort mechanism keeps the system usable, but with
> reduced performance.

Yes, I argued below that I am toally fine with this approach.

>> Probably this is an artifact of the mellanox implementation and we can
>> implement this differently for other cards, but the schema to abort all
>> if the modification doesn't work, seems to be fundamental (I think we
>> require the all-or-nothing approach for now).
> 
> Yes, it's an hard requirement for all implementations. mlxsw implements
> it by evicting all routes from its tables and inserting a default route
> that traps packets to CPU.

Correct, I don't see how fib offloading can do something else, besides
semantically looking at the routing table and figure out where to insert
"exceptions" for subtrees but keep most packets flowing over the hw
directly. But this for another time... ;)

>>> If you block the inserting entity, well that doesn't make a lot of
>>> sense.  If blocking is a workable solution, then we can just block the
>>> entire insert while this FIB dump to the device driver is happening.
>>
>> I don't think we should really block (as in kernel-block) at any time.
>>
>> I was suggesting something like:
>>
>> rtnl_lock();
>> synchronize_rcu_expedited(); // barrier, all rounting modifications are
>> stable and no new can be added due to rtnl_lock
>> register notifier(); // notifier adds entries also into journal();
>> rtnl_unlock();
>> walk_fib_tree_rcu_into_journal();
>> // walk finished
>> start_syncing_journal_to_hw(); // if new entries show up we sync them
>> asap after this point
>>
>> The journal would need a spin lock to protect its internal state and
>> order events correctly.
>>
>>> But I am pretty sure the idea of blocking modifications for so long
>>> was considered undesirable.
>>
>> Yes, this is also still my opinion.
> 
> It was indeed rejected :)
> https://marc.info/?l=linux-netdev&m=147794848224884&w=2

Bye,
Hannes



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Hannes Frederic Sowa
On 17.11.2016 19:16, David Miller wrote:
> From: Hannes Frederic Sowa 
> Date: Thu, 17 Nov 2016 18:20:39 +0100
> 
>> Hi,
>>
>> On 17.11.2016 17:45, David Miller wrote:
>>> From: Hannes Frederic Sowa 
>>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>>>
 The other way is the journal idea I had, which uses an rb-tree with
 timestamps as keys (can be lamport timestamps). You insert into the tree
 until the dump is finished and use it as queue later to shuffle stuff
 into the hardware.
>>>
>>> If you have this "place" where pending inserts are stored, you have
>>> a policy decision to make.
>>>
>>> First of all what do other lookups see when there are pending entires?
>>
>> I think this is a problem with the current approach already, as the
>> delayed work queue already postpones the insert for an undecidable
>> amount of time (and reorders depending on which CPU the entry was
>> inserted and the fib notifier was called).
>>
>> For user space queries we would still query the in-kernel table.
> 
> Ok, I think I might misunderstand something.
> 
> What is going into this journal exactly?  The actual full software and
> hardware insert operation, or just the notification to the hardware
> device driver notifiers?

The journal is only used as a timely ordered queue for updating the
hardware in correct order.

The enqueue side is the fib notifier only. If no fib notifier is
registered we don't use this code at all (and also don't hit the lock
protecting this journal in fib insertion/deletion path - fast in-kernel
path is untouched - otherwise just a spin_lock already under rtnl_lock
in slow path).

The fib notifier enqueues the packet with a timestamp into this journal
and can also merge entries while they are in the queue, e.g. we got a
delete from the fib notifier but the rcu walk indicated an addition of
the entry, so we can merge that at this point and depending on the
timestamp remove the entry or drop the deletion event.

We start dequeueing the fib entries into the hardware as soon as the rcu
dump is finished, at this point we are up-to-date in the queue with all
events. New events can be added to the journal (with appropriate
locking) during this time, as the queue was once in proper synced state
we stay proper synchronized. We keep up with the queue in steady state
after the dump, so syncing happens ASAP. Maybe we can also drop the
journal then.

Something alike this described queue is implemented here (haven't
checked if it exactly matches the specification, certainly it provides
more features):
https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.h
https://github.com/bus1/bus1/blob/master/ipc/bus1/util/queue.c

For this to work the config path needs to add timestamps to the
fib_infos or fib_aliases.

> The "lookup" I'm mostly concerned with is the fast path where the
> packets being processed actually look up a route.

This doesn't change at all. All code will be hidden in a library that
gets attached to the fib notifier, which is configuration code path.

> I do not think we can return success on the insert to the user yet
> have the route lookup dataplace not return that route on a lookup.

We don't change kernel fast path at all.

If we add/delete a route in software and hardware, kernel indicates
success as soon as software has the entry added. It also gets queued up
in the journal. Journal will be lazily processed, if error happens
during that (e.g. hardware signals table full), abort is called and all
packets go to user space ASAP. User space will always show the route as
it is added to in the first place and after the driver called abort also
process the packets accordingly.

I can imagine this can get very complicated. David's approach with a
counter to check for modifications and a limited number of retries
probably works too, especially because the hardware will probably be
initialized before routing daemons start up and will be synced up
hopefully all the time.

So maybe this is over engineered, but I have no idea how long hardware
needs to sync up a e.g. full IPv4 routing table into hardware (if that
is actually the current goal of this).

Bye,
Hannes



Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM

2016-11-17 Thread Harini Katakam
Hi Rafal,

On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo  wrote:
> Hello,
> I think, there could a bug in your patch.
>
>> +
>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>> + dmacfg |= GEM_BIT(ADDR64);
>> +#endif
>
> You enable 64 bit addressing (64b dma bus width) always when appropriate 
> architecture config option is enabled.
> But there are some legacy controllers which do not support that feature. 
> According Cadence hardware team:
> "64 bit addressing was added in July 2013. Earlier version do not have it.
> This feature was enhanced in release August 2014 to have separate upper 
> address values for transmit and receive."
>
>> /* Bitfields in NSR */
>> @@ -474,6 +479,10 @@
>>  struct macb_dma_desc {
>  >  u32 addr;
>>   u32 ctrl;
>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
>> + u32 addrh;
>> + u32 resvd;
>> +#endif
>>  };
>
> It will not work for legacy hardware. Old descriptor is 2 words wide, the new 
> one is 4 words wide.
> If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't support it at 
> all,
> you will miss every second descriptor.
>

True, this feature is not available in all of Cadence IP versions.
In fact, the IP version Zynq does not support this. But the one in ZynqMP does.
So, we enable kernel config for 64 bit DMA addressing for this SoC and hence
the driver picks it up. My assumption was that if the legacy IP does not support
64 bit addressing, then this DMA option wouldn't be enabled.

There is a design config register in Cadence IP which is being read to
check for 64 bit address support - DMA mask is set based on that.
But the addition of two descriptor words cannot be based on this runtime check.
For this reason, all the static changes were placed under this check.

Regards,
Harini


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Eric Dumazet
On Thu, 2016-11-17 at 19:30 +0100, Jesper Dangaard Brouer wrote:

> The point is I can see a socket Send-Q forming, thus we do know the
> application have something to send. Thus, and possibility for
> non-opportunistic bulking. Allowing/implementing bulk enqueue from
> socket layer into qdisc layer, should be fairly simple (and rest of
> xmit_more is already in place).  


As I said, you are fooled by TX completions.

Please make sure to increase the sndbuf limits !

echo 2129920 >/proc/sys/net/core/wmem_default

lpaa23:~# sar -n DEV 1 10|grep eth1
10:49:25 eth1  7.00 9273283.00  0.61 2187214.90  0.00  
0.00  0.00
10:49:26 eth1  1.00 9230795.00  0.06 2176787.57  0.00  
0.00  1.00
10:49:27 eth1  2.00 9247906.00  0.17 2180915.45  0.00  
0.00  0.00
10:49:28 eth1  3.00 9246542.00  0.23 2180790.38  0.00  
0.00  1.00
10:49:29 eth1  1.00 9239218.00  0.06 2179044.83  0.00  
0.00  0.00
10:49:30 eth1  3.00 9248775.00  0.23 2181257.84  0.00  
0.00  1.00
10:49:31 eth1  4.00 9225471.00  0.65 2175772.75  0.00  
0.00  0.00
10:49:32 eth1  2.00 9253536.00  0.33 2182666.44  0.00  
0.00  1.00
10:49:33 eth1  1.00 9265900.00  0.06 2185598.40  0.00  
0.00  0.00
10:49:34 eth1  1.00 6949031.00  0.06 1638889.63  0.00  
0.00  1.00
Average: eth1  2.50 9018045.70  0.25 2126893.82  0.00  
0.00  0.50


lpaa23:~# ethtool -S eth1|grep more; sleep 1;ethtool -S eth1|grep more
 xmit_more: 2251366909
 xmit_more: 2256011392

lpaa23:~# echo 2256011392-2251366909 | bc
4644483

   PerfTop:   76969 irqs/sec  kernel:96.6%  exact: 100.0% [4000Hz cycles:pp],  
(all, 48 CPUs)
---

11.64%  [kernel]  [k] skb_set_owner_w   
 6.21%  [kernel]  [k] queued_spin_lock_slowpath 
 4.76%  [kernel]  [k] _raw_spin_lock
 4.40%  [kernel]  [k] __ip_make_skb 
 3.10%  [kernel]  [k] sock_wfree
 2.87%  [kernel]  [k] ipt_do_table  
 2.76%  [kernel]  [k] fq_dequeue
 2.71%  [kernel]  [k] mlx4_en_xmit  
 2.50%  [kernel]  [k] __dev_queue_xmit  
 2.29%  [kernel]  [k] __ip_append_data.isra.40  
 2.28%  [kernel]  [k] udp_sendmsg   
 2.01%  [kernel]  [k] __alloc_skb   
 1.90%  [kernel]  [k] napi_consume_skb  
 1.63%  [kernel]  [k] udp_send_skb  
 1.62%  [kernel]  [k] skb_release_data  
 1.62%  [kernel]  [k] entry_SYSCALL_64_fastpath 
 1.56%  [kernel]  [k] dev_hard_start_xmit   
 1.55%  udpsnd[.] __libc_send   
 1.48%  [kernel]  [k] netif_skb_features
 1.42%  [kernel]  [k] __qdisc_run   
 1.35%  [kernel]  [k] sk_dst_check  
 1.33%  [kernel]  [k] sock_def_write_space  
 1.30%  [kernel]  [k] kmem_cache_alloc_node_trace   
 1.29%  [kernel]  [k] __local_bh_enable_ip  
 1.21%  [kernel]  [k] copy_user_enhanced_fast_string
 1.08%  [kernel]  [k] __kmalloc_reserve.isra.40 
 1.08%  [kernel]  [k] SYSC_sendto   
 1.07%  [kernel]  [k] kmem_cache_alloc_node 
 0.95%  [kernel]  [k] ip_finish_output2 
 0.95%  [kernel]  [k] ktime_get 
 0.91%  [kernel]  [k] validate_xmit_skb 
 0.88%  [kernel]  [k] sock_alloc_send_pskb  
 0.82%  [kernel]  [k] sock_sendmsg  





Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems

2016-11-17 Thread André Roth

Hi all,

> I checked again the kernel
> at https://github.com/hardkernel/linux/tree/ odroidc2-3.14.y. The
> version you mention (3.14.65-73) seems to be:
> sha1: c75d5f4d1516cdd86d90a9d1c565bb0ed9251036 tag: jenkins-deb s905
> kernel-73

I downloaded the prebuilt image from hardkernel, I did not build the
kernel myself. but hardkernel has an earlier release of the same kernel
version, which works fine too. I assume they would have committed the
change in the newer version..
 
> In this particular version, both realtek drivers:
> - drivers/net/phy/realtek.c
> - drivers/amlogic/ethernet/phy/am_realtek.c
> 
> have the hack to disable 1000M advertisement. I don't understand how
> it possible for you to have 1000Base-T Full Duplex with this, maybe
> I'm missing something here ?

that's what I don't understand as well...

the patched kernel shows the following:

$ uname -a
Linux T-06 4.9.0-rc4+ #21 SMP PREEMPT Sun Nov 13 12:07:19 UTC 2016

$ sudo ethtool eth0
Settings for eth0:
Supported ports: [ TP MII ]
Supported link modes:   10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Supported pause frame use: No
Supports auto-negotiation: Yes
Advertised link modes:  10baseT/Half 10baseT/Full 
100baseT/Half 100baseT/Full 
1000baseT/Full 
Advertised pause frame use: No
Advertised auto-negotiation: Yes
Link partner advertised link modes:  10baseT/Half 10baseT/Full 
 100baseT/Half
100baseT/Full 1000baseT/Half 1000baseT/Full 
Link partner advertised pause frame use: Symmetric Receive-only
Link partner advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: MII
PHYAD: 0
Transceiver: external
Auto-negotiation: on
Supports Wake-on: ug
Wake-on: d
Current message level: 0x003f (63)
   drv probe link timer ifdown ifup
Link detected: yes

$ sudo ethtool --show-eee eth0
EEE Settings for eth0:
EEE status: disabled
Tx LPI: disabled
Supported EEE link modes:  100baseT/Full 
   1000baseT/Full 
Advertised EEE link modes:  100baseT/Full 
Link partner advertised EEE link modes:  100baseT/Full 
 1000baseT/Full 

can it be that "EEE link modes" and the "normal" link modes are two
different things ? 

> If you did compile the kernel yourself, could you check the 2 file
> mentioned above ? Just to be sure there was no patch applied at the
> last minute, which would not show up in the git history of
> hardkernel ?

I cannot check this easily at the moment..

Regards,

 André



Re: [PATCH net 1/1] net sched filters: pass netlink message flags in event notification

2016-11-17 Thread David Miller
From: Roman Mashak 
Date: Wed, 16 Nov 2016 17:16:10 -0500

> Userland client should be able to read an event, and reflect it back to
> the kernel, therefore it needs to extract complete set of netlink flags.
> 
> For example, this will allow "tc monitor" to distinguish Add and Replace
> operations.
> 
> Signed-off-by: Roman Mashak 
> Signed-off-by: Jamal Hadi Salim 

Looks good, applied.


[PATCH 12/20] net/iucv: Convert to hotplug state machine

2016-11-17 Thread Sebastian Andrzej Siewior
Install the callbacks via the state machine and let the core invoke the
callbacks on the already online CPUs. The smp function calls in the
online/downprep callbacks are not required as the callback is guaranteed to
be invoked on the upcoming/outgoing cpu.

Cc: Ursula Braun 
Cc: "David S. Miller" 
Cc: linux-s...@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior 
---
 include/linux/cpuhotplug.h |   1 +
 net/iucv/iucv.c| 118 +
 2 files changed, 45 insertions(+), 74 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index fd5598b8353a..69abf2c09f6c 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -63,6 +63,7 @@ enum cpuhp_state {
CPUHP_X86_THERM_PREPARE,
CPUHP_X86_CPUID_PREPARE,
CPUHP_X86_MSR_PREPARE,
+   CPUHP_NET_IUCV_PREPARE,
CPUHP_TIMERS_DEAD,
CPUHP_NOTF_ERR_INJ_PREPARE,
CPUHP_MIPS_SOC_PREPARE,
diff --git a/net/iucv/iucv.c b/net/iucv/iucv.c
index 88a2a3ba4212..f0d6afc5d4a9 100644
--- a/net/iucv/iucv.c
+++ b/net/iucv/iucv.c
@@ -639,7 +639,7 @@ static void iucv_disable(void)
put_online_cpus();
 }
 
-static void free_iucv_data(int cpu)
+static int iucv_cpu_dead(unsigned int cpu)
 {
kfree(iucv_param_irq[cpu]);
iucv_param_irq[cpu] = NULL;
@@ -647,9 +647,10 @@ static void free_iucv_data(int cpu)
iucv_param[cpu] = NULL;
kfree(iucv_irq_data[cpu]);
iucv_irq_data[cpu] = NULL;
+   return 0;
 }
 
-static int alloc_iucv_data(int cpu)
+static int iucv_cpu_prepare(unsigned int cpu)
 {
/* Note: GFP_DMA used to get memory below 2G */
iucv_irq_data[cpu] = kmalloc_node(sizeof(struct iucv_irq_data),
@@ -671,58 +672,38 @@ static int alloc_iucv_data(int cpu)
return 0;
 
 out_free:
-   free_iucv_data(cpu);
+   iucv_cpu_dead(cpu);
return -ENOMEM;
 }
 
-static int iucv_cpu_notify(struct notifier_block *self,
-unsigned long action, void *hcpu)
+static int iucv_cpu_online(unsigned int cpu)
 {
-   cpumask_t cpumask;
-   long cpu = (long) hcpu;
-
-   switch (action) {
-   case CPU_UP_PREPARE:
-   case CPU_UP_PREPARE_FROZEN:
-   if (alloc_iucv_data(cpu))
-   return notifier_from_errno(-ENOMEM);
-   break;
-   case CPU_UP_CANCELED:
-   case CPU_UP_CANCELED_FROZEN:
-   case CPU_DEAD:
-   case CPU_DEAD_FROZEN:
-   free_iucv_data(cpu);
-   break;
-   case CPU_ONLINE:
-   case CPU_ONLINE_FROZEN:
-   case CPU_DOWN_FAILED:
-   case CPU_DOWN_FAILED_FROZEN:
-   if (!iucv_path_table)
-   break;
-   smp_call_function_single(cpu, iucv_declare_cpu, NULL, 1);
-   break;
-   case CPU_DOWN_PREPARE:
-   case CPU_DOWN_PREPARE_FROZEN:
-   if (!iucv_path_table)
-   break;
-   cpumask_copy(&cpumask, &iucv_buffer_cpumask);
-   cpumask_clear_cpu(cpu, &cpumask);
-   if (cpumask_empty(&cpumask))
-   /* Can't offline last IUCV enabled cpu. */
-   return notifier_from_errno(-EINVAL);
-   smp_call_function_single(cpu, iucv_retrieve_cpu, NULL, 1);
-   if (cpumask_empty(&iucv_irq_cpumask))
-   smp_call_function_single(
-   cpumask_first(&iucv_buffer_cpumask),
-   iucv_allow_cpu, NULL, 1);
-   break;
-   }
-   return NOTIFY_OK;
+   if (!iucv_path_table)
+   return 0;
+   iucv_declare_cpu(NULL);
+   return 0;
 }
 
-static struct notifier_block __refdata iucv_cpu_notifier = {
-   .notifier_call = iucv_cpu_notify,
-};
+static int iucv_cpu_down_prep(unsigned int cpu)
+{
+   cpumask_t cpumask;
+
+   if (!iucv_path_table)
+   return 0;
+
+   cpumask_copy(&cpumask, &iucv_buffer_cpumask);
+   cpumask_clear_cpu(cpu, &cpumask);
+   if (cpumask_empty(&cpumask))
+   /* Can't offline last IUCV enabled cpu. */
+   return -EINVAL;
+
+   iucv_retrieve_cpu(NULL);
+   if (!cpumask_empty(&iucv_irq_cpumask))
+   return 0;
+   smp_call_function_single(cpumask_first(&iucv_buffer_cpumask),
+iucv_allow_cpu, NULL, 1);
+   return 0;
+}
 
 /**
  * iucv_sever_pathid
@@ -2027,6 +2008,7 @@ struct iucv_interface iucv_if = {
 };
 EXPORT_SYMBOL(iucv_if);
 
+static enum cpuhp_state iucv_online;
 /**
  * iucv_init
  *
@@ -2035,7 +2017,6 @@ EXPORT_SYMBOL(iucv_if);
 static int __init iucv_init(void)
 {
int rc;
-   int cpu;
 
if (!MACHINE_IS_VM) {
rc = -EPROTONOSUPPORT;
@@ -2054,23 +2035,19 @@ static int __init iucv_init(void)
goto out_int;
}
 
-   cpu_notifier_register_b

Re: [PATCH net-next 0/3] RDS: TCP: HA/Failover fixes

2016-11-17 Thread David Miller
From: Sowmini Varadhan 
Date: Wed, 16 Nov 2016 13:29:47 -0800

> This series contains a set of fixes for bugs exposed when
> we ran the following in a loop between a test machine pair:
> 
>  while (1); do
># modprobe rds-tcp on test nodes
># run rds-stress in bi-dir mode between test machine pair 
># modprobe -r rds-tcp on test nodes
>  done
> 
> rds-stress in bi-dir mode will cause both nodes to initiate
> RDS-TCP connections at almost the same instant, exposing the 
> bugs fixed in this series. 
> 
> Without the fixes, rds-stress reports sporadic packet drops,
> and packets arriving out of sequence. After the fixes,we have
> been able to run the  test overnight, without any issues.
> 
> Each patch has a detailed description of the root-cause fixed
> by the patch.

Series applied, thank you.


Re: [PATCH 2/3] net: stmmac: replace hardcoded function name by __func__

2016-11-17 Thread David Miller
From: Corentin Labbe 
Date: Wed, 16 Nov 2016 20:09:40 +0100

> From: LABBE Corentin 
> 
> Some printing have the function name hardcoded.
> It is better to use __func__ instead.
> 
> Signed-off-by: Corentin Labbe 

Applied to net-next.


Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread Ido Schimmel
On Thu, Nov 17, 2016 at 06:20:39PM +0100, Hannes Frederic Sowa wrote:
> On 17.11.2016 17:45, David Miller wrote:
> > From: Hannes Frederic Sowa 
> > Date: Thu, 17 Nov 2016 15:36:48 +0100
> > 
> >> The other way is the journal idea I had, which uses an rb-tree with
> >> timestamps as keys (can be lamport timestamps). You insert into the tree
> >> until the dump is finished and use it as queue later to shuffle stuff
> >> into the hardware.
> > 
> > If you have this "place" where pending inserts are stored, you have
> > a policy decision to make.
> > 
> > First of all what do other lookups see when there are pending entires?
> 
> I think this is a problem with the current approach already, as the
> delayed work queue already postpones the insert for an undecidable
> amount of time (and reorders depending on which CPU the entry was
> inserted and the fib notifier was called).

The delayed work queue only postpones the insert into the listener's
table, but the entries will be present in the kernel's table as usual.
Therefore, other lookup made on the kernel's table will see the pending
entries.

Note that both listeners that currently call the dump API do that during
their init, before any lookups can be made on their tables. If you think
it's critical, then we can make sure the workqueues are flushed before
the listeners register their netdevs and effectively expose their tables
to lookups.

I'm looking into the reordering issues you mentioned. I belive it's a
valid point.

> For user space queries we would still query the in-kernel table.

Right. No changes here.

> > If, once inserted into the pending queue, you return success to the
> > inserting entity, then you must make those pending entries visible
> > to lookups.
> 
> I think this same problem is in this patch set already. The way I
> understood it, is, that if a problem during insert emerges, the driver
> calls abort and every packet will be send to user space, as the routing
> table cannot be offloaded and it won't try it again, Ido?

First of all, this abort mechanism is already in place and isn't part of
this patchset. Secondly, why is this a problem? The all-or-nothing
approach is an hard requirement and current implementation is infinitely
better than previous approach in which the kernel's tables were flushed
upon route insertion failure. It rendered the system unusable. Current
implementation of abort mechanism keeps the system usable, but with
reduced performance.

> Probably this is an artifact of the mellanox implementation and we can
> implement this differently for other cards, but the schema to abort all
> if the modification doesn't work, seems to be fundamental (I think we
> require the all-or-nothing approach for now).

Yes, it's an hard requirement for all implementations. mlxsw implements
it by evicting all routes from its tables and inserting a default route
that traps packets to CPU.

> > If you block the inserting entity, well that doesn't make a lot of
> > sense.  If blocking is a workable solution, then we can just block the
> > entire insert while this FIB dump to the device driver is happening.
> 
> I don't think we should really block (as in kernel-block) at any time.
> 
> I was suggesting something like:
> 
> rtnl_lock();
> synchronize_rcu_expedited(); // barrier, all rounting modifications are
> stable and no new can be added due to rtnl_lock
> register notifier(); // notifier adds entries also into journal();
> rtnl_unlock();
> walk_fib_tree_rcu_into_journal();
> // walk finished
> start_syncing_journal_to_hw(); // if new entries show up we sync them
> asap after this point
> 
> The journal would need a spin lock to protect its internal state and
> order events correctly.
> 
> > But I am pretty sure the idea of blocking modifications for so long
> > was considered undesirable.
> 
> Yes, this is also still my opinion.

It was indeed rejected :)
https://marc.info/?l=linux-netdev&m=147794848224884&w=2


Re: [PATCH 1/3] net: stmmac: replace all pr_xxx by their netdev_xxx counterpart

2016-11-17 Thread David Miller
From: Corentin Labbe 
Date: Wed, 16 Nov 2016 20:09:39 +0100

> From: LABBE Corentin 
> 
> The stmmac driver use lots of pr_xxx functions to print information.
> This is bad since we cannot know which device logs the information.
> (moreover if two stmmac device are present)
> 
> Furthermore, it seems that it assumes wrongly that all logs will always
> be subsequent by using a dev_xxx then some indented pr_xxx like this:
> kernel: sun7i-dwmac 1c5.ethernet: no reset control found
> kernel:  Ring mode enabled
> kernel:  No HW DMA feature register supported
> kernel:  Normal descriptors
> kernel:  TX Checksum insertion supported
> 
> So this patch replace all pr_xxx by their netdev_xxx counterpart.
> Excepts for some printing where netdev "cause" unpretty output like:
> sun7i-dwmac 1c5.ethernet (unnamed net_device) (uninitialized): no reset 
> control found
> In those case, I keep dev_xxx.
> 
> In the same time I remove some "stmmac:" print since
> this will be a duplicate with that dev_xxx displays.
> 
> Signed-off-by: Corentin Labbe 

Applied to net-next.


Re: [PATCH 3/3] net: stmmac: replace if (netif_msg_type) by their netif_xxx counterpart

2016-11-17 Thread David Miller
From: Corentin Labbe 
Date: Wed, 16 Nov 2016 20:09:41 +0100

> From: LABBE Corentin 
> 
> As sugested by Joe Perches, we could replace all
> if (netif_msg_type(priv)) dev_xxx(priv->devices, ...)
> by the simpler macro netif_xxx(priv, hw, priv->dev, ...)
> 
> Signed-off-by: Corentin Labbe 

Applied to net-next.


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 08:21:19 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote:
> > On Thu, 17 Nov 2016 06:17:38 -0800
> > Eric Dumazet  wrote:
> >   
> > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> > >   
> > > > I can see that qdisc layer does not activate xmit_more in this case.
> > > > 
> > > 
> > > Sure. Not enough pressure from the sender(s).
> > > 
> > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > > limit is kept at a small value.
> > > 
> > > (BTW not all NIC have expensive doorbells)  
> > 
> > I believe this NIC mlx5 (50G edition) does.
> > 
> > I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> > 2414 cycles(tsc) 603.86 ns
> > 
> > Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> > 
> >  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
> >Overhead  CommandShared ObjectSymbol
> >  +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
> >- _raw_spin_lock
> >   + 90.78% __dev_queue_xmit
> >   + 7.83% dev_queue_xmit
> >   + 1.30% ___slab_alloc
> >  +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w  
> 
> Does TX completion happens on same cpu ?
> 
> >  +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
> >  +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup  
> 
> Why fib_table_lookup() is used with connected UDP flow ???

Don't know. I would be interested in hints howto avoid this!...

I also see it with netperf, and my udp_flood code is here:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c


> >  +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
> >  +3.11%  udp_flood  [kernel.vmlinux] [k] 
> > __ip_route_output_key_hash  
> 
> Same here, this is suspect.

It is the function calling fib_table_lookup(), and it is called by 
ip_route_output_flow().
 
> >  +2.49%  swapper[kernel.vmlinux] [k] __slab_free
> > 
> > In this setup the spinlock in __dev_queue_xmit should be uncongested.
> > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> > 
> > But 8.92% of the time is spend on it, which corresponds to a cost of 215
> > cycles (2414*0.0892).  This cost is too high, thus something else is
> > going on... I claim this mysterious extra cost is the tailptr/doorbell.  
> 
> Well, with no pressure, doorbell is triggered for each packet.
> 
> Since we can not predict that your application is going to send yet
> another packet one usec later, we can not avoid this.

The point is I can see a socket Send-Q forming, thus we do know the
application have something to send. Thus, and possibility for
non-opportunistic bulking. Allowing/implementing bulk enqueue from
socket layer into qdisc layer, should be fairly simple (and rest of
xmit_more is already in place).  


> Note that with the patches I am working on (busypolling extentions),
> we could decide to let the busypoller doing the doorbells, say one every
> 10 usec. (or after ~16 packets were queued)

Sounds interesting! but an opportunistically approach.

> But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same
> strategy .

It is a possibility that TX completions were happening on another CPU
(but I don't think so for mlx5).

To rule that out, I reran the experiment making sure to pin everything
to CPU-0 and the results are the same.

sudo ethtool -L mlx5p2 combined 1

sudo sh -c '\
 for x in /proc/irq/*/mlx5*/../smp_affinity; do \
   echo 01 > $x; grep . -H $x; done \
'

$ taskset -c 0 ./udp_flood --sendto 198.18.50.1 --count $((10**9))

Perf report validating CPU in use:

$ perf report -g --no-children --sort cpu,comm,dso,symbol --stdio --call-graph 
none

# Overhead  CPU  Command Shared ObjectSymbol
   
#   ...  ..  ...  
.
#
 9.97%  000  udp_flood   [kernel.vmlinux] [k] _raw_spin_lock
   
 4.37%  000  udp_flood   [kernel.vmlinux] [k] fib_table_lookup  
   
 3.97%  000  udp_flood   [mlx5_core]  [k] mlx5e_sq_xmit 
   
 3.06%  000  udp_flood   [kernel.vmlinux] [k] 
__ip_route_output_key_hash   
 2.51%  000  udp_flood   [mlx5_core]  [k] mlx5e_poll_tx_cq  
   
 2.48%  000  udp_flood   [kernel.vmlinux] [k] 
copy_user_enhanced_fast_string   
 2.47%  000  udp_flood   [kernel.vmlinux] [k] entry_SYSCALL_64  
   
 2.42%  000  udp_flood   [kernel.vmlinux] [k] udp_sendmsg   
   
 2.39%  000  udp_flood   [kernel.vmlinux] [k] 
__ip_append_data.isra.47 
 2.29%  000  udp_flood   [k

Re: [PATCH net-next] net_sched: sch_fq: use hash_ptr()

2016-11-17 Thread David Miller
From: Eric Dumazet 
Date: Thu, 17 Nov 2016 09:48:30 -0800

> From: Eric Dumazet 
> 
> When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful,
> and I chose hash_32().
> 
> Linus Torvalds and George Spelvin fixed this issue, so we can
> use hash_ptr() to get more entropy on 64bit arches with Terabytes
> of memory, and avoid the cast games.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Hugh Dickins 

Applied, thanks Eric.


[PATCH v8 0/6] Add eBPF hooks for cgroups

2016-11-17 Thread Daniel Mack
This is v8 of the patch set to allow eBPF programs for network
filtering and accounting to be attached to cgroups, so that they apply
to all sockets of all tasks placed in that cgroup. The logic also
allows to be extendeded for other cgroup based eBPF logic.

Again, only minor details are updated in this version.


Thanks,
Daniel


Changes from v7:

* Replace the static inline function cgroup_bpf_run_filter() with
  two specific macros for ingress and egress.  This addresses David
  Miller's concern regarding skb->sk vs. sk in the egress path.
  Thanks a lot to Daniel Borkmann and Alexei Starovoitov for the
  suggestions.


Changes from v6:

* Rebased to 4.9-rc2

* Add EXPORT_SYMBOL(__cgroup_bpf_run_filter). The kbuild test robot
  now succeeds in building this version of the patch set.

* Switch from bpf_prog_run_save_cb() to bpf_prog_run_clear_cb() to not
  tamper with the contents of skb->cb[]. Pointed out by Daniel
  Borkmann.

* Use sk_to_full_sk() in the egress path, as suggested by Daniel
  Borkmann.

* Renamed BPF_PROG_TYPE_CGROUP_SOCKET to BPF_PROG_TYPE_CGROUP_SKB, as
  requested by David Ahern.

* Added Alexei's Acked-by tags.


Changes from v5:

* The eBPF programs now operate on L3 rather than on L2 of the packets,
  and the egress hooks were moved from __dev_queue_xmit() to
  ip*_output().

* For BPF_PROG_TYPE_CGROUP_SOCKET, disallow direct access to the skb
  through BPF_LD_[ABS|IND] instructions, but hook up the
  bpf_skb_load_bytes() access helper instead. Thanks to Daniel Borkmann
  for the help.


Changes from v4:

* Plug an skb leak when dropping packets due to eBPF verdicts in
  __dev_queue_xmit(). Spotted by Daniel Borkmann.

* Check for sk_fullsock(sk) in __cgroup_bpf_run_filter() so we don't
  operate on timewait or request sockets. Suggested by Daniel Borkmann.

* Add missing @parent parameter in kerneldoc of __cgroup_bpf_update().
  Spotted by Rami Rosen.

* Include linux/jump_label.h from bpf-cgroup.h to fix a kbuild error.


Changes from v3:

* Dropped the _FILTER suffix from BPF_PROG_TYPE_CGROUP_SOCKET_FILTER,
  renamed BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS to
  BPF_CGROUP_INET_{IN,E}GRESS and alias BPF_MAX_ATTACH_TYPE to
  __BPF_MAX_ATTACH_TYPE, as suggested by Daniel Borkmann.

* Dropped the attach_flags member from the anonymous struct for BPF
  attach operations in union bpf_attr. They can be added later on via
  CHECK_ATTR. Requested by Daniel Borkmann and Alexei.

* Release old_prog at the end of __cgroup_bpf_update rather that at
  the beginning to fix a race gap between program updates and their
  users. Spotted by Daniel Borkmann.

* Plugged an skb leak when dropping packets on the egress path.
  Spotted by Daniel Borkmann.

* Add cgro...@vger.kernel.org to the loop, as suggested by Rami Rosen.

* Some minor coding style adoptions not worth mentioning in particular.


Changes from v2:

* Fixed the RCU locking details Tejun pointed out.

* Assert bpf_attr.flags == 0 in BPF_PROG_DETACH syscall handler.


Changes from v1:

* Moved all bpf specific cgroup code into its own file, and stub
  out related functions for !CONFIG_CGROUP_BPF as static inline nops.
  This way, the call sites are not cluttered with #ifdef guards while
  the feature remains compile-time configurable.

* Implemented the new scheme proposed by Tejun. Per cgroup, store one
  set of pointers that are pinned to the cgroup, and one for the
  programs that are effective. When a program is attached or detached,
  the change is propagated to all the cgroup's descendants. If a
  subcgroup has its own pinned program, skip the whole subbranch in
  order to allow delegation models.

* The hookup for egress packets is now done from __dev_queue_xmit().

* A static key is now used in both the ingress and egress fast paths
  to keep performance penalties close to zero if the feature is
  not in use.

* Overall cleanup to make the accessors use the program arrays.
  This should make it much easier to add new program types, which
  will then automatically follow the pinned vs. effective logic.

* Fixed locking issues, as pointed out by Eric Dumazet and Alexei
  Starovoitov. Changes to the program array are now done with
  xchg() and are protected by cgroup_mutex.

* eBPF programs are now expected to return 1 to let the packet pass,
  not >= 0. Pointed out by Alexei.

* Operation is now limited to INET sockets, so local AF_UNIX sockets
  are not affected. The enum members are renamed accordingly. In case
  other socket families should be supported, this can be extended in
  the future.

* The sample program learned to support both ingress and egress, and
  can now optionally make the eBPF program drop packets by making it
  return 0.


Daniel Mack (6):
  bpf: add new prog type for cgroup socket filtering
  cgroup: add support for eBPF programs
  bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands
  net: filter: run cgroup eBPF ingress programs
  net: ipv4, ipv6: run cgroup eBPF egress programs
  samples: bpf: add use

[PATCH v8 2/6] cgroup: add support for eBPF programs

2016-11-17 Thread Daniel Mack
This patch adds two sets of eBPF program pointers to struct cgroup.
One for such that are directly pinned to a cgroup, and one for such
that are effective for it.

To illustrate the logic behind that, assume the following example
cgroup hierarchy.

  A - B - C
\ D - E

If only B has a program attached, it will be effective for B, C, D
and E. If D then attaches a program itself, that will be effective for
both D and E, and the program in B will only affect B and C. Only one
program of a given type is effective for a cgroup.

Attaching and detaching programs will be done through the bpf(2)
syscall. For now, ingress and egress inet socket filtering are the
only supported use-cases.

Signed-off-by: Daniel Mack 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf-cgroup.h  |  79 +
 include/linux/cgroup-defs.h |   4 ++
 init/Kconfig|  12 
 kernel/bpf/Makefile |   1 +
 kernel/bpf/cgroup.c | 167 
 kernel/cgroup.c |  18 +
 6 files changed, 281 insertions(+)
 create mode 100644 include/linux/bpf-cgroup.h
 create mode 100644 kernel/bpf/cgroup.c

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
new file mode 100644
index 000..ec80d0c
--- /dev/null
+++ b/include/linux/bpf-cgroup.h
@@ -0,0 +1,79 @@
+#ifndef _BPF_CGROUP_H
+#define _BPF_CGROUP_H
+
+#include 
+#include 
+#include 
+
+struct sock;
+struct cgroup;
+struct sk_buff;
+
+#ifdef CONFIG_CGROUP_BPF
+
+extern struct static_key_false cgroup_bpf_enabled_key;
+#define cgroup_bpf_enabled static_branch_unlikely(&cgroup_bpf_enabled_key)
+
+struct cgroup_bpf {
+   /*
+* Store two sets of bpf_prog pointers, one for programs that are
+* pinned directly to this cgroup, and one for those that are effective
+* when this cgroup is accessed.
+*/
+   struct bpf_prog *prog[MAX_BPF_ATTACH_TYPE];
+   struct bpf_prog *effective[MAX_BPF_ATTACH_TYPE];
+};
+
+void cgroup_bpf_put(struct cgroup *cgrp);
+void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent);
+
+void __cgroup_bpf_update(struct cgroup *cgrp,
+struct cgroup *parent,
+struct bpf_prog *prog,
+enum bpf_attach_type type);
+
+/* Wrapper for __cgroup_bpf_update() protected by cgroup_mutex */
+void cgroup_bpf_update(struct cgroup *cgrp,
+  struct bpf_prog *prog,
+  enum bpf_attach_type type);
+
+int __cgroup_bpf_run_filter(struct sock *sk,
+   struct sk_buff *skb,
+   enum bpf_attach_type type);
+
+/* Wrappers for __cgroup_bpf_run_filter() guarded by cgroup_bpf_enabled. */
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb)   \
+({ \
+   int __ret = 0;  \
+   if (cgroup_bpf_enabled) \
+   __ret = __cgroup_bpf_run_filter(sk, skb,\
+   BPF_CGROUP_INET_INGRESS); \
+   \
+   __ret;  \
+})
+
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb)
\
+({ \
+   int __ret = 0;  \
+   if (cgroup_bpf_enabled && sk && sk == skb->sk) {\
+   typeof(sk) __sk = sk_to_full_sk(sk);\
+   if (sk_fullsock(__sk))  \
+   __ret = __cgroup_bpf_run_filter(__sk, skb,  \
+   BPF_CGROUP_INET_EGRESS); \
+   }   \
+   __ret;  \
+})
+
+#else
+
+struct cgroup_bpf {};
+static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
+static inline void cgroup_bpf_inherit(struct cgroup *cgrp,
+ struct cgroup *parent) {}
+
+#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk,skb) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_INET_EGRESS(sk,skb) ({ 0; })
+
+#endif /* CONFIG_CGROUP_BPF */
+
+#endif /* _BPF_CGROUP_H */
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 5b17de6..861b467 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CGROUPS
 
@@ -300,6 +301,9 @@ struct cgroup {
/* used to schedule release agent */
struct work_struct release_agent_work;
 
+   /* used to store eBPF programs */
+   struct cgroup_bpf bpf;
+
/* ids of the ances

[PATCH v8 4/6] net: filter: run cgroup eBPF ingress programs

2016-11-17 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from sk_filter_trim_cap().

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack 
Acked-by: Alexei Starovoitov 
---
 net/core/filter.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index e3813d6..474b486 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -78,6 +78,10 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, 
unsigned int cap)
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
 
+   err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
+   if (err)
+   return err;
+
err = security_sock_rcv_skb(sk, skb);
if (err)
return err;
-- 
2.7.4



[PATCH v8 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-11-17 Thread Daniel Mack
Extend the bpf(2) syscall by two new commands, BPF_PROG_ATTACH and
BPF_PROG_DETACH which allow attaching and detaching eBPF programs
to a target.

On the API level, the target could be anything that has an fd in
userspace, hence the name of the field in union bpf_attr is called
'target_fd'.

When called with BPF_ATTACH_TYPE_CGROUP_INET_{E,IN}GRESS, the target is
expected to be a valid file descriptor of a cgroup v2 directory which
has the bpf controller enabled. These are the only use-cases
implemented by this patch at this point, but more can be added.

If a program of the given type already exists in the given cgroup,
the program is swapped automically, so userspace does not have to drop
an existing program first before installing a new one, which would
otherwise leave a gap in which no program is attached.

For more information on the propagation logic to subcgroups, please
refer to the bpf cgroup controller implementation.

The API is guarded by CAP_NET_ADMIN.

Signed-off-by: Daniel Mack 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |  8 +
 kernel/bpf/syscall.c | 81 
 2 files changed, 89 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1f3e6f1..f31b655 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -73,6 +73,8 @@ enum bpf_cmd {
BPF_PROG_LOAD,
BPF_OBJ_PIN,
BPF_OBJ_GET,
+   BPF_PROG_ATTACH,
+   BPF_PROG_DETACH,
 };
 
 enum bpf_map_type {
@@ -150,6 +152,12 @@ union bpf_attr {
__aligned_u64   pathname;
__u32   bpf_fd;
};
+
+   struct { /* anonymous struct used by BPF_PROG_ATTACH/DETACH commands */
+   __u32   target_fd;  /* container object to attach 
to */
+   __u32   attach_bpf_fd;  /* eBPF program to attach */
+   __u32   attach_type;
+   };
 } __attribute__((aligned(8)));
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 228f962..1814c01 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -822,6 +822,77 @@ static int bpf_obj_get(const union bpf_attr *attr)
return bpf_obj_get_user(u64_to_ptr(attr->pathname));
 }
 
+#ifdef CONFIG_CGROUP_BPF
+
+#define BPF_PROG_ATTACH_LAST_FIELD attach_type
+
+static int bpf_prog_attach(const union bpf_attr *attr)
+{
+   struct bpf_prog *prog;
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_ATTACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   prog = bpf_prog_get_type(attr->attach_bpf_fd,
+BPF_PROG_TYPE_CGROUP_SKB);
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp)) {
+   bpf_prog_put(prog);
+   return PTR_ERR(cgrp);
+   }
+
+   cgroup_bpf_update(cgrp, prog, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+#define BPF_PROG_DETACH_LAST_FIELD attach_type
+
+static int bpf_prog_detach(const union bpf_attr *attr)
+{
+   struct cgroup *cgrp;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   if (CHECK_ATTR(BPF_PROG_DETACH))
+   return -EINVAL;
+
+   switch (attr->attach_type) {
+   case BPF_CGROUP_INET_INGRESS:
+   case BPF_CGROUP_INET_EGRESS:
+   cgrp = cgroup_get_from_fd(attr->target_fd);
+   if (IS_ERR(cgrp))
+   return PTR_ERR(cgrp);
+
+   cgroup_bpf_update(cgrp, NULL, attr->attach_type);
+   cgroup_put(cgrp);
+   break;
+
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+#endif /* CONFIG_CGROUP_BPF */
+
 SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
 {
union bpf_attr attr = {};
@@ -888,6 +959,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_OBJ_GET:
err = bpf_obj_get(&attr);
break;
+
+#ifdef CONFIG_CGROUP_BPF
+   case BPF_PROG_ATTACH:
+   err = bpf_prog_attach(&attr);
+   break;
+   case BPF_PROG_DETACH:
+   err = bpf_prog_detach(&attr);
+   break;
+#endif
+
default:
err = -EINVAL;
break;
-- 
2.7.4



[PATCH v8 6/6] samples: bpf: add userspace example for attaching eBPF programs to cgroups

2016-11-17 Thread Daniel Mack
Add a simple userpace program to demonstrate the new API to attach eBPF
programs to cgroups. This is what it does:

 * Create arraymap in kernel with 4 byte keys and 8 byte values

 * Load eBPF program

   The eBPF program accesses the map passed in to store two pieces of
   information. The number of invocations of the program, which maps
   to the number of packets received, is stored to key 0. Key 1 is
   incremented on each iteration by the number of bytes stored in
   the skb.

 * Detach any eBPF program previously attached to the cgroup

 * Attach the new program to the cgroup using BPF_PROG_ATTACH

 * Once a second, read map[0] and map[1] to see how many bytes and
   packets were seen on any socket of tasks in the given cgroup.

The program takes a cgroup path as 1st argument, and either "ingress"
or "egress" as 2nd. Optionally, "drop" can be passed as 3rd argument,
which will make the generated eBPF program return 0 instead of 1, so
the kernel will drop the packet.

libbpf gained two new wrappers for the new syscall commands.

Signed-off-by: Daniel Mack 
Acked-by: Alexei Starovoitov 
---
 samples/bpf/Makefile|   2 +
 samples/bpf/libbpf.c|  21 ++
 samples/bpf/libbpf.h|   3 +
 samples/bpf/test_cgrp2_attach.c | 147 
 4 files changed, 173 insertions(+)
 create mode 100644 samples/bpf/test_cgrp2_attach.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 12b7304..e4cdc74 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -22,6 +22,7 @@ hostprogs-y += spintest
 hostprogs-y += map_perf_test
 hostprogs-y += test_overhead
 hostprogs-y += test_cgrp2_array_pin
+hostprogs-y += test_cgrp2_attach
 hostprogs-y += xdp1
 hostprogs-y += xdp2
 hostprogs-y += test_current_task_under_cgroup
@@ -49,6 +50,7 @@ spintest-objs := bpf_load.o libbpf.o spintest_user.o
 map_perf_test-objs := bpf_load.o libbpf.o map_perf_test_user.o
 test_overhead-objs := bpf_load.o libbpf.o test_overhead_user.o
 test_cgrp2_array_pin-objs := libbpf.o test_cgrp2_array_pin.o
+test_cgrp2_attach-objs := libbpf.o test_cgrp2_attach.o
 xdp1-objs := bpf_load.o libbpf.o xdp1_user.o
 # reuse xdp1 source intentionally
 xdp2-objs := bpf_load.o libbpf.o xdp1_user.o
diff --git a/samples/bpf/libbpf.c b/samples/bpf/libbpf.c
index 9969e35..9ce707b 100644
--- a/samples/bpf/libbpf.c
+++ b/samples/bpf/libbpf.c
@@ -104,6 +104,27 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
return syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr));
 }
 
+int bpf_prog_attach(int prog_fd, int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_bpf_fd = prog_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr));
+}
+
+int bpf_prog_detach(int target_fd, enum bpf_attach_type type)
+{
+   union bpf_attr attr = {
+   .target_fd = target_fd,
+   .attach_type = type,
+   };
+
+   return syscall(__NR_bpf, BPF_PROG_DETACH, &attr, sizeof(attr));
+}
+
 int bpf_obj_pin(int fd, const char *pathname)
 {
union bpf_attr attr = {
diff --git a/samples/bpf/libbpf.h b/samples/bpf/libbpf.h
index ac6edb6..d0a799a 100644
--- a/samples/bpf/libbpf.h
+++ b/samples/bpf/libbpf.h
@@ -15,6 +15,9 @@ int bpf_prog_load(enum bpf_prog_type prog_type,
  const struct bpf_insn *insns, int insn_len,
  const char *license, int kern_version);
 
+int bpf_prog_attach(int prog_fd, int attachable_fd, enum bpf_attach_type type);
+int bpf_prog_detach(int attachable_fd, enum bpf_attach_type type);
+
 int bpf_obj_pin(int fd, const char *pathname);
 int bpf_obj_get(const char *pathname);
 
diff --git a/samples/bpf/test_cgrp2_attach.c b/samples/bpf/test_cgrp2_attach.c
new file mode 100644
index 000..63ef208
--- /dev/null
+++ b/samples/bpf/test_cgrp2_attach.c
@@ -0,0 +1,147 @@
+/* eBPF example program:
+ *
+ * - Creates arraymap in kernel with 4 bytes keys and 8 byte values
+ *
+ * - Loads eBPF program
+ *
+ *   The eBPF program accesses the map passed in to store two pieces of
+ *   information. The number of invocations of the program, which maps
+ *   to the number of packets received, is stored to key 0. Key 1 is
+ *   incremented on each iteration by the number of bytes stored in
+ *   the skb.
+ *
+ * - Detaches any eBPF program previously attached to the cgroup
+ *
+ * - Attaches the new program to a cgroup using BPF_PROG_ATTACH
+ *
+ * - Every second, reads map[0] and map[1] to see how many bytes and
+ *   packets were seen on any socket of tasks in the given cgroup.
+ */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include "libbpf.h"
+
+enum {
+   MAP_KEY_PACKETS,
+   MAP_KEY_BYTES,
+};
+
+static int prog_load(int map_fd, int verdict)
+{
+   struct bpf_insn prog[] = {
+   

[PATCH v8 1/6] bpf: add new prog type for cgroup socket filtering

2016-11-17 Thread Daniel Mack
This program type is similar to BPF_PROG_TYPE_SOCKET_FILTER, except that
it does not allow BPF_LD_[ABS|IND] instructions and hooks up the
bpf_skb_load_bytes() helper.

Programs of this type will be attached to cgroups for network filtering
and accounting.

Signed-off-by: Daniel Mack 
Acked-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |  9 +
 net/core/filter.c| 23 +++
 2 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f09c70b..1f3e6f1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -96,8 +96,17 @@ enum bpf_prog_type {
BPF_PROG_TYPE_TRACEPOINT,
BPF_PROG_TYPE_XDP,
BPF_PROG_TYPE_PERF_EVENT,
+   BPF_PROG_TYPE_CGROUP_SKB,
 };
 
+enum bpf_attach_type {
+   BPF_CGROUP_INET_INGRESS,
+   BPF_CGROUP_INET_EGRESS,
+   __MAX_BPF_ATTACH_TYPE
+};
+
+#define MAX_BPF_ATTACH_TYPE __MAX_BPF_ATTACH_TYPE
+
 #define BPF_PSEUDO_MAP_FD  1
 
 /* flags for BPF_MAP_UPDATE_ELEM command */
diff --git a/net/core/filter.c b/net/core/filter.c
index 00351cd..e3813d6 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2576,6 +2576,17 @@ xdp_func_proto(enum bpf_func_id func_id)
}
 }
 
+static const struct bpf_func_proto *
+cg_skb_func_proto(enum bpf_func_id func_id)
+{
+   switch (func_id) {
+   case BPF_FUNC_skb_load_bytes:
+   return &bpf_skb_load_bytes_proto;
+   default:
+   return sk_filter_func_proto(func_id);
+   }
+}
+
 static bool __is_valid_access(int off, int size, enum bpf_access_type type)
 {
if (off < 0 || off >= sizeof(struct __sk_buff))
@@ -2938,6 +2949,12 @@ static const struct bpf_verifier_ops xdp_ops = {
.convert_ctx_access = xdp_convert_ctx_access,
 };
 
+static const struct bpf_verifier_ops cg_skb_ops = {
+   .get_func_proto = cg_skb_func_proto,
+   .is_valid_access= sk_filter_is_valid_access,
+   .convert_ctx_access = sk_filter_convert_ctx_access,
+};
+
 static struct bpf_prog_type_list sk_filter_type __read_mostly = {
.ops= &sk_filter_ops,
.type   = BPF_PROG_TYPE_SOCKET_FILTER,
@@ -2958,12 +2975,18 @@ static struct bpf_prog_type_list xdp_type __read_mostly 
= {
.type   = BPF_PROG_TYPE_XDP,
 };
 
+static struct bpf_prog_type_list cg_skb_type __read_mostly = {
+   .ops= &cg_skb_ops,
+   .type   = BPF_PROG_TYPE_CGROUP_SKB,
+};
+
 static int __init register_sk_filter_ops(void)
 {
bpf_register_prog_type(&sk_filter_type);
bpf_register_prog_type(&sched_cls_type);
bpf_register_prog_type(&sched_act_type);
bpf_register_prog_type(&xdp_type);
+   bpf_register_prog_type(&cg_skb_type);
 
return 0;
 }
-- 
2.7.4



[PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-11-17 Thread Daniel Mack
If the cgroup associated with the receiving socket has an eBPF
programs installed, run them from ip_output(), ip6_output() and
ip_mc_output(). From mentioned functions we have two socket contexts
as per 7026b1ddb6b8 ("netfilter: Pass socket pointer down through
okfn()."). We explicitly need to use sk instead of skb->sk here,
since otherwise the same program would run multiple times on egress
when encap devices are involved, which is not desired in our case.

eBPF programs used in this context are expected to either return 1 to
let the packet pass, or != 1 to drop them. The programs have access to
the skb through bpf_skb_load_bytes(), and the payload starts at the
network headers (L3).

Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
the feature is unused.

Signed-off-by: Daniel Mack 
Acked-by: Alexei Starovoitov 
---
 net/ipv4/ip_output.c  | 15 +++
 net/ipv6/ip6_output.c |  8 
 2 files changed, 23 insertions(+)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 03e7f73..5914006 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -74,6 +74,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -303,6 +304,7 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct rtable *rt = skb_rtable(skb);
struct net_device *dev = rt->dst.dev;
+   int ret;
 
/*
 *  If the indicated interface is up and running, send the packet.
@@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
/*
 *  Multicasts are looped back for other local users
 */
@@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
+   int ret;
 
IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
 
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
 
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip_finish_output,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 6001e78..483f91b 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -39,6 +39,7 @@
 #include 
 #include 
 
+#include 
 #include 
 #include 
 
@@ -143,6 +144,7 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
 {
struct net_device *dev = skb_dst(skb)->dev;
struct inet6_dev *idev = ip6_dst_idev(skb_dst(skb));
+   int ret;
 
if (unlikely(idev->cnf.disable_ipv6)) {
IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
@@ -150,6 +152,12 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
return 0;
}
 
+   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
+   if (ret) {
+   kfree_skb(skb);
+   return ret;
+   }
+
return NF_HOOK_COND(NFPROTO_IPV6, NF_INET_POST_ROUTING,
net, sk, skb, NULL, dev,
ip6_finish_output,
-- 
2.7.4



Re: [patch net-next 6/8] ipv4: fib: Add an API to request a FIB dump

2016-11-17 Thread David Miller
From: Hannes Frederic Sowa 
Date: Thu, 17 Nov 2016 18:20:39 +0100

> Hi,
> 
> On 17.11.2016 17:45, David Miller wrote:
>> From: Hannes Frederic Sowa 
>> Date: Thu, 17 Nov 2016 15:36:48 +0100
>> 
>>> The other way is the journal idea I had, which uses an rb-tree with
>>> timestamps as keys (can be lamport timestamps). You insert into the tree
>>> until the dump is finished and use it as queue later to shuffle stuff
>>> into the hardware.
>> 
>> If you have this "place" where pending inserts are stored, you have
>> a policy decision to make.
>> 
>> First of all what do other lookups see when there are pending entires?
> 
> I think this is a problem with the current approach already, as the
> delayed work queue already postpones the insert for an undecidable
> amount of time (and reorders depending on which CPU the entry was
> inserted and the fib notifier was called).
> 
> For user space queries we would still query the in-kernel table.

Ok, I think I might misunderstand something.

What is going into this journal exactly?  The actual full software and
hardware insert operation, or just the notification to the hardware
device driver notifiers?

The "lookup" I'm mostly concerned with is the fast path where the
packets being processed actually look up a route.

I do not think we can return success on the insert to the user yet
have the route lookup dataplace not return that route on a lookup.


[PATCH ipsec] xfrm: unbreak xfrm_sk_policy_lookup

2016-11-17 Thread Florian Westphal
if we succeed grabbing the refcount, then
  if (err && !xfrm_pol_hold_rcu)

will evaluate to false so this hits last else branch which then
sets policy to ERR_PTR(0).

Fixes: ae33786f73a7ce ("xfrm: policy: only use rcu in xfrm_sk_policy_lookup")
Reported-by: Nicolas Dichtel 
Tested-by: Nicolas Dichtel 
Signed-off-by: Florian Westphal 
---
 net/xfrm/xfrm_policy.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index fd6986634e6f..5bf7e1bfeac7 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -1268,12 +1268,14 @@ static struct xfrm_policy *xfrm_sk_policy_lookup(const 
struct sock *sk, int dir,
err = security_xfrm_policy_lookup(pol->security,
  fl->flowi_secid,
  policy_to_flow_dir(dir));
-   if (!err && !xfrm_pol_hold_rcu(pol))
-   goto again;
-   else if (err == -ESRCH)
+   if (!err) {
+   if (!xfrm_pol_hold_rcu(pol))
+   goto again;
+   } else if (err == -ESRCH) {
pol = NULL;
-   else
+   } else {
pol = ERR_PTR(err);
+   }
} else
pol = NULL;
}
-- 
2.7.3



[PATCH -next] tcp: make undo_cwnd mandatory for congestion modules

2016-11-17 Thread Florian Westphal
The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
which un-does reno halving behaviour.

It seems more appropriate to let congctl algorithms pair .ssthresh
and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
up for all congestion algorithms that used to rely on the fallback.

highspeed, illinois, scalable, veno and yeah use 'reno undo' while their
.ssthresh implementation doesn't halve the slowstart threshold, this
might point to similar issue as the one fixed for dctcp in
ce6dd23329b1e ("dctcp: avoid bogus doubling of cwnd after loss").

Cc: Eric Dumazet 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Signed-off-by: Florian Westphal 
---
 include/net/tcp.h|  1 +
 net/ipv4/tcp_cong.c  | 14 --
 net/ipv4/tcp_dctcp.c |  1 +
 net/ipv4/tcp_highspeed.c |  2 +-
 net/ipv4/tcp_hybla.c |  1 +
 net/ipv4/tcp_illinois.c  |  1 +
 net/ipv4/tcp_input.c |  5 +
 net/ipv4/tcp_lp.c|  1 +
 net/ipv4/tcp_scalable.c  |  1 +
 net/ipv4/tcp_vegas.c |  1 +
 net/ipv4/tcp_veno.c  |  1 +
 net/ipv4/tcp_westwood.c  |  1 +
 net/ipv4/tcp_yeah.c  |  1 +
 13 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 123979fe12bf..7de80739adab 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -958,6 +958,7 @@ u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
 void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
 
 u32 tcp_reno_ssthresh(struct sock *sk);
+u32 tcp_reno_undo_cwnd(struct sock *sk);
 void tcp_reno_cong_avoid(struct sock *sk, u32 ack, u32 acked);
 extern struct tcp_congestion_ops tcp_reno;
 
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 1294af4e0127..38905ec5f508 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -68,8 +68,9 @@ int tcp_register_congestion_control(struct tcp_congestion_ops 
*ca)
 {
int ret = 0;
 
-   /* all algorithms must implement ssthresh and cong_avoid ops */
-   if (!ca->ssthresh || !(ca->cong_avoid || ca->cong_control)) {
+   /* all algorithms must implement these */
+   if (!ca->ssthresh || !ca->undo_cwnd ||
+   !(ca->cong_avoid || ca->cong_control)) {
pr_err("%s does not implement required ops\n", ca->name);
return -EINVAL;
}
@@ -441,10 +442,19 @@ u32 tcp_reno_ssthresh(struct sock *sk)
 }
 EXPORT_SYMBOL_GPL(tcp_reno_ssthresh);
 
+u32 tcp_reno_undo_cwnd(struct sock *sk)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+
+   return max(tp->snd_cwnd, tp->snd_ssthresh << 1);
+}
+EXPORT_SYMBOL_GPL(tcp_reno_undo_cwnd);
+
 struct tcp_congestion_ops tcp_reno = {
.flags  = TCP_CONG_NON_RESTRICTED,
.name   = "reno",
.owner  = THIS_MODULE,
.ssthresh   = tcp_reno_ssthresh,
.cong_avoid = tcp_reno_cong_avoid,
+   .undo_cwnd  = tcp_reno_undo_cwnd,
 };
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 51139175bf61..bde22ebb92a8 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -342,6 +342,7 @@ static struct tcp_congestion_ops dctcp __read_mostly = {
 static struct tcp_congestion_ops dctcp_reno __read_mostly = {
.ssthresh   = tcp_reno_ssthresh,
.cong_avoid = tcp_reno_cong_avoid,
+   .undo_cwnd  = tcp_reno_undo_cwnd,
.get_info   = dctcp_get_info,
.owner  = THIS_MODULE,
.name   = "dctcp-reno",
diff --git a/net/ipv4/tcp_highspeed.c b/net/ipv4/tcp_highspeed.c
index db7842495a64..1eb8fefd9bd0 100644
--- a/net/ipv4/tcp_highspeed.c
+++ b/net/ipv4/tcp_highspeed.c
@@ -161,7 +161,7 @@ static struct tcp_congestion_ops tcp_highspeed 
__read_mostly = {
.init   = hstcp_init,
.ssthresh   = hstcp_ssthresh,
.cong_avoid = hstcp_cong_avoid,
-
+   .undo_cwnd  = tcp_reno_undo_cwnd,
.owner  = THIS_MODULE,
.name   = "highspeed"
 };
diff --git a/net/ipv4/tcp_hybla.c b/net/ipv4/tcp_hybla.c
index 083831e359df..0f7175c3338e 100644
--- a/net/ipv4/tcp_hybla.c
+++ b/net/ipv4/tcp_hybla.c
@@ -166,6 +166,7 @@ static void hybla_cong_avoid(struct sock *sk, u32 ack, u32 
acked)
 static struct tcp_congestion_ops tcp_hybla __read_mostly = {
.init   = hybla_init,
.ssthresh   = tcp_reno_ssthresh,
+   .undo_cwnd  = tcp_reno_undo_cwnd,
.cong_avoid = hybla_cong_avoid,
.set_state  = hybla_state,
 
diff --git a/net/ipv4/tcp_illinois.c b/net/ipv4/tcp_illinois.c
index c8e6d86be114..7c843578f233 100644
--- a/net/ipv4/tcp_illinois.c
+++ b/net/ipv4/tcp_illinois.c
@@ -327,6 +327,7 @@ static size_t tcp_illinois_info(struct sock *sk, u32 ext, 
int *attr,
 static struct tcp_congestion_ops tcp_illinois __read_mostly = {
.init   = tcp_illinois_init,
.ssthresh   = tcp_illinois_ssthresh,
+   .undo_cwnd  = tcp_reno_undo_cwnd,
.cong_avoid = tcp_illinois_cong_avoid,

[PATCH net-next v3 1/5] ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE

2016-11-17 Thread Allan W. Nielsen
From: Raju Lakkaraju 

Defines a generic API to get/set phy tunables. The API is using the
existing ethtool_tunable/tunable_type_id types which is already being used
for mac level tunables.

Signed-off-by: Raju Lakkaraju 
Reviewed-by: Andrew Lunn 
Signed-off-by: Allan W. Nielsen 
---
 include/uapi/linux/ethtool.h | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 8e54723..42f696f 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -248,6 +248,16 @@ struct ethtool_tunable {
void*data[0];
 };
 
+enum phy_tunable_id {
+   ETHTOOL_PHY_ID_UNSPEC,
+
+   /*
+* Add your fresh new phy tunable attribute above and remember to update
+* phy_tunable_strings[] in net/core/ethtool.c
+*/
+   __ETHTOOL_PHY_TUNABLE_COUNT,
+};
+
 /**
  * struct ethtool_regs - hardware register dump
  * @cmd: Command number = %ETHTOOL_GREGS
@@ -548,6 +558,7 @@ struct ethtool_pauseparam {
  * @ETH_SS_FEATURES: Device feature names
  * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
  * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
+ * @ETH_SS_PHY_TUNABLES: PHY tunable names
  */
 enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -558,6 +569,7 @@ enum ethtool_stringset {
ETH_SS_RSS_HASH_FUNCS,
ETH_SS_TUNABLES,
ETH_SS_PHY_STATS,
+   ETH_SS_PHY_TUNABLES,
 };
 
 /**
@@ -1313,7 +1325,8 @@ struct ethtool_per_queue_op {
 
 #define ETHTOOL_GLINKSETTINGS  0x004c /* Get ethtool_link_settings */
 #define ETHTOOL_SLINKSETTINGS  0x004d /* Set ethtool_link_settings */
-
+#define ETHTOOL_PHY_GTUNABLE   0x004e /* Get PHY tunable configuration */
+#define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
-- 
2.7.3



RE: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM

2016-11-17 Thread Rafal Ozieblo
-Original Message-
From: Nicolas Ferre [mailto:nicolas.fe...@atmel.com] 
Sent: 17 listopada 2016 14:29
To: Harini Katakam; Rafal Ozieblo
Cc: harini.kata...@xilinx.com; netdev@vger.kernel.org; 
linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM

> Le 17/11/2016 à 13:21, Harini Katakam a écrit :
> > Hi Rafal,
> > 
> > On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo  wrote:
> >> Hello,
> >> I think, there could a bug in your patch.
> >>
> >>> +
> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> >>> + dmacfg |= GEM_BIT(ADDR64); #endif
> >>
> >> You enable 64 bit addressing (64b dma bus width) always when appropriate 
> >> architecture config option is enabled.
> >> But there are some legacy controllers which do not support that feature. 
> >> According Cadence hardware team:
> >> "64 bit addressing was added in July 2013. Earlier version do not have it.
> >> This feature was enhanced in release August 2014 to have separate upper 
> >> address values for transmit and receive."
> >>
> >>> /* Bitfields in NSR */
> >>> @@ -474,6 +479,10 @@
> >>>  struct macb_dma_desc {
> >>  >  u32 addr;
> >>>   u32 ctrl;
> >>> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> >>> + u32 addrh;
> >>> + u32 resvd;
> >>> +#endif
> >>>  };
> >>
> >> It will not work for legacy hardware. Old descriptor is 2 words wide, the 
> >> new one is 4 words wide.
> >> If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't 
> >> support it at all, you will miss every second descriptor.
> >>
> > 
> > True, this feature is not available in all of Cadence IP versions.
> > In fact, the IP version Zynq does not support this. But the one in ZynqMP 
> > does.
> > So, we enable kernel config for 64 bit DMA addressing for this SoC and 
> > hence the driver picks it up. My assumption was that if the legacy IP 
> > does not support
> > 64 bit addressing, then this DMA option wouldn't be enabled.
> > 
> > There is a design config register in Cadence IP which is being read to 
> > check for 64 bit address support - DMA mask is set based on that.
> > But the addition of two descriptor words cannot be based on this runtime 
> > check.
> > For this reason, all the static changes were placed under this check.
>
> We have quite a bunch of options in this driver to determinate what is the 
> real capacity of the underlying hardware.
> If HW configuration registers are not appropriate, and it seems they are not, 
> I would advice to simply use the DT compatibility string.
>
> Best regards,
> --
> Nicolas Ferre

HW configuration registers are appropriate. The issue is that this code doesn’t 
use the capability bit to switch between different dma descriptors (2 words vs. 
4 words).
DMA descriptor size is chosen based on kernel configuration, not based on 
hardware capabilities.

Regards, 
Rafal Ozieblo


Re: [PATCH net-next v3 4/7] vxlan: improve vxlan route lookup checks.

2016-11-17 Thread David Miller
From: Jiri Benc 
Date: Thu, 17 Nov 2016 16:59:49 +0100

> On Thu, 17 Nov 2016 10:17:01 +, David Laight wrote:
>> Worse than arbitrary, it adds 4 bytes of pad on 64bit systems.
> 
> It does not, this is not a struct.

He is talking about on the function stack.


Re: net: BUG still has locks held in unix_stream_splice_read

2016-11-17 Thread Dmitry Vyukov
On Mon, Oct 10, 2016 at 10:01 AM, Dmitry Vyukov  wrote:
> On Mon, Oct 10, 2016 at 5:14 AM, Al Viro  wrote:
>> On Mon, Oct 10, 2016 at 03:46:07AM +0100, Al Viro wrote:
>>> On Sun, Oct 09, 2016 at 12:06:14PM +0200, Dmitry Vyukov wrote:
>>> > I suspect this is:
>>> >
>>> > commit 25869262ef7af24ccde988867ac3eb1c3d4b88d4
>>> > Author: Al Viro 
>>> > Date:   Sat Sep 17 21:02:10 2016 -0400
>>> > skb_splice_bits(): get rid of callback
>>> > since pipe_lock is the outermost now, we don't need to drop/regain
>>> > socket locks around the call of splice_to_pipe() from 
>>> > skb_splice_bits(),
>>> > which kills the need to have a socket-specific callback; we can just
>>> > call splice_to_pipe() and be done with that.
>>>
>>> Unlikely, since that particular commit removes unlocking/relocking ->iolock
>>> around the call of splice_to_pipe().  Original would've retaken the same
>>> lock on the way out; it's not as if we could leave the syscall there.
>>>
>>> It might be splice-related, but I don't believe that you've got the right
>>> commit here.
>>
>> It's not that commit
>
> It's highly likely. Sorry for falsely pointing to your commit.
>
>
>> , all right - it's "can't call unix_stream_read_generic()
>> with any locks held" stepped onto a couple of commits prior by
>> "splice: lift pipe_lock out of splice_to_pipe()".  Could somebody explain
>> what is that about?
>>
>> E.g what will happen if some code does a read on AF_UNIX socket with
>> some local mutex held?  AFAICS, there are exactly two callers of
>> freezable_schedule_timeout() - this one and one in XFS; the latter is
>> in a kernel thread where we do have good warranties about the locking
>> environment, but here it's in the bleeding ->recvmsg/->splice_read and
>> for those assumption that caller doesn't hold any locks is pretty
>> strong, especially since it's not documented anywhere.
>>
>> What's going on there?
>
> I never saw that warning before. There is some possibility that fuzzer
> has discovered some new paths, but it's much more likely that
> something has changed recently (the stack looks quite simple -- just a
> splice from unix socket). And my previous pull was like a week ago.

Ping. Just hit it again on 4.9-rc5


[ BUG: syz-executor/15922 still has locks held! ]
4.9.0-rc5+ #43 Not tainted
-
1 lock held by syz-executor/15922:
 #0: [ 1441.143288]  (
[< inline >] pipe_lock_nested fs/pipe.c:66
[] pipe_lock+0x5b/0x70 fs/pipe.c:74

stack backtrace:
CPU: 3 PID: 15922 Comm: syz-executor Not tainted 4.9.0-rc5+ #43
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
 88004a98f568 834c2a19 0003 110009531e40
 ed0009531e38 41b58ab3 895758b0 834c272b
 0003 880035256640 0003 88006d122cd8
Call Trace:
 [< inline >] __dump_stack lib/dump_stack.c:15
 [] dump_stack+0x2ee/0x3f5 lib/dump_stack.c:51
 [< inline >] print_held_locks_bug kernel/locking/lockdep.c:4296
 [] debug_check_no_locks_held+0x125/0x140
kernel/locking/lockdep.c:4302
 [< inline >] try_to_freeze include/linux/freezer.h:65
 [< inline >] freezer_count include/linux/freezer.h:127
 [< inline >] freezable_schedule_timeout include/linux/freezer.h:192
 [] unix_stream_data_wait+0x4fd/0x910 net/unix/af_unix.c:2223
 [] unix_stream_read_generic+0x11e2/0x2240
net/unix/af_unix.c:2332
 [] unix_stream_splice_read+0x27f/0x400
net/unix/af_unix.c:2506
 [] sock_splice_read+0xbe/0x100 net/socket.c:772
 [] do_splice_to+0x10f/0x170 fs/splice.c:897
 [< inline >] do_splice fs/splice.c:1185
 [< inline >] SYSC_splice fs/splice.c:1409
 [] SyS_splice+0xfaa/0x16a0 fs/splice.c:1392
 [] entry_SYSCALL_64_fastpath+0x23/0xc6


[PATCH net-next] net_sched: sch_fq: use hash_ptr()

2016-11-17 Thread Eric Dumazet
From: Eric Dumazet 

When I wrote sch_fq.c, hash_ptr() on 64bit arches was awful,
and I chose hash_32().

Linus Torvalds and George Spelvin fixed this issue, so we can
use hash_ptr() to get more entropy on 64bit arches with Terabytes
of memory, and avoid the cast games.

Signed-off-by: Eric Dumazet 
Cc: Hugh Dickins 
---
 net/sched/sch_fq.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 18e752439f6f..86309a3156a5 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -245,7 +245,7 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, 
struct fq_sched_data *q)
skb_orphan(skb);
}
 
-   root = &q->fq_root[hash_32((u32)(long)sk, q->fq_trees_log)];
+   root = &q->fq_root[hash_ptr(sk, q->fq_trees_log)];
 
if (q->flows >= (2U << q->fq_trees_log) &&
q->inactive_flows > q->flows/2)
@@ -599,7 +599,7 @@ static void fq_rehash(struct fq_sched_data *q,
kmem_cache_free(fq_flow_cachep, of);
continue;
}
-   nroot = &new_array[hash_32((u32)(long)of->sk, new_log)];
+   nroot = &new_array[hash_ptr(of->sk, new_log)];
 
np = &nroot->rb_node;
parent = NULL;




Fwd:[Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN

2016-11-17 Thread Neftin, Sasha
From: Baicar, Tyler [mailto:tbai...@codeaurora.org] Sent: Tuesday,
November 15, 2016 11:50 PM
To: Neftin, Sasha ; Kirsher, Jeffrey T
; intel-wired-...@lists.osuosl.org;
netdev@vger.kernel.org; linux-ker...@vger.kernel.org;
ok...@codeaurora.org; ti...@codeaurora.org
Subject: Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of
__E1000_DOWN

On 11/13/2016 2:25 AM, Neftin, Sasha wrote:
> On 11/13/2016 10:34 AM, Neftin, Sasha wrote:
>> On 11/11/2016 12:35 AM, Baicar, Tyler wrote:
>>> Hello Sasha,
>>>
>>> On 11/9/2016 11:19 PM, Neftin, Sasha wrote:
 On 11/9/2016 11:41 PM, Tyler Baicar wrote:
> Move IRQ free code so that it will happen regardless of the 
> __E1000_DOWN bit. Currently the e1000e driver only releases its 
> IRQ if the __E1000_DOWN bit is cleared. This is not sufficient 
> because it is possible for __E1000_DOWN to be set without releasing the 
> IRQ.
> In such a situation, we will hit a kernel bug later in 
> e1000_remove because the IRQ still has action since it was never 
> freed. A secondary bus reset can cause this case to happen.
>
> Signed-off-by: Tyler Baicar 
> ---
>drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
>1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 7017281..36cfcb0 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev)
>  if (!test_bit(__E1000_DOWN, &adapter->state)) {
>e1000e_down(adapter, true);
> -e1000_free_irq(adapter);
>  /* Link status message must follow this format */
>pr_info("%s NIC Link is Down\n", adapter->netdev->name);
>}
>+e1000_free_irq(adapter);
> +
>napi_disable(&adapter->napi);
>  e1000e_free_tx_resources(adapter->tx_ring);
>
 I would like not recommend insert this change. This change related 
 driver state machine, we afraid from lot of synchronization problem 
 and issues.
 We need keep e1000_free_irq in loop and check for 'test_bit' ready.
>>> What do you mean here? There is no loop. If __E1000_DOWN is set then 
>>> we will never free the IRQ.
>>>
 Another point, does before execute secondary bus reset your SW back 
 up pcie configuration space as properly?
>>> After a secondary bus reset, the link needs to recover and go back 
>>> to a working state after 1 second.
>>>
>>>  From the callstack, the issue is happening while removing the 
>>> endpoint from the system, before applying the secondary bus reset.
>>>
>>> The order of events is
>>> 1. remove the drivers
>>> 2. cause a secondary bus reset
>>> 3. wait 1 second
>> Actually, this is too much, usually link up in less than 100ms.You 
>> can check Data Link Layer indication.
>>> 4. recover the link
>>>
>>> callstack:
>>> free_msi_irqs+0x6c/0x1a8
>>> pci_disable_msi+0xb0/0x148
>>> e1000e_reset_interrupt_capability+0x60/0x78
>>> e1000_remove+0xc8/0x180
>>> pci_device_remove+0x48/0x118
>>> __device_release_driver+0x80/0x108
>>> device_release_driver+0x2c/0x40
>>> pci_stop_bus_device+0xa0/0xb0
>>> pci_stop_bus_device+0x3c/0xb0
>>> pci_stop_root_bus+0x54/0x80
>>> acpi_pci_root_remove+0x28/0x64
>>> acpi_bus_trim+0x6c/0xa4
>>> acpi_device_hotplug+0x19c/0x3f4
>>> acpi_hotplug_work_fn+0x28/0x3c
>>> process_one_work+0x150/0x460
>>> worker_thread+0x50/0x4b8
>>> kthread+0xd4/0xe8
>>> ret_from_fork+0x10/0x50
>>>
>>> Thanks,
>>> Tyler
>>>
>> Hello Tyler,
>> Okay, we need consult more about this suggestion.
>> May I ask what is setup you run? Is there NIC or on board LAN? I 
>> would like try reproduce this issue in our lab's too.
>> Also, is same issue observed with same scenario and others NIC's too?
>> Sasha
>> ___
>> Intel-wired-lan mailing list
>> intel-wired-...@lists.osuosl.org
>> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
>>
> Please, specify what is device used.
Hello Sasha,
This was on a QDF2432 using an Intel PRO/1000 PT Dual Port server
adapter. I have not tried other e1000e PCIe cards, but have not seen any
similar issues with Mellanox cards. I'm able to reproduce it with just
pulling the card out. Here is the lspci -vvv output for this card:

0004:00:00.0 PCI bridge: Airgo Networks, Inc. Device 0400 (prog-if 00
[Normal decode])
 Physical Slot: 5
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
SERR- TAbort-
Reset- FastB2B-
 PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
 Capabilities: [40] Power Management version 3
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D

Re: [PATCH 1/3] net: stmmac: replace all pr_xxx by their netdev_xxx counterpart

2016-11-17 Thread Giuseppe CAVALLARO

On 11/16/2016 8:09 PM, Corentin Labbe wrote:

From: LABBE Corentin 

The stmmac driver use lots of pr_xxx functions to print information.
This is bad since we cannot know which device logs the information.
(moreover if two stmmac device are present)

Furthermore, it seems that it assumes wrongly that all logs will always
be subsequent by using a dev_xxx then some indented pr_xxx like this:
kernel: sun7i-dwmac 1c5.ethernet: no reset control found
kernel:  Ring mode enabled
kernel:  No HW DMA feature register supported
kernel:  Normal descriptors
kernel:  TX Checksum insertion supported

So this patch replace all pr_xxx by their netdev_xxx counterpart.
Excepts for some printing where netdev "cause" unpretty output like:
sun7i-dwmac 1c5.ethernet (unnamed net_device) (uninitialized): no reset 
control found
In those case, I keep dev_xxx.

In the same time I remove some "stmmac:" print since
this will be a duplicate with that dev_xxx displays.

Signed-off-by: Corentin Labbe 


Thanks for these changes.

Acked-by: Giuseppe Cavallaro 


---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 204 --
 drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c |  14 +-
 2 files changed, 123 insertions(+), 95 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 8eb12353..791daf4 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -305,7 +305,7 @@ bool stmmac_eee_init(struct stmmac_priv *priv)
 */
spin_lock_irqsave(&priv->lock, flags);
if (priv->eee_active) {
-   pr_debug("stmmac: disable EEE\n");
+   netdev_dbg(priv->dev, "disable EEE\n");
del_timer_sync(&priv->eee_ctrl_timer);
priv->hw->mac->set_eee_timer(priv->hw, 0,
 tx_lpi_timer);
@@ -334,7 +334,7 @@ bool stmmac_eee_init(struct stmmac_priv *priv)
ret = true;
spin_unlock_irqrestore(&priv->lock, flags);

-   pr_debug("stmmac: Energy-Efficient Ethernet initialized\n");
+   netdev_dbg(priv->dev, "Energy-Efficient Ethernet 
initialized\n");
}
 out:
return ret;
@@ -456,8 +456,8 @@ static int stmmac_hwtstamp_ioctl(struct net_device *dev, 
struct ifreq *ifr)
   sizeof(struct hwtstamp_config)))
return -EFAULT;

-   pr_debug("%s config flags:0x%x, tx_type:0x%x, rx_filter:0x%x\n",
-__func__, config.flags, config.tx_type, config.rx_filter);
+   netdev_dbg(priv->dev, "%s config flags:0x%x, tx_type:0x%x, 
rx_filter:0x%x\n",
+  __func__, config.flags, config.tx_type, config.rx_filter);

/* reserved for future extensions */
if (config.flags)
@@ -756,8 +756,9 @@ static void stmmac_adjust_link(struct net_device *dev)
break;
default:
if (netif_msg_link(priv))
-   pr_warn("%s: Speed (%d) not 10/100\n",
-   dev->name, phydev->speed);
+   netdev_warn(priv->dev,
+   "Speed (%d) not 10/100\n",
+   phydev->speed);
break;
}

@@ -810,10 +811,10 @@ static void stmmac_check_pcs_mode(struct stmmac_priv 
*priv)
(interface == PHY_INTERFACE_MODE_RGMII_ID) ||
(interface == PHY_INTERFACE_MODE_RGMII_RXID) ||
(interface == PHY_INTERFACE_MODE_RGMII_TXID)) {
-   pr_debug("STMMAC: PCS RGMII support enable\n");
+   netdev_dbg(priv->dev, "PCS RGMII support enabled\n");
priv->hw->pcs = STMMAC_PCS_RGMII;
} else if (interface == PHY_INTERFACE_MODE_SGMII) {
-   pr_debug("STMMAC: PCS SGMII support enable\n");
+   netdev_dbg(priv->dev, "PCS SGMII support enabled\n");
priv->hw->pcs = STMMAC_PCS_SGMII;
}
}
@@ -848,15 +849,15 @@ static int stmmac_init_phy(struct net_device *dev)

snprintf(phy_id_fmt, MII_BUS_ID_SIZE + 3, PHY_ID_FMT, bus_id,
 priv->plat->phy_addr);
-   pr_debug("stmmac_init_phy:  trying to attach to %s\n",
-phy_id_fmt);
+   netdev_dbg(priv->dev, "stmmac_init_phy: trying to attach to 
%s\n",
+  phy_id_fmt);

phydev = phy_connect(dev, phy_id_fmt, &stmmac_adjust_link,
 interface);
   

[PATCH net v2 3/7] net: ethernet: ti: cpsw: fix deferred probe

2016-11-17 Thread Johan Hovold
Make sure to deregister all child devices also on probe errors to avoid
leaks and to fix probe deferral:

cpsw 4a10.ethernet: omap_device: omap_device_enable() called from invalid 
state 1
cpsw 4a10.ethernet: use pm_runtime_put_sync_suspend() in driver?
cpsw: probe of 4a10.ethernet failed with error -22

Add generic helper to undo the effects of cpsw_probe_dt(), which will
also be used in a follow-on patch to fix further leaks that have been
introduced more recently.

Note that the platform device is now runtime-resumed before registering
any child devices in order to make sure that it is synchronously
suspended after having deregistered the children in the error path.

Fixes: 1fb19aa730e4 ("net: cpsw: Add parent<->child relation support
between cpsw and mdio")
Signed-off-by: Johan Hovold 
---
 drivers/net/ethernet/ti/cpsw.c | 41 -
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 84c5d214557e..5d14abb06486 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2441,6 +2441,11 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
return 0;
 }
 
+static void cpsw_remove_dt(struct platform_device *pdev)
+{
+   of_platform_depopulate(&pdev->dev);
+}
+
 static int cpsw_probe_dual_emac(struct cpsw_priv *priv)
 {
struct cpsw_common  *cpsw = priv->cpsw;
@@ -2585,10 +2590,19 @@ static int cpsw_probe(struct platform_device *pdev)
/* Select default pin state */
pinctrl_pm_select_default_state(&pdev->dev);
 
+   /* Need to enable clocks with runtime PM api to access module
+* registers
+*/
+   ret = pm_runtime_get_sync(&pdev->dev);
+   if (ret < 0) {
+   pm_runtime_put_noidle(&pdev->dev);
+   goto clean_runtime_disable_ret;
+   }
+
if (cpsw_probe_dt(&cpsw->data, pdev)) {
dev_err(&pdev->dev, "cpsw: platform data missing\n");
ret = -ENODEV;
-   goto clean_runtime_disable_ret;
+   goto clean_dt_ret;
}
data = &cpsw->data;
cpsw->rx_ch_num = 1;
@@ -2609,7 +2623,7 @@ static int cpsw_probe(struct platform_device *pdev)
GFP_KERNEL);
if (!cpsw->slaves) {
ret = -ENOMEM;
-   goto clean_runtime_disable_ret;
+   goto clean_dt_ret;
}
for (i = 0; i < data->slaves; i++)
cpsw->slaves[i].slave_num = i;
@@ -2621,7 +2635,7 @@ static int cpsw_probe(struct platform_device *pdev)
if (IS_ERR(clk)) {
dev_err(priv->dev, "fck is not found\n");
ret = -ENODEV;
-   goto clean_runtime_disable_ret;
+   goto clean_dt_ret;
}
cpsw->bus_freq_mhz = clk_get_rate(clk) / 100;
 
@@ -2629,25 +2643,17 @@ static int cpsw_probe(struct platform_device *pdev)
ss_regs = devm_ioremap_resource(&pdev->dev, ss_res);
if (IS_ERR(ss_regs)) {
ret = PTR_ERR(ss_regs);
-   goto clean_runtime_disable_ret;
+   goto clean_dt_ret;
}
cpsw->regs = ss_regs;
 
-   /* Need to enable clocks with runtime PM api to access module
-* registers
-*/
-   ret = pm_runtime_get_sync(&pdev->dev);
-   if (ret < 0) {
-   pm_runtime_put_noidle(&pdev->dev);
-   goto clean_runtime_disable_ret;
-   }
cpsw->version = readl(&cpsw->regs->id_ver);
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
cpsw->wr_regs = devm_ioremap_resource(&pdev->dev, res);
if (IS_ERR(cpsw->wr_regs)) {
ret = PTR_ERR(cpsw->wr_regs);
-   goto clean_pm_runtime_put_ret;
+   goto clean_dt_ret;
}
 
memset(&dma_params, 0, sizeof(dma_params));
@@ -2684,7 +2690,7 @@ static int cpsw_probe(struct platform_device *pdev)
default:
dev_err(priv->dev, "unknown version 0x%08x\n", cpsw->version);
ret = -ENODEV;
-   goto clean_pm_runtime_put_ret;
+   goto clean_dt_ret;
}
for (i = 0; i < cpsw->data.slaves; i++) {
struct cpsw_slave *slave = &cpsw->slaves[i];
@@ -2713,7 +2719,7 @@ static int cpsw_probe(struct platform_device *pdev)
if (!cpsw->dma) {
dev_err(priv->dev, "error initializing dma\n");
ret = -ENOMEM;
-   goto clean_pm_runtime_put_ret;
+   goto clean_dt_ret;
}
 
cpsw->txch[0] = cpdma_chan_create(cpsw->dma, 0, cpsw_tx_handler, 0);
@@ -2823,7 +2829,8 @@ static int cpsw_probe(struct platform_device *pdev)
cpsw_ale_destroy(cpsw->ale);
 clean_dma_ret:
cpdma_ctlr_destroy(cpsw->dma);
-clean_pm_runtime_put_ret:
+clean_dt_ret:
+   cpsw_remove_dt(pdev);
pm_runtime_put_sync(&pde

Re: [PATCH net 1/3] net: phy: realtek: add eee advertisement disable options

2016-11-17 Thread Anand Moon
Hi Jerone,

On 17 November 2016 at 15:50, Jerome Brunet  wrote:
> On Wed, 2016-11-16 at 22:36 +0530, Anand Moon wrote:
>>  Hi Jerome.
>>
>> On 15 November 2016 at 19:59, Jerome Brunet 
>> wrote:
>> >
>> > On some platforms, energy efficient ethernet with rtl8211 devices
>> > is
>> > causing issue, like throughput drop or broken link.
>> >
>> > This was reported on the OdroidC2 (DWMAC + RTL8211F). While the
>> > issue root
>> > cause is not fully understood yet, disabling EEE advertisement
>> > prevent auto
>> > negotiation from enabling EEE.
>> >
>> > This patch provides options to disable 1000T and 100TX EEE
>> > advertisement
>> > individually for the realtek phys supporting this feature.
>> >
>> > Reported-by: Martin Blumenstingl > > m>
>> > Cc: Giuseppe Cavallaro 
>> > Cc: Alexandre TORGUE 
>> > Signed-off-by: Jerome Brunet 
>> > Signed-off-by: Neil Armstrong 
>> > Tested-by: Andre Roth 
>> > ---
>> >  drivers/net/phy/realtek.c | 65
>> > ++-
>> >  1 file changed, 64 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c
>> > index aadd6e9f54ad..77235fd5faaf 100644
>> > --- a/drivers/net/phy/realtek.c
>> > +++ b/drivers/net/phy/realtek.c
>> > @@ -15,6 +15,12 @@
>> >   */
>> >  #include 
>> >  #include 
>> > +#include 
>> > +
>> > +struct rtl8211x_phy_priv {
>> > +   bool eee_1000t_disable;
>> > +   bool eee_100tx_disable;
>> > +};
>> >
>> >  #define RTL821x_PHYSR  0x11
>> >  #define RTL821x_PHYSR_DUPLEX   0x2000
>> > @@ -93,12 +99,44 @@ static int rtl8211f_config_intr(struct
>> > phy_device *phydev)
>> > return err;
>> >  }
>> >
>> > +static void rtl8211x_clear_eee_adv(struct phy_device *phydev)
>> > +{
>> > +   struct rtl8211x_phy_priv *priv = phydev->priv;
>> > +   u16 val;
>> > +
>> > +   if (priv->eee_1000t_disable || priv->eee_100tx_disable) {
>> > +   val = phy_read_mmd_indirect(phydev,
>> > MDIO_AN_EEE_ADV,
>> > +   MDIO_MMD_AN);
>> > +
>> > +   if (priv->eee_1000t_disable)
>> > +   val &= ~MDIO_AN_EEE_ADV_1000T;
>> > +   if (priv->eee_100tx_disable)
>> > +   val &= ~MDIO_AN_EEE_ADV_100TX;
>> > +
>> > +   phy_write_mmd_indirect(phydev, MDIO_AN_EEE_ADV,
>> > +  MDIO_MMD_AN, val);
>> > +   }
>> > +}
>> > +
>> > +static int rtl8211x_config_init(struct phy_device *phydev)
>> > +{
>> > +   int ret;
>> > +
>> > +   ret = genphy_config_init(phydev);
>> > +   if (ret < 0)
>> > +   return ret;
>> > +
>> > +   rtl8211x_clear_eee_adv(phydev);
>> > +
>> > +   return 0;
>> > +}
>> > +
>> >  static int rtl8211f_config_init(struct phy_device *phydev)
>> >  {
>> > int ret;
>> > u16 reg;
>> >
>> > -   ret = genphy_config_init(phydev);
>> > +   ret = rtl8211x_config_init(phydev);
>> > if (ret < 0)
>> > return ret;
>> >
>> > @@ -115,6 +153,26 @@ static int rtl8211f_config_init(struct
>> > phy_device *phydev)
>> > return 0;
>> >  }
>> >
>> > +static int rtl8211x_phy_probe(struct phy_device *phydev)
>> > +{
>> > +   struct device *dev = &phydev->mdio.dev;
>> > +   struct device_node *of_node = dev->of_node;
>> > +   struct rtl8211x_phy_priv *priv;
>> > +
>> > +   priv = devm_kzalloc(dev, sizeof(*priv), GFP_KERNEL);
>> > +   if (!priv)
>> > +   return -ENOMEM;
>> > +
>> > +   priv->eee_1000t_disable =
>> > +   of_property_read_bool(of_node, "realtek,disable-
>> > eee-1000t");
>> > +   priv->eee_100tx_disable =
>> > +   of_property_read_bool(of_node, "realtek,disable-
>> > eee-100tx");
>> > +
>> > +   phydev->priv = priv;
>> > +
>> > +   return 0;
>> > +}
>> > +
>> >  static struct phy_driver realtek_drvs[] = {
>> > {
>> > .phy_id = 0x8201,
>> > @@ -140,7 +198,9 @@ static struct phy_driver realtek_drvs[] = {
>> > .phy_id_mask= 0x001f,
>> > .features   = PHY_GBIT_FEATURES,
>> > .flags  = PHY_HAS_INTERRUPT,
>> > +   .probe  = &rtl8211x_phy_probe,
>> > .config_aneg= genphy_config_aneg,
>> > +   .config_init= &rtl8211x_config_init,
>> > .read_status= genphy_read_status,
>> > .ack_interrupt  = rtl821x_ack_interrupt,
>> > .config_intr= rtl8211e_config_intr,
>> > @@ -152,7 +212,9 @@ static struct phy_driver realtek_drvs[] = {
>> > .phy_id_mask= 0x001f,
>> > .features   = PHY_GBIT_FEATURES,
>> > .flags  = PHY_HAS_INTERRUPT,
>> > +   .probe  = &rtl8211x_phy_probe,
>> > .config_aneg= &genphy_config_aneg,
>> > +   .config_init= &rtl8211x_config_i

[PATCH net v2 6/7] net: ethernet: ti: cpsw: add missing sanity check

2016-11-17 Thread Johan Hovold
Make sure to check for allocation failures before dereferencing a
NULL-pointer during probe.

Fixes: 649a1688c960 ("net: ethernet: ti: cpsw: create common struct to
hold shared driver data")
Signed-off-by: Johan Hovold 
---
 drivers/net/ethernet/ti/cpsw.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 11b2daef3158..1387299030e4 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2588,6 +2588,9 @@ static int cpsw_probe(struct platform_device *pdev)
int irq;
 
cpsw = devm_kzalloc(&pdev->dev, sizeof(struct cpsw_common), GFP_KERNEL);
+   if (!cpsw)
+   return -ENOMEM;
+
cpsw->dev = &pdev->dev;
 
ndev = alloc_etherdev_mq(sizeof(struct cpsw_priv), CPSW_MAX_QUEUES);
-- 
2.7.3



[PATCH ethtool v3 2/2] Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY downshift

2016-11-17 Thread Allan W. Nielsen
From: Raju Lakkaraju 

Add ethtool get and set tunable to access PHY drivers.

Ethtool Help: ethtool -h for PHY tunables
ethtool --set-phy-tunable DEVNAME  Set PHY tunable
[ downshift on|off [count N] ]
ethtool --get-phy-tunable DEVNAME  Get PHY tunable
[ downshift ]

Ethtool ex:
  ethtool --set-phy-tuanble eth0 downshift on
  ethtool --set-phy-tuanble eth0 downshift off
  ethtool --set-phy-tuanble eth0 downshift on count 2

  ethtool --get-phy-tunable eth0 downshift

Signed-off-by: Raju Lakkaraju 
Signed-off-by: Allan W. Nielsen 
---
 ethtool.8.in |  39 
 ethtool.c| 144 +++
 2 files changed, 183 insertions(+)

diff --git a/ethtool.8.in b/ethtool.8.in
index 9631847..337d0cf 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -340,6 +340,18 @@ ethtool \- query or control network driver and hardware 
settings
 .B2 tx-lpi on off
 .BN tx-timer
 .BN advertise
+.HP
+.B ethtool \-\-set\-phy\-tunable
+.I devname
+.RB [
+.B downshift
+.A1 on off
+.BN count
+.RB ]
+.HP
+.B ethtool \-\-get\-phy\-tunable
+.I devname
+.RB [ downshift ]
 .
 .\" Adjust lines (i.e. full justification) and hyphenate.
 .ad
@@ -947,6 +959,33 @@ Values are as for
 Sets the amount of time the device should stay in idle mode prior to asserting
 its Tx LPI (in microseconds). This has meaning only when Tx LPI is enabled.
 .RE
+.TP
+.B \-\-set\-phy\-tunable
+Sets the PHY tunable parameters.
+.RS 4
+.TP
+.A2 downshift on off
+Specifies whether downshift should be enabled
+.TS
+nokeep;
+lB l.
+.BI count \ N
+Sets the PHY downshift re-tries count.
+.TE
+.PD
+.RE
+.TP
+.B \-\-get\-phy\-tunable
+Gets the PHY tunable parameters.
+.RS 4
+.TP
+.B downshift
+For operation in cabling environments that are incompatible with 1000BASE-T,
+PHY device provides an automatic link speed downshift operation.
+Link speed downshift after N failed 1000BASE-T auto-negotiation attempts.
+
+Gets the PHY downshift count/status.
+.RE
 .SH BUGS
 Not supported (in part or whole) on all network drivers.
 .SH AUTHOR
diff --git a/ethtool.c b/ethtool.c
index 49ac94e..7dcd005 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -4520,6 +4520,146 @@ static int do_seee(struct cmd_context *ctx)
return 0;
 }
 
+static int do_get_phy_tunable(struct cmd_context *ctx)
+{
+   int argc = ctx->argc;
+   char **argp = ctx->argp;
+   int err, i;
+   u8 downshift_changed = 0;
+
+   if (argc < 1)
+   exit_bad_args();
+   for (i = 0; i < argc; i++) {
+   if (!strcmp(argp[i], "downshift")) {
+   downshift_changed = 1;
+   i += 1;
+   if (i < argc)
+   exit_bad_args();
+   } else  {
+   exit_bad_args();
+   }
+   }
+
+   if (downshift_changed) {
+   struct ethtool_tunable ds;
+   u8 count = 0;
+
+   ds.cmd = ETHTOOL_PHY_GTUNABLE;
+   ds.id = ETHTOOL_PHY_DOWNSHIFT;
+   ds.type_id = ETHTOOL_TUNABLE_U8;
+   ds.len = 1;
+   ds.data[0] = &count;
+   err = send_ioctl(ctx, &ds);
+   if (err < 0) {
+   perror("Cannot Get PHY downshift count");
+   return 87;
+   }
+   count = *((u8 *)&ds.data[0]);
+   if (count)
+   fprintf(stdout, "Downshift count: %d\n", count);
+   else
+   fprintf(stdout, "Downshift disabled\n");
+   }
+
+   return err;
+}
+
+static int parse_named_bool(struct cmd_context *ctx, const char *name, u8 *on)
+{
+   if (ctx->argc < 2)
+   return 0;
+
+   if (strcmp(*ctx->argp, name))
+   return 0;
+
+   if (!strcmp(*(ctx->argp + 1), "on")) {
+   *on = 1;
+   } else if (!strcmp(*(ctx->argp + 1), "off")) {
+   *on = 0;
+   } else {
+   fprintf(stderr, "Invalid boolean\n");
+   exit_bad_args();
+   }
+
+   ctx->argc -= 2;
+   ctx->argp += 2;
+
+   return 1;
+}
+
+static int parse_named_u8(struct cmd_context *ctx, const char *name, u8 *val)
+{
+   if (ctx->argc < 2)
+   return 0;
+
+   if (strcmp(*ctx->argp, name))
+   return 0;
+
+   *val = get_uint_range(*(ctx->argp + 1), 0, 0xff);
+
+   ctx->argc -= 2;
+   ctx->argp += 2;
+
+   return 1;
+}
+
+static int do_set_phy_tunable(struct cmd_context *ctx)
+{
+   int err = 0;
+   u8 ds_cnt = DOWNSHIFT_DEV_DEFAULT_COUNT;
+   u8 ds_changed = 0, ds_has_cnt = 0, ds_enable = 0;
+
+   if (ctx->argc == 0)
+   exit_bad_args();
+
+   /* Parse arguments */
+   while (ctx->argc) {
+   if (parse_named_bool(ctx, "downshift", &ds_enable)) {
+   ds_changed = 1;
+   ds_has_cnt = parse

[PATCH net-next v3 4/5] ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable

2016-11-17 Thread Allan W. Nielsen
From: Raju Lakkaraju 

Adding validation support for the ETHTOOL_PHY_DOWNSHIFT. Functional
implementation needs to be done in the individual PHY drivers.

Signed-off-by: Raju Lakkaraju 
Reviewed-by: Andrew Lunn 
Signed-off-by: Allan W. Nielsen 
---
 net/core/ethtool.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 61aebdf..e9b45567 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -122,6 +122,7 @@ tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN] = 
{
 static const char
 phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
[ETHTOOL_ID_UNSPEC] = "Unspec",
+   [ETHTOOL_PHY_DOWNSHIFT] = "phy-downshift",
 };
 
 static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
@@ -2435,6 +2436,11 @@ static int ethtool_set_per_queue(struct net_device *dev, 
void __user *useraddr)
 static int ethtool_phy_tunable_valid(const struct ethtool_tunable *tuna)
 {
switch (tuna->id) {
+   case ETHTOOL_PHY_DOWNSHIFT:
+   if (tuna->len != sizeof(u8) ||
+   tuna->type_id != ETHTOOL_TUNABLE_U8)
+   return -EINVAL;
+   break;
default:
return -EINVAL;
}
-- 
2.7.3



[PATCH net-next v3 2/5] ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE

2016-11-17 Thread Allan W. Nielsen
From: Raju Lakkaraju 

Adding get_tunable/set_tunable function pointer to the phy_driver
structure, and uses these function pointers to implement the
ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE ioctls.

Signed-off-by: Raju Lakkaraju 
Reviewed-by: Andrew Lunn 
Signed-off-by: Allan W. Nielsen 
---
 include/linux/phy.h |  7 +
 net/core/ethtool.c  | 87 +
 2 files changed, 94 insertions(+)

diff --git a/include/linux/phy.h b/include/linux/phy.h
index 9880d73..3d35c36 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -611,6 +611,13 @@ struct phy_driver {
void (*get_strings)(struct phy_device *dev, u8 *data);
void (*get_stats)(struct phy_device *dev,
  struct ethtool_stats *stats, u64 *data);
+
+   /* Get and Set PHY tunables */
+   int (*get_tunable)(struct phy_device *dev,
+  struct ethtool_tunable *tuna, void *data);
+   int (*set_tunable)(struct phy_device *dev,
+   struct ethtool_tunable *tuna,
+   const void *data);
 };
 #define to_phy_driver(d) container_of(to_mdio_common_driver(d),
\
  struct phy_driver, mdiodrv)
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 9774898..61aebdf 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -119,6 +119,11 @@ tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN] 
= {
[ETHTOOL_TX_COPYBREAK]  = "tx-copybreak",
 };
 
+static const char
+phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN] = {
+   [ETHTOOL_ID_UNSPEC] = "Unspec",
+};
+
 static int ethtool_get_features(struct net_device *dev, void __user *useraddr)
 {
struct ethtool_gfeatures cmd = {
@@ -227,6 +232,9 @@ static int __ethtool_get_sset_count(struct net_device *dev, 
int sset)
if (sset == ETH_SS_TUNABLES)
return ARRAY_SIZE(tunable_strings);
 
+   if (sset == ETH_SS_PHY_TUNABLES)
+   return ARRAY_SIZE(phy_tunable_strings);
+
if (sset == ETH_SS_PHY_STATS) {
if (dev->phydev)
return phy_get_sset_count(dev->phydev);
@@ -253,6 +261,8 @@ static void __ethtool_get_strings(struct net_device *dev,
   sizeof(rss_hash_func_strings));
else if (stringset == ETH_SS_TUNABLES)
memcpy(data, tunable_strings, sizeof(tunable_strings));
+   else if (stringset == ETH_SS_PHY_TUNABLES)
+   memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
else if (stringset == ETH_SS_PHY_STATS) {
struct phy_device *phydev = dev->phydev;
 
@@ -2422,6 +2432,76 @@ static int ethtool_set_per_queue(struct net_device *dev, 
void __user *useraddr)
};
 }
 
+static int ethtool_phy_tunable_valid(const struct ethtool_tunable *tuna)
+{
+   switch (tuna->id) {
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static int get_phy_tunable(struct net_device *dev, void __user *useraddr)
+{
+   int ret;
+   struct ethtool_tunable tuna;
+   struct phy_device *phydev = dev->phydev;
+   void *data;
+
+   if (!(phydev && phydev->drv && phydev->drv->get_tunable))
+   return -EOPNOTSUPP;
+
+   if (copy_from_user(&tuna, useraddr, sizeof(tuna)))
+   return -EFAULT;
+   ret = ethtool_phy_tunable_valid(&tuna);
+   if (ret)
+   return ret;
+   data = kmalloc(tuna.len, GFP_USER);
+   if (!data)
+   return -ENOMEM;
+   ret = phydev->drv->get_tunable(phydev, &tuna, data);
+   if (ret)
+   goto out;
+   useraddr += sizeof(tuna);
+   ret = -EFAULT;
+   if (copy_to_user(useraddr, data, tuna.len))
+   goto out;
+   ret = 0;
+
+out:
+   kfree(data);
+   return ret;
+}
+
+static int set_phy_tunable(struct net_device *dev, void __user *useraddr)
+{
+   int ret;
+   struct ethtool_tunable tuna;
+   struct phy_device *phydev = dev->phydev;
+   void *data;
+
+   if (!(phydev && phydev->drv && phydev->drv->set_tunable))
+   return -EOPNOTSUPP;
+   if (copy_from_user(&tuna, useraddr, sizeof(tuna)))
+   return -EFAULT;
+   ret = ethtool_phy_tunable_valid(&tuna);
+   if (ret)
+   return ret;
+   data = kmalloc(tuna.len, GFP_USER);
+   if (!data)
+   return -ENOMEM;
+   useraddr += sizeof(tuna);
+   ret = -EFAULT;
+   if (copy_from_user(data, useraddr, tuna.len))
+   goto out;
+   ret = phydev->drv->set_tunable(phydev, &tuna, data);
+
+out:
+   kfree(data);
+   return ret;
+}
+
 /* The main entry point in this file.  Called from net/core/dev_ioctl.c */
 
 int dev_ethtool(struct net *net, struct ifreq *ifr)
@@ -2479,6 +2559,7 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
case ETHTOOL_GET_TS_INFO:

Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Rick Jones

On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:

time to try IP_MTU_DISCOVER ;)


To Rick, maybe you can find a good solution or option with Eric's hint,
to send appropriate sized UDP packets with Don't Fragment (DF).


Well, I suppose adding another setsockopt() to the data socket creation 
wouldn't be too difficult, along with another command-line option to 
cause it to happen.


Could we leave things as "make sure you don't need fragmentation when 
you use this" or would netperf have to start processing ICMP messages?


happy benchmarking,

rick jones



[PATCH net-next v3 5/5] net: phy: Add downshift get/set support in Microsemi PHYs driver

2016-11-17 Thread Allan W. Nielsen
From: Raju Lakkaraju 

Implements the phy tunable function pointers and implement downshift
functionality for MSCC PHYs.

Signed-off-by: Raju Lakkaraju 
Reviewed-by: Andrew Lunn 
Signed-off-by: Allan W. Nielsen 
---
 drivers/net/phy/mscc.c | 100 +
 1 file changed, 100 insertions(+)

diff --git a/drivers/net/phy/mscc.c b/drivers/net/phy/mscc.c
index d0026ab..92018ba 100644
--- a/drivers/net/phy/mscc.c
+++ b/drivers/net/phy/mscc.c
@@ -46,8 +46,15 @@ enum rgmii_rx_clock_delay {
 
 #define MSCC_EXT_PAGE_ACCESS 31
 #define MSCC_PHY_PAGE_STANDARD   0x /* Standard registers */
+#define MSCC_PHY_PAGE_EXTENDED   0x0001 /* Extended registers */
 #define MSCC_PHY_PAGE_EXTENDED_2 0x0002 /* Extended reg - page 2 */
 
+/* Extended Page 1 Registers */
+#define MSCC_PHY_ACTIPHY_CNTL20
+#define DOWNSHIFT_CNTL_MASK  0x001C
+#define DOWNSHIFT_EN 0x0010
+#define DOWNSHIFT_CNTL_POS   2
+
 /* Extended Page 2 Registers */
 #define MSCC_PHY_RGMII_CNTL  20
 #define RGMII_RX_CLK_DELAY_MASK  0x0070
@@ -75,6 +82,8 @@ enum rgmii_rx_clock_delay {
 #define MSCC_VDDMAC_2500 2500
 #define MSCC_VDDMAC_3300 3300
 
+#define DOWNSHIFT_COUNT_MAX  5
+
 struct vsc8531_private {
int rate_magic;
 };
@@ -101,6 +110,66 @@ static int vsc85xx_phy_page_set(struct phy_device *phydev, 
u8 page)
return rc;
 }
 
+static int vsc85xx_downshift_get(struct phy_device *phydev, u8 *count)
+{
+   int rc;
+   u16 reg_val;
+
+   mutex_lock(&phydev->lock);
+   rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
+   if (rc != 0)
+   goto out_unlock;
+
+   reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL);
+   reg_val &= DOWNSHIFT_CNTL_MASK;
+   if (!(reg_val & DOWNSHIFT_EN))
+   *count = DOWNSHIFT_DEV_DISABLE;
+   else
+   *count = ((reg_val & ~DOWNSHIFT_EN) >> DOWNSHIFT_CNTL_POS) + 2;
+   rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
+
+out_unlock:
+   mutex_unlock(&phydev->lock);
+
+   return rc;
+}
+
+static int vsc85xx_downshift_set(struct phy_device *phydev, u8 count)
+{
+   int rc;
+   u16 reg_val;
+
+   if (count == DOWNSHIFT_DEV_DEFAULT_COUNT) {
+   /* Default downshift count 3 (i.e. Bit3:2 = 0b01) */
+   count = ((1 << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN);
+   } else if (count > DOWNSHIFT_COUNT_MAX || count == 1) {
+   phydev_err(phydev, "Downshift count should be 2,3,4 or 5\n");
+   return -ERANGE;
+   } else if (count) {
+   /* Downshift count is either 2,3,4 or 5 */
+   count = (((count - 2) << DOWNSHIFT_CNTL_POS) | DOWNSHIFT_EN);
+   }
+
+   mutex_lock(&phydev->lock);
+   rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_EXTENDED);
+   if (rc != 0)
+   goto out_unlock;
+
+   reg_val = phy_read(phydev, MSCC_PHY_ACTIPHY_CNTL);
+   reg_val &= ~(DOWNSHIFT_CNTL_MASK);
+   reg_val |= count;
+   rc = phy_write(phydev, MSCC_PHY_ACTIPHY_CNTL, reg_val);
+   if (rc != 0)
+   goto out_unlock;
+
+   rc = vsc85xx_phy_page_set(phydev, MSCC_PHY_PAGE_STANDARD);
+
+out_unlock:
+   mutex_unlock(&phydev->lock);
+
+   return rc;
+}
+
 static int vsc85xx_wol_set(struct phy_device *phydev,
   struct ethtool_wolinfo *wol)
 {
@@ -329,6 +398,29 @@ static int vsc85xx_default_config(struct phy_device 
*phydev)
return rc;
 }
 
+static int vsc85xx_get_tunable(struct phy_device *phydev,
+  struct ethtool_tunable *tuna, void *data)
+{
+   switch (tuna->id) {
+   case ETHTOOL_PHY_DOWNSHIFT:
+   return vsc85xx_downshift_get(phydev, (u8 *)data);
+   default:
+   return -EINVAL;
+   }
+}
+
+static int vsc85xx_set_tunable(struct phy_device *phydev,
+  struct ethtool_tunable *tuna,
+  const void *data)
+{
+   switch (tuna->id) {
+   case ETHTOOL_PHY_DOWNSHIFT:
+   return vsc85xx_downshift_set(phydev, *(u8 *)data);
+   default:
+   return -EINVAL;
+   }
+}
+
 static int vsc85xx_config_init(struct phy_device *phydev)
 {
int rc;
@@ -418,6 +510,8 @@ static struct phy_driver vsc85xx_driver[] = {
.probe  = &vsc85xx_probe,
.set_wol= &vsc85xx_wol_set,
.get_wol= &vsc85xx_wol_get,
+   .get_tunable= &vsc85xx_get_tunable,
+   .set_tunable= &vsc85xx_set_tunable,
 },
 {
.phy_id = PHY_ID_VSC8531,
@@ -437,6 +531,8 @@ static struct phy_driver vsc85xx_driver[] = {
.probe  = &vsc85xx_probe,
.set_wol= &vsc85xx_wol_set,
.get_wol= &vsc85xx_wol_get,
+   .get_tunable= &vsc85xx_get_tunable,
+   

[PATCH] net: sky2: Fix shutdown crash

2016-11-17 Thread Jeremy Linton
The sky2 frequently crashes during machine shutdown with:

sky2_get_stats+0x60/0x3d8 [sky2]
dev_get_stats+0x68/0xd8
rtnl_fill_stats+0x54/0x140
rtnl_fill_ifinfo+0x46c/0xc68
rtmsg_ifinfo_build_skb+0x7c/0xf0
rtmsg_ifinfo.part.22+0x3c/0x70
rtmsg_ifinfo+0x50/0x5c
netdev_state_change+0x4c/0x58
linkwatch_do_dev+0x50/0x88
__linkwatch_run_queue+0x104/0x1a4
linkwatch_event+0x30/0x3c
process_one_work+0x140/0x3e0
worker_thread+0x60/0x44c
kthread+0xdc/0xf0
ret_from_fork+0x10/0x50

This is caused by the sky2 being called after it has been shutdown.
A previous thread about this can be found here:

https://lkml.org/lkml/2016/4/12/410

An alternative fix is to assure that IFF_UP gets cleared by
calling dev_close() during shutdown. This is similar to what the
bnx2/tg3/xgene and maybe others are doing to assure that the driver
isn't being called following _shutdown().

Signed-off-by: Jeremy Linton 
---
 drivers/net/ethernet/marvell/sky2.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/marvell/sky2.c 
b/drivers/net/ethernet/marvell/sky2.c
index f05ea56..941c8e2 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -5220,6 +5220,19 @@ static SIMPLE_DEV_PM_OPS(sky2_pm_ops, sky2_suspend, 
sky2_resume);
 
 static void sky2_shutdown(struct pci_dev *pdev)
 {
+   struct sky2_hw *hw = pci_get_drvdata(pdev);
+   int port;
+
+   for (port = 0; port < hw->ports; port++) {
+   struct net_device *ndev = hw->dev[port];
+
+   rtnl_lock();
+   if (netif_running(ndev)) {
+   dev_close(ndev);
+   netif_device_detach(ndev);
+   }
+   rtnl_unlock();
+   }
sky2_suspend(&pdev->dev);
pci_wake_from_d3(pdev, device_may_wakeup(&pdev->dev));
pci_set_power_state(pdev, PCI_D3hot);
-- 
2.5.5



[PATCH net v2 1/7] net: ethernet: ti: cpsw: fix bad register access in probe error path

2016-11-17 Thread Johan Hovold
Make sure to keep the platform device runtime-resumed throughout probe
to avoid accessing the CPSW registers in the error path (e.g. for
deferred probe) with clocks disabled:

Unhandled fault: external abort on non-linefetch (0x1008) at 0xd0872d08
...
[] (cpsw_ale_control_set) from [] 
(cpsw_ale_destroy+0x2c/0x44)
[] (cpsw_ale_destroy) from [] (cpsw_probe+0xbd0/0x10c4)
[] (cpsw_probe) from [] (platform_drv_probe+0x5c/0xc0)

Fixes: df828598a755 ("netdev: driver: ethernet: Add TI CPSW driver")
Signed-off-by: Johan Hovold 
---
 drivers/net/ethernet/ti/cpsw.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index c6cff3d2ff05..f60f8ab7c1e3 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2641,13 +2641,12 @@ static int cpsw_probe(struct platform_device *pdev)
goto clean_runtime_disable_ret;
}
cpsw->version = readl(&cpsw->regs->id_ver);
-   pm_runtime_put_sync(&pdev->dev);
 
res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
cpsw->wr_regs = devm_ioremap_resource(&pdev->dev, res);
if (IS_ERR(cpsw->wr_regs)) {
ret = PTR_ERR(cpsw->wr_regs);
-   goto clean_runtime_disable_ret;
+   goto clean_pm_runtime_put_ret;
}
 
memset(&dma_params, 0, sizeof(dma_params));
@@ -2684,7 +2683,7 @@ static int cpsw_probe(struct platform_device *pdev)
default:
dev_err(priv->dev, "unknown version 0x%08x\n", cpsw->version);
ret = -ENODEV;
-   goto clean_runtime_disable_ret;
+   goto clean_pm_runtime_put_ret;
}
for (i = 0; i < cpsw->data.slaves; i++) {
struct cpsw_slave *slave = &cpsw->slaves[i];
@@ -2713,7 +2712,7 @@ static int cpsw_probe(struct platform_device *pdev)
if (!cpsw->dma) {
dev_err(priv->dev, "error initializing dma\n");
ret = -ENOMEM;
-   goto clean_runtime_disable_ret;
+   goto clean_pm_runtime_put_ret;
}
 
cpsw->txch[0] = cpdma_chan_create(cpsw->dma, 0, cpsw_tx_handler, 0);
@@ -2815,12 +2814,16 @@ static int cpsw_probe(struct platform_device *pdev)
}
}
 
+   pm_runtime_put(&pdev->dev);
+
return 0;
 
 clean_ale_ret:
cpsw_ale_destroy(cpsw->ale);
 clean_dma_ret:
cpdma_ctlr_destroy(cpsw->dma);
+clean_pm_runtime_put_ret:
+   pm_runtime_put_sync(&pdev->dev);
 clean_runtime_disable_ret:
pm_runtime_disable(&pdev->dev);
 clean_ndev_ret:
-- 
2.7.3



RE: Netperf UDP issue with connected sockets

2016-11-17 Thread David Laight
From: Jesper Dangaard Brouer
> Sent: 17 November 2016 14:58
> On Thu, 17 Nov 2016 06:17:38 -0800
> Eric Dumazet  wrote:
> 
> > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> >
> > > I can see that qdisc layer does not activate xmit_more in this case.
> > >
> >
> > Sure. Not enough pressure from the sender(s).
> >
> > The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> > limit is kept at a small value.
> >
> > (BTW not all NIC have expensive doorbells)
> 
> I believe this NIC mlx5 (50G edition) does.
> 
> I'm seeing UDP TX of 1656017.55 pps, which is per packet:
> 2414 cycles(tsc) 603.86 ns
> 
> Perf top shows (with my own udp_flood, that avoids __ip_select_ident):
> 
>  Samples: 56K of event 'cycles', Event count (approx.): 51613832267
>Overhead  CommandShared ObjectSymbol
>  +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
>- _raw_spin_lock
>   + 90.78% __dev_queue_xmit
>   + 7.83% dev_queue_xmit
>   + 1.30% ___slab_alloc
>  +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w
>  +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
>  +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup
>  +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
>  +3.11%  udp_flood  [kernel.vmlinux] [k] 
> __ip_route_output_key_hash
>  +2.49%  swapper[kernel.vmlinux] [k] __slab_free
> 
> In this setup the spinlock in __dev_queue_xmit should be uncongested.
> An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.
> 
> But 8.92% of the time is spend on it, which corresponds to a cost of 215
> cycles (2414*0.0892).  This cost is too high, thus something else is
> going on... I claim this mysterious extra cost is the tailptr/doorbell.

Try adding code to ring the doorbell twice.
If this doesn't slow things down then it isn't (likely to be) responsible
for the delay you are seeing.

David



[PATCH net v2 4/7] net: ethernet: ti: cpsw: fix of_node and phydev leaks

2016-11-17 Thread Johan Hovold
Make sure to drop references taken and deregister devices registered
during probe on probe errors (including deferred probe) and driver
unbind.

Specifically, PHY of-node references were never released and fixed-link
PHY devices were never deregistered.

Fixes: 9e42f715264f ("drivers: net: cpsw: add phy-handle parsing")
Fixes: 1f71e8c96fc6 ("drivers: net: cpsw: Add support for fixed-link
PHY")
Signed-off-by: Johan Hovold 
---
 drivers/net/ethernet/ti/cpsw.c | 35 +++
 1 file changed, 35 insertions(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 5d14abb06486..c3b78bc4fe58 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2443,6 +2443,41 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
 
 static void cpsw_remove_dt(struct platform_device *pdev)
 {
+   struct net_device *ndev = platform_get_drvdata(pdev);
+   struct cpsw_common *cpsw = ndev_to_cpsw(ndev);
+   struct cpsw_platform_data *data = &cpsw->data;
+   struct device_node *node = pdev->dev.of_node;
+   struct device_node *slave_node;
+   int i = 0;
+
+   for_each_available_child_of_node(node, slave_node) {
+   struct cpsw_slave_data *slave_data = &data->slave_data[i];
+
+   if (strcmp(slave_node->name, "slave"))
+   continue;
+
+   if (of_phy_is_fixed_link(slave_node)) {
+   struct phy_device *phydev;
+
+   phydev = of_phy_find_device(slave_node);
+   if (phydev) {
+   fixed_phy_unregister(phydev);
+   /* Put references taken by
+* of_phy_find_device() and
+* of_phy_register_fixed_link().
+*/
+   phy_device_free(phydev);
+   phy_device_free(phydev);
+   }
+   }
+
+   of_node_put(slave_data->phy_node);
+
+   i++;
+   if (i == data->slaves)
+   break;
+   }
+
of_platform_depopulate(&pdev->dev);
 }
 
-- 
2.7.3



[PATCH net-next v3 3/5] ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables

2016-11-17 Thread Allan W. Nielsen
From: Raju Lakkaraju 

For operation in cabling environments that are incompatible with
1000BASE-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BASE-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BASE-T.  This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.

Signed-off-by: Raju Lakkaraju 
Signed-off-by: Allan W. Nielsen 
Reviewed-by: Andrew Lunn 
---
 include/uapi/linux/ethtool.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 42f696f..f0db778 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -248,9 +248,12 @@ struct ethtool_tunable {
void*data[0];
 };
 
+#define DOWNSHIFT_DEV_DEFAULT_COUNT0xff
+#define DOWNSHIFT_DEV_DISABLE  0
+
 enum phy_tunable_id {
ETHTOOL_PHY_ID_UNSPEC,
-
+   ETHTOOL_PHY_DOWNSHIFT,
/*
 * Add your fresh new phy tunable attribute above and remember to update
 * phy_tunable_strings[] in net/core/ethtool.c
-- 
2.7.3



[PATCH ethtool v3 1/2] ethtool-copy.h:sync with net

2016-11-17 Thread Allan W. Nielsen
From: Raju Lakkaraju 

This covers kernel changes upto:

commit f5a4732f85613b3fb43f8bc33a017e3db3b3605a
Author: Raju Lakkaraju 
Date:   Wed Nov 9 16:33:09 2016 +0530

ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables

For operation in cabling environments that are incompatible with
1000BASE-T, PHY device may provide an automatic link speed downshift
operation. When enabled, the device automatically changes its 1000BASE-T
auto-negotiation to the next slower speed after a configured number of
failed attempts at 1000BASE-T.  This feature is useful in setting up in
networks using older cable installations that include only pairs A and B,
and not pairs C and D.

Signed-off-by: Raju Lakkaraju 
Signed-off-by: Allan W. Nielsen 
---
 ethtool-copy.h | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 70748f5..2e2448f 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -247,6 +247,19 @@ struct ethtool_tunable {
void*data[0];
 };
 
+#define DOWNSHIFT_DEV_DEFAULT_COUNT0xff
+#define DOWNSHIFT_DEV_DISABLE  0
+
+enum phy_tunable_id {
+   ETHTOOL_PHY_ID_UNSPEC,
+   ETHTOOL_PHY_DOWNSHIFT,
+   /*
+* Add your fresh new phy tunable attribute above and remember to update
+* phy_tunable_strings[] in net/core/ethtool.c
+*/
+   __ETHTOOL_PHY_TUNABLE_COUNT,
+};
+
 /**
  * struct ethtool_regs - hardware register dump
  * @cmd: Command number = %ETHTOOL_GREGS
@@ -547,6 +560,7 @@ struct ethtool_pauseparam {
  * @ETH_SS_FEATURES: Device feature names
  * @ETH_SS_RSS_HASH_FUNCS: RSS hush function names
  * @ETH_SS_PHY_STATS: Statistic names, for use with %ETHTOOL_GPHYSTATS
+ * @ETH_SS_PHY_TUNABLES: PHY tunable names
  */
 enum ethtool_stringset {
ETH_SS_TEST = 0,
@@ -557,6 +571,7 @@ enum ethtool_stringset {
ETH_SS_RSS_HASH_FUNCS,
ETH_SS_TUNABLES,
ETH_SS_PHY_STATS,
+   ETH_SS_PHY_TUNABLES,
 };
 
 /**
@@ -1312,7 +1327,8 @@ struct ethtool_per_queue_op {
 
 #define ETHTOOL_GLINKSETTINGS  0x004c /* Get ethtool_link_settings */
 #define ETHTOOL_SLINKSETTINGS  0x004d /* Set ethtool_link_settings */
-
+#define ETHTOOL_PHY_GTUNABLE   0x004e /* Get PHY tunable configuration */
+#define ETHTOOL_PHY_STUNABLE   0x004f /* Set PHY tunable configuration */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET ETHTOOL_GSET
-- 
2.7.3



[PATCH net-next v3 0/5] Adding PHY-Tunables and downshift support

2016-11-17 Thread Allan W. Nielsen
Hi All,

(This is a re-post of the v3 patch set with a new cover letter - I was not
aware that the cover letters was used a commit comments in merge commits).

This series add support for PHY tunables, and uses this facility to
configure downshifting. The downshifting mechanism is implemented for MSCC
phys.

This series tries to address the comments provided back in mid October when
this feature was posted along with fast-link-failure. Fast-link-failure has
been separated out, but we would like to pick continue on that if/when we
agree on how the phy-tunables and downshifting should be done.

The proposed generic interface is similar to
ETHTOOL_GTUNABLE/ETHTOOL_STUNABLE, it uses the same type
(ethtool_tunable/tunable_type_id) but a new enum (phy_tunable_id) is added
to reflect the PHY tunable.

The implementation just call the newly added function pointers in
get_tunable/set_tunable phy_device structure.

To configure downshifting, the ethtool_tunable structure is used. 'id' must
be set to 'ETHTOOL_PHY_DOWNSHIFT', 'type_id' must be set to
'ETHTOOL_TUNABLE_U8' and 'data' value configure the amount of downshift
re-tries.

If configured to DOWNSHIFT_DEV_DISABLE, then downshift is disabled If
configured to DOWNSHIFT_DEV_DEFAULT_COUNT, then it is up to the device to
choose a device-specific re-try count.

Tested on Beaglebone Black with VSC 8531 PHY.

Change set:
v0:

- Link Speed downshift and Fast Link failure-2 features coded by using
  Device tree.
v1: 
- Split the Downshift and FLF2 features in different set of patches.
- Removed DT access and implemented IOCTL access suggested by Andrew.
- Added function pointers in get_tunable/set_tunable phy_device structure
v2:
- Added trace message with a hist is printed when downshifting clould not
  be eanbled with the requested count
- (ethtool) Syntax is changed from "--set-phy-tunable downshift on|off|%d"
  to "--set-phy-tunable [downshift on|off [count N]]" - as requested by
  Andrew.
v3:
- Fixed Spelling in "net: phy: Add downshift get/set support in Microsemi
  PHYs driver"

Raju Lakkaraju (5):
  ethtool: (uapi) Add ETHTOOL_PHY_GTUNABLE and ETHTOOL_PHY_STUNABLE
  ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE
  ethtool: (uapi) Add ETHTOOL_PHY_DOWNSHIFT to PHY tunables
  ethtool: Core impl for ETHTOOL_PHY_DOWNSHIFT tunable
  net: phy: Add downshift get/set support in Microsemi PHYs driver

 drivers/net/phy/mscc.c   | 100 +++
 include/linux/phy.h  |   7 +++
 include/uapi/linux/ethtool.h |  18 +++-
 net/core/ethtool.c   |  93 
 4 files changed, 217 insertions(+), 1 deletion(-)

-- 
2.7.4



[PATCH net v2 0/7] net: cpsw: fix leaks and probe deferral

2016-11-17 Thread Johan Hovold
This series fixes as number of leaks and issues in the cpsw probe-error
and driver-unbind paths, some which specifically prevented deferred
probing.

Johan


v2
 - Keep platform device runtime-resumed throughout probe instead of
   resuming in the probe error path as suggested by Grygorii (patch
   1/7).

 - Runtime-resume platform device before registering any children in
   order to make sure it is synchronously suspended after deregistering
   children in the error path (patch 3/7).


Johan Hovold (7):
  net: ethernet: ti: cpsw: fix bad register access in probe error path
  net: ethernet: ti: cpsw: fix mdio device reference leak
  net: ethernet: ti: cpsw: fix deferred probe
  net: ethernet: ti: cpsw: fix of_node and phydev leaks
  net: ethernet: ti: cpsw: fix secondary-emac probe error path
  net: ethernet: ti: cpsw: add missing sanity check
  net: ethernet: ti: cpsw: fix fixed-link phy probe deferral

 drivers/net/ethernet/ti/cpsw.c | 95 --
 1 file changed, 74 insertions(+), 21 deletions(-)

-- 
2.7.3



[PATCH ethtool v3 0/2] Adding downshift support to ethtool

2016-11-17 Thread Allan W. Nielsen
Hi All,

(This is a re-post of the v3 patch set with a new cover letter - I was not
aware that the cover letters was used as commit comments in merge commits).

This patch implements for set/get downshifting.

Downshifting can either be turned on/off, or it can be configured to a
specifc count.

"count" is optional.

Tested on Beaglebone Black with VSC 8531 PHY.

Change set:
v1:
- Initial version of set/get phy tunable with downshift feature.
v2:
- (ethtool) Syntax is changed from "--set-phy-tunable downshift on|off|%d"
  to "--set-phy-tunable [downshift on|off [count N]]" - as requested by
  Andrew.
v3:
- Fixed Spelling in "ethtool-copy.h:sync with net" 
- Fixed "if send_ioctl() returns an error, print the error message and then
  still print th value of count".

Raju Lakkaraju (2):
  ethtool-copy.h:sync with net
  Ethtool: Implements ETHTOOL_PHY_GTUNABLE/ETHTOOL_PHY_STUNABLE and PHY
downshift

 ethtool-copy.h |  18 +++-
 ethtool.8.in   |  39 
 ethtool.c  | 144 +
 3 files changed, 200 insertions(+), 1 deletion(-)

-- 
2.7.4



[PATCH net v2 2/7] net: ethernet: ti: cpsw: fix mdio device reference leak

2016-11-17 Thread Johan Hovold
Make sure to drop the reference taken by of_find_device_by_node() when
looking up an mdio device from a phy_id property during probe.

Fixes: 549985ee9c72 ("cpsw: simplify the setup of the register
pointers")
Signed-off-by: Johan Hovold 
---
 drivers/net/ethernet/ti/cpsw.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index f60f8ab7c1e3..84c5d214557e 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2397,6 +2397,7 @@ static int cpsw_probe_dt(struct cpsw_platform_data *data,
}
snprintf(slave_data->phy_id, sizeof(slave_data->phy_id),
 PHY_ID_FMT, mdio->name, phyid);
+   put_device(&mdio->dev);
} else {
dev_err(&pdev->dev,
"No slave[%d] phy_id, phy-handle, or fixed-link 
property\n",
-- 
2.7.3



Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN

2016-11-17 Thread Neftin, Sasha
On 11/13/2016 10:34 AM, Neftin, Sasha wrote:
> On 11/11/2016 12:35 AM, Baicar, Tyler wrote:
>> Hello Sasha,
>>
>> On 11/9/2016 11:19 PM, Neftin, Sasha wrote:
>>> On 11/9/2016 11:41 PM, Tyler Baicar wrote:
 Move IRQ free code so that it will happen regardless of the
 __E1000_DOWN bit. Currently the e1000e driver only releases its IRQ
 if the __E1000_DOWN bit is cleared. This is not sufficient because
 it is possible for __E1000_DOWN to be set without releasing the IRQ.
 In such a situation, we will hit a kernel bug later in e1000_remove
 because the IRQ still has action since it was never freed. A
 secondary bus reset can cause this case to happen.

 Signed-off-by: Tyler Baicar 
 ---
   drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

 diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
 b/drivers/net/ethernet/intel/e1000e/netdev.c
 index 7017281..36cfcb0 100644
 --- a/drivers/net/ethernet/intel/e1000e/netdev.c
 +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
 @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev)
 if (!test_bit(__E1000_DOWN, &adapter->state)) {
   e1000e_down(adapter, true);
 -e1000_free_irq(adapter);
 /* Link status message must follow this format */
   pr_info("%s NIC Link is Down\n", adapter->netdev->name);
   }   
   +e1000_free_irq(adapter);
 +
   napi_disable(&adapter->napi);
 e1000e_free_tx_resources(adapter->tx_ring);

>>> I would like not recommend insert this change. This change related
>>> driver state machine, we afraid from lot of synchronization problem and
>>> issues.
>>> We need keep e1000_free_irq in loop and check for 'test_bit' ready.
>>
>> What do you mean here? There is no loop. If __E1000_DOWN is set then we
>> will never free the IRQ.
>>
>>> Another point, does before execute secondary bus reset your SW back up
>>> pcie configuration space as properly?
>>
>> After a secondary bus reset, the link needs to recover and go back to a
>> working state after 1 second.
>>
>> From the callstack, the issue is happening while removing the endpoint
>> from the system, before applying the secondary bus reset.
>>
>> The order of events is
>> 1. remove the drivers
>> 2. cause a secondary bus reset
>> 3. wait 1 second
> Actually, this is too much, usually link up in less than 100ms.You can
> check Data Link Layer indication.
>> 4. recover the link
>>
>> callstack:
>> free_msi_irqs+0x6c/0x1a8
>> pci_disable_msi+0xb0/0x148
>> e1000e_reset_interrupt_capability+0x60/0x78
>> e1000_remove+0xc8/0x180
>> pci_device_remove+0x48/0x118
>> __device_release_driver+0x80/0x108
>> device_release_driver+0x2c/0x40
>> pci_stop_bus_device+0xa0/0xb0
>> pci_stop_bus_device+0x3c/0xb0
>> pci_stop_root_bus+0x54/0x80
>> acpi_pci_root_remove+0x28/0x64
>> acpi_bus_trim+0x6c/0xa4
>> acpi_device_hotplug+0x19c/0x3f4
>> acpi_hotplug_work_fn+0x28/0x3c
>> process_one_work+0x150/0x460
>> worker_thread+0x50/0x4b8
>> kthread+0xd4/0xe8
>> ret_from_fork+0x10/0x50
>>
>> Thanks,
>> Tyler
>>
> Hello Tyler,
> Okay, we need consult more about this suggestion.
> May I ask what is setup you run? Is there NIC or on board LAN? I would
> like try reproduce this issue in our lab's too.
> Also, is same issue observed with same scenario and others NIC's too?
> Sasha
> ___
> Intel-wired-lan mailing list
> intel-wired-...@lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> 
Hello Tyler,
I see some in consistent implementation of __*_close methods in our
drivers. Do you have any igb NIC to check if same problem persist there?
Thanks,
Sasha


Re: Netperf UDP issue with connected sockets

2016-11-17 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 06:17:38 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote:
> 
> > I can see that qdisc layer does not activate xmit_more in this case.
> >   
> 
> Sure. Not enough pressure from the sender(s).
> 
> The bottleneck is not the NIC or qdisc in your case, meaning that BQL
> limit is kept at a small value.
> 
> (BTW not all NIC have expensive doorbells)

I believe this NIC mlx5 (50G edition) does.

I'm seeing UDP TX of 1656017.55 pps, which is per packet:
2414 cycles(tsc) 603.86 ns

Perf top shows (with my own udp_flood, that avoids __ip_select_ident):

 Samples: 56K of event 'cycles', Event count (approx.): 51613832267
   Overhead  CommandShared ObjectSymbol
 +8.92%  udp_flood  [kernel.vmlinux] [k] _raw_spin_lock
   - _raw_spin_lock
  + 90.78% __dev_queue_xmit
  + 7.83% dev_queue_xmit
  + 1.30% ___slab_alloc
 +5.59%  udp_flood  [kernel.vmlinux] [k] skb_set_owner_w
 +4.77%  udp_flood  [mlx5_core]  [k] mlx5e_sq_xmit
 +4.09%  udp_flood  [kernel.vmlinux] [k] fib_table_lookup
 +4.00%  swapper[mlx5_core]  [k] mlx5e_poll_tx_cq
 +3.11%  udp_flood  [kernel.vmlinux] [k] __ip_route_output_key_hash
 +2.49%  swapper[kernel.vmlinux] [k] __slab_free

In this setup the spinlock in __dev_queue_xmit should be uncongested.
An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system.

But 8.92% of the time is spend on it, which corresponds to a cost of 215
cycles (2414*0.0892).  This cost is too high, thus something else is
going on... I claim this mysterious extra cost is the tailptr/doorbell.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH] net: sky2: Fix shutdown crash

2016-11-17 Thread Sudeep Holla



On 17/11/16 15:14, Jeremy Linton wrote:

The sky2 frequently crashes during machine shutdown with:

sky2_get_stats+0x60/0x3d8 [sky2]
dev_get_stats+0x68/0xd8
rtnl_fill_stats+0x54/0x140
rtnl_fill_ifinfo+0x46c/0xc68
rtmsg_ifinfo_build_skb+0x7c/0xf0
rtmsg_ifinfo.part.22+0x3c/0x70
rtmsg_ifinfo+0x50/0x5c
netdev_state_change+0x4c/0x58
linkwatch_do_dev+0x50/0x88
__linkwatch_run_queue+0x104/0x1a4
linkwatch_event+0x30/0x3c
process_one_work+0x140/0x3e0
worker_thread+0x60/0x44c
kthread+0xdc/0xf0
ret_from_fork+0x10/0x50

This is caused by the sky2 being called after it has been shutdown.
A previous thread about this can be found here:

https://lkml.org/lkml/2016/4/12/410

An alternative fix is to assure that IFF_UP gets cleared by
calling dev_close() during shutdown. This is similar to what the
bnx2/tg3/xgene and maybe others are doing to assure that the driver
isn't being called following _shutdown().

Signed-off-by: Jeremy Linton 


Since this issue has been very random and on/off recently, it's quite
hard to test this and confirm. However I did around 20 reboot/shutdown
and could not reproduce the issue after applying this patch. So,

Tested-by: Sudeep Holla 

--
Regards,
Sudeep


RE: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM

2016-11-17 Thread Rafal Ozieblo
-Original Message-
From: Harini Katakam [mailto:harinikatakamli...@gmail.com] 
Sent: 17 listopada 2016 13:22
To: Rafal Ozieblo
Cc: harini.kata...@xilinx.com; nicolas.fe...@atmel.com; netdev@vger.kernel.org; 
linux-ker...@vger.kernel.org
Subject: Re: [RFC PATCH 2/2] net: macb: Add 64 bit addressing support for GEM

> Hi Rafal,
>
> On Thu, Nov 17, 2016 at 5:20 PM, Rafal Ozieblo  wrote:
> > Hello,
> > I think, there could a bug in your patch.
> >
> >> +
> >> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> >> + dmacfg |= GEM_BIT(ADDR64); #endif
> >
> > You enable 64 bit addressing (64b dma bus width) always when appropriate 
> > architecture config option is enabled.
> > But there are some legacy controllers which do not support that feature. 
> > According Cadence hardware team:
> > "64 bit addressing was added in July 2013. Earlier version do not have it.
> > This feature was enhanced in release August 2014 to have separate upper 
> > address values for transmit and receive."
> >
> >> /* Bitfields in NSR */
> >> @@ -474,6 +479,10 @@
> >>  struct macb_dma_desc {
> >  >  u32 addr;
> >>   u32 ctrl; 
> >> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> >> + u32 addrh;
> >> + u32 resvd;
> >> +#endif
> >>  };
> >
> > It will not work for legacy hardware. Old descriptor is 2 words wide, the 
> > new one is 4 words wide.
> > If you enable CONFIG_ARCH_DMA_ADDR_T_64BIT but hardware doesn't 
> > support it at all, you will miss every second descriptor.
> >
>
> True, this feature is not available in all of Cadence IP versions.
> In fact, the IP version Zynq does not support this. But the one in ZynqMP 
> does.
> So, we enable kernel config for 64 bit DMA addressing for this SoC and hence 
> the driver picks it up. My assumption was that if the legacy IP does not 
> support
> 64 bit addressing, then this DMA option wouldn't be enabled.

What for example with arm64 (which enables CONFIG_ARCH_DMA_ADDR_T_64BIT by 
default) and legacy IP controller? Or systems with multiple IP, both with and 
without 64b dma capable?
It might result in unpredictable behavio. (explained below)

> There is a design config register in Cadence IP which is being read to check 
> for 64 bit address support - DMA mask is set based on that.
> But the addition of two descriptor words cannot be based on this runtime 
> check.
> For this reason, all the static changes were placed under this check.

Are you talking about this?

+#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
+   if (GEM_BFEXT(DBWDEF, gem_readl(bp, DCFG1)) > GEM_DBW32)
+   dma_set_mask(&pdev->dev, DMA_BIT_MASK(44));
+#endif
+

It only sets the maximum address which can be seen on address bus. (its mask 
exactly)

+static inline void macb_set_addr(struct macb_dma_desc *desc, dma_addr_t addr)
+{
+   desc->addr = (u32)addr;
+#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
+   desc->addrh = (u32)(addr >> 32);
+#endif
+}
+

Because addr is 32b wide, (u32)(addr >> 32) equals 0. IP controller uses 2 
words for dma descriptor, so desc->addr from second hardware descriptor will be 
overwritten by desc->addrh from the first software descriptor.
(same desc->ctrl from second hardware descriptor will be overwritten by 
desc->resvd). 

Regards,
Harini


[PATCH net v2 5/7] net: ethernet: ti: cpsw: fix secondary-emac probe error path

2016-11-17 Thread Johan Hovold
Make sure to deregister the primary device in case the secondary emac
fails to probe.

kernel BUG at /home/johan/work/omicron/src/linux/net/core/dev.c:7743!
...
[] (free_netdev) from [] (cpsw_probe+0x9cc/0xe50)
[] (cpsw_probe) from [] (platform_drv_probe+0x5c/0xc0)

Fixes: d9ba8f9e6298 ("driver: net: ethernet: cpsw: dual emac interface
implementation")
Signed-off-by: Johan Hovold 
---
 drivers/net/ethernet/ti/cpsw.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index c3b78bc4fe58..11b2daef3158 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -2852,7 +2852,7 @@ static int cpsw_probe(struct platform_device *pdev)
ret = cpsw_probe_dual_emac(priv);
if (ret) {
cpsw_err(priv, probe, "error probe slave 2 emac 
interface\n");
-   goto clean_ale_ret;
+   goto clean_unregister_netdev_ret;
}
}
 
@@ -2860,6 +2860,8 @@ static int cpsw_probe(struct platform_device *pdev)
 
return 0;
 
+clean_unregister_netdev_ret:
+   unregister_netdev(ndev);
 clean_ale_ret:
cpsw_ale_destroy(cpsw->ale);
 clean_dma_ret:
-- 
2.7.3



Re: [Patch net] net: check dead netns for peernet2id_alloc()

2016-11-17 Thread David Miller
From: Cong Wang 
Date: Wed, 16 Nov 2016 10:27:02 -0800

> Andrei reports we still allocate netns ID from idr after we destroy
> it in cleanup_net().
> 
> cleanup_net():
>   ...
>   idr_destroy(&net->netns_ids);
>   ...
>   list_for_each_entry_reverse(ops, &pernet_list, list)
> ops_exit_list(ops, &net_exit_list);
>   -> rollback_registered_many()
> -> rtmsg_ifinfo_build_skb()
>  -> rtnl_fill_ifinfo()
>-> peernet2id_alloc()
> 
> After that point we should not even access net->netns_ids, we
> should check the death of the current netns as early as we can in
> peernet2id_alloc().
> 
> For net-next we can consider to avoid sending rtmsg totally,
> it is a good optimization for netns teardown path.
> 
> Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
> Reported-by: Andrei Vagin 
> Cc: Nicolas Dichtel 
> Signed-off-by: Cong Wang 

Applied and queued up for -stable, thanks.


Re: [PATCH net-next] net/mlx4_en: remove napi_hash_del() call

2016-11-17 Thread David Miller
From: Eric Dumazet 
Date: Wed, 16 Nov 2016 05:49:22 -0800

> From: Eric Dumazet 
> 
> There is no need calling napi_hash_del()+synchronize_rcu() before
> calling netif_napi_del()
> 
> netif_napi_del() does this already.
> 
> Using napi_hash_del() in a driver is useful only when dealing with
> a batch of NAPI structures, so that a single synchronize_rcu() can
> be used. mlx4_en_deactivate_cq() is deactivating a single NAPI.
> 
> Signed-off-by: Eric Dumazet 

Applied, thanks Eric.


Re: [PATCH net 0/3] net: phy: fix of_node and device leaks

2016-11-17 Thread David Miller
From: Johan Hovold 
Date: Wed, 16 Nov 2016 15:20:35 +0100

> These patches fix a couple of of_node leaks in the fixed-link code and a
> device reference leak in a phy helper.

Series applied, thanks.


  1   2   >