Re: [PATCH 4/5] virtio_net: add dedicated XDP transmit queues

2016-11-18 Thread Jakub Kicinski
On Fri, 18 Nov 2016 19:20:58 -0800, Eric Dumazet wrote:
> On Fri, 2016-11-18 at 18:57 -0800, Jakub Kicinski wrote:
> > On Fri, 18 Nov 2016 18:43:55 -0800, John Fastabend wrote:  
> > > On 16-11-18 06:10 PM, Jakub Kicinski wrote:  
>  [...]  
> > > 
> > > Seem like a valid concerns to me how about num_possible_cpus() instead.  
> > 
> > That would solve problem 1, but could cpu_possible_mask still be sparse
> > on strange setups?  Let me try to dig into this, I recall someone
> > (Eric?) was fixing similar problems some time ago.  
> 
> nr_cpu_ids is probably what you want ;)

Thank you :)


Re: [PATCH 4/5] virtio_net: add dedicated XDP transmit queues

2016-11-18 Thread Eric Dumazet
On Fri, 2016-11-18 at 18:57 -0800, Jakub Kicinski wrote:
> On Fri, 18 Nov 2016 18:43:55 -0800, John Fastabend wrote:
> > On 16-11-18 06:10 PM, Jakub Kicinski wrote:
> > > On Fri, 18 Nov 2016 13:09:53 -0800, Jakub Kicinski wrote:  
> > >> Looks very cool! :)
> > >>
> > >> On Fri, 18 Nov 2016 11:00:41 -0800, John Fastabend wrote:  
> >  [...]  
> > >>
> > >> Is num_online_cpus() correct here?  
> > > 
> > > Sorry, I don't know the virto_net code, so I'm probably wrong.  I was
> > > concerned whether the number of cpus can change but also that the cpu
> > > mask may be sparse and therefore offsetting by smp_processor_id()
> > > into the queue table below could bring trouble.
> > >   
> > 
> > Seem like a valid concerns to me how about num_possible_cpus() instead.
> 
> That would solve problem 1, but could cpu_possible_mask still be sparse
> on strange setups?  Let me try to dig into this, I recall someone
> (Eric?) was fixing similar problems some time ago.

nr_cpu_ids is probably what you want ;)





Re: [PATCH 4/5] virtio_net: add dedicated XDP transmit queues

2016-11-18 Thread Jakub Kicinski
On Fri, 18 Nov 2016 18:43:55 -0800, John Fastabend wrote:
> On 16-11-18 06:10 PM, Jakub Kicinski wrote:
> > On Fri, 18 Nov 2016 13:09:53 -0800, Jakub Kicinski wrote:  
> >> Looks very cool! :)
> >>
> >> On Fri, 18 Nov 2016 11:00:41 -0800, John Fastabend wrote:  
>  [...]  
> >>
> >> Is num_online_cpus() correct here?  
> > 
> > Sorry, I don't know the virto_net code, so I'm probably wrong.  I was
> > concerned whether the number of cpus can change but also that the cpu
> > mask may be sparse and therefore offsetting by smp_processor_id()
> > into the queue table below could bring trouble.
> >   
> 
> Seem like a valid concerns to me how about num_possible_cpus() instead.

That would solve problem 1, but could cpu_possible_mask still be sparse
on strange setups?  Let me try to dig into this, I recall someone
(Eric?) was fixing similar problems some time ago.

> > @@ -353,9 +381,15 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
> > switch (act) {
> > case XDP_PASS:
> > return XDP_PASS;
> > +   case XDP_TX:
> > +   qp = vi->curr_queue_pairs -
> > +   vi->xdp_queue_pairs +
> > +   smp_processor_id();
> > +   xdp.data = buf + (vi->mergeable_rx_bufs ? 0 : 4);
> > +   virtnet_xdp_xmit(vi, qp, );
> > +   return XDP_TX;
> > default:
> > bpf_warn_invalid_xdp_action(act);
> > -   case XDP_TX:
> > case XDP_ABORTED:
> > case XDP_DROP:
> > return XDP_DROP;
> >   


Re: [PATCH 4/5] virtio_net: add dedicated XDP transmit queues

2016-11-18 Thread John Fastabend
On 16-11-18 06:10 PM, Jakub Kicinski wrote:
> On Fri, 18 Nov 2016 13:09:53 -0800, Jakub Kicinski wrote:
>> Looks very cool! :)
>>
>> On Fri, 18 Nov 2016 11:00:41 -0800, John Fastabend wrote:
>>> @@ -1542,12 +1546,34 @@ static int virtnet_xdp_set(struct net_device *dev, 
>>> struct bpf_prog *prog)
>>> return -EINVAL;
>>> }
>>>  
>>> +   curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
>>> +   if (prog)
>>> +   xdp_qp = num_online_cpus();  
>>
>> Is num_online_cpus() correct here?
> 
> Sorry, I don't know the virto_net code, so I'm probably wrong.  I was
> concerned whether the number of cpus can change but also that the cpu
> mask may be sparse and therefore offsetting by smp_processor_id()
> into the queue table below could bring trouble.
> 

Seem like a valid concerns to me how about num_possible_cpus() instead.

> @@ -353,9 +381,15 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
>   switch (act) {
>   case XDP_PASS:
>   return XDP_PASS;
> + case XDP_TX:
> + qp = vi->curr_queue_pairs -
> + vi->xdp_queue_pairs +
> + smp_processor_id();
> + xdp.data = buf + (vi->mergeable_rx_bufs ? 0 : 4);
> + virtnet_xdp_xmit(vi, qp, );
> + return XDP_TX;
>   default:
>   bpf_warn_invalid_xdp_action(act);
> - case XDP_TX:
>   case XDP_ABORTED:
>   case XDP_DROP:
>   return XDP_DROP;
> 



Re: [PATCH 3/5] virtio_net: Add XDP support

2016-11-18 Thread John Fastabend
On 16-11-18 03:21 PM, Eric Dumazet wrote:
> On Fri, 2016-11-18 at 11:00 -0800, John Fastabend wrote:
> 
> 
>>  static void free_receive_bufs(struct virtnet_info *vi)
>>  {
>> +struct bpf_prog *old_prog;
>>  int i;
>>  
>>  for (i = 0; i < vi->max_queue_pairs; i++) {
>>  while (vi->rq[i].pages)
>>  __free_pages(get_a_page(>rq[i], GFP_KERNEL), 0);
>> +
>> +old_prog = rcu_dereference(vi->rq[i].xdp_prog);
> 
> Seems wrong to me.
> 

Yep it is wrong should be rtnl_dereference() here and the
rcu_dereference() calls earlier in the patch need to be _bh().

> Are you sure lockdep (with CONFIG_PROVE_RCU=y) was happy with this ?

oops you are right it was missing.

> 
>> +RCU_INIT_POINTER(vi->rq[i].xdp_prog, NULL);
>> +if (old_prog)
>> +bpf_prog_put(old_prog);

bpf_prog_put() waits a grace period of ref count is zero. That said on
driver unload we need to protect the bpf_prog_put with RTNL_LOCK() as
well.

I'll send out a v2 in a bit.

Thanks a lot.

>>  }
>>  }
>>  
>>
> 
> 



[PATCH net-next v2 1/4] geneve: Unify LWT and netdev handling.

2016-11-18 Thread Pravin B Shelar
Current geneve implementation has two separate cases to handle.
1. netdev xmit
2. LWT xmit.

In case of netdev, geneve configuration is stored in various
struct geneve_dev members. For example geneve_addr, ttl, tos,
label, flags, dst_cache, etc. For LWT ip_tunnel_info is passed
to the device in ip_tunnel_info.

Following patch uses ip_tunnel_info struct to store almost all
of configuration of a geneve netdevice. This allows us to unify
most of geneve driver code around ip_tunnel_info struct.
This dramatically simplify geneve code, since it does not
need to handle two different configuration cases. Removes
duplicate code, single code path can handle either type
of geneve devices.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/geneve.c | 612 ++-
 1 file changed, 263 insertions(+), 349 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 85a423a..b5e65cd 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -45,41 +45,22 @@ struct geneve_net {
 
 static int geneve_net_id;
 
-union geneve_addr {
-   struct sockaddr_in sin;
-   struct sockaddr_in6 sin6;
-   struct sockaddr sa;
-};
-
-static union geneve_addr geneve_remote_unspec = { .sa.sa_family = AF_UNSPEC, };
-
 /* Pseudo network device */
 struct geneve_dev {
struct hlist_node  hlist;   /* vni hash table */
struct net *net;/* netns for packet i/o */
struct net_device  *dev;/* netdev for geneve tunnel */
+   struct ip_tunnel_info info;
struct geneve_sock __rcu *sock4;/* IPv4 socket used for geneve 
tunnel */
 #if IS_ENABLED(CONFIG_IPV6)
struct geneve_sock __rcu *sock6;/* IPv6 socket used for geneve 
tunnel */
 #endif
-   u8 vni[3];  /* virtual network ID for tunnel */
-   u8 ttl; /* TTL override */
-   u8 tos; /* TOS override */
-   union geneve_addr  remote;  /* IP address for link partner */
struct list_head   next;/* geneve's per namespace list */
-   __be32 label;   /* IPv6 flowlabel override */
-   __be16 dst_port;
-   bool   collect_md;
struct gro_cells   gro_cells;
-   u32flags;
-   struct dst_cache   dst_cache;
+   bool   collect_md;
+   bool   use_udp6_rx_checksums;
 };
 
-/* Geneve device flags */
-#define GENEVE_F_UDP_ZERO_CSUM_TX  BIT(0)
-#define GENEVE_F_UDP_ZERO_CSUM6_TX BIT(1)
-#define GENEVE_F_UDP_ZERO_CSUM6_RX BIT(2)
-
 struct geneve_sock {
boolcollect_md;
struct list_headlist;
@@ -87,7 +68,6 @@ struct geneve_sock {
struct rcu_head rcu;
int refcnt;
struct hlist_head   vni_list[VNI_HASH_SIZE];
-   u32 flags;
 };
 
 static inline __u32 geneve_net_vni_hash(u8 vni[3])
@@ -109,6 +89,20 @@ static __be64 vni_to_tunnel_id(const __u8 *vni)
 #endif
 }
 
+/* Convert 64 bit tunnel ID to 24 bit VNI. */
+static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
+{
+#ifdef __BIG_ENDIAN
+   vni[0] = (__force __u8)(tun_id >> 16);
+   vni[1] = (__force __u8)(tun_id >> 8);
+   vni[2] = (__force __u8)tun_id;
+#else
+   vni[0] = (__force __u8)((__force u64)tun_id >> 40);
+   vni[1] = (__force __u8)((__force u64)tun_id >> 48);
+   vni[2] = (__force __u8)((__force u64)tun_id >> 56);
+#endif
+}
+
 static sa_family_t geneve_get_sk_family(struct geneve_sock *gs)
 {
return gs->sock->sk->sk_family;
@@ -117,6 +111,7 @@ static sa_family_t geneve_get_sk_family(struct geneve_sock 
*gs)
 static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
__be32 addr, u8 vni[])
 {
+   __be64 id = vni_to_tunnel_id(vni);
struct hlist_head *vni_list_head;
struct geneve_dev *geneve;
__u32 hash;
@@ -125,8 +120,8 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock 
*gs,
hash = geneve_net_vni_hash(vni);
vni_list_head = >vni_list[hash];
hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-   if (!memcmp(vni, geneve->vni, sizeof(geneve->vni)) &&
-   addr == geneve->remote.sin.sin_addr.s_addr)
+   if (!memcmp(, >info.key.tun_id, sizeof(id)) &&
+   addr == geneve->info.key.u.ipv4.dst)
return geneve;
}
return NULL;
@@ -136,6 +131,7 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock 
*gs,
 static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
 struct in6_addr addr6, u8 vni[])
 {
+   __be64 id = vni_to_tunnel_id(vni);
struct hlist_head *vni_list_head;
struct geneve_dev *geneve;
__u32 hash;
@@ -144,8 +140,8 @@ static struct 

[PATCH net-next v2 0/4] geneve: Use LWT more effectively.

2016-11-18 Thread Pravin B Shelar
Following patch series make use of geneve LWT code path for
geneve netdev type of device.
This allows us to simplify geneve module.

v1-v2:
Fix warning reported by kbuild test robot.

Pravin B Shelar (4):
  geneve: Unify LWT and netdev handling.
  geneve: Merge ipv4 and ipv6 geneve_build_skb()
  geneve: Remove redundant socket checks.
  geneve: Optimize geneve device lookup.

 drivers/net/geneve.c | 679 +--
 1 file changed, 274 insertions(+), 405 deletions(-)

-- 
1.8.3.1



[PATCH net-next v2 2/4] geneve: Merge ipv4 and ipv6 geneve_build_skb()

2016-11-18 Thread Pravin B Shelar
There are minimal difference in building Geneve header
between ipv4 and ipv6 geneve tunnels. Following patch
refactors code to unify it.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/geneve.c | 100 ++-
 1 file changed, 26 insertions(+), 74 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index b5e65cd..d1759aa 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -630,67 +630,34 @@ static int geneve_stop(struct net_device *dev)
 }
 
 static void geneve_build_header(struct genevehdr *geneveh,
-   __be16 tun_flags, u8 vni[3],
-   u8 options_len, u8 *options)
+   const struct ip_tunnel_info *info)
 {
geneveh->ver = GENEVE_VER;
-   geneveh->opt_len = options_len / 4;
-   geneveh->oam = !!(tun_flags & TUNNEL_OAM);
-   geneveh->critical = !!(tun_flags & TUNNEL_CRIT_OPT);
+   geneveh->opt_len = info->options_len / 4;
+   geneveh->oam = !!(info->key.tun_flags & TUNNEL_OAM);
+   geneveh->critical = !!(info->key.tun_flags & TUNNEL_CRIT_OPT);
geneveh->rsvd1 = 0;
-   memcpy(geneveh->vni, vni, 3);
+   tunnel_id_to_vni(info->key.tun_id, geneveh->vni);
geneveh->proto_type = htons(ETH_P_TEB);
geneveh->rsvd2 = 0;
 
-   memcpy(geneveh->options, options, options_len);
+   ip_tunnel_info_opts_get(geneveh->options, info);
 }
 
-static int geneve_build_skb(struct rtable *rt, struct sk_buff *skb,
-   __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
-   bool xnet)
-{
-   bool udp_sum = !!(tun_flags & TUNNEL_CSUM);
-   struct genevehdr *gnvh;
-   int min_headroom;
-   int err;
-
-   skb_scrub_packet(skb, xnet);
-
-   min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len
-   + GENEVE_BASE_HLEN + opt_len + sizeof(struct iphdr);
-   err = skb_cow_head(skb, min_headroom);
-   if (unlikely(err))
-   goto free_rt;
-
-   err = udp_tunnel_handle_offloads(skb, udp_sum);
-   if (err)
-   goto free_rt;
-
-   gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + opt_len);
-   geneve_build_header(gnvh, tun_flags, vni, opt_len, opt);
-
-   skb_set_inner_protocol(skb, htons(ETH_P_TEB));
-   return 0;
-
-free_rt:
-   ip_rt_put(rt);
-   return err;
-}
-
-#if IS_ENABLED(CONFIG_IPV6)
-static int geneve6_build_skb(struct dst_entry *dst, struct sk_buff *skb,
-__be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
-bool xnet)
+static int geneve_build_skb(struct dst_entry *dst, struct sk_buff *skb,
+   const struct ip_tunnel_info *info,
+   bool xnet, int ip_hdr_len)
 {
-   bool udp_sum = !!(tun_flags & TUNNEL_CSUM);
+   bool udp_sum = !!(info->key.tun_flags & TUNNEL_CSUM);
struct genevehdr *gnvh;
int min_headroom;
int err;
 
+   skb_reset_mac_header(skb);
skb_scrub_packet(skb, xnet);
 
-   min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len
-   + GENEVE_BASE_HLEN + opt_len + sizeof(struct ipv6hdr);
+   min_headroom = LL_RESERVED_SPACE(dst->dev) + dst->header_len +
+  GENEVE_BASE_HLEN + info->options_len + ip_hdr_len;
err = skb_cow_head(skb, min_headroom);
if (unlikely(err))
goto free_dst;
@@ -699,9 +666,9 @@ static int geneve6_build_skb(struct dst_entry *dst, struct 
sk_buff *skb,
if (err)
goto free_dst;
 
-   gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) + opt_len);
-   geneve_build_header(gnvh, tun_flags, vni, opt_len, opt);
-
+   gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh) +
+  info->options_len);
+   geneve_build_header(gnvh, info);
skb_set_inner_protocol(skb, htons(ETH_P_TEB));
return 0;
 
@@ -709,12 +676,11 @@ static int geneve6_build_skb(struct dst_entry *dst, 
struct sk_buff *skb,
dst_release(dst);
return err;
 }
-#endif
 
 static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
   struct net_device *dev,
   struct flowi4 *fl4,
-  struct ip_tunnel_info *info)
+  const struct ip_tunnel_info *info)
 {
bool use_cache = ip_tunnel_dst_cache_usable(skb, info);
struct geneve_dev *geneve = netdev_priv(dev);
@@ -738,7 +704,7 @@ static struct rtable *geneve_get_v4_rt(struct sk_buff *skb,
}
fl4->flowi4_tos = RT_TOS(tos);
 
-   dst_cache = >dst_cache;
+   dst_cache = (struct dst_cache *)>dst_cache;
if (use_cache) {
rt = dst_cache_get_ip4(dst_cache, 

Re: [PATCH 4/5] virtio_net: add dedicated XDP transmit queues

2016-11-18 Thread Jakub Kicinski
On Fri, 18 Nov 2016 13:09:53 -0800, Jakub Kicinski wrote:
> Looks very cool! :)
> 
> On Fri, 18 Nov 2016 11:00:41 -0800, John Fastabend wrote:
> > @@ -1542,12 +1546,34 @@ static int virtnet_xdp_set(struct net_device *dev, 
> > struct bpf_prog *prog)
> > return -EINVAL;
> > }
> >  
> > +   curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
> > +   if (prog)
> > +   xdp_qp = num_online_cpus();  
> 
> Is num_online_cpus() correct here?

Sorry, I don't know the virto_net code, so I'm probably wrong.  I was
concerned whether the number of cpus can change but also that the cpu
mask may be sparse and therefore offsetting by smp_processor_id()
into the queue table below could bring trouble.

@@ -353,9 +381,15 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
switch (act) {
case XDP_PASS:
return XDP_PASS;
+   case XDP_TX:
+   qp = vi->curr_queue_pairs -
+   vi->xdp_queue_pairs +
+   smp_processor_id();
+   xdp.data = buf + (vi->mergeable_rx_bufs ? 0 : 4);
+   virtnet_xdp_xmit(vi, qp, );
+   return XDP_TX;
default:
bpf_warn_invalid_xdp_action(act);
-   case XDP_TX:
case XDP_ABORTED:
case XDP_DROP:
return XDP_DROP;


[PATCH net-next v2 3/4] geneve: Remove redundant socket checks.

2016-11-18 Thread Pravin B Shelar
Geneve already has check for device socket in route
lookup function. So no need to check it in xmit
function.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/geneve.c | 10 ++
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index d1759aa..f2912ca 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -785,14 +785,11 @@ static int geneve_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
struct geneve_sock *gs4 = rcu_dereference(geneve->sock4);
const struct ip_tunnel_key *key = >key;
struct rtable *rt;
-   int err = -EINVAL;
struct flowi4 fl4;
__u8 tos, ttl;
__be16 sport;
__be16 df;
-
-   if (!gs4)
-   return err;
+   int err;
 
rt = geneve_get_v4_rt(skb, dev, , info);
if (IS_ERR(rt))
@@ -828,13 +825,10 @@ static int geneve6_xmit_skb(struct sk_buff *skb, struct 
net_device *dev,
struct geneve_sock *gs6 = rcu_dereference(geneve->sock6);
const struct ip_tunnel_key *key = >key;
struct dst_entry *dst = NULL;
-   int err = -EINVAL;
struct flowi6 fl6;
__u8 prio, ttl;
__be16 sport;
-
-   if (!gs6)
-   return err;
+   int err;
 
dst = geneve_get_v6_dst(skb, dev, , info);
if (IS_ERR(dst))
-- 
1.8.3.1



[PATCH net-next v2 4/4] geneve: Optimize geneve device lookup.

2016-11-18 Thread Pravin B Shelar
Rather than comparing 64-bit tunnel-id, compare tunnel vni
which is 24-bit id. This also save conversion from vni
to tunnel id on each tunnel packet receive.

Signed-off-by: Pravin B Shelar 
---
 drivers/net/geneve.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index f2912ca..930b1b0 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -103,6 +103,17 @@ static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
 #endif
 }
 
+static bool eq_tun_id_and_vni(u8 *tun_id, u8 *vni)
+{
+#ifdef __BIG_ENDIAN
+   return (vni[0] == tun_id[2]) &&
+  (vni[1] == tun_id[1]) &&
+  (vni[2] == tun_id[0]);
+#else
+   return !memcmp(vni, _id[5], 3);
+#endif
+}
+
 static sa_family_t geneve_get_sk_family(struct geneve_sock *gs)
 {
return gs->sock->sk->sk_family;
@@ -111,7 +122,6 @@ static sa_family_t geneve_get_sk_family(struct geneve_sock 
*gs)
 static struct geneve_dev *geneve_lookup(struct geneve_sock *gs,
__be32 addr, u8 vni[])
 {
-   __be64 id = vni_to_tunnel_id(vni);
struct hlist_head *vni_list_head;
struct geneve_dev *geneve;
__u32 hash;
@@ -120,7 +130,7 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock 
*gs,
hash = geneve_net_vni_hash(vni);
vni_list_head = >vni_list[hash];
hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-   if (!memcmp(, >info.key.tun_id, sizeof(id)) &&
+   if (eq_tun_id_and_vni((u8 *)>info.key.tun_id, vni) &&
addr == geneve->info.key.u.ipv4.dst)
return geneve;
}
@@ -131,7 +141,6 @@ static struct geneve_dev *geneve_lookup(struct geneve_sock 
*gs,
 static struct geneve_dev *geneve6_lookup(struct geneve_sock *gs,
 struct in6_addr addr6, u8 vni[])
 {
-   __be64 id = vni_to_tunnel_id(vni);
struct hlist_head *vni_list_head;
struct geneve_dev *geneve;
__u32 hash;
@@ -140,7 +149,7 @@ static struct geneve_dev *geneve6_lookup(struct geneve_sock 
*gs,
hash = geneve_net_vni_hash(vni);
vni_list_head = >vni_list[hash];
hlist_for_each_entry_rcu(geneve, vni_list_head, hlist) {
-   if (!memcmp(, >info.key.tun_id, sizeof(id)) &&
+   if (eq_tun_id_and_vni((u8 *)>info.key.tun_id, vni) &&
ipv6_addr_equal(, >info.key.u.ipv6.dst))
return geneve;
}
-- 
1.8.3.1



[PATCH net-next] udp: avoid one cache line miss in recvmsg()

2016-11-18 Thread Eric Dumazet
From: Eric Dumazet 

UDP_SKB_CB(skb)->partial_cov is located at offset 66 in skb,
requesting a cold cache line being read in cpu cache.

We can avoid this cache line miss for UDP sockets,
as partial_cov has a meaning only for UDPLite.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/udp.c |3 ++-
 net/ipv6/udp.c |3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 
e1fc0116e8d59d8185670c6e55d1219bde55610d..b949770fdc08398a10f3974505a50b2b4f4b2cf3
 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1389,7 +1389,8 @@ int udp_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len, int noblock,
 * coverage checksum (UDP-Lite), do it before the copy.
 */
 
-   if (copied < ulen || UDP_SKB_CB(skb)->partial_cov || peeking) {
+   if (copied < ulen || peeking ||
+   (is_udplite && UDP_SKB_CB(skb)->partial_cov)) {
checksum_valid = !udp_lib_checksum_complete(skb);
if (!checksum_valid)
goto csum_copy_err;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 
4f99417d9b401f2a65c7828e7d6b86d1d6161794..8fd4d89380b86c8630f7fd27ce4e9958497a2b89
 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -363,7 +363,8 @@ int udpv6_recvmsg(struct sock *sk, struct msghdr *msg, 
size_t len,
 * coverage checksum (UDP-Lite), do it before the copy.
 */
 
-   if (copied < ulen || UDP_SKB_CB(skb)->partial_cov || peeking) {
+   if (copied < ulen || peeking ||
+   (is_udplite && UDP_SKB_CB(skb)->partial_cov)) {
checksum_valid = !udp_lib_checksum_complete(skb);
if (!checksum_valid)
goto csum_copy_err;




Re: [PATCH 3/5] virtio_net: Add XDP support

2016-11-18 Thread John Fastabend
On 16-11-18 03:23 PM, Eric Dumazet wrote:
> On Fri, 2016-11-18 at 11:00 -0800, John Fastabend wrote:
>> From: Shrijeet Mukherjee 
> 
> 
>>  #include 
>> @@ -81,6 +82,8 @@ struct receive_queue {
>>  
>>  struct napi_struct napi;
>>  
>> +struct bpf_prog *xdp_prog;
> 
> Please add proper sparse annotation, as in 
> 
>   struct bpf_prog __rcu *xdp_prog;
> 
> And run sparse ;)
> 
> CONFIG_SPARSE_RCU_POINTER=y
> 
> make C=2 drivers/net/virtio_net.o
> 
> 
> 
> 

Yep will do thanks! And I will fix the other comment as well.


[PATCH net-next v3 3/4] bpf, mlx5: drop priv->xdp_prog reference on netdev cleanup

2016-11-18 Thread Daniel Borkmann
mlx5e_xdp_set() is currently the only place where we drop reference on the
prog sitting in priv->xdp_prog when it's exchanged by a new one. We also
need to make sure that we eventually release that reference, for example,
in case the netdev is dismantled, otherwise we leak the program.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 491cff9..6957608 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3705,6 +3705,9 @@ static void mlx5e_nic_cleanup(struct mlx5e_priv *priv)
 
if (MLX5_CAP_GEN(mdev, vport_group_manager))
mlx5_eswitch_unregister_vport_rep(esw, 0);
+
+   if (priv->xdp_prog)
+   bpf_prog_put(priv->xdp_prog);
 }
 
 static int mlx5e_init_nic_rx(struct mlx5e_priv *priv)
-- 
1.9.3



[PATCH net-next v3 4/4] bpf: add __must_check attributes to refcount manipulating helpers

2016-11-18 Thread Daniel Borkmann
Helpers like bpf_prog_add(), bpf_prog_inc(), bpf_map_inc() can fail
with an error, so make sure the caller properly checks their return
value and not just ignores it, which could worst-case lead to use
after free.

Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 include/linux/bpf.h | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 01c1487..69d0a7f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -233,14 +233,14 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void 
*meta, u64 meta_size,
 
 struct bpf_prog *bpf_prog_get(u32 ufd);
 struct bpf_prog *bpf_prog_get_type(u32 ufd, enum bpf_prog_type type);
-struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i);
+struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog *prog, int i);
 void bpf_prog_sub(struct bpf_prog *prog, int i);
-struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog);
+struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog *prog);
 void bpf_prog_put(struct bpf_prog *prog);
 
 struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
-struct bpf_map *bpf_map_inc(struct bpf_map *map, bool uref);
+struct bpf_map * __must_check bpf_map_inc(struct bpf_map *map, bool uref);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
 int bpf_map_precharge_memlock(u32 pages);
@@ -299,7 +299,8 @@ static inline struct bpf_prog *bpf_prog_get_type(u32 ufd,
 {
return ERR_PTR(-EOPNOTSUPP);
 }
-static inline struct bpf_prog *bpf_prog_add(struct bpf_prog *prog, int i)
+static inline struct bpf_prog * __must_check bpf_prog_add(struct bpf_prog 
*prog,
+ int i)
 {
return ERR_PTR(-EOPNOTSUPP);
 }
@@ -311,7 +312,8 @@ static inline void bpf_prog_sub(struct bpf_prog *prog, int 
i)
 static inline void bpf_prog_put(struct bpf_prog *prog)
 {
 }
-static inline struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
+
+static inline struct bpf_prog * __must_check bpf_prog_inc(struct bpf_prog 
*prog)
 {
return ERR_PTR(-EOPNOTSUPP);
 }
-- 
1.9.3



[PATCH net-next v3 2/4] bpf, mlx5: fix various refcount issues in mlx5e_xdp_set

2016-11-18 Thread Daniel Borkmann
There are multiple issues in mlx5e_xdp_set():

1) The batched bpf_prog_add() is currently not checked for errors. When
   doing so, it should be done at an earlier point in time to makes sure
   that we cannot fail anymore at the time we want to set the program for
   each channel. The batched refs short-cut can only be performed when we
   don't need to perform a reset for changing the rq type and the device
   was in opened state. In case the device was not in opened state, then
   the next mlx5e_open_locked() will aquire the refs from the control prog
   via mlx5e_create_rq(), same when we need to perform a reset.

2) When swapping the priv->xdp_prog, then no extra reference count must be
   taken since we got that from call path via dev_change_xdp_fd() already.
   Otherwise, we'd never be able to release the program. Also, bpf_prog_add()
   without checking the return code could fail.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 54bae79..491cff9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3144,11 +3144,21 @@ static int mlx5e_xdp_set(struct net_device *netdev, 
struct bpf_prog *prog)
 
if (was_opened && reset)
mlx5e_close_locked(netdev);
+   if (was_opened && !reset) {
+   /* num_channels is invariant here, so we can take the
+* batched reference right upfront.
+*/
+   prog = bpf_prog_add(prog, priv->params.num_channels);
+   if (IS_ERR(prog)) {
+   err = PTR_ERR(prog);
+   goto unlock;
+   }
+   }
 
-   /* exchange programs */
+   /* exchange programs, extra prog reference we got from caller
+* as long as we don't fail from this point onwards.
+*/
old_prog = xchg(>xdp_prog, prog);
-   if (prog)
-   bpf_prog_add(prog, 1);
if (old_prog)
bpf_prog_put(old_prog);
 
@@ -3164,7 +3174,6 @@ static int mlx5e_xdp_set(struct net_device *netdev, 
struct bpf_prog *prog)
/* exchanging programs w/o reset, we update ref counts on behalf
 * of the channels RQs here.
 */
-   bpf_prog_add(prog, priv->params.num_channels);
for (i = 0; i < priv->params.num_channels; i++) {
struct mlx5e_channel *c = priv->channel[i];
 
-- 
1.9.3



[PATCH net-next v3 0/4] Couple of BPF refcount fixes for mlx5

2016-11-18 Thread Daniel Borkmann
Various mlx5 bugs on eBPF refcount handling found during review.
Last patch in series adds a __must_check to BPF helpers to make
sure we won't run into it again w/o compiler complaining first.

v2 -> v3:

 - Just reworked patch 2/4 so we don't need bpf_prog_sub().
 - Rebased, rest as is.

v1 -> v2:

 - After discussion with Alexei, we agreed upon rebasing the
   patches against net-next.
 - Since net-next, I've also added the __must_check to enforce
   future users to check for errors.
 - Fixed up commit message #2.
 - Simplify assignment from patch #1 based on Saeed's feedback
   on previous set.

Thanks a lot!

Daniel Borkmann (4):
  bpf, mlx5: fix mlx5e_create_rq taking reference on prog
  bpf, mlx5: fix various refcount issues in mlx5e_xdp_set
  bpf, mlx5: drop priv->xdp_prog reference on netdev cleanup
  bpf: add __must_check attributes to refcount manipulating helpers

 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 33 +--
 include/linux/bpf.h   | 12 +
 kernel/bpf/syscall.c  |  1 +
 3 files changed, 33 insertions(+), 13 deletions(-)

-- 
1.9.3



[PATCH net-next v3 1/4] bpf, mlx5: fix mlx5e_create_rq taking reference on prog

2016-11-18 Thread Daniel Borkmann
In mlx5e_create_rq(), when creating a new queue, we call bpf_prog_add() but
without checking the return value. bpf_prog_add() can fail since 92117d8443bc
("bpf: fix refcnt overflow"), so we really must check it. Take the reference
right when we assign it to the rq from priv->xdp_prog, and just drop the
reference on error path. Destruction in mlx5e_destroy_rq() looks good, though.

Fixes: 86994156c736 ("net/mlx5e: XDP fast RX drop bpf programs support")
Signed-off-by: Daniel Borkmann 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 13 +
 kernel/bpf/syscall.c  |  1 +
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index bd0732d..54bae79 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -513,7 +513,13 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->channel = c;
rq->ix  = c->ix;
rq->priv= c->priv;
-   rq->xdp_prog = priv->xdp_prog;
+
+   rq->xdp_prog = priv->xdp_prog ? bpf_prog_inc(priv->xdp_prog) : NULL;
+   if (IS_ERR(rq->xdp_prog)) {
+   err = PTR_ERR(rq->xdp_prog);
+   rq->xdp_prog = NULL;
+   goto err_rq_wq_destroy;
+   }
 
rq->buff.map_dir = DMA_FROM_DEVICE;
if (rq->xdp_prog)
@@ -590,12 +596,11 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->page_cache.head = 0;
rq->page_cache.tail = 0;
 
-   if (rq->xdp_prog)
-   bpf_prog_add(rq->xdp_prog, 1);
-
return 0;
 
 err_rq_wq_destroy:
+   if (rq->xdp_prog)
+   bpf_prog_put(rq->xdp_prog);
mlx5_wq_destroy(>wq_ctrl);
 
return err;
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ce1b7de..eb15498 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -696,6 +696,7 @@ struct bpf_prog *bpf_prog_inc(struct bpf_prog *prog)
 {
return bpf_prog_add(prog, 1);
 }
+EXPORT_SYMBOL_GPL(bpf_prog_inc);
 
 static struct bpf_prog *__bpf_prog_get(u32 ufd, enum bpf_prog_type *type)
 {
-- 
1.9.3



Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-18 Thread Eric Dumazet
On Fri, 2016-11-18 at 16:38 -0800, Jarno Rajahalme wrote:

> This fixes the problem for me, so for whatever it’s worth:
> 
> Tested-by: Jarno Rajahalme 
> 

Thanks for testing !

https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=e88a2766143a27bfe6704b4493b214de4094cf29





Re: Long delays creating a netns after deleting one (possibly RCU related)

2016-11-18 Thread Jarno Rajahalme

> On Nov 14, 2016, at 3:09 PM, Eric Dumazet  wrote:
> 
> On Mon, 2016-11-14 at 14:46 -0800, Eric Dumazet wrote:
>> On Mon, 2016-11-14 at 16:12 -0600, Eric W. Biederman wrote:
>> 
>>> synchronize_rcu_expidited is not enough if you have multiple network
>>> devices in play.
>>> 
>>> Looking at the code it comes down to this commit, and it appears there
>>> is a promise add rcu grace period combining by Eric Dumazet.
>>> 
>>> Eric since people are hitting noticable stalls because of the rcu grace
>>> period taking a long time do you think you could look at this code path
>>> a bit more?
>>> 
>>> commit 93d05d4a320cb16712bb3d57a9658f395d8cecb9
>>> Author: Eric Dumazet 
>>> Date:   Wed Nov 18 06:31:03 2015 -0800
>> 
>> Absolutely, I will take a loop asap.
> 
> The worst offender should be fixed by the following patch.
> 
> busy poll needs to poll the physical device, not a virtual one...
> 
> diff --git a/include/net/gro_cells.h b/include/net/gro_cells.h
> index 
> d15214d673b2e8e08fd6437b572278fb1359f10d..2a1abbf8da74368cd01adc40cef6c0644e059ef2
>  100644
> --- a/include/net/gro_cells.h
> +++ b/include/net/gro_cells.h
> @@ -68,6 +68,9 @@ static inline int gro_cells_init(struct gro_cells *gcells, 
> struct net_device *de
>   struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
> 
>   __skb_queue_head_init(>napi_skbs);
> +
> + set_bit(NAPI_STATE_NO_BUSY_POLL, >napi.state);
> +
>   netif_napi_add(dev, >napi, gro_cell_poll, 64);
>   napi_enable(>napi);
>   }
> 
> 
> 
> 
> 

This fixes the problem for me, so for whatever it’s worth:

Tested-by: Jarno Rajahalme 



[PATCH net-next v2 3/5] virtio_net: Do not clear memory for struct virtio_net_hdr twice.

2016-11-18 Thread Jarno Rajahalme
virtio_net_hdr_from_skb() clears the memory for the header, so there
is no point for the callers to do the same.

Signed-off-by: Jarno Rajahalme 
---
 drivers/net/tun.c  | 3 +--
 include/linux/virtio_net.h | 2 +-
 net/packet/af_packet.c | 2 --
 3 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 3b8d8cc..64e694c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1360,8 +1360,7 @@ static ssize_t tun_put_user(struct tun_struct *tun,
}
 
if (vnet_hdr_sz) {
-   struct virtio_net_hdr gso = { 0 }; /* no info leak */
-   int ret;
+   struct virtio_net_hdr gso;
 
if (iov_iter_count(iter) < vnet_hdr_sz)
return -EINVAL;
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 74f1e33..6620400 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -58,7 +58,7 @@ static inline int virtio_net_hdr_from_skb(const struct 
sk_buff *skb,
  struct virtio_net_hdr *hdr,
  bool little_endian)
 {
-   memset(hdr, 0, sizeof(*hdr));
+   memset(hdr, 0, sizeof(*hdr));   /* no info leak */
 
if (skb_is_gso(skb)) {
struct skb_shared_info *sinfo = skb_shinfo(skb);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d2238b2..abe6c0b 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1970,8 +1970,6 @@ static unsigned int run_filter(struct sk_buff *skb,
 static int __packet_rcv_vnet(const struct sk_buff *skb,
 struct virtio_net_hdr *vnet_hdr)
 {
-   *vnet_hdr = (const struct virtio_net_hdr) { 0 };
-
if (virtio_net_hdr_from_skb(skb, vnet_hdr, vio_le()))
BUG();
 
-- 
2.1.4



[PATCH net-next v2 4/5] af_packet: Use virtio_net_hdr_to_skb().

2016-11-18 Thread Jarno Rajahalme
Use the common virtio_net_hdr_to_skb() instead of open coding it.
Other call sites were changed by commit fd2a0437dc, but this one was
missed, maybe because it is split in two parts of the source code.

Interim comparisons of 'vnet_hdr->gso_type' still work as both the
vnet_hdr and skb notion of gso_type is zero when there is no gso.

Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
Signed-off-by: Jarno Rajahalme 
---
 net/packet/af_packet.c | 51 +++---
 1 file changed, 3 insertions(+), 48 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index abe6c0b..1816b77 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2388,8 +2388,6 @@ static void tpacket_set_protocol(const struct net_device 
*dev,
 
 static int __packet_snd_vnet_parse(struct virtio_net_hdr *vnet_hdr, size_t len)
 {
-   unsigned short gso_type = 0;
-
if ((vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
(__virtio16_to_cpu(vio_le(), vnet_hdr->csum_start) +
 __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset) + 2 >
@@ -2401,29 +2399,6 @@ static int __packet_snd_vnet_parse(struct virtio_net_hdr 
*vnet_hdr, size_t len)
if (__virtio16_to_cpu(vio_le(), vnet_hdr->hdr_len) > len)
return -EINVAL;
 
-   if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
-   switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
-   case VIRTIO_NET_HDR_GSO_TCPV4:
-   gso_type = SKB_GSO_TCPV4;
-   break;
-   case VIRTIO_NET_HDR_GSO_TCPV6:
-   gso_type = SKB_GSO_TCPV6;
-   break;
-   case VIRTIO_NET_HDR_GSO_UDP:
-   gso_type = SKB_GSO_UDP;
-   break;
-   default:
-   return -EINVAL;
-   }
-
-   if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
-   gso_type |= SKB_GSO_TCP_ECN;
-
-   if (vnet_hdr->gso_size == 0)
-   return -EINVAL;
-   }
-
-   vnet_hdr->gso_type = gso_type;  /* changes type, temporary storage */
return 0;
 }
 
@@ -2443,27 +2418,6 @@ static int packet_snd_vnet_parse(struct msghdr *msg, 
size_t *len,
return __packet_snd_vnet_parse(vnet_hdr, *len);
 }
 
-static int packet_snd_vnet_gso(struct sk_buff *skb,
-  struct virtio_net_hdr *vnet_hdr)
-{
-   if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
-   u16 s = __virtio16_to_cpu(vio_le(), vnet_hdr->csum_start);
-   u16 o = __virtio16_to_cpu(vio_le(), vnet_hdr->csum_offset);
-
-   if (!skb_partial_csum_set(skb, s, o))
-   return -EINVAL;
-   }
-
-   skb_shinfo(skb)->gso_size =
-   __virtio16_to_cpu(vio_le(), vnet_hdr->gso_size);
-   skb_shinfo(skb)->gso_type = vnet_hdr->gso_type;
-
-   /* Header must be checked, and gso_segs computed. */
-   skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
-   skb_shinfo(skb)->gso_segs = 0;
-   return 0;
-}
-
 static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
void *frame, struct net_device *dev, void *data, int tp_len,
__be16 proto, unsigned char *addr, int hlen, int copylen,
@@ -2723,7 +2677,8 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
}
}
 
-   if (po->has_vnet_hdr && packet_snd_vnet_gso(skb, vnet_hdr)) {
+   if (po->has_vnet_hdr && virtio_net_hdr_to_skb(skb, vnet_hdr,
+ vio_le())) {
tp_len = -EINVAL;
goto tpacket_error;
}
@@ -2914,7 +2869,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
packet_pick_tx_queue(dev, skb);
 
if (po->has_vnet_hdr) {
-   err = packet_snd_vnet_gso(skb, _hdr);
+   err = virtio_net_hdr_to_skb(skb, _hdr, vio_le());
if (err)
goto out_free;
len += sizeof(vnet_hdr);
-- 
2.1.4



[PATCH net-next v2 2/5] virtio_net.h: Fix comment.

2016-11-18 Thread Jarno Rajahalme
Fix incorrent comment after the final #endif.

Signed-off-by: Jarno Rajahalme 
---
 include/linux/virtio_net.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index 1c912f8..74f1e33 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct 
sk_buff *skb,
return 0;
 }
 
-#endif /* _LINUX_VIRTIO_BYTEORDER */
+#endif /* _LINUX_VIRTIO_NET_H */
-- 
2.1.4



[PATCH net-next v2 5/5] af_packet: Use virtio_net_hdr_from_skb() directly.

2016-11-18 Thread Jarno Rajahalme
Remove static function __packet_rcv_vnet(), which only called
virtio_net_hdr_from_skb() and BUG()ged out if an error code was
returned.  Instead, call virtio_net_hdr_from_skb() from the former
call sites of __packet_rcv_vnet() and actually use the error handling
code that is already there.

Signed-off-by: Jarno Rajahalme 
---
 net/packet/af_packet.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 1816b77..fab9bbf 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1967,15 +1967,6 @@ static unsigned int run_filter(struct sk_buff *skb,
return res;
 }
 
-static int __packet_rcv_vnet(const struct sk_buff *skb,
-struct virtio_net_hdr *vnet_hdr)
-{
-   if (virtio_net_hdr_from_skb(skb, vnet_hdr, vio_le()))
-   BUG();
-
-   return 0;
-}
-
 static int packet_rcv_vnet(struct msghdr *msg, const struct sk_buff *skb,
   size_t *len)
 {
@@ -1985,7 +1976,7 @@ static int packet_rcv_vnet(struct msghdr *msg, const 
struct sk_buff *skb,
return -EINVAL;
*len -= sizeof(vnet_hdr);
 
-   if (__packet_rcv_vnet(skb, _hdr))
+   if (virtio_net_hdr_from_skb(skb, _hdr, vio_le()))
return -EINVAL;
 
return memcpy_to_msg(msg, (void *)_hdr, sizeof(vnet_hdr));
@@ -2244,8 +2235,9 @@ static int tpacket_rcv(struct sk_buff *skb, struct 
net_device *dev,
spin_unlock(>sk_receive_queue.lock);
 
if (po->has_vnet_hdr) {
-   if (__packet_rcv_vnet(skb, h.raw + macoff -
-  sizeof(struct virtio_net_hdr))) {
+   if (virtio_net_hdr_from_skb(skb, h.raw + macoff -
+   sizeof(struct virtio_net_hdr),
+   vio_le())) {
spin_lock(>sk_receive_queue.lock);
goto drop_n_account;
}
-- 
2.1.4



[PATCH net-next v2 1/5] virtio_net: Simplify call sites for virtio_net_hdr_{from,to}_skb().

2016-11-18 Thread Jarno Rajahalme
No point storing the return value of virtio_net_hdr_to_skb() or
virtio_net_hdr_from_skb() to a variable when the value is used only
once as a boolean in an immediately following if statement.

Signed-off-by: Jarno Rajahalme 
---
 drivers/net/macvtap.c | 5 ++---
 drivers/net/tun.c | 8 +++-
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 070e329..5da9861 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
if (iov_iter_count(iter) < vnet_hdr_len)
return -EINVAL;
 
-   ret = virtio_net_hdr_from_skb(skb, _hdr,
- macvtap_is_little_endian(q));
-   if (ret)
+   if (virtio_net_hdr_from_skb(skb, _hdr,
+   macvtap_is_little_endian(q)))
BUG();
 
if (copy_to_iter(_hdr, sizeof(vnet_hdr), iter) !=
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 1588469..3b8d8cc 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1252,8 +1252,7 @@ static ssize_t tun_get_user(struct tun_struct *tun, 
struct tun_file *tfile,
return -EFAULT;
}
 
-   err = virtio_net_hdr_to_skb(skb, , tun_is_little_endian(tun));
-   if (err) {
+   if (virtio_net_hdr_to_skb(skb, , tun_is_little_endian(tun))) {
this_cpu_inc(tun->pcpu_stats->rx_frame_errors);
kfree_skb(skb);
return -EINVAL;
@@ -1367,9 +1366,8 @@ static ssize_t tun_put_user(struct tun_struct *tun,
if (iov_iter_count(iter) < vnet_hdr_sz)
return -EINVAL;
 
-   ret = virtio_net_hdr_from_skb(skb, ,
- tun_is_little_endian(tun));
-   if (ret) {
+   if (virtio_net_hdr_from_skb(skb, ,
+   tun_is_little_endian(tun))) {
struct skb_shared_info *sinfo = skb_shinfo(skb);
pr_err("unexpected GSO type: "
   "0x%x, gso_size %d, hdr_len %d\n",
-- 
2.1.4



Re: [mm PATCH v3 21/23] mm: Add support for releasing multiple instances of a page

2016-11-18 Thread Andrew Morton
On Thu, 10 Nov 2016 06:36:06 -0500 Alexander Duyck 
 wrote:

> This patch adds a function that allows us to batch free a page that has
> multiple references outstanding.  Specifically this function can be used to
> drop a page being used in the page frag alloc cache.  With this drivers can
> make use of functionality similar to the page frag alloc cache without
> having to do any workarounds for the fact that there is no function that
> frees multiple references.
> 
> ...
>
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -506,6 +506,8 @@ extern void free_hot_cold_page(struct page *page, bool 
> cold);
>  extern void free_hot_cold_page_list(struct list_head *list, bool cold);
>  
>  struct page_frag_cache;
> +extern void __page_frag_drain(struct page *page, unsigned int order,
> +   unsigned int count);
>  extern void *__alloc_page_frag(struct page_frag_cache *nc,
>  unsigned int fragsz, gfp_t gfp_mask);
>  extern void __free_page_frag(void *addr);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0fbfead..54fea40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3912,6 +3912,20 @@ static struct page *__page_frag_refill(struct 
> page_frag_cache *nc,
>   return page;
>  }
>  
> +void __page_frag_drain(struct page *page, unsigned int order,
> +unsigned int count)
> +{
> + VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
> +
> + if (page_ref_sub_and_test(page, count)) {
> + if (order == 0)
> + free_hot_cold_page(page, false);
> + else
> + __free_pages_ok(page, order);
> + }
> +}
> +EXPORT_SYMBOL(__page_frag_drain);

It's an exported-to-modules library function.  It should be documented,
please?  The page-frag API is only partially documented, but that's no
excuse.

And perhaps documentation will help explain the naming choice.  Why
"drain"?  I'd have expected "put"?

And why the leading underscores.  The page-frag API is pretty weird :(

And inconsistent.  __alloc_page_frag -> page_frag_alloc,
__free_page_frag -> page_frag_free(), etc.  I must have been asleep
when I let that lot through.


Re: [PATCH 3/5] virtio_net: Add XDP support

2016-11-18 Thread Eric Dumazet
On Fri, 2016-11-18 at 11:00 -0800, John Fastabend wrote:
> From: Shrijeet Mukherjee 


>  #include 
> @@ -81,6 +82,8 @@ struct receive_queue {
>  
>   struct napi_struct napi;
>  
> + struct bpf_prog *xdp_prog;

Please add proper sparse annotation, as in 

struct bpf_prog __rcu *xdp_prog;

And run sparse ;)

CONFIG_SPARSE_RCU_POINTER=y

make C=2 drivers/net/virtio_net.o






Re: [PATCH 3/5] virtio_net: Add XDP support

2016-11-18 Thread Eric Dumazet
On Fri, 2016-11-18 at 11:00 -0800, John Fastabend wrote:


>  static void free_receive_bufs(struct virtnet_info *vi)
>  {
> + struct bpf_prog *old_prog;
>   int i;
>  
>   for (i = 0; i < vi->max_queue_pairs; i++) {
>   while (vi->rq[i].pages)
>   __free_pages(get_a_page(>rq[i], GFP_KERNEL), 0);
> +
> + old_prog = rcu_dereference(vi->rq[i].xdp_prog);

Seems wrong to me.

Are you sure lockdep (with CONFIG_PROVE_RCU=y) was happy with this ?

> + RCU_INIT_POINTER(vi->rq[i].xdp_prog, NULL);
> + if (old_prog)
> + bpf_prog_put(old_prog);
>   }
>  }
>  
> 




[PATCH] net: fix bogus cast in skb_pagelen() and use unsigned variables

2016-11-18 Thread Alexey Dobriyan
1) cast to "int" is unnecessary:
   u8 will be promoted to int before decrementing,
   small positive numbers fit into "int", so their values won't be changed
   during promotion.

   Once everything is int including loop counters, signedness doesn't
   matter: 32-bit operations will stay 32-bit operations.

   But! Someone tried to make this loop smart by making everything of
   the same type apparently in an attempt to optimise it.
   Do the optimization, just differently.
   Do the cast where it matters. :^)

2) frag size is unsigned entity and sum of fragments sizes is also
   unsigned.

Make everything unsigned, leave no MOVSX instruction behind.


add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
function old new   delta
skb_cow_data 835 834  -1
ip_do_fragment  25492548  -1
ip6_fragment31303128  -2
Total: Before=154865032, After=154865028, chg -0.00%


Signed-off-by: Alexey Dobriyan 
---

 include/linux/skbuff.h |6 +++---
 net/ipv4/ip_output.c   |2 +-
 net/ipv6/ip6_output.c  |2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1799,11 +1799,11 @@ static inline unsigned int skb_headlen(const struct 
sk_buff *skb)
return skb->len - skb->data_len;
 }
 
-static inline int skb_pagelen(const struct sk_buff *skb)
+static inline unsigned int skb_pagelen(const struct sk_buff *skb)
 {
-   int i, len = 0;
+   unsigned int i, len = 0;
 
-   for (i = (int)skb_shinfo(skb)->nr_frags - 1; i >= 0; i--)
+   for (i = skb_shinfo(skb)->nr_frags - 1; (int)i >= 0; i--)
len += skb_frag_size(_shinfo(skb)->frags[i]);
return len + skb_headlen(skb);
 }
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -581,7 +581,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct 
sk_buff *skb,
 */
if (skb_has_frag_list(skb)) {
struct sk_buff *frag, *frag2;
-   int first_len = skb_pagelen(skb);
+   unsigned int first_len = skb_pagelen(skb);
 
if (first_len - hlen > mtu ||
((first_len - hlen) & 7) ||
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -625,7 +625,7 @@ int ip6_fragment(struct net *net, struct sock *sk, struct 
sk_buff *skb,
 
hroom = LL_RESERVED_SPACE(rt->dst.dev);
if (skb_has_frag_list(skb)) {
-   int first_len = skb_pagelen(skb);
+   unsigned int first_len = skb_pagelen(skb);
struct sk_buff *frag2;
 
if (first_len - hlen > mtu ||


[PATCH] netlink: smaller nla_attr_minlen table

2016-11-18 Thread Alexey Dobriyan
Length of a netlink attribute may be u16 but lengths of basic attributes
are much smaller, so small we can save 16 bytes of .rodata and pocket
change inside .text.

16-bit is worse on x86-64 than 8-bit because of operand size override prefix.

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-19 (-19)
function old new   delta
validate_nla 418 417  -1
nla_policy_len66  64  -2
nla_attr_minlen   32  16 -16
Total: Before=154865051, After=154865032, chg -0.00%


Signed-off-by: Alexey Dobriyan 
---

 lib/nlattr.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -14,7 +14,7 @@
 #include 
 #include 
 
-static const u16 nla_attr_minlen[NLA_TYPE_MAX+1] = {
+static const u8 nla_attr_minlen[NLA_TYPE_MAX+1] = {
[NLA_U8]= sizeof(u8),
[NLA_U16]   = sizeof(u16),
[NLA_U32]   = sizeof(u32),


Re: [net-next] af_packet: Use virtio_net_hdr_to_skb().

2016-11-18 Thread Jarno Rajahalme
Sorry for my transgressions and wasting your time. I’ll send a v2 in a moment.

  Jarno
 
> On Nov 18, 2016, at 8:35 AM, David Miller  wrote:
> 
> From: Jarno Rajahalme 
> Date: Wed, 16 Nov 2016 18:06:42 -0800
> 
>> Use the common virtio_net_hdr_to_skb() instead of open coding it.
>> Other call sites were changed by commit fd2a0437dc, but this one was
>> missed, maybe because it is split in two parts of the source code.
>> 
>> Also fix other call sites to be more uniform.
>> 
>> Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
>> Signed-off-by: Jarno Rajahalme 
> 
> This patch is doing many more things that just this.
> 
> Do not mix unrelated changes together:
> 
>> @@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>>  if (iov_iter_count(iter) < vnet_hdr_len)
>>  return -EINVAL;
>> 
>> -ret = virtio_net_hdr_from_skb(skb, _hdr,
>> -  macvtap_is_little_endian(q));
>> -if (ret)
>> +if (virtio_net_hdr_from_skb(skb, _hdr,
>> +macvtap_is_little_endian(q)))
>>  BUG();
>> 
>>  if (copy_to_iter(_hdr, sizeof(vnet_hdr), iter) !=
> 
> This has nothing to do with modifying code to use vrtio_net_hdr_to_skb(), it
> doesn't belong in this patch.
> 
>> @@ -1361,15 +1360,12 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>>  }
>> 
>>  if (vnet_hdr_sz) {
>> -struct virtio_net_hdr gso = { 0 }; /* no info leak */
>> -int ret;
>> -
>> +struct virtio_net_hdr gso;
> 
> This is _extremely_ opaque.  The initializer is trying to prevent kernel 
> memory
> info leaks onto the network or into user space.
> 
> Maybe this transformation is valid but:
> 
> 1) YOU DON'T EVEN MENTION IT IN YOUR COMMIT MESSAGE.
> 
> 2) It's unrelated to this specific change, therefore it belongs in
>   a separate change.
> 
> 3) You don't explain that it is a valid transformation, not why.
> 
> It is extremely disappointing to catch unrelated, potentially far
> reaching things embedded in a patch when I review it.
> 
> Please do not ever do this.
> 
>> @@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct 
>> sk_buff *skb,
>>  return 0;
>> }
>> 
>> -#endif /* _LINUX_VIRTIO_BYTEORDER */
>> +#endif /* _LINUX_VIRTIO_NET_H */
> 
> Another unrelated change.
> 
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 11db0d6..09abb88 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -1971,8 +1971,6 @@ static unsigned int run_filter(struct sk_buff *skb,
>> static int __packet_rcv_vnet(const struct sk_buff *skb,
>>   struct virtio_net_hdr *vnet_hdr)
>> {
>> -*vnet_hdr = (const struct virtio_net_hdr) { 0 };
>> -
> 
> There is no way this belongs in this patch, and again you do not explain
> why removing this initializer is valid.



[PATCH] netlink: use "unsigned int" in nla_next()

2016-11-18 Thread Alexey Dobriyan
->nla_len is unsigned entity (it's length after all) and u16,
thus it can't overflow when being aligned into int/unsigned int.

(nlmsg_next has the same code, but I didn't yet convince myself
it is correct to do so).

There is pointer arithmetic in this function and offset being
unsigned is better:

add/remove: 0/0 grow/shrink: 1/64 up/down: 5/-309 (-304)
function old new   delta
nl80211_set_wiphy   14441449  +5
team_nl_cmd_options_set  997 995  -2
tcf_em_tree_validate 872 870  -2
switchdev_port_bridge_setlink352 350  -2
switchdev_port_br_afspec 312 310  -2
rtm_to_fib_config428 426  -2
qla4xxx_sysfs_ddb_set_param 21932191  -2
qla4xxx_iface_set_param 44704468  -2
ovs_nla_free_flow_actions152 150  -2
output_userspace 518 516  -2
...
nl80211_set_reg  654 649  -5
validate_scan_freqs  148 142  -6
validate_linkmsg 288 282  -6
nl80211_parse_connkeys   489 483  -6
nlattr_set   231 224  -7
nf_tables_delsetelem 267 260  -7
do_setlink  34163408  -8
netlbl_cipsov4_add_std  16721659 -13
nl80211_parse_sched_scan29022888 -14
nl80211_trigger_scan17381720 -18
do_execute_actions  28212738 -83
Total: Before=154865355, After=154865051, chg -0.00%

Signed-off-by: Alexey Dobriyan 
---

 include/net/netlink.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -713,7 +713,7 @@ static inline bool nla_ok(const struct nlattr *nla, int 
remaining)
  */
 static inline struct nlattr *nla_next(const struct nlattr *nla, int *remaining)
 {
-   int totlen = NLA_ALIGN(nla->nla_len);
+   unsigned int totlen = NLA_ALIGN(nla->nla_len);
 
*remaining -= totlen;
return (struct nlattr *) ((char *) nla + totlen);


[PATCH] net: make struct napi_alloc_cache::skb_count unsigned int

2016-11-18 Thread Alexey Dobriyan
size_t is way too much for an integer not exceeding 64.

Space savings: 10 bytes!

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-10 (-10)
function old new   delta
napi_consume_skb 165 163  -2
__kfree_skb_flush 56  53  -3
__kfree_skb_defer 97  92  -5
Total: Before=154865639, After=154865629, chg -0.00%

Signed-off-by: Alexey Dobriyan 
---

 net/core/skbuff.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -354,7 +354,7 @@ EXPORT_SYMBOL(build_skb);
 
 struct napi_alloc_cache {
struct page_frag_cache page;
-   size_t skb_count;
+   unsigned int skb_count;
void *skb_cache[NAPI_SKB_CACHE_SIZE];
 };
 


[RFC 10/10] IB/hfi1: VNIC SDMA support

2016-11-18 Thread Vishwanathapura, Niranjana
HFI1 VNIC SDMA support enables transmission of VNIC packets over SDMA.
Map VNIC queues to SDMA engines and support halting and wakeup of the
VNIC queues.

Change-Id: I2d2d23bda9fb8a7194d9722e23bc69b110cdcf86
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
---
 drivers/infiniband/hw/hfi1/hfi.h |   1 +
 drivers/infiniband/hw/hfi1/vnic.h|  30 +++-
 drivers/infiniband/hw/hfi1/vnic_device.c |   2 +-
 drivers/infiniband/hw/hfi1/vnic_main.c   |  22 ++-
 drivers/infiniband/hw/hfi1/vnic_sdma.c   | 260 +++
 5 files changed, 311 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 2ff3453..f476188 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -855,6 +855,7 @@ struct hfi1_asic_data {
 /* Virtual NIC information */
 struct hfi1_vnic_data {
struct hfi1_ctxtdata *ctxt[HFI1_NUM_VNIC_CTXT];
+   struct kmem_cache *txreq_cache;
u8 num_vports;
struct hfi_vnic_ctrl_device *ctrl_dev;
struct idr vesw_idr;
diff --git a/drivers/infiniband/hw/hfi1/vnic.h 
b/drivers/infiniband/hw/hfi1/vnic.h
index d91c35b..4bdfe2b 100644
--- a/drivers/infiniband/hw/hfi1/vnic.h
+++ b/drivers/infiniband/hw/hfi1/vnic.h
@@ -49,6 +49,7 @@
 
 #include "hfi_vnic.h"
 #include "hfi.h"
+#include "sdma.h"
 
 #define HFI1_VNIC_ICRC_LEN   4
 #define HFI1_VNIC_TAIL_LEN   1
@@ -90,6 +91,26 @@
 #define HFI1_VNIC_SC_SHIFT  4
 
 /**
+ * struct hfi1_vnic_sdma - VNIC per Tx ring SDMA information
+ * @dd - device data pointer
+ * @sde - sdma engine
+ * @vinfo - vnic info pointer
+ * @wait - iowait structure
+ * @stx - sdma tx request
+ * @state - vnic Tx ring SDMA state
+ * @q_idx - vnic Tx queue index
+ */
+struct hfi1_vnic_sdma {
+   struct hfi1_devdata *dd;
+   struct sdma_engine  *sde;
+   struct hfi1_vnic_vport_info *vinfo;
+   struct iowait wait;
+   struct sdma_txreq stx;
+   unsigned int state;
+   u8 q_idx;
+};
+
+/**
  * struct hfi1_vnic_notifier - VNIC notifer structure
  * @cb - vnic callback function
  */
@@ -104,6 +125,7 @@ struct hfi1_vnic_notifier {
  * @event_flags: event notification flags
  * @notifier: vnic notifier
  * @skbq: Array of queues for received socket buffers
+ * @sdma: VNIC SDMA structure per TXQ
  */
 struct hfi1_vnic_vport_info {
struct hfi1_devdata *dd;
@@ -112,7 +134,8 @@ struct hfi1_vnic_vport_info {
DECLARE_BITMAP(event_flags, HFI_VNIC_NUM_EVTS);
struct hfi_vnic_device *vdev;
 
-   struct sk_buff_head skbq[HFI1_NUM_VNIC_CTXT];
+   struct sk_buff_headskbq[HFI1_NUM_VNIC_CTXT];
+   struct hfi1_vnic_sdma  sdma[HFI1_VNIC_MAX_TXQ];
 };
 
 static inline struct hfi1_devdata *vnic_dev2dd(struct hfi_vnic_device *vdev)
@@ -131,10 +154,15 @@ static inline void hfi1_vnic_update_pad(unsigned char 
*pad, u8 plen)
 /* vnic hfi1 internal functions */
 int hfi1_vnic_setup(struct hfi1_devdata *dd);
 void hfi1_vnic_cleanup(struct hfi1_devdata *dd);
+int hfi1_vnic_txreq_init(struct hfi1_devdata *dd);
+void hfi1_vnic_txreq_deinit(struct hfi1_devdata *dd);
 int hfi1_vnic_add_ctrl_port(struct hfi1_devdata *dd, struct device *parent);
 void hfi1_vnic_rem_ctrl_port(struct hfi1_devdata *dd);
 
 void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet);
+void hfi1_vnic_sdma_init(struct hfi1_vnic_vport_info *vinfo);
+bool hfi1_vnic_sdma_write_avail(struct hfi1_vnic_vport_info *vinfo,
+   u8 q_idx);
 
 /* vnic device bus ops */
 int hfi1_vnic_init(struct hfi_vnic_device *vdev);
diff --git a/drivers/infiniband/hw/hfi1/vnic_device.c 
b/drivers/infiniband/hw/hfi1/vnic_device.c
index 468e197..5fb1a49 100644
--- a/drivers/infiniband/hw/hfi1/vnic_device.c
+++ b/drivers/infiniband/hw/hfi1/vnic_device.c
@@ -85,7 +85,7 @@ static int hfi1_vdev_create(struct hfi_vnic_ctrl_device *cdev,
return -ENOMEM;
 
vinfo->dd = dd;
-   hfi_info.num_tx_q = 1;
+   hfi_info.num_tx_q = dd->chip_sdma_engines;
hfi_info.num_rx_q = HFI1_NUM_VNIC_CTXT;
hfi_info.cap = HFI_VNIC_CAP_SG;
vdev = hfi_vnic_device_register(cdev, port_num, vport_num, vinfo,
diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c 
b/drivers/infiniband/hw/hfi1/vnic_main.c
index 82e30bd..a21e4cd 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -294,15 +294,21 @@ int hfi1_vnic_put_skb(struct hfi_vnic_device *vdev,
 
 u8 hfi1_vnic_select_queue(struct hfi_vnic_device *vdev, u8 vl, u8 entropy)
 {
-   return 0;
+   struct hfi1_devdata *dd = (struct hfi1_devdata *)vdev->cdev->hfi_priv;
+   struct sdma_engine *sde;
+
+   sde = sdma_select_engine_vl(dd, entropy, vl);
+   return sde->this_idx;
 }
 
 bool hfi1_vnic_get_write_avail(struct hfi_vnic_device *vdev, u8 q_idx)
 {
+   struct hfi1_vnic_vport_info *vinfo = vdev->hfi_priv;
+
   

[RFC 08/10] IB/hfi-vnic: VNIC Ethernet Management Agent (VEMA) driver

2016-11-18 Thread Vishwanathapura, Niranjana
HFI VEMA driver interfaces with the Infiniband MAD stack to exchange the
management information packets with the Ethernet Manager (EM).
It interfaces with the HFI VNIC netdev driver to SET/GET the management
information. The information exchanged with the EM includes class port
details, encapsulation configuration, various counters, unicast and
multicast MAC list and the MAC table. It also supports sending traps
to the EM.

Change-Id: I7439f96858c9019455da1e924a0201eb27177b85
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Tanya K Jajodia 
Signed-off-by: Sudeep Dutt 
---
 drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile |2 +-
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h |9 +
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c   |9 +-
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c | 1024 
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_vema_iface.c   |2 +-
 5 files changed, 1043 insertions(+), 3 deletions(-)
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c

diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
index 375cd09..e05b72b 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
@@ -5,4 +5,4 @@ ccflags-y += -I$(src)/../include
 obj-$(CONFIG_HFI_VNIC) += hfi_vnic.o
 
 hfi_vnic-y := hfi_vnic_netdev.o hfi_vnic_encap.o hfi_vnic_ethtool.o \
-  hfi_vnic_vema_iface.o
+  hfi_vnic_vema.o hfi_vnic_vema_iface.o
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
index 8ebed89..fbebf68 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
@@ -268,6 +268,8 @@ struct hfi_vnic_rx_queue {
  * @mactbl_lock: mac table lock
  * @stats_lock: statistics lock
  * @flow_tbl: flow to default port redirection table
+ * @trap_timeout: trap timeout
+ * @trap_count: no. of traps allowed within timeout period
  * @q_sum_cntrs: per queue EM summary counters
  * @q_err_cntrs: per queue EM error counters
  * @q_rx_logic_errors: per queue rx logic (default) errors
@@ -301,6 +303,8 @@ struct hfi_vnic_adapter {
struct mutex stats_lock;
 
u8 flow_tbl[HFI_VNIC_FLOW_TBL_SIZE];
+   unsigned long trap_timeout;
+   u8trap_count;
 
struct __hfi_vnic_summary_counters  q_sum_cntrs[HFI_VNIC_MAX_QUEUE];
struct __hfi_vnic_error_countersq_err_cntrs[HFI_VNIC_MAX_QUEUE];
@@ -410,4 +414,9 @@ void hfi_vnic_set_per_veswport_info(struct hfi_vnic_adapter 
*adapter,
 void hfi_vnic_vema_report_event(struct hfi_vnic_adapter *adapter, u8 event);
 void hfi_vnic_set_ethtool_ops(struct net_device *ndev);
 
+int hfi_vnic_vema_init(void);
+void hfi_vnic_vema_deinit(void);
+void hfi_vnic_vema_send_trap(struct hfi_vnic_adapter *adapter,
+struct __hfi_veswport_trap *data, u32 lid);
+
 #endif /* _HFI_VNIC_INTERNAL_H */
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
index 75a3fd2..4ee5bb6 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c
@@ -855,9 +855,15 @@ static int __init hfi_vnic_init_module(void)
pr_info("HFI Virtual Network Driver - %s\n",
hfi_vnic_driver_version);
 
-   rc = hfi_vnic_driver_register(_vnic_drv);
+   rc = hfi_vnic_vema_init();
if (rc)
+   return rc;
+
+   rc = hfi_vnic_driver_register(_vnic_drv);
+   if (rc) {
pr_err("VNIC driver register failed %d\n", rc);
+   hfi_vnic_vema_deinit();
+   }
 
return rc;
 }
@@ -867,6 +873,7 @@ static int __init hfi_vnic_init_module(void)
 static void __exit hfi_vnic_exit_module(void)
 {
hfi_vnic_driver_unregister(_vnic_drv);
+   hfi_vnic_vema_deinit();
 }
 module_exit(hfi_vnic_exit_module);
 
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c
new file mode 100644
index 000..b947cdf
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema.c
@@ -0,0 +1,1024 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the 

[RFC 09/10] IB/hfi1: Virtual Network Interface Controller (VNIC) support

2016-11-18 Thread Vishwanathapura, Niranjana
HFI1 HW specific support for VNIC functionality. Add support to create
VNIC devices on HFI VNIC Bus. Also implement the bus operations to
allocate resources, transmit and receive of Omni-Path encapsulated
Ethernet packets.

Dynamically allocate a set of contexts for VNIC when the first vnic
port is instantiated. Allocate VNIC contexts from user contexts pool
and return them back to the same pool while freeing up. Set aside
enough MSI-X interrupts for VNIC contexts and assign them when the
contexts are allocated. On the receive side, use an RSM rule to
spread TCP/UDP streams among VNIC contexts.

Change-Id: I1b275a7585d6c2e3573039a9137014031f1f5c7e
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Andrzej Kacprowski 
---
 drivers/infiniband/hw/hfi1/Kconfig|   2 +-
 drivers/infiniband/hw/hfi1/Makefile   |   3 +-
 drivers/infiniband/hw/hfi1/aspm.h |  13 +-
 drivers/infiniband/hw/hfi1/chip.c | 270 ---
 drivers/infiniband/hw/hfi1/chip.h |   2 +
 drivers/infiniband/hw/hfi1/debugfs.c  |   6 +-
 drivers/infiniband/hw/hfi1/driver.c   |  78 -
 drivers/infiniband/hw/hfi1/file_ops.c |  25 +-
 drivers/infiniband/hw/hfi1/hfi.h  |  50 ++-
 drivers/infiniband/hw/hfi1/init.c |  44 ++-
 drivers/infiniband/hw/hfi1/mad.c  |   8 +-
 drivers/infiniband/hw/hfi1/pio.c  |  17 +
 drivers/infiniband/hw/hfi1/pio.h  |   6 +
 drivers/infiniband/hw/hfi1/sysfs.c|   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   6 +-
 drivers/infiniband/hw/hfi1/user_pages.c   |   3 +-
 drivers/infiniband/hw/hfi1/vnic.h | 155 +
 drivers/infiniband/hw/hfi1/vnic_device.c  | 168 +
 drivers/infiniband/hw/hfi1/vnic_main.c| 555 ++
 drivers/infiniband/hw/hfi1/vnic_sdma.c|  60 
 include/rdma/opa_port_info.h  |   2 +-
 21 files changed, 1376 insertions(+), 99 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/vnic.h
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_device.c
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_main.c
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_sdma.c

diff --git a/drivers/infiniband/hw/hfi1/Kconfig 
b/drivers/infiniband/hw/hfi1/Kconfig
index f6ea088..6c07117 100644
--- a/drivers/infiniband/hw/hfi1/Kconfig
+++ b/drivers/infiniband/hw/hfi1/Kconfig
@@ -1,6 +1,6 @@
 config INFINIBAND_HFI1
tristate "Intel OPA Gen1 support"
-   depends on X86_64 && INFINIBAND_RDMAVT && I2C
+   depends on X86_64 && INFINIBAND_RDMAVT && I2C && HFI_VNIC_BUS
select MMU_NOTIFIER
select CRC32
select I2C_ALGOBIT
diff --git a/drivers/infiniband/hw/hfi1/Makefile 
b/drivers/infiniband/hw/hfi1/Makefile
index 0cf97a0..c579f98 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -6,13 +6,14 @@
 # Called from the kernel module build system.
 #
 obj-$(CONFIG_INFINIBAND_HFI1) += hfi1.o
+ccflags-y += -I$(src)/../../sw/intel/vnic/include
 
 hfi1-y := affinity.o chip.o device.o driver.o efivar.o \
eprom.o file_ops.o firmware.o \
init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \
qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \
uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \
-   verbs_txreq.o
+   verbs_txreq.o vnic_main.o vnic_device.o vnic_sdma.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
 
 CFLAGS_trace.o = -I$(src)
diff --git a/drivers/infiniband/hw/hfi1/aspm.h 
b/drivers/infiniband/hw/hfi1/aspm.h
index 0d58fe3..3a01b69 100644
--- a/drivers/infiniband/hw/hfi1/aspm.h
+++ b/drivers/infiniband/hw/hfi1/aspm.h
@@ -229,14 +229,17 @@ static inline void aspm_ctx_timer_function(unsigned long 
data)
spin_unlock_irqrestore(>aspm_lock, flags);
 }
 
-/* Disable interrupt processing for verbs contexts when PSM contexts are open 
*/
+/*
+ * Disable interrupt processing for verbs contexts when PSM or VNIC contexts
+ * are open.
+ */
 static inline void aspm_disable_all(struct hfi1_devdata *dd)
 {
struct hfi1_ctxtdata *rcd;
unsigned long flags;
unsigned i;
 
-   for (i = 0; i < dd->first_user_ctxt; i++) {
+   for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) {
rcd = dd->rcd[i];
del_timer_sync(>aspm_timer);
spin_lock_irqsave(>aspm_lock, flags);
@@ -260,7 +263,7 @@ static inline void aspm_enable_all(struct hfi1_devdata *dd)
if (aspm_mode != ASPM_MODE_DYNAMIC)
return;
 
-   for (i = 0; i < dd->first_user_ctxt; i++) {
+   for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) {
rcd = dd->rcd[i];
spin_lock_irqsave(>aspm_lock, flags);
rcd->aspm_intr_enable = true;
@@ -276,7 +279,7 @@ static inline void aspm_ctx_init(struct hfi1_ctxtdata *rcd)
 

[RFC 07/10] IB/hfi-vnic: VNIC Ethernet Management Agent (VEMA) interface

2016-11-18 Thread Vishwanathapura, Niranjana
HFI VNIC EMA interface functions are the management interfaces to the HFI
VNIC netdev driver. Implement the required GET/SET management interface
functions and processing of new management information. Add support to
send trap notifications upon various events like interface status change,
unicast/multicast mac list update and mac address change.

Change-Id: I18ccdc0a898ecd7ddcaca795f0a3d205c24b7e6b
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Tanya K Jajodia 
---
 drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile |   3 +-
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h|   4 +
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h |  24 ++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c   | 159 -
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_vema_iface.c   | 385 +
 5 files changed, 572 insertions(+), 3 deletions(-)
 create mode 100644 
drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_vema_iface.c

diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
index a05b2f5..375cd09 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
@@ -4,4 +4,5 @@
 ccflags-y += -I$(src)/../include
 obj-$(CONFIG_HFI_VNIC) += hfi_vnic.o
 
-hfi_vnic-y := hfi_vnic_netdev.o hfi_vnic_encap.o hfi_vnic_ethtool.o
+hfi_vnic-y := hfi_vnic_netdev.o hfi_vnic_encap.o hfi_vnic_ethtool.o \
+  hfi_vnic_vema_iface.o
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h
index 9ed5221..4e6f367 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h
@@ -99,6 +99,10 @@
 #define HFI_VNIC_STATE_DROP_ALL0x1
 #define HFI_VNIC_STATE_FORWARDING  0x3
 
+/* VNIC Ethernet link status */
+#define HFI_VNIC_ETH_LINK_UP 1
+#define HFI_VNIC_ETH_LINK_DOWN   2
+
 /**
  * struct hfi_vesw_info - HFI vnic switch information
  * @fabric_id: 10-bit fabric id
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
index 21a43f6..8ebed89 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
@@ -261,6 +261,9 @@ struct hfi_vnic_rx_queue {
  * @lock: adapter lock
  * @rxq: receive queue array
  * @info: virtual ethernet switch port information
+ * @vema_mac_addr: mac address configured by vema
+ * @umac_hash: unicast maclist hash
+ * @mmac_hash: multicast maclist hash
  * @mactbl: hash table of MAC entries
  * @mactbl_lock: mac table lock
  * @stats_lock: statistics lock
@@ -286,6 +289,9 @@ struct hfi_vnic_adapter {
struct hfi_vnic_rx_queue  rxq[HFI_VNIC_MAX_QUEUE];
 
struct __hfi_veswport_info  info;
+   u8  vema_mac_addr[ETH_ALEN];
+   u32 umac_hash;
+   u32 mmac_hash;
struct hlist_head  __rcu   *mactbl;
 
/* Lock used to protect updates to mac table */
@@ -378,12 +384,30 @@ struct hfi_vnic_mac_tbl_node {
 int hfi_vnic_encap_skb(struct hfi_vnic_adapter *adapter, struct sk_buff *skb);
 int hfi_vnic_decap_skb(struct hfi_vnic_rx_queue *rxq, struct sk_buff *skb);
 u8 hfi_vnic_calc_entropy(struct hfi_vnic_adapter *adapter, struct sk_buff 
*skb);
+void hfi_vnic_process_vema_config(struct hfi_vnic_adapter *adapter);
 void hfi_vnic_release_mac_tbl(struct hfi_vnic_adapter *adapter);
 void hfi_vnic_query_mac_tbl(struct hfi_vnic_adapter *adapter,
struct hfi_veswport_mactable *tbl);
 int hfi_vnic_update_mac_tbl(struct hfi_vnic_adapter *adapter,
struct hfi_veswport_mactable *tbl);
+void hfi_vnic_query_ucast_macs(struct hfi_vnic_adapter *adapter,
+  struct hfi_veswport_iface_macs *macs);
+void hfi_vnic_query_mcast_macs(struct hfi_vnic_adapter *adapter,
+  struct hfi_veswport_iface_macs *macs);
 void hfi_vnic_update_stats(struct net_device *netdev);
+void hfi_vnic_get_summary_counters(struct hfi_vnic_adapter *adapter,
+  struct hfi_veswport_summary_counters *cntrs);
+void hfi_vnic_get_error_counters(struct hfi_vnic_adapter *adapter,
+struct hfi_veswport_error_counters *cntrs);
+void hfi_vnic_get_vesw_info(struct hfi_vnic_adapter *adapter,
+   struct hfi_vesw_info *info);
+void hfi_vnic_set_vesw_info(struct hfi_vnic_adapter *adapter,
+   struct hfi_vesw_info *info);
+void hfi_vnic_get_per_veswport_info(struct hfi_vnic_adapter *adapter,
+  

[RFC 03/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) netdev driver

2016-11-18 Thread Vishwanathapura, Niranjana
HFI VNIC netdev driver supports Ethernet functionality over Omni-Path
fabric by encapsulating Ethernet packets inside Omni-Path packet header.
It interfaces with the network stack to provide standard Ethernet network
interfaces to the user. It binds with the HFI VNIC device and invokes the
bus operations supported by it.

Change-Id: I2613b2c36e548182828e732181e4bde99b8d01dc
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Sudeep Dutt 
Signed-off-by: Tanya K Jajodia 
Signed-off-by: Andrzej Kacprowski 
---
 drivers/infiniband/Kconfig |   1 +
 drivers/infiniband/sw/intel/vnic/Makefile  |   1 +
 drivers/infiniband/sw/intel/vnic/hfi_vnic/Kconfig  |   8 +
 drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile |   7 +
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c| 239 +++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h|  62 +++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c  |  81 
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h | 221 ++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c   | 469 +
 9 files changed, 1089 insertions(+)
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic/Kconfig
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c
 create mode 100644 
drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 7fe9095..0c419d2 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -85,6 +85,7 @@ source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
 
 source "drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Kconfig"
+source "drivers/infiniband/sw/intel/vnic/hfi_vnic/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
 
diff --git a/drivers/infiniband/sw/intel/vnic/Makefile 
b/drivers/infiniband/sw/intel/vnic/Makefile
index 083e55b..bb22e22 100644
--- a/drivers/infiniband/sw/intel/vnic/Makefile
+++ b/drivers/infiniband/sw/intel/vnic/Makefile
@@ -1 +1,2 @@
 obj-$(CONFIG_HFI_VNIC_BUS) += hfi_vnic_bus/
+obj-$(CONFIG_HFI_VNIC) += hfi_vnic/
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/Kconfig 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Kconfig
new file mode 100644
index 000..d03efc9
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Kconfig
@@ -0,0 +1,8 @@
+config HFI_VNIC
+   tristate "Intel HFI VNIC support"
+   depends on X86_64 && INFINIBAND && HFI_VNIC_BUS
+   ---help---
+   This is HFI Virtual Network Interface Controller (VNIC) driver
+   for Ethernet over HFI feature. It implements the HW independent
+   VNIC functionality. It interfaces with Linux stack for data path
+   and IB MAD for the control path.
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
new file mode 100644
index 000..a05b2f5
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/Makefile
@@ -0,0 +1,7 @@
+# Makefile - Intel HFI Virtual Network Controller driver
+# Copyright(c) 2016, Intel Corporation.
+#
+ccflags-y += -I$(src)/../include
+obj-$(CONFIG_HFI_VNIC) += hfi_vnic.o
+
+hfi_vnic-y := hfi_vnic_netdev.o hfi_vnic_encap.o hfi_vnic_ethtool.o
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
new file mode 100644
index 000..9804c6d
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
@@ -0,0 +1,239 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following 

[RFC 05/10] IB/hfi-vnic: VNIC statistics support

2016-11-18 Thread Vishwanathapura, Niranjana
HFI VNIC driver statistics support maintains various counters including
standard netdev counters and the Ethernet manager defined counters.
Add the Ethtool hook to read the counters.

Change-Id: I6d828c2ce5eeae73d611174a985ff41f83480562
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
---
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c|  19 +-
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c  | 131 +++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h |  84 +++
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c   | 260 -
 4 files changed, 486 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
index 9804c6d..5a5e5a7 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
@@ -210,8 +210,10 @@ int hfi_vnic_encap_skb(struct hfi_vnic_adapter *adapter, 
struct sk_buff *skb)
hdr->slid_high = info->vport.encap_slid >> 20;
 
dlid = hfi_vnic_get_dlid(adapter, skb, def_port);
-   if (unlikely(!dlid))
+   if (unlikely(!dlid)) {
+   adapter->q_err_cntrs[skb->queue_mapping].tx_dlid_zero++;
return -EFAULT;
+   }
 
hdr->dlid = dlid;
hdr->dlid_high = dlid >> 20;
@@ -234,6 +236,19 @@ int hfi_vnic_encap_skb(struct hfi_vnic_adapter *adapter, 
struct sk_buff *skb)
 /* hfi_vnic_decap_skb - strip OPA header from the skb (ethernet) packet */
 int hfi_vnic_decap_skb(struct hfi_vnic_rx_queue *rxq, struct sk_buff *skb)
 {
+   struct hfi_vnic_adapter *adapter = rxq->adapter;
+   int max_len = adapter->netdev->mtu + VLAN_ETH_HLEN;
+   int rc = -EFAULT;
+
skb_pull(skb, HFI_VNIC_HDR_LEN);
-   return 0;
+
+   /* Validate Packet length */
+   if (skb->len > max_len)
+   adapter->q_err_cntrs[rxq->idx].rx_oversize++;
+   else if (skb->len < ETH_ZLEN)
+   adapter->q_err_cntrs[rxq->idx].rx_runt++;
+   else
+   rc = 0;
+
+   return rc;
 }
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c
index 32bb9ce..ab4b00d 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_ethtool.c
@@ -54,6 +54,83 @@
 #include "hfi_vnic.h"
 #include "hfi_vnic_internal.h"
 
+enum {NETDEV_STATS, VNIC_STATS};
+
+struct vnic_stats {
+   char stat_string[ETH_GSTRING_LEN];
+   struct {
+   int type;
+   int sizeof_stat;
+   int stat_offset;
+   };
+};
+
+#define VNIC_STAT(m){ VNIC_STATS,   \
+ FIELD_SIZEOF(struct hfi_vnic_adapter, m), \
+ offsetof(struct hfi_vnic_adapter, m) }
+#define VNIC_NETDEV_STAT(m) { NETDEV_STATS, \
+ FIELD_SIZEOF(struct net_device, m),   \
+ offsetof(struct net_device, m) }
+
+static struct vnic_stats vnic_gstrings_stats[] = {
+   /* NETDEV stats */
+   {"rx_packets", VNIC_NETDEV_STAT(stats.rx_packets)},
+   {"tx_packets", VNIC_NETDEV_STAT(stats.tx_packets)},
+   {"rx_bytes", VNIC_NETDEV_STAT(stats.rx_bytes)},
+   {"tx_bytes", VNIC_NETDEV_STAT(stats.tx_bytes)},
+   {"rx_errors", VNIC_NETDEV_STAT(stats.rx_errors)},
+   {"tx_errors", VNIC_NETDEV_STAT(stats.tx_errors)},
+   {"rx_dropped", VNIC_NETDEV_STAT(stats.rx_dropped)},
+   {"tx_dropped", VNIC_NETDEV_STAT(stats.tx_dropped)},
+
+   {"rx_fifo_errors", VNIC_NETDEV_STAT(stats.rx_fifo_errors)},
+   {"rx_missed_errors", VNIC_NETDEV_STAT(stats.rx_missed_errors)},
+   {"tx_carrier_errors", VNIC_NETDEV_STAT(stats.tx_carrier_errors)},
+   {"tx_fifo_errors", VNIC_NETDEV_STAT(stats.tx_fifo_errors)},
+
+   /* SUMMARY counters */
+   {"tx_unicast", VNIC_STAT(sum_cntrs.tx_grp.unicast)},
+   {"tx_mcastbcast", VNIC_STAT(sum_cntrs.tx_grp.mcastbcast)},
+   {"tx_untagged", VNIC_STAT(sum_cntrs.tx_grp.untagged)},
+   {"tx_vlan", VNIC_STAT(sum_cntrs.tx_grp.vlan)},
+
+   {"tx_64_size", VNIC_STAT(sum_cntrs.tx_grp.xx_64_size)},
+   {"tx_65_127", VNIC_STAT(sum_cntrs.tx_grp.xx_65_127)},
+   {"tx_128_255", VNIC_STAT(sum_cntrs.tx_grp.xx_128_255)},
+   {"tx_256_511", VNIC_STAT(sum_cntrs.tx_grp.xx_256_511)},
+   {"tx_512_1023", VNIC_STAT(sum_cntrs.tx_grp.xx_512_1023)},
+   {"tx_1024_1518", VNIC_STAT(sum_cntrs.tx_grp.xx_1024_1518)},
+   {"tx_1519_max", VNIC_STAT(sum_cntrs.tx_grp.xx_1519_max)},
+
+   {"rx_unicast", VNIC_STAT(sum_cntrs.rx_grp.unicast)},
+   {"rx_mcastbcast", VNIC_STAT(sum_cntrs.rx_grp.mcastbcast)},
+   {"rx_untagged", 

[RFC 06/10] IB/hfi-vnic: VNIC MAC table support

2016-11-18 Thread Vishwanathapura, Niranjana
HFI VNIC MAC table contains the MAC address to DLID mappings provided by
the Ethernet manager. During transmission, the MAC table provides the MAC
address to DLID translation. Implement MAC table using simple hash list.
Also provide support to update/query the MAC table by Ethernet manager.

Change-Id: Ibe88bcd65ac47c316d2ac4ef746b12f82dcea274
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
---
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c| 236 +
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h |  53 -
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_netdev.c   |   4 +
 3 files changed, 292 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
index 5a5e5a7..ffdd7b3 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.c
@@ -105,6 +105,238 @@
 
 #define HFI_VNIC_SC_MASK 0x1f
 
+/*
+ * Using a simple hash table for mac table implementation with the last octet
+ * of mac address as a key.
+ */
+static void hfi_vnic_free_mac_tbl(struct hlist_head *mactbl)
+{
+   struct hfi_vnic_mac_tbl_node *node;
+   struct hlist_node *tmp;
+   int bkt;
+
+   if (!mactbl)
+   return;
+
+   vnic_hash_for_each_safe(mactbl, bkt, tmp, node, hlist) {
+   hash_del(>hlist);
+   kfree(node);
+   }
+   kfree(mactbl);
+}
+
+static struct hlist_head *hfi_vnic_alloc_mac_tbl(void)
+{
+   u32 size = sizeof(struct hlist_head) * HFI_VNIC_MAC_TBL_SIZE;
+   struct hlist_head *mactbl;
+
+   mactbl = kzalloc(size, GFP_KERNEL);
+   if (!mactbl)
+   return ERR_PTR(-ENOMEM);
+
+   vnic_hash_init(mactbl);
+   return mactbl;
+}
+
+/* hfi_vnic_release_mac_tbl - empty and free the mac table */
+void hfi_vnic_release_mac_tbl(struct hfi_vnic_adapter *adapter)
+{
+   struct hlist_head *mactbl;
+
+   mutex_lock(>mactbl_lock);
+   mactbl = rcu_access_pointer(adapter->mactbl);
+   rcu_assign_pointer(adapter->mactbl, NULL);
+   synchronize_rcu();
+   hfi_vnic_free_mac_tbl(mactbl);
+   mutex_unlock(>mactbl_lock);
+}
+
+/*
+ * hfi_vnic_query_mac_tbl - query the mac table for a section
+ *
+ * This function implements query of specific function of the mac table.
+ * The function also expects the requested range to be valid.
+ */
+void hfi_vnic_query_mac_tbl(struct hfi_vnic_adapter *adapter,
+   struct hfi_veswport_mactable *tbl)
+{
+   struct hfi_vnic_mac_tbl_node *node;
+   struct hlist_head *mactbl;
+   int bkt;
+   u16 loffset, lnum_entries;
+
+   rcu_read_lock();
+   mactbl = rcu_dereference(adapter->mactbl);
+   if (!mactbl)
+   goto get_mac_done;
+
+   loffset = be16_to_cpu(tbl->offset);
+   lnum_entries = be16_to_cpu(tbl->num_entries);
+
+   vnic_hash_for_each(mactbl, bkt, node, hlist) {
+   struct __hfi_vnic_mactable_entry *nentry = >entry;
+   struct hfi_veswport_mactable_entry *entry;
+
+   if ((node->index < loffset) ||
+   (node->index >= (loffset + lnum_entries)))
+   continue;
+
+   /* populate entry in the tbl corresponding to the index */
+   entry = >tbl_entries[node->index - loffset];
+   memcpy(entry->mac_addr, nentry->mac_addr,
+  ARRAY_SIZE(entry->mac_addr));
+   memcpy(entry->mac_addr_mask, nentry->mac_addr_mask,
+  ARRAY_SIZE(entry->mac_addr_mask));
+   entry->dlid_sd.dw = cpu_to_be32(nentry->dlid_sd.dw);
+   }
+   tbl->mac_tbl_digest = cpu_to_be32(adapter->info.vport.mac_tbl_digest);
+get_mac_done:
+   rcu_read_unlock();
+}
+
+/*
+ * hfi_vnic_update_mac_tbl - update mac table section
+ *
+ * This function updates the specified section of the mac table.
+ * The procedure includes following steps.
+ *  - Allocate a new mac (hash) table.
+ *  - Add the specified entries to the new table.
+ *(except the ones that are requested to be deleted).
+ *  - Add all the other entries from the old mac table.
+ *  - If there is a failure, free the new table and return.
+ *  - Switch to the new table.
+ *  - Free the old table and return.
+ *
+ * The function also expects the requested range to be valid.
+ */
+int hfi_vnic_update_mac_tbl(struct hfi_vnic_adapter *adapter,
+   struct hfi_veswport_mactable *tbl)
+{
+   struct hfi_vnic_mac_tbl_node *node, *new_node;
+   struct hlist_head *new_mactbl, *old_mactbl;
+   int i, bkt, rc = 0;
+   u8 key;
+   u16 loffset, lnum_entries;
+
+   mutex_lock(>mactbl_lock);
+   /* allocate new mac table */
+   

[RFC 02/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) Bus driver

2016-11-18 Thread Vishwanathapura, Niranjana
HFI VNIC bus driver interfaces between hardware independent VNIC
functionality and the hardware dependent VNIC functionality.
Support creation of Intel HFI VNIC devices and binding with Intel
HFI VNIC drivers. Define the bus operations the HFI VNIC device
should support.

Change-Id: I91f65d0957d4866b133ee2b6b5246c49cbc0ba69
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
---
 MAINTAINERS|   7 +
 drivers/infiniband/Kconfig |   1 +
 drivers/infiniband/sw/Makefile |   1 +
 drivers/infiniband/sw/intel/vnic/Makefile  |   1 +
 .../infiniband/sw/intel/vnic/hfi_vnic_bus/Kconfig  |   8 +
 .../infiniband/sw/intel/vnic/hfi_vnic_bus/Makefile |   5 +
 .../sw/intel/vnic/hfi_vnic_bus/hfi_vnic_bus.c  | 366 +
 .../infiniband/sw/intel/vnic/include/hfi_vnic.h| 282 
 8 files changed, 671 insertions(+)
 create mode 100644 drivers/infiniband/sw/intel/vnic/Makefile
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Kconfig
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Makefile
 create mode 100644 drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/hfi_vnic_bus.c
 create mode 100644 drivers/infiniband/sw/intel/vnic/include/hfi_vnic.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 3d838cf..8c37878 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5628,6 +5628,13 @@ F:   drivers/block/cciss*
 F: include/linux/cciss_ioctl.h
 F: include/uapi/linux/cciss_ioctl.h
 
+HFI-VNIC DRIVER
+M: Dennis Dalessandro 
+M: Niranjana Vishwanathapura 
+L: linux-r...@vger.kernel.org
+S: Supported
+F: drivers/infiniband/sw/intel/vnic
+
 HFI1 DRIVER
 M: Mike Marciniszyn 
 M: Dennis Dalessandro 
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index fb3fb89..7fe9095 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -84,6 +84,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
 source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
 
+source "drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
 
diff --git a/drivers/infiniband/sw/Makefile b/drivers/infiniband/sw/Makefile
index 8b095b2..4fa6058 100644
--- a/drivers/infiniband/sw/Makefile
+++ b/drivers/infiniband/sw/Makefile
@@ -1,2 +1,3 @@
 obj-$(CONFIG_INFINIBAND_RDMAVT)+= rdmavt/
 obj-$(CONFIG_RDMA_RXE) += rxe/
+obj-$(CONFIG_INFINIBAND)   += intel/vnic/
diff --git a/drivers/infiniband/sw/intel/vnic/Makefile 
b/drivers/infiniband/sw/intel/vnic/Makefile
new file mode 100644
index 000..083e55b
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_HFI_VNIC_BUS) += hfi_vnic_bus/
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Kconfig 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Kconfig
new file mode 100644
index 000..85952d6
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Kconfig
@@ -0,0 +1,8 @@
+config HFI_VNIC_BUS
+   tristate "Intel HFI VNIC bus support"
+   depends on X86_64
+   ---help---
+   This is HFI Virtual Network Interface Controller (VNIC) Bus driver
+   for binding Intel HFI VNIC devices and drivers. It separates the
+   hardware independent VNIC functionaity from the hw dependent. It
+   provides APIs to register and unregister VNIC devices and drivers.
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Makefile 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Makefile
new file mode 100644
index 000..5fac098
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/Makefile
@@ -0,0 +1,5 @@
+# Makefile - Intel HFI Virtual Network Controller bus driver
+# Copyright(c) 2016, Intel Corporation.
+#
+ccflags-y += -I$(src)/../include
+obj-$(CONFIG_HFI_VNIC_BUS) += hfi_vnic_bus.o
diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/hfi_vnic_bus.c 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/hfi_vnic_bus.c
new file mode 100644
index 000..5455fc7
--- /dev/null
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic_bus/hfi_vnic_bus.c
@@ -0,0 +1,366 @@
+/*
+ * Copyright(c) 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * 

[RFC 01/10] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) documentation

2016-11-18 Thread Vishwanathapura, Niranjana
Add HFI VNIC design document explaining the VNIC architecture and the
driver design.

Change-Id: I7baa39444579dc582fe1e49b86e9cfc71f0a41a4
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
---
 Documentation/infiniband/hfi_vnic.txt | 97 +++
 1 file changed, 97 insertions(+)
 create mode 100644 Documentation/infiniband/hfi_vnic.txt

diff --git a/Documentation/infiniband/hfi_vnic.txt 
b/Documentation/infiniband/hfi_vnic.txt
new file mode 100644
index 000..3501288
--- /dev/null
+++ b/Documentation/infiniband/hfi_vnic.txt
@@ -0,0 +1,97 @@
+Intel Omni-Path Host Fabric Interface (HFI) Virtual Network Interface
+Controller (VNIC) feature supports Ethernet functionality over Omni-Path
+fabric by encapsulating the Ethernet packets between HFI nodes.
+
+The patterns of exchanges of Omni-Path encapsulated Ethernet packets
+involves one or more virtual Ethernet switches overlaid on the Omni-Path
+fabric topology. A subset of HFI nodes on the Omni-Path fabric are
+permitted to exchange encapsulated Ethernet packets across a particular
+virtual Ethernet switch. The virtual Ethernet switches are logical
+abstractions achieved by configuring the HFI nodes on the fabric for
+header generation and processing. In the simplest configuration all HFI
+nodes across the fabric exchange encapsulated Ethernet packets over a
+single virtual Ethernet switch. A virtual Ethernet switch, is effectively
+an independent Ethernet network. The configuration is performed by an
+Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM)
+application. HFI nodes can have multiple VNICs each connected to a
+different virtual Ethernet switch. The below diagram presents a case
+of two virtual Ethernet switches with two HFI nodes.
+
+ +---+
+ |  Subnet/  |
+ | Ethernet  |
+ |  Manager  |
+ +---+
+/  /
+  /   /
+//
+  / /
++-+  +--+
+|  Virtual Ethernet Switch|  |  Virtual Ethernet Switch |
+|  +-++-+ |  | +-++-+   |
+|  | VPORT   ||  VPORT  | |  | |  VPORT  ||  VPORT  |   |
++--+-++-+-+  +-+-++-+---+
+ | \/ |
+ |   \/   |
+ | \/ |
+ |/  \|
+ |  /  \  |
+ +---++  +---++
+ |   VNIC|VNIC|  |VNIC   |VNIC|
+ +---++  +---++
+ |  HFI   |  |  HFI   |
+ ++  ++
+
+Intel HFI VNIC software design is presented in the below diagram.
+HFI VNIC functionality has a HW dependent component and a HW
+independent component. HFI VNIC Bus module decouples these two
+functionalities.
+
+The HW dependent VNIC functionality is part of the HFI1 driver. It
+implements the bus operations to do various tasks including HW resource
+allocation for VNIC functionality and actual transmission and reception
+of encapsulated Ethernet packets over the fabric. It creates a control
+device (per HFI) on the HFI VNIC bus for the control plane operations
+and VNIC devices for the data plane. Each VNIC device on the HFI VNIC
+bus is addressed by the HFI instance, HFI port, and a VNIC port number
+on the HFI port.
+
+The HFI VNIC module implements the HW independent VNIC functionality.
+It consists of two drivers. The VNIC Ethernet Management Agent (VEMA)
+driver binds with the control device on the VNIC bus and interfaces with
+the Infiniband MAD stack. It exchanges the management information with
+the Ethernet Manager (EM). The VNIC netdev driver binds with the VNIC
+devices on the HFI VNIC bus and interfaces with the Linux network stack,
+thus providing standard Ethernet network interfaces. The VNIC netdev
+driver encapsulates the Ethernet packets with an Omni-Path header before
+passing them to the HFI1 driver for transmission. Similarly, it
+de-encapsulates the received Omni-Path packets before passing them to the
+network stack. For each VNIC interface, the information required for
+encapsulation is configured by EM via VEMA MAD interface.
+
+
++---+ +--+
+|   | |   Linux  |
+| IB MAD| |  Network |
+| 

[RFC 04/10] IB/hfi-vnic: VNIC Ethernet Management (EM) structure definitions

2016-11-18 Thread Vishwanathapura, Niranjana
Define VNIC EM MAD structures and the associated macros. These structures
are used for information exchange between VNIC EM agent on the HFI host
and the Ethernet manager.

Change-Id: If4837ec74e5b0eecc81774a52ab92fffea4b6338
Reviewed-by: Dennis Dalessandro 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Tanya K Jajodia 
---
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h| 444 +
 .../sw/intel/vnic/hfi_vnic/hfi_vnic_internal.h |  35 +-
 2 files changed, 478 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h 
b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h
index 6786cce..9ed5221 100644
--- a/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h
+++ b/drivers/infiniband/sw/intel/vnic/hfi_vnic/hfi_vnic_encap.h
@@ -52,11 +52,455 @@
  * and decapsulation of Ethernet packets
  */
 
+#include 
+#include 
+
+/* Maximum number of vnics supported */
+#define HFI_MAX_VPORTS_SUPPORTED 256
+
+/* EMA class version */
+#define HFI_EMA_CLASS_VERSION   0x80
+
+/*
+ * Define the Intel vendor management class for HFI
+ * ETHERNET MANAGEMENT
+ */
+#define HFI_MGMT_CLASS_INTEL_EMA0x34
+
+/* EM attribute IDs */
+#define HFI_EM_ATTR_CLASS_PORT_INFO 0x0001
+#define HFI_EM_ATTR_VESWPORT_INFO   0x0011
+#define HFI_EM_ATTR_VESWPORT_MAC_ENTRIES0x0012
+#define HFI_EM_ATTR_IFACE_UCAST_MACS0x0013
+#define HFI_EM_ATTR_IFACE_MCAST_MACS0x0014
+#define HFI_EM_ATTR_DELETE_VESW 0x0015
+#define HFI_EM_ATTR_VESWPORT_SUMMARY_COUNTERS   0x0020
+#define HFI_EM_ATTR_VESWPORT_ERROR_COUNTERS 0x0022
+
 #define HFI_VESW_MAX_NUM_DEF_PORT   16
 #define HFI_VNIC_MAX_NUM_PCP8
 
+#define HFI_VNIC_EMA_DATA(OPA_MGMT_MAD_SIZE - IB_MGMT_VENDOR_HDR)
+
+/* Defines for vendor specific notice(trap) attributes */
+#define HFI_INTEL_EMA_NOTICE_TYPE_INFO 0x04
+
+/* INTEL OUI */
+#define INTEL_OUI_1 0x00
+#define INTEL_OUI_2 0x06
+#define INTEL_OUI_3 0x6a
+
+/* Trap opcodes sent from VNIC */
+#define HFI_VESWPORT_TRAP_IFACE_UCAST_MAC_CHANGE 0x1
+#define HFI_VESWPORT_TRAP_IFACE_MCAST_MAC_CHANGE 0x2
+#define HFI_VESWPORT_TRAP_ETH_LINK_STATUS_CHANGE 0x3
+
 /* VNIC configured and operational state values */
 #define HFI_VNIC_STATE_DROP_ALL0x1
 #define HFI_VNIC_STATE_FORWARDING  0x3
 
+/**
+ * struct hfi_vesw_info - HFI vnic switch information
+ * @fabric_id: 10-bit fabric id
+ * @vesw_id: 12-bit virtual ethernet switch id
+ * @def_port_mask: bitmask of default ports
+ * @pkey: partition key
+ * @u_mcast_dlid: unknown multicast dlid
+ * @u_ucast_dlid: array of unknown unicast dlids
+ * @eth_mtu: MTUs for each vlan PCP
+ * @eth_mtu_non_vlan: MTU for non vlan packets
+ */
+struct hfi_vesw_info {
+   __be16  fabric_id;
+   __be16  vesw_id;
+
+   u8  rsvd0[6];
+   __be16  def_port_mask;
+
+   u8  rsvd1[2];
+   __be16  pkey;
+
+   u8  rsvd2[4];
+   __be32  u_mcast_dlid;
+   __be32  u_ucast_dlid[HFI_VESW_MAX_NUM_DEF_PORT];
+
+   u8  rsvd3[44];
+   __be16  eth_mtu[HFI_VNIC_MAX_NUM_PCP];
+   __be16  eth_mtu_non_vlan;
+   u8  rsvd4[2];
+} __packed;
+
+/**
+ * struct hfi_per_veswport_info - HFI vnic per port information
+ * @port_num: port number
+ * @eth_link_status: current ethernet link state
+ * @base_mac_addr: base mac address
+ * @config_state: configured port state
+ * @oper_state: operational port state
+ * @max_mac_tbl_ent: max number of mac table entries
+ * @max_smac_ent: max smac entries in mac table
+ * @mac_tbl_digest: mac table digest
+ * @encap_slid: base slid for the port
+ * @pcp_to_sc_uc: sc by pcp index for unicast ethernet packets
+ * @pcp_to_vl_uc: vl by pcp index for unicast ethernet packets
+ * @pcp_to_sc_mc: sc by pcp index for multicast ethernet packets
+ * @pcp_to_vl_mc: vl by pcp index for multicast ethernet packets
+ * @non_vlan_sc_uc: sc for non-vlan unicast ethernet packets
+ * @non_vlan_vl_uc: vl for non-vlan unicast ethernet packets
+ * @non_vlan_sc_mc: sc for non-vlan multicast ethernet packets
+ * @non_vlan_vl_mc: vl for non-vlan multicast ethernet packets
+ * @uc_macs_gen_count: generation count for unicast macs list
+ * @mc_macs_gen_count: generation count for multicast macs list
+ */
+struct hfi_per_veswport_info {
+   __be32  port_num;
+
+   u8  eth_link_status;
+   u8  rsvd0[3];
+
+   u8  base_mac_addr[ETH_ALEN];
+   u8  config_state;
+   u8  oper_state;
+
+   __be16  max_mac_tbl_ent;
+   __be16  max_smac_ent;
+   __be32  mac_tbl_digest;
+   u8  rsvd1[4];
+
+   __be32  encap_slid;
+
+   u8  pcp_to_sc_uc[HFI_VNIC_MAX_NUM_PCP];
+   u8  pcp_to_vl_uc[HFI_VNIC_MAX_NUM_PCP];
+   u8  

[RFC 00/10] HFI Virtual Network Interface Controller (VNIC)

2016-11-18 Thread Vishwanathapura, Niranjana
Intel Omni-Path Host Fabric Interface (HFI) Virtual Network Interface
Controller (VNIC) feature supports Ethernet functionality over Omni-Path
fabric by encapsulating the Ethernet packets between HFI nodes.

The patterns of exchanges of Omni-Path encapsulated Ethernet packets
involves one or more virtual Ethernet switches overlaid on the Omni-Path
fabric topology. A subset of HFI nodes on the Omni-Path fabric are
permitted to exchange encapsulated Ethernet packets across a particular
virtual Ethernet switch. The virtual Ethernet switches are logical
abstractions achieved by configuring the HFI nodes on the fabric for
header generation and processing. In the simplest configuration all HFI
nodes across the fabric exchange encapsulated Ethernet packets over a
single virtual Ethernet switch. A virtual Ethernet switch, is effectively
an independent Ethernet network. The configuration is performed by an
Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM)
application. HFI nodes can have multiple VNICs each connected to a
different virtual Ethernet switch. The below diagram presents a case
of two virtual Ethernet switches with two HFI nodes.

 +---+
 |  Subnet/  |
 | Ethernet  |
 |  Manager  |
 +---+
/  /
  /   /
//
  / /
+-+  +--+
|  Virtual Ethernet Switch|  |  Virtual Ethernet Switch |
|  +-++-+ |  | +-++-+   |
|  | VPORT   ||  VPORT  | |  | |  VPORT  ||  VPORT  |   |
+--+-++-+-+  +-+-++-+---+
 | \/ |
 |   \/   |
 | \/ |
 |/  \|
 |  /  \  |
 +---++  +---++
 |   VNIC|VNIC|  |VNIC   |VNIC|
 +---++  +---++
 |  HFI   |  |  HFI   |
 ++  ++

Intel HFI VNIC software design is presented in the below diagram.
HFI VNIC functionality has a HW dependent component and a HW
independent component. HFI VNIC Bus module decouples these two
functionalities.

The HW dependent VNIC functionality is part of the HFI1 driver. It
implements the bus operations to do various tasks including HW resource
allocation for VNIC functionality and actual transmission and reception
of encapsulated Ethernet packets over the fabric. It creates a control
device (per HFI) on the HFI VNIC bus for the control plane operations
and VNIC devices for the data plane. Each VNIC device on the HFI VNIC
bus is addressed by the HFI instance, HFI port, and a VNIC port number
on the HFI port.

The HFI VNIC module implements the HW independent VNIC functionality.
It consists of two drivers. The VNIC Ethernet Management Agent (VEMA)
driver binds with the control device on the VNIC bus and interfaces with
the Infiniband MAD stack. It exchanges the management information with
the Ethernet Manager (EM). The VNIC netdev driver binds with the VNIC
devices on the HFI VNIC bus and interfaces with the Linux network stack,
thus providing standard Ethernet network interfaces. The VNIC netdev
driver encapsulates the Ethernet packets with an Omni-Path header before
passing them to the HFI1 driver for transmission. Similarly, it
de-encapsulates the received Omni-Path packets before passing them to the
network stack. For each VNIC interface, the information required for
encapsulation is configured by EM via VEMA MAD interface.


+---+ +--+
|   | |   Linux  |
| IB MAD| |  Network |
|   | |   Stack  |
+---+ +--+
 |   |
 |   |
++
||
| HFI VNIC Module|
|(HFI VNIC Netdev and EMA drivers)   |
||
++
 |
 |
++
|  HFI VNIC Bus  |
++
  

[PATCH] net: macb: add check for dma mapping error in start_xmit()

2016-11-18 Thread Alexey Khoroshilov
at91ether_start_xmit() does not check for dma mapping errors.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov 
---
 drivers/net/ethernet/cadence/macb.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index b32444a3ed79..533653bd7aec 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -2673,6 +2673,12 @@ static int at91ether_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
lp->skb_length = skb->len;
lp->skb_physaddr = dma_map_single(NULL, skb->data, skb->len,
DMA_TO_DEVICE);
+   if (dma_mapping_error(NULL, lp->skb_physaddr)) {
+   dev_kfree_skb_any(skb);
+   dev->stats.tx_dropped++;
+   netdev_err(dev, "%s: DMA mapping error\n", __func__);
+   return NETDEV_TX_OK;
+   }
 
/* Set address of the data in the Transmit Address register */
macb_writel(lp, TAR, lp->skb_physaddr);
-- 
2.7.4



Potential deadlock BUG in drivers/net/wireless/st/cw1200/sta.c (Linux 4.9)

2016-11-18 Thread Iago Abal
Hi,

With the help of a static bug finder (EBA -
https://github.com/models-team/eba) I have found a potential deadlock
in drivers/net/wireless/st/cw1200/
sta.c. This happens due to a recursive mutex_lock on `priv->conf_mutex'.

If this is indeed a bug, I will be happy to help with a patch.

A quick (not elegant) fix could be to unlock before the call to
`cw1200_do_unjoin' in line 1174, and lock again afterwards. It seems
that `cw1200_join_complete' is always called with `priv->conf_mutex'
held. Another option could be to add a Boolean parameter to
`cw1200_do_unjoin' to choose whether this function should take the
lock itself. Yet another option would be to have a
`__cw1200_do_unjoin' that does not lock, and make `cw1200_do_unjoin' a
wrapper over this that adds the locking; `cw1200_join_complete' would
call `__cw1200_do_unjoin' instead.

Someone who is actually familiar with this code may have a better
proposal though.

The trace is as follows:

1. Function `cw1200_join_complete_work' takes the first lock in line 1189:

// see 
https://github.com/torvalds/linux/blob/v4.9-rc5/drivers/net/wireless/st/cw1200/sta.c#L1189
mutex_lock(& priv->conf_mutex);

2. and subsequently calls `cw1200_join_complete';
3. which calls `cw1200_do_unjoin' in line 1174;
4. and this latter function takes the lock for the second time in line 1387:

// see 
https://github.com/torvalds/linux/blob/v4.9-rc5/drivers/net/wireless/st/cw1200/sta.c#L1387
mutex_lock(& priv->conf_mutex);

Hope it helps!

--
iago


[PATCH net] l2tp: fix racy SOCK_ZAPPED flag check in l2tp_ip{,6}_bind()

2016-11-18 Thread Guillaume Nault
Lock socket before checking the SOCK_ZAPPED flag in l2tp_ip6_bind().
Without lock, a concurrent call could modify the socket flags between
the sock_flag(sk, SOCK_ZAPPED) test and the lock_sock() call. This way,
a socket could be inserted twice in l2tp_ip6_bind_table. Releasing it
would then leave a stale pointer there, generating use-after-free
errors when walking through the list or modifying adjacent entries.

BUG: KASAN: use-after-free in l2tp_ip6_close+0x22e/0x290 at addr 
8800081b0ed8
Write of size 8 by task syz-executor/10987
CPU: 0 PID: 10987 Comm: syz-executor Not tainted 4.8.0+ #39
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
 880031d97838 829f835b 88001b5a1640 8800081b0ec0
 8800081b15a0 8800081b6d20 880031d97860 8174d3cc
 880031d978f0 8800081b0e80 88001b5a1640 880031d978e0
Call Trace:
 [] dump_stack+0xb3/0x118 lib/dump_stack.c:15
 [] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
 [< inline >] print_address_description mm/kasan/report.c:194
 [] kasan_report_error+0x1f6/0x4d0 mm/kasan/report.c:283
 [< inline >] kasan_report mm/kasan/report.c:303
 [] __asan_report_store8_noabort+0x3e/0x40 
mm/kasan/report.c:329
 [< inline >] __write_once_size ./include/linux/compiler.h:249
 [< inline >] __hlist_del ./include/linux/list.h:622
 [< inline >] hlist_del_init ./include/linux/list.h:637
 [] l2tp_ip6_close+0x22e/0x290 net/l2tp/l2tp_ip6.c:239
 [] inet_release+0xed/0x1c0 net/ipv4/af_inet.c:415
 [] inet6_release+0x50/0x70 net/ipv6/af_inet6.c:422
 [] sock_release+0x8d/0x1d0 net/socket.c:570
 [] sock_close+0x16/0x20 net/socket.c:1017
 [] __fput+0x28c/0x780 fs/file_table.c:208
 [] fput+0x15/0x20 fs/file_table.c:244
 [] task_work_run+0xf9/0x170
 [] do_exit+0x85e/0x2a00
 [] do_group_exit+0x108/0x330
 [] get_signal+0x617/0x17a0 kernel/signal.c:2307
 [] do_signal+0x7f/0x18f0
 [] exit_to_usermode_loop+0xbf/0x150 
arch/x86/entry/common.c:156
 [< inline >] prepare_exit_to_usermode arch/x86/entry/common.c:190
 [] syscall_return_slowpath+0x1a0/0x1e0 
arch/x86/entry/common.c:259
 [] entry_SYSCALL_64_fastpath+0xc4/0xc6
Object at 8800081b0ec0, in cache L2TP/IPv6 size: 1448
Allocated:
PID = 10987
 [ 1116.897025] [] save_stack_trace+0x16/0x20
 [ 1116.897025] [] save_stack+0x46/0xd0
 [ 1116.897025] [] kasan_kmalloc+0xad/0xe0
 [ 1116.897025] [] kasan_slab_alloc+0x12/0x20
 [ 1116.897025] [< inline >] slab_post_alloc_hook mm/slab.h:417
 [ 1116.897025] [< inline >] slab_alloc_node mm/slub.c:2708
 [ 1116.897025] [< inline >] slab_alloc mm/slub.c:2716
 [ 1116.897025] [] kmem_cache_alloc+0xc8/0x2b0 mm/slub.c:2721
 [ 1116.897025] [] sk_prot_alloc+0x69/0x2b0 
net/core/sock.c:1326
 [ 1116.897025] [] sk_alloc+0x38/0xae0 net/core/sock.c:1388
 [ 1116.897025] [] inet6_create+0x2d7/0x1000 
net/ipv6/af_inet6.c:182
 [ 1116.897025] [] __sock_create+0x37b/0x640 net/socket.c:1153
 [ 1116.897025] [< inline >] sock_create net/socket.c:1193
 [ 1116.897025] [< inline >] SYSC_socket net/socket.c:1223
 [ 1116.897025] [] SyS_socket+0xef/0x1b0 net/socket.c:1203
 [ 1116.897025] [] entry_SYSCALL_64_fastpath+0x23/0xc6
Freed:
PID = 10987
 [ 1116.897025] [] save_stack_trace+0x16/0x20
 [ 1116.897025] [] save_stack+0x46/0xd0
 [ 1116.897025] [] kasan_slab_free+0x71/0xb0
 [ 1116.897025] [< inline >] slab_free_hook mm/slub.c:1352
 [ 1116.897025] [< inline >] slab_free_freelist_hook mm/slub.c:1374
 [ 1116.897025] [< inline >] slab_free mm/slub.c:2951
 [ 1116.897025] [] kmem_cache_free+0xc8/0x330 mm/slub.c:2973
 [ 1116.897025] [< inline >] sk_prot_free net/core/sock.c:1369
 [ 1116.897025] [] __sk_destruct+0x32b/0x4f0 
net/core/sock.c:1444
 [ 1116.897025] [] sk_destruct+0x44/0x80 net/core/sock.c:1452
 [ 1116.897025] [] __sk_free+0x53/0x220 net/core/sock.c:1460
 [ 1116.897025] [] sk_free+0x23/0x30 net/core/sock.c:1471
 [ 1116.897025] [] sk_common_release+0x28c/0x3e0 
./include/net/sock.h:1589
 [ 1116.897025] [] l2tp_ip6_close+0x1fe/0x290 
net/l2tp/l2tp_ip6.c:243
 [ 1116.897025] [] inet_release+0xed/0x1c0 
net/ipv4/af_inet.c:415
 [ 1116.897025] [] inet6_release+0x50/0x70 
net/ipv6/af_inet6.c:422
 [ 1116.897025] [] sock_release+0x8d/0x1d0 net/socket.c:570
 [ 1116.897025] [] sock_close+0x16/0x20 net/socket.c:1017
 [ 1116.897025] [] __fput+0x28c/0x780 fs/file_table.c:208
 [ 1116.897025] [] fput+0x15/0x20 fs/file_table.c:244
 [ 1116.897025] [] task_work_run+0xf9/0x170
 [ 1116.897025] [] do_exit+0x85e/0x2a00
 [ 1116.897025] [] do_group_exit+0x108/0x330
 [ 1116.897025] [] get_signal+0x617/0x17a0 
kernel/signal.c:2307
 [ 1116.897025] [] do_signal+0x7f/0x18f0
 [ 1116.897025] [] exit_to_usermode_loop+0xbf/0x150 
arch/x86/entry/common.c:156
 [ 1116.897025] [< inline >] prepare_exit_to_usermode 
arch/x86/entry/common.c:190
 [ 1116.897025] [] syscall_return_slowpath+0x1a0/0x1e0 
arch/x86/entry/common.c:259
 [ 

Re: [PATCH 4/5] virtio_net: add dedicated XDP transmit queues

2016-11-18 Thread Jakub Kicinski
Looks very cool! :)

On Fri, 18 Nov 2016 11:00:41 -0800, John Fastabend wrote:
> @@ -1542,12 +1546,34 @@ static int virtnet_xdp_set(struct net_device *dev, 
> struct bpf_prog *prog)
>   return -EINVAL;
>   }
>  
> + curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
> + if (prog)
> + xdp_qp = num_online_cpus();

Is num_online_cpus() correct here?


Re: [PATCH -next] tcp: make undo_cwnd mandatory for congestion modules

2016-11-18 Thread Neal Cardwell
On Fri, Nov 18, 2016 at 1:54 PM, Florian Westphal  wrote:
> David Miller  wrote:
>> If you really suspect that highspeed et al. need to implement their own
>> undo_cwnd instead of using the default reno fallback, I would really
>> rather that this gets either fixed or explicitly marked as likely wrong
>> (in an "XXX" comment or similar).
>
> Ok, fair enough.  I am not familiar with these algorithms, I will check
> what they're doing in more detail and if absolutely needed resubmit this
> patch with XXX/FIXME/TODO comments added.

BTW, FWIW I really like the idea of making undo_cwnd required. It
simplifies the core code and forces CC modules to think about what
undo should look like for their CC module.

And I suspect you are right that those CC modules have an issue that
should be fixed.

neal


[PATCH net-next] mlx4: avoid unnecessary dirtying of critical fields

2016-11-18 Thread Eric Dumazet
From: Eric Dumazet 

While stressing a 40Gbit mlx4 NIC with busy polling, I found false
sharing in mlx4 driver that can be easily avoided.

This patch brings an additional 7 % performance improvement in UDP_RR
workload.

1) If we received no frame during one mlx4_en_process_rx_cq()
   invocation, no need to call mlx4_cq_set_ci() and/or dirty ring->cons

2) Do not refill rx buffers if we have plenty of them.
   This avoids false sharing and allows some bulk/batch optimizations.
   Page allocator and its locks will thank us.

Finally, mlx4_en_poll_rx_cq() should not return 0 if it determined
cpu handling NIC IRQ should be changed. We should return budget-1
instead, to not fool net_rx_action() and its netdev_budget.

Signed-off-by: Eric Dumazet 
Cc: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   51 +++
 1 file changed, 32 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 22f08f9ef464..2112494ff43b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -688,18 +688,23 @@ static void validate_loopback(struct mlx4_en_priv *priv, 
struct sk_buff *skb)
dev_kfree_skb_any(skb);
 }
 
-static void mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
-struct mlx4_en_rx_ring *ring)
+static bool mlx4_en_refill_rx_buffers(struct mlx4_en_priv *priv,
+ struct mlx4_en_rx_ring *ring)
 {
-   int index = ring->prod & ring->size_mask;
+   u32 missing = ring->actual_size - (ring->prod - ring->cons);
 
-   while ((u32) (ring->prod - ring->cons) < ring->actual_size) {
-   if (mlx4_en_prepare_rx_desc(priv, ring, index,
+   /* Try to batch allocations, but not too much. */
+   if (missing < 8)
+   return false;
+   do {
+   if (mlx4_en_prepare_rx_desc(priv, ring,
+   ring->prod & ring->size_mask,
GFP_ATOMIC | __GFP_COLD))
break;
ring->prod++;
-   index = ring->prod & ring->size_mask;
-   }
+   } while (--missing);
+
+   return true;
 }
 
 /* When hardware doesn't strip the vlan, we need to calculate the checksum
@@ -1081,15 +1086,20 @@ int mlx4_en_process_rx_cq(struct net_device *dev, 
struct mlx4_en_cq *cq, int bud
 
 out:
rcu_read_unlock();
-   if (doorbell_pending)
-   mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
-
-   AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
-   mlx4_cq_set_ci(>mcq);
-   wmb(); /* ensure HW sees CQ consumer before we post new buffers */
-   ring->cons = cq->mcq.cons_index;
-   mlx4_en_refill_rx_buffers(priv, ring);
-   mlx4_en_update_rx_prod_db(ring);
+
+   if (polled) {
+   if (doorbell_pending)
+   mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
+
+   AVG_PERF_COUNTER(priv->pstats.rx_coal_avg, polled);
+   mlx4_cq_set_ci(>mcq);
+   wmb(); /* ensure HW sees CQ consumer before we post new buffers 
*/
+   ring->cons = cq->mcq.cons_index;
+   }
+
+   if (mlx4_en_refill_rx_buffers(priv, ring))
+   mlx4_en_update_rx_prod_db(ring);
+
return polled;
 }
 
@@ -1131,10 +1141,13 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int 
budget)
return budget;
 
/* Current cpu is not according to smp_irq_affinity -
-* probably affinity changed. need to stop this NAPI
-* poll, and restart it on the right CPU
+* probably affinity changed. Need to stop this NAPI
+* poll, and restart it on the right CPU.
+* Try to avoid returning a too small value (like 0),
+* to not fool net_rx_action() and its netdev_budget
 */
-   done = 0;
+   if (done)
+   done--;
}
/* Done for now */
if (napi_complete_done(napi, done))




Re: [PATCH] mlxsw: switchib: add MLXSW_PCI dependency

2016-11-18 Thread David Miller
From: Arnd Bergmann 
Date: Fri, 18 Nov 2016 17:01:14 +0100

> The newly added switchib driver fails to link if MLXSW_PCI=m:
> 
> drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In 
> function^Cmlxsw_sib_module_exit':
> switchib.c:(.exit.text+0x8): undefined reference to 
> `mlxsw_pci_driver_unregister'
> switchib.c:(.exit.text+0x10): undefined reference to 
> `mlxsw_pci_driver_unregister'
> drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function 
> `mlxsw_sib_module_init':
> switchib.c:(.init.text+0x28): undefined reference to 
> `mlxsw_pci_driver_register'
> switchib.c:(.init.text+0x38): undefined reference to 
> `mlxsw_pci_driver_register'
> switchib.c:(.init.text+0x48): undefined reference to 
> `mlxsw_pci_driver_unregister'
> 
> The other two such sub-drivers have a dependency, so add the same one
> here. In theory we could allow this driver if MLXSW_PCI is disabled,
> but it's probably not worth it.
> 
> Signed-off-by: Arnd Bergmann 

Please resubmit this with a proper fixes tag that identifies the commit that
added the switchib driver.

Thanks.


Re: [PATCH net] rtnetlink: fix FDB size computation

2016-11-18 Thread David Miller
From: Sabrina Dubroca 
Date: Fri, 18 Nov 2016 15:50:39 +0100

> Add missing NDA_VLAN attribute's size.
> 
> Fixes: 1e53d5bb8878 ("net: Pass VLAN ID to rtnl_fdb_notify.")
> Signed-off-by: Sabrina Dubroca 

Applied and queued up for -stable, thanks.


Re: [PATCH 0/5] XDP for virtio_net

2016-11-18 Thread John Fastabend
On 16-11-18 10:59 AM, John Fastabend wrote:
> This implements virtio_net for the mergeable buffers and big_packet
> modes. I tested this with vhost_net running on qemu and did not see
> any issues.
> 
> There are some restrictions for XDP to be enabled (see patch 3) for
> more details.
> 
>   1. LRO must be off
>   2. MTU must be less than PAGE_SIZE
>   3. queues must be available to dedicate to XDP
>   4. num_bufs received in mergeable buffers must be 1
>   5. big_packet mode must have all data on single page
> 
> Please review any comments/feedback welcome as always.
> 
> Thanks,
> John
> ---
> 

Hi Dave,

Should be obvious but this is for net-next I dropped the tag from my
git-send command.

Also I missed probably the most important person on the CC/TO list.

+Michael Tsirkin.

Thanks,
John


Re: [PATCH net-next] cxgb4: Allocate Tx queues dynamically

2016-11-18 Thread David Miller
From: Atul Gupta 
Date: Fri, 18 Nov 2016 16:37:40 +0530

> From: Hariprasad Shenai 
> 
> Allocate resources dynamically for Upper layer driver's (ULD) like
> cxgbit, iw_cxgb4, cxgb4i and chcr. The resources allocated include Tx
> queues which are allocated when ULD register with cxgb4 driver and freed
> while un-registering. The Tx queues which are shared by ULD shall be
> allocated by first registering driver and un-allocated by last
> unregistering driver.
> 
> Signed-off-by: Atul Gupta 

Applied.


Re: [PATCH] liquidio CN23XX: check if PENDING bit is clear using logical and

2016-11-18 Thread David Miller
From: Colin King 
Date: Fri, 18 Nov 2016 18:45:32 +

> From: Colin Ian King 
> 
> the mbox state should be bitwise anded rather than logically anded
> with OCTEON_MBOX_STATE_RESPONSE_PENDING. Fix this by using the
> correct & operator instead of &&.
> 
> Signed-off-by: Colin Ian King 

Dan Carpenter already submitted a fix for this.


Re: [patch net-next] liquidio CN23XX: bitwise vs logical AND typo

2016-11-18 Thread David Miller
From: Dan Carpenter 
Date: Fri, 18 Nov 2016 14:47:35 +0300

> We obviously intended a bitwise AND here, not a logical one.
> 
> Fixes: 8c978d059224 ("liquidio CN23XX: Mailbox support")
> Signed-off-by: Dan Carpenter 

Applied.


Re: pull-request: mac80211 2016-11-18

2016-11-18 Thread David Miller
From: Johannes Berg 
Date: Fri, 18 Nov 2016 08:52:00 +0100

> Due to travel/vacation, this is a bit late, but there aren't
> that many fixes either. Most interesting/important are the
> fixes from Felix and perhaps the scan entry limit.
> 
> Please pull and let me know if there's any problem.

Pulled, thanks a lot Johannes.


[PATCH 4/5] virtio_net: add dedicated XDP transmit queues

2016-11-18 Thread John Fastabend
XDP requires using isolated transmit queues to avoid interference
with normal networking stack (BQL, NETDEV_TX_BUSY, etc). This patch
adds a XDP queue per cpu when a XDP program is loaded and does not
expose the queues to the OS via the normal API call to
netif_set_real_num_tx_queues(). This way the stack will never push
an skb to these queues.

However virtio/vhost/qemu implementation only allows for creating
TX/RX queue pairs at this time so creating only TX queues was not
possible. And because the associated RX queues are being created I
went ahead and exposed these to the stack and let the backend use
them. This creates more RX queues visible to the network stack than
TX queues which is worth mentioning but does not cause any issues as
far as I can tell.

Signed-off-by: John Fastabend 
---
 drivers/net/virtio_net.c |   32 +---
 1 file changed, 29 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 16c257d..631ee07 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -114,6 +114,9 @@ struct virtnet_info {
/* # of queue pairs currently used by the driver */
u16 curr_queue_pairs;
 
+   /* # of XDP queue pairs currently used by the driver */
+   u16 xdp_queue_pairs;
+
/* I like... big packets and I cannot lie! */
bool big_packets;
 
@@ -1525,7 +1528,8 @@ static int virtnet_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
 {
struct virtnet_info *vi = netdev_priv(dev);
struct bpf_prog *old_prog;
-   int i;
+   u16 xdp_qp = 0, curr_qp;
+   int err, i;
 
if ((dev->features & NETIF_F_LRO) && prog) {
netdev_warn(dev, "can't set XDP while LRO is on, disable LRO 
first\n");
@@ -1542,12 +1546,34 @@ static int virtnet_xdp_set(struct net_device *dev, 
struct bpf_prog *prog)
return -EINVAL;
}
 
+   curr_qp = vi->curr_queue_pairs - vi->xdp_queue_pairs;
+   if (prog)
+   xdp_qp = num_online_cpus();
+
+   /* XDP requires extra queues for XDP_TX */
+   if (curr_qp + xdp_qp > vi->max_queue_pairs) {
+   netdev_warn(dev, "request %i queues but max is %i\n",
+   curr_qp + xdp_qp, vi->max_queue_pairs);
+   return -ENOMEM;
+   }
+
+   err = virtnet_set_queues(vi, curr_qp + xdp_qp);
+   if (err) {
+   dev_warn(>dev, "XDP Device queue allocation failure.\n");
+   return err;
+   }
+
if (prog) {
-   prog = bpf_prog_add(prog, vi->max_queue_pairs - 1);
-   if (IS_ERR(prog))
+   prog = bpf_prog_add(prog, vi->max_queue_pairs);
+   if (IS_ERR(prog)) {
+   virtnet_set_queues(vi, curr_qp);
return PTR_ERR(prog);
+   }
}
 
+   vi->xdp_queue_pairs = xdp_qp;
+   netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
+
for (i = 0; i < vi->max_queue_pairs; i++) {
old_prog = rcu_dereference(vi->rq[i].xdp_prog);
rcu_assign_pointer(vi->rq[i].xdp_prog, prog);



[PATCH 5/5] virtio_net: add XDP_TX support

2016-11-18 Thread John Fastabend
This adds support for the XDP_TX action to virtio_net. When an XDP
program is run and returns the XDP_TX action the virtio_net XDP
implementation will transmit the packet on a TX queue that aligns
with the current CPU that the XDP packet was processed on.

Before sending the packet the header is zeroed.  Also XDP is expected
to handle checksum correctly so no checksum offload  support is
provided.

Signed-off-by: John Fastabend 
---
 drivers/net/virtio_net.c |   57 --
 1 file changed, 54 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 631ee07..4b22938 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -330,12 +330,40 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
*vi,
return skb;
 }
 
+static void virtnet_xdp_xmit(struct virtnet_info *vi,
+unsigned int qnum, struct xdp_buff *xdp)
+{
+   struct send_queue *sq = >sq[qnum];
+   struct virtio_net_hdr_mrg_rxbuf *hdr;
+   unsigned int num_sg, len;
+   void *xdp_sent;
+
+   /* Free up any pending old buffers before queueing new ones. */
+   while ((xdp_sent = virtqueue_get_buf(sq->vq, )) != NULL) {
+   struct page *page = virt_to_head_page(xdp_sent);
+
+   put_page(page);
+   }
+
+   /* Zero header and leave csum up to XDP layers */
+   hdr = xdp->data;
+   memset(hdr, 0, vi->hdr_len);
+   hdr->hdr.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+   hdr->hdr.flags = VIRTIO_NET_HDR_F_DATA_VALID;
+
+   num_sg = 1;
+   sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+   virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, xdp->data, GFP_ATOMIC);
+   virtqueue_kick(sq->vq);
+}
+
 static u32 do_xdp_prog(struct virtnet_info *vi,
   struct bpf_prog *xdp_prog,
   struct page *page, int offset, int len)
 {
int hdr_padded_len;
struct xdp_buff xdp;
+   unsigned int qp;
u32 act;
u8 *buf;
 
@@ -353,9 +381,15 @@ static u32 do_xdp_prog(struct virtnet_info *vi,
switch (act) {
case XDP_PASS:
return XDP_PASS;
+   case XDP_TX:
+   qp = vi->curr_queue_pairs -
+   vi->xdp_queue_pairs +
+   smp_processor_id();
+   xdp.data = buf + (vi->mergeable_rx_bufs ? 0 : 4);
+   virtnet_xdp_xmit(vi, qp, );
+   return XDP_TX;
default:
bpf_warn_invalid_xdp_action(act);
-   case XDP_TX:
case XDP_ABORTED:
case XDP_DROP:
return XDP_DROP;
@@ -386,8 +420,15 @@ static struct sk_buff *receive_big(struct net_device *dev,
if (xdp_prog) {
u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
 
-   if (act == XDP_DROP)
+   switch (act) {
+   case XDP_PASS:
+   break;
+   case XDP_TX:
+   goto xdp_xmit;
+   case XDP_DROP:
+   default:
goto err;
+   }
}
 
skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
@@ -399,6 +440,7 @@ static struct sk_buff *receive_big(struct net_device *dev,
 err:
dev->stats.rx_dropped++;
give_pages(rq, page);
+xdp_xmit:
return NULL;
 }
 
@@ -417,6 +459,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
struct sk_buff *head_skb, *curr_skb;
struct bpf_prog *xdp_prog;
 
+   head_skb = NULL;
xdp_prog = rcu_dereference(rq->xdp_prog);
if (xdp_prog) {
u32 act;
@@ -427,8 +470,15 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
}
 
act = do_xdp_prog(vi, xdp_prog, page, offset, len);
-   if (act == XDP_DROP)
+   switch (act) {
+   case XDP_PASS:
+   break;
+   case XDP_TX:
+   goto xdp_xmit;
+   case XDP_DROP:
+   default:
goto err_skb;
+   }
}
 
head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
@@ -502,6 +552,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
 err_buf:
dev->stats.rx_dropped++;
dev_kfree_skb(head_skb);
+xdp_xmit:
return NULL;
 }
 



Re: [PATCH] netns: fix get_net_ns_by_fd(int pid) typo

2016-11-18 Thread David Miller
From: Stefan Hajnoczi 
Date: Fri, 18 Nov 2016 09:41:46 +

> The argument to get_net_ns_by_fd() is a /proc/$PID/ns/net file
> descriptor not a pid.  Fix the typo.
> 
> Signed-off-by: Stefan Hajnoczi 

Applied.


[PATCH 1/5] net: virtio dynamically disable/enable LRO

2016-11-18 Thread John Fastabend
This adds support for dynamically setting the LRO feature flag. The
message to control guest features in the backend uses the
CTRL_GUEST_OFFLOADS msg type.

Signed-off-by: John Fastabend 
---
 drivers/net/virtio_net.c |   43 +++
 1 file changed, 43 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 2cafd12..0758cae 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1419,6 +1419,41 @@ static void virtnet_init_settings(struct net_device *dev)
.set_settings = virtnet_set_settings,
 };
 
+static int virtnet_set_features(struct net_device *netdev,
+   netdev_features_t features)
+{
+   struct virtnet_info *vi = netdev_priv(netdev);
+   struct virtio_device *vdev = vi->vdev;
+   struct scatterlist sg;
+   u64 offloads = 0;
+
+   if (features & NETIF_F_LRO)
+   offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
+   (1 << VIRTIO_NET_F_GUEST_TSO6);
+
+   if (features & NETIF_F_RXCSUM)
+   offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
+
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
+   sg_init_one(, , sizeof(uint64_t));
+   if (!virtnet_send_command(vi,
+ VIRTIO_NET_CTRL_GUEST_OFFLOADS,
+ VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
+ )) {
+   dev_warn(>dev,
+"Failed to set guest offloads by virtnet 
command.\n");
+   return -EINVAL;
+   }
+   } else if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) &&
+  !virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
+   dev_warn(>dev,
+"No support for setting offloads pre version_1.\n");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
.ndo_open= virtnet_open,
.ndo_stop= virtnet_close,
@@ -1435,6 +1470,7 @@ static void virtnet_init_settings(struct net_device *dev)
 #ifdef CONFIG_NET_RX_BUSY_POLL
.ndo_busy_poll  = virtnet_busy_poll,
 #endif
+   .ndo_set_features   = virtnet_set_features,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -1810,6 +1846,12 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
dev->features |= NETIF_F_RXCSUM;
 
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
+   virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
+   dev->features |= NETIF_F_LRO;
+   dev->hw_features |= NETIF_F_LRO;
+   }
+
dev->vlan_features = dev->features;
 
/* MTU range: 68 - 65535 */
@@ -2049,6 +2091,7 @@ static int virtnet_restore(struct virtio_device *vdev)
VIRTIO_NET_F_CTRL_MAC_ADDR,
VIRTIO_F_ANY_LAYOUT,
VIRTIO_NET_F_MTU,
+   VIRTIO_NET_F_CTRL_GUEST_OFFLOADS,
 };
 
 static struct virtio_driver virtio_net_driver = {



[PATCH 2/5] net: xdp: add invalid buffer warning

2016-11-18 Thread John Fastabend
This adds a warning for drivers to use when encountering an invalid
buffer for XDP. For normal cases this should not happen but to catch
this in virtual/qemu setups that I may not have expected from the
emulation layer having a standard warning is useful.

Signed-off-by: John Fastabend 
---
 include/linux/filter.h |1 +
 net/core/filter.c  |6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 1f09c52..0c79004 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -595,6 +595,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter 
__user *filter,
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
 void bpf_warn_invalid_xdp_action(u32 act);
+void bpf_warn_invalid_xdp_buffer(void);
 
 #ifdef CONFIG_BPF_JIT
 extern int bpf_jit_enable;
diff --git a/net/core/filter.c b/net/core/filter.c
index cd9e2ba..b8fb57c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2722,6 +2722,12 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+void bpf_warn_invalid_xdp_buffer(void)
+{
+   WARN_ONCE(1, "Illegal XDP buffer encountered, expect packet loss\n");
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_buffer);
+
 static u32 sk_filter_convert_ctx_access(enum bpf_access_type type, int dst_reg,
int src_reg, int ctx_off,
struct bpf_insn *insn_buf,



Re: [Patch net v2] af_unix: conditionally use freezable blocking calls in read

2016-11-18 Thread David Miller
From: Cong Wang 
Date: Thu, 17 Nov 2016 15:55:26 -0800

> Commit 2b15af6f95 ("af_unix: use freezable blocking calls in read")
> converts schedule_timeout() to its freezable version, it was probably
> correct at that time, but later, commit 2b514574f7e8
> ("net: af_unix: implement splice for stream af_unix sockets") breaks
> the strong requirement for a freezable sleep, according to
> commit 0f9548ca1091:
> 
> We shouldn't try_to_freeze if locks are held.  Holding a lock can cause a
> deadlock if the lock is later acquired in the suspend or hibernate path
> (e.g.  by dpm).  Holding a lock can also cause a deadlock in the case of
> cgroup_freezer if a lock is held inside a frozen cgroup that is later
> acquired by a process outside that group.
> 
> The pipe_lock is still held at that point.
> 
> So use freezable version only for the recvmsg call path, avoid impact for
> Android.
> 
> Fixes: 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix 
> sockets")
> Reported-by: Dmitry Vyukov 
> Cc: Tejun Heo 
> Cc: Colin Cross 
> Cc: Rafael J. Wysocki 
> Cc: Hannes Frederic Sowa 
> Signed-off-by: Cong Wang 

Applied and queued up for -stable, thanks.


[PATCH 3/5] virtio_net: Add XDP support

2016-11-18 Thread John Fastabend
From: Shrijeet Mukherjee 

This adds XDP support to virtio_net. Some requirements must be
met for XDP to be enabled depending on the mode. First it will
only be supported with LRO disabled so that data is not pushed
across multiple buffers. The MTU must be less than a page size
to avoid having to handle XDP across multiple pages.

If mergeable receive is enabled this first series only supports
the case where header and data are in the same buf which we can
check when a packet is received by looking at num_buf. If the
num_buf is greater than 1 and a XDP program is loaded the packet
is dropped and a warning is thrown. When any_header_sg is set this
does not happen and both header and data is put in a single buffer
as expected so we check this when XDP programs are loaded. Note I
have only tested this with Linux vhost backend.

If big packets mode is enabled and MTU/LRO conditions above are
met then XDP is allowed.

A follow on patch can be generated to solve the mergeable receive
case with num_bufs equal to 2. Buffers greater than two may not
be handled has easily.

Suggested-by: Shrijeet Mukherjee 
Signed-off-by: John Fastabend 
---
 drivers/net/virtio_net.c |  144 +-
 1 file changed, 140 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0758cae..16c257d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -81,6 +82,8 @@ struct receive_queue {
 
struct napi_struct napi;
 
+   struct bpf_prog *xdp_prog;
+
/* Chain pages by the private ptr. */
struct page *pages;
 
@@ -324,6 +327,38 @@ static struct sk_buff *page_to_skb(struct virtnet_info *vi,
return skb;
 }
 
+static u32 do_xdp_prog(struct virtnet_info *vi,
+  struct bpf_prog *xdp_prog,
+  struct page *page, int offset, int len)
+{
+   int hdr_padded_len;
+   struct xdp_buff xdp;
+   u32 act;
+   u8 *buf;
+
+   buf = page_address(page) + offset;
+
+   if (vi->mergeable_rx_bufs)
+   hdr_padded_len = sizeof(struct virtio_net_hdr_mrg_rxbuf);
+   else
+   hdr_padded_len = sizeof(struct padded_vnet_hdr);
+
+   xdp.data = buf + hdr_padded_len;
+   xdp.data_end = xdp.data + (len - vi->hdr_len);
+
+   act = bpf_prog_run_xdp(xdp_prog, );
+   switch (act) {
+   case XDP_PASS:
+   return XDP_PASS;
+   default:
+   bpf_warn_invalid_xdp_action(act);
+   case XDP_TX:
+   case XDP_ABORTED:
+   case XDP_DROP:
+   return XDP_DROP;
+   }
+}
+
 static struct sk_buff *receive_small(struct virtnet_info *vi, void *buf, 
unsigned int len)
 {
struct sk_buff * skb = buf;
@@ -340,9 +375,19 @@ static struct sk_buff *receive_big(struct net_device *dev,
   void *buf,
   unsigned int len)
 {
+   struct bpf_prog *xdp_prog;
struct page *page = buf;
-   struct sk_buff *skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
+   struct sk_buff *skb;
+
+   xdp_prog = rcu_dereference(rq->xdp_prog);
+   if (xdp_prog) {
+   u32 act = do_xdp_prog(vi, xdp_prog, page, 0, len);
+
+   if (act == XDP_DROP)
+   goto err;
+   }
 
+   skb = page_to_skb(vi, rq, page, 0, len, PAGE_SIZE);
if (unlikely(!skb))
goto err;
 
@@ -366,10 +411,25 @@ static struct sk_buff *receive_mergeable(struct 
net_device *dev,
struct page *page = virt_to_head_page(buf);
int offset = buf - page_address(page);
unsigned int truesize = max(len, mergeable_ctx_to_buf_truesize(ctx));
+   struct sk_buff *head_skb, *curr_skb;
+   struct bpf_prog *xdp_prog;
 
-   struct sk_buff *head_skb = page_to_skb(vi, rq, page, offset, len,
-  truesize);
-   struct sk_buff *curr_skb = head_skb;
+   xdp_prog = rcu_dereference(rq->xdp_prog);
+   if (xdp_prog) {
+   u32 act;
+
+   if (num_buf > 1) {
+   bpf_warn_invalid_xdp_buffer();
+   goto err_skb;
+   }
+
+   act = do_xdp_prog(vi, xdp_prog, page, offset, len);
+   if (act == XDP_DROP)
+   goto err_skb;
+   }
+
+   head_skb = page_to_skb(vi, rq, page, offset, len, truesize);
+   curr_skb = head_skb;
 
if (unlikely(!curr_skb))
goto err_skb;
@@ -1328,6 +1388,13 @@ static int virtnet_set_channels(struct net_device *dev,
if (queue_pairs > vi->max_queue_pairs || queue_pairs == 0)
return -EINVAL;
 
+   /* For now we don't support modifying channels while XDP is loaded
+* also when XDP 

[PATCH 0/5] XDP for virtio_net

2016-11-18 Thread John Fastabend
This implements virtio_net for the mergeable buffers and big_packet
modes. I tested this with vhost_net running on qemu and did not see
any issues.

There are some restrictions for XDP to be enabled (see patch 3) for
more details.

  1. LRO must be off
  2. MTU must be less than PAGE_SIZE
  3. queues must be available to dedicate to XDP
  4. num_bufs received in mergeable buffers must be 1
  5. big_packet mode must have all data on single page

Please review any comments/feedback welcome as always.

Thanks,
John
---

John Fastabend (4):
  net: virtio dynamically disable/enable LRO
  net: xdp: add invalid buffer warning
  virtio_net: add dedicated XDP transmit queues
  virtio_net: add XDP_TX support

Shrijeet Mukherjee (1):
  virtio_net: Add XDP support


 drivers/net/virtio_net.c |  264 +-
 include/linux/filter.h   |1 
 net/core/filter.c|6 +
 3 files changed, 267 insertions(+), 4 deletions(-)

--
Signature


Re: [PATCH v2 net-next] lan78xx: relocate mdix setting to phy driver

2016-11-18 Thread David Miller
From: 
Date: Thu, 17 Nov 2016 22:10:02 +

> From: Woojung Huh 
> 
> Relocate mdix code to phy driver to be called at config_init().
> 
> Signed-off-by: Woojung Huh 

Applied, thank you.


Re: [PATCH -next] tcp: make undo_cwnd mandatory for congestion modules

2016-11-18 Thread Florian Westphal
David Miller  wrote:
> From: Florian Westphal 
> Date: Thu, 17 Nov 2016 13:56:51 +0100
> 
> > The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
> > which un-does reno halving behaviour.
> > 
> > It seems more appropriate to let congctl algorithms pair .ssthresh
> > and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
> > up for all congestion algorithms that used to rely on the fallback.
> > 
> > highspeed, illinois, scalable, veno and yeah use 'reno undo' while their
> > .ssthresh implementation doesn't halve the slowstart threshold, this
> > might point to similar issue as the one fixed for dctcp in
> > ce6dd23329b1e ("dctcp: avoid bogus doubling of cwnd after loss").
> > 
> > Cc: Eric Dumazet 
> > Cc: Yuchung Cheng 
> > Cc: Neal Cardwell 
> > Signed-off-by: Florian Westphal 
> 
> If you really suspect that highspeed et al. need to implement their own
> undo_cwnd instead of using the default reno fallback, I would really
> rather that this gets either fixed or explicitly marked as likely wrong
> (in an "XXX" comment or similar).

Ok, fair enough.  I am not familiar with these algorithms, I will check
what they're doing in more detail and if absolutely needed resubmit this
patch with XXX/FIXME/TODO comments added.

> Otherwise nobody is going to remember this down the road.

Agreed.


Re: [PATCH net-next v4 0/5] net: Enable COMPILE_TEST for Marvell & Freescale drivers

2016-11-18 Thread David Miller
From: Florian Fainelli 
Date: Thu, 17 Nov 2016 11:19:09 -0800

> This patch series allows building the Freescale and Marvell Ethernet
> network drivers with COMPILE_TEST.

Thanks for doing this work, this kind of thing helps me a lot.

Series applied, thanks.


Re: [PATCH net v2 0/7] net: cpsw: fix leaks and probe deferral

2016-11-18 Thread David Miller
From: Johan Hovold 
Date: Thu, 17 Nov 2016 17:39:57 +0100

> This series fixes as number of leaks and issues in the cpsw probe-error
> and driver-unbind paths, some which specifically prevented deferred
> probing.
 ...
> v2
>  - Keep platform device runtime-resumed throughout probe instead of
>resuming in the probe error path as suggested by Grygorii (patch
>1/7).
> 
>  - Runtime-resume platform device before registering any children in
>order to make sure it is synchronously suspended after deregistering
>children in the error path (patch 3/7).

Series applied, thanks.


Re: [PATCH net v2 7/7] net: ethernet: ti: cpsw: fix fixed-link phy probe deferral

2016-11-18 Thread David Miller
From: Johan Hovold 
Date: Thu, 17 Nov 2016 18:19:20 +0100

> On Thu, Nov 17, 2016 at 12:04:16PM -0500, David Miller wrote:
>> From: Johan Hovold 
>> Date: Thu, 17 Nov 2016 17:40:04 +0100
>> 
>> > Make sure to propagate errors from of_phy_register_fixed_link() which
>> > can fail with -EPROBE_DEFER.
>> > 
>> > Fixes: 1f71e8c96fc6 ("drivers: net: cpsw: Add support for fixed-link
>> > PHY")
>> > Signed-off-by: Johan Hovold 
>> 
>> Johan, when you update a patch within a series you must post the
>> entire series freshly to the lists, cover posting and all.
> 
> I'm quite sure that is exactly what I did. Did you only get this last
> patch out of the seven?

I ended up getting it delayed, thanks.


Re: [PATCH net 0/3] mlx4 fix for shutdown flow

2016-11-18 Thread David Miller
From: Tariq Toukan 
Date: Thu, 17 Nov 2016 17:40:48 +0200

> This patchset fixes an invalid reference to mdev in mlx4 shutdown flow.
> 
> In patch 1, we make sure netif_device_detach() is called from shutdown flow 
> only,
> since we want to keep it present during a simple configuration change.
> 
> In patches 2 and 3, we add checks that were missing in:
> * dev_get_phys_port_id
> * dev_get_phys_port_name
> We check the presence of the network device before calling the driver's
> callbacks. This already exists for all other ndo's.
> 
> Series generated against net commit:
> e5f6f564fd19 bnxt: add a missing rcu synchronization

I don't like where this is going nor the precedence it is setting.

If you are taking the device into a state where it cannot be safely
accessed by ndo operations, then you _MUST_ do whatever is necessary
to make sure the device is unregistered and cannot be found in the
various global lists and tables of network devices.

This is mandatory.

And this is how we must fix these kinds of problems instead of
peppering device presence test all over the place.  That will be
error prone and in the long term a huge maintainence burdon.

I'm not applying this series, sorry.  You have to fix this properly.



[PATCH] liquidio CN23XX: check if PENDING bit is clear using logical and

2016-11-18 Thread Colin King
From: Colin Ian King 

the mbox state should be bitwise anded rather than logically anded
with OCTEON_MBOX_STATE_RESPONSE_PENDING. Fix this by using the
correct & operator instead of &&.

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c 
b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
index 5309384..73696b42 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_mailbox.c
@@ -301,7 +301,7 @@ int octeon_mbox_process_message(struct octeon_mbox *mbox)
   sizeof(struct octeon_mbox_cmd));
if (!mbox_cmd.msg.s.resp_needed) {
mbox->state &= ~OCTEON_MBOX_STATE_REQUEST_RECEIVED;
-   if (!(mbox->state &&
+   if (!(mbox->state &
  OCTEON_MBOX_STATE_RESPONSE_PENDING))
mbox->state = OCTEON_MBOX_STATE_IDLE;
writeq(OCTEON_PFVFSIG, mbox->mbox_read_reg);
-- 
2.10.2



Re: [PATCH -next] tcp: make undo_cwnd mandatory for congestion modules

2016-11-18 Thread David Miller
From: Florian Westphal 
Date: Thu, 17 Nov 2016 13:56:51 +0100

> The undo_cwnd fallback in the stack doubles cwnd based on ssthresh,
> which un-does reno halving behaviour.
> 
> It seems more appropriate to let congctl algorithms pair .ssthresh
> and .undo_cwnd properly. Add a 'tcp_reno_undo_cwnd' function and wire it
> up for all congestion algorithms that used to rely on the fallback.
> 
> highspeed, illinois, scalable, veno and yeah use 'reno undo' while their
> .ssthresh implementation doesn't halve the slowstart threshold, this
> might point to similar issue as the one fixed for dctcp in
> ce6dd23329b1e ("dctcp: avoid bogus doubling of cwnd after loss").
> 
> Cc: Eric Dumazet 
> Cc: Yuchung Cheng 
> Cc: Neal Cardwell 
> Signed-off-by: Florian Westphal 

If you really suspect that highspeed et al. need to implement their own
undo_cwnd instead of using the default reno fallback, I would really
rather that this gets either fixed or explicitly marked as likely wrong
(in an "XXX" comment or similar).

Otherwise nobody is going to remember this down the road.


Re: [PATCH] net: sky2: Fix shutdown crash

2016-11-18 Thread David Miller
From: Jeremy Linton 
Date: Thu, 17 Nov 2016 09:14:25 -0600

> The sky2 frequently crashes during machine shutdown with:
> 
> sky2_get_stats+0x60/0x3d8 [sky2]
> dev_get_stats+0x68/0xd8
> rtnl_fill_stats+0x54/0x140
> rtnl_fill_ifinfo+0x46c/0xc68
> rtmsg_ifinfo_build_skb+0x7c/0xf0
> rtmsg_ifinfo.part.22+0x3c/0x70
> rtmsg_ifinfo+0x50/0x5c
> netdev_state_change+0x4c/0x58
> linkwatch_do_dev+0x50/0x88
> __linkwatch_run_queue+0x104/0x1a4
> linkwatch_event+0x30/0x3c
> process_one_work+0x140/0x3e0
> worker_thread+0x60/0x44c
> kthread+0xdc/0xf0
> ret_from_fork+0x10/0x50
> 
> This is caused by the sky2 being called after it has been shutdown.
> A previous thread about this can be found here:
> 
> https://lkml.org/lkml/2016/4/12/410
> 
> An alternative fix is to assure that IFF_UP gets cleared by
> calling dev_close() during shutdown. This is similar to what the
> bnx2/tg3/xgene and maybe others are doing to assure that the driver
> isn't being called following _shutdown().
> 
> Signed-off-by: Jeremy Linton 

Applied and queued up for -stable, thanks.


Re: [v5,1/5] soc: qcom: smem_state: Fix include for ERR_PTR()

2016-11-18 Thread Bjorn Andersson
On Wed 16 Nov 10:49 PST 2016, Kalle Valo wrote:

> Bjorn Andersson  wrote:
> > The correct include file for getting errno constants and ERR_PTR() is
> > linux/err.h, rather than linux/errno.h, so fix the include.
> > 
> > Fixes: e8b123e60084 ("soc: qcom: smem_state: Add stubs for disabled 
> > smem_state")
> > Acked-by: Andy Gross 
> > Signed-off-by: Bjorn Andersson 
> 
> For some reason this fails to compile now. Can you take a look, please?
> 
> ERROR: "qcom_wcnss_open_channel" 
> [drivers/net/wireless/ath/wcn36xx/wcn36xx.ko] undefined!
> make[1]: *** [__modpost] Error 1
> make: *** [modules] Error 2
> 
> 5 patches set to Changes Requested.
> 
> 9429045 [v5,1/5] soc: qcom: smem_state: Fix include for ERR_PTR()
> 9429047 [v5,2/5] wcn36xx: Transition driver to SMD client

This patch was updated with the necessary depends in Kconfig to catch
this exact issue and when I pull in your .config (which has QCOM_SMD=n,
QCOM_WCNSS_CTRL=n and WCN36XX=y) I can build this just fine.

I've tested the various combinations and it seems to work fine. Do you
have any other patches in your tree? Any stale objects?

Would you mind retesting this, before I invest more time in trying to
reproduce the issue you're seeing?

Regards,
Bjorn


Re: Synopsys Ethernet QoS Driver

2016-11-18 Thread Eric Dumazet
On Fri, 2016-11-18 at 16:40 +, Joao Pinto wrote:

> help a lot, thank you!
> lets start working then :)

Please read this very useful document first, so that you can avoid
common mistakes ;)


https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt

Thanks




Re: [PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-11-18 Thread Pablo Neira Ayuso
On Fri, Nov 18, 2016 at 09:17:18AM -0800, Alexei Starovoitov wrote:
> On Fri, Nov 18, 2016 at 01:37:32PM +0100, Pablo Neira Ayuso wrote:
> > On Thu, Nov 17, 2016 at 07:27:08PM +0100, Daniel Mack wrote:
> > [...]
> > > @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, 
> > > struct sk_buff *skb)
> > >   skb->dev = dev;
> > >   skb->protocol = htons(ETH_P_IP);
> > >  
> > > + ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > > + if (ret) {
> > > + kfree_skb(skb);
> > > + return ret;
> > > + }
> > > +
> > >   /*
> > >*  Multicasts are looped back for other local users
> > >*/
> > > @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, 
> > > struct sk_buff *skb)
> > >  int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> > >  {
> > >   struct net_device *dev = skb_dst(skb)->dev;
> > > + int ret;
> > >  
> > >   IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
> > >  
> > >   skb->dev = dev;
> > >   skb->protocol = htons(ETH_P_IP);
> > >  
> > > + ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > > + if (ret) {
> > > + kfree_skb(skb);
> > > + return ret;
> > > + }
> > > +
> > >   return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
> > >   net, sk, skb, NULL, dev,
> > >   ip_finish_output,
> > 
> > Please, place this after the netfilter hook.
> > 
> > Since this new hook may mangle output packets, any mangling
> > potentially interfers and breaks conntrack.
> 
> actually this hook cannot mangle the packets, so no conntrack
> concerns.  Also this was brought up by Lorenzo earlier and consensus
> was that it's cleaner to leave it in this order.

Not yet probably, but this could be used to implement snat at some
point, you have potentially the infrastructure to do so in place
already.

> My reply:
> http://www.spinics.net/lists/cgroups/msg16675.html
> and Daniel's:
> http://www.spinics.net/lists/cgroups/msg16677.html
> and the rest of that thread.

Please place this afterwards since I don't want to update Netfilter
documentation to indicate that there is a new spot to debug before
POSTROUTING that may drop packets. People are used to debugging things
in a certain way, if packets are dropped after POSTROUTING, then
netfilter tracing will indicate the packet has successfully left our
framework and people will notice that packets are dropped somewhere
else, so they have a clue probably is this new layer.

Actually I remember you mentioned in a previous email that this hook
can be placed anywhere, and that they don't really need a fixed
location, if so, then it should not be much of a problem to change
this.

I can live with this new scenario where the kernel becomes a place
where everyone can push bpf blobs everywhere and your "code decides"
submission policy if others do as well, even if I frankly don't like
it. No problem. But please don't use the word "consensus" to justify
this, because this was not exactly what it was shown during Netconf.

So just send a v9 with this change I'm requesting and you have my word
I will not intefer anymore on this submission.

Thank you.


Re: [PATCH] netns: fix get_net_ns_by_fd(int pid) typo

2016-11-18 Thread Rami Rosen
On 18 November 2016 at 11:41, Stefan Hajnoczi  wrote:
> The argument to get_net_ns_by_fd() is a /proc/$PID/ns/net file
> descriptor not a pid.  Fix the typo.
>

Acked-by: Rami Rosen 


Re: [PATCH net-next] amd-xgbe: Update connection validation for backplane mode

2016-11-18 Thread David Miller
From: Tom Lendacky 
Date: Thu, 17 Nov 2016 08:43:37 -0600

> Update the connection type enumeration for backplane mode and return
> an error when there is a mismatch between the mode and the connection
> type.
> 
> Signed-off-by: Tom Lendacky 

Applied, thanks.


Re: [PATCH v8 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs

2016-11-18 Thread Alexei Starovoitov
On Fri, Nov 18, 2016 at 01:37:32PM +0100, Pablo Neira Ayuso wrote:
> On Thu, Nov 17, 2016 at 07:27:08PM +0100, Daniel Mack wrote:
> [...]
> > @@ -312,6 +314,12 @@ int ip_mc_output(struct net *net, struct sock *sk, 
> > struct sk_buff *skb)
> > skb->dev = dev;
> > skb->protocol = htons(ETH_P_IP);
> >  
> > +   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > +   if (ret) {
> > +   kfree_skb(skb);
> > +   return ret;
> > +   }
> > +
> > /*
> >  *  Multicasts are looped back for other local users
> >  */
> > @@ -364,12 +372,19 @@ int ip_mc_output(struct net *net, struct sock *sk, 
> > struct sk_buff *skb)
> >  int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
> >  {
> > struct net_device *dev = skb_dst(skb)->dev;
> > +   int ret;
> >  
> > IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
> >  
> > skb->dev = dev;
> > skb->protocol = htons(ETH_P_IP);
> >  
> > +   ret = BPF_CGROUP_RUN_PROG_INET_EGRESS(sk, skb);
> > +   if (ret) {
> > +   kfree_skb(skb);
> > +   return ret;
> > +   }
> > +
> > return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
> > net, sk, skb, NULL, dev,
> > ip_finish_output,
> 
> Please, place this after the netfilter hook.
> 
> Since this new hook may mangle output packets, any mangling
> potentially interfers and breaks conntrack.

actually this hook cannot mangle the packets, so no conntrack concerns.
Also this was brought up by Lorenzo earlier
and consensus was that it's cleaner to leave it in this order.
My reply:
http://www.spinics.net/lists/cgroups/msg16675.html
and Daniel's:
http://www.spinics.net/lists/cgroups/msg16677.html
and the rest of that thread.

Thanks



Re: [PATCH net-next v3 0/5] Adding PHY-Tunables and downshift support

2016-11-18 Thread David Miller
From: "Allan W. Nielsen" 
Date: Thu, 17 Nov 2016 13:07:19 +0100

> This series add support for PHY tunables, and uses this facility to
> configure downshifting. The downshifting mechanism is implemented for MSCC
> phys.

Series applied, thanks.


Re: Netperf UDP issue with connected sockets

2016-11-18 Thread Jesper Dangaard Brouer
On Thu, 17 Nov 2016 13:44:02 -0800
Eric Dumazet  wrote:

> On Thu, 2016-11-17 at 22:19 +0100, Jesper Dangaard Brouer wrote:
> 
> > 
> > Maybe you can share your udp flood "udpsnd" program source?  
> 
> Very ugly. This is based on what I wrote when tracking the UDP v6
> checksum bug (4f2e4ad56a65f3b7d64c258e373cb71e8d2499f4 net: mangle zero
> checksum in skb_checksum_help()), because netperf sends the same message
> over and over...

Thanks a lot, hope you don't mind; I added the code to my github repo:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_snd.c

So I identified the difference, and reason behind the route lookups.
Your program is using send() and I was using sendmsg().  Given
udp_flood is designed to test different calls, I simply added --send as
a new possibility.
 https://github.com/netoptimizer/network-testing/commit/16166c2cd1fa8

If I use --write instead, then I can also avoid the fib_table_lookup
and __ip_route_output_key_hash calls.


> Use -d 2   to remove the ip_idents_reserve() overhead.

#define IP_PMTUDISC_DO  2 /* Always DF  */

Added a --pmtu option to my udp_flood program
 https://github.com/netoptimizer/network-testing/commit/23a78caf4bb5b

 
> #define _GNU_SOURCE
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> char buffer[1400];
> 
> int main(int argc, char** argv) {
>   int fd, i;
>   struct sockaddr_in6 addr;
>   char *host = "2002:af6:798::1";
>   int family = AF_INET6;
>   int discover = -1;
> 
>   while ((i = getopt(argc, argv, "4H:d:")) != -1) {
> switch (i) {
> case 'H': host = optarg; break;
> case '4': family = AF_INET; break;
> case 'd': discover = atoi(optarg); break;
> }
>   }
>   fd = socket(family, SOCK_DGRAM, 0);
>   if (fd < 0)
> error(1, errno, "failed to create socket");
>   if (discover != -1)
> setsockopt(fd, SOL_IP, IP_MTU_DISCOVER,
>, sizeof(discover));
> 
>   memset(, 0, sizeof(addr));
>   if (family == AF_INET6) {
> addr.sin6_family = AF_INET6;
> addr.sin6_port = htons(9);
>   inet_pton(family, host, (void *)_addr.s6_addr);
>   } else {
> struct sockaddr_in *in = (struct sockaddr_in *)
> in->sin_family = family;
> in->sin_port = htons(9);
>   inet_pton(family, host, >sin_addr);
>   }
>   connect(fd, (struct sockaddr *),
>   (family == AF_INET6) ? sizeof(addr) :
>  sizeof(struct sockaddr_in));
>   memset(buffer, 1, 1400);
>   for (i = 0; i < 65536; i++) {
> memcpy(buffer, , sizeof(i));
> send(fd, buffer, 100 + rand() % 200, 0);

Using send() avoids the fib_table_lookup, on a connected UDP socket.

>   }
>   return 0;
> }


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net-next V2 0/8] Mellanox 100G mlx5 update 2016-11-15

2016-11-18 Thread David Miller
From: Saeed Mahameed 
Date: Thu, 17 Nov 2016 13:45:54 +0200

> This series contains four humble mlx5 features.
> 
> From Gal, 
>  - Add the support for PCIe statistics and expose them in ethtool
> 
> From Huy,
>  - Add the support for port module events reporting and statistics
>  - Add the support for driver version setting into FW (for display purposes 
> only)
> 
> From Mohamad,
>  - Extended the command interface cache flexibility
> 
> This series was generated against commit
> 6a02f5eb6a8a ("Merge branch 'mlxsw-i2c")
> 
> V2:
>  - Changed plain "unsigned" to "unsigned int"

Series applied, thanks.


Re: [PATCH net-next 0/5] sfc: Firmware-Assisted TSO version 2

2016-11-18 Thread David Miller
From: Edward Cree 
Date: Thu, 17 Nov 2016 10:49:42 +

> The firmware on 8000 series SFC NICs supports a new TSO API ("FATSOv2"), and
>  7000 series NICs will also support this in an imminent release.  This series
>  adds driver support for this TSO implementation.
> The series also removes SWTSO, as it's now equivalent to GSO.  This does not
>  actually remove very much code, because SWTSO was grotesquely intertwingled
>  with FATSOv1, which will also be removed once 7000 series supports FATSOv2.

Series applied, thanks.


Re: Synopsys Ethernet QoS Driver

2016-11-18 Thread Joao Pinto
On 18-11-2016 16:35, Florian Fainelli wrote:
> 
> 
> On 11/18/2016 08:31 AM, Joao Pinto wrote:
>>  Hi Florian,
>>
>> On 18-11-2016 14:53, Florian Fainelli wrote:
>>> On November 18, 2016 4:28:30 AM PST, Joao Pinto  
>>> wrote:


snip (...)

 I would also gladly be available to be its maintainer if you agree with
 it.
>>>
>>> Since you have both the hardware and a clear todo list for this driver, 
>>> start submitting patches, get them included in David's tree and over time 
>>> chances are that you will become the maintainer, either explicitly by 
>>> adding an entry in the MAINTAINERS file or just by consistently 
>>> contributing to this area.
>>
>> Thanks for the feedback.
>>
>> So I found 2 suitable git trees:
>>  a) kernel/git/davem/net.git
>>  b) kernel/git/davem/net-next.git
>>
>> We should submit to net.git correct? The net-next.git is a tree with selected
>> patches for upstream only?
> 
> net-next.git is the git tree where new features/enhancements can be
> submitted, while net.git is for bug fixes. Unless you absolutely need
> to, it is common practice to avoid having changes in net-next.git depend
> on net.git and vice versa.
> 
> Hope this helps.
> 

help a lot, thank you!
lets start working then :)

Thanks,
Joao


Re: Synopsys Ethernet QoS Driver

2016-11-18 Thread Florian Fainelli


On 11/18/2016 08:31 AM, Joao Pinto wrote:
>  Hi Florian,
> 
> On 18-11-2016 14:53, Florian Fainelli wrote:
>> On November 18, 2016 4:28:30 AM PST, Joao Pinto  
>> wrote:
>>>
>>> Dear all,
>>>
>>> My name is Joao Pinto and I work at Synopsys.
>>> I am a kernel developer with special focus in mainline collaboration,
>>> both Linux
>>> and Buildroot. I was recently named one of the maintainers of the PCIe
>>> Designware core driver and I was the author of the Designware UFS
>>> driver stack.
>>>
>>> I am sending you this e-mail because you were the suggested contacts
>> >from the
>>> get_maintainers script concerning Ethernet drivers :).
>>>
>>> Currently I have the task to work on the mainline Ethernet QoS driver
>>> in which
>>> you are the author. The work would consist of the following:
>>>
>>> a) Separate the current driver in a Core driver (common ops) + platform
>>> glue
>>> driver + pci glue driver
>>> b) Add features that are currently only available internally
>>> c) Add specific phy support using the PHY framework
>>>
>>> I would also gladly be available to be its maintainer if you agree with
>>> it.
>>
>> Since you have both the hardware and a clear todo list for this driver, 
>> start submitting patches, get them included in David's tree and over time 
>> chances are that you will become the maintainer, either explicitly by adding 
>> an entry in the MAINTAINERS file or just by consistently contributing to 
>> this area.
> 
> Thanks for the feedback.
> 
> So I found 2 suitable git trees:
>  a) kernel/git/davem/net.git
>  b) kernel/git/davem/net-next.git
> 
> We should submit to net.git correct? The net-next.git is a tree with selected
> patches for upstream only?

net-next.git is the git tree where new features/enhancements can be
submitted, while net.git is for bug fixes. Unless you absolutely need
to, it is common practice to avoid having changes in net-next.git depend
on net.git and vice versa.

Hope this helps.
-- 
Florian


Re: [net-next] af_packet: Use virtio_net_hdr_to_skb().

2016-11-18 Thread David Miller
From: Jarno Rajahalme 
Date: Wed, 16 Nov 2016 18:06:42 -0800

> Use the common virtio_net_hdr_to_skb() instead of open coding it.
> Other call sites were changed by commit fd2a0437dc, but this one was
> missed, maybe because it is split in two parts of the source code.
> 
> Also fix other call sites to be more uniform.
> 
> Fixes: fd2a0437dc ("virtio_net: introduce virtio_net_hdr_{from,to}_skb")
> Signed-off-by: Jarno Rajahalme 

This patch is doing many more things that just this.

Do not mix unrelated changes together:

> @@ -821,9 +821,8 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>   if (iov_iter_count(iter) < vnet_hdr_len)
>   return -EINVAL;
>  
> - ret = virtio_net_hdr_from_skb(skb, _hdr,
> -   macvtap_is_little_endian(q));
> - if (ret)
> + if (virtio_net_hdr_from_skb(skb, _hdr,
> + macvtap_is_little_endian(q)))
>   BUG();
>  
>   if (copy_to_iter(_hdr, sizeof(vnet_hdr), iter) !=

This has nothing to do with modifying code to use vrtio_net_hdr_to_skb(), it
doesn't belong in this patch.

> @@ -1361,15 +1360,12 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>   }
>  
>   if (vnet_hdr_sz) {
> - struct virtio_net_hdr gso = { 0 }; /* no info leak */
> - int ret;
> -
> + struct virtio_net_hdr gso;

This is _extremely_ opaque.  The initializer is trying to prevent kernel memory
info leaks onto the network or into user space.

Maybe this transformation is valid but:

1) YOU DON'T EVEN MENTION IT IN YOUR COMMIT MESSAGE.

2) It's unrelated to this specific change, therefore it belongs in
   a separate change.

3) You don't explain that it is a valid transformation, not why.

It is extremely disappointing to catch unrelated, potentially far
reaching things embedded in a patch when I review it.

Please do not ever do this.

> @@ -98,4 +98,4 @@ static inline int virtio_net_hdr_from_skb(const struct 
> sk_buff *skb,
>   return 0;
>  }
>  
> -#endif /* _LINUX_VIRTIO_BYTEORDER */
> +#endif /* _LINUX_VIRTIO_NET_H */

Another unrelated change.

> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 11db0d6..09abb88 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -1971,8 +1971,6 @@ static unsigned int run_filter(struct sk_buff *skb,
>  static int __packet_rcv_vnet(const struct sk_buff *skb,
>struct virtio_net_hdr *vnet_hdr)
>  {
> - *vnet_hdr = (const struct virtio_net_hdr) { 0 };
> -

There is no way this belongs in this patch, and again you do not explain
why removing this initializer is valid.


Re: Synopsys Ethernet QoS Driver

2016-11-18 Thread Joao Pinto
 Hi Florian,

On 18-11-2016 14:53, Florian Fainelli wrote:
> On November 18, 2016 4:28:30 AM PST, Joao Pinto  
> wrote:
>>
>> Dear all,
>>
>> My name is Joao Pinto and I work at Synopsys.
>> I am a kernel developer with special focus in mainline collaboration,
>> both Linux
>> and Buildroot. I was recently named one of the maintainers of the PCIe
>> Designware core driver and I was the author of the Designware UFS
>> driver stack.
>>
>> I am sending you this e-mail because you were the suggested contacts
>>from the
>> get_maintainers script concerning Ethernet drivers :).
>>
>> Currently I have the task to work on the mainline Ethernet QoS driver
>> in which
>> you are the author. The work would consist of the following:
>>
>> a) Separate the current driver in a Core driver (common ops) + platform
>> glue
>> driver + pci glue driver
>> b) Add features that are currently only available internally
>> c) Add specific phy support using the PHY framework
>>
>> I would also gladly be available to be its maintainer if you agree with
>> it.
> 
> Since you have both the hardware and a clear todo list for this driver, start 
> submitting patches, get them included in David's tree and over time chances 
> are that you will become the maintainer, either explicitly by adding an entry 
> in the MAINTAINERS file or just by consistently contributing to this area.

Thanks for the feedback.

So I found 2 suitable git trees:
 a) kernel/git/davem/net.git
 b) kernel/git/davem/net-next.git

We should submit to net.git correct? The net-next.git is a tree with selected
patches for upstream only?

> 



Re: Synopsys Ethernet QoS Driver

2016-11-18 Thread Florian Fainelli
On November 18, 2016 4:28:30 AM PST, Joao Pinto  wrote:
>
>Dear all,
>
>My name is Joao Pinto and I work at Synopsys.
>I am a kernel developer with special focus in mainline collaboration,
>both Linux
>and Buildroot. I was recently named one of the maintainers of the PCIe
>Designware core driver and I was the author of the Designware UFS
>driver stack.
>
>I am sending you this e-mail because you were the suggested contacts
>from the
>get_maintainers script concerning Ethernet drivers :).
>
>Currently I have the task to work on the mainline Ethernet QoS driver
>in which
>you are the author. The work would consist of the following:
>
>a) Separate the current driver in a Core driver (common ops) + platform
>glue
>driver + pci glue driver
>b) Add features that are currently only available internally
>c) Add specific phy support using the PHY framework
>
>I would also gladly be available to be its maintainer if you agree with
>it.

Since you have both the hardware and a clear todo list for this driver, start 
submitting patches, get them included in David's tree and over time chances are 
that you will become the maintainer, either explicitly by adding an entry in 
the MAINTAINERS file or just by consistently contributing to this area.

-- 
Florian


Re: [mm PATCH v3 00/23] Add support for DMA writable pages being writable by the network stack

2016-11-18 Thread Alexander Duyck
On Thu, Nov 10, 2016 at 3:34 AM, Alexander Duyck
 wrote:
> The first 19 patches in the set add support for the DMA attribute
> DMA_ATTR_SKIP_CPU_SYNC on multiple platforms/architectures.  This is needed
> so that we can flag the calls to dma_map/unmap_page so that we do not
> invalidate cache lines that do not currently belong to the device.  Instead
> we have to take care of this in the driver via a call to
> sync_single_range_for_cpu prior to freeing the Rx page.
>
> Patch 20 adds support for dma_map_page_attrs and dma_unmap_page_attrs so
> that we can unmap and map a page using the DMA_ATTR_SKIP_CPU_SYNC
> attribute.
>
> Patch 21 adds support for freeing a page that has multiple references being
> held by a single caller.  This way we can free page fragments that were
> allocated by a given driver.
>
> The last 2 patches use these updates in the igb driver, and lay the
> groundwork to allow for us to reimplement the use of build_skb.
>
> v1: Minor fixes based on issues found by kernel build bot
> Few minor changes for issues found on code review
> Added Acked-by for patches that were acked and not changed
>
> v2: Added a few more Acked-by
> Submitting patches to mm instead of net-next
>
> v3: Added Acked-by for PowerPC architecture
> Dropped first 3 patches which were accepted into swiotlb tree
> Dropped comments describing swiotlb changes.
>
> ---
>
> Alexander Duyck (23):
>   arch/arc: Add option to skip sync on DMA mapping
>   arch/arm: Add option to skip sync on DMA map and unmap
>   arch/avr32: Add option to skip sync on DMA map
>   arch/blackfin: Add option to skip sync on DMA map
>   arch/c6x: Add option to skip sync on DMA map and unmap
>   arch/frv: Add option to skip sync on DMA map
>   arch/hexagon: Add option to skip DMA sync as a part of mapping
>   arch/m68k: Add option to skip DMA sync as a part of mapping
>   arch/metag: Add option to skip DMA sync as a part of map and unmap
>   arch/microblaze: Add option to skip DMA sync as a part of map and unmap
>   arch/mips: Add option to skip DMA sync as a part of map and unmap
>   arch/nios2: Add option to skip DMA sync as a part of map and unmap
>   arch/openrisc: Add option to skip DMA sync as a part of mapping
>   arch/parisc: Add option to skip DMA sync as a part of map and unmap
>   arch/powerpc: Add option to skip DMA sync as a part of mapping
>   arch/sh: Add option to skip DMA sync as a part of mapping
>   arch/sparc: Add option to skip DMA sync as a part of map and unmap
>   arch/tile: Add option to skip DMA sync as a part of map and unmap
>   arch/xtensa: Add option to skip DMA sync as a part of mapping
>   dma: Add calls for dma_map_page_attrs and dma_unmap_page_attrs
>   mm: Add support for releasing multiple instances of a page
>   igb: Update driver to make use of DMA_ATTR_SKIP_CPU_SYNC
>   igb: Update code to better handle incrementing page count
>
>
>  arch/arc/mm/dma.c |5 ++
>  arch/arm/common/dmabounce.c   |   16 --
>  arch/avr32/mm/dma-coherent.c  |7 ++-
>  arch/blackfin/kernel/dma-mapping.c|8 +++
>  arch/c6x/kernel/dma.c |   14 -
>  arch/frv/mb93090-mb00/pci-dma-nommu.c |   14 -
>  arch/frv/mb93090-mb00/pci-dma.c   |9 +++
>  arch/hexagon/kernel/dma.c |6 ++
>  arch/m68k/kernel/dma.c|8 +++
>  arch/metag/kernel/dma.c   |   16 +-
>  arch/microblaze/kernel/dma.c  |   10 +++-
>  arch/mips/loongson64/common/dma-swiotlb.c |2 -
>  arch/mips/mm/dma-default.c|8 ++-
>  arch/nios2/mm/dma-mapping.c   |   26 +++---
>  arch/openrisc/kernel/dma.c|3 +
>  arch/parisc/kernel/pci-dma.c  |   20 ++--
>  arch/powerpc/kernel/dma.c |9 +++
>  arch/sh/kernel/dma-nommu.c|7 ++-
>  arch/sparc/kernel/iommu.c |4 +-
>  arch/sparc/kernel/ioport.c|4 +-
>  arch/tile/kernel/pci-dma.c|   12 -
>  arch/xtensa/kernel/pci-dma.c  |7 ++-
>  drivers/net/ethernet/intel/igb/igb.h  |7 ++-
>  drivers/net/ethernet/intel/igb/igb_main.c |   77 
> +++--
>  include/linux/dma-mapping.h   |   20 +---
>  include/linux/gfp.h   |2 +
>  mm/page_alloc.c   |   14 +
>  27 files changed, 246 insertions(+), 89 deletions(-)
>

So I am just wondering if I need to resubmit this to pick up the new
"Acked-by"s or if I should just wait?

As I said in the description my hope is to get this into the -mm tree
and I am not familiar with what the process is for being accepted
there.

Thanks.

- Alex


Re: [PATCH] netns: make struct pernet_operations::id unsigned int

2016-11-18 Thread David Miller
From: Alexey Dobriyan 
Date: Thu, 17 Nov 2016 04:58:21 +0300

> Make struct pernet_operations::id unsigned.
 ...
> Signed-off-by: Alexey Dobriyan 

Applied, thank you.


[PATCH] mlxsw: switchib: add MLXSW_PCI dependency

2016-11-18 Thread Arnd Bergmann
The newly added switchib driver fails to link if MLXSW_PCI=m:

drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In 
function^Cmlxsw_sib_module_exit':
switchib.c:(.exit.text+0x8): undefined reference to 
`mlxsw_pci_driver_unregister'
switchib.c:(.exit.text+0x10): undefined reference to 
`mlxsw_pci_driver_unregister'
drivers/net/ethernet/mellanox/mlxsw/mlxsw_switchib.o: In function 
`mlxsw_sib_module_init':
switchib.c:(.init.text+0x28): undefined reference to `mlxsw_pci_driver_register'
switchib.c:(.init.text+0x38): undefined reference to `mlxsw_pci_driver_register'
switchib.c:(.init.text+0x48): undefined reference to 
`mlxsw_pci_driver_unregister'

The other two such sub-drivers have a dependency, so add the same one
here. In theory we could allow this driver if MLXSW_PCI is disabled,
but it's probably not worth it.

Signed-off-by: Arnd Bergmann 
---
 drivers/net/ethernet/mellanox/mlxsw/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Kconfig 
b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
index bac2e5e826e2..49237a24605e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
@@ -31,7 +31,7 @@ config MLXSW_PCI
 
 config MLXSW_SWITCHIB
tristate "Mellanox Technologies SwitchIB and SwitchIB-2 support"
-   depends on MLXSW_CORE && NET_SWITCHDEV
+   depends on MLXSW_CORE && MLXSW_PCI && NET_SWITCHDEV
default m
---help---
  This driver supports Mellanox Technologies SwitchIB and SwitchIB-2
-- 
2.9.0



Re: [net-next PATCH v2] net: dummy: Introduce dummy virtual functions

2016-11-18 Thread Phil Sutter
Hi,

On Fri, Nov 18, 2016 at 12:04:14AM +0200, Or Gerlitz wrote:
> On Mon, Nov 14, 2016 at 3:02 PM, Phil Sutter  wrote:
> 
> > Due to the assumption that all PFs are PCI devices, this implementation
> > is not completely straightforward: In order to allow for
> > rtnl_fill_ifinfo() to see the dummy VFs, a fake PCI parent device is
> > attached to the dummy netdev. This has to happen at the right spot so
> > register_netdevice() does not get confused. This patch abuses
> > ndo_fix_features callback for that. In ndo_uninit callback, the fake
> > parent is removed again for the same purpose.
> 
> So you did some mimic-ing of PCI interface, how do you let the user to
> config the number of VFs? though a module param? why? if the module
> param only serves to say how many VF the device supports, maybe
> support the maximum possible by PCI spec and skip the module param?

Yes, this is controlled via module parameter. But it doesn't say how
much is supported but rather how many dummy VFs are to be created for
each dummy interface.

> > +module_param(num_vfs, int, 0);
> > +MODULE_PARM_DESC(num_vfs, "Number of dummy VFs per dummy device");
> > +
> 
> > @@ -190,6 +382,7 @@ static int __init dummy_init_one(void)
> > err = register_netdevice(dev_dummy);
> > if (err < 0)
> > goto err;
> > +
> > return 0;
> 
> nit, remove this added blank line..

Oh yes, thanks. Spontaneous reviewer's blindness. :)

The implementation is problematic in another aspect though: Upon reboot,
it seems like no netdev ops are being called but the fake PCI parent's
kobject is being freed which does not work (and leads to an oops).

Cheers, Phil


Re: [PATCH net-next] udp: enable busy polling for all sockets

2016-11-18 Thread David Miller
From: Eric Dumazet 
Date: Wed, 16 Nov 2016 09:10:42 -0800

> From: Eric Dumazet 
> 
> UDP busy polling is restricted to connected UDP sockets.
> 
> This is because sk_busy_loop() only takes care of one NAPI context.
> 
> There are cases where it could be extended.
> 
> 1) Some hosts receive traffic on a single NIC, with one RX queue.
> 
> 2) Some applications use SO_REUSEPORT and associated BPF filter
>to split the incoming traffic on one UDP socket per RX
> queue/thread/cpu
> 
> 3) Some UDP sockets are used to send/receive traffic for one flow, but
> they do not bother with connect()
> 
> 
> This patch records the napi_id of first received skb, giving more
> reach to busy polling.
> 
> Tested:
> 
> lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
> lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
> 
> lpaa23:~# for f in `seq 1 10`; do ./super_netperf 1 -H lpaa24 -t UDP_RR -l 5; 
> done
> 
> Before patch :
>27867   28870   37324   41060   41215
>36764   36838   44455   41282   43843
> After patch :
>73920   73213   70147   74845   71697
>68315   68028   75219   70082   73707
> 
> Signed-off-by: Eric Dumazet 

Applied, thanks Eric.


Re: [PATCH v3 3/5] net: asix: Fix AX88772x resume failures

2016-11-18 Thread Jon Hunter
Hi Allan,

On 14/11/16 09:45, ASIX_Allan [Office] wrote:
> Hi Jon,
> 
> Please help to double check if the USB host controller of your Terga
> platform had been powered OFF while running the ax88772_suspend() routine or
> not? 

Sorry for the delay. Today I set up a local board to reproduce this on
and was able to recreate the same problem. The Tegra xhci driver does
not power off during suspend and simply calls xhci_suspend(). I also
checked vbus to see if it was turning off but it is not. Furthermore I
don't see a new USB device detected after the error and so I don't see
any evidence that it ever disconnects.

Cheers
Jon

-- 
nvpublic


[PATCH net] rtnetlink: fix FDB size computation

2016-11-18 Thread Sabrina Dubroca
Add missing NDA_VLAN attribute's size.

Fixes: 1e53d5bb8878 ("net: Pass VLAN ID to rtnl_fdb_notify.")
Signed-off-by: Sabrina Dubroca 
---
 net/core/rtnetlink.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 2a75127f0e9e..92e75af2dc6a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2863,7 +2863,10 @@ static int nlmsg_populate_fdb_fill(struct sk_buff *skb,
 
 static inline size_t rtnl_fdb_nlmsg_size(void)
 {
-   return NLMSG_ALIGN(sizeof(struct ndmsg)) + nla_total_size(ETH_ALEN);
+   return NLMSG_ALIGN(sizeof(struct ndmsg)) +
+  nla_total_size(ETH_ALEN) +   /* NDA_LLADDR */
+  nla_total_size(sizeof(u16)) +/* NDA_VLAN */
+  0;
 }
 
 static void rtnl_fdb_notify(struct net_device *dev, u8 *addr, u16 vid, int 
type,
-- 
2.10.2



Re: [PATCH] net: fec: Detect and recover receive queue hangs

2016-11-18 Thread Chris Lesiak
On 11/18/2016 12:44 AM, Andy Duan wrote:
> From: Chris Lesiak  Sent: Friday, November 18, 2016 
> 5:15 AM
>  >To: Andy Duan 
>  >Cc: netdev@vger.kernel.org; linux-ker...@vger.kernel.org; Jaccon
>  >Bastiaansen ; chris.les...@licor.com
>  >Subject: [PATCH] net: fec: Detect and recover receive queue hangs
>  >
>  >This corrects a problem that appears to be similar to ERR006358.  But while
>  >ERR006358 is a race when the tx queue transitions from empty to not empty,
>  >this problem is a race when the rx queue transitions from full to not full.
>  >
>  >The symptom is a receive queue that is stuck.  The ENET_RDAR register will
>  >read 0, indicating that there are no empty receive descriptors in the 
> receive
>  >ring.  Since no additional frames can be queued, no RXF interrupts occur.
>  >
>  >This problem can be triggered with a 1 Gb link and about 400 Mbps of 
> traffic.

I can cause the error by running the following on an imx6q: iperf -s -u
And sending packets from the other end of a 1 Gbps link:
iperf -c $IPADDR -u -b4pps

A few others have seen this problem.
See: https://community.nxp.com/thread/322882

>  >
>  >This patch detects this condition, sets the work_rx bit, and reschedules the
>  >poll method.
>  >
>  >Signed-off-by: Chris Lesiak 
>  >---
>  > drivers/net/ethernet/freescale/fec_main.c | 31
>  >+++
>  > 1 file changed, 31 insertions(+)
>  >
> Firstly, how to reproduce the issue, pls list the reproduce steps. Thanks.
> Secondly, pls check below comments.
>
>  >diff --git a/drivers/net/ethernet/freescale/fec_main.c
>  >b/drivers/net/ethernet/freescale/fec_main.c
>  >index fea0f33..8a87037 100644
>  >--- a/drivers/net/ethernet/freescale/fec_main.c
>  >+++ b/drivers/net/ethernet/freescale/fec_main.c
>  >@@ -1588,6 +1588,34 @@ fec_enet_interrupt(int irq, void *dev_id)
>  >return ret;
>  > }
>  >
>  >+static inline bool
>  >+fec_enet_recover_rxq(struct fec_enet_private *fep, u16 queue_id) {
>  >+   int work_bit = (queue_id == 0) ? 2 : ((queue_id == 1) ? 0 : 1);
>  >+
>  >+   if (readl(fep->rx_queue[queue_id]->bd.reg_desc_active))
> If rx ring is really empty in slight throughput cases,  rdar is always 
> cleared, then there always do napi reschedule.

I think that you are concerned that if rdar is zero due to this hardware
problem,
but the rx ring is actually empty, then fec_enet_rx_queue will never do
a write
to rdar so that it can be non-zero.  That will cause napi to always be
resceduled.

I suppose that might be the case with zero rx traffic, and I was
concerned that
it might be true even when there was rx traffic.  I suspected that the
hardware,
seeing that rdar is zero, would never queue another packet, even if
there were
in fact empty descriptors.  But it doesn't seem to be the case.  It does
reschedule
multiple times, but eventually sees some packets in the rx ring and
recovers.

I admit that I do not completely understand how that can happen.  I did
confirm
that fec_enet_active_rxring is not being called.

Maybe someone with a deeper understanding of the fec than I can provide an
explanation.

>
>  >+   return false;
>  >+
>  >+   dev_notice_once(>pdev->dev, "Recovered rx queue\n");
>  >+
>  >+   fep->work_rx |= 1 << work_bit;
>  >+
>  >+   return true;
>  >+}
>  >+
>  >+static inline bool fec_enet_recover_rxqs(struct fec_enet_private *fep)
>  >+{
>  >+   unsigned int q;
>  >+   bool ret = false;
>  >+
>  >+   for (q = 0; q < fep->num_rx_queues; q++) {
>  >+   if (fec_enet_recover_rxq(fep, q))
>  >+   ret = true;
>  >+   }
>  >+
>  >+   return ret;
>  >+}
>  >+
>  > static int fec_enet_rx_napi(struct napi_struct *napi, int budget)  {
>  >struct net_device *ndev = napi->dev;
>  >@@ -1601,6 +1629,9 @@ static int fec_enet_rx_napi(struct napi_struct *napi,
>  >int budget)
>  >if (pkts < budget) {
>  >napi_complete(napi);
>  >writel(FEC_DEFAULT_IMASK, fep->hwp + FEC_IMASK);
>  >+
>  >+   if (fec_enet_recover_rxqs(fep) && napi_reschedule(napi))
>  >+   writel(FEC_NAPI_IMASK, fep->hwp + FEC_IMASK);
>  >}
>  >return pkts;
>  > }
>  >--
>  >2.5.5
>


-- 
Chris Lesiak
Principal Design Engineer, Software
LI-COR Biosciences
chris.les...@licor.com

Any opinions expressed are those of the author and
do not necessarily represent those of his employer.




[RFC PATCH v2 1/2] macb: Add 1588 support in Cadence GEM.

2016-11-18 Thread Andrei Pistirica
Cadence GEM provides a 102 bit time counter with 48 bits for seconds,
30 bits for nsecs and 24 bits for sub-nsecs to control 1588 timestamping.

This patch does the following:
- Registers to ptp clock framework
- Timer initialization is done by writing time of day to the timer counter.
- ns increment register is programmed as NSEC_PER_SEC/TSU_CLK.
  For a 16 bit subns precision, the subns increment equals
  remainder of (NS_PER_SEC/TSU_CLK) * (2^16).
- HW time stamp capabilities are advertised via ethtool and macb ioctl is
  updated accordingly.
- Timestamps are obtained from the TX/RX PTP event/PEER registers.
  The timestamp obtained thus is updated in skb for upper layers to access.
- The drivers register functions with ptp to perform time and frequency
  adjustment.
- Time adjustment is done by writing to the 1558_ADJUST register.
  The controller will read the delta in this register and update the timer
  counter register. Alternatively, for large time offset adjustments,
  the driver reads the secs and nsecs counter values, adds/subtracts the
  delta and updates the timer counter.
- Frequency adjustment is not directly supported by this IP.
  addend is the initial value ns increment and similarly addendesub.
  The ppb (parts per billion) provided is used as
  ns_incr = addend +/- (ppb/rate).
  Similarly the remainder of the above is used to populate subns increment.
  In case the ppb requested is negative AND subns adjustment greater than
  the addendsub, ns_incr is reduced by 1 and subns_incr is adjusted in
  positive accordingly.

Signed-off-by: Andrei Pistirica 
Signed-off-by: Harini Katakam 
---
Version 2 patch for: https://patchwork.kernel.org/patch/9310989/.
Modifications:
- bitfields for TSU are named according to SAMA5D2 data sheet
- identify GEM-PTP support based on platform capability
- add spinlock for TSU access
- change macb_ptp_adjfreq and use fewer 64bit divisions

 drivers/net/ethernet/cadence/Kconfig|  10 +-
 drivers/net/ethernet/cadence/Makefile   |   8 +-
 drivers/net/ethernet/cadence/macb.h |  80 +++
 drivers/net/ethernet/cadence/macb_ptp.c | 229 
 4 files changed, 325 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/cadence/macb_ptp.c

diff --git a/drivers/net/ethernet/cadence/Kconfig 
b/drivers/net/ethernet/cadence/Kconfig
index f0bcb15..ebbc65f 100644
--- a/drivers/net/ethernet/cadence/Kconfig
+++ b/drivers/net/ethernet/cadence/Kconfig
@@ -29,6 +29,14 @@ config MACB
  support for the MACB/GEM chip.
 
  To compile this driver as a module, choose M here: the module
- will be called macb.
+ will be called cadence-macb.
+
+config MACB_USE_HWSTAMP
+   bool "Use IEEE 1588 hwstamp"
+   depends on MACB
+   default y
+   select PTP_1588_CLOCK
+   ---help---
+ Enable IEEE 1588 Precision Time Protocol (PTP) support for MACB.
 
 endif # NET_CADENCE
diff --git a/drivers/net/ethernet/cadence/Makefile 
b/drivers/net/ethernet/cadence/Makefile
index 91f79b1..4402d42 100644
--- a/drivers/net/ethernet/cadence/Makefile
+++ b/drivers/net/ethernet/cadence/Makefile
@@ -2,4 +2,10 @@
 # Makefile for the Atmel network device drivers.
 #
 
-obj-$(CONFIG_MACB) += macb.o
+cadence-macb-y := macb.o
+
+ifeq ($(CONFIG_MACB_USE_HWSTAMP),y)
+cadence-macb-y += macb_ptp.o
+endif
+
+obj-$(CONFIG_MACB) += cadence-macb.o
diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index 3f385ab..2ee9af8 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -10,6 +10,10 @@
 #ifndef _MACB_H
 #define _MACB_H
 
+#include 
+#include 
+#include 
+
 #define MACB_GREGS_NBR 16
 #define MACB_GREGS_VERSION 2
 #define MACB_MAX_QUEUES 8
@@ -129,6 +133,20 @@
 #define GEM_RXIPCCNT   0x01a8 /* IP header Checksum Error Counter */
 #define GEM_RXTCPCCNT  0x01ac /* TCP Checksum Error Counter */
 #define GEM_RXUDPCCNT  0x01b0 /* UDP Checksum Error Counter */
+#define GEM_TISUBN 0x01bc /* 1588 Timer Increment Sub-ns */
+#define GEM_TSH0x01c0 /* 1588 Timer Seconds High */
+#define GEM_TSL0x01d0 /* 1588 Timer Seconds Low */
+#define GEM_TN 0x01d4 /* 1588 Timer Nanoseconds */
+#define GEM_TA 0x01d8 /* 1588 Timer Adjust */
+#define GEM_TI 0x01dc /* 1588 Timer Increment */
+#define GEM_EFTSL  0x01e0 /* PTP Event Frame Tx Seconds Low */
+#define GEM_EFTN   0x01e4 /* PTP Event Frame Tx Nanoseconds */
+#define GEM_EFRSL  0x01e8 /* PTP Event Frame Rx Seconds Low */
+#define GEM_EFRN   0x01ec /* PTP Event Frame Rx Nanoseconds */
+#define GEM_PEFTSL 0x01f0 /* PTP Peer Event Frame Tx Secs Low */
+#define GEM_PEFTN  0x01f4 /* PTP Peer Event Frame Tx Ns */
+#define GEM_PEFRSL 0x01f8 /* PTP Peer 

[RFC PATCH v2 2/2] macb: Enable 1588 support in SAMA5D2 platform.

2016-11-18 Thread Andrei Pistirica
Hardware time stamp on the PTP Ethernet packets are received using the
SO_TIMESTAMPING API. Where timers are obtained from the PTP event/peer
registers.

Signed-off-by: Andrei Pistirica 
---
Version 2 patch for: https://patchwork.kernel.org/patch/9310991/
Modificaions:
- add PTP caps for SAMA5D2/3/4 platforms
- and cosmetic changes

 drivers/net/ethernet/cadence/macb.c |  24 +++-
 drivers/net/ethernet/cadence/macb.h |  13 ++
 drivers/net/ethernet/cadence/macb_ptp.c | 222 
 3 files changed, 254 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/cadence/macb.c 
b/drivers/net/ethernet/cadence/macb.c
index d975882..eb66b76 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -697,6 +697,8 @@ static void macb_tx_interrupt(struct macb_queue *queue)
 
/* First, update TX stats if needed */
if (skb) {
+   macb_ptp_do_txstamp(bp, skb);
+
netdev_vdbg(bp->dev, "skb %u (data %p) TX 
complete\n",
macb_tx_ring_wrap(tail), skb->data);
bp->stats.tx_packets++;
@@ -853,6 +855,8 @@ static int gem_rx(struct macb *bp, int budget)
GEM_BFEXT(RX_CSUM, ctrl) & GEM_RX_CSUM_CHECKED_MASK)
skb->ip_summed = CHECKSUM_UNNECESSARY;
 
+   macb_ptp_do_rxstamp(bp, skb);
+
bp->stats.rx_packets++;
bp->stats.rx_bytes += skb->len;
 
@@ -1946,6 +1950,8 @@ static int macb_open(struct net_device *dev)
 
netif_tx_start_all_queues(dev);
 
+   macb_ptp_init(dev);
+
return 0;
 }
 
@@ -2204,7 +2210,7 @@ static const struct ethtool_ops gem_ethtool_ops = {
.get_regs_len   = macb_get_regs_len,
.get_regs   = macb_get_regs,
.get_link   = ethtool_op_get_link,
-   .get_ts_info= ethtool_op_get_ts_info,
+   .get_ts_info= macb_get_ts_info,
.get_ethtool_stats  = gem_get_ethtool_stats,
.get_strings= gem_get_ethtool_strings,
.get_sset_count = gem_get_sset_count,
@@ -2221,7 +2227,14 @@ static int macb_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
if (!phydev)
return -ENODEV;
 
-   return phy_mii_ioctl(phydev, rq, cmd);
+   switch (cmd) {
+   case SIOCSHWTSTAMP:
+   return macb_hwtst_set(dev, rq, cmd);
+   case SIOCGHWTSTAMP:
+   return macb_hwtst_get(dev, rq);
+   default:
+   return phy_mii_ioctl(phydev, rq, cmd);
+   }
 }
 
 static int macb_set_features(struct net_device *netdev,
@@ -2812,7 +2825,7 @@ static const struct macb_config pc302gem_config = {
 };
 
 static const struct macb_config sama5d2_config = {
-   .caps = MACB_CAPS_USRIO_DEFAULT_IS_MII_GMII,
+   .caps = MACB_CAPS_USRIO_DEFAULT_IS_MII_GMII | MACB_CAPS_GEM_HAS_PTP,
.dma_burst_length = 16,
.clk_init = macb_clk_init,
.init = macb_init,
@@ -2820,14 +2833,15 @@ static const struct macb_config sama5d2_config = {
 
 static const struct macb_config sama5d3_config = {
.caps = MACB_CAPS_SG_DISABLED | MACB_CAPS_GIGABIT_MODE_AVAILABLE
- | MACB_CAPS_USRIO_DEFAULT_IS_MII_GMII,
+ | MACB_CAPS_USRIO_DEFAULT_IS_MII_GMII
+ | MACB_CAPS_GEM_HAS_PTP,
.dma_burst_length = 16,
.clk_init = macb_clk_init,
.init = macb_init,
 };
 
 static const struct macb_config sama5d4_config = {
-   .caps = MACB_CAPS_USRIO_DEFAULT_IS_MII_GMII,
+   .caps = MACB_CAPS_USRIO_DEFAULT_IS_MII_GMII | MACB_CAPS_GEM_HAS_PTP,
.dma_burst_length = 4,
.clk_init = macb_clk_init,
.init = macb_init,
diff --git a/drivers/net/ethernet/cadence/macb.h 
b/drivers/net/ethernet/cadence/macb.h
index 2ee9af8..3ac824a 100644
--- a/drivers/net/ethernet/cadence/macb.h
+++ b/drivers/net/ethernet/cadence/macb.h
@@ -918,8 +918,21 @@ struct macb {
 
 #ifdef CONFIG_MACB_USE_HWSTAMP
 void macb_ptp_init(struct net_device *ndev);
+void macb_ptp_do_rxstamp(struct macb *bp, struct sk_buff *skb);
+void macb_ptp_do_txstamp(struct macb *bp, struct sk_buff *skb);
+int macb_ptp_get_ts_info(struct net_device *dev, struct ethtool_ts_info *info);
+#define macb_get_ts_info macb_ptp_get_ts_info
+int macb_hwtst_set(struct net_device *netdev, struct ifreq *ifr, int cmd);
+int macb_hwtst_get(struct net_device *netdev, struct ifreq *ifr);
 #else
 void macb_ptp_init(struct net_device *ndev) { }
+void macb_ptp_do_rxstamp(struct macb *bp, struct sk_buff *skb) { }
+void macb_ptp_do_txstamp(struct macb *bp, struct sk_buff *skb) { }
+#define macb_get_ts_info ethtool_op_get_ts_info
+int macb_hwtst_set(struct net_device *netdev, struct ifreq *ifr, int cmd)
+   { return -EOPNOTSUPP; }
+int macb_hwtst_get(struct net_device 

Re: Synopsys Ethernet QoS Driver

2016-11-18 Thread Joao Pinto
Hello Ozgur,

Thanks for your feedback.

On 18-11-2016 13:09, mued dib wrote:
> Dear Joao;
> 
> thanks for support and this project is good. I have some questions, Linux
> already support to QoS with "tc". right?
> 
> Can you send us a list of driver files you are interested?

For now we are interesting in improving the synopsys QoS driver under
/nect/ethernet/synopsys. For now the driver structure consists of a single file
called dwc_eth_qos.c, containing synopsys ethernet qos common ops and platform
related stuff.

Our strategy would be:

a) Implement a platform glue driver (dwc_eth_qos_pltfm.c)
b) Implement a pci glue driver (dwc_eth_qos_pci.c)
c) Implement a "core driver" (dwc_eth_qos.c) that would only have Ethernet QoS
related stuff to be reused by the platform / pci drivers
d) Add a set of features to the "core driver" that we have available internally

Thanks,
Joao


> 
> Regards,
> 
> Ozgur Karatas
> 
> 2016-11-18 15:28 GMT+03:00 Joao Pinto :
> 
>>
>> Dear all,
>>
>> My name is Joao Pinto and I work at Synopsys.
>> I am a kernel developer with special focus in mainline collaboration, both
>> Linux
>> and Buildroot. I was recently named one of the maintainers of the PCIe
>> Designware core driver and I was the author of the Designware UFS driver
>> stack.
>>
>> I am sending you this e-mail because you were the suggested contacts from
>> the
>> get_maintainers script concerning Ethernet drivers :).
>>
>> Currently I have the task to work on the mainline Ethernet QoS driver in
>> which
>> you are the author. The work would consist of the following:
>>
>> a) Separate the current driver in a Core driver (common ops) + platform
>> glue
>> driver + pci glue driver
>> b) Add features that are currently only available internally
>> c) Add specific phy support using the PHY framework
>>
>> I would also gladly be available to be its maintainer if you agree with it.
>>
>> It would be great to have your collaboration in the project if you are
>> available
>> to review the work in progress.
>>
>> Thank you and I am looking forward for your feedback!
>>
>> Joao Pinto
>>
> 



  1   2   >