[net-next:master 332/339] kernel//bpf/syscall.c:1404:23: warning: cast to pointer from integer of different size

2017-09-28 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   fa8fefaa678ea390b873195d19c09930da84a4bb
commit: cb4d2b3f03d8eed90be3a194e5b54b734ec4bbe9 [332/339] bpf: Add name, 
load_time, uid and map_ids to bpf_prog_info
config: blackfin-allmodconfig (attached as .config)
compiler: bfin-uclinux-gcc (GCC) 6.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout cb4d2b3f03d8eed90be3a194e5b54b734ec4bbe9
# save the attached .config to linux build tree
make.cross ARCH=blackfin 

All warnings (new ones prefixed by >>):

   kernel//bpf/syscall.c: In function 'bpf_prog_get_info_by_fd':
>> kernel//bpf/syscall.c:1404:23: warning: cast to pointer from integer of 
>> different size [-Wint-to-pointer-cast]
  u32 *user_map_ids = (u32 *)info.map_ids;
  ^

vim +1404 kernel//bpf/syscall.c

  1371  
  1372  static int bpf_prog_get_info_by_fd(struct bpf_prog *prog,
  1373 const union bpf_attr *attr,
  1374 union bpf_attr __user *uattr)
  1375  {
  1376  struct bpf_prog_info __user *uinfo = 
u64_to_user_ptr(attr->info.info);
  1377  struct bpf_prog_info info = {};
  1378  u32 info_len = attr->info.info_len;
  1379  char __user *uinsns;
  1380  u32 ulen;
  1381  int err;
  1382  
  1383  err = check_uarg_tail_zero(uinfo, sizeof(info), info_len);
  1384  if (err)
  1385  return err;
  1386  info_len = min_t(u32, sizeof(info), info_len);
  1387  
  1388  if (copy_from_user(&info, uinfo, info_len))
  1389  return -EFAULT;
  1390  
  1391  info.type = prog->type;
  1392  info.id = prog->aux->id;
  1393  info.load_time = prog->aux->load_time;
  1394  info.created_by_uid = from_kuid_munged(current_user_ns(),
  1395 prog->aux->user->uid);
  1396  
  1397  memcpy(info.tag, prog->tag, sizeof(prog->tag));
  1398  memcpy(info.name, prog->aux->name, sizeof(prog->aux->name));
  1399  
  1400  ulen = info.nr_map_ids;
  1401  info.nr_map_ids = prog->aux->used_map_cnt;
  1402  ulen = min_t(u32, info.nr_map_ids, ulen);
  1403  if (ulen) {
> 1404  u32 *user_map_ids = (u32 *)info.map_ids;
  1405  u32 i;
  1406  
  1407  for (i = 0; i < ulen; i++)
  1408  if (put_user(prog->aux->used_maps[i]->id,
  1409   &user_map_ids[i]))
  1410  return -EFAULT;
  1411  }
  1412  
  1413  if (!capable(CAP_SYS_ADMIN)) {
  1414  info.jited_prog_len = 0;
  1415  info.xlated_prog_len = 0;
  1416  goto done;
  1417  }
  1418  
  1419  ulen = info.jited_prog_len;
  1420  info.jited_prog_len = prog->jited_len;
  1421  if (info.jited_prog_len && ulen) {
  1422  uinsns = u64_to_user_ptr(info.jited_prog_insns);
  1423  ulen = min_t(u32, info.jited_prog_len, ulen);
  1424  if (copy_to_user(uinsns, prog->bpf_func, ulen))
  1425  return -EFAULT;
  1426  }
  1427  
  1428  ulen = info.xlated_prog_len;
  1429  info.xlated_prog_len = bpf_prog_insn_size(prog);
  1430  if (info.xlated_prog_len && ulen) {
  1431  uinsns = u64_to_user_ptr(info.xlated_prog_insns);
  1432  ulen = min_t(u32, info.xlated_prog_len, ulen);
  1433  if (copy_to_user(uinsns, prog->insnsi, ulen))
  1434  return -EFAULT;
  1435  }
  1436  
  1437  done:
  1438  if (copy_to_user(uinfo, &info, info_len) ||
  1439  put_user(info_len, &uattr->info.info_len))
  1440  return -EFAULT;
  1441  
  1442  return 0;
  1443  }
  1444  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT

2017-09-28 Thread Jesper Dangaard Brouer
On Fri, 29 Sep 2017 00:45:40 +0200
Daniel Borkmann  wrote:

> On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:
> > Introducing a new way to redirect XDP frames.  Notice how no driver
> > changes are necessary given the design of XDP_REDIRECT.
> >
> > This redirect map type is called 'cpumap', as it allows redirection
> > XDP frames to remote CPUs.  The remote CPU will do the SKB allocation
> > and start the network stack invocation on that CPU.
> >
> > This is a scalability and isolation mechanism, that allow separating
> > the early driver network XDP layer, from the rest of the netstack, and
> > assigning dedicated CPUs for this stage.  The sysadm control/configure
> > the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how
> > many queues are configured via ethtool --set-channels.  Benchmarks
> > show that a single CPU can handle approx 11Mpps.  Thus, only assigning
> > two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s
> > wirespeed smallest packet 14.88Mpps.  Reducing the number of queues
> > have the advantage that more packets being "bulk" available per hard
> > interrupt[1].
> >
> > [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
> >
> > Use-cases:
> >
> > 1. End-host based pre-filtering for DDoS mitigation.  This is fast
> > enough to allow software to see and filter all packets wirespeed.
> > Thus, no packets getting silently dropped by hardware.
> >
> > 2. Given NIC HW unevenly distributes packets across RX queue, this
> > mechanism can be used for redistribution load across CPUs.  This
> > usually happens when HW is unaware of a new protocol.  This
> > resembles RPS (Receive Packet Steering), just faster, but with more
> > responsibility placed on the BPF program for correct steering.
> >
> > 3. Auto-scaling or power saving via only activating the appropriate
> > number of remote CPUs for handling the current load.  The cpumap
> > tracepoints can function as a feedback loop for this purpose.  
> 
> Interesting work, thanks! Still digesting the code a bit. I think
> it pretty much goes into the direction that Eric describes in his
> netdev paper quoted above; not on a generic level though but specific
> to XDP at least; theoretically XDP could just run transparently on
> the CPU doing the filtering, and raw buffers are handed to remote
> CPU with similar batching, but it would need some different config
> interface at minimum.

Good that you noticed this is (implicit) implementing RX bulking, which
is where much of the performance gain originates from.

It is true, I am inspired by Eric's paper (I love it). Do notice that
this is not blocking or interfering with Erics/others continued work in
this area.  This implementation just show that the section "break the
pipe!" idea works very well for XDP. 

More on config knobs below.
 
> Shouldn't we take the CPU(s) running XDP on the RX queues out from
> the normal process scheduler, so that we have a guarantee that user
> space or unrelated kernel tasks cannot interfere with them anymore,
> and we could then turn them into busy polling eventually (e.g. as
> long as XDP is running there and once off could put them back into
> normal scheduling domain transparently)?

We should be careful not to invent networking config knobs that belongs
to other parts of the kernel, like the scheduler.  We already have
ability to control where IRQ's land via procfs smp_affinity.  And if
you want to avoid CPU isolation, we can use the boot cmdline
"isolcpus" (hint like DPDK recommend/use for zero-loss configs).  It is
the userspace tool (or sysadm) loading the XDP program, who is
responsible for having configures the CPU smp_affinity alignment.

Making NAPI busy-poll is out of scope for this patchset. Someone
should work on this separately.  It would just help/improve this kind
of scheme.

I actually think it would be more relevant to add/put the "remote" CPUs
in the 'cpumap' into a separate scheduler group.  To implement stuff
like auto-scaling and power-saving.


> What about RPS/RFS in the sense that once you punt them to remote
> CPU, could we reuse application locality information so they'd end
> up on the right CPU in the first place (w/o backlog detour), or is
> the intent to rather disable it and have some own orchestration
> with relation to the CPU map?

An advanced bpf orchestration could basically implement what you
describe, combined with a userspace side tool that taskset/pin
applications.  To know when a task can move between CPUs, you use the
tracepoints to see when the CPU queue is empty (hint, time_limit=true
and processed=0).

For now, I'm not targeting such advanced use-cases.  My main target is
a customer that have double tagged VLANS, and ixgbe cannot RSS
distribute these, thus they all end-up on queue 0.  And as I
demonstrated (in another email) RPS is too slow to fix this.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red H

Re: [Patch net-next] net_sched: use idr to allocate u32 filter handles

2017-09-28 Thread Simon Horman
On Thu, Sep 28, 2017 at 03:19:05PM -0700, Cong Wang wrote:
> On Thu, Sep 28, 2017 at 12:34 AM, Simon Horman
>  wrote:
> > Hi Cong,
> >
> > this looks like a nice enhancement to me. Did you measure any performance
> > benefit from it.  Perhaps it could be described in the changelog_ I also
> > have a more detailed question below.
> 
> No, I am inspired by commit c15ab236d69d, don't measure it.

Perhaps it would be nice to note that in the changelog.

> >> ---
> >>  net/sched/cls_u32.c | 108 
> >> 
> >>  1 file changed, 67 insertions(+), 41 deletions(-)
> >>
> >> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> >> index 10b8d851fc6b..316b8a791b13 100644
> >> --- a/net/sched/cls_u32.c
> >> +++ b/net/sched/cls_u32.c
> >> @@ -46,6 +46,7 @@
> >
> > ...
> >
> >> @@ -937,22 +940,33 @@ static int u32_change(struct net *net, struct 
> >> sk_buff *in_skb,
> >>   return -EINVAL;
> >>   if (TC_U32_KEY(handle))
> >>   return -EINVAL;
> >> - if (handle == 0) {
> >> - handle = gen_new_htid(tp->data);
> >> - if (handle == 0)
> >> - return -ENOMEM;
> >> - }
> >>   ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), 
> >> GFP_KERNEL);
> >>   if (ht == NULL)
> >>   return -ENOBUFS;
> >> + if (handle == 0) {
> >> + handle = gen_new_htid(tp->data, ht);
> >> + if (handle == 0) {
> >> + kfree(ht);
> >> + return -ENOMEM;
> >> + }
> >> + } else {
> >> + err = idr_alloc_ext(&tp_c->handle_idr, ht, NULL,
> >> + handle, handle + 1, GFP_KERNEL);
> >> + if (err) {
> >> + kfree(ht);
> >> + return err;
> >> + }
> >
> > The above seems to check that handle is not already in use and mark it as
> > in use. But I don't see that logic in the code prior to this patch.
> > Am I missing something? If not perhaps this portion should be a separate
> > patch or described in the changelog.
> 
> The logic is in upper layer, tc_ctl_tfilter(). It tries to get a
> filter by handle
> (if non-zero), and errors out if we are creating a new filter with the same
> handle.
> 
> At the point you quote above, 'n' is already NULL and 'handle' is non-zero,
> which means there is no existing filter has same handle, it is safe to just
> mark it as in-use.

Thanks for the clarification, that seems fine to me.

Reviewed-by: Simon Horman 



Re: [PATCH net-next v9] openvswitch: enable NSH support

2017-09-28 Thread Yang, Yi
On Fri, Sep 29, 2017 at 02:28:38AM +0800, Pravin Shelar wrote:
> On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi  wrote:
> > On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
> >> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
> >> > After push_nsh, the packet won't be recirculated to flow pipeline, so
> >> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
> >> > will be recirculated to flow pipeline, it will be reparsed, so
> >> > key->eth.type will be set in packet parse function, we needn't handle it
> >> > in pop_nsh.
> >>
> >> This seems to be a very different approach than what we currently have.
> >> Looking at the code, the requirement after "destructive" actions such
> >> as pushing or popping headers is to recirculate.
> >
> > This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> > will impact on performance, recurculating after pop_nsh is unavoidable, So
> > also cc jan.scheur...@ericsson.com.
> >
> > Actucally all the keys before push_nsh are still there after push_nsh,
> > push_nsh has updated all the nsh keys, so recirculating remains avoidable.
> >
> 
> 
> We should keep existing model for this patch. Later you can submit
> optimization patch with specific use cases and performance
> improvement. So that we can evaluate code complexity and benefits.

Ok, I'll remove the below line in push_nsh and send out v11, thanks.

key->eth.type = htons(ETH_P_NSH);

> 
> >>
> >> Setting key->eth.type to satisfy conditions in the output path without
> >> updating the rest of the key looks very hacky and fragile to me. There
> >> might be other conditions and dependencies that are not obvious.
> >> I don't think the code was written with such code path in mind.
> >>
> >> I'd like to hear what Pravin thinks about this.
> >>
> >>  Jiri


Re: [pull request][net 00/11] Mellanox, mlx5 fixes 2017-09-28

2017-09-28 Thread David Miller
From: Saeed Mahameed 
Date: Thu, 28 Sep 2017 07:41:21 +0300

> This series provides misc fixes for mlx5 dirver.
> 
> Please pull and let me know if there's any problem.

Pulled.

> for -stable:
>   net/mlx5e: IPoIB, Fix access to invalid memory address (Kernels >= 4.12)

Queued up for -stable, thanks.


Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower

2017-09-28 Thread Jiri Pirko
Thu, Sep 28, 2017 at 09:22:15PM CEST, amritha.namb...@intel.com wrote:
>On 9/14/2017 1:00 AM, Nambiar, Amritha wrote:
>> On 9/13/2017 6:26 AM, Jiri Pirko wrote:
>>> Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.namb...@intel.com wrote:
 This patch enables tc-flower based hardware offloads. tc flower
 filter provided by the kernel is configured as driver specific
 cloud filter. The patch implements functions and admin queue
 commands needed to support cloud filters in the driver and
 adds cloud filters to configure these tc-flower filters.

 The only action supported is to redirect packets to a traffic class
 on the same device.
>>>
>>> So basically you are not doing redirect, you are just setting tclass for
>>> matched packets, right? Why you use mirred for this? I think that
>>> you might consider extending g_act for that:
>>>
>>> # tc filter add dev eth0 protocol ip ingress \
>>>   prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>>>   action tclass 0
>>>
>> Yes, this doesn't work like a typical egress redirect, but is aimed at
>> forwarding the matched packets to a different queue-group/traffic class
>> on the same device, so some sort-of ingress redirect in the hardware. I
>> possibly may not need the mirred-redirect as you say, I'll look into the
>> g_act way of doing this with a new gact tc action.
>> 
>
>I was looking at introducing a new gact tclass action to TC. In the HW
>offload path, this sets a traffic class value for certain matched
>packets so they will be processed in a queue belonging to the traffic class.
>
># tc filter add dev eth0 protocol ip parent :\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_sw\
>  action tclass 2
>
>But, I'm having trouble defining what this action means in the kernel
>datapath. For ingress, this action could just take the default path and
>do nothing and only have meaning in the HW offloaded path. For egress,

Sounds ok.


>certain qdiscs like 'multiq' and 'prio' could use this 'tclass' value
>for band selection, while the 'mqprio' qdisc selects the traffic class
>based on the skb priority in netdev_pick_tx(), so what would this action
>mean for the 'mqprio' qdisc?

I don't see why this action would have any special meaning for specific
qdiscs. The qdiscs have already mechanisms for band mapping. I don't see
why to mix it up with tclass action.

Also, you can use tclass action on qdisc clsact egress to do band
mapping. That would be symmetrical with ingress.


>
>It looks like the 'prio' qdisc uses band selection based on the
>'classid', so I was thinking of using the 'classid' through the cls
>flower filter and offload it to HW for the traffic class index, this way
>we would have the same behavior in HW offload and SW fallback and there
>would be no need for a separate tc action.
>
>In HW:
># tc filter add dev eth0 protocol ip parent :\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_sw classid 1:2\
>
>filter pref 2 flower chain 0
>filter pref 2 flower chain 0 handle 0x1 classid 1:2
>  eth_type ipv4
>  ip_proto udp
>  dst_ip 192.168.3.5
>  dst_port 25
>  skip_sw
>  in_hw
>
>This will be used to route packets to traffic class 2.
>
>In SW:
># tc filter add dev eth0 protocol ip parent :\
>  prio 2 flower dst_ip 192.168.3.5/32\
>  ip_proto udp dst_port 25 skip_hw classid 1:2
>
>filter pref 2 flower chain 0
>filter pref 2 flower chain 0 handle 0x1 classid 1:2
>  eth_type ipv4
>  ip_proto udp
>  dst_ip 192.168.3.5
>  dst_port 25
>  skip_hw
>  not_in_hw
>
>>>

 # tc qdisc add dev eth0 ingress
 # ethtool -K eth0 hw-tc-offload on

 # tc filter add dev eth0 protocol ip parent :\
  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
  action mirred ingress redirect dev eth0 tclass 0

 # tc filter add dev eth0 protocol ip parent :\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw\
  action mirred ingress redirect dev eth0 tclass 1

 # tc filter add dev eth0 protocol ipv6 parent :\
  prio 3 flower dst_ip fe8::200:1\
  ip_proto udp dst_port 66 skip_sw\
  action mirred ingress redirect dev eth0 tclass 1

 Delete tc flower filter:
 Example:

 # tc filter del dev eth0 parent : prio 3 handle 0x1 flower
 # tc filter del dev eth0 parent :

 Flow Director Sideband is disabled while configuring cloud filters
 via tc-flower and until any cloud filter exists.

 Unsupported matches when cloud filters are added using enhanced
 big buffer cloud filter mode of underlying switch include:
 1. source port and source IP
 2. Combined MAC address and IP fields.
 3. Not specifying L4 port

 These filter matches can however be used to redirect traffic to
 the main VSI (tc 0) which does not require the enhanced big buffer
 cloud filter support.

 v3: Cleaned up some lengthy function names. Changed ipv6 address t

Re: [patch net-next 1/7] skbuff: Add the offload_mr_fwd_mark field

2017-09-28 Thread Jiri Pirko
Thu, Sep 28, 2017 at 07:49:03PM CEST, and...@lunn.ch wrote:
>On Thu, Sep 28, 2017 at 07:34:09PM +0200, Jiri Pirko wrote:
>> From: Yotam Gigi 
>> 
>> Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
>> used to allow partial offloading of MFC multicast routes.
>
>> The reason why the already existing "offload_fwd_mark" bit cannot be used
>> is that a switchdev driver would want to make the distinction between a
>> packet that has already gone through L2 forwarding but did not go through
>> multicast forwarding, and a packet that has already gone through both L2
>> and multicast forwarding.
>
>Hi Jiri
>
>So we are talking about l2 vs l3. So why not call this
>offload_l3_fwd_mark?
>
>Is there anything really specific to multicast here?

Currently it is, not sure if it is going to be used for anything else
later on. In case it will be, it could be renamed very easily.


>
>   Thanks
>  Andrew


Re: [PATCH v4 2/2] ip_tunnel: add mpls over gre encapsulation

2017-09-28 Thread Tom Herbert
On Thu, Sep 28, 2017 at 2:34 AM, Amine Kherbouche
 wrote:
> This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
> API.
>
> Encap:
>   - Add a new iptunnel type mpls.
>   - Share tx path: gre type mpls loaded from skb->protocol.
>
> Decap:
>   - pull gre hdr and call mpls_forward().
>
> Signed-off-by: Amine Kherbouche 
> Acked-by: Roopa Prabhu 
> ---
>  include/net/gre.h  |  1 +
>  include/uapi/linux/if_tunnel.h |  1 +
>  net/ipv4/gre_demux.c   | 27 +++
>  net/ipv4/ip_gre.c  |  3 +++
>  net/ipv6/ip6_gre.c |  3 +++
>  net/mpls/af_mpls.c | 36 
>  6 files changed, 71 insertions(+)
>
> diff --git a/include/net/gre.h b/include/net/gre.h
> index d25d836..aa3c4d3 100644
> --- a/include/net/gre.h
> +++ b/include/net/gre.h
> @@ -35,6 +35,7 @@ struct net_device *gretap_fb_dev_create(struct net *net, 
> const char *name,
>u8 name_assign_type);
>  int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>  bool *csum_err, __be16 proto, int nhs);
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len);
>
>  static inline int gre_calc_hlen(__be16 o_flags)
>  {
> diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
> index 2e52088..a2f48c0 100644
> --- a/include/uapi/linux/if_tunnel.h
> +++ b/include/uapi/linux/if_tunnel.h
> @@ -84,6 +84,7 @@ enum tunnel_encap_types {
> TUNNEL_ENCAP_NONE,
> TUNNEL_ENCAP_FOU,
> TUNNEL_ENCAP_GUE,
> +   TUNNEL_ENCAP_MPLS,
>  };
>
>  #define TUNNEL_ENCAP_FLAG_CSUM (1<<0)
> diff --git a/net/ipv4/gre_demux.c b/net/ipv4/gre_demux.c
> index b798862..40484a3 100644
> --- a/net/ipv4/gre_demux.c
> +++ b/net/ipv4/gre_demux.c
> @@ -23,6 +23,9 @@
>  #include 
>  #include 
>  #include 
> +#if IS_ENABLED(CONFIG_MPLS)
> +#include 
> +#endif
>  #include 
>  #include 
>
> @@ -122,6 +125,30 @@ int gre_parse_header(struct sk_buff *skb, struct 
> tnl_ptk_info *tpi,
>  }
>  EXPORT_SYMBOL(gre_parse_header);
>
> +#if IS_ENABLED(CONFIG_MPLS)
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
> +{
> +   if (unlikely(!pskb_may_pull(skb, gre_hdr_len)))
> +   goto drop;
> +
> +   /* Pop GRE hdr and reset the skb */
> +   skb_pull(skb, gre_hdr_len);
> +   skb_reset_network_header(skb);
> +

I don't see why MPLS/GRE needs to be a special case in gre_rcv. Can't
we just follow the normal processing patch which calls the proto ops
handler for the protocol in the GRE header? Also, if protocol specific
code is added to rcv function that most likely means that we need to
update the related offloads also (grant it that MPLS doesn't support
GRO but it looks like it supports GSO). Additionally, we'd need to
consider if flow dissector needs a similar special case (I will point
out that my recently posted patches there eliminated TEB as the one
special case in GRE dissection).

Thanks,
Tom

> +   return mpls_forward(skb, skb->dev, NULL, NULL);
> +drop:
> +   kfree_skb(skb);
> +   return NET_RX_DROP;
> +}
> +#else
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
> +{
> +   kfree_skb(skb);
> +   return NET_RX_DROP;
> +}
> +#endif
> +EXPORT_SYMBOL(mpls_gre_rcv);
> +
>  static int gre_rcv(struct sk_buff *skb)
>  {
> const struct gre_protocol *proto;
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 9cee986..7a50e4f 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -412,6 +412,9 @@ static int gre_rcv(struct sk_buff *skb)
> return 0;
> }
>
> +   if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
> +   return mpls_gre_rcv(skb, hdr_len);
> +
> if (ipgre_rcv(skb, &tpi, hdr_len) == PACKET_RCVD)
> return 0;
>
> diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
> index c82d41e..440efb1 100644
> --- a/net/ipv6/ip6_gre.c
> +++ b/net/ipv6/ip6_gre.c
> @@ -476,6 +476,9 @@ static int gre_rcv(struct sk_buff *skb)
> if (hdr_len < 0)
> goto drop;
>
> +   if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
> +   return mpls_gre_rcv(skb, hdr_len);
> +
> if (iptunnel_pull_header(skb, hdr_len, tpi.proto, false))
> goto drop;
>
> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
> index 36ea2ad..4274243 100644
> --- a/net/mpls/af_mpls.c
> +++ b/net/mpls/af_mpls.c
> @@ -16,6 +16,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #if IS_ENABLED(CONFIG_IPV6)
>  #include 
> @@ -39,6 +40,36 @@ static int one = 1;
>  static int label_limit = (1 << 20) - 1;
>  static int ttl_max = 255;
>
> +#if IS_ENABLED(CONFIG_NET_IP_TUNNEL)
> +size_t ipgre_mpls_encap_hlen(struct ip_tunnel_encap *e)
> +{
> +   return sizeof(struct mpls_shim_hdr);
> +}
> +
> +static const struct ip_tunnel_encap_ops mpls_iptun_ops = {
> +   .encap_hlen = ipgre_mpls_enca

Re: [net-next PATCH 1/5] bpf: introduce new bpf cpu map type BPF_MAP_TYPE_CPUMAP

2017-09-28 Thread Alexei Starovoitov
On Thu, Sep 28, 2017 at 02:57:08PM +0200, Jesper Dangaard Brouer wrote:
> The 'cpumap' is primary used as a backend map for XDP BPF helper
> call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> 
> This patch implement the main part of the map.  It is not connected to
> the XDP redirect system yet, and no SKB allocation are done yet.
> 
> The main concern in this patch is to ensure the datapath can run
> without any locking.  This adds complexity to the setup and tear-down
> procedure, which assumptions are extra carefully documented in the
> code comments.
> 
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  include/linux/bpf_types.h  |1 
>  include/uapi/linux/bpf.h   |1 
>  kernel/bpf/Makefile|1 
>  kernel/bpf/cpumap.c|  547 
> 
>  kernel/bpf/syscall.c   |8 +
>  tools/include/uapi/linux/bpf.h |1 
>  6 files changed, 558 insertions(+), 1 deletion(-)
>  create mode 100644 kernel/bpf/cpumap.c
> 
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 6f1a567667b8..814c1081a4a9 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -41,4 +41,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
>  #ifdef CONFIG_STREAM_PARSER
>  BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
>  #endif
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index e43491ac4823..f14e15702533 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -111,6 +111,7 @@ enum bpf_map_type {
>   BPF_MAP_TYPE_HASH_OF_MAPS,
>   BPF_MAP_TYPE_DEVMAP,
>   BPF_MAP_TYPE_SOCKMAP,
> + BPF_MAP_TYPE_CPUMAP,
>  };
>  
>  enum bpf_prog_type {
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 897daa005b23..dba0bd33a43c 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -4,6 +4,7 @@ obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o 
> helpers.o tnum.o
>  obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o 
> bpf_lru_list.o lpm_trie.o map_in_map.o
>  ifeq ($(CONFIG_NET),y)
>  obj-$(CONFIG_BPF_SYSCALL) += devmap.o
> +obj-$(CONFIG_BPF_SYSCALL) += cpumap.o
>  ifeq ($(CONFIG_STREAM_PARSER),y)
>  obj-$(CONFIG_BPF_SYSCALL) += sockmap.o
>  endif
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> new file mode 100644
> index ..f0948af82e65
> --- /dev/null
> +++ b/kernel/bpf/cpumap.c
> @@ -0,0 +1,547 @@
> +/* bpf/cpumap.c
> + *
> + * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
> + * Released under terms in GPL version 2.  See COPYING.
> + */
> +
> +/* The 'cpumap' is primary used as a backend map for XDP BPF helper
> + * call bpf_redirect_map() and XDP_REDIRECT action, like 'devmap'.
> + *
> + * Unlike devmap which redirect XDP frames out another NIC device,
> + * this map type redirect raw XDP frames to another CPU.  The remote
> + * CPU will do SKB-allocation and call the normal network stack.
> + *
> + * This is a scalability and isolation mechanism, that allow
> + * separating the early driver network XDP layer, from the rest of the
> + * netstack, and assigning dedicated CPUs for this stage.  This
> + * basically allows for 10G wirespeed pre-filtering via bpf.
> + */
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +/*
> + * General idea: XDP packets getting XDP redirected to another CPU,
> + * will maximum be stored/queued for one driver ->poll() call.  It is
> + * guaranteed that setting flush bit and flush operation happen on
> + * same CPU.  Thus, cpu_map_flush operation can deduct via this_cpu_ptr()
> + * which queue in bpf_cpu_map_entry contains packets.
> + */
> +
> +#define CPU_MAP_BULK_SIZE 8  /* 8 == one cacheline on 64-bit archs */
> +struct xdp_bulk_queue {
> + void *q[CPU_MAP_BULK_SIZE];
> + unsigned int count;
> +};
> +
> +/* Struct for every remote "destination" CPU in map */
> +struct bpf_cpu_map_entry {
> + u32 cpu;/* kthread CPU and map index */
> + int map_id; /* Back reference to map */
> + u32 qsize;  /* Redundant queue size for map lookup */
> +
> + /* XDP can run multiple RX-ring queues, need __percpu enqueue store */
> + struct xdp_bulk_queue __percpu *bulkq;
> +
> + /* Queue with potential multi-producers, and single-consumer kthread */
> + struct ptr_ring *queue;
> + struct task_struct *kthread;
> + struct work_struct kthread_stop_wq;
> +
> + atomic_t refcnt; /* Control when this struct can be free'ed */
> + struct rcu_head rcu;
> +};
> +
> +struct bpf_cpu_map {
> + struct bpf_map map;
> + /* Below members specific for map type */
> + struct bpf_cpu_map_entry **cpu_map;
> + unsigned long __percpu *flush_needed;
> +};
> +
> +static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
> +  struct xdp_bulk_queue *bq);
> +
> +static u64 cpu_map_bitmap_

[no subject]

2017-09-28 Thread Tina Aaron


Do you need urgent LOAN ? If yes, Contact me now via Email: 
mondataclas...@gmail.com




CONFIDENTIALITY NOTICE: This email message, including any attachments, is for 
the sole use of the intended recipient(s) and may contain confidential and 
privileged information.  Any unauthorized use, disclosure or distribution is 
prohibited.  If you are not the intended recipient, please discard the message 
immediately and inform the sender that the message was sent in error.



Re: [lkp-robot] [mac80211] 31e9170bde: hwsim.sta_dynamic_down_up.fail

2017-09-28 Thread Xiang Gao
Thanks, I will look into it.
Xiang Gao


2017-09-28 4:06 GMT-04:00 kernel test robot :
>
> FYI, we noticed the following commit:
>
> commit: 31e9170bdeb6ebe66426337b4e2b9924683a412b ("mac80211: aead api to 
> reduce redundancy")
> url: 
> https://github.com/0day-ci/linux/commits/Xiang-Gao/mac80211-aead-api-to-reduce-redundancy/20170926-053110
> base: https://git.kernel.org/cgit/linux/kernel/git/jberg/mac80211-next.git 
> master
>
> in testcase: hwsim
> with following parameters:
>
> group: hwsim-10
>
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu host -smp 2 -m 2G
>
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
>
>
> 2017-09-27 16:04:27 ./run-tests.py sta_dynamic_down_up
> DEV: wlan0: 02:00:00:00:00:00
> DEV: wlan1: 02:00:00:00:01:00
> DEV: wlan2: 02:00:00:00:02:00
> APDEV: wlan3
> APDEV: wlan4
> START sta_dynamic_down_up 1/1
> Test: Dynamically added wpa_supplicant interface down/up
> Starting AP wlan3
> Create a dynamic wpa_supplicant interface and connect
> Connect STA wlan5 to AP
> dev1->dev2 unicast data delivery failed
> Traceback (most recent call last):
>   File "./run-tests.py", line 453, in main
> t(dev, apdev)
>   File "/lkp/benchmarks/hwsim/tests/hwsim/test_sta_dynamic.py", line 122, in 
> test_sta_dynamic_down_up
> hwsim_utils.test_connectivity(wpas, hapd)
>   File "/lkp/benchmarks/hwsim/tests/hwsim/hwsim_utils.py", line 165, in 
> test_connectivity
> raise Exception(last_err)
> Exception: dev1->dev2 unicast data delivery failed
> FAIL sta_dynamic_down_up 5.397413 2017-09-27 16:04:32.540689
> passed 0 test case(s)
> skipped 0 test case(s)
> failed tests: sta_dynamic_down_up
>
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k  job-script  # job-script is attached in 
> this email
>
>
>
> Thanks,
> Xiaolong


Re: linux-next: build failure after merge of the net-next tree

2017-09-28 Thread Florian Fainelli
Le 09/28/17 à 18:36, Stephen Rothwell a écrit :
> Hi all,
> 
> After merging the net-next tree, today's linux-next build (arm
> multi_v7_defconfig) failed like this:
> 
> net/dsa/slave.c: In function 'dsa_slave_create':
> net/dsa/slave.c:1191:18: error: 'struct dsa_slave_priv' has no member named 
> 'phy'
>   phy_disconnect(p->phy);
>   ^
> 
> Caused by commit
> 
>   0115dcd1787d ("net: dsa: use slave device phydev")
> 
> Interacting with commit
> 
>   e804441cfe0b ("net: dsa: Fix network device registration order")
> 
> from the net tree.
> 
> I applied the following merge fix patch (which I am not sure about):

Your resolution looks fine to me, thanks Stephen!

> 
> From: Stephen Rothwell 
> Date: Fri, 29 Sep 2017 11:28:45 +1000
> Subject: [PATCH] net: dsa: merge fix patch for removal of phy
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  net/dsa/slave.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 8869954485db..9191c929c6c8 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -1188,7 +1188,7 @@ int dsa_slave_create(struct dsa_port *port, const char 
> *name)
>   return 0;
>  
>  out_phy:
> - phy_disconnect(p->phy);
> + phy_disconnect(slave_dev->phydev);
>   if (of_phy_is_fixed_link(p->dp->dn))
>   of_phy_deregister_fixed_link(p->dp->dn);
>  out_free:
> 


-- 
Florian


linux-next: build failure after merge of the net-next tree

2017-09-28 Thread Stephen Rothwell
Hi all,

After merging the net-next tree, today's linux-next build (arm
multi_v7_defconfig) failed like this:

net/dsa/slave.c: In function 'dsa_slave_create':
net/dsa/slave.c:1191:18: error: 'struct dsa_slave_priv' has no member named 
'phy'
  phy_disconnect(p->phy);
  ^

Caused by commit

  0115dcd1787d ("net: dsa: use slave device phydev")

Interacting with commit

  e804441cfe0b ("net: dsa: Fix network device registration order")

from the net tree.

I applied the following merge fix patch (which I am not sure about):

From: Stephen Rothwell 
Date: Fri, 29 Sep 2017 11:28:45 +1000
Subject: [PATCH] net: dsa: merge fix patch for removal of phy

Signed-off-by: Stephen Rothwell 
---
 net/dsa/slave.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 8869954485db..9191c929c6c8 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1188,7 +1188,7 @@ int dsa_slave_create(struct dsa_port *port, const char 
*name)
return 0;
 
 out_phy:
-   phy_disconnect(p->phy);
+   phy_disconnect(slave_dev->phydev);
if (of_phy_is_fixed_link(p->dp->dn))
of_phy_deregister_fixed_link(p->dp->dn);
 out_free:
-- 
2.14.1

-- 
Cheers,
Stephen Rothwell


Re: [PATCH V4] r8152: add Linksys USB3GIGV1 id

2017-09-28 Thread Doug Anderson
Grant,

On Thu, Sep 28, 2017 at 11:35 AM, Grant Grundler  wrote:
> This linksys dongle by default comes up in cdc_ether mode.
> This patch allows r8152 to claim the device:
>Bus 002 Device 002: ID 13b1:0041 Linksys
>
> Signed-off-by: Grant Grundler 
> ---
>  drivers/net/usb/cdc_ether.c | 10 ++
>  drivers/net/usb/r8152.c |  2 ++
>  2 files changed, 12 insertions(+)

This seems nice to me now.  Thanks for all the fixes!  I'm no expert
in this area, but as far as I know this is ready to go now, so FWIW:

Reviewed-by: Douglas Anderson 


[PATCH v4 net-next 7/8] fou: Support flow dissection

2017-09-28 Thread Tom Herbert
Populate offload flow_dissect callabck appropriately for fou and gue.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 63 ++
 1 file changed, 63 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 1540db65241a..a831dd49fb28 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -282,6 +282,20 @@ static int fou_gro_complete(struct sock *sk, struct 
sk_buff *skb,
return err;
 }
 
+static enum flow_dissect_ret fou_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   *p_ip_proto = fou_from_sock(sk)->protocol;
+   *p_nhoff += sizeof(struct udphdr);
+
+   return FLOW_DISSECT_RET_IPPROTO_AGAIN;
+}
+
 static struct guehdr *gue_gro_remcsum(struct sk_buff *skb, unsigned int off,
  struct guehdr *guehdr, void *data,
  size_t hdrlen, struct gro_remcsum *grc,
@@ -500,6 +514,53 @@ static int gue_gro_complete(struct sock *sk, struct 
sk_buff *skb, int nhoff)
return err;
 }
 
+static enum flow_dissect_ret gue_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   struct guehdr *guehdr, _guehdr;
+
+   guehdr = __skb_header_pointer(skb, *p_nhoff + sizeof(struct udphdr),
+ sizeof(_guehdr), data, *p_hlen, &_guehdr);
+   if (!guehdr)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   switch (guehdr->version) {
+   case 0:
+   if (unlikely(guehdr->control))
+   return FLOW_DISSECT_RET_CONTINUE;
+
+   *p_ip_proto = guehdr->proto_ctype;
+   *p_nhoff += sizeof(struct udphdr) +
+   sizeof(*guehdr) + (guehdr->hlen << 2);
+
+   break;
+   case 1:
+   switch (((struct iphdr *)guehdr)->version) {
+   case 4:
+   *p_ip_proto = IPPROTO_IPIP;
+   break;
+   case 6:
+   *p_ip_proto = IPPROTO_IPV6;
+   break;
+   default:
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+
+   *p_nhoff += sizeof(struct udphdr);
+
+   break;
+   default:
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+
+   return FLOW_DISSECT_RET_IPPROTO_AGAIN;
+}
+
 static int fou_add_to_port_list(struct net *net, struct fou *fou)
 {
struct fou_net *fn = net_generic(net, fou_net_id);
@@ -570,12 +631,14 @@ static int fou_create(struct net *net, struct fou_cfg 
*cfg,
tunnel_cfg.encap_rcv = fou_udp_recv;
tunnel_cfg.gro_receive = fou_gro_receive;
tunnel_cfg.gro_complete = fou_gro_complete;
+   tunnel_cfg.flow_dissect = fou_flow_dissect;
fou->protocol = cfg->protocol;
break;
case FOU_ENCAP_GUE:
tunnel_cfg.encap_rcv = gue_udp_recv;
tunnel_cfg.gro_receive = gue_gro_receive;
tunnel_cfg.gro_complete = gue_gro_complete;
+   tunnel_cfg.flow_dissect = gue_flow_dissect;
break;
default:
err = -EINVAL;
-- 
2.11.0



[PATCH v4 net-next 6/8] udp: flow dissector offload

2017-09-28 Thread Tom Herbert
Add support to perform UDP specific flow dissection. This is
primarily intended for dissecting encapsulated packets in UDP
encapsulation.

This patch adds a flow_dissect offload for UDP4 and UDP6. The backend
function performs a socket lookup and calls the flow_dissect function
if a socket is found.

Signed-off-by: Tom Herbert 
---
 include/linux/udp.h  |  8 
 include/net/udp.h|  8 
 include/net/udp_tunnel.h |  8 
 net/ipv4/udp_offload.c   | 48 
 net/ipv4/udp_tunnel.c|  1 +
 net/ipv6/udp_offload.c   | 16 
 6 files changed, 89 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..2e90b189ef6a 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,14 @@ struct udp_sock {
int (*gro_complete)(struct sock *sk,
struct sk_buff *skb,
int nhoff);
+   /* Flow dissector function for a UDP socket */
+   enum flow_dissect_ret (*flow_dissect)(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 
/* udp_recvmsg try to use this before splicing sk_receive_queue */
struct sk_buff_head reader_queue cacheline_aligned_in_smp;
diff --git a/include/net/udp.h b/include/net/udp.h
index c6b1c5d8d3c9..4867f329538c 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -176,6 +176,14 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 struct udphdr *uh, udp_lookup_t lookup);
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
+enum flow_dissect_ret udp_flow_dissect(struct sk_buff *skb,
+   udp_lookup_t lookup,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
+
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
struct udphdr *uh;
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 10cce0dd4450..b7102e0f41a9 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -69,6 +69,13 @@ typedef struct sk_buff **(*udp_tunnel_gro_receive_t)(struct 
sock *sk,
 struct sk_buff *skb);
 typedef int (*udp_tunnel_gro_complete_t)(struct sock *sk, struct sk_buff *skb,
 int nhoff);
+typedef enum flow_dissect_ret (*udp_tunnel_flow_dissect_t)(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 
 struct udp_tunnel_sock_cfg {
void *sk_user_data; /* user data used by encap_rcv call back */
@@ -78,6 +85,7 @@ struct udp_tunnel_sock_cfg {
udp_tunnel_encap_destroy_t encap_destroy;
udp_tunnel_gro_receive_t gro_receive;
udp_tunnel_gro_complete_t gro_complete;
+   udp_tunnel_flow_dissect_t flow_dissect;
 };
 
 /* Setup the given (UDP) sock to receive UDP encapsulated packets */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index a744bb515455..fddf923ef433 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -335,11 +335,59 @@ static int udp4_gro_complete(struct sk_buff *skb, int 
nhoff)
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
 
+enum flow_dissect_ret udp_flow_dissect(struct sk_buff *skb,
+   udp_lookup_t lookup,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   struct udphdr *uh, _uh;
+   struct sock *sk;
+
+   uh = __skb_header_pointer(skb, *p_nhoff, sizeof(_uh), data,
+ *p_hlen, &_uh);
+   if (!uh)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   rcu_r

[PATCH v4 net-next 4/8] flow_dissector: Add protocol specific flow dissection offload

2017-09-28 Thread Tom Herbert
Add offload capability for performing protocol specific flow dissection
(either by EtherType or IP protocol).

Specifically:

- Add flow_dissect to offload callbacks
- Move flow_dissect_ret enum to flow_dissector.h, cleanup names and add a
  couple of values
- Unify handling of functions that return flow_dissect_ret enum
- In __skb_flow_dissect, add default case for switch(proto) as well as
  switch(ip_proto) that looks up and calls protocol specific flow
  dissection

Signed-off-by: Tom Herbert 
---
 include/linux/netdevice.h| 27 ++
 include/net/flow_dissector.h |  1 +
 net/core/dev.c   | 65 
 net/core/flow_dissector.c| 16 +--
 net/ipv4/route.c |  4 ++-
 5 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 06b173200e23..f186b6ab480a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2207,12 +2207,25 @@ struct offload_callbacks {
struct sk_buff  **(*gro_receive)(struct sk_buff **head,
 struct sk_buff *skb);
int (*gro_complete)(struct sk_buff *skb, int nhoff);
+   enum flow_dissect_ret (*flow_dissect)(struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 };
 
 struct packet_offload {
__be16   type;  /* This is really htons(ether_type). */
u16  priority;
struct offload_callbacks callbacks;
+   enum flow_dissect_ret (*proto_flow_dissect)(struct sk_buff *skb,
+   u8 proto,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
struct list_head list;
 };
 
@@ -3252,6 +3265,20 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
 struct packet_offload *gro_find_complete_by_type(__be16 type);
+enum flow_dissect_ret flow_dissect_by_type(struct sk_buff *skb,
+   __be16 type,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
+enum flow_dissect_ret flow_dissect_by_type_proto(struct sk_buff *skb,
+   __be16 type, u8 proto,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index fc3dce730a6b..ad75bbfd1c9c 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -213,6 +213,7 @@ enum flow_dissector_key_id {
 #define FLOW_DISSECTOR_F_STOP_AT_L3BIT(1)
 #define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABELBIT(2)
 #define FLOW_DISSECTOR_F_STOP_AT_ENCAP BIT(3)
+#define FLOW_DISSECTOR_F_STOP_AT_L4BIT(4)
 
 struct flow_dissector_key {
enum flow_dissector_key_id key_id;
diff --git a/net/core/dev.c b/net/core/dev.c
index e350c768d4b5..f3cd884bd04b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -104,6 +104,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -4907,6 +4908,70 @@ struct packet_offload *gro_find_complete_by_type(__be16 
type)
 }
 EXPORT_SYMBOL(gro_find_complete_by_type);
 
+enum flow_dissect_ret flow_dissect_by_type(struct sk_buff *skb,
+   __be16 type,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   struct list_head *offload_head = &offload_base;
+   struct packet_offload *ptype;
+
+ 

Re: [PATCH V2] r8152: add Linksys USB3GIGV1 id

2017-09-28 Thread Doug Anderson
Hi,

On Thu, Sep 28, 2017 at 3:28 PM, Rustad, Mark D  wrote:
>
>> On Sep 27, 2017, at 9:39 AM, Grant Grundler  wrote:
>>
>> On Wed, Sep 27, 2017 at 12:15 AM, Oliver Neukum  wrote:
>>> Am Dienstag, den 26.09.2017, 08:19 -0700 schrieb Doug Anderson:

 I know that for at least some of the adapters in the CDC Ethernet
 blacklist it was claimed that the CDC Ethernet support in the adapter
 was kinda broken anyway so the blacklist made sense.  ...but for the
 Linksys Gigabit adapter the CDC Ethernet driver seems to work OK, it's
 just not quite as full featured / efficient as the R8152 driver.

 Is that not a concern?  I guess you could tell people in this
 situation that they simply need to enable the R8152 driver to get
 continued support for their Ethernet adapter?
>>>
>>> Hi,
>>>
>>> yes, it is a valid concern. An #ifdef will be needed.
>>
>> Good idea - I will post V3 shortly.
>>
>> I'm assuming you mean to add #ifdef CONFIG_USB_RTL8152 around the
>> blacklist entry in cdc_ether driver.
>
> Shouldn't that be an #if IS_ENABLED(...) test, since that seems to be the 
> proper way to check configured drivers.

Yes, I had the same feedback on v3.  See my comments at
.  Grant has fixed it in
v4.  Please see .  :)

-Doug


[PATCH v4 net-next 5/8] ip: Add callbacks to flow dissection by IP protocol

2017-09-28 Thread Tom Herbert
Populate the proto_flow_dissect function for IPv4 and IPv6 packet
offloads. This allows the caller to flow dissect a packet starting
at the given IP protocol (as parsed to that point by flow dissector
for instance).

Signed-off-by: Tom Herbert 
---
 net/ipv4/af_inet.c | 27 +++
 net/ipv6/ip6_offload.c | 27 +++
 2 files changed, 54 insertions(+)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e31108e5ef79..18c1d884999a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1440,6 +1440,32 @@ static struct sk_buff **ipip_gro_receive(struct sk_buff 
**head,
return inet_gro_receive(head, skb);
 }
 
+static enum flow_dissect_ret inet_proto_flow_dissect(struct sk_buff *skb,
+   u8 proto,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   const struct net_offload *ops;
+
+   rcu_read_lock();
+
+   ops = rcu_dereference(inet_offloads[proto]);
+   if (ops && ops->callbacks.flow_dissect)
+   ret =  ops->callbacks.flow_dissect(skb, key_control,
+  flow_dissector,
+  target_container,
+  data, p_proto, p_ip_proto,
+  p_nhoff, p_hlen, flags);
+
+   rcu_read_unlock();
+
+   return ret;
+}
+
 #define SECONDS_PER_DAY86400
 
 /* inet_current_timestamp - Return IP network timestamp
@@ -1763,6 +1789,7 @@ static int ipv4_proc_init(void);
 
 static struct packet_offload ip_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IP),
+   .proto_flow_dissect = inet_proto_flow_dissect,
.callbacks = {
.gso_segment = inet_gso_segment,
.gro_receive = inet_gro_receive,
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index cdb3728faca7..a33a2b40b3d6 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -339,8 +339,35 @@ static int ip4ip6_gro_complete(struct sk_buff *skb, int 
nhoff)
return inet_gro_complete(skb, nhoff);
 }
 
+static enum flow_dissect_ret inet6_proto_flow_dissect(struct sk_buff *skb,
+   u8 proto,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   const struct net_offload *ops;
+
+   rcu_read_lock();
+
+   ops = rcu_dereference(inet6_offloads[proto]);
+   if (ops && ops->callbacks.flow_dissect)
+   ret =  ops->callbacks.flow_dissect(skb, key_control,
+  flow_dissector,
+  target_container, data,
+  p_proto, p_ip_proto, p_nhoff,
+  p_hlen, flags);
+
+   rcu_read_unlock();
+
+   return ret;
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IPV6),
+   .proto_flow_dissect = inet6_proto_flow_dissect,
.callbacks = {
.gso_segment = ipv6_gso_segment,
.gro_receive = ipv6_gro_receive,
-- 
2.11.0



[PATCH v4 net-next 8/8] vxlan: support flow dissect

2017-09-28 Thread Tom Herbert
Populate offload flow_dissect callback appropriately for VXLAN and
VXLAN-GPE.

Signed-off-by: Tom Herbert 
---
 drivers/net/vxlan.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index d7c49cf1d5e9..80227050b2d4 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1327,6 +1327,45 @@ static bool vxlan_ecn_decapsulate(struct vxlan_sock *vs, 
void *oiph,
return err <= 1;
 }
 
+static enum flow_dissect_ret vxlan_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   __be16 protocol = htons(ETH_P_TEB);
+   struct vxlanhdr *vhdr, _vhdr;
+   struct vxlan_sock *vs;
+
+   vhdr = __skb_header_pointer(skb, *p_nhoff + sizeof(struct udphdr),
+   sizeof(_vhdr), data, *p_hlen, &_vhdr);
+   if (!vhdr)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   vs = rcu_dereference_sk_user_data(sk);
+   if (!vs)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   if (vs->flags & VXLAN_F_GPE) {
+   struct vxlanhdr_gpe *gpe = (struct vxlanhdr_gpe *)vhdr;
+
+   /* Need to have Next Protocol set for interfaces in GPE mode. */
+   if (gpe->version != 0 || !gpe->np_applied || gpe->oam_flag)
+   return FLOW_DISSECT_RET_CONTINUE;
+
+   protocol = tun_p_from_eth_p(gpe->next_protocol);
+   if (!protocol)
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+
+   *p_nhoff += sizeof(struct udphdr) + sizeof(_vhdr);
+   *p_proto = protocol;
+
+   return FLOW_DISSECT_RET_PROTO_AGAIN;
+}
+
 /* Callback from net/ipv4/udp.c to receive packets */
 static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 {
@@ -2846,6 +2885,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net 
*net, bool ipv6,
tunnel_cfg.encap_destroy = NULL;
tunnel_cfg.gro_receive = vxlan_gro_receive;
tunnel_cfg.gro_complete = vxlan_gro_complete;
+   tunnel_cfg.flow_dissect = vxlan_flow_dissect;
 
setup_udp_tunnel_sock(net, sock, &tunnel_cfg);
 
-- 
2.11.0



[PATCH v4 net-next 3/8] udp: Check static key udp_encap_needed in udp_gro_receive

2017-09-28 Thread Tom Herbert
Currently, the only support for udp gro is provided by UDP encapsulation
protocols. Since they always set udp_encap_needed we can check that in
udp_gro_receive functions before performing a socket lookup.

Signed-off-by: Tom Herbert 
---
 include/net/udp.h  | 2 ++
 net/ipv4/udp.c | 4 +++-
 net/ipv4/udp_offload.c | 7 +++
 net/ipv6/udp_offload.c | 7 +++
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 12dfbfe2e2d7..c6b1c5d8d3c9 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -97,6 +97,8 @@ static inline struct udp_hslot *udp_hashslot2(struct 
udp_table *table,
 
 extern struct proto udp_prot;
 
+extern struct static_key udp_encap_needed;
+
 extern atomic_long_t udp_memory_allocated;
 
 /* sysctl variables for udp */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 784ced0b9150..2788843e8eb2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1813,7 +1813,9 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
return 0;
 }
 
-static struct static_key udp_encap_needed __read_mostly;
+struct static_key udp_encap_needed __read_mostly;
+EXPORT_SYMBOL(udp_encap_needed);
+
 void udp_encap_enable(void)
 {
static_key_enable(&udp_encap_needed);
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 97658bfc1b58..a744bb515455 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -261,6 +261,13 @@ static struct sk_buff **udp4_gro_receive(struct sk_buff 
**head,
 {
struct udphdr *uh = udp_gro_udphdr(skb);
 
+   if (!static_key_false(&udp_encap_needed)) {
+   /* Currently udp_gro_receive only does something if
+* a UDP encapsulation has been set.
+*/
+   goto flush;
+   }
+
if (unlikely(!uh))
goto flush;
 
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 455fd4e39333..111b026e4f03 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -34,6 +34,13 @@ static struct sk_buff **udp6_gro_receive(struct sk_buff 
**head,
 {
struct udphdr *uh = udp_gro_udphdr(skb);
 
+   if (!static_key_false(&udp_encap_needed)) {
+   /* Currently udp_gro_receive only does something if
+* a UDP encapsulation has been set.
+*/
+   goto flush;
+   }
+
if (unlikely(!uh))
goto flush;
 
-- 
2.11.0



[PATCH v4 net-next 2/8] flow_dissector: Move ETH_P_TEB processing to main switch

2017-09-28 Thread Tom Herbert
Support for processing TEB is currently in GRE flow dissection as a
special case. This can be moved to be a case the main proto switch in
__skb_flow_dissect.

Signed-off-by: Tom Herbert 
---
 net/core/flow_dissector.c | 45 -
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 76f5e5bc3177..c15b41f96cbe 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -282,27 +282,8 @@ __skb_flow_dissect_gre(const struct sk_buff *skb,
if (hdr->flags & GRE_SEQ)
offset += sizeof(((struct pptp_gre_header *) 0)->seq);
 
-   if (gre_ver == 0) {
-   if (*p_proto == htons(ETH_P_TEB)) {
-   const struct ethhdr *eth;
-   struct ethhdr _eth;
-
-   eth = __skb_header_pointer(skb, *p_nhoff + offset,
-  sizeof(_eth),
-  data, *p_hlen, &_eth);
-   if (!eth)
-   return FLOW_DISSECT_RET_OUT_BAD;
-   *p_proto = eth->h_proto;
-   offset += sizeof(*eth);
-
-   /* Cap headers that we access via pointers at the
-* end of the Ethernet header as our maximum alignment
-* at that point is only 2 bytes.
-*/
-   if (NET_IP_ALIGN)
-   *p_hlen = *p_nhoff + offset;
-   }
-   } else { /* version 1, must be PPTP */
+   /* version 1, must be PPTP */
+   if (gre_ver == 1) {
u8 _ppp_hdr[PPP_HDRLEN];
u8 *ppp_hdr;
 
@@ -595,6 +576,28 @@ bool __skb_flow_dissect(struct sk_buff *skb,
 
break;
}
+   case htons(ETH_P_TEB): {
+   const struct ethhdr *eth;
+   struct ethhdr _eth;
+
+   eth = __skb_header_pointer(skb, nhoff, sizeof(_eth),
+  data, hlen, &_eth);
+   if (!eth)
+   goto out_bad;
+
+   proto = eth->h_proto;
+   nhoff += sizeof(*eth);
+
+   /* Cap headers that we access via pointers at the
+* end of the Ethernet header as our maximum alignment
+* at that point is only 2 bytes.
+*/
+   if (NET_IP_ALIGN)
+   hlen = nhoff;
+
+   fdret = FLOW_DISSECT_RET_PROTO_AGAIN;
+   break;
+   }
case htons(ETH_P_8021AD):
case htons(ETH_P_8021Q): {
const struct vlan_hdr *vlan;
-- 
2.11.0



[PATCH v4 net-next 1/8] flow_dissector: Change skbuf argument to be non const

2017-09-28 Thread Tom Herbert
Change the skbuf argument of __skb_flow_dissect to be non constant so
that the function can call functions that take non constant skbuf
arguments. This is needed if we are to call socket lookup or BPF in the
flow dissector path.

The changes include unraveling the call chain into __skb_flow_dissect so
that those also use non constant skbufs.

Signed-off-by: Tom Herbert 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  2 +-
 drivers/net/ethernet/cisco/enic/enic_clsf.c   |  2 +-
 drivers/net/ethernet/cisco/enic/enic_clsf.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c|  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c |  2 +-
 drivers/net/ethernet/qlogic/qede/qede.h   |  2 +-
 drivers/net/ethernet/qlogic/qede/qede_filter.c|  2 +-
 drivers/net/ethernet/sfc/efx.h|  2 +-
 drivers/net/ethernet/sfc/falcon/efx.h |  2 +-
 drivers/net/ethernet/sfc/falcon/rx.c  |  2 +-
 drivers/net/ethernet/sfc/rx.c |  2 +-
 include/linux/netdevice.h |  4 ++--
 include/linux/skbuff.h| 12 ++--
 include/net/ip_fib.h  |  4 ++--
 include/net/route.h   |  4 ++--
 net/core/flow_dissector.c | 10 +-
 net/ipv4/fib_semantics.c  |  2 +-
 net/ipv4/route.c  |  6 +++---
 net/sched/sch_sfq.c   |  2 +-
 20 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 5ba49938ba55..29f5cf6bea4a 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7344,7 +7344,7 @@ static bool bnxt_fltr_match(struct bnxt_ntuple_filter *f1,
return false;
 }
 
-static int bnxt_rx_flow_steer(struct net_device *dev, const struct sk_buff 
*skb,
+static int bnxt_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
  u16 rxq_index, u32 flow_id)
 {
struct bnxt *bp = netdev_priv(dev);
diff --git a/drivers/net/ethernet/cisco/enic/enic_clsf.c 
b/drivers/net/ethernet/cisco/enic/enic_clsf.c
index 3c677ed3c29e..7ee2aa1c3184 100644
--- a/drivers/net/ethernet/cisco/enic/enic_clsf.c
+++ b/drivers/net/ethernet/cisco/enic/enic_clsf.c
@@ -167,7 +167,7 @@ static struct enic_rfs_fltr_node *htbl_key_search(struct 
hlist_head *h,
return NULL;
 }
 
-int enic_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int enic_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
   u16 rxq_index, u32 flow_id)
 {
struct flow_keys keys;
diff --git a/drivers/net/ethernet/cisco/enic/enic_clsf.h 
b/drivers/net/ethernet/cisco/enic/enic_clsf.h
index 4bfbf25f9ddc..0e7f533f81b9 100644
--- a/drivers/net/ethernet/cisco/enic/enic_clsf.h
+++ b/drivers/net/ethernet/cisco/enic/enic_clsf.h
@@ -13,7 +13,7 @@ void enic_rfs_flw_tbl_free(struct enic *enic);
 struct enic_rfs_fltr_node *htbl_fltr_search(struct enic *enic, u16 fltr_id);
 
 #ifdef CONFIG_RFS_ACCEL
-int enic_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int enic_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
   u16 rxq_index, u32 flow_id);
 void enic_flow_may_expire(unsigned long data);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 9c218f1cfc6c..9f7afbfb09f9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -348,7 +348,7 @@ mlx4_en_filter_find(struct mlx4_en_priv *priv, __be32 
src_ip, __be32 dst_ip,
 }
 
 static int
-mlx4_en_filter_rfs(struct net_device *net_dev, const struct sk_buff *skb,
+mlx4_en_filter_rfs(struct net_device *net_dev, struct sk_buff *skb,
   u16 rxq_index, u32 flow_id)
 {
struct mlx4_en_priv *priv = netdev_priv(net_dev);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index cc13d3dbd366..897c9d46702c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -1017,7 +1017,7 @@ int mlx5e_arfs_create_tables(struct mlx5e_priv *priv);
 void mlx5e_arfs_destroy_tables(struct mlx5e_priv *priv);
 int mlx5e_arfs_enable(struct mlx5e_priv *priv);
 int mlx5e_arfs_disable(struct mlx5e_priv *priv);
-int mlx5e_rx_flow_steer(struct net_device *dev, const struct sk_buff *skb,
+int mlx5e_rx_flow_steer(struct net_device *dev, struct sk_buff *skb,
u16 rxq_index, u32 flow_id);
 #endif
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c
index 12d3ced61114..f5e182bd613d 100644
--- a/drivers/net/ethernet/mel

[PATCH v4 net-next 0/8] flow_dissector: Protocol specific flow dissector offload

2017-09-28 Thread Tom Herbert
This patch set adds a new offload type to perform flow dissection for
specific protocols (either by EtherType or by IP protocol). This is
primary useful to crack open UDP encapsulations (like VXLAN, GUE) for
the purposes of parsing the encapsulated packet.

Items in this patch set:
- Create new protocol case in __skb_dissect for ETH_P_TEB. This is based
  on the code in the GRE dissect function and the special handling in
  GRE can now be removed (it sets protocol to ETH_P_TEB and returns so
  goto proto_again is done)
- Add infrastructure for protocol specific flow dissection offload
- Add infrastructure to perform UDP flow dissection. Uses same model of
  GRO where a flow_dissect callback can be associated with a UDP
  socket
- Use the infrastructure to support flow dissection of VXLAN and GUE

Tested:

Forced RPS to call flow dissection for VXLAN, FOU, and GUE. Observed
that inner packet was being properly dissected.

v2: Add signed off

v3:
   - Make skb argument of flow dissector to be non const
   - Change UDP GRO to only do something if encap_needed static
 key is set
   - don't reference inet6_offloads or inet_offloads, get to
 them through ptype

v4:
   - skb argument to ndo_rx_flow_steer allso needs to become
 non constant

Tom Herbert (8):
  flow_dissector: Change skbuf argument to be non const
  flow_dissector: Move ETH_P_TEB processing to main switch
  udp: Check static key udp_encap_needed in udp_gro_receive
  flow_dissector: Add protocol specific flow dissection offload
  ip: Add callbacks to flow dissection by IP protocol
  udp: flow dissector offload
  fou: Support flow dissection
  vxlan: support flow dissect

 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  2 +-
 drivers/net/ethernet/cisco/enic/enic_clsf.c   |  2 +-
 drivers/net/ethernet/cisco/enic/enic_clsf.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c|  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_arfs.c |  2 +-
 drivers/net/ethernet/qlogic/qede/qede.h   |  2 +-
 drivers/net/ethernet/qlogic/qede/qede_filter.c|  2 +-
 drivers/net/ethernet/sfc/efx.h|  2 +-
 drivers/net/ethernet/sfc/falcon/efx.h |  2 +-
 drivers/net/ethernet/sfc/falcon/rx.c  |  2 +-
 drivers/net/ethernet/sfc/rx.c |  2 +-
 drivers/net/vxlan.c   | 40 +
 include/linux/netdevice.h | 31 +-
 include/linux/skbuff.h| 12 ++--
 include/linux/udp.h   |  8 +++
 include/net/flow_dissector.h  |  1 +
 include/net/ip_fib.h  |  4 +-
 include/net/route.h   |  4 +-
 include/net/udp.h | 10 
 include/net/udp_tunnel.h  |  8 +++
 net/core/dev.c| 65 +
 net/core/flow_dissector.c | 71 ++-
 net/ipv4/af_inet.c| 27 +
 net/ipv4/fib_semantics.c  |  2 +-
 net/ipv4/fou.c| 63 
 net/ipv4/route.c  | 10 ++--
 net/ipv4/udp.c|  4 +-
 net/ipv4/udp_offload.c| 55 ++
 net/ipv4/udp_tunnel.c |  1 +
 net/ipv6/ip6_offload.c| 27 +
 net/ipv6/udp_offload.c| 23 
 net/sched/sch_sfq.c   |  2 +-
 33 files changed, 433 insertions(+), 59 deletions(-)

-- 
2.11.0



Re: [REGRESSION] Warning in tcp_fastretrans_alert() of net/ipv4/tcp_input.c

2017-09-28 Thread Yuchung Cheng
On Thu, Sep 28, 2017 at 1:14 AM, Oleksandr Natalenko
 wrote:
> Hi.
>
> Won't tell about panic in tcp_sacktag_walk() since I cannot trigger it
> intentionally, but setting net.ipv4.tcp_retrans_collapse to 0 *does not* fix
> warning in tcp_fastretrans_alert() for me.

Hi Oleksandr: no retrans_collapse should not matter for that warning
in tcp_fstretrans_alert(). the warning as I explained earlier is
likely false. Neal and I are more concerned the panic in
tcp_sacktag_walk. This is just a blind shot but thx for retrying.

We can submit a one-liner to remove the fast retrans warning but want
to nail the bigger issue first.

>
> On středa 27. září 2017 2:18:32 CEST Yuchung Cheng wrote:
>> On Tue, Sep 26, 2017 at 5:12 PM, Yuchung Cheng  wrote:
>> > On Tue, Sep 26, 2017 at 6:10 AM, Roman Gushchin  wrote:
>> >>> On Wed, Sep 20, 2017 at 6:46 PM, Roman Gushchin  wrote:
>> >>> > > Hello.
>> >>> > >
>> >>> > > Since, IIRC, v4.11, there is some regression in TCP stack resulting
>> >>> > > in the
>> >>> > > warning shown below. Most of the time it is harmless, but rarely it
>> >>> > > just
>> >>> > > causes either freeze or (I believe, this is related too) panic in
>> >>> > > tcp_sacktag_walk() (because sk_buff passed to this function is
>> >>> > > NULL).
>> >>> > > Unfortunately, I still do not have proper stacktrace from panic, but
>> >>> > > will try to capture it if possible.
>> >>> > >
>> >>> > > Also, I have custom settings regarding TCP stack, shown below as
>> >>> > > well. ifb is used to shape traffic with tc.
>> >>> > >
>> >>> > > Please note this regression was already reported as BZ [1] and as a
>> >>> > > letter to ML [2], but got neither attention nor resolution. It is
>> >>> > > reproducible for (not only) me on my home router since v4.11 till
>> >>> > > v4.13.1 incl.
>> >>> > >
>> >>> > > Please advise on how to deal with it. I'll provide any additional
>> >>> > > info if
>> >>> > > necessary, also ready to test patches if any.
>> >>> > >
>> >>> > > Thanks.
>> >>> > >
>> >>> > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=195835
>> >>> > > [2]
>> >>> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.ne
>> >>> > > t_lists_netdev_msg436158.html&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=jJ
>> >>> > > YgtDM7QT-W-Fz_d29HYQ&m=MDDRfLG5DvdOeniMpaZDJI8ulKQ6PQ6OX_1YtRsiTMA&s
>> >>> > > =-n3dGZw-pQ95kMBUfq5G9nYZFcuWtbTDlYFkcvQPoKc&e=>>> >
>> >>> > We're experiencing the same problems on some machines in our fleet.
>> >>> > Exactly the same symptoms: tcp_fastretrans_alert() warnings and
>> >>> > sometimes panics in tcp_sacktag_walk().
>> >>
>> >>> > Here is an example of a backtrace with the panic log:
>> >> Hi Yuchung!
>> >>
>> >>> do you still see the panics if you disable RACK?
>> >>> sysctl net.ipv4.tcp_recovery=0?
>> >>
>> >> No, we haven't seen any crash since that.
>> >
>> > I am out of ideas how RACK can potentially cause tcp_sacktag_walk to
>> > take an empty skb :-( Do you have stack trace or any hint on which call
>> > to tcp-sacktag_walk triggered the panic? internally at Google we never
>> > see that.
>>
>> hmm something just struck me: could you try
>> sysctl net.ipv4.tcp_recovery=1 net.ipv4.tcp_retrans_collapse=0
>> and see if kernel still panics on sack processing?
>>
>> >>> also have you experience any sack reneg? could you post the output of
>> >>> ' nstat |grep -i TCP' thanks
>> >>
>> >> hostnameTcpActiveOpens  22896800.0
>> >> hostnameTcpPassiveOpens 35927580.0
>> >> hostnameTcpAttemptFails 746910 0.0
>> >> hostnameTcpEstabResets  154988 0.0
>> >> hostnameTcpInSegs   162586782550.0
>> >> hostnameTcpOutSegs  469670116110.0
>> >> hostnameTcpRetransSegs  13724310   0.0
>> >> hostnameTcpInErrs   2  0.0
>> >> hostnameTcpOutRsts  94187980.0
>> >> hostnameTcpExtEmbryonicRsts 2303   0.0
>> >> hostnameTcpExtPruneCalled   90192  0.0
>> >> hostnameTcpExtOfoPruned 57274  0.0
>> >> hostnameTcpExtOutOfWindowIcmps  3  0.0
>> >> hostnameTcpExtTW11647050.0
>> >> hostnameTcpExtTWRecycled2  0.0
>> >> hostnameTcpExtPAWSEstab 1590.0
>> >> hostnameTcpExtDelayedACKs   209207209  0.0
>> >> hostnameTcpExtDelayedACKLocked  508571 0.0
>> >> hostnameTcpExtDelayedACKLost17132480.0
>> >> hostnameTcpExtListenOverflows   6250.0
>> >> hostnameTcpExtListenDrops   6250.0
>> >> hostnam

Re: [net-next PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation

2017-09-28 Thread Daniel Borkmann

On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:
[...]

+/* Convert xdp_buff to xdp_pkt */
+static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
+{
+   struct xdp_pkt *xdp_pkt;
+   int headroom;
+
+   /* Assure headroom is available for storing info */
+   headroom = xdp->data - xdp->data_hard_start;
+   if (headroom < sizeof(*xdp_pkt))
+   return NULL;
+
+   /* Store info in top of packet */
+   xdp_pkt = xdp->data_hard_start;


(You'd also need to handle data_meta here if set, and for below
cpu_map_build_skb(), e.g. headroom is data_meta-data_hard_start.)


+   xdp_pkt->data = xdp->data;
+   xdp_pkt->len  = xdp->data_end - xdp->data;
+   xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
+
+   return xdp_pkt;
+}
+
+static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
+struct xdp_pkt *xdp_pkt)
+{
+   unsigned int frame_size;
+   void *pkt_data_start;
+   struct sk_buff *skb;
+
+   /* build_skb need to place skb_shared_info after SKB end, and
+* also want to know the memory "truesize".  Thus, need to

[...]

  static int cpu_map_kthread_run(void *data)
  {
+   const unsigned long busy_poll_jiffies = usecs_to_jiffies(2000);
+   unsigned long time_limit = jiffies + busy_poll_jiffies;
struct bpf_cpu_map_entry *rcpu = data;
+   unsigned int empty_cnt = 0;

set_current_state(TASK_INTERRUPTIBLE);
while (!kthread_should_stop()) {
+   unsigned int processed = 0, drops = 0;
struct xdp_pkt *xdp_pkt;

-   schedule();
-   /* Do work */
-   while ((xdp_pkt = ptr_ring_consume(rcpu->queue))) {
-   /* For now just "refcnt-free" */
-   page_frag_free(xdp_pkt);
+   /* Release CPU reschedule checks */
+   if ((time_after_eq(jiffies, time_limit) || empty_cnt > 25) &&
+   __ptr_ring_empty(rcpu->queue)) {
+   empty_cnt++;
+   schedule();
+   time_limit = jiffies + busy_poll_jiffies;
+   WARN_ON(smp_processor_id() != rcpu->cpu);
+   } else {
+   cond_resched();
}
+
+   /* Process packets in rcpu->queue */
+   local_bh_disable();
+   /*
+* The bpf_cpu_map_entry is single consumer, with this
+* kthread CPU pinned. Lockless access to ptr_ring
+* consume side valid as no-resize allowed of queue.
+*/
+   while ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {
+   struct sk_buff *skb;
+   int ret;
+
+   /* Allow busy polling again */
+   empty_cnt = 0;
+
+   skb = cpu_map_build_skb(rcpu, xdp_pkt);
+   if (!skb) {
+   page_frag_free(xdp_pkt);
+   continue;
+   }
+
+   /* Inject into network stack */
+   ret = netif_receive_skb(skb);


Have you looked into whether it's feasible to reuse GRO
engine here as well?


+   if (ret == NET_RX_DROP)
+   drops++;
+
+   /* Limit BH-disable period */
+   if (++processed == 8)
+   break;
+   }
+   local_bh_enable();
+
__set_current_state(TASK_INTERRUPTIBLE);
}
put_cpu_map_entry(rcpu);

[...]


Re: [next-queue PATCH 2/3] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-09-28 Thread Vinicius Costa Gomes
Hi,

Cong Wang  writes:

[...]

>>>
>>> I am not sure how we can solve this elegantly, perhaps you should
>>> extend mqprio rather than add a new one?
>>
>> Is the alternative hinted in the FIXME worse? Instead of passing the
>> index of the hardware queue to the driver we pass the pointer to a
>> netdev_queue to the driver and it "discovers" the HW queue from that.
>
> Does this way solve the dependency on mqprio? If yes then it is good.
> And you have to fix it before merge, we don't have any qdisc depending
> a specific type of qdisc to be its parent.

Yes, it does. And if we do like Jesus pointed out, we can do this on the
CBS qdisc side, no need to change the driver.


Cheers,
--
Vinicius


Re: [PATCH] Add a driver for Renesas uPD60620 and uPD60620A PHYs

2017-09-28 Thread Andrew Lunn
Hi Bernd

> >> +  if (phy_state & BMSR_ANEGCOMPLETE) {
> > 
> > It is worth comparing this against genphy_read_status() which is the
> > reference implementation. You would normally check if auto negotiation
> > is enabled, not if it has completed. If it is enabled you read the
> > current negotiated state, even if it is not completed.
> > 
> 
> Do you suggest that there are cases where auto negotiation does not
> reach completion, and still provides a usable link status?

My experience is that it often return 10/half, since everything should
support that. And depending on what the MAC is doing, packets can
sometime get across the link.
 
> I have tried to connect to link partners with fixed configuration
> but even then the auto negotiation always competes normally.

Which is a bit odd.

There are a few different possibilities here.  The peer PHY driver is
broken. Rather than doing fixed, it actually set the possible
negotiation options to just the one setting you tried to fix it
to. And hence the uPD60620 device negotiated fine. Or the uPD60620 is
broken is said it negotiated, but in fact it failed.

What was the result? 10/Half, or the fixed values you set the peer to?
 
> 
> >From 2e101aed8466b314251972d1eaccfb43cf177078 Mon Sep 17 00:00:00 2001
> From: Bernd Edlinger 
> Date: Thu, 21 Sep 2017 15:46:16 +0200
> Subject: [PATCH 2/5] Add a driver for Renesas uPD60620 and uPD60620A PHYs.
> 
> Signed-off-by: Bernd Edlinger 

Please send this is a new patch. If we were to take this is is, all
the comments above would end up in the commit message.

> ---

Under the --- you can however add comments which don't go into the
commit log. Good practice is to list the things you changed since the
previous version.

Thanks
Andrew


Re: [net-next PATCH 0/5] New bpf cpumap type for XDP_REDIRECT

2017-09-28 Thread Daniel Borkmann

On 09/28/2017 02:57 PM, Jesper Dangaard Brouer wrote:

Introducing a new way to redirect XDP frames.  Notice how no driver
changes are necessary given the design of XDP_REDIRECT.

This redirect map type is called 'cpumap', as it allows redirection
XDP frames to remote CPUs.  The remote CPU will do the SKB allocation
and start the network stack invocation on that CPU.

This is a scalability and isolation mechanism, that allow separating
the early driver network XDP layer, from the rest of the netstack, and
assigning dedicated CPUs for this stage.  The sysadm control/configure
the RX-CPU to NIC-RX queue (as usual) via procfs smp_affinity and how
many queues are configured via ethtool --set-channels.  Benchmarks
show that a single CPU can handle approx 11Mpps.  Thus, only assigning
two NIC RX-queues (and two CPUs) is sufficient for handling 10Gbit/s
wirespeed smallest packet 14.88Mpps.  Reducing the number of queues
have the advantage that more packets being "bulk" available per hard
interrupt[1].

[1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf

Use-cases:

1. End-host based pre-filtering for DDoS mitigation.  This is fast
enough to allow software to see and filter all packets wirespeed.
Thus, no packets getting silently dropped by hardware.

2. Given NIC HW unevenly distributes packets across RX queue, this
mechanism can be used for redistribution load across CPUs.  This
usually happens when HW is unaware of a new protocol.  This
resembles RPS (Receive Packet Steering), just faster, but with more
responsibility placed on the BPF program for correct steering.

3. Auto-scaling or power saving via only activating the appropriate
number of remote CPUs for handling the current load.  The cpumap
tracepoints can function as a feedback loop for this purpose.


Interesting work, thanks! Still digesting the code a bit. I think
it pretty much goes into the direction that Eric describes in his
netdev paper quoted above; not on a generic level though but specific
to XDP at least; theoretically XDP could just run transparently on
the CPU doing the filtering, and raw buffers are handed to remote
CPU with similar batching, but it would need some different config
interface at minimum.

Shouldn't we take the CPU(s) running XDP on the RX queues out from
the normal process scheduler, so that we have a guarantee that user
space or unrelated kernel tasks cannot interfere with them anymore,
and we could then turn them into busy polling eventually (e.g. as
long as XDP is running there and once off could put them back into
normal scheduling domain transparently)?

What about RPS/RFS in the sense that once you punt them to remote
CPU, could we reuse application locality information so they'd end
up on the right CPU in the first place (w/o backlog detour), or is
the intent to rather disable it and have some own orchestration
with relation to the CPU map?

Cheers,
Daniel


Re: [next-queue PATCH 2/3] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-09-28 Thread Cong Wang
On Wed, Sep 27, 2017 at 2:14 PM, Vinicius Costa Gomes
 wrote:
> Hi,
>
> Cong Wang  writes:
>
>> On Tue, Sep 26, 2017 at 4:39 PM, Vinicius Costa Gomes
>>  wrote:
>>> +static int cbs_init(struct Qdisc *sch, struct nlattr *opt)
>>> +{
>>> +   struct cbs_sched_data *q = qdisc_priv(sch);
>>> +   struct net_device *dev = qdisc_dev(sch);
>>> +
>>> +   if (!opt)
>>> +   return -EINVAL;
>>> +
>>> +   /* FIXME: this means that we can only install this qdisc
>>> +* "under" mqprio. Do we need a more generic way to retrieve
>>> +* the queue, or do we pass the netdev_queue to the driver?
>>> +*/
>>> +   q->queue = TC_H_MIN(sch->parent) - 1 - netdev_get_num_tc(dev);
>>> +
>>> +   return cbs_change(sch, opt);
>>> +}
>>
>> Yeah it is ugly to assume its parent is mqprio, at least you should
>> error out if it is not the case.
>
> Will add an error for this, for now.
>
>>
>> I am not sure how we can solve this elegantly, perhaps you should
>> extend mqprio rather than add a new one?
>
> Is the alternative hinted in the FIXME worse? Instead of passing the
> index of the hardware queue to the driver we pass the pointer to a
> netdev_queue to the driver and it "discovers" the HW queue from that.

Does this way solve the dependency on mqprio? If yes then it is good.
And you have to fix it before merge, we don't have any qdisc depending
a specific type of qdisc to be its parent.


Re: [PATCH V2] r8152: add Linksys USB3GIGV1 id

2017-09-28 Thread Rustad, Mark D

> On Sep 27, 2017, at 9:39 AM, Grant Grundler  wrote:
> 
> On Wed, Sep 27, 2017 at 12:15 AM, Oliver Neukum  wrote:
>> Am Dienstag, den 26.09.2017, 08:19 -0700 schrieb Doug Anderson:
>>> 
>>> I know that for at least some of the adapters in the CDC Ethernet
>>> blacklist it was claimed that the CDC Ethernet support in the adapter
>>> was kinda broken anyway so the blacklist made sense.  ...but for the
>>> Linksys Gigabit adapter the CDC Ethernet driver seems to work OK, it's
>>> just not quite as full featured / efficient as the R8152 driver.
>>> 
>>> Is that not a concern?  I guess you could tell people in this
>>> situation that they simply need to enable the R8152 driver to get
>>> continued support for their Ethernet adapter?
>> 
>> Hi,
>> 
>> yes, it is a valid concern. An #ifdef will be needed.
> 
> Good idea - I will post V3 shortly.
> 
> I'm assuming you mean to add #ifdef CONFIG_USB_RTL8152 around the
> blacklist entry in cdc_ether driver.

Shouldn't that be an #if IS_ENABLED(...) test, since that seems to be the 
proper way to check configured drivers.

--
Mark Rustad, Networking Division, Intel Corporation



signature.asc
Description: Message signed with OpenPGP


Re: [Patch net-next] net_sched: use idr to allocate u32 filter handles

2017-09-28 Thread Cong Wang
On Thu, Sep 28, 2017 at 12:34 AM, Simon Horman
 wrote:
> Hi Cong,
>
> this looks like a nice enhancement to me. Did you measure any performance
> benefit from it.  Perhaps it could be described in the changelog_ I also
> have a more detailed question below.

No, I am inspired by commit c15ab236d69d, don't measure it.


>
>> ---
>>  net/sched/cls_u32.c | 108 
>> 
>>  1 file changed, 67 insertions(+), 41 deletions(-)
>>
>> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
>> index 10b8d851fc6b..316b8a791b13 100644
>> --- a/net/sched/cls_u32.c
>> +++ b/net/sched/cls_u32.c
>> @@ -46,6 +46,7 @@
>
> ...
>
>> @@ -937,22 +940,33 @@ static int u32_change(struct net *net, struct sk_buff 
>> *in_skb,
>>   return -EINVAL;
>>   if (TC_U32_KEY(handle))
>>   return -EINVAL;
>> - if (handle == 0) {
>> - handle = gen_new_htid(tp->data);
>> - if (handle == 0)
>> - return -ENOMEM;
>> - }
>>   ht = kzalloc(sizeof(*ht) + divisor*sizeof(void *), GFP_KERNEL);
>>   if (ht == NULL)
>>   return -ENOBUFS;
>> + if (handle == 0) {
>> + handle = gen_new_htid(tp->data, ht);
>> + if (handle == 0) {
>> + kfree(ht);
>> + return -ENOMEM;
>> + }
>> + } else {
>> + err = idr_alloc_ext(&tp_c->handle_idr, ht, NULL,
>> + handle, handle + 1, GFP_KERNEL);
>> + if (err) {
>> + kfree(ht);
>> + return err;
>> + }
>
> The above seems to check that handle is not already in use and mark it as
> in use. But I don't see that logic in the code prior to this patch.
> Am I missing something? If not perhaps this portion should be a separate
> patch or described in the changelog.

The logic is in upper layer, tc_ctl_tfilter(). It tries to get a
filter by handle
(if non-zero), and errors out if we are creating a new filter with the same
handle.

At the point you quote above, 'n' is already NULL and 'handle' is non-zero,
which means there is no existing filter has same handle, it is safe to just
mark it as in-use.

Thanks.


[PATCH v3 net-next 8/8] vxlan: support flow dissect

2017-09-28 Thread Tom Herbert
Populate offload flow_dissect callback appropriately for VXLAN and
VXLAN-GPE.

Signed-off-by: Tom Herbert 
---
 drivers/net/vxlan.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index d7c49cf1d5e9..80227050b2d4 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1327,6 +1327,45 @@ static bool vxlan_ecn_decapsulate(struct vxlan_sock *vs, 
void *oiph,
return err <= 1;
 }
 
+static enum flow_dissect_ret vxlan_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   __be16 protocol = htons(ETH_P_TEB);
+   struct vxlanhdr *vhdr, _vhdr;
+   struct vxlan_sock *vs;
+
+   vhdr = __skb_header_pointer(skb, *p_nhoff + sizeof(struct udphdr),
+   sizeof(_vhdr), data, *p_hlen, &_vhdr);
+   if (!vhdr)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   vs = rcu_dereference_sk_user_data(sk);
+   if (!vs)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   if (vs->flags & VXLAN_F_GPE) {
+   struct vxlanhdr_gpe *gpe = (struct vxlanhdr_gpe *)vhdr;
+
+   /* Need to have Next Protocol set for interfaces in GPE mode. */
+   if (gpe->version != 0 || !gpe->np_applied || gpe->oam_flag)
+   return FLOW_DISSECT_RET_CONTINUE;
+
+   protocol = tun_p_from_eth_p(gpe->next_protocol);
+   if (!protocol)
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+
+   *p_nhoff += sizeof(struct udphdr) + sizeof(_vhdr);
+   *p_proto = protocol;
+
+   return FLOW_DISSECT_RET_PROTO_AGAIN;
+}
+
 /* Callback from net/ipv4/udp.c to receive packets */
 static int vxlan_rcv(struct sock *sk, struct sk_buff *skb)
 {
@@ -2846,6 +2885,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net 
*net, bool ipv6,
tunnel_cfg.encap_destroy = NULL;
tunnel_cfg.gro_receive = vxlan_gro_receive;
tunnel_cfg.gro_complete = vxlan_gro_complete;
+   tunnel_cfg.flow_dissect = vxlan_flow_dissect;
 
setup_udp_tunnel_sock(net, sock, &tunnel_cfg);
 
-- 
2.11.0



[PATCH v3 net-next 5/8] ip: Add callbacks to flow dissection by IP protocol

2017-09-28 Thread Tom Herbert
Populate the proto_flow_dissect function for IPv4 and IPv6 packet
offloads. This allows the caller to flow dissect a packet starting
at the given IP protocol (as parsed to that point by flow dissector
for instance).

Signed-off-by: Tom Herbert 
---
 net/ipv4/af_inet.c | 27 +++
 net/ipv6/ip6_offload.c | 27 +++
 2 files changed, 54 insertions(+)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index e31108e5ef79..18c1d884999a 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1440,6 +1440,32 @@ static struct sk_buff **ipip_gro_receive(struct sk_buff 
**head,
return inet_gro_receive(head, skb);
 }
 
+static enum flow_dissect_ret inet_proto_flow_dissect(struct sk_buff *skb,
+   u8 proto,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   const struct net_offload *ops;
+
+   rcu_read_lock();
+
+   ops = rcu_dereference(inet_offloads[proto]);
+   if (ops && ops->callbacks.flow_dissect)
+   ret =  ops->callbacks.flow_dissect(skb, key_control,
+  flow_dissector,
+  target_container,
+  data, p_proto, p_ip_proto,
+  p_nhoff, p_hlen, flags);
+
+   rcu_read_unlock();
+
+   return ret;
+}
+
 #define SECONDS_PER_DAY86400
 
 /* inet_current_timestamp - Return IP network timestamp
@@ -1763,6 +1789,7 @@ static int ipv4_proc_init(void);
 
 static struct packet_offload ip_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IP),
+   .proto_flow_dissect = inet_proto_flow_dissect,
.callbacks = {
.gso_segment = inet_gso_segment,
.gro_receive = inet_gro_receive,
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index cdb3728faca7..a33a2b40b3d6 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -339,8 +339,35 @@ static int ip4ip6_gro_complete(struct sk_buff *skb, int 
nhoff)
return inet_gro_complete(skb, nhoff);
 }
 
+static enum flow_dissect_ret inet6_proto_flow_dissect(struct sk_buff *skb,
+   u8 proto,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   const struct net_offload *ops;
+
+   rcu_read_lock();
+
+   ops = rcu_dereference(inet6_offloads[proto]);
+   if (ops && ops->callbacks.flow_dissect)
+   ret =  ops->callbacks.flow_dissect(skb, key_control,
+  flow_dissector,
+  target_container, data,
+  p_proto, p_ip_proto, p_nhoff,
+  p_hlen, flags);
+
+   rcu_read_unlock();
+
+   return ret;
+}
+
 static struct packet_offload ipv6_packet_offload __read_mostly = {
.type = cpu_to_be16(ETH_P_IPV6),
+   .proto_flow_dissect = inet6_proto_flow_dissect,
.callbacks = {
.gso_segment = ipv6_gso_segment,
.gro_receive = ipv6_gro_receive,
-- 
2.11.0



[PATCH v3 net-next 7/8] fou: Support flow dissection

2017-09-28 Thread Tom Herbert
Populate offload flow_dissect callabck appropriately for fou and gue.

Signed-off-by: Tom Herbert 
---
 net/ipv4/fou.c | 63 ++
 1 file changed, 63 insertions(+)

diff --git a/net/ipv4/fou.c b/net/ipv4/fou.c
index 1540db65241a..a831dd49fb28 100644
--- a/net/ipv4/fou.c
+++ b/net/ipv4/fou.c
@@ -282,6 +282,20 @@ static int fou_gro_complete(struct sock *sk, struct 
sk_buff *skb,
return err;
 }
 
+static enum flow_dissect_ret fou_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   *p_ip_proto = fou_from_sock(sk)->protocol;
+   *p_nhoff += sizeof(struct udphdr);
+
+   return FLOW_DISSECT_RET_IPPROTO_AGAIN;
+}
+
 static struct guehdr *gue_gro_remcsum(struct sk_buff *skb, unsigned int off,
  struct guehdr *guehdr, void *data,
  size_t hdrlen, struct gro_remcsum *grc,
@@ -500,6 +514,53 @@ static int gue_gro_complete(struct sock *sk, struct 
sk_buff *skb, int nhoff)
return err;
 }
 
+static enum flow_dissect_ret gue_flow_dissect(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   struct guehdr *guehdr, _guehdr;
+
+   guehdr = __skb_header_pointer(skb, *p_nhoff + sizeof(struct udphdr),
+ sizeof(_guehdr), data, *p_hlen, &_guehdr);
+   if (!guehdr)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   switch (guehdr->version) {
+   case 0:
+   if (unlikely(guehdr->control))
+   return FLOW_DISSECT_RET_CONTINUE;
+
+   *p_ip_proto = guehdr->proto_ctype;
+   *p_nhoff += sizeof(struct udphdr) +
+   sizeof(*guehdr) + (guehdr->hlen << 2);
+
+   break;
+   case 1:
+   switch (((struct iphdr *)guehdr)->version) {
+   case 4:
+   *p_ip_proto = IPPROTO_IPIP;
+   break;
+   case 6:
+   *p_ip_proto = IPPROTO_IPV6;
+   break;
+   default:
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+
+   *p_nhoff += sizeof(struct udphdr);
+
+   break;
+   default:
+   return FLOW_DISSECT_RET_CONTINUE;
+   }
+
+   return FLOW_DISSECT_RET_IPPROTO_AGAIN;
+}
+
 static int fou_add_to_port_list(struct net *net, struct fou *fou)
 {
struct fou_net *fn = net_generic(net, fou_net_id);
@@ -570,12 +631,14 @@ static int fou_create(struct net *net, struct fou_cfg 
*cfg,
tunnel_cfg.encap_rcv = fou_udp_recv;
tunnel_cfg.gro_receive = fou_gro_receive;
tunnel_cfg.gro_complete = fou_gro_complete;
+   tunnel_cfg.flow_dissect = fou_flow_dissect;
fou->protocol = cfg->protocol;
break;
case FOU_ENCAP_GUE:
tunnel_cfg.encap_rcv = gue_udp_recv;
tunnel_cfg.gro_receive = gue_gro_receive;
tunnel_cfg.gro_complete = gue_gro_complete;
+   tunnel_cfg.flow_dissect = gue_flow_dissect;
break;
default:
err = -EINVAL;
-- 
2.11.0



[PATCH v3 net-next 3/8] udp: Check static key udp_encap_needed in udp_gro_receive

2017-09-28 Thread Tom Herbert
Currently, the only support for udp gro is provided by UDP encapsulation
protocols. Since they always set udp_encap_needed we can check that in
udp_gro_receive functions before performing a socket lookup.

Signed-off-by: Tom Herbert 
---
 include/net/udp.h  | 2 ++
 net/ipv4/udp.c | 4 +++-
 net/ipv4/udp_offload.c | 7 +++
 net/ipv6/udp_offload.c | 7 +++
 4 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/include/net/udp.h b/include/net/udp.h
index 12dfbfe2e2d7..c6b1c5d8d3c9 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -97,6 +97,8 @@ static inline struct udp_hslot *udp_hashslot2(struct 
udp_table *table,
 
 extern struct proto udp_prot;
 
+extern struct static_key udp_encap_needed;
+
 extern atomic_long_t udp_memory_allocated;
 
 /* sysctl variables for udp */
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 784ced0b9150..2788843e8eb2 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1813,7 +1813,9 @@ static int __udp_queue_rcv_skb(struct sock *sk, struct 
sk_buff *skb)
return 0;
 }
 
-static struct static_key udp_encap_needed __read_mostly;
+struct static_key udp_encap_needed __read_mostly;
+EXPORT_SYMBOL(udp_encap_needed);
+
 void udp_encap_enable(void)
 {
static_key_enable(&udp_encap_needed);
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 97658bfc1b58..a744bb515455 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -261,6 +261,13 @@ static struct sk_buff **udp4_gro_receive(struct sk_buff 
**head,
 {
struct udphdr *uh = udp_gro_udphdr(skb);
 
+   if (!static_key_false(&udp_encap_needed)) {
+   /* Currently udp_gro_receive only does something if
+* a UDP encapsulation has been set.
+*/
+   goto flush;
+   }
+
if (unlikely(!uh))
goto flush;
 
diff --git a/net/ipv6/udp_offload.c b/net/ipv6/udp_offload.c
index 455fd4e39333..111b026e4f03 100644
--- a/net/ipv6/udp_offload.c
+++ b/net/ipv6/udp_offload.c
@@ -34,6 +34,13 @@ static struct sk_buff **udp6_gro_receive(struct sk_buff 
**head,
 {
struct udphdr *uh = udp_gro_udphdr(skb);
 
+   if (!static_key_false(&udp_encap_needed)) {
+   /* Currently udp_gro_receive only does something if
+* a UDP encapsulation has been set.
+*/
+   goto flush;
+   }
+
if (unlikely(!uh))
goto flush;
 
-- 
2.11.0



[PATCH v3 net-next 4/8] flow_dissector: Add protocol specific flow dissection offload

2017-09-28 Thread Tom Herbert
Add offload capability for performing protocol specific flow dissection
(either by EtherType or IP protocol).

Specifically:

- Add flow_dissect to offload callbacks
- Move flow_dissect_ret enum to flow_dissector.h, cleanup names and add a
  couple of values
- Unify handling of functions that return flow_dissect_ret enum
- In __skb_flow_dissect, add default case for switch(proto) as well as
  switch(ip_proto) that looks up and calls protocol specific flow
  dissection

Signed-off-by: Tom Herbert 
---
 include/linux/netdevice.h| 27 ++
 include/net/flow_dissector.h |  1 +
 net/core/dev.c   | 65 
 net/core/flow_dissector.c| 16 +--
 net/ipv4/route.c |  4 ++-
 5 files changed, 110 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1..565d7cdfe967 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2207,12 +2207,25 @@ struct offload_callbacks {
struct sk_buff  **(*gro_receive)(struct sk_buff **head,
 struct sk_buff *skb);
int (*gro_complete)(struct sk_buff *skb, int nhoff);
+   enum flow_dissect_ret (*flow_dissect)(struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 };
 
 struct packet_offload {
__be16   type;  /* This is really htons(ether_type). */
u16  priority;
struct offload_callbacks callbacks;
+   enum flow_dissect_ret (*proto_flow_dissect)(struct sk_buff *skb,
+   u8 proto,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
struct list_head list;
 };
 
@@ -3252,6 +3265,20 @@ struct sk_buff *napi_get_frags(struct napi_struct *napi);
 gro_result_t napi_gro_frags(struct napi_struct *napi);
 struct packet_offload *gro_find_receive_by_type(__be16 type);
 struct packet_offload *gro_find_complete_by_type(__be16 type);
+enum flow_dissect_ret flow_dissect_by_type(struct sk_buff *skb,
+   __be16 type,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
+enum flow_dissect_ret flow_dissect_by_type_proto(struct sk_buff *skb,
+   __be16 type, u8 proto,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 
 static inline void napi_free_frags(struct napi_struct *napi)
 {
diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index fc3dce730a6b..ad75bbfd1c9c 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -213,6 +213,7 @@ enum flow_dissector_key_id {
 #define FLOW_DISSECTOR_F_STOP_AT_L3BIT(1)
 #define FLOW_DISSECTOR_F_STOP_AT_FLOW_LABELBIT(2)
 #define FLOW_DISSECTOR_F_STOP_AT_ENCAP BIT(3)
+#define FLOW_DISSECTOR_F_STOP_AT_L4BIT(4)
 
 struct flow_dissector_key {
enum flow_dissector_key_id key_id;
diff --git a/net/core/dev.c b/net/core/dev.c
index e350c768d4b5..f3cd884bd04b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -104,6 +104,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -4907,6 +4908,70 @@ struct packet_offload *gro_find_complete_by_type(__be16 
type)
 }
 EXPORT_SYMBOL(gro_find_complete_by_type);
 
+enum flow_dissect_ret flow_dissect_by_type(struct sk_buff *skb,
+   __be16 type,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   struct list_head *offload_head = &offload_base;
+   struct packet_offload *ptype;
+
+ 

[PATCH v3 net-next 1/8] flow_dissector: Change skbuf argument to be non const

2017-09-28 Thread Tom Herbert
Change the skbuf argument of __skb_flow_dissect to be non constant so
that the function can call functions that take non constant skbuf
arguments. This is needed if we are to call socket lookup or BPF in the
flow dissector path.

The changes include unraveling the call chain into __skb_flow_dissect so
that those also use non constant skbufs.

Signed-off-by: Tom Herbert 
---
 include/linux/skbuff.h| 12 ++--
 include/net/ip_fib.h  |  4 ++--
 include/net/route.h   |  4 ++--
 net/core/flow_dissector.c | 10 +-
 net/ipv4/fib_semantics.c  |  2 +-
 net/ipv4/route.c  |  6 +++---
 net/sched/sch_sfq.c   |  2 +-
 7 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 19e64bfb1a66..5a6e765e120f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1155,8 +1155,8 @@ __skb_set_sw_hash(struct sk_buff *skb, __u32 hash, bool 
is_l4)
 }
 
 void __skb_get_hash(struct sk_buff *skb);
-u32 __skb_get_hash_symmetric(const struct sk_buff *skb);
-u32 skb_get_poff(const struct sk_buff *skb);
+u32 __skb_get_hash_symmetric(struct sk_buff *skb);
+u32 skb_get_poff(struct sk_buff *skb);
 u32 __skb_get_poff(const struct sk_buff *skb, void *data,
   const struct flow_keys *keys, int hlen);
 __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 ip_proto,
@@ -1172,13 +1172,13 @@ void skb_flow_dissector_init(struct flow_dissector 
*flow_dissector,
 const struct flow_dissector_key *key,
 unsigned int key_count);
 
-bool __skb_flow_dissect(const struct sk_buff *skb,
+bool __skb_flow_dissect(struct sk_buff *skb,
struct flow_dissector *flow_dissector,
void *target_container,
void *data, __be16 proto, int nhoff, int hlen,
unsigned int flags);
 
-static inline bool skb_flow_dissect(const struct sk_buff *skb,
+static inline bool skb_flow_dissect(struct sk_buff *skb,
struct flow_dissector *flow_dissector,
void *target_container, unsigned int flags)
 {
@@ -1186,7 +1186,7 @@ static inline bool skb_flow_dissect(const struct sk_buff 
*skb,
  NULL, 0, 0, 0, flags);
 }
 
-static inline bool skb_flow_dissect_flow_keys(const struct sk_buff *skb,
+static inline bool skb_flow_dissect_flow_keys(struct sk_buff *skb,
  struct flow_keys *flow,
  unsigned int flags)
 {
@@ -1225,7 +1225,7 @@ static inline __u32 skb_get_hash_flowi6(struct sk_buff 
*skb, const struct flowi6
return skb->hash;
 }
 
-__u32 skb_get_hash_perturb(const struct sk_buff *skb, u32 perturb);
+__u32 skb_get_hash_perturb(struct sk_buff *skb, u32 perturb);
 
 static inline __u32 skb_get_hash_raw(const struct sk_buff *skb)
 {
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 1a7f7e424320..a376dfe1ad44 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -374,11 +374,11 @@ int fib_sync_up(struct net_device *dev, unsigned int 
nh_flags);
 
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 int fib_multipath_hash(const struct fib_info *fi, const struct flowi4 *fl4,
-  const struct sk_buff *skb);
+  struct sk_buff *skb);
 #endif
 void fib_select_multipath(struct fib_result *res, int hash);
 void fib_select_path(struct net *net, struct fib_result *res,
-struct flowi4 *fl4, const struct sk_buff *skb);
+struct flowi4 *fl4, struct sk_buff *skb);
 
 /* Exported by fib_trie.c */
 void fib_trie_init(void);
diff --git a/include/net/route.h b/include/net/route.h
index 57dfc6850d37..cb95b79f0117 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -114,10 +114,10 @@ int ip_rt_init(void);
 void rt_cache_flush(struct net *net);
 void rt_flush_dev(struct net_device *dev);
 struct rtable *ip_route_output_key_hash(struct net *net, struct flowi4 *flp,
-   const struct sk_buff *skb);
+   struct sk_buff *skb);
 struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 
*flp,
struct fib_result *res,
-   const struct sk_buff *skb);
+   struct sk_buff *skb);
 
 static inline struct rtable *__ip_route_output_key(struct net *net,
   struct flowi4 *flp)
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0a977373d003..76f5e5bc3177 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -424,7 +424,7 @@ static bool skb_flow_dissect_allowed(int *num_hdrs)
  *
  * Caller must take care of zeroing target container memory.
  */
-bool __skb_flow_dissec

[PATCH v3 net-next 2/8] flow_dissector: Move ETH_P_TEB processing to main switch

2017-09-28 Thread Tom Herbert
Support for processing TEB is currently in GRE flow dissection as a
special case. This can be moved to be a case the main proto switch in
__skb_flow_dissect.

Signed-off-by: Tom Herbert 
---
 net/core/flow_dissector.c | 45 -
 1 file changed, 24 insertions(+), 21 deletions(-)

diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 76f5e5bc3177..c15b41f96cbe 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -282,27 +282,8 @@ __skb_flow_dissect_gre(const struct sk_buff *skb,
if (hdr->flags & GRE_SEQ)
offset += sizeof(((struct pptp_gre_header *) 0)->seq);
 
-   if (gre_ver == 0) {
-   if (*p_proto == htons(ETH_P_TEB)) {
-   const struct ethhdr *eth;
-   struct ethhdr _eth;
-
-   eth = __skb_header_pointer(skb, *p_nhoff + offset,
-  sizeof(_eth),
-  data, *p_hlen, &_eth);
-   if (!eth)
-   return FLOW_DISSECT_RET_OUT_BAD;
-   *p_proto = eth->h_proto;
-   offset += sizeof(*eth);
-
-   /* Cap headers that we access via pointers at the
-* end of the Ethernet header as our maximum alignment
-* at that point is only 2 bytes.
-*/
-   if (NET_IP_ALIGN)
-   *p_hlen = *p_nhoff + offset;
-   }
-   } else { /* version 1, must be PPTP */
+   /* version 1, must be PPTP */
+   if (gre_ver == 1) {
u8 _ppp_hdr[PPP_HDRLEN];
u8 *ppp_hdr;
 
@@ -595,6 +576,28 @@ bool __skb_flow_dissect(struct sk_buff *skb,
 
break;
}
+   case htons(ETH_P_TEB): {
+   const struct ethhdr *eth;
+   struct ethhdr _eth;
+
+   eth = __skb_header_pointer(skb, nhoff, sizeof(_eth),
+  data, hlen, &_eth);
+   if (!eth)
+   goto out_bad;
+
+   proto = eth->h_proto;
+   nhoff += sizeof(*eth);
+
+   /* Cap headers that we access via pointers at the
+* end of the Ethernet header as our maximum alignment
+* at that point is only 2 bytes.
+*/
+   if (NET_IP_ALIGN)
+   hlen = nhoff;
+
+   fdret = FLOW_DISSECT_RET_PROTO_AGAIN;
+   break;
+   }
case htons(ETH_P_8021AD):
case htons(ETH_P_8021Q): {
const struct vlan_hdr *vlan;
-- 
2.11.0



[PATCH v3 net-next 6/8] udp: flow dissector offload

2017-09-28 Thread Tom Herbert
Add support to perform UDP specific flow dissection. This is
primarily intended for dissecting encapsulated packets in UDP
encapsulation.

This patch adds a flow_dissect offload for UDP4 and UDP6. The backend
function performs a socket lookup and calls the flow_dissect function
if a socket is found.

Signed-off-by: Tom Herbert 
---
 include/linux/udp.h  |  8 
 include/net/udp.h|  8 
 include/net/udp_tunnel.h |  8 
 net/ipv4/udp_offload.c   | 48 
 net/ipv4/udp_tunnel.c|  1 +
 net/ipv6/udp_offload.c   | 16 
 6 files changed, 89 insertions(+)

diff --git a/include/linux/udp.h b/include/linux/udp.h
index eaea63bc79bb..2e90b189ef6a 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -79,6 +79,14 @@ struct udp_sock {
int (*gro_complete)(struct sock *sk,
struct sk_buff *skb,
int nhoff);
+   /* Flow dissector function for a UDP socket */
+   enum flow_dissect_ret (*flow_dissect)(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 
/* udp_recvmsg try to use this before splicing sk_receive_queue */
struct sk_buff_head reader_queue cacheline_aligned_in_smp;
diff --git a/include/net/udp.h b/include/net/udp.h
index c6b1c5d8d3c9..4867f329538c 100644
--- a/include/net/udp.h
+++ b/include/net/udp.h
@@ -176,6 +176,14 @@ struct sk_buff **udp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb,
 struct udphdr *uh, udp_lookup_t lookup);
 int udp_gro_complete(struct sk_buff *skb, int nhoff, udp_lookup_t lookup);
 
+enum flow_dissect_ret udp_flow_dissect(struct sk_buff *skb,
+   udp_lookup_t lookup,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
+
 static inline struct udphdr *udp_gro_udphdr(struct sk_buff *skb)
 {
struct udphdr *uh;
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 10cce0dd4450..b7102e0f41a9 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -69,6 +69,13 @@ typedef struct sk_buff **(*udp_tunnel_gro_receive_t)(struct 
sock *sk,
 struct sk_buff *skb);
 typedef int (*udp_tunnel_gro_complete_t)(struct sock *sk, struct sk_buff *skb,
 int nhoff);
+typedef enum flow_dissect_ret (*udp_tunnel_flow_dissect_t)(struct sock *sk,
+   const struct sk_buff *skb,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags);
 
 struct udp_tunnel_sock_cfg {
void *sk_user_data; /* user data used by encap_rcv call back */
@@ -78,6 +85,7 @@ struct udp_tunnel_sock_cfg {
udp_tunnel_encap_destroy_t encap_destroy;
udp_tunnel_gro_receive_t gro_receive;
udp_tunnel_gro_complete_t gro_complete;
+   udp_tunnel_flow_dissect_t flow_dissect;
 };
 
 /* Setup the given (UDP) sock to receive UDP encapsulated packets */
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index a744bb515455..fddf923ef433 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -335,11 +335,59 @@ static int udp4_gro_complete(struct sk_buff *skb, int 
nhoff)
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
 
+enum flow_dissect_ret udp_flow_dissect(struct sk_buff *skb,
+   udp_lookup_t lookup,
+   struct flow_dissector_key_control *key_control,
+   struct flow_dissector *flow_dissector,
+   void *target_container, void *data,
+   __be16 *p_proto, u8 *p_ip_proto, int *p_nhoff,
+   int *p_hlen, unsigned int flags)
+{
+   enum flow_dissect_ret ret = FLOW_DISSECT_RET_CONTINUE;
+   struct udphdr *uh, _uh;
+   struct sock *sk;
+
+   uh = __skb_header_pointer(skb, *p_nhoff, sizeof(_uh), data,
+ *p_hlen, &_uh);
+   if (!uh)
+   return FLOW_DISSECT_RET_OUT_BAD;
+
+   rcu_r

[PATCH v3 net-next 0/8] flow_dissector: Protocol specific flow dissector offload

2017-09-28 Thread Tom Herbert
This patch set adds a new offload type to perform flow dissection for
specific protocols (either by EtherType or by IP protocol). This is
primary useful to crack open UDP encapsulations (like VXLAN, GUE) for
the purposes of parsing the encapsulated packet.

Items in this patch set:
- Create new protocol case in __skb_dissect for ETH_P_TEB. This is based
  on the code in the GRE dissect function and the special handling in
  GRE can now be removed (it sets protocol to ETH_P_TEB and returns so
  goto proto_again is done)
- Add infrastructure for protocol specific flow dissection offload
- Add infrastructure to perform UDP flow dissection. Uses same model of
  GRO where a flow_dissect callback can be associated with a UDP
  socket
- Use the infrastructure to support flow dissection of VXLAN and GUE

Tested:

Forced RPS to call flow dissection for VXLAN, FOU, and GUE. Observed
that inner packet was being properly dissected.

v2: Add signed off

v3:
   - Make skb argument of flow dissector to be non const
   - Change UDP GRO to only do something if encap_needed static
 key is set
   - don't reference inet6_offloads or inet_offloads, get to
 them through ptype

Tom Herbert (8):
  flow_dissector: Change skbuf argument to be non const
  flow_dissector: Move ETH_P_TEB processing to main switch
  udp: Check static key udp_encap_needed in udp_gro_receive
  flow_dissector: Add protocol specific flow dissection offload
  ip: Add callbacks to flow dissection by IP protocol
  udp: flow dissector offload
  fou: Support flow dissection
  vxlan: support flow dissect

 drivers/net/vxlan.c  | 40 +
 include/linux/netdevice.h| 27 +
 include/linux/skbuff.h   | 12 
 include/linux/udp.h  |  8 +
 include/net/flow_dissector.h |  1 +
 include/net/ip_fib.h |  4 +--
 include/net/route.h  |  4 +--
 include/net/udp.h| 10 +++
 include/net/udp_tunnel.h |  8 +
 net/core/dev.c   | 65 
 net/core/flow_dissector.c| 71 +++-
 net/ipv4/af_inet.c   | 27 +
 net/ipv4/fib_semantics.c |  2 +-
 net/ipv4/fou.c   | 63 +++
 net/ipv4/route.c | 10 ---
 net/ipv4/udp.c   |  4 ++-
 net/ipv4/udp_offload.c   | 55 ++
 net/ipv4/udp_tunnel.c|  1 +
 net/ipv6/ip6_offload.c   | 27 +
 net/ipv6/udp_offload.c   | 23 ++
 net/sched/sch_sfq.c  |  2 +-
 21 files changed, 419 insertions(+), 45 deletions(-)

-- 
2.11.0



linux-next: Signed-off-by missing for commit in the net-next tree

2017-09-28 Thread Stephen Rothwell
Hi all,

Commit

  8f1975e31d8e ("inetpeer: speed up inetpeer_invalidate_tree()")

is missing a Signed-off-by from its author.

-- 
Cheers,
Stephen Rothwell


Re: [PATCH net-next] libbpf: use map_flags when creating maps

2017-09-28 Thread Daniel Borkmann

On 09/28/2017 07:33 PM, Craig Gallek wrote:

On Wed, Sep 27, 2017 at 6:03 PM, Daniel Borkmann  wrote:

On 09/27/2017 06:29 PM, Alexei Starovoitov wrote:

On 9/27/17 7:04 AM, Craig Gallek wrote:

From: Craig Gallek 

[...]


yes it will break loading of pre-compiled .o
Instead of breaking, let's fix the loader to do it the way
samples/bpf/bpf_load.c does.
See commit 156450d9d964 ("samples/bpf: make bpf_load.c code compatible
with ELF maps section changes")


+1, iproute2 loader also does map spec fixup

For libbpf it would be good also such that it reduces the diff
further between the libbpf and bpf_load so that it allows move
to libbpf for samples in future.


Fair enough, I'll try to get this to work more dynamically.  I did
noticed that the fields of struct bpf_map_def in
selftests/.../bpf_helpers.h and iproute2's struct bpf_elf_map have
diverged. The flags field is the only thing missing from libbpf right
now (and they are at the same offset for both), so it won't be an
issue for this change, but it is going to make unifying all of these
things under libbpf not trivial at some point...


Yes, iproute2 uses its own loader with own specifics related to
iproute2. With the above I rather meant that we can reduce the
gap between libbpf and bpf_load from the BPF samples, so that
it would allow us to migrate samples over entirely, there was
unfortunately never a follow-up on 156450d9d964 though, but given
there is need to extend libbpf (I guess mostly due to lpm map
handling ;)), we need to do a similar thing there as well to not
break existing obj files.


Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-28 Thread Waskiewicz Jr, Peter
On 9/28/17 2:23 PM, John Fastabend wrote:
> [...]
> 
>> I'm pretty sure I misunderstood what you were going after with
>> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a
>> few weeks) will populate the headroom coming out of the driver to XDP,
>> and then once the XDP program extracts whatever hints it wants via
>> helpers, I fully expect that area in the headroom to get stomped by
>> something else.  If we want to send any of that hint data up farther,
>> we'll already have it extracted via the helpers, and the eBPF program
>> can happily assign it to wherever in the outbound metadata area.
> 
> In case its not obvious with the latest xdp metadata patches the outbound
> metadata can then be pushed into skb fields via a tc_cls program if needed.

Yes, that was what I was alluding to with "can happily assign it to 
wherever."  The patches we're working on are driver->XDP, then anything 
else using the latest meta-data patches would be XDP->anywhere else.  So 
I don't think we're going to step on any toes.

Thanks John,
-PJ


Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-28 Thread Daniel Borkmann

On 09/28/2017 10:52 PM, Waskiewicz Jr, Peter wrote:

On 9/28/17 12:59 PM, Andy Gospodarek wrote:

On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
 wrote:

On 9/26/17 10:21 AM, Andy Gospodarek wrote:

On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:

On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
[...]

First, thanks for this detailed description.  It was helpful to read
along with the patches.

My only concern about this area being generic is that you are now in a
state where any bpf program must know about all the bpf programs in the
receive pipeline before it can properly parse what is stored in the
meta-data and add it to an skb (or perform any other action).
Especially if each program adds it's own meta-data along the way.

Maybe this isn't a big concern based on the number of users of this
today, but it just starts to seem like a concern as there are these
hints being passed between layers that are challenging to track due to a
lack of a standard format for passing data between.


Btw, we do have similar kind of programmable scratch buffer also today
wrt skb cb[] that you can program from tc side, the perf ring buffer,
which doesn't have any fixed layout for the slots, or a per-cpu map
where you can transfer data between tail calls for example, then tail
calls themselves that need to coordinate, or simply mangling of packets
itself if you will, but more below to your use case ...


The main reason I bring this up is that Michael and I had discussed and
designed a way for drivers to communicate between each other that rx
resources could be freed after a tx completion on an XDP_REDIRECT
action.  Much like this code, it involved adding an new element to
struct xdp_md that could point to the important information.  Now that
there is a generic way to handle this, it would seem nice to be able to
leverage it, but I'm not sure how reliable this meta-data area would be
without the ability to mark it in some manner.

For additional background, the minimum amount of data needed in the case
Michael and I were discussing was really 2 words.  One to serve as a
pointer to an rx_ring structure and one to have a counter to the rx
producer entry.  This data could be acessed by the driver processing the
tx completions and callback to the driver that received the frame off the wire
to perform any needed processing.  (For those curious this would also require a
new callback/netdev op to act on this data stored in the XDP buffer.)


What you describe above doesn't seem to be fitting to the use-case of
this set, meaning the area here is fully programmable out of the BPF
program, the infrastructure you're describing is some sort of means of
communication between drivers for the XDP_REDIRECT, and should be
outside of the control of the BPF program to mangle.


OK, I understand that perspective.  I think saying this is really meant
as a BPF<->BPF communication channel for now is fine.


You could probably reuse the base infra here and make a part of that
inaccessible for the program with some sort of a fixed layout, but I
haven't seen your code yet to be able to fully judge. Intention here
is to allow for programmability within the BPF prog in a generic way,
such that based on the use-case it can be populated in specific ways
and propagated to the skb w/o having to define a fixed layout and
bloat xdp_buff all the way to an skb while still retaining all the
flexibility.


Some level of reuse might be proper, but I'd rather it be explicit for
my use since it's not exclusively something that will need to be used by
a BPF prog, but rather the driver.  I'll produce some patches this week
for reference.


Sorry for chiming in late, I've been offline.

We're looking to add some functionality from driver to XDP inside this
xdp_buff->data_meta region.  We want to assign it to an opaque
structure, that would be specific per driver (think of a flex descriptor
coming out of the hardware).  We'd like to pass these offloaded
computations into XDP programs to help accelerate them, such as packet
type, where headers are located, etc.  It's similar to Jesper's RFC
patches back in May when passing through the mlx Rx descriptor to XDP.

This is actually what a few of us are planning to present at NetDev 2.2
in November.  If you're hoping to restrict this headroom in the xdp_buff
for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
that further.


No sweat, PJ, thanks for replying.  I saw the notes for your accepted
session and I'm looking forward to it.

John's suggestion earlier in the thread was actually similar to the
conclusion I reached when thinking about Daniel's patch a bit more.
(I like John's better though as it doesn't get constrained by UAPI.)
Since redirect actions happen at a point where no other programs will
run on the buffer, that space can be used for this redirect data and
there are no conflicts.


Yep fully agree, it's not read anywhere else anymore or could go up
the stack where we'd 

Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-28 Thread John Fastabend
[...]

> I'm pretty sure I misunderstood what you were going after with 
> XDP_REDIRECT reserving the headroom.  Our use case (patches coming in a 
> few weeks) will populate the headroom coming out of the driver to XDP, 
> and then once the XDP program extracts whatever hints it wants via 
> helpers, I fully expect that area in the headroom to get stomped by 
> something else.  If we want to send any of that hint data up farther, 
> we'll already have it extracted via the helpers, and the eBPF program 
> can happily assign it to wherever in the outbound metadata area.

In case its not obvious with the latest xdp metadata patches the outbound
metadata can then be pushed into skb fields via a tc_cls program if needed.

.John

> 
>> (There's also Jesper's series from today -- I've seen it but have not
>> had time to fully grok all of those changes.)
> 
> I'm also working through my inbox to get to that series.  I have some 
> email to catch up on...
> 
> Thanks Andy,
> -PJ
> 



[PATCH net-next] ibmvnic: Set state UP

2017-09-28 Thread Mick Tarsel
State is initially reported as UNKNOWN. Before register call
netif_carrier_off(). Once the device is opened, call netif_carrier_on() in
order to set the state to UP.

Signed-off-by: Mick Tarsel 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index cb8182f..4bc14a9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -927,6 +927,7 @@ static int ibmvnic_open(struct net_device *netdev)
}
 
rc = __ibmvnic_open(netdev);
+   netif_carrier_on(netdev);
mutex_unlock(&adapter->reset_lock);
 
return rc;
@@ -3899,6 +3900,7 @@ static int ibmvnic_probe(struct vio_dev *dev, const 
struct vio_device_id *id)
if (rc)
goto ibmvnic_init_fail;
 
+   netif_carrier_off(netdev);
rc = register_netdev(netdev);
if (rc) {
dev_err(&dev->dev, "failed to register netdev rc=%d\n", rc);
-- 
1.8.3.1



Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-28 Thread Waskiewicz Jr, Peter
On 9/28/17 12:59 PM, Andy Gospodarek wrote:
> On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
>  wrote:
>> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
 On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
 [...]
> First, thanks for this detailed description.  It was helpful to read
> along with the patches.
>
> My only concern about this area being generic is that you are now in a
> state where any bpf program must know about all the bpf programs in the
> receive pipeline before it can properly parse what is stored in the
> meta-data and add it to an skb (or perform any other action).
> Especially if each program adds it's own meta-data along the way.
>
> Maybe this isn't a big concern based on the number of users of this
> today, but it just starts to seem like a concern as there are these
> hints being passed between layers that are challenging to track due to a
> lack of a standard format for passing data between.

 Btw, we do have similar kind of programmable scratch buffer also today
 wrt skb cb[] that you can program from tc side, the perf ring buffer,
 which doesn't have any fixed layout for the slots, or a per-cpu map
 where you can transfer data between tail calls for example, then tail
 calls themselves that need to coordinate, or simply mangling of packets
 itself if you will, but more below to your use case ...

> The main reason I bring this up is that Michael and I had discussed and
> designed a way for drivers to communicate between each other that rx
> resources could be freed after a tx completion on an XDP_REDIRECT
> action.  Much like this code, it involved adding an new element to
> struct xdp_md that could point to the important information.  Now that
> there is a generic way to handle this, it would seem nice to be able to
> leverage it, but I'm not sure how reliable this meta-data area would be
> without the ability to mark it in some manner.
>
> For additional background, the minimum amount of data needed in the case
> Michael and I were discussing was really 2 words.  One to serve as a
> pointer to an rx_ring structure and one to have a counter to the rx
> producer entry.  This data could be acessed by the driver processing the
> tx completions and callback to the driver that received the frame off the 
> wire
> to perform any needed processing.  (For those curious this would also 
> require a
> new callback/netdev op to act on this data stored in the XDP buffer.)

 What you describe above doesn't seem to be fitting to the use-case of
 this set, meaning the area here is fully programmable out of the BPF
 program, the infrastructure you're describing is some sort of means of
 communication between drivers for the XDP_REDIRECT, and should be
 outside of the control of the BPF program to mangle.
>>>
>>> OK, I understand that perspective.  I think saying this is really meant
>>> as a BPF<->BPF communication channel for now is fine.
>>>
 You could probably reuse the base infra here and make a part of that
 inaccessible for the program with some sort of a fixed layout, but I
 haven't seen your code yet to be able to fully judge. Intention here
 is to allow for programmability within the BPF prog in a generic way,
 such that based on the use-case it can be populated in specific ways
 and propagated to the skb w/o having to define a fixed layout and
 bloat xdp_buff all the way to an skb while still retaining all the
 flexibility.
>>>
>>> Some level of reuse might be proper, but I'd rather it be explicit for
>>> my use since it's not exclusively something that will need to be used by
>>> a BPF prog, but rather the driver.  I'll produce some patches this week
>>> for reference.
>>
>> Sorry for chiming in late, I've been offline.
>>
>> We're looking to add some functionality from driver to XDP inside this
>> xdp_buff->data_meta region.  We want to assign it to an opaque
>> structure, that would be specific per driver (think of a flex descriptor
>> coming out of the hardware).  We'd like to pass these offloaded
>> computations into XDP programs to help accelerate them, such as packet
>> type, where headers are located, etc.  It's similar to Jesper's RFC
>> patches back in May when passing through the mlx Rx descriptor to XDP.
>>
>> This is actually what a few of us are planning to present at NetDev 2.2
>> in November.  If you're hoping to restrict this headroom in the xdp_buff
>> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
>> that further.
>>
> 
> No sweat, PJ, thanks for replying.  I saw the notes for your accepted
> session and I'm looking forward to it.
> 
> John's suggestion earlier in the thread was actually similar to the
> conclusion I reached when thinking about Daniel's patch

[PATCH net-next 07/10] sctp: add sockopt to get/set stream scheduler

2017-09-28 Thread Marcelo Ricardo Leitner
As defined per RFC Draft ndata Section 4.3.2, named as
SCTP_STREAM_SCHEDULER.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner 
---
 include/uapi/linux/sctp.h |  1 +
 net/sctp/socket.c | 75 +++
 2 files changed, 76 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 
4487e7625ddbd48be1868a8292a807ecd0a314bc..0050f10087d224bad87c8c54ad318003381aee12
 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -122,6 +122,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_RESET_ASSOC   120
 #define SCTP_ADD_STREAMS   121
 #define SCTP_SOCKOPT_PEELOFF_FLAGS 122
+#define SCTP_STREAM_SCHEDULER  123
 
 /* PR-SCTP policies */
 #define SCTP_PR_SCTP_NONE  0x
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 
d207734326b085e60625e4333f74221481114892..ae35dbf2810f78c71ce77115ffe4b0e27a672abc
 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -79,6 +79,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Forward declarations for internal helper functions. */
 static int sctp_writeable(struct sock *sk);
@@ -3914,6 +3915,36 @@ static int sctp_setsockopt_add_streams(struct sock *sk,
return retval;
 }
 
+static int sctp_setsockopt_scheduler(struct sock *sk,
+char __user *optval,
+unsigned int optlen)
+{
+   struct sctp_association *asoc;
+   struct sctp_assoc_value params;
+   int retval = -EINVAL;
+
+   if (optlen < sizeof(params))
+   goto out;
+
+   optlen = sizeof(params);
+   if (copy_from_user(¶ms, optval, optlen)) {
+   retval = -EFAULT;
+   goto out;
+   }
+
+   if (params.assoc_value > SCTP_SS_MAX)
+   goto out;
+
+   asoc = sctp_id2assoc(sk, params.assoc_id);
+   if (!asoc)
+   goto out;
+
+   retval = sctp_sched_set_sched(asoc, params.assoc_value);
+
+out:
+   return retval;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -4095,6 +4126,9 @@ static int sctp_setsockopt(struct sock *sk, int level, 
int optname,
case SCTP_ADD_STREAMS:
retval = sctp_setsockopt_add_streams(sk, optval, optlen);
break;
+   case SCTP_STREAM_SCHEDULER:
+   retval = sctp_setsockopt_scheduler(sk, optval, optlen);
+   break;
default:
retval = -ENOPROTOOPT;
break;
@@ -6793,6 +6827,43 @@ static int sctp_getsockopt_enable_strreset(struct sock 
*sk, int len,
return retval;
 }
 
+static int sctp_getsockopt_scheduler(struct sock *sk, int len,
+char __user *optval,
+int __user *optlen)
+{
+   struct sctp_assoc_value params;
+   struct sctp_association *asoc;
+   int retval = -EFAULT;
+
+   if (len < sizeof(params)) {
+   retval = -EINVAL;
+   goto out;
+   }
+
+   len = sizeof(params);
+   if (copy_from_user(¶ms, optval, len))
+   goto out;
+
+   asoc = sctp_id2assoc(sk, params.assoc_id);
+   if (!asoc) {
+   retval = -EINVAL;
+   goto out;
+   }
+
+   params.assoc_value = sctp_sched_get_sched(asoc);
+
+   if (put_user(len, optlen))
+   goto out;
+
+   if (copy_to_user(optval, ¶ms, len))
+   goto out;
+
+   retval = 0;
+
+out:
+   return retval;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
   char __user *optval, int __user *optlen)
 {
@@ -6975,6 +7046,10 @@ static int sctp_getsockopt(struct sock *sk, int level, 
int optname,
retval = sctp_getsockopt_enable_strreset(sk, len, optval,
 optlen);
break;
+   case SCTP_STREAM_SCHEDULER:
+   retval = sctp_getsockopt_scheduler(sk, len, optval,
+  optlen);
+   break;
default:
retval = -ENOPROTOOPT;
break;
-- 
2.13.5



[PATCH net-next 10/10] sctp: introduce round robin stream scheduler

2017-09-28 Thread Marcelo Ricardo Leitner
This patch introduces RFC Draft ndata section 3.2 Priority Based
Scheduler (SCTP_SS_RR).

Works by maintaining a list of enqueued streams and tracking the last
one used to send data. When the datamsg is done, it switches to the next
stream.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner 
---
 include/net/sctp/structs.h |  11 +++
 include/uapi/linux/sctp.h  |   3 +-
 net/sctp/Makefile  |   3 +-
 net/sctp/stream_sched.c|   2 +
 net/sctp/stream_sched_rr.c | 201 +
 5 files changed, 218 insertions(+), 2 deletions(-)
 create mode 100644 net/sctp/stream_sched_rr.c

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 
40eb8d66a37c3ecee39141dc111663f7aac7326a..16f949eef52fdfd7c90fa15b44093334d1355aaf
 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1348,6 +1348,10 @@ struct sctp_stream_out_ext {
struct list_head prio_list;
struct sctp_stream_priorities *prio_head;
};
+   /* Fields used by RR scheduler */
+   struct {
+   struct list_head rr_list;
+   };
};
 };
 
@@ -1374,6 +1378,13 @@ struct sctp_stream {
/* List of priorities scheduled */
struct list_head prio_list;
};
+   /* Fields used by RR scheduler */
+   struct {
+   /* List of streams scheduled */
+   struct list_head rr_list;
+   /* The next stream stream in line */
+   struct sctp_stream_out_ext *rr_next;
+   };
};
 };
 
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 
850fa8b29d7e8163dc4ee88af192309bb2535ae9..6cd7d416ca406e59d3214976fc425bb805f5c6cc
 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1100,7 +1100,8 @@ struct sctp_add_streams {
 enum sctp_sched_type {
SCTP_SS_FCFS,
SCTP_SS_PRIO,
-   SCTP_SS_MAX = SCTP_SS_PRIO
+   SCTP_SS_RR,
+   SCTP_SS_MAX = SCTP_SS_RR
 };
 
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 
647c9cfd4e95be4429d25792e5832d7be2efc5c8..bf90c53977190ff563c2b43af31afb7c431d4534
 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -12,7 +12,8 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
  inqueue.o outqueue.o ulpqueue.o \
  tsnmap.o bind_addr.o socket.o primitive.o \
  output.o input.o debug.o stream.o auth.o \
- offload.o stream_sched.o stream_sched_prio.o
+ offload.o stream_sched.o stream_sched_prio.o \
+ stream_sched_rr.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
index 
115ddb7651695cca7417cb63004a1a59c93523b8..03513a9fa110b5317af4502f98ab37702c1eddb9
 100644
--- a/net/sctp/stream_sched.c
+++ b/net/sctp/stream_sched.c
@@ -122,10 +122,12 @@ static struct sctp_sched_ops sctp_sched_fcfs = {
 /* API to other parts of the stack */
 
 extern struct sctp_sched_ops sctp_sched_prio;
+extern struct sctp_sched_ops sctp_sched_rr;
 
 struct sctp_sched_ops *sctp_sched_ops[] = {
&sctp_sched_fcfs,
&sctp_sched_prio,
+   &sctp_sched_rr,
 };
 
 int sctp_sched_set_sched(struct sctp_association *asoc,
diff --git a/net/sctp/stream_sched_rr.c b/net/sctp/stream_sched_rr.c
new file mode 100644
index 
..7612a438c5b939ae1c26c4acc06902749b601524
--- /dev/null
+++ b/net/sctp/stream_sched_rr.c
@@ -0,0 +1,201 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * This file is part of the SCTP kernel implementation
+ *
+ * These functions manipulate sctp stream queue/scheduling.
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ * 
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * .
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresched(es):
+ *lksctp developers 
+ *
+ * Written or modified by:
+ *Marcelo Ricardo Leitner 
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+/* Priority handling
+ * RFC DRAFT ndata section 3.2
+ */
+static void sctp_sched_rr_unsched_all(struct sctp_stream *stream);
+
+static void s

[PATCH net-next 04/10] sctp: introduce struct sctp_stream_out_ext

2017-09-28 Thread Marcelo Ricardo Leitner
With the stream schedulers, sctp_stream_out will become too big to be
allocated by kmalloc and as we need to allocate with BH disabled, we
cannot use __vmalloc in sctp_stream_init().

This patch moves out the stats from sctp_stream_out to
sctp_stream_out_ext, which will be allocated only when the application
tries to sendmsg something on it.

Just the introduction of sctp_stream_out_ext would already fix the issue
described above by splitting the allocation in two. Moving the stats
to it also reduces the pressure on the allocator as we will ask for less
memory atomically when creating the socket and we will use GFP_KERNEL
later.

Then, for stream schedulers, we will just use sctp_stream_out_ext.

Signed-off-by: Marcelo Ricardo Leitner 
---
 include/net/sctp/structs.h | 10 --
 net/sctp/chunk.c   |  6 +++---
 net/sctp/outqueue.c|  4 ++--
 net/sctp/socket.c  | 27 +--
 net/sctp/stream.c  | 16 
 5 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 
0477945de1a3cf5c27348e99d9a30e02c491d1de..9b2b30b3ba4dfd10c24c3e06ed80779180a06baf
 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -84,6 +84,7 @@ struct sctp_ulpq;
 struct sctp_ep_common;
 struct crypto_shash;
 struct sctp_stream;
+struct sctp_stream_out;
 
 
 #include 
@@ -380,6 +381,7 @@ struct sctp_sender_hb_info {
 
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 gfp_t gfp);
+int sctp_stream_init_ext(struct sctp_stream *stream, __u16 sid);
 void sctp_stream_free(struct sctp_stream *stream);
 void sctp_stream_clear(struct sctp_stream *stream);
 void sctp_stream_update(struct sctp_stream *stream, struct sctp_stream *new);
@@ -1315,11 +1317,15 @@ struct sctp_inithdr_host {
__u32 initial_tsn;
 };
 
+struct sctp_stream_out_ext {
+   __u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
+   __u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+};
+
 struct sctp_stream_out {
__u16   ssn;
__u8state;
-   __u64   abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
-   __u64   abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
+   struct sctp_stream_out_ext *ext;
 };
 
 struct sctp_stream_in {
diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
index 
3afac275ee82dbec825dd71378dffe69a53718a7..7b261afc47b9d709fdd780a93aaba874f35d79be
 100644
--- a/net/sctp/chunk.c
+++ b/net/sctp/chunk.c
@@ -311,10 +311,10 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
 
if (chunk->sent_count) {
chunk->asoc->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
-   streamout->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
+   streamout->ext->abandoned_sent[SCTP_PR_INDEX(TTL)]++;
} else {
chunk->asoc->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
-   streamout->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
+   streamout->ext->abandoned_unsent[SCTP_PR_INDEX(TTL)]++;
}
return 1;
} else if (SCTP_PR_RTX_ENABLED(chunk->sinfo.sinfo_flags) &&
@@ -323,7 +323,7 @@ int sctp_chunk_abandoned(struct sctp_chunk *chunk)
&chunk->asoc->stream.out[chunk->sinfo.sinfo_stream];
 
chunk->asoc->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
-   streamout->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
+   streamout->ext->abandoned_sent[SCTP_PR_INDEX(RTX)]++;
return 1;
} else if (!SCTP_PR_POLICY(chunk->sinfo.sinfo_flags) &&
   chunk->msg->expires_at &&
diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
2966ff400755fe93e3658e09d3bb44b9d7d19d2e..746b07b7937d8730824b9e09917d947aa7863ec6
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -366,7 +366,7 @@ static int sctp_prsctp_prune_sent(struct sctp_association 
*asoc,
streamout = &asoc->stream.out[chk->sinfo.sinfo_stream];
asoc->sent_cnt_removable--;
asoc->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
-   streamout->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
+   streamout->ext->abandoned_sent[SCTP_PR_INDEX(PRIO)]++;
 
if (!chk->tsn_gap_acked) {
if (chk->transport)
@@ -404,7 +404,7 @@ static int sctp_prsctp_prune_unsent(struct sctp_association 
*asoc,
struct sctp_stream_out *streamout =
&asoc->stream.out[chk->sinfo.sinfo_stream];
 
-   streamout->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
+   streamout->ext->abandoned_unsent[SCTP_PR_INDEX(PRIO)]++;
}
 
msg_len -= SCTP_DATA_SNDSIZE(chk) +
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 
d4730ada7f3233367be7a0e3bb10e286a25602c8..d207734326b085e60625e4333f74221481114892
 100644
-

[PATCH net-next 02/10] sctp: factor out stream->out allocation

2017-09-28 Thread Marcelo Ricardo Leitner
There is 1 place allocating it and 2 other reallocating. Move such
procedures to a common function.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/stream.c | 52 
 1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 
1afa9555808390d5fc736727422d9700a3855613..6d0e997d301f89e165367106c02e82f8a6c3a877
 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -35,6 +35,30 @@
 #include 
 #include 
 
+static int sctp_stream_alloc_out(struct sctp_stream *stream, __u16 outcnt,
+gfp_t gfp)
+{
+   struct sctp_stream_out *out;
+
+   out = kmalloc_array(outcnt, sizeof(*out), gfp);
+   if (!out)
+   return -ENOMEM;
+
+   if (stream->out) {
+   memcpy(out, stream->out, min(outcnt, stream->outcnt) *
+sizeof(*out));
+   kfree(stream->out);
+   }
+
+   if (outcnt > stream->outcnt)
+   memset(out + stream->outcnt, 0,
+  (outcnt - stream->outcnt) * sizeof(*out));
+
+   stream->out = out;
+
+   return 0;
+}
+
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 gfp_t gfp)
 {
@@ -48,11 +72,9 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 
outcnt, __u16 incnt,
if (outcnt == stream->outcnt)
goto in;
 
-   kfree(stream->out);
-
-   stream->out = kcalloc(outcnt, sizeof(*stream->out), gfp);
-   if (!stream->out)
-   return -ENOMEM;
+   i = sctp_stream_alloc_out(stream, outcnt, gfp);
+   if (i)
+   return i;
 
stream->outcnt = outcnt;
for (i = 0; i < stream->outcnt; i++)
@@ -276,15 +298,9 @@ int sctp_send_add_streams(struct sctp_association *asoc,
}
 
if (out) {
-   struct sctp_stream_out *streamout;
-
-   streamout = krealloc(stream->out, outcnt * sizeof(*streamout),
-GFP_KERNEL);
-   if (!streamout)
+   retval = sctp_stream_alloc_out(stream, outcnt, GFP_KERNEL);
+   if (retval)
goto out;
-
-   memset(streamout + stream->outcnt, 0, out * sizeof(*streamout));
-   stream->out = streamout;
}
 
chunk = sctp_make_strreset_addstrm(asoc, out, in);
@@ -682,10 +698,10 @@ struct sctp_chunk *sctp_process_strreset_addstrm_in(
struct sctp_strreset_addstrm *addstrm = param.v;
struct sctp_stream *stream = &asoc->stream;
__u32 result = SCTP_STRRESET_DENIED;
-   struct sctp_stream_out *streamout;
struct sctp_chunk *chunk = NULL;
__u32 request_seq, outcnt;
__u16 out, i;
+   int ret;
 
request_seq = ntohl(addstrm->request_seq);
if (TSN_lt(asoc->strreset_inseq, request_seq) ||
@@ -714,14 +730,10 @@ struct sctp_chunk *sctp_process_strreset_addstrm_in(
if (!out || outcnt > SCTP_MAX_STREAM)
goto out;
 
-   streamout = krealloc(stream->out, outcnt * sizeof(*streamout),
-GFP_ATOMIC);
-   if (!streamout)
+   ret = sctp_stream_alloc_out(stream, outcnt, GFP_ATOMIC);
+   if (ret)
goto out;
 
-   memset(streamout + stream->outcnt, 0, out * sizeof(*streamout));
-   stream->out = streamout;
-
chunk = sctp_make_strreset_addstrm(asoc, out, 0);
if (!chunk)
goto out;
-- 
2.13.5



[PATCH net-next 06/10] sctp: introduce stream scheduler foundations

2017-09-28 Thread Marcelo Ricardo Leitner
This patch introduces the hooks necessary to do stream scheduling, as
per RFC Draft ndata.  It also introduces the first scheduler, which is
what we do today but now factored out: first come first served (FCFS).

With stream scheduling now we have to track which chunk was enqueued on
which stream and be able to select another other than the in front of
the main outqueue. So we introduce a list on sctp_stream_out_ext
structure for this purpose.

We reuse sctp_chunk->transmitted_list space for the list above, as the
chunk cannot belong to the two lists at the same time. By using the
union in there, we can have distinct names for these moments.

sctp_sched_ops are the operations expected to be implemented by each
scheduler. The dequeueing is a bit particular to this implementation but
it is to match how we dequeue packets today. We first dequeue and then
check if it fits the packet and if not, we requeue it at head. Thus why
we don't have a peek operation but have dequeue_done instead, which is
called once the chunk can be safely considered as transmitted.

The check removed from sctp_outq_flush is now performed by
sctp_stream_outq_migrate, which is only called during assoc setup.
(sctp_sendmsg() also checks for it)

The only operation that is foreseen but not yet added here is a way to
signalize that a new packet is starting or that the packet is done, for
round robin scheduler per packet, but is intentionally left to the
patch that actually implements it.

Support for IDATA chunks, also described in this RFC, with user message
interleaving is straightforward as it just requires the schedulers to
probe for the feature and ignore datamsg boundaries when dequeueing.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner 
---
 include/net/sctp/stream_sched.h |  72 +++
 include/net/sctp/structs.h  |  15 ++-
 include/uapi/linux/sctp.h   |   6 +
 net/sctp/Makefile   |   2 +-
 net/sctp/outqueue.c |  59 +
 net/sctp/sm_sideeffect.c|   3 +
 net/sctp/stream.c   |  88 +++--
 net/sctp/stream_sched.c | 270 
 8 files changed, 477 insertions(+), 38 deletions(-)
 create mode 100644 include/net/sctp/stream_sched.h
 create mode 100644 net/sctp/stream_sched.c

diff --git a/include/net/sctp/stream_sched.h b/include/net/sctp/stream_sched.h
new file mode 100644
index 
..c676550a4c7dd0ea27ac0e14437d0a2b451ef499
--- /dev/null
+++ b/include/net/sctp/stream_sched.h
@@ -0,0 +1,72 @@
+/* SCTP kernel implementation
+ * (C) Copyright Red Hat Inc. 2017
+ *
+ * These are definitions used by the stream schedulers, defined in RFC
+ * draft ndata (https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-11)
+ *
+ * This SCTP implementation is free software;
+ * you can redistribute it and/or modify it under the terms of
+ * the GNU General Public License as published by
+ * the Free Software Foundation; either version 2, or (at your option)
+ * any later version.
+ *
+ * This SCTP implementation  is distributed in the hope that it
+ * will be useful, but WITHOUT ANY WARRANTY; without even the implied
+ * 
+ * warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ * See the GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with GNU CC; see the file COPYING.  If not, see
+ * .
+ *
+ * Please send any bug reports or fixes you make to the
+ * email addresses:
+ *lksctp developers 
+ *
+ * Written or modified by:
+ *   Marcelo Ricardo Leitner 
+ */
+
+#ifndef __sctp_stream_sched_h__
+#define __sctp_stream_sched_h__
+
+struct sctp_sched_ops {
+   /* Property handling for a given stream */
+   int (*set)(struct sctp_stream *stream, __u16 sid, __u16 value,
+  gfp_t gfp);
+   int (*get)(struct sctp_stream *stream, __u16 sid, __u16 *value);
+
+   /* Init the specific scheduler */
+   int (*init)(struct sctp_stream *stream);
+   /* Init a stream */
+   int (*init_sid)(struct sctp_stream *stream, __u16 sid, gfp_t gfp);
+   /* Frees the entire thing */
+   void (*free)(struct sctp_stream *stream);
+
+   /* Enqueue a chunk */
+   void (*enqueue)(struct sctp_outq *q, struct sctp_datamsg *msg);
+   /* Dequeue a chunk */
+   struct sctp_chunk *(*dequeue)(struct sctp_outq *q);
+   /* Called only if the chunk fit the packet */
+   void (*dequeue_done)(struct sctp_outq *q, struct sctp_chunk *chunk);
+   /* Sched all chunks already enqueued */
+   void (*sched_all)(struct sctp_stream *steam);
+   /* Unched all chunks already enqueued */
+   void (*unsched_all)(struct sctp_stream *steam);
+};
+
+int sctp_sched_set_sched(struct sctp_association *asoc,
+enum sctp_sched_type sche

[PATCH net-next 09/10] sctp: introduce priority based stream scheduler

2017-09-28 Thread Marcelo Ricardo Leitner
This patch introduces RFC Draft ndata section 3.4 Priority Based
Scheduler (SCTP_SS_PRIO).

It works by having a struct sctp_stream_priority for each priority
configured. This struct is then enlisted on a queue ordered per priority
if, and only if, there is a stream with data queued, so that dequeueing
is very straightforward: either finish current datamsg or simply dequeue
from the highest priority queued, which is the next stream pointed, and
that's it.

If there are multiple streams assigned with the same priority and with
data queued, it will do round robin amongst them while respecting
datamsgs boundaries (when not using idata chunks), to be reasonably
fair.

We intentionally don't maintain a list of priorities nor a list of all
streams with the same priority to save memory. The first would mean at
least 2 other pointers per priority (which, for 1000 priorities, that
can mean 16kB) and the second would also mean 2 other pointers but per
stream. As SCTP supports up to 65535 streams on a given asoc, that's
1MB. This impacts when giving a priority to some stream, as we have to
find out if the new priority is already being used and if we can free
the old one, and also when tearing down.

The new fields in struct sctp_stream_out_ext and sctp_stream are added
under a union because that memory is to be shared with other schedulers.
It could be defined as an opaque area like skb->cb, but that would make
the list handling a nightmare.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner 
---
 include/net/sctp/structs.h   |  24 +++
 include/uapi/linux/sctp.h|   3 +-
 net/sctp/Makefile|   2 +-
 net/sctp/stream_sched.c  |   3 +
 net/sctp/stream_sched_prio.c | 347 +++
 5 files changed, 377 insertions(+), 2 deletions(-)
 create mode 100644 net/sctp/stream_sched_prio.c

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 
3c22a30fd71b4ef87419a77cf69b00807a5986bb..40eb8d66a37c3ecee39141dc111663f7aac7326a
 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -1328,10 +1328,27 @@ struct sctp_inithdr_host {
__u32 initial_tsn;
 };
 
+struct sctp_stream_priorities {
+   /* List of priorities scheduled */
+   struct list_head prio_sched;
+   /* List of streams scheduled */
+   struct list_head active;
+   /* The next stream stream in line */
+   struct sctp_stream_out_ext *next;
+   __u16 prio;
+};
+
 struct sctp_stream_out_ext {
__u64 abandoned_unsent[SCTP_PR_INDEX(MAX) + 1];
__u64 abandoned_sent[SCTP_PR_INDEX(MAX) + 1];
struct list_head outq; /* chunks enqueued by this stream */
+   union {
+   struct {
+   /* Scheduled streams list */
+   struct list_head prio_list;
+   struct sctp_stream_priorities *prio_head;
+   };
+   };
 };
 
 struct sctp_stream_out {
@@ -1351,6 +1368,13 @@ struct sctp_stream {
__u16 incnt;
/* Current stream being sent, if any */
struct sctp_stream_out *out_curr;
+   union {
+   /* Fields used by priority scheduler */
+   struct {
+   /* List of priorities scheduled */
+   struct list_head prio_list;
+   };
+   };
 };
 
 #define SCTP_STREAM_CLOSED 0x00
diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 
00ac417d2c4f8468ea2aad32e59806be5c5aa08d..850fa8b29d7e8163dc4ee88af192309bb2535ae9
 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -1099,7 +1099,8 @@ struct sctp_add_streams {
 /* SCTP Stream schedulers */
 enum sctp_sched_type {
SCTP_SS_FCFS,
-   SCTP_SS_MAX = SCTP_SS_FCFS
+   SCTP_SS_PRIO,
+   SCTP_SS_MAX = SCTP_SS_PRIO
 };
 
 #endif /* _UAPI_SCTP_H */
diff --git a/net/sctp/Makefile b/net/sctp/Makefile
index 
0f6e6d1d69fd336b4a99f896851b0120f9a0d1e0..647c9cfd4e95be4429d25792e5832d7be2efc5c8
 100644
--- a/net/sctp/Makefile
+++ b/net/sctp/Makefile
@@ -12,7 +12,7 @@ sctp-y := sm_statetable.o sm_statefuns.o sm_sideeffect.o \
  inqueue.o outqueue.o ulpqueue.o \
  tsnmap.o bind_addr.o socket.o primitive.o \
  output.o input.o debug.o stream.o auth.o \
- offload.o stream_sched.o
+ offload.o stream_sched.o stream_sched_prio.o
 
 sctp_probe-y := probe.o
 
diff --git a/net/sctp/stream_sched.c b/net/sctp/stream_sched.c
index 
40a9a9de2b98a56786a4c8585f5ad514be9189af..115ddb7651695cca7417cb63004a1a59c93523b8
 100644
--- a/net/sctp/stream_sched.c
+++ b/net/sctp/stream_sched.c
@@ -121,8 +121,11 @@ static struct sctp_sched_ops sctp_sched_fcfs = {
 
 /* API to other parts of the stack */
 
+extern struct sctp_sched_ops sctp_sched_prio;
+
 struct sctp_sched_ops *sctp_sched_ops[] = {
&sctp_sched_fcfs,
+   &sctp_sched_prio,
 };
 
 int sctp_sched_set_sched(struct sctp_asso

[PATCH net-next 08/10] sctp: add sockopt to get/set stream scheduler parameters

2017-09-28 Thread Marcelo Ricardo Leitner
As defined per RFC Draft ndata Section 4.3.3, named as
SCTP_STREAM_SCHEDULER_VALUE.

See-also: https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13
Signed-off-by: Marcelo Ricardo Leitner 
---
 include/uapi/linux/sctp.h |  7 +
 net/sctp/socket.c | 77 +++
 2 files changed, 84 insertions(+)

diff --git a/include/uapi/linux/sctp.h b/include/uapi/linux/sctp.h
index 
0050f10087d224bad87c8c54ad318003381aee12..00ac417d2c4f8468ea2aad32e59806be5c5aa08d
 100644
--- a/include/uapi/linux/sctp.h
+++ b/include/uapi/linux/sctp.h
@@ -123,6 +123,7 @@ typedef __s32 sctp_assoc_t;
 #define SCTP_ADD_STREAMS   121
 #define SCTP_SOCKOPT_PEELOFF_FLAGS 122
 #define SCTP_STREAM_SCHEDULER  123
+#define SCTP_STREAM_SCHEDULER_VALUE124
 
 /* PR-SCTP policies */
 #define SCTP_PR_SCTP_NONE  0x
@@ -815,6 +816,12 @@ struct sctp_assoc_value {
 uint32_tassoc_value;
 };
 
+struct sctp_stream_value {
+   sctp_assoc_t assoc_id;
+   uint16_t stream_id;
+   uint16_t stream_value;
+};
+
 /*
  * 7.2.2 Peer Address Information
  *
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 
ae35dbf2810f78c71ce77115ffe4b0e27a672abc..88c28421ec151e83665efcbcbd8a6403b122205a
 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -3945,6 +3945,34 @@ static int sctp_setsockopt_scheduler(struct sock *sk,
return retval;
 }
 
+static int sctp_setsockopt_scheduler_value(struct sock *sk,
+  char __user *optval,
+  unsigned int optlen)
+{
+   struct sctp_association *asoc;
+   struct sctp_stream_value params;
+   int retval = -EINVAL;
+
+   if (optlen < sizeof(params))
+   goto out;
+
+   optlen = sizeof(params);
+   if (copy_from_user(¶ms, optval, optlen)) {
+   retval = -EFAULT;
+   goto out;
+   }
+
+   asoc = sctp_id2assoc(sk, params.assoc_id);
+   if (!asoc)
+   goto out;
+
+   retval = sctp_sched_set_value(asoc, params.stream_id,
+ params.stream_value, GFP_KERNEL);
+
+out:
+   return retval;
+}
+
 /* API 6.2 setsockopt(), getsockopt()
  *
  * Applications use setsockopt() and getsockopt() to set or retrieve
@@ -4129,6 +4157,9 @@ static int sctp_setsockopt(struct sock *sk, int level, 
int optname,
case SCTP_STREAM_SCHEDULER:
retval = sctp_setsockopt_scheduler(sk, optval, optlen);
break;
+   case SCTP_STREAM_SCHEDULER_VALUE:
+   retval = sctp_setsockopt_scheduler_value(sk, optval, optlen);
+   break;
default:
retval = -ENOPROTOOPT;
break;
@@ -6864,6 +6895,48 @@ static int sctp_getsockopt_scheduler(struct sock *sk, 
int len,
return retval;
 }
 
+static int sctp_getsockopt_scheduler_value(struct sock *sk, int len,
+  char __user *optval,
+  int __user *optlen)
+{
+   struct sctp_stream_value params;
+   struct sctp_association *asoc;
+   int retval = -EFAULT;
+
+   if (len < sizeof(params)) {
+   retval = -EINVAL;
+   goto out;
+   }
+
+   len = sizeof(params);
+   if (copy_from_user(¶ms, optval, len))
+   goto out;
+
+   asoc = sctp_id2assoc(sk, params.assoc_id);
+   if (!asoc) {
+   retval = -EINVAL;
+   goto out;
+   }
+
+   retval = sctp_sched_get_value(asoc, params.stream_id,
+ ¶ms.stream_value);
+   if (retval)
+   goto out;
+
+   if (put_user(len, optlen)) {
+   retval = -EFAULT;
+   goto out;
+   }
+
+   if (copy_to_user(optval, ¶ms, len)) {
+   retval = -EFAULT;
+   goto out;
+   }
+
+out:
+   return retval;
+}
+
 static int sctp_getsockopt(struct sock *sk, int level, int optname,
   char __user *optval, int __user *optlen)
 {
@@ -7050,6 +7123,10 @@ static int sctp_getsockopt(struct sock *sk, int level, 
int optname,
retval = sctp_getsockopt_scheduler(sk, len, optval,
   optlen);
break;
+   case SCTP_STREAM_SCHEDULER_VALUE:
+   retval = sctp_getsockopt_scheduler_value(sk, len, optval,
+optlen);
+   break;
default:
retval = -ENOPROTOOPT;
break;
-- 
2.13.5



[PATCH net-next 03/10] sctp: factor out stream->in allocation

2017-09-28 Thread Marcelo Ricardo Leitner
Same idea as previous patch.

Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/stream.c | 36 
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 
6d0e997d301f89e165367106c02e82f8a6c3a877..952437d656cc71ad1c133a736c539eff9a8d80c2
 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -59,6 +59,31 @@ static int sctp_stream_alloc_out(struct sctp_stream *stream, 
__u16 outcnt,
return 0;
 }
 
+static int sctp_stream_alloc_in(struct sctp_stream *stream, __u16 incnt,
+   gfp_t gfp)
+{
+   struct sctp_stream_in *in;
+
+   in = kmalloc_array(incnt, sizeof(*stream->in), gfp);
+
+   if (!in)
+   return -ENOMEM;
+
+   if (stream->in) {
+   memcpy(in, stream->in, min(incnt, stream->incnt) *
+  sizeof(*in));
+   kfree(stream->in);
+   }
+
+   if (incnt > stream->incnt)
+   memset(in + stream->incnt, 0,
+  (incnt - stream->incnt) * sizeof(*in));
+
+   stream->in = in;
+
+   return 0;
+}
+
 int sctp_stream_init(struct sctp_stream *stream, __u16 outcnt, __u16 incnt,
 gfp_t gfp)
 {
@@ -84,8 +109,8 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 
outcnt, __u16 incnt,
if (!incnt)
return 0;
 
-   stream->in = kcalloc(incnt, sizeof(*stream->in), gfp);
-   if (!stream->in) {
+   i = sctp_stream_alloc_in(stream, incnt, gfp);
+   if (i) {
kfree(stream->out);
stream->out = NULL;
return -ENOMEM;
@@ -623,7 +648,6 @@ struct sctp_chunk *sctp_process_strreset_addstrm_out(
struct sctp_strreset_addstrm *addstrm = param.v;
struct sctp_stream *stream = &asoc->stream;
__u32 result = SCTP_STRRESET_DENIED;
-   struct sctp_stream_in *streamin;
__u32 request_seq, incnt;
__u16 in, i;
 
@@ -670,13 +694,9 @@ struct sctp_chunk *sctp_process_strreset_addstrm_out(
if (!in || incnt > SCTP_MAX_STREAM)
goto out;
 
-   streamin = krealloc(stream->in, incnt * sizeof(*streamin),
-   GFP_ATOMIC);
-   if (!streamin)
+   if (sctp_stream_alloc_in(stream, incnt, GFP_ATOMIC))
goto out;
 
-   memset(streamin + stream->incnt, 0, in * sizeof(*streamin));
-   stream->in = streamin;
stream->incnt = incnt;
 
result = SCTP_STRRESET_PERFORMED;
-- 
2.13.5



[PATCH net-next 05/10] sctp: introduce sctp_chunk_stream_no

2017-09-28 Thread Marcelo Ricardo Leitner
Add a helper to fetch the stream number from a given chunk.

Signed-off-by: Marcelo Ricardo Leitner 
---
 include/net/sctp/structs.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 
9b2b30b3ba4dfd10c24c3e06ed80779180a06baf..c48f7999fe9b80c5b5e41910a3608059b94140a7
 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -642,6 +642,11 @@ void sctp_init_addrs(struct sctp_chunk *, union sctp_addr 
*,
 union sctp_addr *);
 const union sctp_addr *sctp_source(const struct sctp_chunk *chunk);
 
+static inline __u16 sctp_chunk_stream_no(struct sctp_chunk *ch)
+{
+   return ntohs(ch->subh.data_hdr->stream);
+}
+
 enum {
SCTP_ADDR_NEW,  /* new address added to assoc/ep */
SCTP_ADDR_SRC,  /* address can be used as source */
-- 
2.13.5



[PATCH net-next 00/10] Introduce SCTP Stream Schedulers

2017-09-28 Thread Marcelo Ricardo Leitner
This patchset introduces the SCTP Stream Schedulers are defined by
https://tools.ietf.org/html/draft-ietf-tsvwg-sctp-ndata-13

It provides 3 schedulers at the moment: FCFS, Priority and Round Robin.
The other 3, Round Robin per packet, Fair Capacity and Weighted Fair
Capacity will be added later. More specifically, WFQ is required by
WebRTC Datachannels.

The draft also defines the idata chunk, allowing a usermsg to be
interrupted by another piece of idata from another stream. This patchset
*doesn't* include it. It will be posted later by Xin Long.  Its
integration with this patchset is very simple and it basically only
requires a tweak in sctp_sched_dequeue_done(), to ignore datamsg
boundaries.

The first 5 patches are a preparation for the next ones. The most
relevant patches are the 4th and 6th ones. More details are available on
each patch.

Marcelo Ricardo Leitner (10):
  sctp: silence warns on sctp_stream_init allocations
  sctp: factor out stream->out allocation
  sctp: factor out stream->in allocation
  sctp: introduce struct sctp_stream_out_ext
  sctp: introduce sctp_chunk_stream_no
  sctp: introduce stream scheduler foundations
  sctp: add sockopt to get/set stream scheduler
  sctp: add sockopt to get/set stream scheduler parameters
  sctp: introduce priority based stream scheduler
  sctp: introduce round robin stream scheduler

 include/net/sctp/stream_sched.h |  72 +
 include/net/sctp/structs.h  |  63 +++-
 include/uapi/linux/sctp.h   |  16 ++
 net/sctp/Makefile   |   3 +-
 net/sctp/chunk.c|   6 +-
 net/sctp/outqueue.c |  63 
 net/sctp/sm_sideeffect.c|   3 +
 net/sctp/socket.c   | 179 -
 net/sctp/stream.c   | 196 +++
 net/sctp/stream_sched.c | 275 +++
 net/sctp/stream_sched_prio.c| 347 
 net/sctp/stream_sched_rr.c  | 201 +++
 12 files changed, 1347 insertions(+), 77 deletions(-)
 create mode 100644 include/net/sctp/stream_sched.h
 create mode 100644 net/sctp/stream_sched.c
 create mode 100644 net/sctp/stream_sched_prio.c
 create mode 100644 net/sctp/stream_sched_rr.c

-- 
2.13.5



[PATCH net-next 01/10] sctp: silence warns on sctp_stream_init allocations

2017-09-28 Thread Marcelo Ricardo Leitner
As SCTP supports up to 65535 streams, that can lead to very large
allocations in sctp_stream_init(). As Xin Long noticed, systems with
small amounts of memory are more prone to not have enough memory and
dump warnings on dmesg initiated by user actions. Thus, silence them.

Also, if the reallocation of stream->out is not necessary, skip it and
keep the memory we already have.

Reported-by: Xin Long 
Signed-off-by: Marcelo Ricardo Leitner 
---
 net/sctp/stream.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/sctp/stream.c b/net/sctp/stream.c
index 
63ea1550371493ec8863627c7a43f46a22f4a4c9..1afa9555808390d5fc736727422d9700a3855613
 100644
--- a/net/sctp/stream.c
+++ b/net/sctp/stream.c
@@ -40,9 +40,14 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 
outcnt, __u16 incnt,
 {
int i;
 
+   gfp |= __GFP_NOWARN;
+
/* Initial stream->out size may be very big, so free it and alloc
-* a new one with new outcnt to save memory.
+* a new one with new outcnt to save memory if needed.
 */
+   if (outcnt == stream->outcnt)
+   goto in;
+
kfree(stream->out);
 
stream->out = kcalloc(outcnt, sizeof(*stream->out), gfp);
@@ -53,6 +58,7 @@ int sctp_stream_init(struct sctp_stream *stream, __u16 
outcnt, __u16 incnt,
for (i = 0; i < stream->outcnt; i++)
stream->out[i].state = SCTP_STREAM_OPEN;
 
+in:
if (!incnt)
return 0;
 
-- 
2.13.5



Re: [PATCH 1/4] dt-bindings: net: ravb: Document optional reset-gpios property

2017-09-28 Thread Sergei Shtylyov

Hello!

On 09/28/2017 06:53 PM, Geert Uytterhoeven wrote:


The optional "reset-gpios" property (part of the generic MDIO bus
properties) lets us describe the GPIO used for resetting the Ethernet
PHY.

Signed-off-by: Geert Uytterhoeven 
---
  Documentation/devicetree/bindings/net/renesas,ravb.txt | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt 
b/Documentation/devicetree/bindings/net/renesas,ravb.txt
index c902261893b913f5..4a6ec1ba32d0bf16 100644
--- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
+++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
@@ -52,6 +52,7 @@ Optional properties:
 AVB_LINK signal.
  - renesas,ether-link-active-low: boolean, specify when the AVB_LINK signal is
 active-low instead of normal active-high.
+- reset-gpios: see mdio.txt in the same directory.


   Sigh, I can only repeat that was a terrible prop name choice -- when 
applied to a MAC node... what reset does it mean? MAC?


MBR, Sergei


Re: [PATCH net-next RFC 3/9] net: dsa: mv88e6xxx: add support for GPIO configuration

2017-09-28 Thread Vivien Didelot
Hi Brandon,

>> Would there be any value in implementing a proper gpiochip structure
>> here such that other pieces of SW can see this GPIO controller as a
>> provider and you can reference it from e.g: Device Tree using GPIO
>> descriptors?
>
> That would be my preference as well, or maybe a pinctrl driver.

Indeed seeing a gpio_chip or a pinctrl controller registered from a
gpio.c or pinctrl.c file in a separate patchset would be great.


Thanks,

Vivien


Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-28 Thread Andy Gospodarek
On Thu, Sep 28, 2017 at 1:59 AM, Waskiewicz Jr, Peter
 wrote:
> On 9/26/17 10:21 AM, Andy Gospodarek wrote:
>> On Mon, Sep 25, 2017 at 08:50:28PM +0200, Daniel Borkmann wrote:
>>> On 09/25/2017 08:10 PM, Andy Gospodarek wrote:
>>> [...]
 First, thanks for this detailed description.  It was helpful to read
 along with the patches.

 My only concern about this area being generic is that you are now in a
 state where any bpf program must know about all the bpf programs in the
 receive pipeline before it can properly parse what is stored in the
 meta-data and add it to an skb (or perform any other action).
 Especially if each program adds it's own meta-data along the way.

 Maybe this isn't a big concern based on the number of users of this
 today, but it just starts to seem like a concern as there are these
 hints being passed between layers that are challenging to track due to a
 lack of a standard format for passing data between.
>>>
>>> Btw, we do have similar kind of programmable scratch buffer also today
>>> wrt skb cb[] that you can program from tc side, the perf ring buffer,
>>> which doesn't have any fixed layout for the slots, or a per-cpu map
>>> where you can transfer data between tail calls for example, then tail
>>> calls themselves that need to coordinate, or simply mangling of packets
>>> itself if you will, but more below to your use case ...
>>>
 The main reason I bring this up is that Michael and I had discussed and
 designed a way for drivers to communicate between each other that rx
 resources could be freed after a tx completion on an XDP_REDIRECT
 action.  Much like this code, it involved adding an new element to
 struct xdp_md that could point to the important information.  Now that
 there is a generic way to handle this, it would seem nice to be able to
 leverage it, but I'm not sure how reliable this meta-data area would be
 without the ability to mark it in some manner.

 For additional background, the minimum amount of data needed in the case
 Michael and I were discussing was really 2 words.  One to serve as a
 pointer to an rx_ring structure and one to have a counter to the rx
 producer entry.  This data could be acessed by the driver processing the
 tx completions and callback to the driver that received the frame off the 
 wire
 to perform any needed processing.  (For those curious this would also 
 require a
 new callback/netdev op to act on this data stored in the XDP buffer.)
>>>
>>> What you describe above doesn't seem to be fitting to the use-case of
>>> this set, meaning the area here is fully programmable out of the BPF
>>> program, the infrastructure you're describing is some sort of means of
>>> communication between drivers for the XDP_REDIRECT, and should be
>>> outside of the control of the BPF program to mangle.
>>
>> OK, I understand that perspective.  I think saying this is really meant
>> as a BPF<->BPF communication channel for now is fine.
>>
>>> You could probably reuse the base infra here and make a part of that
>>> inaccessible for the program with some sort of a fixed layout, but I
>>> haven't seen your code yet to be able to fully judge. Intention here
>>> is to allow for programmability within the BPF prog in a generic way,
>>> such that based on the use-case it can be populated in specific ways
>>> and propagated to the skb w/o having to define a fixed layout and
>>> bloat xdp_buff all the way to an skb while still retaining all the
>>> flexibility.
>>
>> Some level of reuse might be proper, but I'd rather it be explicit for
>> my use since it's not exclusively something that will need to be used by
>> a BPF prog, but rather the driver.  I'll produce some patches this week
>> for reference.
>
> Sorry for chiming in late, I've been offline.
>
> We're looking to add some functionality from driver to XDP inside this
> xdp_buff->data_meta region.  We want to assign it to an opaque
> structure, that would be specific per driver (think of a flex descriptor
> coming out of the hardware).  We'd like to pass these offloaded
> computations into XDP programs to help accelerate them, such as packet
> type, where headers are located, etc.  It's similar to Jesper's RFC
> patches back in May when passing through the mlx Rx descriptor to XDP.
>
> This is actually what a few of us are planning to present at NetDev 2.2
> in November.  If you're hoping to restrict this headroom in the xdp_buff
> for an exclusive use case with XDP_REDIRECT, then I'd like to discuss
> that further.
>

No sweat, PJ, thanks for replying.  I saw the notes for your accepted
session and I'm looking forward to it.

John's suggestion earlier in the thread was actually similar to the
conclusion I reached when thinking about Daniel's patch a bit more.
(I like John's better though as it doesn't get constrained by UAPI.)
Since redirect actions happen at a point where no other prog

Re: [PATCH/RFC net-next] ravb: RX checksum offload

2017-09-28 Thread Sergei Shtylyov

Hello!

On 09/28/2017 01:49 PM, Simon Horman wrote:


Add support for RX checksum offload. This is enabled by default and
may be disabled and re-enabled using ethtool:

  # ethtool -K eth0 rx off
  # ethtool -K eth0 rx on

The RAVB provides a simple checksumming scheme which appears to be
completely compatible with CHECKSUM_COMPLETE: a 1's complement sum of


Hm, the gen2/3 manuals say calculation doesn't involve bit inversion...


Yes, I believe that matches my observation of the values supplied by
the hardware. Empirically this appears to be what the kernel expects.


   Then why you talk of 1's complement here?


all packet data after the L2 header is appended to packet data; this may
be trivially read by the driver and used to update the skb accordingly.

In terms of performance throughput is close to gigabit line-rate both with
and without RX checksum offload enabled. Perf output, however, appears to
indicate that significantly less time is spent in do_csum(). This is as
expected.


[...]


By inspection this also appears to be compatible with the ravb found
on R-Car Gen 2 SoCs, however, this patch is currently untested on such
hardware.


I probably won't be able to test it on gen2 too...


Signed-off-by: Simon Horman 


I'm generally OK with the patch but have some questions/comments below...


Thanks, I will try to address them.


---
  drivers/net/ethernet/renesas/ravb_main.c | 58 +++-
  1 file changed, 57 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
b/drivers/net/ethernet/renesas/ravb_main.c
index fdf30bfa403b..7c6438cd7de7 100644
--- a/drivers/net/ethernet/renesas/ravb_main.c
+++ b/drivers/net/ethernet/renesas/ravb_main.c

[...]

@@ -1842,6 +1859,41 @@ static int ravb_do_ioctl(struct net_device *ndev, struct 
ifreq *req, int cmd)
return phy_mii_ioctl(phydev, req, cmd);
  }
+static void ravb_set_rx_csum(struct net_device *ndev, bool enable)
+{
+   struct ravb_private *priv = netdev_priv(ndev);
+   unsigned long flags;
+
+   spin_lock_irqsave(&priv->lock, flags);
+
+   /* Disable TX and RX */
+   ravb_rcv_snd_disable(ndev);
+
+   /* Modify RX Checksum setting */
+   if (enable)
+   ravb_modify(ndev, ECMR, 0, ECMR_RCSC);


Please use ECMR_RCSC as the 3rd argument too to conform the common driver
style.


+   else
+   ravb_modify(ndev, ECMR, ECMR_RCSC, 0);


This *if* can easily be folded into a single ravb_modify() call...


Thanks, something like this?

ravb_modify(ndev, ECMR, ECMR_RCSC, enable ? ECMR_RCSC : 0);


   Yes, exactly! :-)


[...]

@@ -2004,6 +2057,9 @@ static int ravb_probe(struct platform_device *pdev)
if (!ndev)
return -ENOMEM;
+   ndev->features |= NETIF_F_RXCSUM;
+   ndev->hw_features |= ndev->features;


Hum, both fields are 0 before this? Then why not use '=' instead of '|='?
Even if not, why not just use the same value as both the rvalues?


I don't feel strongly about this, how about?

ndev->features = NETIF_F_RXCSUM;
ndev->hw_features = NETIF_F_RXCSUM;


   Yes, I think it should work...

MBR, Sergei


Re: [PATCH RFC 3/5] Add KSZ8795 switch driver

2017-09-28 Thread Andrew Lunn
On Mon, Sep 18, 2017 at 08:27:13PM +, tristram...@microchip.com wrote:
> > > +/**
> > > + * Some counters do not need to be read too often because they are less
> > likely
> > > + * to increase much.
> > > + */
> > 
> > What does comment mean? Are you caching statistics, and updating
> > different values at different rates?
> > 
> 
> There are 34 counters.  In normal case using generic bus I/O or PCI to read 
> them
> is very quick, but the switch is mostly accessed using SPI, or even I2C.  As 
> the SPI
> access is very slow.

How slow is it? The Marvell switches all use MDIO. It is probably a
bit faster than I2C, but it is a lot slower than MMIO or PCI.

ethtool -S lan0 takes about 25ms.

No other driver does caching. So i'm hesitant to add one which does.

>  These accesses can be getting 1588 PTP timestamps and opening/closing ports.

You could drop the mutex between each statistic read, so allowing
something else access to the switch. That should reduce the jitter PTP
experiences.

Andrew


Re: [PATCH net-next RFC 5/9] net: dsa: forward hardware timestamping ioctls to switch driver

2017-09-28 Thread Vivien Didelot
Hi Brandon,

Brandon Streiff  writes:

>  static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int 
> cmd)
>  {
> + struct dsa_slave_priv *p = netdev_priv(dev);
> + struct dsa_switch *ds = p->dp->ds;
> + int port = p->dp->index;
> +
>   if (!dev->phydev)
>   return -ENODEV;

Move this check below:

>  
> - return phy_mii_ioctl(dev->phydev, ifr, cmd);
> + switch (cmd) {
> + case SIOCGMIIPHY:
> + case SIOCGMIIREG:
> + case SIOCSMIIREG:
> + if (dev->phydev)
> + return phy_mii_ioctl(dev->phydev, ifr, cmd);
> + else
> + return -EOPNOTSUPP;

if (!dev->phydev)
return -ENODEV;

return phy_mii_ioctl(dev->phydev, ifr, cmd);

> + case SIOCGHWTSTAMP:
> + if (ds->ops->port_hwtstamp_get)
> + return ds->ops->port_hwtstamp_get(ds, port, ifr);
> + else
> + return -EOPNOTSUPP;

Here you can replace the else statement with break;

> + case SIOCSHWTSTAMP:
> + if (ds->ops->port_hwtstamp_set)
> + return ds->ops->port_hwtstamp_set(ds, port, ifr);
> + else
> + return -EOPNOTSUPP;

Same here;

> + default:
> + return -EOPNOTSUPP;
> + }

Then drop the default case and return -EOPNOTSUPP after the switch.

>  }


Thanks,

Vivien


Re: [RFC PATCH v3 7/7] i40e: Enable cloud filters via tc-flower

2017-09-28 Thread Nambiar, Amritha
On 9/14/2017 1:00 AM, Nambiar, Amritha wrote:
> On 9/13/2017 6:26 AM, Jiri Pirko wrote:
>> Wed, Sep 13, 2017 at 11:59:50AM CEST, amritha.namb...@intel.com wrote:
>>> This patch enables tc-flower based hardware offloads. tc flower
>>> filter provided by the kernel is configured as driver specific
>>> cloud filter. The patch implements functions and admin queue
>>> commands needed to support cloud filters in the driver and
>>> adds cloud filters to configure these tc-flower filters.
>>>
>>> The only action supported is to redirect packets to a traffic class
>>> on the same device.
>>
>> So basically you are not doing redirect, you are just setting tclass for
>> matched packets, right? Why you use mirred for this? I think that
>> you might consider extending g_act for that:
>>
>> # tc filter add dev eth0 protocol ip ingress \
>>   prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw \
>>   action tclass 0
>>
> Yes, this doesn't work like a typical egress redirect, but is aimed at
> forwarding the matched packets to a different queue-group/traffic class
> on the same device, so some sort-of ingress redirect in the hardware. I
> possibly may not need the mirred-redirect as you say, I'll look into the
> g_act way of doing this with a new gact tc action.
> 

I was looking at introducing a new gact tclass action to TC. In the HW
offload path, this sets a traffic class value for certain matched
packets so they will be processed in a queue belonging to the traffic class.

# tc filter add dev eth0 protocol ip parent :\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw\
  action tclass 2

But, I'm having trouble defining what this action means in the kernel
datapath. For ingress, this action could just take the default path and
do nothing and only have meaning in the HW offloaded path. For egress,
certain qdiscs like 'multiq' and 'prio' could use this 'tclass' value
for band selection, while the 'mqprio' qdisc selects the traffic class
based on the skb priority in netdev_pick_tx(), so what would this action
mean for the 'mqprio' qdisc?

It looks like the 'prio' qdisc uses band selection based on the
'classid', so I was thinking of using the 'classid' through the cls
flower filter and offload it to HW for the traffic class index, this way
we would have the same behavior in HW offload and SW fallback and there
would be no need for a separate tc action.

In HW:
# tc filter add dev eth0 protocol ip parent :\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_sw classid 1:2\

filter pref 2 flower chain 0
filter pref 2 flower chain 0 handle 0x1 classid 1:2
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.3.5
  dst_port 25
  skip_sw
  in_hw

This will be used to route packets to traffic class 2.

In SW:
# tc filter add dev eth0 protocol ip parent :\
  prio 2 flower dst_ip 192.168.3.5/32\
  ip_proto udp dst_port 25 skip_hw classid 1:2

filter pref 2 flower chain 0
filter pref 2 flower chain 0 handle 0x1 classid 1:2
  eth_type ipv4
  ip_proto udp
  dst_ip 192.168.3.5
  dst_port 25
  skip_hw
  not_in_hw

>>
>>>
>>> # tc qdisc add dev eth0 ingress
>>> # ethtool -K eth0 hw-tc-offload on
>>>
>>> # tc filter add dev eth0 protocol ip parent :\
>>>  prio 1 flower dst_mac 3c:fd:fe:a0:d6:70 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 0
>>>
>>> # tc filter add dev eth0 protocol ip parent :\
>>>  prio 2 flower dst_ip 192.168.3.5/32\
>>>  ip_proto udp dst_port 25 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 1
>>>
>>> # tc filter add dev eth0 protocol ipv6 parent :\
>>>  prio 3 flower dst_ip fe8::200:1\
>>>  ip_proto udp dst_port 66 skip_sw\
>>>  action mirred ingress redirect dev eth0 tclass 1
>>>
>>> Delete tc flower filter:
>>> Example:
>>>
>>> # tc filter del dev eth0 parent : prio 3 handle 0x1 flower
>>> # tc filter del dev eth0 parent :
>>>
>>> Flow Director Sideband is disabled while configuring cloud filters
>>> via tc-flower and until any cloud filter exists.
>>>
>>> Unsupported matches when cloud filters are added using enhanced
>>> big buffer cloud filter mode of underlying switch include:
>>> 1. source port and source IP
>>> 2. Combined MAC address and IP fields.
>>> 3. Not specifying L4 port
>>>
>>> These filter matches can however be used to redirect traffic to
>>> the main VSI (tc 0) which does not require the enhanced big buffer
>>> cloud filter support.
>>>
>>> v3: Cleaned up some lengthy function names. Changed ipv6 address to
>>> __be32 array instead of u8 array. Used macro for IP version. Minor
>>> formatting changes.
>>> v2:
>>> 1. Moved I40E_SWITCH_MODE_MASK definition to i40e_type.h
>>> 2. Moved dev_info for add/deleting cloud filters in else condition
>>> 3. Fixed some format specifier in dev_err logs
>>> 4. Refactored i40e_get_capabilities to take an additional
>>>   list_type parameter and use it to query device and function
>>>   level capabilities.
>>> 5. Fixed parsing tc redirect action to check for the is_tcf

Re: [PATCH 2/4] ravb: Add optional PHY reset during system resume

2017-09-28 Thread Florian Fainelli
On 09/28/2017 11:45 AM, Geert Uytterhoeven wrote:
> Hi Florian,
> 
> On Thu, Sep 28, 2017 at 7:22 PM, Florian Fainelli  
> wrote:
>> On 09/28/2017 08:53 AM, Geert Uytterhoeven wrote:
>>> If the optional "reset-gpios" property is specified in DT, the generic
>>> MDIO bus code takes care of resetting the PHY during device probe.
>>> However, the PHY may still have to be reset explicitly after system
>>> resume.
>>>
>>> This allows to restore Ethernet operation after resume from s2ram on
>>> Salvator-XS, where the enable pin of the regulator providing PHY power
>>> is connected to PRESETn, and PSCI suspend powers down the SoC.
>>>
>>> Signed-off-by: Geert Uytterhoeven 
>>> ---
>>>  drivers/net/ethernet/renesas/ravb_main.c | 9 +
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
>>> b/drivers/net/ethernet/renesas/ravb_main.c
>>> index fdf30bfa403bf416..96d1d48e302f8c9a 100644
>>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>>> @@ -19,6 +19,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include 
>>>  #include 
>>> @@ -2268,6 +2269,7 @@ static int __maybe_unused ravb_resume(struct device 
>>> *dev)
>>>  {
>>>   struct net_device *ndev = dev_get_drvdata(dev);
>>>   struct ravb_private *priv = netdev_priv(ndev);
>>> + struct mii_bus *bus = priv->mii_bus;
>>>   int ret = 0;
>>>
>>>   if (priv->wol_enabled) {
>>> @@ -2302,6 +2304,13 @@ static int __maybe_unused ravb_resume(struct device 
>>> *dev)
>>>* reopen device if it was running before system suspended.
>>>*/
>>>
>>> + /* PHY reset */
>>> + if (bus->reset_gpiod) {
>>> + gpiod_set_value_cansleep(bus->reset_gpiod, 1);
>>> + udelay(bus->reset_delay_us);
>>> + gpiod_set_value_cansleep(bus->reset_gpiod, 0);
>>> + }
>>
>> This is a clever hack, but unfortunately this is also misusing the MDIO
>> bus reset line into a PHY reset line. As commented in patch 3, if this
>> reset line is tied to the PHY, then this should be a PHY property and
> 
> OK.
> 
>> you cannot (ab)use the MDIO bus GPIO reset logic anymore...
> 
> And then I should add reset-gpios support to drivers/net/phy/micrel.c?
> Or is there already generic code to handle per-PHY reset? I couldn't find it.

There is not such a thing unfortunately, but it would presumably be
called within drivers/net/phy/mdio_bus.c during bus->reset() time
because you need the PHY reset to be deasserted before you can
successfully read/write from the PHY, and if you can't read/write from
the PHY, the MDIO bus layer cannot read the PHY ID, and therefore cannot
match a PHY device with its driver, so things don't work.

NB: you could move this entirely to the Micrel PHY driver if you specify
a compatible string that has a the PHY OUI in it, because that bypasses
the need to match the PHY driver with the PHY device, but this may not
be an acceptable solution for non-DT platforms or other platforms where
the PHY can't be determined based on the board DTS.

I was going to suggest writing some sort of generic helper that walks
the list of child nodes from a MDIO bus device node and deassert reset
lines and enables clocks, but there is absolutely nothing generic about
that. Things like which of the reset should come first, and if there are
multiple, in which order, etc.

> 
>> Should not you also try to manage this reset line during ravb_open() to
>> achiever better power savings?
> 
> I don't know. The Micrel KSZ9031RNXVA datasheet doesn't mention if it's
> safe or not to assert reset for a prolonged time.
> 
> Thanks!
> 
> Gr{oetje,eeting}s,
> 
> Geert
> 
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> ge...@linux-m68k.org
> 
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like 
> that.
> -- Linus Torvalds
> 


-- 
Florian


Re: [PATCH RFC 3/5] Add KSZ8795 switch driver

2017-09-28 Thread Florian Fainelli
On 09/28/2017 11:40 AM, Pavel Machek wrote:
> Hi!
> 
> On Mon 2017-09-18 20:27:13, tristram...@microchip.com wrote:
 +/**
 + * Some counters do not need to be read too often because they are less
>>> likely
 + * to increase much.
 + */
>>>
>>> What does comment mean? Are you caching statistics, and updating
>>> different values at different rates?
>>>
>>
>> There are 34 counters.  In normal case using generic bus I/O or PCI to read 
>> them
>> is very quick, but the switch is mostly accessed using SPI, or even I2C.  As 
>> the SPI
>> access is very slow and cannot run in interrupt context I keep worrying 
>> reading
>> the MIB counters in a loop for 5 or more ports will prevent other critical 
>> hardware
>> access from executing soon enough.  These accesses can be getting 1588 PTP
>> timestamps and opening/closing ports.  (RSTP Conformance Test sends test 
>> traffic
>> to port supposed to be closed/opened after receiving specific RSTP
>> BPDU.)
> 
> Hmm. Ok, interesting.
> 
> I wonder how well this is going to work if userspace actively 'does
> something' with the switch.
> 
> It seems to me that even if your statistics code is careful not to do
> 'a lot' of accesses at the same time, userspace can use other parts of
> the driver to do the same, and thus cause same unwanted effects...

A few switches have a MIB snapshot feature that is implemented such that
accessing the snapshot does not hog the remainder of the switch
registers, is this something possible on KSZ switches?

Tangential: net-next is currently open, so now would be a good time to
send a revised version of your patch series to target possibly 4.15 with
an initial implementation. Please fix the cover-letter and patch
threading such that they look like the following:

[PATCH 0/X]
   [PATCH 1/X]
   [PATCH 2/X]
   etc..

Right now this shows up as separate emails/patches and this is very
annoying to follow as a thread.

Thank you
-- 
Florian


Re: [PATCH 2/4] ravb: Add optional PHY reset during system resume

2017-09-28 Thread Geert Uytterhoeven
Hi Florian,

On Thu, Sep 28, 2017 at 7:22 PM, Florian Fainelli  wrote:
> On 09/28/2017 08:53 AM, Geert Uytterhoeven wrote:
>> If the optional "reset-gpios" property is specified in DT, the generic
>> MDIO bus code takes care of resetting the PHY during device probe.
>> However, the PHY may still have to be reset explicitly after system
>> resume.
>>
>> This allows to restore Ethernet operation after resume from s2ram on
>> Salvator-XS, where the enable pin of the regulator providing PHY power
>> is connected to PRESETn, and PSCI suspend powers down the SoC.
>>
>> Signed-off-by: Geert Uytterhoeven 
>> ---
>>  drivers/net/ethernet/renesas/ravb_main.c | 9 +
>>  1 file changed, 9 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/renesas/ravb_main.c 
>> b/drivers/net/ethernet/renesas/ravb_main.c
>> index fdf30bfa403bf416..96d1d48e302f8c9a 100644
>> --- a/drivers/net/ethernet/renesas/ravb_main.c
>> +++ b/drivers/net/ethernet/renesas/ravb_main.c
>> @@ -19,6 +19,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -2268,6 +2269,7 @@ static int __maybe_unused ravb_resume(struct device 
>> *dev)
>>  {
>>   struct net_device *ndev = dev_get_drvdata(dev);
>>   struct ravb_private *priv = netdev_priv(ndev);
>> + struct mii_bus *bus = priv->mii_bus;
>>   int ret = 0;
>>
>>   if (priv->wol_enabled) {
>> @@ -2302,6 +2304,13 @@ static int __maybe_unused ravb_resume(struct device 
>> *dev)
>>* reopen device if it was running before system suspended.
>>*/
>>
>> + /* PHY reset */
>> + if (bus->reset_gpiod) {
>> + gpiod_set_value_cansleep(bus->reset_gpiod, 1);
>> + udelay(bus->reset_delay_us);
>> + gpiod_set_value_cansleep(bus->reset_gpiod, 0);
>> + }
>
> This is a clever hack, but unfortunately this is also misusing the MDIO
> bus reset line into a PHY reset line. As commented in patch 3, if this
> reset line is tied to the PHY, then this should be a PHY property and

OK.

> you cannot (ab)use the MDIO bus GPIO reset logic anymore...

And then I should add reset-gpios support to drivers/net/phy/micrel.c?
Or is there already generic code to handle per-PHY reset? I couldn't find it.

> Should not you also try to manage this reset line during ravb_open() to
> achiever better power savings?

I don't know. The Micrel KSZ9031RNXVA datasheet doesn't mention if it's
safe or not to assert reset for a prolonged time.

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [PATCH RFC 3/5] Add KSZ8795 switch driver

2017-09-28 Thread Pavel Machek
Hi!

On Mon 2017-09-18 20:27:13, tristram...@microchip.com wrote:
> > > +/**
> > > + * Some counters do not need to be read too often because they are less
> > likely
> > > + * to increase much.
> > > + */
> > 
> > What does comment mean? Are you caching statistics, and updating
> > different values at different rates?
> > 
> 
> There are 34 counters.  In normal case using generic bus I/O or PCI to read 
> them
> is very quick, but the switch is mostly accessed using SPI, or even I2C.  As 
> the SPI
> access is very slow and cannot run in interrupt context I keep worrying 
> reading
> the MIB counters in a loop for 5 or more ports will prevent other critical 
> hardware
> access from executing soon enough.  These accesses can be getting 1588 PTP
> timestamps and opening/closing ports.  (RSTP Conformance Test sends test 
> traffic
> to port supposed to be closed/opened after receiving specific RSTP
> BPDU.)

Hmm. Ok, interesting.

I wonder how well this is going to work if userspace actively 'does
something' with the switch.

It seems to me that even if your statistics code is careful not to do
'a lot' of accesses at the same time, userspace can use other parts of
the driver to do the same, and thus cause same unwanted effects...
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


[PATCH V4] r8152: add Linksys USB3GIGV1 id

2017-09-28 Thread Grant Grundler
This linksys dongle by default comes up in cdc_ether mode.
This patch allows r8152 to claim the device:
   Bus 002 Device 002: ID 13b1:0041 Linksys

Signed-off-by: Grant Grundler 
---
 drivers/net/usb/cdc_ether.c | 10 ++
 drivers/net/usb/r8152.c |  2 ++
 2 files changed, 12 insertions(+)

V4: use IS_ENABLED() to check CONFIG_USB_RTL8152 is m or y.
(verified by adding #error to the new code and trying to compile
 Thanks Doug for the tip!)
Add LINKSYS vendor #define in same order for both drivers.

V3: for backwards compat, add #ifdef CONFIG_USB_RTL8152 around
the cdc_ether blacklist entry so the cdc_ether driver can
still claim the device if r8152 driver isn't configured.

V2: add LINKSYS_VENDOR_ID to cdc_ether blacklist



diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 8ab281b478f2..677a85360db1 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -547,6 +547,7 @@ static const struct driver_info wwan_info = {
 #define REALTEK_VENDOR_ID  0x0bda
 #define SAMSUNG_VENDOR_ID  0x04e8
 #define LENOVO_VENDOR_ID   0x17ef
+#define LINKSYS_VENDOR_ID  0x13b1
 #define NVIDIA_VENDOR_ID   0x0955
 #define HP_VENDOR_ID   0x03f0
 #define MICROSOFT_VENDOR_ID0x045e
@@ -737,6 +738,15 @@ static const struct usb_device_id  products[] = {
.driver_info = 0,
 },
 
+#if IS_ENABLED(CONFIG_USB_RTL8152)
+/* Linksys USB3GIGV1 Ethernet Adapter */
+{
+   USB_DEVICE_AND_INTERFACE_INFO(LINKSYS_VENDOR_ID, 0x0041, USB_CLASS_COMM,
+   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+   .driver_info = 0,
+},
+#endif
+
 /* ThinkPad USB-C Dock (based on Realtek RTL8153) */
 {
USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x3062, USB_CLASS_COMM,
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index ceb78e2ea4f0..941ece08ba78 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -613,6 +613,7 @@ enum rtl8152_flags {
 #define VENDOR_ID_MICROSOFT0x045e
 #define VENDOR_ID_SAMSUNG  0x04e8
 #define VENDOR_ID_LENOVO   0x17ef
+#define VENDOR_ID_LINKSYS  0x13b1
 #define VENDOR_ID_NVIDIA   0x0955
 
 #define MCU_TYPE_PLA   0x0100
@@ -5316,6 +5317,7 @@ static const struct usb_device_id rtl8152_table[] = {
{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7205)},
{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x720c)},
{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7214)},
+   {REALTEK_USB_DEVICE(VENDOR_ID_LINKSYS, 0x0041)},
{REALTEK_USB_DEVICE(VENDOR_ID_NVIDIA,  0x09ff)},
{}
 };
-- 
2.14.2.822.g60be5d43e6-goog



Re: [PATCH net-next v9] openvswitch: enable NSH support

2017-09-28 Thread Pravin Shelar
On Tue, Sep 26, 2017 at 6:39 PM, Yang, Yi  wrote:
> On Tue, Sep 26, 2017 at 06:49:14PM +0800, Jiri Benc wrote:
>> On Tue, 26 Sep 2017 12:55:39 +0800, Yang, Yi wrote:
>> > After push_nsh, the packet won't be recirculated to flow pipeline, so
>> > key->eth.type must be set explicitly here, but for pop_nsh, the packet
>> > will be recirculated to flow pipeline, it will be reparsed, so
>> > key->eth.type will be set in packet parse function, we needn't handle it
>> > in pop_nsh.
>>
>> This seems to be a very different approach than what we currently have.
>> Looking at the code, the requirement after "destructive" actions such
>> as pushing or popping headers is to recirculate.
>
> This is optimization proposed by Jan Scheurich, recurculating after push_nsh
> will impact on performance, recurculating after pop_nsh is unavoidable, So
> also cc jan.scheur...@ericsson.com.
>
> Actucally all the keys before push_nsh are still there after push_nsh,
> push_nsh has updated all the nsh keys, so recirculating remains avoidable.
>


We should keep existing model for this patch. Later you can submit
optimization patch with specific use cases and performance
improvement. So that we can evaluate code complexity and benefits.

>>
>> Setting key->eth.type to satisfy conditions in the output path without
>> updating the rest of the key looks very hacky and fragile to me. There
>> might be other conditions and dependencies that are not obvious.
>> I don't think the code was written with such code path in mind.
>>
>> I'd like to hear what Pravin thinks about this.
>>
>>  Jiri


[PATCH net-next] Revert "net: dsa: bcm_sf2: Defer port enabling to calling port_enable"

2017-09-28 Thread Florian Fainelli
This reverts commit e85ec74ace29 ("net: dsa: bcm_sf2: Defer port
enabling to calling port_enable") because this now makes an unbind
followed by a bind to fail connecting to the ingrated PHY.

What this patch missed is that we need the PHY to be enabled with
bcm_sf2_gphy_enable_set() before probing it on the MDIO bus. This is
correctly done in the ops->setup() function, but by the time
ops->port_enable() runs, this is too late. Upon unbind we would power
down the PHY, and so when we would bind again, the PHY would be left
powered off.

Fixes: e85ec74ace29 ("net: dsa: bcm_sf2: Defer port enabling to calling 
port_enable")
Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/bcm_sf2.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2.c b/drivers/net/dsa/bcm_sf2.c
index 898d5642b516..7aecc98d0a18 100644
--- a/drivers/net/dsa/bcm_sf2.c
+++ b/drivers/net/dsa/bcm_sf2.c
@@ -754,11 +754,14 @@ static int bcm_sf2_sw_setup(struct dsa_switch *ds)
struct bcm_sf2_priv *priv = bcm_sf2_to_priv(ds);
unsigned int port;
 
-   /* Disable unused ports and configure IMP port */
+   /* Enable all valid ports and disable those unused */
for (port = 0; port < priv->hw_params.num_ports; port++) {
-   if (dsa_is_cpu_port(ds, port))
+   /* IMP port receives special treatment */
+   if ((1 << port) & ds->enabled_port_mask)
+   bcm_sf2_port_setup(ds, port, NULL);
+   else if (dsa_is_cpu_port(ds, port))
bcm_sf2_imp_setup(ds, port);
-   else if (!((1 << port) & ds->enabled_port_mask))
+   else
bcm_sf2_port_disable(ds, port, NULL);
}
 
-- 
2.9.3



Re: [PATCH] Add a driver for Renesas uPD60620 and uPD60620A PHYs

2017-09-28 Thread Bernd Edlinger
On 09/22/17 19:59, Andrew Lunn wrote:
> On Fri, Sep 22, 2017 at 05:08:45PM +, Bernd Edlinger wrote:
>>
>> +config RENESAS_PHY
>> +tristate "Driver for Renesas PHYs"
>> +---help---
>> +  Supports the uPD60620 and uPD60620A PHYs.
>> +
> 
> Hi Bernd
> 
> Please call this "Reneseas PHYs" and place in it alphabetical order.
> 

Done.

>> +
>> +/* Extended Registers and values */
>> +/* PHY Special Control/Status*/
>> +#define PHY_PHYSCR 0x1F  /* PHY.31 */
>> +#define PHY_PHYSCR_10MB0x0004/* PHY speed = 10mb */
>> +#define PHY_PHYSCR_100MB   0x0008/* PHY speed = 100mb */
>> +#define PHY_PHYSCR_DUPLEX  0x0010/* PHY Duplex */
>> +#define PHY_PHYSCR_RSVD5   0x0020/* Reserved Bit 5 */
>> +#define PHY_PHYSCR_MIIMOD  0x0040/* Enable 4B5B MII mode */
> 
> Are any of these comments actually useful. It seems like the defines
> are pretty obvious.
> 
>> +#define PHY_PHYSCR_RSVD7   0x0080/* Reserved Bit 7 */
>> +#define PHY_PHYSCR_RSVD8   0x0100/* Reserved Bit 8 */
>> +#define PHY_PHYSCR_RSVD9   0x0200/* Reserved Bit 9 */
>> +#define PHY_PHYSCR_RSVD10  0x0400/* Reserved Bit 10 */
>> +#define PHY_PHYSCR_RSVD11  0x0800/* Reserved Bit 11 */
>> +#define PHY_PHYSCR_ANDONE  0x1000/* Auto negotiation done */
>> +#define PHY_PHYSCR_RSVD13  0x2000/* Reserved Bit 13 */
>> +#define PHY_PHYSCR_RSVD14  0x4000/* Reserved Bit 14 */
>> +#define PHY_PHYSCR_RSVD15  0x8000/* Reserved Bit 15 */
> 
> It looks like the only register you use is SCR and SPM. Maybe delete
> all the rest? Or do you plan to add more features making use of these
> registers?
> 

No, I removed all unused defines for now.

>> +phydev->link = 0;
>> +phydev->lp_advertising = 0;
>> +phydev->pause = 0;
>> +phydev->asym_pause = 0;
>> +
>> +if (phy_state & BMSR_ANEGCOMPLETE) {
> 
> It is worth comparing this against genphy_read_status() which is the
> reference implementation. You would normally check if auto negotiation
> is enabled, not if it has completed. If it is enabled you read the
> current negotiated state, even if it is not completed.
> 

Do you suggest that there are cases where auto negotiation does not
reach completion, and still provides a usable link status?

I have tried to connect to link partners with fixed configuration
but even then the auto negotiation always competes normally.
 

>> +phy_state = phy_read(phydev, PHY_PHYSCR);
>> +if (phy_state < 0)
>> +return phy_state;
>> +
>> +if (phy_state & (PHY_PHYSCR_10MB | PHY_PHYSCR_100MB)) {
>> +phydev->link = 1;
>> +phydev->speed = SPEED_10;
>> +phydev->duplex = DUPLEX_HALF;
>> +
>> +if (phy_state & PHY_PHYSCR_100MB)
>> +phydev->speed = SPEED_100;
>> +if (phy_state & PHY_PHYSCR_DUPLEX)
>> +phydev->duplex = DUPLEX_FULL;
>> +
>> +phy_state = phy_read(phydev, MII_LPA);
>> +if (phy_state < 0)
>> +return phy_state;
>> +
>> +phydev->lp_advertising
>> += mii_lpa_to_ethtool_lpa_t(phy_state);
>> +
>> +if (phydev->duplex == DUPLEX_FULL) {
>> +if (phy_state & LPA_PAUSE_CAP)
>> +phydev->pause = 1;
>> +if (phy_state & LPA_PAUSE_ASYM)
>> +phydev->asym_pause = 1;
>> +}
>> +}
>> +} else if (phy_state & BMSR_LSTATUS) {
> 
> The else clause is then for a fixed configuration. Since all you are
> looking at is BMCR, you can probably just cut/paste from
> genphy_read_status().
> 

I think I can fold the fixed speed case in the auto negotiation case:
The PHYSCR has always the correct values for fixed settings.
I was initially unsure if I should look at it while autonegotiation is
not complete, but as you pointed out, that is the generally accepted
practice.


Thanks
Bernd.


>From 2e101aed8466b314251972d1eaccfb43cf177078 Mon Sep 17 00:00:00 2001
From: Bernd Edlinger 
Date: Thu, 21 Sep 2017 15:46:16 +0200
Subject: [PATCH 2/5] Add a driver for Renesas uPD60620 and uPD60620A PHYs.

Signed-off-by: Bernd Edlinger 
---
 drivers/net/phy/Kconfig|   5 +++
 drivers/net/phy/Makefile   |   1 +
 drivers/net/phy/uPD60620.c | 109 +
 3 files changed, 115 insertions(+)
 create mode 100644 drivers/net/phy/uPD60620.c

diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig
index a9d16a3..f67943b 100644
--- a/drivers/net/phy/Kconfig
+++ b/drivers/net/phy/Kconfig
@@ -366,6 +366,11 @@ config REALTEK_PHY
---help---
  Supports the Realtek 821x PHY.
 
+config RENESAS_PHY
+   tristate "Driver for Renesas PHYs"
+   ---help---
+ Supports the Renesas PHYs uPD60620

[PATCH v2] lib: fix multiple strlcpy definition

2017-09-28 Thread Baruch Siach
Some C libraries, like uClibc and musl, provide BSD compatible
strlcpy(). Add check_strlcpy() to configure, and avoid defining strlcpy
and strlcat when the C library provides them.

This fixes the following static link error with uClibc-ng:

.../sysroot/usr/lib/libc.a(strlcpy.os): In function `strlcpy':
strlcpy.c:(.text+0x0): multiple definition of `strlcpy'
../lib/libutil.a(utils.o):utils.c:(.text+0x1ddc): first defined here
collect2: error: ld returned 1 exit status

Acked-by: Phil Sutter 
Signed-off-by: Baruch Siach 
---
v2: Fix the order of strlcpy parameters
---
 configure| 24 
 lib/Makefile |  4 
 lib/utils.c  |  2 ++
 3 files changed, 30 insertions(+)

diff --git a/configure b/configure
index 7be8fb113cc9..e0982f34a992 100755
--- a/configure
+++ b/configure
@@ -326,6 +326,27 @@ EOF
 rm -f $TMPDIR/dbtest.c $TMPDIR/dbtest
 }
 
+check_strlcpy()
+{
+cat >$TMPDIR/strtest.c <
+int main(int argc, char **argv) {
+   char dst[10];
+   strlcpy(dst, "test", sizeof(dst));
+   return 0;
+}
+EOF
+$CC -I$INCLUDE -o $TMPDIR/strtest $TMPDIR/strtest.c >/dev/null 2>&1
+if [ $? -eq 0 ]
+then
+   echo "no"
+else
+   echo "NEED_STRLCPY:=y" >>$CONFIG
+   echo "yes"
+fi
+rm -f $TMPDIR/strtest.c $TMPDIR/strtest
+}
+
 quiet_config()
 {
cat <

Re: [PATCH net-next RFC 3/9] net: dsa: mv88e6xxx: add support for GPIO configuration

2017-09-28 Thread Andrew Lunn
On Thu, Sep 28, 2017 at 10:45:03AM -0700, Florian Fainelli wrote:
> On 09/28/2017 08:25 AM, Brandon Streiff wrote:
> > The Scratch/Misc register is a windowed interface that provides access
> > to the GPIO configuration. Provide a new method for configuration of
> > GPIO functions.
> > 
> > Signed-off-by: Brandon Streiff 
> > ---
> 
> > +/* Offset 0x1A: Scratch and Misc. Register */
> > +static int mv88e6xxx_g2_scratch_reg_read(struct mv88e6xxx_chip *chip,
> > +int reg, u8 *data)
> > +{
> > +   int err;
> > +   u16 value;
> > +
> > +   err = mv88e6xxx_g2_write(chip, MV88E6XXX_G2_SCRATCH_MISC_MISC,
> > +reg << 8);
> > +   if (err)
> > +   return err;
> > +
> > +   err = mv88e6xxx_g2_read(chip, MV88E6XXX_G2_SCRATCH_MISC_MISC, &value);
> > +   if (err)
> > +   return err;
> > +
> > +   *data = (value & MV88E6XXX_G2_SCRATCH_MISC_DATA_MASK);
> > +
> > +   return 0;
> > +}
> 
> With the write and read acquiring and then releasing the lock
> immediately, is no there room for this sequence to be interrupted in the
> middle and end-up returning inconsistent reads?

Hi Florian

The general pattern in this code is that the lock chip->reg_lock is
taken at a higher level. That protects against other threads. The
driver tends to do that at the highest levels, at the entry points
into the driver. I've not yet checked this code follows the pattern
yet. However, we have a check in the low level to ensure the lock has
been taken. So it seems likely the lock is held.
 
> Would there be any value in implementing a proper gpiochip structure
> here such that other pieces of SW can see this GPIO controller as a
> provider and you can reference it from e.g: Device Tree using GPIO
> descriptors?

That would be my preference as well, or maybe a pinctrl driver.

 Andrew


Re: [patch net-next 3/7] ipv4: ipmr: Don't forward packets already forwarded by hardware

2017-09-28 Thread Florian Fainelli
On 09/28/2017 10:34 AM, Jiri Pirko wrote:
> From: Yotam Gigi 
> 
> Change the ipmr module to not forward packets if:
>  - The packet is marked with the offload_mr_fwd_mark, and
>  - Both input interface and output interface share the same parent ID.
> 
> This way, a packet can go through partial multicast forwarding in the
> hardware, where it will be forwarded only to the devices that share the
> same parent ID (AKA, reside inside the same hardware). The kernel will
> forward the packet to all other interfaces.
> 
> To do this, add the ipmr_offload_forward helper, which per skb, ingress VIF
> and egress VIF, returns whether the forwarding was offloaded to hardware.
> The ipmr_queue_xmit frees the skb and does not forward it if the result is
> a true value.
> 
> All the forwarding path code compiles out when the CONFIG_NET_SWITCHDEV is
> not set.
> 
> Signed-off-by: Yotam Gigi 
> Reviewed-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 
> ---
>  net/ipv4/ipmr.c | 37 -
>  1 file changed, 32 insertions(+), 5 deletions(-)
> 
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 4566c54..deba569 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -1857,10 +1857,33 @@ static inline int ipmr_forward_finish(struct net 
> *net, struct sock *sk,
>   return dst_output(net, sk, skb);
>  }
>  
> +#ifdef CONFIG_NET_SWITCHDEV
> +static bool ipmr_forward_offloaded(struct sk_buff *skb, struct mr_table *mrt,
> +int in_vifi, int out_vifi)
> +{
> + struct vif_device *out_vif = &mrt->vif_table[out_vifi];
> + struct vif_device *in_vif = &mrt->vif_table[in_vifi];

Nit: in_vifi and out_vifi may be better named as in_vif_idx and
out_vif_idx, oh well you are just replicating the existing naming
conventions used down below, never mind then.
-- 
Florian


Re: [PATCH net-next RFC 0/9] net: dsa: PTP timestamping for mv88e6xxx

2017-09-28 Thread Florian Fainelli
On 09/28/2017 10:36 AM, Andrew Lunn wrote:
>> - Patch #3: The GPIO config support is handled in a very simple manner.
>>   I suspect a longer term goal would be to use pinctrl here.
> 
> I assume ptp already has the core code to use pinctrl and Linux
> standard GPIOs? What does the device tree binding look like? How do
> you specify the GPIOs to use?
> 
> What we want to avoid is defining an ABI now, otherwise it is going to
> be hard to swap to pinctrl later.
> 
>> - Patch #6: the dsa_switch pointer and port index is plumbed from
>>   dsa_device_ops::rcv so that we can call the correct port_rxtstamp
>>   method. This involved instrumenting all of the *_tag_rcv functions in
>>   a way that's kind of a kludge and that I'm not terribly happy with.
> 
> Yes, this is ugly. I will see if i can find a better way to do
> this. 

See my reply in patch 6, I may be missing something, but once
dst->rdcv() has been called, skb->dev points to the slave network device
which already contains the switch port and switch information in
dsa_slave_priv, so that should lift the need for asking the individual
taggers' rcv() callback to tell us about it.
-- 
Florian


Re: [patch net-next 1/7] skbuff: Add the offload_mr_fwd_mark field

2017-09-28 Thread Andrew Lunn
On Thu, Sep 28, 2017 at 07:34:09PM +0200, Jiri Pirko wrote:
> From: Yotam Gigi 
> 
> Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
> used to allow partial offloading of MFC multicast routes.

> The reason why the already existing "offload_fwd_mark" bit cannot be used
> is that a switchdev driver would want to make the distinction between a
> packet that has already gone through L2 forwarding but did not go through
> multicast forwarding, and a packet that has already gone through both L2
> and multicast forwarding.

Hi Jiri

So we are talking about l2 vs l3. So why not call this
offload_l3_fwd_mark?

Is there anything really specific to multicast here?

   Thanks
  Andrew


Re: [PATCH v5 1/4] ipv4: Namespaceify tcp_fastopen knob

2017-09-28 Thread David Miller
From: Haishuang Yan 
Date: Wed, 27 Sep 2017 11:35:40 +0800

> Different namespace application might require enable TCP Fast Open
> feature independently of the host.
> 
> This patch series continues making more of the TCP Fast Open related
> sysctl knobs be per net-namespace.
> 
> Reported-by: Luca BRUNO 
> Signed-off-by: Haishuang Yan 

Applied.


Re: [PATCH] net-ipv6: remove unused IP6_ECN_clear() function

2017-09-28 Thread David Miller
From: Maciej Żenczykowski 
Date: Tue, 26 Sep 2017 20:37:22 -0700

> From: Maciej Żenczykowski 
> 
> This function is unused, and furthermore it is buggy since it suffers
> from the same issue that requires IP6_ECN_set_ce() to take a pointer
> to the skb so that it may (in case of CHECKSUM_COMPLETE) update skb->csum
> 
> Instead of fixing it, let's just outright remove it.
> 
> Tested: builds, and 'git grep IP6_ECN_clear' comes up empty
> 
> Signed-off-by: Maciej Żenczykowski 

Applied to net-next.


Re: [PATCH v5 2/4] ipv4: Remove the 'publish' logic in tcp_fastopen_init_key_once

2017-09-28 Thread David Miller
From: Haishuang Yan 
Date: Wed, 27 Sep 2017 11:35:41 +0800

> The 'publish' logic is not necessary after commit dfea2aa65424 ("tcp:
> Do not call tcp_fastopen_reset_cipher from interrupt context"), because
> in tcp_fastopen_cookie_gen,it wouldn't call tcp_fastopen_init_key_once.
> 
> Signed-off-by: Haishuang Yan 

Applied.


Re: [PATCH v5 3/4] ipv4: Namespaceify tcp_fastopen_key knob

2017-09-28 Thread David Miller
From: Haishuang Yan 
Date: Wed, 27 Sep 2017 11:35:42 +0800

> Different namespace application might require different tcp_fastopen_key
> independently of the host.
> 
> David Miller pointed out there is a leak without releasing the context
> of tcp_fastopen_key during netns teardown. So add the release action in
> exit_batch path.
> 
> Tested:
> 1. Container namespace:
> # cat /proc/sys/net/ipv4/tcp_fastopen_key:
> 2817fff2-f803cf97-eadfd1f3-78c0992b
> 
> cookie key in tcp syn packets:
> Fast Open Cookie
> Kind: TCP Fast Open Cookie (34)
> Length: 10
> Fast Open Cookie: 1e5dd82a8c492ca9
> 
> 2. Host:
> # cat /proc/sys/net/ipv4/tcp_fastopen_key:
> 107d7c5f-68eb2ac7-02fb06e6-ed341702
> 
> cookie key in tcp syn packets:
> Fast Open Cookie
> Kind: TCP Fast Open Cookie (34)
> Length: 10
> Fast Open Cookie: e213c02bf0afbc8a
> 
> Signed-off-by: Haishuang Yan 

Applied.


Re: [PATCH v5 4/4] ipv4: Namespaceify tcp_fastopen_blackhole_timeout knob

2017-09-28 Thread David Miller
From: Haishuang Yan 
Date: Wed, 27 Sep 2017 11:35:43 +0800

> Different namespace application might require different time period in
> second to disable Fastopen on active TCP sockets.
> 
> Tested:
> Simulate following similar situation that the server's data gets dropped
> after 3WHS.
> C  syn-data ---> S
> C <--- syn/ack - S
> C  ack > S
> S (accept & write)
> C?  X <- data -- S
>   [retry and timeout]
> 
> And then print netstat of TCPFastOpenBlackhole, the counter increased as
> expected when the firewall blackhole issue is detected and active TFO is
> disabled.
> # cat /proc/net/netstat | awk '{print $91}'
> TCPFastOpenBlackhole
> 1
> 
> Signed-off-by: Haishuang Yan 

Applied.


Re: [PATCH net-next RFC 3/9] net: dsa: mv88e6xxx: add support for GPIO configuration

2017-09-28 Thread Florian Fainelli
On 09/28/2017 08:25 AM, Brandon Streiff wrote:
> The Scratch/Misc register is a windowed interface that provides access
> to the GPIO configuration. Provide a new method for configuration of
> GPIO functions.
> 
> Signed-off-by: Brandon Streiff 
> ---

> +/* Offset 0x1A: Scratch and Misc. Register */
> +static int mv88e6xxx_g2_scratch_reg_read(struct mv88e6xxx_chip *chip,
> +  int reg, u8 *data)
> +{
> + int err;
> + u16 value;
> +
> + err = mv88e6xxx_g2_write(chip, MV88E6XXX_G2_SCRATCH_MISC_MISC,
> +  reg << 8);
> + if (err)
> + return err;
> +
> + err = mv88e6xxx_g2_read(chip, MV88E6XXX_G2_SCRATCH_MISC_MISC, &value);
> + if (err)
> + return err;
> +
> + *data = (value & MV88E6XXX_G2_SCRATCH_MISC_DATA_MASK);
> +
> + return 0;
> +}

With the write and read acquiring and then releasing the lock
immediately, is no there room for this sequence to be interrupted in the
middle and end-up returning inconsistent reads?

> +
> +static int mv88e6xxx_g2_scratch_reg_write(struct mv88e6xxx_chip *chip,
> +   int reg, u8 data)
> +{
> + u16 value = (reg << 8) | data;
> +
> + return mv88e6xxx_g2_update(chip, MV88E6XXX_G2_SCRATCH_MISC_MISC, value);
> +}
> +
> +/* Configures the specified pin for the specified function. This function
> + * does not unset other pins configured for the same function. If multiple
> + * pins are configured for the same function, the lower-index pin gets
> + * that function and the higher-index pin goes back to being GPIO.
> + */
> +int mv88e6xxx_g2_set_gpio_config(struct mv88e6xxx_chip *chip, int pin,
> +  int func, int dir)
> +{
> + int mode_reg = MV88E6XXX_G2_SCRATCH_GPIO_MODE(pin);
> + int dir_reg = MV88E6XXX_G2_SCRATCH_GPIO_DIR(pin);
> + int err;
> + u8 val;
> +
> + if (pin < 0 || pin >= mv88e6xxx_num_gpio(chip))
> + return -ERANGE;
> +
> + /* Set function first */
> + err = mv88e6xxx_g2_scratch_reg_read(chip, mode_reg, &val);
> + if (err)
> + return err;
> +
> + /* Zero bits in the field for this GPIO and OR in new config */
> + val &= ~MV88E6XXX_G2_SCRATCH_GPIO_MODE_MASK(pin);
> + val |= (func << MV88E6XXX_G2_SCRATCH_GPIO_MODE_OFFSET(pin));
> +
> + err = mv88e6xxx_g2_scratch_reg_write(chip, mode_reg, val);
> + if (err)
> + return err;
> +
> + /* Set direction */
> + err = mv88e6xxx_g2_scratch_reg_read(chip, dir_reg, &val);
> + if (err)
> + return err;
> +
> + /* Zero bits in the field for this GPIO and OR in new config */
> + val &= ~MV88E6XXX_G2_SCRATCH_GPIO_DIR_MASK(pin);
> + val |= (dir << MV88E6XXX_G2_SCRATCH_GPIO_DIR_OFFSET(pin));
> +
> + return mv88e6xxx_g2_scratch_reg_write(chip, dir_reg, val);
> +}

Would there be any value in implementing a proper gpiochip structure
here such that other pieces of SW can see this GPIO controller as a
provider and you can reference it from e.g: Device Tree using GPIO
descriptors?
-- 
Florian


Re: [PATCH net-next RFC 6/9] net: dsa: forward timestamping callbacks to switch drivers

2017-09-28 Thread Florian Fainelli
On 09/28/2017 08:25 AM, Brandon Streiff wrote:
> Forward the rx/tx timestamp machinery from the dsa infrastructure to the
> switch driver.
> 
> On the rx side, defer delivery of skbs until we have an rx timestamp.
> This mimicks the behavior of skb_defer_rx_timestamp. The implementation
> does have to thread through the tagging protocol handlers because
> it is where that we know which switch and port the skb goes to.
> 
> On the tx side, identify PTP packets, clone them, and pass them to the
> underlying switch driver before we transmit. This mimicks the behavior
> of skb_tx_timestamp.
> 
> Signed-off-by: Brandon Streiff 
> ---
>  include/net/dsa.h | 13 +++--
>  net/dsa/dsa.c | 39 ++-
>  net/dsa/slave.c   | 25 +
>  net/dsa/tag_brcm.c|  6 +-
>  net/dsa/tag_dsa.c |  6 +-
>  net/dsa/tag_edsa.c|  6 +-
>  net/dsa/tag_ksz.c |  6 +-
>  net/dsa/tag_lan9303.c |  6 +-
>  net/dsa/tag_mtk.c |  6 +-
>  net/dsa/tag_qca.c |  6 +-
>  net/dsa/tag_trailer.c |  6 +-
>  11 files changed, 114 insertions(+), 11 deletions(-)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index 1163af1..4daf7f7 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -101,11 +101,14 @@ struct dsa_platform_data {
>  };
>  
>  struct packet_type;
> +struct dsa_switch;
>  
>  struct dsa_device_ops {
>   struct sk_buff *(*xmit)(struct sk_buff *skb, struct net_device *dev);
>   struct sk_buff *(*rcv)(struct sk_buff *skb, struct net_device *dev,
> -struct packet_type *pt);
> +struct packet_type *pt,
> +struct dsa_switch **src_dev,
> +int *src_port);
>   int (*flow_dissect)(const struct sk_buff *skb, __be16 *proto,
>   int *offset);
>  };
> @@ -134,7 +137,9 @@ struct dsa_switch_tree {
>   /* Copy of tag_ops->rcv for faster access in hot path */
>   struct sk_buff *(*rcv)(struct sk_buff *skb,
>  struct net_device *dev,
> -struct packet_type *pt);
> +struct packet_type *pt,
> +struct dsa_switch **src_dev,
> +int *src_port);
>  
>   /*
>* The switch port to which the CPU is attached.
> @@ -449,6 +454,10 @@ struct dsa_switch_ops {
>struct ifreq *ifr);
>   int (*port_hwtstamp_set)(struct dsa_switch *ds, int port,
>struct ifreq *ifr);
> + void(*port_txtstamp)(struct dsa_switch *ds, int port,
> +  struct sk_buff *clone, unsigned int type);
> + bool(*port_rxtstamp)(struct dsa_switch *ds, int port,
> +  struct sk_buff *skb, unsigned int type);
>  };
>  
>  struct dsa_switch_driver {
> diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
> index 81c852e..42e7286 100644
> --- a/net/dsa/dsa.c
> +++ b/net/dsa/dsa.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  
> @@ -157,6 +158,37 @@ struct net_device *dsa_dev_to_net_device(struct device 
> *dev)
>  }
>  EXPORT_SYMBOL_GPL(dsa_dev_to_net_device);
>  
> +/* Determine if we should defer delivery of skb until we have a rx timestamp.
> + *
> + * Called from dsa_switch_rcv. For now, this will only work if tagging is
> + * enabled on the switch. Normally the MAC driver would retrieve the hardware
> + * timestamp when it reads the packet out of the hardware. However in a DSA
> + * switch, the DSA driver owning the interface to which the packet is
> + * delivered is never notified unless we do so here.
> + */
> +static bool dsa_skb_defer_rx_timestamp(struct dsa_switch *ds, int port,
> +struct sk_buff *skb)

You should not need the port information here because it's already
implied from skb->dev which points to the DSA slave network device, see
below.

> +{
> + unsigned int type;
> +
> + if (skb_headroom(skb) < ETH_HLEN)
> + return false;

Are you positive this is necessary? Because we called dst->rcv() we have
called eth_type_trans() which already made sure about that

> +
> + __skb_push(skb, ETH_HLEN);
> +
> + type = ptp_classify_raw(skb);
> +
> + __skb_pull(skb, ETH_HLEN);
> +
> + if (type == PTP_CLASS_NONE)
> + return false;
> +
> + if (likely(ds->ops->port_rxtstamp))
> + return ds->ops->port_rxtstamp(ds, port, skb, type);
> +
> + return false;
> +}

Can we also have a fast-path bypass in case time stamping is not
supported by the switch so we don't have to even try to classify this
packet only to realize we don't have a port_rxtsamp() operation later?
You can either gate this with a compile-time option, or use e.g: a
sta

Re: [PATCH net-next RFC 0/9] net: dsa: PTP timestamping for mv88e6xxx

2017-09-28 Thread Andrew Lunn
> - Patch #3: The GPIO config support is handled in a very simple manner.
>   I suspect a longer term goal would be to use pinctrl here.

I assume ptp already has the core code to use pinctrl and Linux
standard GPIOs? What does the device tree binding look like? How do
you specify the GPIOs to use?

What we want to avoid is defining an ABI now, otherwise it is going to
be hard to swap to pinctrl later.

> - Patch #6: the dsa_switch pointer and port index is plumbed from
>   dsa_device_ops::rcv so that we can call the correct port_rxtstamp
>   method. This involved instrumenting all of the *_tag_rcv functions in
>   a way that's kind of a kludge and that I'm not terribly happy with.

Yes, this is ugly. I will see if i can find a better way to do
this. 

  Andrew


[iproute PATCH] ip-route: Fix for listing routes with RTAX_LOCK attribute

2017-09-28 Thread Phil Sutter
This fixes a corner-case for routes with a certain metric locked to
zero:

| ip route add 192.168.7.0/24 dev eth0 window 0
| ip route add 192.168.7.0/24 dev eth0 window lock 0

Since the kernel doesn't dump the attribute if it is zero, both routes
added above would appear as if they were equal although they are not.

Fix this by taking mxlock value for the given metric into account before
skipping it if it is not present.

Reported-by: Thomas Haller 
Signed-off-by: Phil Sutter 
---
 ip/iproute.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index a8733f45bf881..e81bc05ec16cb 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -574,10 +574,10 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
for (i = 2; i <= RTAX_MAX; i++) {
__u32 val = 0U;
 
-   if (mxrta[i] == NULL)
+   if (mxrta[i] == NULL && !(mxlock & (1 << i)))
continue;
 
-   if (i != RTAX_CC_ALGO)
+   if (mxrta[i] != NULL && i != RTAX_CC_ALGO)
val = rta_getattr_u32(mxrta[i]);
 
if (i == RTAX_HOPLIMIT && (int)val == -1)
-- 
2.13.1



Re: [PATCH v3 net-next 00/10] Add support for DCB feature in hns3 driver

2017-09-28 Thread David Miller
From: Yunsheng Lin 
Date: Wed, 27 Sep 2017 09:45:22 +0800

> The patchset contains some enhancement related to DCB before
> adding support for DCB feature.

Series applied, thanks.


Re: [PATCH net] net: Set sk_prot_creator when cloning sockets to the right proto

2017-09-28 Thread David Miller
From: Christoph Paasch 
Date: Tue, 26 Sep 2017 17:38:50 -0700

> sk->sk_prot and sk->sk_prot_creator can differ when the app uses
> IPV6_ADDRFORM (transforming an IPv6-socket to an IPv4-one).
> Which is why sk_prot_creator is there to make sure that sk_prot_free()
> does the kmem_cache_free() on the right kmem_cache slab.
> 
> Now, if such a socket gets transformed back to a listening socket (using
> connect() with AF_UNSPEC) we will allocate an IPv4 tcp_sock through
> sk_clone_lock() when a new connection comes in. But sk_prot_creator will
> still point to the IPv6 kmem_cache (as everything got copied in
> sk_clone_lock()). When freeing, we will thus put this
> memory back into the IPv6 kmem_cache although it was allocated in the
> IPv4 cache. I have seen memory corruption happening because of this.
> 
> With slub-debugging and MEMCG_KMEM enabled this gives the warning
>   "cache_from_obj: Wrong slab cache. TCPv6 but object is from TCP"
> 
> A C-program to trigger this:
 ...
> As far as I can see, this bug has been there since the beginning of the
> git-days.
> 
> Signed-off-by: Christoph Paasch 

Applied and queued up for -stable, thanks.


[patch net-next 1/7] skbuff: Add the offload_mr_fwd_mark field

2017-09-28 Thread Jiri Pirko
From: Yotam Gigi 

Similarly to the offload_fwd_mark field, the offload_mr_fwd_mark field is
used to allow partial offloading of MFC multicast routes.

Switchdev drivers can offload MFC multicast routes to the hardware by
registering to the FIB notification chain. When one of the route output
interfaces is not offload-able, i.e. has different parent ID, the route
cannot be fully offloaded by the hardware. Examples to non-offload-able
devices are a management NIC, dummy device, pimreg device, etc.

Similar problem exists in the bridge module, as one bridge can hold
interfaces with different parent IDs. At the bridge, the problem is solved
by the offload_fwd_mark skb field.

Currently, when a route cannot go through full offload, the only solution
for a switchdev driver is not to offload it at all and let the packet go
through slow path.

Using the offload_mr_fwd_mark field, a driver can indicate that a packet
was already forwarded by hardware to all the devices with the same parent
ID as the input device. Further patches in this patch-set are going to
enhance ipmr to skip multicast forwarding to devices with the same parent
ID if a packets is marked with that field.

The reason why the already existing "offload_fwd_mark" bit cannot be used
is that a switchdev driver would want to make the distinction between a
packet that has already gone through L2 forwarding but did not go through
multicast forwarding, and a packet that has already gone through both L2
and multicast forwarding.

For example: when a packet is ingressing from a switchport enslaved to a
bridge, which is configured with multicast forwarding, the following
scenarios are possible:
 - The packet can be trapped to the CPU due to exception while multicast
   forwarding (for example, MTU error). In that case, it had already gone
   through L2 forwarding in the hardware, thus A switchdev driver would
   want to set the skb->offload_fwd_mark and not the
   skb->offload_mr_fwd_mark.
 - The packet can also be trapped due to a pimreg/dummy device used as one
   of the output interfaces. In that case, it can go through both L2 and
   (partial) multicast forwarding inside the hardware, thus a switchdev
   driver would want to set both the skb->offload_fwd_mark and
   skb->offload_mr_fwd_mark.

Signed-off-by: Yotam Gigi 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 include/linux/skbuff.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 19e64bf..ada8214 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -772,6 +772,7 @@ struct sk_buff {
__u8remcsum_offload:1;
 #ifdef CONFIG_NET_SWITCHDEV
__u8offload_fwd_mark:1;
+   __u8offload_mr_fwd_mark:1;
 #endif
 #ifdef CONFIG_NET_CLS_ACT
__u8tc_skip_classify:1;
-- 
2.9.5



[patch net-next 3/7] ipv4: ipmr: Don't forward packets already forwarded by hardware

2017-09-28 Thread Jiri Pirko
From: Yotam Gigi 

Change the ipmr module to not forward packets if:
 - The packet is marked with the offload_mr_fwd_mark, and
 - Both input interface and output interface share the same parent ID.

This way, a packet can go through partial multicast forwarding in the
hardware, where it will be forwarded only to the devices that share the
same parent ID (AKA, reside inside the same hardware). The kernel will
forward the packet to all other interfaces.

To do this, add the ipmr_offload_forward helper, which per skb, ingress VIF
and egress VIF, returns whether the forwarding was offloaded to hardware.
The ipmr_queue_xmit frees the skb and does not forward it if the result is
a true value.

All the forwarding path code compiles out when the CONFIG_NET_SWITCHDEV is
not set.

Signed-off-by: Yotam Gigi 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 net/ipv4/ipmr.c | 37 -
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 4566c54..deba569 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -1857,10 +1857,33 @@ static inline int ipmr_forward_finish(struct net *net, 
struct sock *sk,
return dst_output(net, sk, skb);
 }
 
+#ifdef CONFIG_NET_SWITCHDEV
+static bool ipmr_forward_offloaded(struct sk_buff *skb, struct mr_table *mrt,
+  int in_vifi, int out_vifi)
+{
+   struct vif_device *out_vif = &mrt->vif_table[out_vifi];
+   struct vif_device *in_vif = &mrt->vif_table[in_vifi];
+
+   if (!skb->offload_mr_fwd_mark)
+   return false;
+   if (!out_vif->dev_parent_id_valid || !in_vif->dev_parent_id_valid)
+   return false;
+   return netdev_phys_item_id_same(&out_vif->dev_parent_id,
+   &in_vif->dev_parent_id);
+}
+#else
+static bool ipmr_forward_offloaded(struct sk_buff *skb, struct mr_table *mrt,
+  int in_vifi, int out_vifi)
+{
+   return false;
+}
+#endif
+
 /* Processing handlers for ipmr_forward */
 
 static void ipmr_queue_xmit(struct net *net, struct mr_table *mrt,
-   struct sk_buff *skb, struct mfc_cache *c, int vifi)
+   int in_vifi, struct sk_buff *skb,
+   struct mfc_cache *c, int vifi)
 {
const struct iphdr *iph = ip_hdr(skb);
struct vif_device *vif = &mrt->vif_table[vifi];
@@ -1881,6 +1904,9 @@ static void ipmr_queue_xmit(struct net *net, struct 
mr_table *mrt,
goto out_free;
}
 
+   if (ipmr_forward_offloaded(skb, mrt, in_vifi, vifi))
+   goto out_free;
+
if (vif->flags & VIFF_TUNNEL) {
rt = ip_route_output_ports(net, &fl4, NULL,
   vif->remote, vif->local,
@@ -2058,8 +2084,8 @@ static void ip_mr_forward(struct net *net, struct 
mr_table *mrt,
struct sk_buff *skb2 = skb_clone(skb, 
GFP_ATOMIC);
 
if (skb2)
-   ipmr_queue_xmit(net, mrt, skb2, cache,
-   psend);
+   ipmr_queue_xmit(net, mrt, true_vifi,
+   skb2, cache, psend);
}
psend = ct;
}
@@ -2070,9 +2096,10 @@ static void ip_mr_forward(struct net *net, struct 
mr_table *mrt,
struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 
if (skb2)
-   ipmr_queue_xmit(net, mrt, skb2, cache, psend);
+   ipmr_queue_xmit(net, mrt, true_vifi, skb2,
+   cache, psend);
} else {
-   ipmr_queue_xmit(net, mrt, skb, cache, psend);
+   ipmr_queue_xmit(net, mrt, true_vifi, skb, cache, psend);
return;
}
}
-- 
2.9.5



[patch net-next 2/7] ipv4: ipmr: Add the parent ID field to VIF struct

2017-09-28 Thread Jiri Pirko
From: Yotam Gigi 

In order to allow the ipmr module to do partial multicast forwarding
according to the device parent ID, add the device parent ID field to the
VIF struct. This way, the forwarding path can use the parent ID field
without invoking switchdev calls, which requires the RTNL lock.

When a new VIF is added, set the device parent ID field in it by invoking
the switchdev_port_attr_get call.

Signed-off-by: Yotam Gigi 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 include/linux/mroute.h | 2 ++
 net/ipv4/ipmr.c| 9 +
 2 files changed, 11 insertions(+)

diff --git a/include/linux/mroute.h b/include/linux/mroute.h
index b072a84..a46577f 100644
--- a/include/linux/mroute.h
+++ b/include/linux/mroute.h
@@ -57,6 +57,8 @@ static inline bool ipmr_rule_default(const struct fib_rule 
*rule)
 
 struct vif_device {
struct net_device   *dev;   /* Device we are using 
*/
+   struct netdev_phys_item_id dev_parent_id;   /* Device parent ID
*/
+   booldev_parent_id_valid;
unsigned long   bytes_in,bytes_out;
unsigned long   pkt_in,pkt_out; /* Statistics   
*/
unsigned long   rate_limit; /* Traffic shaping (NI) 
*/
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 292a8e8..4566c54 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -67,6 +67,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct ipmr_rule {
struct fib_rule common;
@@ -868,6 +869,9 @@ static int vif_add(struct net *net, struct mr_table *mrt,
   struct vifctl *vifc, int mrtsock)
 {
int vifi = vifc->vifc_vifi;
+   struct switchdev_attr attr = {
+   .id = SWITCHDEV_ATTR_ID_PORT_PARENT_ID,
+   };
struct vif_device *v = &mrt->vif_table[vifi];
struct net_device *dev;
struct in_device *in_dev;
@@ -942,6 +946,11 @@ static int vif_add(struct net *net, struct mr_table *mrt,
 
/* Fill in the VIF structures */
 
+   attr.orig_dev = dev;
+   if (!switchdev_port_attr_get(dev, &attr)) {
+   v->dev_parent_id_valid = true;
+   memcpy(v->dev_parent_id.id, attr.u.ppid.id, attr.u.ppid.id_len);
+   }
v->rate_limit = vifc->vifc_rate_limit;
v->local = vifc->vifc_lcl_addr.s_addr;
v->remote = vifc->vifc_rmt_addr.s_addr;
-- 
2.9.5



[patch net-next 6/7] mlxsw: spectrum: mr_tcam: Add trap-and-forward multicast route

2017-09-28 Thread Jiri Pirko
From: Yotam Gigi 

In addition to the current multicast route actions, which include trap
route action and a forward route action, add the trap-and-forward multicast
route action, and implement it in the multicast routing hardware logic.

To implement that, add a trap-and-forward ACL action as the last action in
the route flexible action set. The used trap is the ACL2 trap, which marks
the packets with offload_mr_forward_mark, to prevent the packet from being
forwarded again by the kernel.

Note: At that stage the offloading logic does not support trap-and-forward
multicast routes. This patch adds the support only in the hardware logic.

Signed-off-by: Yotam Gigi 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h  | 1 +
 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c | 8 
 2 files changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h
index c851b23..5d26a12 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h
@@ -42,6 +42,7 @@
 enum mlxsw_sp_mr_route_action {
MLXSW_SP_MR_ROUTE_ACTION_FORWARD,
MLXSW_SP_MR_ROUTE_ACTION_TRAP,
+   MLXSW_SP_MR_ROUTE_ACTION_TRAP_AND_FORWARD,
 };
 
 enum mlxsw_sp_mr_route_prio {
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
index cda9e9a..3ffb28d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c
@@ -253,6 +253,7 @@ mlxsw_sp_mr_tcam_afa_block_create(struct mlxsw_sp *mlxsw_sp,
if (err)
goto err;
break;
+   case MLXSW_SP_MR_ROUTE_ACTION_TRAP_AND_FORWARD:
case MLXSW_SP_MR_ROUTE_ACTION_FORWARD:
/* If we are about to append a multicast router action, commit
 * the erif_list.
@@ -266,6 +267,13 @@ mlxsw_sp_mr_tcam_afa_block_create(struct mlxsw_sp 
*mlxsw_sp,
  erif_list->kvdl_index);
if (err)
goto err;
+
+   if (route_action == MLXSW_SP_MR_ROUTE_ACTION_TRAP_AND_FORWARD) {
+   err = mlxsw_afa_block_append_trap_and_forward(afa_block,
+ 
MLXSW_TRAP_ID_ACL2);
+   if (err)
+   goto err;
+   }
break;
default:
err = -EINVAL;
-- 
2.9.5



[patch net-next 7/7] mlxsw: spectrum: mr: Support trap-and-forward routes

2017-09-28 Thread Jiri Pirko
From: Yotam Gigi 

Add the support of trap-and-forward route action in the multicast routing
offloading logic. A route will be set to trap-and-forward action if one (or
more) of its output interfaces is not offload-able, i.e. does not have a
valid Spectrum RIF.

This way, a route with mixed output VIFs list, which contains both
offload-able and un-offload-able devices can go through partial offloading
in hardware, and the rest will be done in the kernel ipmr module.

Signed-off-by: Yotam Gigi 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
index 0912025..4c0848e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c
@@ -114,9 +114,9 @@ static bool mlxsw_sp_mr_vif_valid(const struct 
mlxsw_sp_mr_vif *vif)
return mlxsw_sp_mr_vif_regular(vif) && vif->dev && vif->rif;
 }
 
-static bool mlxsw_sp_mr_vif_rif_invalid(const struct mlxsw_sp_mr_vif *vif)
+static bool mlxsw_sp_mr_vif_exists(const struct mlxsw_sp_mr_vif *vif)
 {
-   return mlxsw_sp_mr_vif_regular(vif) && vif->dev && !vif->rif;
+   return vif->dev;
 }
 
 static bool
@@ -182,14 +182,13 @@ mlxsw_sp_mr_route_action(const struct mlxsw_sp_mr_route 
*mr_route)
if (!mlxsw_sp_mr_route_valid_evifs_num(mr_route))
return MLXSW_SP_MR_ROUTE_ACTION_TRAP;
 
-   /* If either one of the eVIFs is not regular (VIF of type pimreg or
-* tunnel) or one of the VIFs has no matching RIF, trap the packet.
+   /* If one of the eVIFs has no RIF, trap-and-forward the route as there
+* is some more routing to do in software too.
 */
-   list_for_each_entry(rve, &mr_route->evif_list, route_node) {
-   if (!mlxsw_sp_mr_vif_regular(rve->mr_vif) ||
-   mlxsw_sp_mr_vif_rif_invalid(rve->mr_vif))
-   return MLXSW_SP_MR_ROUTE_ACTION_TRAP;
-   }
+   list_for_each_entry(rve, &mr_route->evif_list, route_node)
+   if (mlxsw_sp_mr_vif_exists(rve->mr_vif) && !rve->mr_vif->rif)
+   return MLXSW_SP_MR_ROUTE_ACTION_TRAP_AND_FORWARD;
+
return MLXSW_SP_MR_ROUTE_ACTION_FORWARD;
 }
 
-- 
2.9.5



[patch net-next 4/7] mlxsw: acl: Introduce ACL trap and forward action

2017-09-28 Thread Jiri Pirko
From: Yotam Gigi 

Use trap/discard flex action to implement trap and forward. The action will
later be used for multicast routing, as the multicast routing mechanism is
done using ACL flexible actions in Spectrum hardware. Using that action, it
will be possible to implement a trap-and-forward route.

Signed-off-by: Yotam Gigi 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c | 17 +
 .../net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h |  2 ++
 2 files changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
index bc55d0e..6a979a0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
@@ -676,6 +676,7 @@ enum mlxsw_afa_trapdisc_trap_action {
 MLXSW_ITEM32(afa, trapdisc, trap_action, 0x00, 24, 4);
 
 enum mlxsw_afa_trapdisc_forward_action {
+   MLXSW_AFA_TRAPDISC_FORWARD_ACTION_FORWARD = 1,
MLXSW_AFA_TRAPDISC_FORWARD_ACTION_DISCARD = 3,
 };
 
@@ -729,6 +730,22 @@ int mlxsw_afa_block_append_trap(struct mlxsw_afa_block 
*block, u16 trap_id)
 }
 EXPORT_SYMBOL(mlxsw_afa_block_append_trap);
 
+int mlxsw_afa_block_append_trap_and_forward(struct mlxsw_afa_block *block,
+   u16 trap_id)
+{
+   char *act = mlxsw_afa_block_append_action(block,
+ MLXSW_AFA_TRAPDISC_CODE,
+ MLXSW_AFA_TRAPDISC_SIZE);
+
+   if (!act)
+   return -ENOBUFS;
+   mlxsw_afa_trapdisc_pack(act, MLXSW_AFA_TRAPDISC_TRAP_ACTION_TRAP,
+   MLXSW_AFA_TRAPDISC_FORWARD_ACTION_FORWARD,
+   trap_id);
+   return 0;
+}
+EXPORT_SYMBOL(mlxsw_afa_block_append_trap_and_forward);
+
 /* Forwarding Action
  * -
  * Forwarding Action can be used to implement Policy Based Switching (PBS)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h
index 06b0be4..a8d3314 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h
@@ -61,6 +61,8 @@ int mlxsw_afa_block_continue(struct mlxsw_afa_block *block);
 int mlxsw_afa_block_jump(struct mlxsw_afa_block *block, u16 group_id);
 int mlxsw_afa_block_append_drop(struct mlxsw_afa_block *block);
 int mlxsw_afa_block_append_trap(struct mlxsw_afa_block *block, u16 trap_id);
+int mlxsw_afa_block_append_trap_and_forward(struct mlxsw_afa_block *block,
+   u16 trap_id);
 int mlxsw_afa_block_append_fwd(struct mlxsw_afa_block *block,
   u8 local_port, bool in_port);
 int mlxsw_afa_block_append_vlan_modify(struct mlxsw_afa_block *block,
-- 
2.9.5



[patch net-next 5/7] mlxsw: spectrum: Add trap for multicast trap-and-forward routes

2017-09-28 Thread Jiri Pirko
From: Yotam Gigi 

When a multicast route is configured with trap-and-forward action, the
packets should be marked with skb->offload_mr_fwd_mark, in order to prevent
the packets from being forwarded again by the kernel ipmr module.

Due to this, it is not possible to use the already existing multicast trap
(MLXSW_TRAP_ID_ACL1) as the packet should be marked differently. Add the
MLXSW_TRAP_ID_ACL2 which is for trap-and-forward multicast routes, and set
the offload_mr_fwd_mark skb field in its handler.

Signed-off-by: Yotam Gigi 
Reviewed-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 13 +
 drivers/net/ethernet/mellanox/mlxsw/trap.h |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index e9b9443..3adf237 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -3312,6 +3312,14 @@ static void mlxsw_sp_rx_listener_mark_func(struct 
sk_buff *skb, u8 local_port,
return mlxsw_sp_rx_listener_no_mark_func(skb, local_port, priv);
 }
 
+static void mlxsw_sp_rx_listener_mr_mark_func(struct sk_buff *skb,
+ u8 local_port, void *priv)
+{
+   skb->offload_mr_fwd_mark = 1;
+   skb->offload_fwd_mark = 1;
+   return mlxsw_sp_rx_listener_no_mark_func(skb, local_port, priv);
+}
+
 static void mlxsw_sp_rx_listener_sample_func(struct sk_buff *skb, u8 
local_port,
 void *priv)
 {
@@ -3355,6 +3363,10 @@ static void mlxsw_sp_rx_listener_sample_func(struct 
sk_buff *skb, u8 local_port,
MLXSW_RXL(mlxsw_sp_rx_listener_mark_func, _trap_id, _action,\
_is_ctrl, SP_##_trap_group, DISCARD)
 
+#define MLXSW_SP_RXL_MR_MARK(_trap_id, _action, _trap_group, _is_ctrl) \
+   MLXSW_RXL(mlxsw_sp_rx_listener_mr_mark_func, _trap_id, _action, \
+   _is_ctrl, SP_##_trap_group, DISCARD)
+
 #define MLXSW_SP_EVENTL(_func, _trap_id)   \
MLXSW_EVENTL(_func, _trap_id, SP_EVENT)
 
@@ -3425,6 +3437,7 @@ static const struct mlxsw_listener mlxsw_sp_listener[] = {
MLXSW_SP_RXL_MARK(IPV4_PIM, TRAP_TO_CPU, PIM, false),
MLXSW_SP_RXL_MARK(RPF, TRAP_TO_CPU, RPF, false),
MLXSW_SP_RXL_MARK(ACL1, TRAP_TO_CPU, MULTICAST, false),
+   MLXSW_SP_RXL_MR_MARK(ACL2, TRAP_TO_CPU, MULTICAST, false),
 };
 
 static int mlxsw_sp_cpu_policers_set(struct mlxsw_core *mlxsw_core)
diff --git a/drivers/net/ethernet/mellanox/mlxsw/trap.h 
b/drivers/net/ethernet/mellanox/mlxsw/trap.h
index a981035..ec6cef8 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/trap.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/trap.h
@@ -93,6 +93,8 @@ enum {
MLXSW_TRAP_ID_ACL0 = 0x1C0,
/* Multicast trap used for routes with trap action */
MLXSW_TRAP_ID_ACL1 = 0x1C1,
+   /* Multicast trap used for routes with trap-and-forward action */
+   MLXSW_TRAP_ID_ACL2 = 0x1C2,
 
MLXSW_TRAP_ID_MAX = 0x1FF
 };
-- 
2.9.5



[patch net-next 0/7] mlxsw: Add support for partial multicast route offload

2017-09-28 Thread Jiri Pirko
From: Jiri Pirko 

Yotam says:

Previous patchset introduced support for offloading multicast MFC routes to
the Spectrum hardware. As described in that patchset, no partial offloading
is supported, i.e if a route has one output interface which is not a valid
offloadable device (e.g. pimreg device, dummy device, management NIC), the
route is trapped to the CPU and the forwarding is done in slow-path.

Add support for partial offloading of multicast routes, by letting the
hardware to forward the packet to all the in-hardware devices, while the
kernel ipmr module will continue forwarding to all other interfaces.

Similarly to the bridge, the kernel ipmr module will forward a marked
packet to an interface only if the interface has a different parent ID than
the packet's ingress interfaces.

The first patch introduces the offload_mr_fwd_mark skb field, which can be
used by offloading drivers to indicate that a packet had already gone
through multicast forwarding in hardware, similarly to the offload_fwd_mark
field that indicates that a packet had already gone through L2 forwarding
in hardware.

Patches 2 and 3 change the ipmr module to not forward packets that had
already been forwarded by the hardware, i.e. packets that are marked with
offload_mr_fwd_mark and the ingress VIF shares the same parent ID with the
egress VIF.

Patches 4, 5, 6 and 7 add the support in the mlxsw Spectrum driver for trap
and forward routes, while marking the trapped packets with the
offload_mr_fwd_mark.

Yotam Gigi (7):
  skbuff: Add the offload_mr_fwd_mark field
  ipv4: ipmr: Add the parent ID field to VIF struct
  ipv4: ipmr: Don't forward packets already forwarded by hardware
  mlxsw: acl: Introduce ACL trap and forward action
  mlxsw: spectrum: Add trap for multicast trap-and-forward routes
  mlxsw: spectrum: mr_tcam: Add trap-and-forward multicast route
  mlxsw: spectrum: mr: Support trap-and-forward routes

 .../mellanox/mlxsw/core_acl_flex_actions.c | 17 
 .../mellanox/mlxsw/core_acl_flex_actions.h |  2 +
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 13 ++
 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.c  | 17 
 drivers/net/ethernet/mellanox/mlxsw/spectrum_mr.h  |  1 +
 .../net/ethernet/mellanox/mlxsw/spectrum_mr_tcam.c |  8 
 drivers/net/ethernet/mellanox/mlxsw/trap.h |  2 +
 include/linux/mroute.h |  2 +
 include/linux/skbuff.h |  1 +
 net/ipv4/ipmr.c| 46 +++---
 10 files changed, 95 insertions(+), 14 deletions(-)

-- 
2.9.5



Re: [PATCH net-next] libbpf: use map_flags when creating maps

2017-09-28 Thread Craig Gallek
On Wed, Sep 27, 2017 at 6:03 PM, Daniel Borkmann  wrote:
> On 09/27/2017 06:29 PM, Alexei Starovoitov wrote:
>>
>> On 9/27/17 7:04 AM, Craig Gallek wrote:
>>>
>>> From: Craig Gallek 
>>>
>>> This extends struct bpf_map_def to include a flags field.  Note that
>>> this has the potential to break the validation logic in
>>> bpf_object__validate_maps and bpf_object__init_maps as they use
>>> sizeof(struct bpf_map_def) as a minimal allowable size of a map section.
>>> Any bpf program compiled with a smaller struct bpf_map_def will fail this
>>> check.
>>>
>>> I don't believe this will be an issue in practice as both compile-time
>>> definitions of struct bpf_map_def (in samples/bpf/bpf_load.h and
>>> tools/testing/selftests/bpf/bpf_helpers.h) have always been larger
>>> than this newly updated version in libbpf.h.
>>>
>>> Signed-off-by: Craig Gallek 
>>> ---
>>>  tools/lib/bpf/libbpf.c | 2 +-
>>>  tools/lib/bpf/libbpf.h | 1 +
>>>  2 files changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>>> index 35f6dfcdc565..6bea85f260a3 100644
>>> --- a/tools/lib/bpf/libbpf.c
>>> +++ b/tools/lib/bpf/libbpf.c
>>> @@ -874,7 +874,7 @@ bpf_object__create_maps(struct bpf_object *obj)
>>>def->key_size,
>>>def->value_size,
>>>def->max_entries,
>>> -  0);
>>> +  def->map_flags);
>>>  if (*pfd < 0) {
>>>  size_t j;
>>>  int err = *pfd;
>>> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
>>> index 7959086eb9c9..6e20003109e0 100644
>>> --- a/tools/lib/bpf/libbpf.h
>>> +++ b/tools/lib/bpf/libbpf.h
>>> @@ -207,6 +207,7 @@ struct bpf_map_def {
>>>  unsigned int key_size;
>>>  unsigned int value_size;
>>>  unsigned int max_entries;
>>> +unsigned int map_flags;
>>>  };
>>
>>
>> yes it will break loading of pre-compiled .o
>> Instead of breaking, let's fix the loader to do it the way
>> samples/bpf/bpf_load.c does.
>> See commit 156450d9d964 ("samples/bpf: make bpf_load.c code compatible
>> with ELF maps section changes")
>
>
> +1, iproute2 loader also does map spec fixup
>
> For libbpf it would be good also such that it reduces the diff
> further between the libbpf and bpf_load so that it allows move
> to libbpf for samples in future.

Fair enough, I'll try to get this to work more dynamically.  I did
noticed that the fields of struct bpf_map_def in
selftests/.../bpf_helpers.h and iproute2's struct bpf_elf_map have
diverged. The flags field is the only thing missing from libbpf right
now (and they are at the same offset for both), so it won't be an
issue for this change, but it is going to make unifying all of these
things under libbpf not trivial at some point...


Re: [PATCH] arp: make arp_hdr_len() return unsigned int

2017-09-28 Thread David Miller
From: Alexey Dobriyan 
Date: Tue, 26 Sep 2017 23:12:28 +0300

> Negative ARP header length are not a thing.
> 
> Constify arguments while I'm at it.
> 
> Space savings:
> 
>   add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-3 (-3)
>   functionold new   delta
>   arpt_do_table  11631160  -3
> 
> Signed-off-by: Alexey Dobriyan 

Applied.


  1   2   3   >