Re: [Bridge] [RFC net-next v2] bridge lwtunnel, VPLS & NVGRE

2017-08-21 Thread Roopa Prabhu
On Mon, Aug 21, 2017 at 10:15 AM, David Lamparter  wrote:
> Hi all,
>
>
> this is an update on the earlier "[RFC net-next] VPLS support".  Note
> I've changed the subject lines on some of the patches to better reflect
> what they really do (tbh the earlier subject lines were crap.)
>
> As previously, iproute2 / FRR patches are at:
> - https://github.com/eqvinox/vpls-iproute2
> - https://github.com/opensourcerouting/frr/commits/vpls
> while this patchset is also available at:
> - https://github.com/eqvinox/vpls-linux-kernel
> (but please be aware that I'm amending and rebasing commits)
>
> The NVGRE implementation in the 3rd patch in this series is actually an
> accident - I was just wiring up gretap as a reference;  only after I was
> done I noticed that that sums up to NVGRE, more or less.  IMHO, it does
> serve well to demonstrate the bridge changes are not VPLS-specific.
>
> To refer some notes from the first announce mail:
>> I've tested some basic setups, the chain from LDP down into the kernel
>> works at least in these.  FRR has some testcases around from OpenBSD
>> VPLS support, I haven't wired that up to run against Linux / this
>> patchset yet.
>
> Same as before (API didn't change).
>
>> The patchset needs a lot of polishing (yes I left my TODO notes in the
>> commit messages), for now my primary concern is overall design
>> feedback.  Roopa has already provided a lot of input (Thanks!);  the
>> major topic I'm expecting to get discussion on is the bridge FDB
>> changes.
>
> Got some useful input;  but still need feedback on the bridge FDB
> changes (first 2 patches).  I don't believe it to have a significant
> impact on existing bridge operation, and I believe a multipoint tunnel
> driver without its own FDB (e.g. NVGRE in this set) should perform
> better than one with its own FDB (e.g. existing VXLAN).
>
>> P.S.: For a little context on the bridge FDB changes - I'm hoping to
>> find some time to extend this to the MDB to allow aggregating dst
>> metadata and handing down a list of dst metas on TX.  This isn't
>> specifically for VPLS but rather to give sufficient information to the
>> 802.11 stack to allow it to optimize selecting rates (or unicasting)
>> for multicast traffic by having the multicast subscriber list known.
>> This is done by major commercial wifi solutions (e.g. google "dynamic
>> multicast optimization".)
>
> You can find hacks at this on:
> https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
> Please note that the patches in that branch are not at an acceptable
> quality level, but you can see the semantic relation to 802.11.
>
> I would, however, like to point out that this branch has pseudo-working
> IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
> (I'll do that as soon as I get to it, it'll pop up on that branch too.)
>
> This is relevant to the discussion because it's a feature which is
> non-obvious (to me) on how to do with the VXLAN model of having an
> entirely separate FDB.  Meanwhile, with this architecture, the proof of
> concept / hack is coming in at a measly cost of:
> 8 files changed, 176 insertions(+), 15 deletions(-)

David, what is special about the vpls igmp/mld snooping code ?...do
you have to snoop vpls attrs ?.
in the vxlan model.., the vxlan driver can snoop its own attrs eg
vxlan-id, remote dst etc.
and the pkt is passed up to the bridge where it will hit the normal
bridge igmp/mpld snooping code.
can you pls elaborate ?

keeping vpls specific code and api in a separate vpls driver allows
for cleanly extending it in the future.


Re: [Bridge] [RFC net-next v2] bridge lwtunnel, VPLS & NVGRE

2017-08-21 Thread David Lamparter
On Mon, Aug 21, 2017 at 05:01:51PM -0700, Stephen Hemminger wrote:
> On Mon, 21 Aug 2017 19:15:17 +0200 David Lamparter  wrote:
> > > P.S.: For a little context on the bridge FDB changes - I'm hoping to
> > > find some time to extend this to the MDB to allow aggregating dst
> > > metadata and handing down a list of dst metas on TX.  This isn't
> > > specifically for VPLS but rather to give sufficient information to the
> > > 802.11 stack to allow it to optimize selecting rates (or unicasting)
> > > for multicast traffic by having the multicast subscriber list known.
> > > This is done by major commercial wifi solutions (e.g. google "dynamic
> > > multicast optimization".)  
> > 
> > You can find hacks at this on:
> > https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
> > Please note that the patches in that branch are not at an acceptable
> > quality level, but you can see the semantic relation to 802.11.
> > 
> > I would, however, like to point out that this branch has pseudo-working
> > IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
> > (I'll do that as soon as I get to it, it'll pop up on that branch too.)
> > 
> > This is relevant to the discussion because it's a feature which is
> > non-obvious (to me) on how to do with the VXLAN model of having an
> > entirely separate FDB.  Meanwhile, with this architecture, the proof of
> > concept / hack is coming in at a measly cost of:
> > 8 files changed, 176 insertions(+), 15 deletions(-)
> 
> I know the bridge is an easy target to extend L2 forwarding, but it is not
> the only option. Have you condidered building a new driver

Yes I have;  I dismissed the approach because even though an fdb is
reasonable to duplicate, I did not believe replicating multicast
snooping code into both VPLS and 802.11 (and possibly VXLAN) to be a
viable option.  ...is it?

> (like VXLAN does) which does the forwarding you want. Having all
> features in one driver makes for worse performance, and increased
> complexity.

Can you elaborate?  I agree with that sentence as a general statement,
but a general statement needs to apply to a specific situation.  As
discussed in the previous thread with Nikolay, checking skb->_refdst
against 0 should be doable without touching additional cachelines, so
the performance cost should be rather small.  For complexity - it's
keeping an extra pointer around, which is semantically bound to the
existing net_bridge_fdb_entry->dst.  On the other hand, it spares us
from another copy of a fdb implementation, and two copies of multicast
snooping code...  I honestly believe this patchset is a good approach.


-David


Re: [Bridge] [RFC net-next v2] bridge lwtunnel, VPLS & NVGRE

2017-08-21 Thread Stephen Hemminger
On Mon, 21 Aug 2017 19:15:17 +0200
David Lamparter  wrote:

> Hi all,
> 
> 
> this is an update on the earlier "[RFC net-next] VPLS support".  Note
> I've changed the subject lines on some of the patches to better reflect
> what they really do (tbh the earlier subject lines were crap.)
> 
> As previously, iproute2 / FRR patches are at:
> - https://github.com/eqvinox/vpls-iproute2
> - https://github.com/opensourcerouting/frr/commits/vpls
> while this patchset is also available at:
> - https://github.com/eqvinox/vpls-linux-kernel
> (but please be aware that I'm amending and rebasing commits)
> 
> The NVGRE implementation in the 3rd patch in this series is actually an
> accident - I was just wiring up gretap as a reference;  only after I was
> done I noticed that that sums up to NVGRE, more or less.  IMHO, it does
> serve well to demonstrate the bridge changes are not VPLS-specific.
> 
> To refer some notes from the first announce mail:
> > I've tested some basic setups, the chain from LDP down into the kernel
> > works at least in these.  FRR has some testcases around from OpenBSD
> > VPLS support, I haven't wired that up to run against Linux / this
> > patchset yet.  
> 
> Same as before (API didn't change).
> 
> > The patchset needs a lot of polishing (yes I left my TODO notes in the
> > commit messages), for now my primary concern is overall design
> > feedback.  Roopa has already provided a lot of input (Thanks!);  the
> > major topic I'm expecting to get discussion on is the bridge FDB
> > changes.  
> 
> Got some useful input;  but still need feedback on the bridge FDB
> changes (first 2 patches).  I don't believe it to have a significant
> impact on existing bridge operation, and I believe a multipoint tunnel
> driver without its own FDB (e.g. NVGRE in this set) should perform
> better than one with its own FDB (e.g. existing VXLAN).
> 
> > P.S.: For a little context on the bridge FDB changes - I'm hoping to
> > find some time to extend this to the MDB to allow aggregating dst
> > metadata and handing down a list of dst metas on TX.  This isn't
> > specifically for VPLS but rather to give sufficient information to the
> > 802.11 stack to allow it to optimize selecting rates (or unicasting)
> > for multicast traffic by having the multicast subscriber list known.
> > This is done by major commercial wifi solutions (e.g. google "dynamic
> > multicast optimization".)  
> 
> You can find hacks at this on:
> https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
> Please note that the patches in that branch are not at an acceptable
> quality level, but you can see the semantic relation to 802.11.
> 
> I would, however, like to point out that this branch has pseudo-working
> IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
> (I'll do that as soon as I get to it, it'll pop up on that branch too.)
> 
> This is relevant to the discussion because it's a feature which is
> non-obvious (to me) on how to do with the VXLAN model of having an
> entirely separate FDB.  Meanwhile, with this architecture, the proof of
> concept / hack is coming in at a measly cost of:
> 8 files changed, 176 insertions(+), 15 deletions(-)
> 
> 
> Cheers,
> 
> -David
> 
> 
> --- diffstat:
> include/linux/netdevice.h  |  18 ++
> include/net/dst_metadata.h |  51 ++---
> include/net/ip_tunnels.h   |   5 ++
> include/uapi/linux/lwtunnel.h  |   8 +++
> include/uapi/linux/neighbour.h |   2 +
> include/uapi/linux/rtnetlink.h |   5 ++
> net/bridge/br.c|   2 +-
> net/bridge/br_device.c |   4 ++
> net/bridge/br_fdb.c| 119 
> net/bridge/br_input.c  |   6 +-
> net/bridge/br_private.h|   6 +-
> net/core/lwtunnel.c|   1 +
> net/ipv4/ip_gre.c  |  40 --
> net/ipv4/ip_tunnel.c   |   1 +
> net/ipv4/ip_tunnel_core.c  |  87 +++--
> net/mpls/Kconfig   |  11 
> net/mpls/Makefile  |   1 +
> net/mpls/af_mpls.c | 113 --
> net/mpls/internal.h|  44 +--
> net/mpls/vpls.c| 550 
> +++
> 20 files changed, 990 insertions(+), 84 deletions(-)

I know the bridge is an easy target to extend L2 forwarding, but it is not
the only option. Have you condidered building a new driver (like VXLAN does)
which does the forwarding you want. Having all features in one driver
makes for worse performance, and increased complexity.


[Bridge] [PATCH 5/6] mpls: add VPLS entry points

2017-08-21 Thread David Lamparter
This wires up the neccessary calls for VPLS into the MPLS forwarding
pieces.  Since CONFIG_MPLS_VPLS doesn't exist yet in Kconfig, it'll
never be enabled, so we're on the stubs for now.

[TODO: maybe rename VPLS_FLAGS to MPLS_FLAGS and use it for
non-pseudowire OAM bits too (e.g. enabling G-ACh or LSP ping.)]

Signed-off-by: David Lamparter 
---
 include/uapi/linux/rtnetlink.h |  5 
 net/mpls/af_mpls.c | 65 ++
 net/mpls/internal.h| 40 ++
 3 files changed, 104 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index dab7dad9e01a..b5a34e0e4327 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -326,6 +326,8 @@ enum rtattr_type_t {
RTA_PAD,
RTA_UID,
RTA_TTL_PROPAGATE,
+   RTA_VPLS_IF,
+   RTA_VPLS_FLAGS,
__RTA_MAX
 };
 
@@ -334,6 +336,9 @@ enum rtattr_type_t {
 #define RTM_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct 
rtmsg
 #define RTM_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct rtmsg))
 
+#define RTA_VPLS_F_CW_RX   (1 << 0)
+#define RTA_VPLS_F_CW_TX   (1 << 1)
+
 /* RTM_MULTIPATH --- array of struct rtnexthop.
  *
  * "struct rtnexthop" describes all necessary nexthop information,
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 1de2b3501dc8..7a21d230f9e7 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -299,6 +299,11 @@ static bool mpls_egress(struct net *net, struct mpls_route 
*rt,
success = true;
break;
}
+   case MPT_VPLS:
+   /* nothing to do here, no TTL in Ethernet
+* (and we shouldn't mess with the TTL in inner IP packets,
+* pseudowires are supposed to be transparent) */
+   break;
case MPT_UNSPEC:
/* Should have decided which protocol it is by now */
break;
@@ -349,6 +354,8 @@ static int mpls_forward(struct sk_buff *skb, struct 
net_device *dev,
goto drop;
}
 
+   if (rt->rt_payload_type == MPT_VPLS)
+   return vpls_rcv(skb, dev, pt, rt, hdr, orig_dev);
 
/* Pop the label */
skb_pull(skb, sizeof(*hdr));
@@ -469,6 +476,8 @@ static const struct nla_policy rtm_mpls_policy[RTA_MAX+1] = 
{
 struct mpls_route_config {
u32 rc_protocol;
u32 rc_ifindex;
+   u32 rc_vpls_ifindex;
+   u8  rc_vpls_flags;
u8  rc_via_table;
u8  rc_via_alen;
u8  rc_via[MAX_VIA_ALEN];
@@ -541,6 +550,8 @@ static void mpls_route_update(struct net *net, unsigned 
index,
rt = rtnl_dereference(platform_label[index]);
rcu_assign_pointer(platform_label[index], new);
 
+   vpls_label_update(index, rt, new);
+
mpls_notify_route(net, index, rt, new, info);
 
/* If we removed a route free it now */
@@ -942,6 +953,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
struct mpls_route __rcu **platform_label;
struct net *net = cfg->rc_nlinfo.nl_net;
struct mpls_route *rt, *old;
+   struct net_device *vpls_dev = NULL;
int err = -EINVAL;
u8 max_via_alen;
unsigned index;
@@ -996,6 +1008,24 @@ static int mpls_route_add(struct mpls_route_config *cfg,
goto errout;
}
 
+   if (cfg->rc_vpls_ifindex) {
+   vpls_dev = dev_get_by_index(net, cfg->rc_vpls_ifindex);
+   if (!vpls_dev) {
+   err = -ENODEV;
+   NL_SET_ERR_MSG(extack, "Invalid VPLS ifindex");
+   goto errout;
+   }
+   /* we're under RTNL; and we'll drop routes when we're
+* notified the device is going away. */
+   dev_put(vpls_dev);
+
+   if (!is_vpls_dev(vpls_dev)) {
+   err = -ENODEV;
+   NL_SET_ERR_MSG(extack, "Not a VPLS device");
+   goto errout;
+   }
+   }
+
err = -ENOMEM;
rt = mpls_rt_alloc(nhs, max_via_alen, max_labels);
if (IS_ERR(rt)) {
@@ -1006,6 +1036,8 @@ static int mpls_route_add(struct mpls_route_config *cfg,
rt->rt_protocol = cfg->rc_protocol;
rt->rt_payload_type = cfg->rc_payload_type;
rt->rt_ttl_propagate = cfg->rc_ttl_propagate;
+   rt->rt_vpls_dev = vpls_dev;
+   rt->rt_vpls_flags = cfg->rc_vpls_flags;
 
if (cfg->rc_mp)
err = mpls_nh_build_multi(cfg, rt, max_labels, extack);
@@ -1430,6 +1462,14 @@ static void mpls_ifdown(struct net_device *dev, int 
event)
if (!rt)
continue;
 
+   if (rt->rt_vpls_dev == dev) {
+   

[Bridge] [PATCH 4/6] mpls: split forwarding path on rx/tx boundary

2017-08-21 Thread David Lamparter
This makes mpls_rt_xmit() available for use in upcoming VPLS code.  Same
for mpls_route_input_rcu().

Signed-off-by: David Lamparter 
---
 net/mpls/af_mpls.c  | 48 ++--
 net/mpls/internal.h |  4 
 2 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index c5b9ce41d66f..1de2b3501dc8 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -43,7 +43,7 @@ static void rtmsg_lfib(int event, u32 label, struct 
mpls_route *rt,
   struct nlmsghdr *nlh, struct net *net, u32 portid,
   unsigned int nlm_flags);
 
-static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
+struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index)
 {
struct mpls_route *rt = NULL;
 
@@ -313,15 +313,8 @@ static int mpls_forward(struct sk_buff *skb, struct 
net_device *dev,
struct net *net = dev_net(dev);
struct mpls_shim_hdr *hdr;
struct mpls_route *rt;
-   struct mpls_nh *nh;
struct mpls_entry_decoded dec;
-   struct net_device *out_dev;
-   struct mpls_dev *out_mdev;
struct mpls_dev *mdev;
-   unsigned int hh_len;
-   unsigned int new_header_size;
-   unsigned int mtu;
-   int err;
 
/* Careful this entire function runs inside of an rcu critical section 
*/
 
@@ -356,9 +349,6 @@ static int mpls_forward(struct sk_buff *skb, struct 
net_device *dev,
goto drop;
}
 
-   nh = mpls_select_multipath(rt, skb);
-   if (!nh)
-   goto err;
 
/* Pop the label */
skb_pull(skb, sizeof(*hdr));
@@ -376,6 +366,32 @@ static int mpls_forward(struct sk_buff *skb, struct 
net_device *dev,
goto err;
dec.ttl -= 1;
 
+   if (mpls_rt_xmit(skb, rt, dec))
+   goto drop;
+   return 0;
+
+err:
+   MPLS_INC_STATS(mdev, rx_errors);
+drop:
+   kfree_skb(skb);
+   return NET_RX_DROP;
+}
+
+int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
+struct mpls_entry_decoded dec)
+{
+   struct mpls_nh *nh;
+   struct net_device *out_dev = NULL;
+   struct mpls_dev *out_mdev;
+   unsigned int hh_len;
+   unsigned int new_header_size;
+   unsigned int mtu;
+   int err;
+
+   nh = mpls_select_multipath(rt, skb);
+   if (!nh)
+   goto tx_err;
+
/* Find the output device */
out_dev = rcu_dereference(nh->nh_dev);
if (!mpls_output_possible(out_dev))
@@ -401,8 +417,9 @@ static int mpls_forward(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely(!new_header_size && dec.bos)) {
/* Penultimate hop popping */
if (!mpls_egress(dev_net(out_dev), rt, skb, dec))
-   goto err;
+   goto tx_err;
} else {
+   struct mpls_shim_hdr *hdr;
bool bos;
int i;
skb_push(skb, new_header_size);
@@ -435,12 +452,7 @@ static int mpls_forward(struct sk_buff *skb, struct 
net_device *dev,
out_mdev = out_dev ? mpls_dev_get(out_dev) : NULL;
if (out_mdev)
MPLS_INC_STATS(out_mdev, tx_errors);
-   goto drop;
-err:
-   MPLS_INC_STATS(mdev, rx_errors);
-drop:
-   kfree_skb(skb);
-   return NET_RX_DROP;
+   return -1;
 }
 
 static struct packet_type mpls_packet_type __read_mostly = {
diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index cf65aec2e551..b70c6663d4f3 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h
@@ -210,4 +210,8 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned 
int mtu);
 void mpls_stats_inc_outucastpkts(struct net_device *dev,
 const struct sk_buff *skb);
 
+struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index);
+int mpls_rt_xmit(struct sk_buff *skb, struct mpls_route *rt,
+struct mpls_entry_decoded dec);
+
 #endif /* MPLS_INTERNAL_H */
-- 
2.13.0



[Bridge] [PATCH 3/6] gretap: support lwtunnel under bridge (NVGRE)

2017-08-21 Thread David Lamparter
This enables using an IPv4 multicast destination for gretap and enables
learning unicast destinations through the bridge fdb.  Alternatively, a
zero destination can be used together with manual entry creation via
netlink.

This is essentially basic NVGRE support, if the GRE key is configured
appropriately.  VLAN to key mapping is not supported.  Implementing
NVGRE was actually an unintentional side effect.

[v2: move src/dst flipping to RX path, allow zero tunnel daddr]
Signed-off-by: David Lamparter 
---
 net/ipv4/ip_gre.c| 40 +++-
 net/ipv4/ip_tunnel.c |  1 +
 2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 7a7829e839c2..1596f709e5b6 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -266,15 +266,35 @@ static int __ipgre_rcv(struct sk_buff *skb, const struct 
tnl_ptk_info *tpi,
skb_pop_mac_header(skb);
else
skb_reset_mac_header(skb);
-   if (tunnel->collect_md) {
+   if (tunnel->collect_md ||
+   tunnel->parms.iph.daddr == 0 ||
+   ipv4_is_multicast(tunnel->parms.iph.daddr)) {
__be16 flags;
__be64 tun_id;
 
flags = tpi->flags & (TUNNEL_CSUM | TUNNEL_KEY);
tun_id = key32_to_tunnel_id(tpi->key);
-   tun_dst = ip_tun_rx_dst(skb, flags, tun_id, 0);
-   if (!tun_dst)
-   return PACKET_REJECT;
+   if (tunnel->collect_md) {
+   tun_dst = ip_tun_rx_dst(skb, flags, tun_id, 0);
+   if (!tun_dst)
+   return PACKET_REJECT;
+   } else {
+   tun_dst = metadata_dst_alloc(0,
+   METADATA_IP_TUNNEL, GFP_ATOMIC);
+   if (!tun_dst)
+   return PACKET_REJECT;
+
+   /* build dst appropriate for responding */
+   tun_dst->u.tun_info.options_len = 0;
+   tun_dst->u.tun_info.mode = IP_TUNNEL_INFO_TX;
+
+   ip_tunnel_key_init(_dst->u.tun_info.key,
+  tunnel->parms.iph.saddr,
+  iph->saddr,
+  tunnel->parms.iph.tos,
+  tunnel->parms.iph.ttl,
+  0, 0, 0, tun_id, flags);
+   }
}
 
ip_tunnel_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
@@ -507,11 +527,14 @@ static netdev_tx_t gre_tap_xmit(struct sk_buff *skb,
struct net_device *dev)
 {
struct ip_tunnel *tunnel = netdev_priv(dev);
+   struct ip_tunnel_info *tun_info = skb_tunnel_info(skb);
 
-   if (tunnel->collect_md) {
+   if (tunnel->collect_md || tun_info) {
gre_fb_xmit(skb, dev, htons(ETH_P_TEB));
return NETDEV_TX_OK;
}
+   /* tunnel layer doesn't expect a metadata dst */
+   skb_dst_drop(skb);
 
if (gre_handle_offloads(skb, !!(tunnel->parms.o_flags & TUNNEL_CSUM)))
goto free_skb;
@@ -933,6 +956,7 @@ static int gre_tap_init(struct net_device *dev)
 {
__gre_tunnel_init(dev);
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+   netif_keep_dst(dev);
 
return ip_tunnel_init(dev);
 }
@@ -940,6 +964,10 @@ static int gre_tap_init(struct net_device *dev)
 static const struct net_device_ops gre_tap_netdev_ops = {
.ndo_init   = gre_tap_init,
.ndo_uninit = ip_tunnel_uninit,
+#ifdef CONFIG_NET_IPGRE_BROADCAST
+   .ndo_open   = ipgre_open,
+   .ndo_stop   = ipgre_close,
+#endif
.ndo_start_xmit = gre_tap_xmit,
.ndo_set_mac_address= eth_mac_addr,
.ndo_validate_addr  = eth_validate_addr,
@@ -947,6 +975,8 @@ static const struct net_device_ops gre_tap_netdev_ops = {
.ndo_get_stats64= ip_tunnel_get_stats64,
.ndo_get_iflink = ip_tunnel_get_iflink,
.ndo_fill_metadata_dst  = gre_fill_metadata_dst,
+   .ndo_metadst_fill   = ip_tunnel_fill_metadst,
+   .ndo_metadst_build  = ip_tunnel_build_metadst,
 };
 
 static void ipgre_tap_setup(struct net_device *dev)
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 129d1a3616f8..451c11fc9ae5 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -140,6 +140,7 @@ struct ip_tunnel *ip_tunnel_lookup(struct ip_tunnel_net 
*itn,
 
hlist_for_each_entry_rcu(t, head, 

[Bridge] [PATCH 2/6] bridge: lwtunnel netlink interface

2017-08-21 Thread David Lamparter
This makes each FDB entry's metadata dst accessible through the same
ENCAP uapi as lwtunnel uses.  The function signature is slightly
different due to metadata_dst <> lwtunnel_state.

Netlink encapsulation is done by callbacks in net_device_ops.  This is
because the metadata is always used in the context of a port / device on
the bridge; it's not meaningful in a "vacuum".  It makes no sense to
allow inputting metadata of a type that doesn't match the device (where
in lwtunnel it does, by just switching the encapsulation.)  Also, this
way a device can do extended checks of the validity of incoming data
from the user, ensuring it is actually usable.

Note this is not related to ndo_fill_metadata_dst(), that one is used
only by OVS and operates on a packet that is currently being switched,
i.e. data plane.  The API in this patch is control plane.

[TODO: maybe just pass the entire netlink attr block down?]
Signed-off-by: David Lamparter 
---
 include/linux/netdevice.h  | 18 +
 include/net/ip_tunnels.h   |  5 +++
 include/uapi/linux/neighbour.h |  2 +
 net/bridge/br.c|  2 +-
 net/bridge/br_fdb.c| 79 +++---
 net/bridge/br_private.h|  1 +
 net/ipv4/ip_tunnel_core.c  | 87 +-
 7 files changed, 162 insertions(+), 32 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0f1c4cb2441e..2de46f8b3f4f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -828,6 +828,8 @@ struct xfrmdev_ops {
 };
 #endif
 
+struct metadata_dst;
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1128,6 +1130,15 @@ struct xfrmdev_ops {
  * void (*ndo_xdp_flush)(struct net_device *dev);
  * This function is used to inform the driver to flush a paticular
  * xpd tx queue. Must be called on same CPU as xdp_xmit.
+ * int (*ndo_metadst_fill)(struct sk_buff *skb, struct metadata_dst *dst);
+ * Used to encapsulate a metadata_dst that is associated with this
+ * netdevice into the appropriate netlink attributes on skb.
+ * Needs to return a lwtunnel_encap_types value if valid data was filled.
+ * int (*ndo_metadst_build)(struct net_device *dev, struct nlattr *meta,
+ * struct metadata_dst **dst,
+ * struct netlink_ext_ack *extack);
+ * Reverse of the previous function, build a metadata_dst from netlink
+ * attributes.  Should perform appropriate validation.
  */
 struct net_device_ops {
int (*ndo_init)(struct net_device *dev);
@@ -1314,6 +1325,13 @@ struct net_device_ops {
int (*ndo_xdp_xmit)(struct net_device *dev,
struct xdp_buff *xdp);
void(*ndo_xdp_flush)(struct net_device *dev);
+
+   int (*ndo_metadst_fill)(struct sk_buff *skb,
+   struct metadata_dst *dst);
+   int (*ndo_metadst_build)(struct net_device *dev,
+struct nlattr *meta,
+struct metadata_dst **dst,
+struct netlink_ext_ack 
*extack);
 };
 
 /**
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 520809912f03..e6181fb83324 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -451,6 +451,11 @@ void __init ip_tunnel_core_init(void);
 void ip_tunnel_need_metadata(void);
 void ip_tunnel_unneed_metadata(void);
 
+int ip_tunnel_fill_metadst(struct sk_buff *skb, struct metadata_dst *md_dst);
+int ip_tunnel_build_metadst(struct net_device *dev, struct nlattr *meta,
+   struct metadata_dst **dst,
+   struct netlink_ext_ack *extack);
+
 #else /* CONFIG_INET */
 
 static inline struct ip_tunnel_info *lwt_tun_info(struct lwtunnel_state 
*lwtstate)
diff --git a/include/uapi/linux/neighbour.h b/include/uapi/linux/neighbour.h
index 3199d28980b3..cd98ce4b8dd9 100644
--- a/include/uapi/linux/neighbour.h
+++ b/include/uapi/linux/neighbour.h
@@ -27,6 +27,8 @@ enum {
NDA_MASTER,
NDA_LINK_NETNSID,
NDA_SRC_VNI,
+   NDA_ENCAP_TYPE,
+   NDA_ENCAP,
__NDA_MAX
 };
 
diff --git a/net/bridge/br.c b/net/bridge/br.c
index 1407d1ba7577..822dfcef2649 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -140,7 +140,7 @@ static int br_switchdev_event(struct notifier_block *unused,
switch (event) {
case SWITCHDEV_FDB_ADD_TO_BRIDGE:
fdb_info = ptr;
-   err = br_fdb_external_learn_add(br, p, fdb_info->addr,
+   err = br_fdb_external_learn_add(br, p, NULL, fdb_info->addr,
   

[Bridge] [PATCH 1/6] bridge: lwtunnel support in FDB

2017-08-21 Thread David Lamparter
This implements holding tunnel config in the form of metadata_dst
information in the bridge layer, though only for unicast right now.
Multicast is still left to design and implement.

While struct lwtunnel_state might seem the more appropriate structure to
use here, there are two problems with that:
- I haven't found a good way to stuff it onto a SKB
  (there's dst_entry->lwtstate, but if we're adding a dst, we might as
  well go with a metadata_dst)
- it also needs to propagate upwards on received packets, which is
  already in place for tunnel metadata collection

[v2: fixed race in fdb update with atomic_xchg]
[v3: consistently use metadata_dst pointer]
[v4: patch renamed]
Signed-off-by: David Lamparter 
---
 include/net/dst_metadata.h | 27 ++-
 net/bridge/br_device.c |  4 
 net/bridge/br_fdb.c| 46 --
 net/bridge/br_input.c  |  6 --
 net/bridge/br_private.h|  5 -
 5 files changed, 62 insertions(+), 26 deletions(-)

diff --git a/include/net/dst_metadata.h b/include/net/dst_metadata.h
index a803129a4849..4bcc0f314853 100644
--- a/include/net/dst_metadata.h
+++ b/include/net/dst_metadata.h
@@ -24,7 +24,7 @@ struct metadata_dst {
} u;
 };
 
-static inline struct metadata_dst *skb_metadata_dst(struct sk_buff *skb)
+static inline struct metadata_dst *skb_metadata_dst(const struct sk_buff *skb)
 {
struct metadata_dst *md_dst = (struct metadata_dst *) skb_dst(skb);
 
@@ -34,6 +34,11 @@ static inline struct metadata_dst *skb_metadata_dst(struct 
sk_buff *skb)
return NULL;
 }
 
+static inline struct metadata_dst *metadata_dst_clone(struct metadata_dst 
*md_dst)
+{
+   return (struct metadata_dst *)dst_clone(_dst->dst);
+}
+
 static inline struct ip_tunnel_info *skb_tunnel_info(struct sk_buff *skb)
 {
struct metadata_dst *md_dst = skb_metadata_dst(skb);
@@ -56,17 +61,12 @@ static inline bool skb_valid_dst(const struct sk_buff *skb)
return dst && !(dst->flags & DST_METADATA);
 }
 
-static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
-  const struct sk_buff *skb_b)
+static inline int metadata_dst_cmp(const struct metadata_dst *a,
+  const struct metadata_dst *b)
 {
-   const struct metadata_dst *a, *b;
-
-   if (!(skb_a->_skb_refdst | skb_b->_skb_refdst))
+   if (!(a || b))
return 0;
 
-   a = (const struct metadata_dst *) skb_dst(skb_a);
-   b = (const struct metadata_dst *) skb_dst(skb_b);
-
if (!a != !b || a->type != b->type)
return 1;
 
@@ -83,6 +83,15 @@ static inline int skb_metadata_dst_cmp(const struct sk_buff 
*skb_a,
}
 }
 
+static inline int skb_metadata_dst_cmp(const struct sk_buff *skb_a,
+  const struct sk_buff *skb_b)
+{
+   if (!(skb_a->_skb_refdst | skb_b->_skb_refdst))
+   return 0;
+   return metadata_dst_cmp(skb_metadata_dst(skb_a),
+   skb_metadata_dst(skb_b));
+}
+
 void metadata_dst_free(struct metadata_dst *);
 struct metadata_dst *metadata_dst_alloc(u8 optslen, enum metadata_type type,
gfp_t flags);
diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index 861ae2a165f4..f98bc2016ddd 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -53,6 +53,7 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
brstats->tx_bytes += skb->len;
u64_stats_update_end(>syncp);
 
+   skb_dst_drop(skb);
BR_INPUT_SKB_CB(skb)->brdev = dev;
 
skb_reset_mac_header(skb);
@@ -81,6 +82,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct 
net_device *dev)
else
br_flood(br, skb, BR_PKT_MULTICAST, false, true);
} else if ((dst = br_fdb_find_rcu(br, dest, vid)) != NULL) {
+   struct metadata_dst *md_dst = rcu_dereference(dst->md_dst);
+   if (md_dst)
+   skb_dst_set_noref(skb, _dst->dst);
br_forward(dst->dst, skb, false, true);
} else {
br_flood(br, skb, BR_PKT_UNICAST, false, true);
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index a79b648aac88..6ac3b916c39b 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -25,11 +25,13 @@
 #include 
 #include 
 #include 
+#include 
 #include "br_private.h"
 
 static struct kmem_cache *br_fdb_cache __read_mostly;
 static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
- const unsigned char *addr, u16 vid);
+ struct metadata_dst *md_dst, const unsigned char *addr,
+ u16 vid);
 static void fdb_notify(struct net_bridge *br,
   const struct net_bridge_fdb_entry *, int);
 
@@ -174,6 +176,8 @@ static void fdb_delete(struct 

[Bridge] [RFC net-next v2] bridge lwtunnel, VPLS & NVGRE

2017-08-21 Thread David Lamparter
Hi all,


this is an update on the earlier "[RFC net-next] VPLS support".  Note
I've changed the subject lines on some of the patches to better reflect
what they really do (tbh the earlier subject lines were crap.)

As previously, iproute2 / FRR patches are at:
- https://github.com/eqvinox/vpls-iproute2
- https://github.com/opensourcerouting/frr/commits/vpls
while this patchset is also available at:
- https://github.com/eqvinox/vpls-linux-kernel
(but please be aware that I'm amending and rebasing commits)

The NVGRE implementation in the 3rd patch in this series is actually an
accident - I was just wiring up gretap as a reference;  only after I was
done I noticed that that sums up to NVGRE, more or less.  IMHO, it does
serve well to demonstrate the bridge changes are not VPLS-specific.

To refer some notes from the first announce mail:
> I've tested some basic setups, the chain from LDP down into the kernel
> works at least in these.  FRR has some testcases around from OpenBSD
> VPLS support, I haven't wired that up to run against Linux / this
> patchset yet.

Same as before (API didn't change).

> The patchset needs a lot of polishing (yes I left my TODO notes in the
> commit messages), for now my primary concern is overall design
> feedback.  Roopa has already provided a lot of input (Thanks!);  the
> major topic I'm expecting to get discussion on is the bridge FDB
> changes.

Got some useful input;  but still need feedback on the bridge FDB
changes (first 2 patches).  I don't believe it to have a significant
impact on existing bridge operation, and I believe a multipoint tunnel
driver without its own FDB (e.g. NVGRE in this set) should perform
better than one with its own FDB (e.g. existing VXLAN).

> P.S.: For a little context on the bridge FDB changes - I'm hoping to
> find some time to extend this to the MDB to allow aggregating dst
> metadata and handing down a list of dst metas on TX.  This isn't
> specifically for VPLS but rather to give sufficient information to the
> 802.11 stack to allow it to optimize selecting rates (or unicasting)
> for multicast traffic by having the multicast subscriber list known.
> This is done by major commercial wifi solutions (e.g. google "dynamic
> multicast optimization".)

You can find hacks at this on:
https://github.com/eqvinox/vpls-linux-kernel/tree/mdb-hack
Please note that the patches in that branch are not at an acceptable
quality level, but you can see the semantic relation to 802.11.

I would, however, like to point out that this branch has pseudo-working
IGMP/MLD snooping for VPLS, and it'd be 20-ish lines to add it to NVGRE
(I'll do that as soon as I get to it, it'll pop up on that branch too.)

This is relevant to the discussion because it's a feature which is
non-obvious (to me) on how to do with the VXLAN model of having an
entirely separate FDB.  Meanwhile, with this architecture, the proof of
concept / hack is coming in at a measly cost of:
8 files changed, 176 insertions(+), 15 deletions(-)


Cheers,

-David


--- diffstat:
include/linux/netdevice.h  |  18 ++
include/net/dst_metadata.h |  51 ++---
include/net/ip_tunnels.h   |   5 ++
include/uapi/linux/lwtunnel.h  |   8 +++
include/uapi/linux/neighbour.h |   2 +
include/uapi/linux/rtnetlink.h |   5 ++
net/bridge/br.c|   2 +-
net/bridge/br_device.c |   4 ++
net/bridge/br_fdb.c| 119 
net/bridge/br_input.c  |   6 +-
net/bridge/br_private.h|   6 +-
net/core/lwtunnel.c|   1 +
net/ipv4/ip_gre.c  |  40 --
net/ipv4/ip_tunnel.c   |   1 +
net/ipv4/ip_tunnel_core.c  |  87 +++--
net/mpls/Kconfig   |  11 
net/mpls/Makefile  |   1 +
net/mpls/af_mpls.c | 113 --
net/mpls/internal.h|  44 +--
net/mpls/vpls.c| 550 
+++
20 files changed, 990 insertions(+), 84 deletions(-)