Re: [PATCHv3 0/4] macvlan: add vepa and bridge mode
From: Patrick McHardy Date: Thu, 26 Nov 2009 17:26:17 +0100 > Arnd Bergmann wrote: >> Version 2 description: >> The patch to iproute2 has not changed, so I'm not including >> it this time. Patch 4/4 (the netlink interface) is basically >> unchanged as well but included for completeness. >> >> The other changes have moved forward a bit, to the point where >> I find them a lot cleaner and am more confident in the code >> being ready for inclusion. The implementation hardly resembles >> Erics original patch now, so I've dropped his signed-off-by. >> >> Please take a look and ack if you are happy so we can get it >> into 2.6.33. > > Looks good to me, nice work. > > Acked-by: Patrick McHardy > > for the entire series. All applied to net-next-2.6, thanks everyone! ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] veth: move loopback logic to common location
Arnd Bergmann wrote: > On Thursday 26 November 2009, Patrick McHardy wrote: >> In addition to those already handled, I'd say >> >> - priority: affects qdisc classification, may refer to classes of the >> old namespace >> - ipvs_property: might cause packets to incorrectly skip netfilter hooks >> - nf_trace: might trigger packet tracing >> - nf_bridge: contains references to network devices in the old NS, >> also indicates packet was bridged >> - iif: index is only valid in the originating namespace >> - probably secmark. > > ok > >> - tc_index: classification result, should only be set in the namespace >> of the classifier >> - tc_verd: RTTL etc. should begin at zero again > > Wouldn't that defeat the purpose of RTTL? If you create a loop > across two devices in different namespaces, it may no longer get > detected. Or is that a different problem again? Mhh good point, that would indeed be possible. OTOH using ingress filtering in one namespace currently might cause the packet to get dropped in a different namespace because the ttl runs out. For now I'd suggest to go the safe route and keep the TTL intact until we can come up with something better. > +void skb_set_dev(struct sk_buff *skb, struct net_device *dev) > +{ > + if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) { > + secpath_reset(skb); > + skb_dst_drop(skb); > + nf_reset(skb); > + skb_init_secmark(skb); > + skb->mark = 0; > + skb->priority = 0; > + skb->nf_trace = 0; > + skb->ipvs_property = 0; > +#ifdef CONFIG_NET_SCHED > + skb->tc_index = 0; > +#ifdef CONFIG_NET_CLS_ACT > + skb->tc_verd = SET_TC_VERD(skb->tc_verd, 0); > + skb->tc_verd = SET_TC_RTTL(skb->tc_verd, 0); > +#endif > +#endif This makes we wonder which ones we actually should keep. Most of the others get reinitialized anyways, so maybe its better to simply clear the entire area up until ->tail like f.i. skb_recycle_check(). > + } > + skb->dev = dev; > + skb->skb_iif = skb->dev->ifindex; This doesn't seem necessary, if the packet goes through netif_receive_skb, it will be set anyways. > +} > +EXPORT_SYMBOL(skb_set_dev); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] veth: move loopback logic to common location
On Thursday 26 November 2009, Patrick McHardy wrote: > In addition to those already handled, I'd say > > - priority: affects qdisc classification, may refer to classes of the > old namespace > - ipvs_property: might cause packets to incorrectly skip netfilter hooks > - nf_trace: might trigger packet tracing > - nf_bridge: contains references to network devices in the old NS, > also indicates packet was bridged > - iif: index is only valid in the originating namespace > - probably secmark. ok > - tc_index: classification result, should only be set in the namespace > of the classifier > - tc_verd: RTTL etc. should begin at zero again Wouldn't that defeat the purpose of RTTL? If you create a loop across two devices in different namespaces, it may no longer get detected. Or is that a different problem again? Arnd <>< --- net: maintain namespace isolation between vlan and real device In the vlan and macvlan drivers, the start_xmit function forwards data to the dev_queue_xmit function for another device, which may potentially belong to a different namespace. To make sure that classification stays within a single namespace, this resets the potentially critical fields. Still needs testing, don't apply Signed-off-by: Arnd Bergmann --- drivers/net/macvlan.c |2 +- include/linux/netdevice.h |9 + net/8021q/vlan_dev.c |2 +- net/core/dev.c| 37 + 4 files changed, 44 insertions(+), 6 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 322112c..edcebf1 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev) } xmit_world: - skb->dev = vlan->lowerdev; + skb_set_dev(skb, vlan->lowerdev); return dev_queue_xmit(skb); } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9428793..fdf4a1a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1009,6 +1009,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev) return 0; } +#ifdef CONFIG_NET_NS +static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev) +{ + skb->dev = dev; +} +#else /* CONFIG_NET_NS */ +void skb_set_dev(struct sk_buff *skb, struct net_device *dev); +#endif + static inline bool netdev_uses_trailer_tags(struct net_device *dev) { #ifdef CONFIG_NET_DSA_TAG_TRAILER diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index de0dc6b..51fcfff 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -323,7 +323,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb, } - skb->dev = vlan_dev_info(dev)->real_dev; + skb_set_dev(skb, vlan_dev_info(dev)->real_dev); len = skb->len; ret = dev_queue_xmit(skb); diff --git a/net/core/dev.c b/net/core/dev.c index f8baa15..220d4e4 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) if (skb->len > (dev->mtu + dev->hard_header_len)) return NET_RX_DROP; - skb_dst_drop(skb); + skb_set_dev(skb, dev); skb->tstamp.tv64 = 0; skb->pkt_type = PACKET_HOST; skb->protocol = eth_type_trans(skb, dev); - skb->mark = 0; - secpath_reset(skb); - nf_reset(skb); return netif_rx(skb); } EXPORT_SYMBOL_GPL(dev_forward_skb); @@ -1614,6 +1611,39 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb) return false; } +/** + * skb_dev_set -- assign a buffer to a new device + * @skb: buffer for the new device + * @dev: network device + * + * If an skb is owned by a device already, we have to reset + * all data private to the namespace a device belongs to + * before assigning it a new device. + */ +void skb_set_dev(struct sk_buff *skb, struct net_device *dev) +{ + if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) { + secpath_reset(skb); + skb_dst_drop(skb); + nf_reset(skb); + skb_init_secmark(skb); + skb->mark = 0; + skb->priority = 0; + skb->nf_trace = 0; + skb->ipvs_property = 0; +#ifdef CONFIG_NET_SCHED + skb->tc_index = 0; +#ifdef CONFIG_NET_CLS_ACT + skb->tc_verd = SET_TC_VERD(skb->tc_verd, 0); + skb->tc_verd = SET_TC_RTTL(skb->tc_verd, 0); +#endif +#endif + } + skb->dev = dev; + skb->skb_iif = skb->dev->ifindex; +} +EXPORT_SYMBOL(skb_set_dev); + /* * Invalidate hardware checksum when packet is to be mangled, and * complete checksum manually on outgoing path. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] veth: move loopback logic to common location
Patrick McHardy writes: > Arnd Bergmann wrote: >> On Tuesday 24 November 2009, Patrick McHardy wrote: >>> Eric W. Biederman wrote: I don't quite follow what you intend with dev_queue_xmit when the macvlan is in one namespace and the real physical device is in another. Are you mentioning that the packet classifier runs in the namespace where the primary device lives with packets from a different namespace? >>> Exactly. And I think we should make sure that the namespace of >>> the macvlan device can't (deliberately or accidentally) cause >>> misclassification. >> >> This is independent of my series and a preexisting problem, right? > > Correct. > >> Which fields do you think need to be reset to maintain namespace >> isolation for the outbound path in macvlan? > > In addition to those already handled, I'd say > > - priority: affects qdisc classification, may refer to classes of the > old namespace > - ipvs_property: might cause packets to incorrectly skip netfilter hooks > - nf_trace: might trigger packet tracing > - nf_bridge: contains references to network devices in the old NS, > also indicates packet was bridged > - iif: index is only valid in the originating namespace > - tc_index: classification result, should only be set in the namespace > of the classifier > - tc_verd: RTTL etc. should begin at zero again > - probably secmark. Wow. I thought we were trying to reduce skbuff, where did all of those fields come from? Regarless that sounds like a good list to get stomped. Eric ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCHv3 0/4] macvlan: add vepa and bridge mode
Arnd Bergmann wrote: > Version 2 description: > The patch to iproute2 has not changed, so I'm not including > it this time. Patch 4/4 (the netlink interface) is basically > unchanged as well but included for completeness. > > The other changes have moved forward a bit, to the point where > I find them a lot cleaner and am more confident in the code > being ready for inclusion. The implementation hardly resembles > Erics original patch now, so I've dropped his signed-off-by. > > Please take a look and ack if you are happy so we can get it > into 2.6.33. Looks good to me, nice work. Acked-by: Patrick McHardy for the entire series. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv3 2/4] macvlan: cleanup rx statistics
We have very similar code for rx statistics in two places in the macvlan driver, with a third one being added in the next patch. Consolidate them into one function to improve overall readability of the driver. Signed-off-by: Arnd Bergmann --- drivers/net/macvlan.c | 70 1 files changed, 41 insertions(+), 29 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index ae2b5c7..1e7faf9 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -116,42 +116,58 @@ static int macvlan_addr_busy(const struct macvlan_port *port, return 0; } +static inline void macvlan_count_rx(const struct macvlan_dev *vlan, + unsigned int len, bool success, + bool multicast) +{ + struct macvlan_rx_stats *rx_stats; + + rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id()); + if (likely(success)) { + rx_stats->rx_packets++;; + rx_stats->rx_bytes += len; + if (multicast) + rx_stats->multicast++; + } else { + rx_stats->rx_errors++; + } +} + +static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev, +const struct ethhdr *eth) +{ + if (!skb) + return NET_RX_DROP; + + skb->dev = dev; + if (!compare_ether_addr_64bits(eth->h_dest, + dev->broadcast)) + skb->pkt_type = PACKET_BROADCAST; + else + skb->pkt_type = PACKET_MULTICAST; + + return netif_rx(skb); +} + static void macvlan_broadcast(struct sk_buff *skb, const struct macvlan_port *port) { const struct ethhdr *eth = eth_hdr(skb); const struct macvlan_dev *vlan; struct hlist_node *n; - struct net_device *dev; struct sk_buff *nskb; unsigned int i; - struct macvlan_rx_stats *rx_stats; + int err; if (skb->protocol == htons(ETH_P_PAUSE)) return; for (i = 0; i < MACVLAN_HASH_SIZE; i++) { hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) { - dev = vlan->dev; - rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id()); - nskb = skb_clone(skb, GFP_ATOMIC); - if (nskb == NULL) { - rx_stats->rx_errors++; - continue; - } - - rx_stats->rx_bytes += skb->len + ETH_HLEN; - rx_stats->rx_packets++; - rx_stats->multicast++; - - nskb->dev = dev; - if (!compare_ether_addr_64bits(eth->h_dest, dev->broadcast)) - nskb->pkt_type = PACKET_BROADCAST; - else - nskb->pkt_type = PACKET_MULTICAST; - - netif_rx(nskb); + err = macvlan_broadcast_one(nskb, vlan->dev, eth); + macvlan_count_rx(vlan, skb->len + ETH_HLEN, +err == NET_RX_SUCCESS, 1); } } } @@ -163,7 +179,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) const struct macvlan_port *port; const struct macvlan_dev *vlan; struct net_device *dev; - struct macvlan_rx_stats *rx_stats; + unsigned int len; port = rcu_dereference(skb->dev->macvlan_port); if (port == NULL) @@ -183,15 +199,11 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) kfree_skb(skb); return NULL; } - rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id()); + len = skb->len + ETH_HLEN; skb = skb_share_check(skb, GFP_ATOMIC); - if (skb == NULL) { - rx_stats->rx_errors++; + macvlan_count_rx(vlan, len, skb != NULL, 0); + if (!skb) return NULL; - } - - rx_stats->rx_bytes += skb->len + ETH_HLEN; - rx_stats->rx_packets++; skb->dev = dev; skb->pkt_type = PACKET_HOST; -- 1.6.3.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv3 1/4] veth: move loopback logic to common location
The veth driver contains code to forward an skb from the start_xmit function of one network device into the receive path of another device. Moving that code into a common location lets us reuse the code for direct forwarding of data between macvlan ports, and possibly in other drivers. Signed-off-by: Arnd Bergmann --- drivers/net/veth.c| 17 +++-- include/linux/netdevice.h |2 ++ net/core/dev.c| 40 3 files changed, 45 insertions(+), 14 deletions(-) diff --git a/drivers/net/veth.c b/drivers/net/veth.c index 2d657f2..6c4b5a2 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -155,8 +155,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) struct veth_net_stats *stats, *rcv_stats; int length, cpu; - skb_orphan(skb); - priv = netdev_priv(dev); rcv = priv->peer; rcv_priv = netdev_priv(rcv); @@ -168,20 +166,12 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) if (!(rcv->flags & IFF_UP)) goto tx_drop; - if (skb->len > (rcv->mtu + MTU_PAD)) - goto rx_drop; - -skb->tstamp.tv64 = 0; - skb->pkt_type = PACKET_HOST; - skb->protocol = eth_type_trans(skb, rcv); if (dev->features & NETIF_F_NO_CSUM) skb->ip_summed = rcv_priv->ip_summed; - skb->mark = 0; - secpath_reset(skb); - nf_reset(skb); - - length = skb->len; + length = skb->len + ETH_HLEN; + if (dev_forward_skb(rcv, skb) != NET_RX_SUCCESS) + goto rx_drop; stats->tx_bytes += length; stats->tx_packets++; @@ -189,7 +179,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev) rcv_stats->rx_bytes += length; rcv_stats->rx_packets++; - netif_rx(skb); return NETDEV_TX_OK; tx_drop: diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 97873e3..9428793 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1562,6 +1562,8 @@ extern intdev_set_mac_address(struct net_device *, extern int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq); +extern int dev_forward_skb(struct net_device *dev, + struct sk_buff *skb); extern int netdev_budget; diff --git a/net/core/dev.c b/net/core/dev.c index ccefa24..f8baa15 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -105,6 +105,7 @@ #include #include #include +#include #include #include #include @@ -1419,6 +1420,45 @@ static inline void net_timestamp(struct sk_buff *skb) skb->tstamp.tv64 = 0; } +/** + * dev_forward_skb - loopback an skb to another netif + * + * @dev: destination network device + * @skb: buffer to forward + * + * return values: + * NET_RX_SUCCESS (no congestion) + * NET_RX_DROP (packet was dropped) + * + * dev_forward_skb can be used for injecting an skb from the + * start_xmit function of one device into the receive queue + * of another device. + * + * The receiving device may be in another namespace, so + * we have to clear all information in the skb that could + * impact namespace isolation. + */ +int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) +{ + skb_orphan(skb); + + if (!(dev->flags & IFF_UP)) + return NET_RX_DROP; + + if (skb->len > (dev->mtu + dev->hard_header_len)) + return NET_RX_DROP; + + skb_dst_drop(skb); + skb->tstamp.tv64 = 0; + skb->pkt_type = PACKET_HOST; + skb->protocol = eth_type_trans(skb, dev); + skb->mark = 0; + secpath_reset(skb); + nf_reset(skb); + return netif_rx(skb); +} +EXPORT_SYMBOL_GPL(dev_forward_skb); + /* * Support routine. Sends outgoing frames to any network * taps currently in use. -- 1.6.3.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv3 4/4] macvlan: export macvlan mode through netlink
In order to support all three modes of macvlan at runtime, extend the existing netlink protocol to allow choosing the mode per macvlan slave interface. This depends on a matching patch to iproute2 in order to become accessible in user land. Signed-off-by: Arnd Bergmann --- drivers/net/macvlan.c | 56 +- include/linux/if_link.h | 15 2 files changed, 65 insertions(+), 6 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index d6bd843..322112c 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -33,12 +33,6 @@ #define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE) -enum macvlan_mode { - MACVLAN_MODE_PRIVATE= 1, - MACVLAN_MODE_VEPA = 2, - MACVLAN_MODE_BRIDGE = 4, -}; - struct macvlan_port { struct net_device *dev; struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; @@ -614,6 +608,17 @@ static int macvlan_validate(struct nlattr *tb[], struct nlattr *data[]) if (!is_valid_ether_addr(nla_data(tb[IFLA_ADDRESS]))) return -EADDRNOTAVAIL; } + + if (data && data[IFLA_MACVLAN_MODE]) { + switch (nla_get_u32(data[IFLA_MACVLAN_MODE])) { + case MACVLAN_MODE_PRIVATE: + case MACVLAN_MODE_VEPA: + case MACVLAN_MODE_BRIDGE: + break; + default: + return -EINVAL; + } + } return 0; } @@ -678,6 +683,10 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev, vlan->dev = dev; vlan->port = port; + vlan->mode = MACVLAN_MODE_VEPA; + if (data && data[IFLA_MACVLAN_MODE]) + vlan->mode = nla_get_u32(data[IFLA_MACVLAN_MODE]); + err = register_netdevice(dev); if (err < 0) return err; @@ -699,6 +708,36 @@ static void macvlan_dellink(struct net_device *dev, struct list_head *head) macvlan_port_destroy(port->dev); } +static int macvlan_changelink(struct net_device *dev, + struct nlattr *tb[], struct nlattr *data[]) +{ + struct macvlan_dev *vlan = netdev_priv(dev); + if (data && data[IFLA_MACVLAN_MODE]) + vlan->mode = nla_get_u32(data[IFLA_MACVLAN_MODE]); + return 0; +} + +static size_t macvlan_get_size(const struct net_device *dev) +{ + return nla_total_size(4); +} + +static int macvlan_fill_info(struct sk_buff *skb, + const struct net_device *dev) +{ + struct macvlan_dev *vlan = netdev_priv(dev); + + NLA_PUT_U32(skb, IFLA_MACVLAN_MODE, vlan->mode); + return 0; + +nla_put_failure: + return -EMSGSIZE; +} + +static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = { + [IFLA_MACVLAN_MODE] = { .type = NLA_U32 }, +}; + static struct rtnl_link_ops macvlan_link_ops __read_mostly = { .kind = "macvlan", .priv_size = sizeof(struct macvlan_dev), @@ -707,6 +746,11 @@ static struct rtnl_link_ops macvlan_link_ops __read_mostly = { .validate = macvlan_validate, .newlink= macvlan_newlink, .dellink= macvlan_dellink, + .maxtype= IFLA_MACVLAN_MAX, + .policy = macvlan_policy, + .changelink = macvlan_changelink, + .get_size = macvlan_get_size, + .fill_info = macvlan_fill_info, }; static int macvlan_device_event(struct notifier_block *unused, diff --git a/include/linux/if_link.h b/include/linux/if_link.h index 1d3b242..6674791 100644 --- a/include/linux/if_link.h +++ b/include/linux/if_link.h @@ -181,4 +181,19 @@ struct ifla_vlan_qos_mapping { __u32 to; }; +/* MACVLAN section */ +enum { + IFLA_MACVLAN_UNSPEC, + IFLA_MACVLAN_MODE, + __IFLA_MACVLAN_MAX, +}; + +#define IFLA_MACVLAN_MAX (__IFLA_MACVLAN_MAX - 1) + +enum macvlan_mode { + MACVLAN_MODE_PRIVATE = 1, /* don't talk to other macvlans */ + MACVLAN_MODE_VEPA= 2, /* talk to other ports through ext bridge */ + MACVLAN_MODE_BRIDGE = 4, /* talk to bridge ports directly */ +}; + #endif /* _LINUX_IF_LINK_H */ -- 1.6.3.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCHv3 3/4] macvlan: implement bridge, VEPA and private mode
This allows each macvlan slave device to be in one of three modes, depending on the use case: MACVLAN_PRIVATE: The device never communicates with any other device on the same upper_dev. This even includes frames coming back from a reflective relay, where supported by the adjacent bridge. MACVLAN_VEPA: The new Virtual Ethernet Port Aggregator (VEPA) mode, we assume that the adjacent bridge returns all frames where both source and destination are local to the macvlan port, i.e. the bridge is set up as a reflective relay. Broadcast frames coming in from the upper_dev get flooded to all macvlan interfaces in VEPA mode. We never deliver any frames locally. MACVLAN_BRIDGE: We provide the behavior of a simple bridge between different macvlan interfaces on the same port. Frames from one interface to another one get delivered directly and are not sent out externally. Broadcast frames get flooded to all other bridge ports and to the external interface, but when they come back from a reflective relay, we don't deliver them again. Since we know all the MAC addresses, the macvlan bridge mode does not require learning or STP like the bridge module does. Based on an earlier patch "macvlan: Reflect macvlan packets meant for other macvlan devices" by Eric Biederman. Signed-off-by: Arnd Bergmann Cc: Eric Biederman --- drivers/net/macvlan.c | 80 - 1 files changed, 72 insertions(+), 8 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 1e7faf9..d6bd843 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -29,9 +29,16 @@ #include #include #include +#include #define MACVLAN_HASH_SIZE (1 << BITS_PER_BYTE) +enum macvlan_mode { + MACVLAN_MODE_PRIVATE= 1, + MACVLAN_MODE_VEPA = 2, + MACVLAN_MODE_BRIDGE = 4, +}; + struct macvlan_port { struct net_device *dev; struct hlist_head vlan_hash[MACVLAN_HASH_SIZE]; @@ -59,6 +66,7 @@ struct macvlan_dev { struct macvlan_port *port; struct net_device *lowerdev; struct macvlan_rx_stats *rx_stats; + enum macvlan_mode mode; }; @@ -134,11 +142,14 @@ static inline void macvlan_count_rx(const struct macvlan_dev *vlan, } static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev, -const struct ethhdr *eth) +const struct ethhdr *eth, bool local) { if (!skb) return NET_RX_DROP; + if (local) + return dev_forward_skb(dev, skb); + skb->dev = dev; if (!compare_ether_addr_64bits(eth->h_dest, dev->broadcast)) @@ -150,7 +161,9 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev, } static void macvlan_broadcast(struct sk_buff *skb, - const struct macvlan_port *port) + const struct macvlan_port *port, + struct net_device *src, + enum macvlan_mode mode) { const struct ethhdr *eth = eth_hdr(skb); const struct macvlan_dev *vlan; @@ -164,8 +177,12 @@ static void macvlan_broadcast(struct sk_buff *skb, for (i = 0; i < MACVLAN_HASH_SIZE; i++) { hlist_for_each_entry_rcu(vlan, n, &port->vlan_hash[i], hlist) { + if (vlan->dev == src || !(vlan->mode & mode)) + continue; + nskb = skb_clone(skb, GFP_ATOMIC); - err = macvlan_broadcast_one(nskb, vlan->dev, eth); + err = macvlan_broadcast_one(nskb, vlan->dev, eth, +mode == MACVLAN_MODE_BRIDGE); macvlan_count_rx(vlan, skb->len + ETH_HLEN, err == NET_RX_SUCCESS, 1); } @@ -178,6 +195,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) const struct ethhdr *eth = eth_hdr(skb); const struct macvlan_port *port; const struct macvlan_dev *vlan; + const struct macvlan_dev *src; struct net_device *dev; unsigned int len; @@ -186,7 +204,25 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb) return skb; if (is_multicast_ether_addr(eth->h_dest)) { - macvlan_broadcast(skb, port); + src = macvlan_hash_lookup(port, eth->h_source); + if (!src) + /* frame comes from an external address */ + macvlan_broadcast(skb, port, NULL, + MACVLAN_MODE_PRIVATE | + MACVLAN_MODE_VEPA| + MACVLAN_MODE_BRIDGE); + el
[PATCHv3 0/4] macvlan: add vepa and bridge mode
Not many changes this time, just integrated a bug fix and all the coding style feedback from Eric Dumazet and Patrick McHardy. I'll keep the patch for network namespaces on the tx path out of this series for now, because the discussion is still ongoing and it addresses an unrelated issue. --- Version 2 description: The patch to iproute2 has not changed, so I'm not including it this time. Patch 4/4 (the netlink interface) is basically unchanged as well but included for completeness. The other changes have moved forward a bit, to the point where I find them a lot cleaner and am more confident in the code being ready for inclusion. The implementation hardly resembles Erics original patch now, so I've dropped his signed-off-by. Please take a look and ack if you are happy so we can get it into 2.6.33. --- Version 1 description: This is based on an earlier patch from Eric Biederman adding forwarding between macvlans. I extended his approach to allow the administrator to choose the mode for each macvlan, and to implement a functional VEPA between macvlan. Still missing from this is support for communication between the lower device that the macvlans are based on. This would be extremely useful but as others have found out before me requires significant changes not only to macvlan but also to the common transmit path. I've tested VEPA operation with the hairpin support added to the bridge driver by Anna Fischer. --- Arnd Bergmann (4): veth: move loopback logic to common location macvlan: cleanup rx statistics macvlan: implement bridge, VEPA and private mode macvlan: export macvlan mode through netlink drivers/net/macvlan.c | 188 + drivers/net/veth.c| 17 + include/linux/if_link.h | 15 include/linux/netdevice.h |2 + net/core/dev.c| 40 ++ 5 files changed, 214 insertions(+), 48 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] veth: move loopback logic to common location
Arnd Bergmann wrote: > On Tuesday 24 November 2009, Patrick McHardy wrote: >> Eric W. Biederman wrote: >>> I don't quite follow what you intend with dev_queue_xmit when the macvlan >>> is in one namespace and the real physical device is in another. Are >>> you mentioning that the packet classifier runs in the namespace where >>> the primary device lives with packets from a different namespace? >> Exactly. And I think we should make sure that the namespace of >> the macvlan device can't (deliberately or accidentally) cause >> misclassification. > > This is independent of my series and a preexisting problem, right? Correct. > Which fields do you think need to be reset to maintain namespace > isolation for the outbound path in macvlan? In addition to those already handled, I'd say - priority: affects qdisc classification, may refer to classes of the old namespace - ipvs_property: might cause packets to incorrectly skip netfilter hooks - nf_trace: might trigger packet tracing - nf_bridge: contains references to network devices in the old NS, also indicates packet was bridged - iif: index is only valid in the originating namespace - tc_index: classification result, should only be set in the namespace of the classifier - tc_verd: RTTL etc. should begin at zero again - probably secmark. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] veth: move loopback logic to common location
On Tuesday 24 November 2009, Patrick McHardy wrote: > Eric W. Biederman wrote: > > I don't quite follow what you intend with dev_queue_xmit when the macvlan > > is in one namespace and the real physical device is in another. Are > > you mentioning that the packet classifier runs in the namespace where > > the primary device lives with packets from a different namespace? > > Exactly. And I think we should make sure that the namespace of > the macvlan device can't (deliberately or accidentally) cause > misclassification. This is independent of my series and a preexisting problem, right? Which fields do you think need to be reset to maintain namespace isolation for the outbound path in macvlan? --- net: maintain namespace isolation between vlan and real device In the vlan and macvlan drivers, the start_xmit function forwards data to the dev_queue_xmit function for another device, which may potentially belong to a different namespace. To make sure that classification stays within a single namespace, this resets the potentially critical fields. Signed-off-by: Arnd Bergmann --- drivers/net/macvlan.c |2 +- include/linux/netdevice.h |9 + net/8021q/vlan_dev.c |2 +- net/core/dev.c| 26 ++ 4 files changed, 33 insertions(+), 6 deletions(-) diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 322112c..edcebf1 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev) } xmit_world: - skb->dev = vlan->lowerdev; + skb_set_dev(skb, vlan->lowerdev); return dev_queue_xmit(skb); } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9428793..fdf4a1a 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1009,6 +1009,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev) return 0; } +#ifdef CONFIG_NET_NS +static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev) +{ + skb->dev = dev; +} +#else /* CONFIG_NET_NS */ +void skb_set_dev(struct sk_buff *skb, struct net_device *dev); +#endif + static inline bool netdev_uses_trailer_tags(struct net_device *dev) { #ifdef CONFIG_NET_DSA_TAG_TRAILER diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index de0dc6b..51fcfff 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -323,7 +323,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb, } - skb->dev = vlan_dev_info(dev)->real_dev; + skb_set_dev(skb, vlan_dev_info(dev)->real_dev); len = skb->len; ret = dev_queue_xmit(skb); diff --git a/net/core/dev.c b/net/core/dev.c index f8baa15..7179b58 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) if (skb->len > (dev->mtu + dev->hard_header_len)) return NET_RX_DROP; - skb_dst_drop(skb); + skb_set_dev(skb, dev); skb->tstamp.tv64 = 0; skb->pkt_type = PACKET_HOST; skb->protocol = eth_type_trans(skb, dev); - skb->mark = 0; - secpath_reset(skb); - nf_reset(skb); return netif_rx(skb); } EXPORT_SYMBOL_GPL(dev_forward_skb); @@ -1614,6 +1611,27 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb) return false; } +/** + * skb_dev_set -- assign a buffer to a new device + * @skb: buffer for the new device + * @dev: network device + * + * If an skb is owned by a device already, we have to reset + * all data private to the namespace a device belongs to + * before assigning it a new device. + */ +void skb_set_dev(struct sk_buff *skb, struct net_device *dev) +{ + if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) { + secpath_reset(skb); + skb_dst_drop(skb); + nf_reset(skb); + skb->mark = 0; + } + skb->dev = dev; +} +EXPORT_SYMBOL(skb_set_dev); + /* * Invalidate hardware checksum when packet is to be mangled, and * complete checksum manually on outgoing path. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization