Re: [PATCH 2/2] Virtual ethernet device driver
Patrick McHardy wrote: Pavel Emelianov wrote: Patrick McHardy wrote: + + if (tb[IFLA_IFNAME]) + nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ); + else + snprintf(ifname, IFNAMSIZ, DRV_NAME %%d); Does this work? The other device is not registered at this time, so I think the allocated names could clash .. Oh! And one more thing, I've just thought about - even two devices with similar name would be ok when we have a namespaces and one of the ends of this tunnel is to be in some other namespace rather that the init one. Yes, but we don't have network namespaces yet. Which brings up another question: is this driver of any use without namespaces? Sure. For example, you can join two bridges together. Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Virtual ethernet device driver
Patrick McHardy wrote: Pavel Emelianov wrote: Patrick McHardy wrote: Pavel Emelianov wrote: + if (data != NULL data[VETH_INFO_PEER] != NULL) { + err = nla_parse_nested(tb, IFLA_INFO_MAX, + data[VETH_INFO_PEER], ifla_policy); + if (err 0) + return err; + } One more suggestion regarding the PEER attribute: you only nest IFLA attributes below it, but some information that might be interesting to use on device creation is contained in ifinfomsg (flags and ifindex). So I think it would be better to use a complete message, including header. I don't get it. Can you elaborate, please. You don't have a struct ifinfomsg for the peer device. At some point we might want to add support for specifying initial flags for the device (some easily supportable ones are IFF_PROMISC, IFF_ALLMULTI, IFF_NOARP) and ideally that should also be possible for the peer device. So I suggest you use a complete ifinfomsg including the header instead of just the attributes for VETH_INFO_PEER. I don't see any information from this struct being used on the link-creation paths... Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/2] Virtual ethernet device (v3)
Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. Eric recently sent a similar driver called etun with the sysfs interface. This implementation uses another interface - the RTM_NRELINK message introduced by Patric. The newlink callback is organized that way to make it easy to create the peer device in the separate namespace when we have them in kernel. Changes from v.2.1: * Made the generic routine for link creation to be used by veth driver, any other tunnel driver that needs to create several devices at once and rtnl_newlink() code. Changes from v.2: * Rebase over latest netdev tree. No actual changes; * Small code rework. Changes from v.1: * percpu statistics; * standard convention for nla policy names; * module alias added; * xmit function fixes noticed by Patric; * code cleanup. The patch for an ip utility is also provided. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] Cc: Patrick McHardy [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Introduce the generic rtnl_create_link()
This routine gets the parsed rtnl attributes and creates a new link with generic info (IFLA_LINKINFO policy). Its intention is to help the drivers, that need to create several links at once (like VETH). This is nothing but a copy-paste-ed part of rtnl_newlink() function that is responsible for creation of new device. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- include/net/rtnetlink.h |4 ++ net/core/rtnetlink.c| 83 ++-- 2 files changed, 57 insertions(+), 30 deletions(-) diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h index 3861c05..8218288 100644 --- a/include/net/rtnetlink.h +++ b/include/net/rtnetlink.h @@ -78,6 +78,10 @@ extern void __rtnl_link_unregister(struc extern int rtnl_link_register(struct rtnl_link_ops *ops); extern voidrtnl_link_unregister(struct rtnl_link_ops *ops); +extern struct net_device *rtnl_create_link(char *ifname, + const struct rtnl_link_ops *ops, struct nlattr *tb[]); +extern const struct nla_policy ifla_policy[IFLA_MAX+1]; + #define MODULE_ALIAS_RTNL_LINK(kind) MODULE_ALIAS(rtnl-link- kind) #endif diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 864cbdf..02e00e8 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -714,7 +714,7 @@ cont: return skb-len; } -static const struct nla_policy ifla_policy[IFLA_MAX+1] = { +const struct nla_policy ifla_policy[IFLA_MAX+1] = { [IFLA_IFNAME] = { .type = NLA_STRING, .len = IFNAMSIZ-1 }, [IFLA_ADDRESS] = { .type = NLA_BINARY, .len = MAX_ADDR_LEN }, [IFLA_BROADCAST]= { .type = NLA_BINARY, .len = MAX_ADDR_LEN }, @@ -941,6 +941,50 @@ static int rtnl_dellink(struct sk_buff * return 0; } +struct net_device *rtnl_create_link(char *ifname, + const struct rtnl_link_ops *ops, struct nlattr *tb[]) +{ + int err; + struct net_device *dev; + + err = -ENOMEM; + dev = alloc_netdev(ops-priv_size, ifname, ops-setup); + if (!dev) + goto err; + + if (strchr(dev-name, '%')) { + err = dev_alloc_name(dev, dev-name); + if (err 0) + goto err_free; + } + + dev-rtnl_link_ops = ops; + + if (tb[IFLA_MTU]) + dev-mtu = nla_get_u32(tb[IFLA_MTU]); + if (tb[IFLA_ADDRESS]) + memcpy(dev-dev_addr, nla_data(tb[IFLA_ADDRESS]), + nla_len(tb[IFLA_ADDRESS])); + if (tb[IFLA_BROADCAST]) + memcpy(dev-broadcast, nla_data(tb[IFLA_BROADCAST]), + nla_len(tb[IFLA_BROADCAST])); + if (tb[IFLA_TXQLEN]) + dev-tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); + if (tb[IFLA_WEIGHT]) + dev-weight = nla_get_u32(tb[IFLA_WEIGHT]); + if (tb[IFLA_OPERSTATE]) + set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE])); + if (tb[IFLA_LINKMODE]) + dev-link_mode = nla_get_u8(tb[IFLA_LINKMODE]); + + return dev; + +err_free: + free_netdev(dev); +err: + return ERR_PTR(err); +} + static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg) { const struct rtnl_link_ops *ops; @@ -1051,40 +1095,17 @@ replay: if (!ifname[0]) snprintf(ifname, IFNAMSIZ, %s%%d, ops-kind); - dev = alloc_netdev(ops-priv_size, ifname, ops-setup); - if (!dev) - return -ENOMEM; - - if (strchr(dev-name, '%')) { - err = dev_alloc_name(dev, dev-name); - if (err 0) - goto err_free; - } - dev-rtnl_link_ops = ops; - if (tb[IFLA_MTU]) - dev-mtu = nla_get_u32(tb[IFLA_MTU]); - if (tb[IFLA_ADDRESS]) - memcpy(dev-dev_addr, nla_data(tb[IFLA_ADDRESS]), - nla_len(tb[IFLA_ADDRESS])); - if (tb[IFLA_BROADCAST]) - memcpy(dev-broadcast, nla_data(tb[IFLA_BROADCAST]), - nla_len(tb[IFLA_BROADCAST])); - if (tb[IFLA_TXQLEN]) - dev-tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]); - if (tb[IFLA_WEIGHT]) - dev-weight = nla_get_u32(tb[IFLA_WEIGHT]); - if (tb[IFLA_OPERSTATE]) - set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE])); - if (tb[IFLA_LINKMODE]) - dev-link_mode = nla_get_u8(tb[IFLA_LINKMODE]); + dev = rtnl_create_link(ifname, ops, tb); - if (ops-newlink) + if (IS_ERR(dev)) + err = PTR_ERR(dev); + else if (ops-newlink) err = ops-newlink(dev, tb, data); else
[PATCH 2/2] Virtual ethernet device driver
Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. The newlink callback is organized that way to make it easy to create the peer device in the separate namespace when we have them in kernel. This implementation uses another interface - the RTM_NRELINK message introduced by Patric. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- drivers/net/Kconfig |6 drivers/net/Makefile |1 drivers/net/veth.c | 426 +++ include/net/veth.h | 12 + 4 files changed, 445 insertions(+) diff-tree 293fc37c28eee0a27e19179ab0d809460c73eced (from 15028aad00ddf241581fbe74a02ec89cbb28d35d) Author: Pavel [EMAIL PROTECTED] Date: Thu Jul 12 13:05:44 2007 +0400 Veth support diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index d4e39ff..7bdc6e3 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -124,6 +124,12 @@ config TUN If you don't know what to use this for, you don't need it. +config VETH + tristate Virtual ethernet device + ---help--- + The device is an ethernet tunnel. Devices are created in pairs. When + one end receives the packet it appears on its pair and vice versa. + config NET_SB1000 tristate General Instruments Surfboard 1000 depends on PNP diff --git a/drivers/net/Makefile b/drivers/net/Makefile index a2241e6..d7b1103 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -189,6 +189,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o obj-$(CONFIG_MACMACE) += macmace.o obj-$(CONFIG_MAC89x0) += mac89x0.o obj-$(CONFIG_TUN) += tun.o +obj-$(CONFIG_VETH) += veth.o obj-$(CONFIG_NET_NETX) += netx-eth.o obj-$(CONFIG_DL2K) += dl2k.o obj-$(CONFIG_R8169) += r8169.o diff --git a/drivers/net/veth.c b/drivers/net/veth.c new file mode 100644 index 000..caa7b62 --- /dev/null +++ b/drivers/net/veth.c @@ -0,0 +1,426 @@ +/* + * drivers/net/veth.c + * + * Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc + * + * Author: Pavel Emelianov [EMAIL PROTECTED] + * Ethtool interface from: Eric W. Biederman [EMAIL PROTECTED] + * + */ + +#include linux/list.h +#include linux/netdevice.h +#include linux/ethtool.h +#include linux/etherdevice.h + +#include net/dst.h +#include net/xfrm.h +#include net/veth.h + +#define DRV_NAME veth +#define DRV_VERSION1.0 + +struct veth_net_stats { + unsigned long rx_packets; + unsigned long tx_packets; + unsigned long rx_bytes; + unsigned long tx_bytes; + unsigned long tx_dropped; +}; + +struct veth_priv { + struct net_device *peer; + struct net_device *dev; + struct list_head list; + struct veth_net_stats *stats; + unsigned ip_summed; +}; + +static LIST_HEAD(veth_list); + +/* + * ethtool interface + */ + +static struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { peer_ifindex }, +}; + +static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +{ + cmd-supported = 0; + cmd-advertising= 0; + cmd-speed = SPEED_1; + cmd-duplex = DUPLEX_FULL; + cmd-port = PORT_TP; + cmd-phy_address= 0; + cmd-transceiver= XCVR_INTERNAL; + cmd-autoneg= AUTONEG_DISABLE; + cmd-maxtxpkt = 0; + cmd-maxrxpkt = 0; + return 0; +} + +static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) +{ + strcpy(info-driver, DRV_NAME); + strcpy(info-version, DRV_VERSION); + strcpy(info-fw_version, N/A); +} + +static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch(stringset) { + case ETH_SS_STATS: + memcpy(buf, ethtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + } +} + +static int veth_get_stats_count(struct net_device *dev) +{ + return ARRAY_SIZE(ethtool_stats_keys); +} + +static void veth_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *stats, u64 *data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + data[0] = priv-peer-ifindex; +} + +static u32 veth_get_rx_csum(struct net_device *dev) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + return priv-ip_summed == CHECKSUM_UNNECESSARY; +} + +static int veth_set_rx_csum(struct net_device *dev, u32 data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + priv-ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE; + return 0; +} + +static u32 veth_get_tx_csum(struct net_device *dev) +{ + return (dev-features NETIF_F_NO_CSUM) != 0; +} + +static int
[PATCH 0/2] Module for ip utility to support veth device (v.3)
The usage is # ip link add [name] type veth [peer name] [mac mac] [peer_mac mac] This consists of two parts: 1/2 makes some copy-paste changes in ip/iplink.c for generic CLI attributes parsing, 2/2 the module itself. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Intruduce iplink_parse() routine
This routine parses CLI attributes, describing generic link parameters such as name, address, etc. This is mostly copy-pasted from iplink_modify(). Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- include/utils.h |3 + ip/iplink.c | 127 2 files changed, 76 insertions(+), 54 deletions(-) diff --git a/include/utils.h b/include/utils.h index a3fd335..3fd851d 100644 --- a/include/utils.h +++ b/include/utils.h @@ -146,4 +146,7 @@ extern int cmdlineno; extern size_t getcmdline(char **line, size_t *len, FILE *in); extern int makeargs(char *line, char *argv[], int maxargs); +struct iplink_req; +int iplink_parse(int argc, char **argv, struct iplink_req *req, + char **name, char **type, char **link, char **dev); #endif /* __UTILS_H__ */ diff --git a/ip/iplink.c b/ip/iplink.c index cfacdab..9ee269c 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -142,140 +142,159 @@ static int iplink_have_newlink(void) } #endif /* ! IPLINK_IOCTL_COMPAT */ -static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv) +struct iplink_req { + struct nlmsghdr n; + struct ifinfomsgi; + charbuf[1024]; +}; + +int iplink_parse(int argc, char **argv, struct iplink_req *req, + char **name, char **type, char **link, char **dev) { + int ret, len; + char abuf[32]; int qlen = -1; int mtu = -1; - int len; - char abuf[32]; - char *dev = NULL; - char *name = NULL; - char *link = NULL; - char *type = NULL; - struct link_util *lu = NULL; - struct { - struct nlmsghdr n; - struct ifinfomsgi; - charbuf[1024]; - } req; - memset(req, 0, sizeof(req)); - - req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)); - req.n.nlmsg_flags = NLM_F_REQUEST|flags; - req.n.nlmsg_type = cmd; - req.i.ifi_family = preferred_family; + ret = argc; while (argc 0) { if (strcmp(*argv, up) == 0) { - req.i.ifi_change |= IFF_UP; - req.i.ifi_flags |= IFF_UP; + req-i.ifi_change |= IFF_UP; + req-i.ifi_flags |= IFF_UP; } else if (strcmp(*argv, down) == 0) { - req.i.ifi_change |= IFF_UP; - req.i.ifi_flags = ~IFF_UP; + req-i.ifi_change |= IFF_UP; + req-i.ifi_flags = ~IFF_UP; } else if (strcmp(*argv, name) == 0) { NEXT_ARG(); - name = *argv; + *name = *argv; } else if (matches(*argv, link) == 0) { NEXT_ARG(); - link = *argv; + *link = *argv; } else if (matches(*argv, address) == 0) { NEXT_ARG(); len = ll_addr_a2n(abuf, sizeof(abuf), *argv); - addattr_l(req.n, sizeof(req), IFLA_ADDRESS, abuf, len); + addattr_l(req-n, sizeof(req), IFLA_ADDRESS, abuf, len); } else if (matches(*argv, broadcast) == 0 || - strcmp(*argv, brd) == 0) { + strcmp(*argv, brd) == 0) { NEXT_ARG(); len = ll_addr_a2n(abuf, sizeof(abuf), *argv); - addattr_l(req.n, sizeof(req), IFLA_BROADCAST, abuf, len); + addattr_l(req-n, sizeof(req), IFLA_BROADCAST, abuf, len); } else if (matches(*argv, txqueuelen) == 0 || - strcmp(*argv, qlen) == 0 || - matches(*argv, txqlen) == 0) { + strcmp(*argv, qlen) == 0 || + matches(*argv, txqlen) == 0) { NEXT_ARG(); if (qlen != -1) duparg(txqueuelen, *argv); if (get_integer(qlen, *argv, 0)) invarg(Invalid \txqueuelen\ value\n, *argv); - addattr_l(req.n, sizeof(req), IFLA_TXQLEN, qlen, 4); + addattr_l(req-n, sizeof(req), IFLA_TXQLEN, qlen, 4); } else if (strcmp(*argv, mtu) == 0) { NEXT_ARG(); if (mtu != -1) duparg(mtu, *argv); if (get_integer(mtu, *argv, 0)) invarg(Invalid \mtu\ value\n, *argv); - addattr_l(req.n, sizeof(req), IFLA_MTU, mtu, 4); + addattr_l(req-n, sizeof(req), IFLA_MTU, mtu, 4); } else if (strcmp(*argv, multicast) == 0
Re: [PATCH 2/2] Virtual ethernet device driver
Patrick McHardy wrote: Pavel Emelianov wrote: +static int veth_newlink(struct net_device *dev, + struct nlattr *tb[], struct nlattr *data[]) +{ +int err; +struct net_device *peer; +struct veth_priv *priv; +char ifname[IFNAMSIZ]; + +/* + * prepare the devices info + */ + +if (tb[IFLA_ADDRESS] == NULL) +random_ether_addr(dev-dev_addr); + +if (data != NULL data[VETH_INFO_PEER] != NULL) { +err = nla_parse_nested(tb, IFLA_INFO_MAX, +data[VETH_INFO_PEER], ifla_policy); +if (err 0) +return err; +} Not having a peer should be an error, no? No. That's the intention - if the user doesn't specify peer in the command line then two _identical_ devices are created. Of course, if he specifies one name - there'll be a collision, but one can say my_own_veth_number_%d and everything will be ok. Or just use the default name provided. E.g. ip link add type veth will send a packet with data[VETH_INFO_PEER} == NULL, but this is OK! User just wants a default tunnel and he will get it :) Does this answer your second comment below? + +if (tb[IFLA_IFNAME]) +nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ); +else +snprintf(ifname, IFNAMSIZ, DRV_NAME %%d); Does this work? The other device is not registered at this time, so I think the allocated names could clash .. If it does work you could also perform name allocation in the rtnl_create_link function. + +peer = rtnl_create_link(ifname, veth_link_ops, tb); +if (IS_ERR(peer)) +return PTR_ERR(peer); + +if (tb[IFLA_ADDRESS] == NULL) +random_ether_addr(peer-dev_addr); + +/* + * register devices + */ + +err = register_netdevice(peer); +if (err 0) +goto err_register_peer; + +err = register_netdevice(dev); +if (err 0) +goto err_register_dev; + +/* + * tie the deviced together + */ + +priv = netdev_priv(dev); +priv-dev = dev; +priv-peer = peer; +list_add(priv-list, veth_list); + +priv = netdev_priv(peer); +priv-dev = peer; +priv-peer = dev; +INIT_LIST_HEAD(priv-list); +return 0; + +err_register_dev: +unregister_netdevice(peer); +return err; + +err_register_peer: +free_netdev(peer); +return err; +} +static __exit void veth_exit(void) +{ +struct veth_priv *priv, *next; + +rtnl_lock(); +__rtnl_link_unregister(veth_link_ops); + +list_for_each_entry_safe(priv, next, veth_list, list) +veth_dellink(priv-dev); +rtnl_unlock(); Devices are unregistered automatically through the dellink function, rtnl_link_unregister(..) is enough. OK. This looks like a minor and not-significant comment, so do I need to resend the patch or David is OK to take it and I will send an incremental one? Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Virtual ethernet device driver
Patrick McHardy wrote: Pavel Emelianov wrote: +static int veth_newlink(struct net_device *dev, + struct nlattr *tb[], struct nlattr *data[]) +{ +int err; +struct net_device *peer; +struct veth_priv *priv; +char ifname[IFNAMSIZ]; + +/* + * prepare the devices info + */ + +if (tb[IFLA_ADDRESS] == NULL) +random_ether_addr(dev-dev_addr); + +if (data != NULL data[VETH_INFO_PEER] != NULL) { +err = nla_parse_nested(tb, IFLA_INFO_MAX, +data[VETH_INFO_PEER], ifla_policy); +if (err 0) +return err; +} Not having a peer should be an error, no? + +if (tb[IFLA_IFNAME]) +nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ); +else +snprintf(ifname, IFNAMSIZ, DRV_NAME %%d); Does this work? The other device is not registered at this time, so I think the allocated names could clash .. Oh! And one more thing, I've just thought about - even two devices with similar name would be ok when we have a namespaces and one of the ends of this tunnel is to be in some other namespace rather that the init one. If it does work you could also perform name allocation in the rtnl_create_link function. + +peer = rtnl_create_link(ifname, veth_link_ops, tb); +if (IS_ERR(peer)) +return PTR_ERR(peer); + +if (tb[IFLA_ADDRESS] == NULL) +random_ether_addr(peer-dev_addr); + +/* + * register devices + */ + +err = register_netdevice(peer); +if (err 0) +goto err_register_peer; + +err = register_netdevice(dev); +if (err 0) +goto err_register_dev; + +/* + * tie the deviced together + */ + +priv = netdev_priv(dev); +priv-dev = dev; +priv-peer = peer; +list_add(priv-list, veth_list); + +priv = netdev_priv(peer); +priv-dev = peer; +priv-peer = dev; +INIT_LIST_HEAD(priv-list); +return 0; + +err_register_dev: +unregister_netdevice(peer); +return err; + +err_register_peer: +free_netdev(peer); +return err; +} +static __exit void veth_exit(void) +{ +struct veth_priv *priv, *next; + +rtnl_lock(); +__rtnl_link_unregister(veth_link_ops); + +list_for_each_entry_safe(priv, next, veth_list, list) +veth_dellink(priv-dev); +rtnl_unlock(); Devices are unregistered automatically through the dellink function, rtnl_link_unregister(..) is enough. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Virtual ethernet device driver
Patrick McHardy wrote: Pavel Emelianov wrote: Patrick McHardy wrote: Pavel Emelianov wrote: +static int veth_newlink(struct net_device *dev, + struct nlattr *tb[], struct nlattr *data[]) +{ + int err; + struct net_device *peer; + struct veth_priv *priv; + char ifname[IFNAMSIZ]; + + /* + * prepare the devices info + */ + + if (tb[IFLA_ADDRESS] == NULL) + random_ether_addr(dev-dev_addr); + + if (data != NULL data[VETH_INFO_PEER] != NULL) { + err = nla_parse_nested(tb, IFLA_INFO_MAX, + data[VETH_INFO_PEER], ifla_policy); + if (err 0) + return err; + } Not having a peer should be an error, no? No. That's the intention - if the user doesn't specify peer in the command line then two _identical_ devices are created. Of course, if he specifies one name - there'll be a collision, but one can say my_own_veth_number_%d and everything will be ok. Or just use the default name provided. E.g. ip link add type veth will send a packet with data[VETH_INFO_PEER} == NULL, but this is OK! User just wants a default tunnel and he will get it :) I see. Does this answer your second comment below? No, to get unique names the sequence has to be: dev_alloc_name register_netdevice dev_alloc_name register_netdevice But you have: dev_alloc_name dev_alloc_name (- might allocate same name as first call) register_netdevice register_netdevice Oops :) You're right. That's the problem. I was carried away by testing the peer options and checking for names rather than veth%d to work... By the way, that will create some problems. You see, your patches imply that the register_netdevice() will be called at the very end of the -newlink callback. Otherwise, the error path of any code following the registering will have to call unregister_netdevice() which will BUG() in free_netdev() in rtnl_newlink() - the device state will be neither UNINITIALIZED nor UNREGISTERED :( +static __exit void veth_exit(void) +{ + struct veth_priv *priv, *next; + + rtnl_lock(); + __rtnl_link_unregister(veth_link_ops); + + list_for_each_entry_safe(priv, next, veth_list, list) + veth_dellink(priv-dev); + rtnl_unlock(); Devices are unregistered automatically through the dellink function, rtnl_link_unregister(..) is enough. OK. This looks like a minor and not-significant comment, so do I need to resend the patch or David is OK to take it and I will send an incremental one? An incremental patch for this is fine I guess, your code is correct, its merely a simplification. Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Virtual ethernet device driver
Patrick McHardy wrote: Pavel Emelianov wrote: Patrick McHardy wrote: No, to get unique names the sequence has to be: dev_alloc_name register_netdevice dev_alloc_name register_netdevice But you have: dev_alloc_name dev_alloc_name (- might allocate same name as first call) register_netdevice register_netdevice Oops :) You're right. That's the problem. I was carried away by testing the peer options and checking for names rather than veth%d to work... By the way, that will create some problems. You see, your patches imply that the register_netdevice() will be called at the very end of the -newlink callback. Otherwise, the error path of any code following the registering will have to call unregister_netdevice() which will BUG() in free_netdev() in rtnl_newlink() - the device state will be neither UNINITIALIZED nor UNREGISTERED :( Thats true. I think you could do: - use name of the supplied device for the second device - register second device - allocate new name for first device - register first device Not 100% like this (since the first name is expected to be associated with the device with the first address), but reallocating the name for the first device sounds like a good idea. David, which way would be more preferable - to fix booth issues pointed by Patric and send the v.4 patch, or to wait for this patch to be committed and then send two incremental fixes? Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] Virtual ethernet device driver
Patrick McHardy wrote: Pavel Emelianov wrote: +if (data != NULL data[VETH_INFO_PEER] != NULL) { +err = nla_parse_nested(tb, IFLA_INFO_MAX, +data[VETH_INFO_PEER], ifla_policy); +if (err 0) +return err; +} One more suggestion regarding the PEER attribute: you only nest IFLA attributes below it, but some information that might be interesting to use on device creation is contained in ifinfomsg (flags and ifindex). So I think it would be better to use a complete message, including header. I don't get it. Can you elaborate, please. Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Virtual ethernet device (v2.1)
= Since the netlink NEWLINK interface is now in the netdev tree I resend the veth driver patch as submittion for inclusion. = LOG: Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. Eric recently sent a similar driver called etun with the sysfs interface. This implementation uses another interface - the RTM_NRELINK message introduced by Patric. The newlink callback is organized that way to make it easy to create the peer device in the separate namespace when we have them in kernel. Changes from v.2: * Rebase over latest netdev tree. No actual changes; * Small code rework. Changes from v.1: * percpu statistics; * standard convention for nla policy names; * module alias added; * xmit function fixes noticed by Patric; * code cleanup. The patch for an ip utility is also provided. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- drivers/net/Kconfig |6 drivers/net/Makefile |1 drivers/net/veth.c | 452 +++ include/net/veth.h | 14 + 4 files changed, 473 insertions(+) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index d4e39ff..7bdc6e3 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -124,6 +124,12 @@ config TUN If you don't know what to use this for, you don't need it. +config VETH + tristate Virtual ethernet device + ---help--- + The device is an ethernet tunnel. Devices are created in pairs. When + one end receives the packet it appears on its pair and vice versa. + config NET_SB1000 tristate General Instruments Surfboard 1000 depends on PNP diff --git a/drivers/net/Makefile b/drivers/net/Makefile index a2241e6..d7b1103 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -189,6 +189,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o obj-$(CONFIG_MACMACE) += macmace.o obj-$(CONFIG_MAC89x0) += mac89x0.o obj-$(CONFIG_TUN) += tun.o +obj-$(CONFIG_VETH) += veth.o obj-$(CONFIG_NET_NETX) += netx-eth.o obj-$(CONFIG_DL2K) += dl2k.o obj-$(CONFIG_R8169) += r8169.o diff --git a/drivers/net/veth.c b/drivers/net/veth.c new file mode 100644 index 000..bf56e0e --- /dev/null +++ b/drivers/net/veth.c @@ -0,0 +1,452 @@ +/* + * drivers/net/veth.c + * + * Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc + * + * Author: Pavel Emelianov [EMAIL PROTECTED] + * Ethtool interface from: Eric W. Biederman [EMAIL PROTECTED] + * + */ + +#include linux/list.h +#include linux/netdevice.h +#include linux/ethtool.h +#include linux/etherdevice.h + +#include net/dst.h +#include net/xfrm.h +#include net/veth.h + +#define DRV_NAME veth +#define DRV_VERSION1.0 + +struct veth_net_stats { + unsigned long rx_packets; + unsigned long tx_packets; + unsigned long rx_bytes; + unsigned long tx_bytes; + unsigned long tx_dropped; +}; + +struct veth_priv { + struct net_device *peer; + struct net_device *dev; + struct list_head list; + struct veth_net_stats *stats; + unsigned ip_summed; +}; + +static LIST_HEAD(veth_list); + +/* + * ethtool interface + */ + +static struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { peer_ifindex }, +}; + +static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +{ + cmd-supported = 0; + cmd-advertising= 0; + cmd-speed = SPEED_1; + cmd-duplex = DUPLEX_FULL; + cmd-port = PORT_TP; + cmd-phy_address= 0; + cmd-transceiver= XCVR_INTERNAL; + cmd-autoneg= AUTONEG_DISABLE; + cmd-maxtxpkt = 0; + cmd-maxrxpkt = 0; + return 0; +} + +static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) +{ + strcpy(info-driver, DRV_NAME); + strcpy(info-version, DRV_VERSION); + strcpy(info-fw_version, N/A); +} + +static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch(stringset) { + case ETH_SS_STATS: + memcpy(buf, ethtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + } +} + +static int veth_get_stats_count(struct net_device *dev) +{ + return ARRAY_SIZE(ethtool_stats_keys); +} + +static void veth_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *stats, u64 *data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + data[0] = priv-peer-ifindex; +} + +static u32 veth_get_rx_csum(struct net_device *dev) +{ + struct veth_priv *priv
[PATCH] Module for ip utility to support veth device (v.2.1)
The usage is # ip link add [name] type veth [peer name] [mac mac] [peer_mac mac] This version doesn't include the fix for ip/iplink.c as Patrick said that he had included it into his patches already. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- Makefile|6 +++- link_veth.c | 89 veth.h | 14 + 3 files changed, 108 insertions(+), 1 deletion(-) diff --git a/ip/Makefile b/ip/Makefile index 9a5bfe3..b46bce3 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o ALLOBJ=$(IPOBJ) $(RTMONOBJ) SCRIPTS=ifcfg rtpr routel routef TARGETS=ip rtmon +LIBS=link_veth.so -all: $(TARGETS) $(SCRIPTS) +all: $(TARGETS) $(SCRIPTS) $(LIBS) ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL) @@ -24,3 +25,6 @@ clean: LDLIBS += -ldl LDFLAGS+= -Wl,-export-dynamic + +%.so: %.c + $(CC) $(CFLAGS) -shared $ -o $@ diff --git a/ip/link_veth.c b/ip/link_veth.c new file mode 100644 index 000..7760684 --- /dev/null +++ b/ip/link_veth.c @@ -0,0 +1,89 @@ +/* + * link_veth.c veth driver module + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors:Pavel Emelianov [EMAIL PROTECTED] + * + */ + +#include string.h + +#include utils.h +#include ip_common.h +#include veth.h + +#define ETH_ALEN 6 + +static void usage(void) +{ + printf(Usage: ip link add ... type veth + [peer peer-name] [mac mac] [peer_mac mac]\n); +} + +static int veth_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *hdr) +{ + __u8 mac[ETH_ALEN]; + + for (; argc != 0; argv++, argc--) { + if (strcmp(*argv, peer) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + addattr_l(hdr, 1024, VETH_INFO_PEER, + *argv, strlen(*argv)); + + continue; + } + + if (strcmp(*argv, mac) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL) + return -1; + + addattr_l(hdr, 1024, VETH_INFO_MAC, + mac, ETH_ALEN); + continue; + } + + if (strcmp(*argv, peer_mac) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL) + return -1; + + addattr_l(hdr, 1024, VETH_INFO_PEER_MAC, + mac, ETH_ALEN); + continue; + } + + usage(); + return -1; + } + + return 0; +} + +struct link_util veth_link_util = { + .id = veth, + .parse_opt = veth_parse_opt, +}; diff --git a/ip/veth.h b/ip/veth.h new file mode 100644 index 000..b84a530 --- /dev/null +++ b/ip/veth.h @@ -0,0 +1,14 @@ +#ifndef __NET_VETH_H__ +#define __NET_VETH_H__ + +enum { + VETH_INFO_UNSPEC, + VETH_INFO_MAC, + VETH_INFO_PEER, + VETH_INFO_PEER_MAC, + + __VETH_INFO_MAX +#define VETH_INFO_MAX (__VETH_INFO_MAX - 1) +}; + +#endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet device (v2.1)
Christian Borntraeger wrote: Am Mittwoch, 11. Juli 2007 schrieb Pavel Emelianov: drivers/net/veth.c | 452 include/net/veth.h | 14 + I know, I am late in the game, but wont the name collide somewhat with drivers/net/ibmveth.h, drivers/net/iseries_veth.c, and drivers/net/ibmveth.c? David asked the same thing after v.1. The answer is : No it will not. These devices use eth%d strings to generate their names. Christian Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet device (v2.1)
Patrick McHardy wrote: Pavel Emelianov wrote: +static const struct nla_policy veth_policy[VETH_INFO_MAX + 1] = { +[VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN }, +[VETH_INFO_PEER]= { .type = NLA_STRING }, +[VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN }, +}; Looks good, just one question. What happended to the IFLA_PARTNER attribute idea? I have a patch to allow specifying the initial MAC address for a device, IFLA_PARTNER would make the whole thing symetrical. Well, the notion of a partner is not applicable to any generic link unlike the device name, physical (MAC) address or MTU value. So i think that it's better to keep this as a specific driver information not to confuse the generic code. I think it's worth making this as some more generic code than it is now, but since this driver is the only user of partner maybe it's better not to make it right now. Later, when we do know what do we want partner to be. Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet device (v2.1)
Patrick McHardy wrote: Pavel Emelianov wrote: Patrick McHardy wrote: Pavel Emelianov wrote: +static const struct nla_policy veth_policy[VETH_INFO_MAX + 1] = { + [VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN }, + [VETH_INFO_PEER]= { .type = NLA_STRING }, + [VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN }, +}; Looks good, just one question. What happended to the IFLA_PARTNER attribute idea? I have a patch to allow specifying the initial MAC address for a device, IFLA_PARTNER would make the whole thing symetrical. Well, the notion of a partner is not applicable to any generic link unlike the device name, physical (MAC) address or MTU value. So i think that it's better to keep this as a specific driver information not to confuse the generic code. I think it's worth making this as some more generic code than it is now, but since this driver is the only user of partner maybe it's better not to make it right now. Later, when we do know what do we want partner to be. Mhh doing it later means dealing with compatibility issues, which is why I'm asking now. We currently support IFLA_NAME, IFLA_MTU, Oh, I see. IFLA_TXQLEN, IFLA_WEIGTH, IFLA_OPERSTATE and IFLA_LINKMODE, and with my patch additionally IFLA_ADDRESS and IFLA_BROADCAST. AFAICT they are all applicable for the partner link as well. Agree. Maybe it is better to make some generic routine to create the device with the parameters specified in the netlink packet. Then the generic code creates one end of a tunnel and calls -new_link callback. This callback extracts the PARTNER packet part and calls this generic routine to create the second pair. Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet device (v2.1)
Patrick McHardy wrote: Pavel Emelianov wrote: Patrick McHardy wrote: Mhh doing it later means dealing with compatibility issues, which is why I'm asking now. We currently support IFLA_NAME, IFLA_MTU, Oh, I see. IFLA_TXQLEN, IFLA_WEIGTH, IFLA_OPERSTATE and IFLA_LINKMODE, and with my patch additionally IFLA_ADDRESS and IFLA_BROADCAST. AFAICT they are all applicable for the partner link as well. Agree. Maybe it is better to make some generic routine to create the device with the parameters specified in the netlink packet. Then the generic code creates one end of a tunnel and calls -new_link callback. This callback extracts the PARTNER packet part and calls this generic routine to create the second pair. Something like that. Moving the part between NLM_F_CREATE and the ops-newlink call of rtnl_newlink to a new function should work. For now you could even parse the IFLA_PARTNER attribute and nested IFLA_NAME/IFLA_ADDRESS attributes yourself and ignore the rest, this will at least leave us the option of handling it generically later. OK. I'll try to make the generic call. Could you please send me the patches with IFLA_ADDRESS support for booth kernel and ip utility. Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel
Patrick McHardy wrote: Pavel Emelianov wrote: Patrick McHardy wrote: + skb-pkt_type = PACKET_HOST; + skb-protocol = eth_type_trans(skb, rcv); + if (dev-features NETIF_F_NO_CSUM) + skb-ip_summed = rcv_priv-ip_summed; + + dst_release(skb-dst); + skb-dst = NULL; + + secpath_reset(skb); + nf_reset(skb); Is skb-mark supposed to survive communication between different namespaces? I guess it must not. Thanks. I guess there are a few others that should be cleared as well, like the tc related members, secmark, ipvs_property, ... It seems like we are about to have some skb_reset_all() routine to make the skb look like newborn. The rtnl_link codes looks fine. I don't like the VETH_INFO_MAC attribute very much though, we already have a generic device attribute for MAC addresses. Of course that only allows you to supply one MAC address, so I'm wondering what you think of allocating only a single device per newlink operation and binding them in a seperate enslave operation? I did this at the very first version, but Alexey showed me that this would be wrong. Look. When we create the second device it must be in the other namespace as it is useless to have them in one namespace. But if we have the device in the other namespace the RTNL_NEWLINK message from kernel would come into this namespace thus confusing ip utility in the init namespace. Creating the device in the init ns and moving it into the new one is rather a complex task. But with such approach the creation looks really logical. We send a packet to the kernel and have a single response about the new device appearance. At the same time we have a RTNL_NEWLINK message arrived at the destination namespace informing that a new device has appeared there as well. The question is how to proceed. I haven't read all mails yet, but it seems there is some disagreement about whether to create all devices in the same namespace and move them later or create them directly in The agreement was that we can make any of the above. We can create booth devices in the init namespace and then move one of them into the desired namespace, or we can explicitly specify which namespace to create the pair in. their target namespace. For now I guess it doesn't matter much, so can everyone agree to adding a IFLA_PARTNER attribute that includes a complete ifinfomsg and the attributes and you later decide how to handle namespaces? +enum { + VETH_INFO_UNSPEC, + VETH_INFO_MAC, + VETH_INFO_PEER, + VETH_INFO_PEER_MAC, + + VETH_INFO_MAX +}; Please follow the #define VETH_INFO_MAX (__VETH_INFO_MAX - 1) convention here. Could you please clarify this point. I saw the lines enum { ... RTNL_NEWLINK #define RTNL_NEWLINK RTNL_NEWLINK ... } and had my brains exploded imagining what this would mean :( Thats just to make the new attributes visible as preprocessor symbols so userspace can use them for #ifdefs. We usually use it when adding new attributes/message types, but its not necessary for the initial set of attributes if you already have some other preprocessor-visisble symbol (like VETH_INFO_MAX) userspace can use. What I was refering to is this convention: enum { ... __IFLA_MAX }; #define IFLA_MAX (__IFLA_MAX - 1) Which is used to make sure that IFLA_MAX is really the max and not max + 1 and additionally people won't forget to update it. OK thanks. This is already done in the v2. Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel
Patrick McHardy wrote: Pavel Emelianov wrote: Patrick McHardy wrote: The question is how to proceed. I haven't read all mails yet, but it seems there is some disagreement about whether to create all devices in the same namespace and move them later or create them directly in The agreement was that we can make any of the above. We can create booth devices in the init namespace and then move one of them into the desired namespace, or we can explicitly specify which namespace to create the pair in. I'm going to push my latest patches to Dave today, the easiest way is probably is you just add whatever you need to the API afterwards. OK. Dave didn't object against the driver. Hope he will accept it as well. I have also found a BUG in your API. Look, when you declare the alias with the MODULE_ALIAS_RTNL_LINK in drivers you use strings as an argument. But this macro stringifyes the argument itself which results in bad aliases. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h index d744198..f627e1f 100644 --- a/include/net/rtnetlink.h +++ b/include/net/rtnetlink.h @@ -77,6 +77,6 @@ extern void __rtnl_link_unregister(struc extern int rtnl_link_register(struct rtnl_link_ops *ops); extern voidrtnl_link_unregister(struct rtnl_link_ops *ops); -#define MODULE_ALIAS_RTNL_LINK(name) MODULE_ALIAS(rtnl-link- #name) +#define MODULE_ALIAS_RTNL_LINK(name) MODULE_ALIAS(rtnl-link- name) #endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel (v.2)
Ben Greear wrote: [snip] I would also like some way to identify veth from other device types, preferably something like a value in sysfs. However, that should not hold up We can do this with ethtool. It can get and print the driver name of the device. I think I'd like something in sysfs that we could query for any interface. Possible return strings could be: VLAN VETH ETH PPP BRIDGE AP /* wifi access point interface */ STA /* wifi station */ I will cook up a patch for consideration after veth goes in. Ben, could you please tell what sysfs features do you plan to implement? Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel
Patrick McHardy wrote: Pavel Emelianov wrote: Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. Eric recently sent a similar driver called etun. This implementation uses another interface - the RTM_NRELINK message introduced by Patric. The patch fits today netdev tree with Patrick's patches. The newlink callback is organized that way to make it easy to create the peer device in the separate namespace when we have them in kernel. +struct veth_priv { +struct net_device *peer; +struct net_device *dev; +struct list_head list; +struct net_device_stats stats; You can use dev-stats instead. OK. Actually I planned to use percpu stats to reduce cacheline trashing (Stephen has noticed it also). The reason I didn't do it here is that the patch would look more complicated, but I wanted to show and approve the netlink interface first. +static int veth_xmit(struct sk_buff *skb, struct net_device *dev) +{ +struct net_device *rcv = NULL; +struct veth_priv *priv, *rcv_priv; +int length; + +skb_orphan(skb); + +priv = netdev_priv(dev); +rcv = priv-peer; +rcv_priv = netdev_priv(rcv); + +if (!(rcv-flags IFF_UP)) +goto outf; + +skb-dev = rcv; eth_type_trans already sets skb-dev. Ok. Thanks. +skb-pkt_type = PACKET_HOST; +skb-protocol = eth_type_trans(skb, rcv); +if (dev-features NETIF_F_NO_CSUM) +skb-ip_summed = rcv_priv-ip_summed; + +dst_release(skb-dst); +skb-dst = NULL; + +secpath_reset(skb); +nf_reset(skb); Is skb-mark supposed to survive communication between different namespaces? I guess it must not. Thanks. +static const struct nla_policy veth_policy[VETH_INFO_MAX] = { +[VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN }, +[VETH_INFO_PEER]= { .type = NLA_STRING }, +[VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN }, +}; The rtnl_link codes looks fine. I don't like the VETH_INFO_MAC attribute very much though, we already have a generic device attribute for MAC addresses. Of course that only allows you to supply one MAC address, so I'm wondering what you think of allocating only a single device per newlink operation and binding them in a seperate enslave operation? I did this at the very first version, but Alexey showed me that this would be wrong. Look. When we create the second device it must be in the other namespace as it is useless to have them in one namespace. But if we have the device in the other namespace the RTNL_NEWLINK message from kernel would come into this namespace thus confusing ip utility in the init namespace. Creating the device in the init ns and moving it into the new one is rather a complex task. But with such approach the creation looks really logical. We send a packet to the kernel and have a single response about the new device appearance. At the same time we have a RTNL_NEWLINK message arrived at the destination namespace informing that a new device has appeared there as well. +enum { +VETH_INFO_UNSPEC, +VETH_INFO_MAC, +VETH_INFO_PEER, +VETH_INFO_PEER_MAC, + +VETH_INFO_MAX +}; Please follow the #define VETH_INFO_MAX (__VETH_INFO_MAX - 1) convention here. Could you please clarify this point. I saw the lines enum { ... RTNL_NEWLINK #define RTNL_NEWLINK RTNL_NEWLINK ... } and had my brains exploded imagining what this would mean :( - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel
I did this at the very first version, but Alexey showed me that this would be wrong. Look. When we create the second device it must be in the other namespace as it is useless to have them in one namespace. But if we have the device in the other namespace the RTNL_NEWLINK message from kernel would come into this namespace thus confusing ip utility in the init namespace. Creating the device in the init ns and moving it into the new one is rather a complex task. Pavel, moving the netdevice to another namespace is not a complex task. Eric Biederman did it in its patchset ( cf. http://lxc.sf.net/network ) By saying complex I didn't mean that this is difficult to implement, but that it consists (must consist) of many stages. I.e. composite. Making the device right in the namespace is liter. When the pair device is created, both extremeties are into the init namespace and you can choose to which namespace to move one extremity. I do not mind that. When the network namespace dies, the netdev is moved back to the init namespace. That facilitate network device management. Concerning netlink events, this is automatically generated when the network device is moved through namespaces. IMHO, we should have the network device movement between namespaces in order to be able to move a physical network device too (eg. you have 4 NIC and you want to create 3 containers and assign 3 NIC to each of them) Agree. Moving the devices is a must-have functionality. I do not mind making the pair in the init namespace and move the second one into the desired namespace. But if we *always* will have two ends in different namespaces what to complicate things for? Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Virtual ethernet tunnel (v.2)
Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. Eric recently sent a similar driver called etun. This implementation uses another interface - the RTM_NRELINK message introduced by Patric. The newlink callback is organized that way to make it easy to create the peer device in the separate namespace when we have them in kernel. Changes from v.1: * percpu statistics; * standard convention for nla policy names; * module alias added; * xmit function fixes noticed by Patric; * code cleanup. The patch for an ip utility is also provided. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] Since ethtool interface was taken from Eric's patch, I think that he would like to see his Signed-off line as well (however he didn't answer yesterday). --- diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 7d57f4a..7e144be 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -119,6 +119,12 @@ config TUN If you don't know what to use this for, you don't need it. +config VETH + tristate Virtual ethernet device + ---help--- + The device is an ethernet tunnel. Devices are created in pairs. When + one end receives the packet it appears on its pair and vice versa. + config NET_SB1000 tristate General Instruments Surfboard 1000 depends on PNP diff --git a/drivers/net/Makefile b/drivers/net/Makefile index a77affa..4764119 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o obj-$(CONFIG_MACMACE) += macmace.o obj-$(CONFIG_MAC89x0) += mac89x0.o obj-$(CONFIG_TUN) += tun.o +obj-$(CONFIG_VETH) += veth.o obj-$(CONFIG_NET_NETX) += netx-eth.o obj-$(CONFIG_DL2K) += dl2k.o obj-$(CONFIG_R8169) += r8169.o diff --git a/drivers/net/veth.c b/drivers/net/veth.c new file mode 100644 index 000..e7ad43d --- /dev/null +++ b/drivers/net/veth.c @@ -0,0 +1,442 @@ +/* + * drivers/net/veth.c + * + * Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc + * + * Author: Pavel Emelianov [EMAIL PROTECTED] + * + */ + +#include linux/list.h +#include linux/netdevice.h +#include linux/ethtool.h +#include linux/etherdevice.h + +#include net/dst.h +#include net/xfrm.h +#include net/veth.h + +#define DRV_NAME veth +#define DRV_VERSION1.0 + +struct veth_device_stats { + unsigned long rx_packets; + unsigned long tx_packets; + unsigned long rx_bytes; + unsigned long tx_bytes; + unsigned long tx_dropped; +}; + +struct veth_priv { + struct net_device *peer; + struct net_device *dev; + struct list_head list; + struct veth_device_stats *stats; + unsigned ip_summed; +}; + +static LIST_HEAD(veth_list); + +/* + * ethtool interface + */ + +static struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { peer_ifindex }, +}; + +static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +{ + cmd-supported = 0; + cmd-advertising= 0; + cmd-speed = SPEED_1; + cmd-duplex = DUPLEX_FULL; + cmd-port = PORT_TP; + cmd-phy_address= 0; + cmd-transceiver= XCVR_INTERNAL; + cmd-autoneg= AUTONEG_DISABLE; + cmd-maxtxpkt = 0; + cmd-maxrxpkt = 0; + return 0; +} + +static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) +{ + strcpy(info-driver, DRV_NAME); + strcpy(info-version, DRV_VERSION); + strcpy(info-fw_version, N/A); +} + +static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch(stringset) { + case ETH_SS_STATS: + memcpy(buf, ethtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + } +} + +static int veth_get_stats_count(struct net_device *dev) +{ + return ARRAY_SIZE(ethtool_stats_keys); +} + +static void veth_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *stats, u64 *data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + data[0] = priv-peer-ifindex; +} + +static u32 veth_get_rx_csum(struct net_device *dev) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + return priv-ip_summed == CHECKSUM_UNNECESSARY; +} + +static int veth_set_rx_csum(struct net_device *dev, u32 data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + priv-ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE; + return 0; +} + +static u32 veth_get_tx_csum(struct net_device *dev) +{ + return (dev-features NETIF_F_NO_CSUM) != 0; +} + +static int veth_set_tx_csum(struct net_device *dev, u32 data
[PATCH] Module for ip utility to support veth device (v.2)
The usage is # ip link add [name] type veth [peer name] [mac mac] [peer_mac mac] This version doesn't include the fix for ip/iplink.c as Patrick said that he had included it into his patches already. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- diff --git a/ip/Makefile b/ip/Makefile index 9a5bfe3..b46bce3 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o ALLOBJ=$(IPOBJ) $(RTMONOBJ) SCRIPTS=ifcfg rtpr routel routef TARGETS=ip rtmon +LIBS=link_veth.so -all: $(TARGETS) $(SCRIPTS) +all: $(TARGETS) $(SCRIPTS) $(LIBS) ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL) @@ -24,3 +25,6 @@ clean: LDLIBS += -ldl LDFLAGS+= -Wl,-export-dynamic + +%.so: %.c + $(CC) $(CFLAGS) -shared $ -o $@ diff --git a/ip/link_veth.c b/ip/link_veth.c new file mode 100644 index 000..f2e4079 --- /dev/null +++ b/ip/link_veth.c @@ -0,0 +1,86 @@ +/* + * ip/link_veth.c + * + * Virtual ETHernet tunnel supprt. + * + * Author: Pavel Emelianov [EMAIL PROTECTED] + */ + +#include stdio.h +#include string.h + +#include utils.h +#include ip_common.h +#include veth.h + +#define ETH_ALEN 6 + +static void usage(void) +{ + printf(Usage: ip link add ... + [peer peer-name] [mac mac] [peer_mac mac]\n); +} + +static int veth_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *hdr) +{ + __u8 mac[ETH_ALEN]; + + for (; argc != 0; argv++, argc--) { + if (strcmp(*argv, peer) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + addattr_l(hdr, 1024, VETH_INFO_PEER, + *argv, strlen(*argv)); + + continue; + } + + if (strcmp(*argv, mac) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL) + return -1; + + addattr_l(hdr, 1024, VETH_INFO_MAC, + mac, ETH_ALEN); + continue; + } + + if (strcmp(*argv, peer_mac) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL) + return -1; + + addattr_l(hdr, 1024, VETH_INFO_PEER_MAC, + mac, ETH_ALEN); + continue; + } + + usage(); + return -1; + } + + return 0; +} + +struct link_util veth_link_util = { + .id = veth, + .parse_opt = veth_parse_opt, +}; diff --git a/ip/veth.h b/ip/veth.h new file mode 100644 index 000..d52e0c5 --- /dev/null +++ b/ip/veth.h @@ -0,0 +1,15 @@ +#ifndef __NET_VETH_H__ +#define __NET_VETH_H__ + +enum { + VETH_INFO_UNSPEC, + VETH_INFO_MAC, + VETH_INFO_PEER, + VETH_INFO_PEER_MAC, + + __VETH_INFO_MAX +}; + +#define VETH_INFO_MAX (__VETH_INFO_MAX - 1) + +#endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel
Daniel Lezcano wrote: Pavel Emelianov wrote: I did this at the very first version, but Alexey showed me that this would be wrong. Look. When we create the second device it must be in the other namespace as it is useless to have them in one namespace. But if we have the device in the other namespace the RTNL_NEWLINK message from kernel would come into this namespace thus confusing ip utility in the init namespace. Creating the device in the init ns and moving it into the new one is rather a complex task. Pavel, moving the netdevice to another namespace is not a complex task. Eric Biederman did it in its patchset ( cf. http://lxc.sf.net/network ) By saying complex I didn't mean that this is difficult to implement, but that it consists (must consist) of many stages. I.e. composite. Making the device right in the namespace is liter. When the pair device is created, both extremeties are into the init namespace and you can choose to which namespace to move one extremity. I do not mind that. When the network namespace dies, the netdev is moved back to the init namespace. That facilitate network device management. Concerning netlink events, this is automatically generated when the network device is moved through namespaces. IMHO, we should have the network device movement between namespaces in order to be able to move a physical network device too (eg. you have 4 NIC and you want to create 3 containers and assign 3 NIC to each of them) Agree. Moving the devices is a must-have functionality. I do not mind making the pair in the init namespace and move the second one into the desired namespace. But if we *always* will have two ends in different namespaces what to complicate things for? Just to provide a netdev sufficiently generic to be used by people who don't want namespaces but just want to do some network testing, like Ben Greear does. He mentioned in a previous email, he will be happy to stop redirecting people to out of tree patch. This patch creates booth devices in the init namespace. That's what you want, isn't it? When we have the namespaces we will be able to create the pair with booth ends in the init namespace - just do not specify the namespace id to create the 2nd end in and the driver will leave it int the init one. https://lists.linux-foundation.org/pipermail/containers/2007-April/004420.html Thanks, Pavel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel
no one is against generic code and ability to create 2 interfaces in *one* namespace. (Like we currently allow to do so in OpenVZ) However, believe me, moving an interface is a *hard* operation. Much harder then netdev register from the scratch. Because it requires to take into account many things like: - packets in flight which requires synchronize and is slow on big machines - asynchronous sysfs entries registration/deregistration from rtln_unlock - netdev_run_todo - name/ifindex collisions - shutdown/cleanup of addresses/routes/qdisc and other similar stuff All of what you are describing is already implemented in the Eric's patchset. Daniel. We *agree* that this task *is implementable*. We just want to say that creating the device in the namespace is less comp... (oh, well, forget this word) faster than creating and then moving. You can have a look at : http://lxc.sourceforge.net/patches/2.6.20/2.6.20-netns1/broken_out/ And more precisly: for sysfs issues: http://lxc.sourceforge.net/patches/2.6.20/2.6.20-netns1/broken_out/0065-sysfs-Shadow-directory-support.patch for network device movement: http://lxc.sourceforge.net/patches/2.6.20/2.6.20-netns1/broken_out/0096-net-Implment-network-device-movement-between-namesp.patch Good job. Can you prove that this code is less buggy than the existing register_netdevice() one? This requires testing, doesn't it? But on the other hand we have the ability to create the device right in the namespace using well known (and thus well debugged) code with minimal actions from the kernel. Thanks, Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel (v.2)
Ben Greear wrote: Pavel Emelianov wrote: Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. As Dave mentioned, there is already a driver known as 'veth'. Maybe borrow the etun name as well? We have already seen that this driver uses ethXXX names for its devices and Dave agreed with veth one. Moreover Alexey Kuznetsov said that he would prefer the name veth for etun. I would also like some way to identify veth from other device types, preferably something like a value in sysfs. However, that should not hold up We can do this with ethtool. It can get and print the driver name of the device. consideration of this patch, and I am willing to submit a patch after this goes in to add the functionality I want... Ok. Thanks. +/* + * xmit + */ + +static int veth_xmit(struct sk_buff *skb, struct net_device *dev) +{ +struct net_device *rcv = NULL; +struct veth_device_stats *stats; +struct veth_priv *priv, *rcv_priv; +int length, cpu; + +skb_orphan(skb); + +priv = netdev_priv(dev); +cpu = smp_processor_id(); +stats = per_cpu_ptr(priv-stats, cpu); +rcv = priv-peer; + +if (!(rcv-flags IFF_UP)) +goto outf; I think you need at least the option to zero out the time-stamp, otherwise it will not be re-calculated when received on the peer, and it potentially spent significant time since it was last calculated (think netem delay or similar). +/* Zero out the time-stamp so that receiving code is forced + * to recalculate it. + */ +skb-tstamp.off_sec = 0; +skb-tstamp.off_usec = 0; + +rcv_priv = netdev_priv(rcv); +skb-pkt_type = PACKET_HOST; +skb-protocol = eth_type_trans(skb, rcv); +if (dev-features NETIF_F_NO_CSUM) +skb-ip_summed = rcv_priv-ip_summed; + +dst_release(skb-dst); +skb-dst = NULL; +secpath_reset(skb); +nf_reset(skb); +skb-mark = 0; + +length = skb-len; This should be done before you do the eth_type_trans, as that pulls the header and your byte counters will be off. This will be ETH_HLEN larger, do you mean this? I think this is normal as this device tries to look like an iron ethernet card :) + +stats-tx_bytes += length; +stats-tx_packets++; + +stats = per_cpu_ptr(rcv_priv-stats, cpu); +stats-rx_bytes += length; +stats-rx_packets++; + +netif_rx(skb); +return 0; + +outf: +kfree_skb(skb); +stats-tx_dropped++; +return 0; +} Thanks, Ben - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Virtual ethernet tunnel (v.2)
Ben Greear wrote: Pavel Emelianov wrote: Ben Greear wrote: Pavel Emelianov wrote: Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. As Dave mentioned, there is already a driver known as 'veth'. Maybe borrow the etun name as well? We have already seen that this driver uses ethXXX names for its devices and Dave agreed with veth one. Moreover Alexey Kuznetsov said that he would prefer the name veth for etun. Ok, fine by me. I started reading mail from the wrong direction this morning :) I would also like some way to identify veth from other device types, preferably something like a value in sysfs. However, that should not hold up We can do this with ethtool. It can get and print the driver name of the device. I think I'd like something in sysfs that we could query for any interface. Possible return strings could be: VLAN VETH ETH PPP BRIDGE AP /* wifi access point interface */ STA /* wifi station */ I will cook up a patch for consideration after veth goes in. OK. I think you need at least the option to zero out the time-stamp, otherwise it will not be re-calculated when received on the peer, and it potentially spent significant time since it was last calculated (think netem delay or similar). +/* Zero out the time-stamp so that receiving code is forced + * to recalculate it. + */ +skb-tstamp.off_sec = 0; +skb-tstamp.off_usec = 0; + +rcv_priv = netdev_priv(rcv); +skb-pkt_type = PACKET_HOST; +skb-protocol = eth_type_trans(skb, rcv); +if (dev-features NETIF_F_NO_CSUM) +skb-ip_summed = rcv_priv-ip_summed; + +dst_release(skb-dst); +skb-dst = NULL; +secpath_reset(skb); +nf_reset(skb); +skb-mark = 0; + +length = skb-len; This should be done before you do the eth_type_trans, as that pulls the header and your byte counters will be off. This will be ETH_HLEN larger, do you mean this? I think this is normal as this device tries to look like an iron ethernet card :) For device counters, it should count the number of bytes received, including all headers, but excluding the ethernet FCS. If an 'iron' card did differently, I'd consider it a bug. Hmm... The loopback must be doing bad things then. It first calls eth_type_trans and then accounts for the new skb-len. Thanks, Ben - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Virtual ethernet tunnel
Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. Eric recently sent a similar driver called etun. This implementation uses another interface - the RTM_NRELINK message introduced by Patric. The patch fits today netdev tree with Patrick's patches. The newlink callback is organized that way to make it easy to create the peer device in the separate namespace when we have them in kernel. The patch for an ip utility is also provided. Eric, since ethtool interface was from your patch, I add your Signed-off-by line. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 7d57f4a..7e144be 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -119,6 +119,12 @@ config TUN If you don't know what to use this for, you don't need it. +config VETH + tristate Virtual ethernet device + ---help--- + The device is an ethernet tunnel. Devices are created in pairs. When + one end receives the packet it appears on its pair and vice versa. + config NET_SB1000 tristate General Instruments Surfboard 1000 depends on PNP diff --git a/drivers/net/Makefile b/drivers/net/Makefile index a77affa..4764119 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o obj-$(CONFIG_MACMACE) += macmace.o obj-$(CONFIG_MAC89x0) += mac89x0.o obj-$(CONFIG_TUN) += tun.o +obj-$(CONFIG_VETH) += veth.o obj-$(CONFIG_NET_NETX) += netx-eth.o obj-$(CONFIG_DL2K) += dl2k.o obj-$(CONFIG_R8169) += r8169.o diff --git a/drivers/net/veth.c b/drivers/net/veth.c new file mode 100644 index 000..6746c91 --- /dev/null +++ b/drivers/net/veth.c @@ -0,0 +1,391 @@ +/* + * drivers/net/veth.c + * + * Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc + * + */ + +#include linux/list.h +#include linux/netdevice.h +#include linux/ethtool.h +#include linux/etherdevice.h + +#include net/dst.h +#include net/xfrm.h +#include net/veth.h + +#define DRV_NAME veth +#define DRV_VERSION1.0 + +struct veth_priv { + struct net_device *peer; + struct net_device *dev; + struct list_head list; + struct net_device_stats stats; + unsigned ip_summed; +}; + +static LIST_HEAD(veth_list); + +/* + * ethtool interface + */ + +static struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { peer_ifindex }, +}; + +static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +{ + cmd-supported = 0; + cmd-advertising= 0; + cmd-speed = SPEED_1; + cmd-duplex = DUPLEX_FULL; + cmd-port = PORT_TP; + cmd-phy_address= 0; + cmd-transceiver= XCVR_INTERNAL; + cmd-autoneg= AUTONEG_DISABLE; + cmd-maxtxpkt = 0; + cmd-maxrxpkt = 0; + return 0; +} + +static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) +{ + strcpy(info-driver, DRV_NAME); + strcpy(info-version, DRV_VERSION); + strcpy(info-fw_version, N/A); +} + +static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch(stringset) { + case ETH_SS_STATS: + memcpy(buf, ethtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + } +} + +static int veth_get_stats_count(struct net_device *dev) +{ + return ARRAY_SIZE(ethtool_stats_keys); +} + +static void veth_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *stats, u64 *data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + data[0] = priv-peer-ifindex; +} + +static u32 veth_get_rx_csum(struct net_device *dev) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + return priv-ip_summed == CHECKSUM_UNNECESSARY; +} + +static int veth_set_rx_csum(struct net_device *dev, u32 data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + priv-ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE; + return 0; +} + +static u32 veth_get_tx_csum(struct net_device *dev) +{ + return (dev-features NETIF_F_NO_CSUM) != 0; +} + +static int veth_set_tx_csum(struct net_device *dev, u32 data) +{ + if (data) + dev-features |= NETIF_F_NO_CSUM; + else + dev-features = ~NETIF_F_NO_CSUM; + return 0; +} + +static struct ethtool_ops veth_ethtool_ops = { + .get_settings = veth_get_settings, + .get_drvinfo= veth_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_rx_csum
[PATCH] Module for ip utility to support veth device
The usage is # ip link add [name] type veth [peer name] [mac mac] [peer_mac mac] The Makefile is maybe not as beautiful as it could be. It is to be discussed. One thing I noticed during testing is the following. When launching this with link_veth.so module and not specifying any module specific parameters, the kernel refuses to accept the packet when parsing the IFLA_LINKINFO. So the hunk for ip/iplink.c doesn't add an empty extra header if no extra data expected. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- diff --git a/ip/Makefile b/ip/Makefile index 9a5bfe3..b46bce3 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o ALLOBJ=$(IPOBJ) $(RTMONOBJ) SCRIPTS=ifcfg rtpr routel routef TARGETS=ip rtmon +LIBS=link_veth.so -all: $(TARGETS) $(SCRIPTS) +all: $(TARGETS) $(SCRIPTS) $(LIBS) ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL) @@ -24,3 +25,6 @@ clean: LDLIBS += -ldl LDFLAGS+= -Wl,-export-dynamic + +%.so: %.c + $(CC) $(CFLAGS) -shared $ -o $@ diff --git a/ip/iplink.c b/ip/iplink.c index 5170419..6975990 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -287,7 +287,7 @@ static int iplink_modify(int cmd, unsign strlen(type)); lu = get_link_type(type); - if (lu) { + if (lu argc) { struct rtattr * data = NLMSG_TAIL(req.n); addattr_l(req.n, sizeof(req), IFLA_INFO_DATA, NULL, 0); diff --git a/ip/link_veth.c b/ip/link_veth.c new file mode 100644 index 000..adfdef6 --- /dev/null +++ b/ip/link_veth.c @@ -0,0 +1,77 @@ +#include string.h + +#include utils.h +#include ip_common.h +#include veth.h + +#define ETH_ALEN 6 + +static void usage(void) +{ + printf(Usage: ip link add ... + [peer peer-name] [mac mac] [peer_mac mac]\n); +} + +static int veth_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *hdr) +{ + __u8 mac[ETH_ALEN]; + + for (; argc != 0; argv++, argc--) { + if (strcmp(*argv, peer) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + addattr_l(hdr, 1024, VETH_INFO_PEER, + *argv, strlen(*argv)); + + continue; + } + + if (strcmp(*argv, mac) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL) + return -1; + + addattr_l(hdr, 1024, VETH_INFO_MAC, + mac, ETH_ALEN); + continue; + } + + if (strcmp(*argv, peer_mac) == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL) + return -1; + + addattr_l(hdr, 1024, VETH_INFO_PEER_MAC, + mac, ETH_ALEN); + continue; + } + + usage(); + return -1; + } + + return 0; +} + +struct link_util veth_link_util = { + .id = veth, + .parse_opt = veth_parse_opt, +}; diff --git a/ip/veth.h b/ip/veth.h new file mode 100644 index 000..74c8e1e --- /dev/null +++ b/ip/veth.h @@ -0,0 +1,13 @@ +#ifndef __NET_VETH_H__ +#define __NET_VETH_H__ + +enum { + VETH_INFO_UNSPEC, + VETH_INFO_MAC, + VETH_INFO_PEER, + VETH_INFO_PEER_MAC, + + VETH_INFO_MAX +}; + +#endif - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Consolidate udp hash calculations
David Miller wrote: From: Pavel Emelianov [EMAIL PROTECTED] Date: Fri, 04 May 2007 18:51:36 +0400 Make access to udphash/udplitehash symmetrical to inet hashes. This may also help network namespaces, since they tend to use one hash for different namespaces by selecting the hash chain depending on a hash value and the namespace. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] This is not the UDP Hash Function, it is a mask to bring the hash value modulo of the hash table size. So your function name is misleading and makes the code even worse. I'm also not so sure this cleanup really adds any clarity, even with a proper function name. The same is applicable to inet_lhashfn and inet_bhashfn, but they are called hashfn-s and do exist in kernel. This hashfn will make things look similar. I can only assume you wish to do something with the UDP hash table sizes, and therefore only want to have a need to touch one function. You are right, I have told that this must help network namespaces, by making the hash differ depending on the namespace passed. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Consolidate checking for tcp orphan count being too big
tcp_out_of_resources() and tcp_close() perform the same checking of number of orphan sockets. Move this code into common place. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- diff --git a/include/net/tcp.h b/include/net/tcp.h index e22b4f0..a8af9ae 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -254,6 +254,12 @@ static inline int between(__u32 seq1, __ return seq3 - seq2 = seq1 - seq2; } +static inline int tcp_too_many_orphans(struct sock *sk, int num) +{ + return (num sysctl_tcp_max_orphans) || + (sk-sk_wmem_queued SOCK_MIN_SNDBUF +atomic_read(tcp_memory_allocated) sysctl_tcp_mem[2]); +} extern struct proto tcp_prot; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8b124ea..8fbe550 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1675,9 +1675,8 @@ adjudge_to_death: } if (sk-sk_state != TCP_CLOSE) { sk_stream_mem_reclaim(sk); - if (atomic_read(sk-sk_prot-orphan_count) sysctl_tcp_max_orphans || - (sk-sk_wmem_queued SOCK_MIN_SNDBUF -atomic_read(tcp_memory_allocated) sysctl_tcp_mem[2])) { + if (tcp_too_many_orphans(sk, + atomic_read(sk-sk_prot-orphan_count))) { if (net_ratelimit()) printk(KERN_INFO TCP: too many of orphaned sockets\n); diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index 2ca97b2..e613401 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -78,9 +78,7 @@ static int tcp_out_of_resources(struct s if (sk-sk_err_soft) orphans = 1; - if (orphans = sysctl_tcp_max_orphans || - (sk-sk_wmem_queued SOCK_MIN_SNDBUF -atomic_read(tcp_memory_allocated) sysctl_tcp_mem[2])) { + if (tcp_too_many_orphans(sk, orphans)) { if (net_ratelimit()) printk(KERN_INFO Out of socket memory\n); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Consolidate udp hash calculations
Make access to udphash/udplitehash symmetrical to inet hashes. This may also help network namespaces, since they tend to use one hash for different namespaces by selecting the hash chain depending on a hash value and the namespace. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- diff --git a/include/linux/udp.h b/include/linux/udp.h index 6de445c..a40ccd4 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -49,6 +49,11 @@ static inline struct udphdr *udp_hdr(con #include net/inet_sock.h #define UDP_HTABLE_SIZE128 +static inline unsigned int udp_hashfn(unsigned int num) +{ + return num (UDP_HTABLE_SIZE - 1); +} + struct udp_sock { /* inet_sock has to be the first member */ struct inet_sock inet; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 113e0c4..8ea1c81 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -132,7 +132,7 @@ static inline int __udp_lib_port_inuse(u struct hlist_node *node; struct inet_sock *inet; - sk_for_each(sk, node, udptable[hash (UDP_HTABLE_SIZE - 1)]) { + sk_for_each(sk, node, udptable[udp_hashfn(hash)]) { if (sk-sk_hash != hash) continue; inet = inet_sk(sk); @@ -159,7 +159,6 @@ int __udp_lib_get_port(struct sock *sk, const struct sock *sk2 )) { struct hlist_node *node; - struct hlist_head *head; struct sock *sk2; unsigned int hash; interror = 1; @@ -175,10 +174,11 @@ int __udp_lib_get_port(struct sock *sk, best = result = *port_rover; for (i = 0; i UDP_HTABLE_SIZE; i++, result++) { int size; + struct hlist_head *head; hash = hash_port_and_addr(result, inet_sk(sk)-rcv_saddr); - head = udptable[hash (UDP_HTABLE_SIZE - 1)]; + head = udptable[udp_hashfn(hash)]; if (hlist_empty(head)) { if (result sysctl_local_port_range[1]) result = sysctl_local_port_range[0] + @@ -222,9 +222,7 @@ gotit: *port_rover = snum = result; } else { hash = hash_port_and_addr(snum, 0); - head = udptable[hash (UDP_HTABLE_SIZE - 1)]; - - sk_for_each(sk2, node, head) + sk_for_each(sk2, node, udptable[udp_hashfn(hash)]) if (sk2-sk_hash == hash sk2 != sk inet_sk(sk2)-num == snum @@ -237,9 +235,7 @@ gotit: if (inet_sk(sk)-rcv_saddr) { hash = hash_port_and_addr(snum, inet_sk(sk)-rcv_saddr); - head = udptable[hash (UDP_HTABLE_SIZE - 1)]; - - sk_for_each(sk2, node, head) + sk_for_each(sk2, node, udptable[udp_hashfn(hash)]) if (sk2-sk_hash == hash sk2 != sk inet_sk(sk2)-num == snum @@ -255,8 +251,7 @@ gotit: inet_sk(sk)-num = snum; sk-sk_hash = hash; if (sk_unhashed(sk)) { - head = udptable[hash (UDP_HTABLE_SIZE - 1)]; - sk_add_node(sk, head); + sk_add_node(sk, udptable[udp_hashfn(hash)]); sock_prot_inc_use(sk-sk_prot); } error = 0; @@ -304,7 +299,7 @@ static struct sock *__udp4_lib_lookup(__ lookup: - sk_for_each(sk, node, udptable[hash (UDP_HTABLE_SIZE - 1)]) { + sk_for_each(sk, node, udptable[udp_hashfn(hash)]) { struct inet_sock *inet = inet_sk(sk); if (sk-sk_hash != hash || ipv6_only_sock(sk) || @@ -1205,8 +1200,8 @@ static int __udp4_lib_mcast_deliver(stru read_lock(udp_hash_lock); - sk = sk_head(udptable[hash (UDP_HTABLE_SIZE - 1)]); - skw = sk_head(udptable[hashwild (UDP_HTABLE_SIZE - 1)]); + sk = sk_head(udptable[udp_hashfn(hash)]); + skw = sk_head(udptable[udp_hashfn(hashwild)]); sk = udp_v4_mcast_next(sk, hash, hport, daddr, uh-source, saddr, dif); if (!sk) { diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index b083c09..a0a8915 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -67,7 +67,7 @@ static struct sock *__udp6_lib_lookup(st int badness = -1; read_lock(udp_hash_lock); - sk_for_each(sk, node, udptable[hnum (UDP_HTABLE_SIZE - 1)]) { + sk_for_each(sk, node, udptable[udp_hashfn(hnum)]) { struct inet_sock *inet = inet_sk(sk); if (sk-sk_hash == hnum sk-sk_family == PF_INET6) { @@ -350,7 +350,7 @@ static int __udp6_lib_mcast_deliver(stru int dif; read_lock(udp_hash_lock); - sk
Re: [PATCH] Rework dev_base via list_head
David Miller wrote: From: Pavel Emelianov [EMAIL PROTECTED] Date: Wed, 02 May 2007 17:40:56 +0400 Cleanup of dev_base list use, with the aim to simplify making device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Fits 2.6.21-rc7 tree. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] Acked-by: Kirill Korotaev [EMAIL PROTECTED] Overall this looks mostly good. One thing I want to audit before applying this is loop termination conditions. With the old loop, if you do something like this: for (dev = dev_base; dev; dev = dev-next) { if (dev == what_I_want) break; } you can test for a successful find after the loop with: if (dev) { I_found_it(); } That doesn't work with for_each_netdev(), if the loop runs till the end of the list, the iterator will not be left at NULL. I just want to make sure you didn't leave any code around which wants that behavior still. My fault :( I've found some places where this was missed. I will make a new patch shortly. This is one of the subtle things about using the list iterators in linux/list.h, vs. a traditional by-hand singly linked list implementation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Rework dev_base via list_head (v2)
Cleanup of dev_base list use, with the aim to simplify making device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Changes: * David's comment about loop termination; * Applies to todays netdev git tree. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] Acked-by: Kirill Korotaev [EMAIL PROTECTED] --- diff --git a/arch/s390/appldata/appldata_net_sum.c b/arch/s390/appldata/appldata_net_sum.c index a43f348..2180ac1 100644 --- a/arch/s390/appldata/appldata_net_sum.c +++ b/arch/s390/appldata/appldata_net_sum.c @@ -107,7 +107,7 @@ static void appldata_get_net_sum_data(vo tx_dropped = 0; collisions = 0; read_lock(dev_base_lock); - for (dev = dev_base; dev != NULL; dev = dev-next) { + for_each_netdev(dev) { stats = dev-get_stats(dev); rx_packets += stats-rx_packets; tx_packets += stats-tx_packets; diff --git a/arch/sparc64/solaris/ioctl.c b/arch/sparc64/solaris/ioctl.c index 330743c..18352a4 100644 --- a/arch/sparc64/solaris/ioctl.c +++ b/arch/sparc64/solaris/ioctl.c @@ -686,7 +686,8 @@ static inline int solaris_i(unsigned int int i = 0; read_lock_bh(dev_base_lock); - for (d = dev_base; d; d = d-next) i++; + for_each_netdev(d) + i++; read_unlock_bh(dev_base_lock); if (put_user (i, (int __user *)A(arg))) diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c index 1a6aeac..01fbdd3 100644 --- a/drivers/block/aoe/aoecmd.c +++ b/drivers/block/aoe/aoecmd.c @@ -194,15 +194,15 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne sl = sl_tail = NULL; read_lock(dev_base_lock); - for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) { + for_each_netdev(ifp) { dev_hold(ifp); if (!is_aoe_netif(ifp)) - continue; + goto cont; skb = new_skb(sizeof *h + sizeof *ch); if (skb == NULL) { printk(KERN_INFO aoe: skb alloc failure\n); - continue; + goto cont; } skb_put(skb, sizeof *h + sizeof *ch); skb-dev = ifp; @@ -221,6 +221,8 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne skb-next = sl; sl = skb; +cont: + dev_put(ifp); } read_unlock(dev_base_lock); diff --git a/drivers/net/wireless/strip.c b/drivers/net/wireless/strip.c index 2a299a0..ef32a5c 100644 --- a/drivers/net/wireless/strip.c +++ b/drivers/net/wireless/strip.c @@ -1971,8 +1971,7 @@ static struct net_device *get_strip_dev( sizeof(zero_address))) { struct net_device *dev; read_lock_bh(dev_base_lock); - dev = dev_base; - while (dev) { + for_each_netdev(dev) { if (dev-type == strip_info-dev-type !memcmp(dev-dev_addr, strip_info-true_dev_addr, @@ -1983,7 +1982,6 @@ static struct net_device *get_strip_dev( read_unlock_bh(dev_base_lock); return (dev); } - dev = dev-next; } read_unlock_bh(dev_base_lock); } diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c index 3df82fe..98be288 100644 --- a/drivers/parisc/led.c +++ b/drivers/parisc/led.c @@ -365,7 +365,7 @@ static __inline__ int led_get_net_activi * for reading should be OK */ read_lock(dev_base_lock); rcu_read_lock(); - for (dev = dev_base; dev; dev = dev-next) { + for_each_netdev(dev) { struct net_device_stats *stats; struct in_device *in_dev = __in_dev_get_rcu(dev); if (!in_dev || !in_dev-ifa_list) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ac0c92b..782eb1d 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -304,7 +304,7 @@ struct net_device unsigned long state; - struct net_device *next; + struct list_headdev_list; /* The device initialization function. Called only once. */ int (*init)(struct net_device *dev); @@ -575,9 +575,31 @@ struct packet_type { #include linux/notifier.h extern struct net_device loopback_dev; /* The loopback */ -extern struct net_device *dev_base; /* All devices */ +extern struct list_head
Re: [PATCH] Rework dev_base via list_head (v2)
Patrick McHardy wrote: Pavel Emelianov wrote: Cleanup of dev_base list use, with the aim to simplify making device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Changes: * David's comment about loop termination; * Applies to todays netdev git tree. This seems to be missing fs/afs/netdevices.c. Hm... I've cloned the git repo this morning but there's not such file... $ ls fs/afs/netdevices.c /bin/ls: fs/afs/netdevices.c: No such file or directory Repo was cloned from git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.git - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Rework dev_base via list_head (v3)
Netdev tree was updated today soon after I sent a patch for it, so this version applies to it and the fs/afs/netdevices.c file is added. Hope I am not late yet again :) Cleanup of dev_base list use, with the aim to simplify making device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] Acked-by: Kirill Korotaev [EMAIL PROTECTED] Cc: Patrick McHardy [EMAIL PROTECTED] --- diff --git a/arch/s390/appldata/appldata_net_sum.c b/arch/s390/appldata/appldata_net_sum.c index a43f348..2180ac1 100644 --- a/arch/s390/appldata/appldata_net_sum.c +++ b/arch/s390/appldata/appldata_net_sum.c @@ -107,7 +107,7 @@ static void appldata_get_net_sum_data(vo tx_dropped = 0; collisions = 0; read_lock(dev_base_lock); - for (dev = dev_base; dev != NULL; dev = dev-next) { + for_each_netdev(dev) { stats = dev-get_stats(dev); rx_packets += stats-rx_packets; tx_packets += stats-tx_packets; diff --git a/arch/sparc64/solaris/ioctl.c b/arch/sparc64/solaris/ioctl.c index 330743c..18352a4 100644 --- a/arch/sparc64/solaris/ioctl.c +++ b/arch/sparc64/solaris/ioctl.c @@ -686,7 +686,8 @@ static inline int solaris_i(unsigned int int i = 0; read_lock_bh(dev_base_lock); - for (d = dev_base; d; d = d-next) i++; + for_each_netdev(d) + i++; read_unlock_bh(dev_base_lock); if (put_user (i, (int __user *)A(arg))) diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c index 1a6aeac..01fbdd3 100644 --- a/drivers/block/aoe/aoecmd.c +++ b/drivers/block/aoe/aoecmd.c @@ -194,15 +194,15 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne sl = sl_tail = NULL; read_lock(dev_base_lock); - for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) { + for_each_netdev(ifp) { dev_hold(ifp); if (!is_aoe_netif(ifp)) - continue; + goto cont; skb = new_skb(sizeof *h + sizeof *ch); if (skb == NULL) { printk(KERN_INFO aoe: skb alloc failure\n); - continue; + goto cont; } skb_put(skb, sizeof *h + sizeof *ch); skb-dev = ifp; @@ -221,6 +221,8 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne skb-next = sl; sl = skb; +cont: + dev_put(ifp); } read_unlock(dev_base_lock); diff --git a/drivers/net/wireless/strip.c b/drivers/net/wireless/strip.c index 2a299a0..ef32a5c 100644 --- a/drivers/net/wireless/strip.c +++ b/drivers/net/wireless/strip.c @@ -1971,8 +1971,7 @@ static struct net_device *get_strip_dev( sizeof(zero_address))) { struct net_device *dev; read_lock_bh(dev_base_lock); - dev = dev_base; - while (dev) { + for_each_netdev(dev) { if (dev-type == strip_info-dev-type !memcmp(dev-dev_addr, strip_info-true_dev_addr, @@ -1983,7 +1982,6 @@ static struct net_device *get_strip_dev( read_unlock_bh(dev_base_lock); return (dev); } - dev = dev-next; } read_unlock_bh(dev_base_lock); } diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c index 3df82fe..98be288 100644 --- a/drivers/parisc/led.c +++ b/drivers/parisc/led.c @@ -365,7 +365,7 @@ static __inline__ int led_get_net_activi * for reading should be OK */ read_lock(dev_base_lock); rcu_read_lock(); - for (dev = dev_base; dev; dev = dev-next) { + for_each_netdev(dev) { struct net_device_stats *stats; struct in_device *in_dev = __in_dev_get_rcu(dev); if (!in_dev || !in_dev-ifa_list) diff --git a/fs/afs/netdevices.c b/fs/afs/netdevices.c index ce08977..fc27d4b 100644 --- a/fs/afs/netdevices.c +++ b/fs/afs/netdevices.c @@ -47,7 +47,7 @@ int afs_get_ipv4_interfaces(struct afs_i ASSERT(maxbufs 0); rtnl_lock(); - for (dev = dev_base; dev; dev = dev-next) { + for_each_netdev(dev) { if (dev-type == ARPHRD_LOOPBACK !wantloopback) continue; idev = __in_dev_get_rtnl(dev); diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 4428f1c..3044622 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h
[PATCH] Virtual ethernet device (tunnel)
Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. Eric recently sent a similar driver called etun. This implementation is closer to the OpenVZ one and it lacks some unimportant features of etun driver (like ethtool_ops) for simplicity. The general difference from etun is that a netlink interface is used to create and destroy the pairs. The patch for an ip utility is also provided. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] Acked-By: Kirill Korotaev [EMAIL PROTECTED] Acked-By: Dmitry Mishin [EMAIL PROTECTED] Acked-By: Alexey Kuznetsov [EMAIL PROTECTED] --- diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index c3f9f59..445dbc7 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -119,6 +119,12 @@ config TUN If you don't know what to use this for, you don't need it. +config VETH + tristate Virtual ethernet device + ---help--- + The device is an ethernet tunnel. Devices are created in pairs. When + one end receives the packet it appears on its pair and vice versa. + config NET_SB1000 tristate General Instruments Surfboard 1000 depends on PNP diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 33af833..2730b80 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o obj-$(CONFIG_MACMACE) += macmace.o obj-$(CONFIG_MAC89x0) += mac89x0.o obj-$(CONFIG_TUN) += tun.o +obj-$(CONFIG_VETH) += veth.o obj-$(CONFIG_NET_NETX) += netx-eth.o obj-$(CONFIG_DL2K) += dl2k.o obj-$(CONFIG_R8169) += r8169.o diff --git a/drivers/net/veth.c b/drivers/net/veth.c new file mode 100644 index 000..6105f99 --- /dev/null +++ b/drivers/net/veth.c @@ -0,0 +1,387 @@ +/* + * drivers/net/veth.c + * + * Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc + * + */ + +#include linux/init.h +#include linux/if.h +#include linux/netdevice.h +#include linux/etherdevice.h +#include net/dst.h +#include net/xfrm.h +#include net/genetlink.h +#include net/veth.h + +struct veth_struct { + struct net_device *peer; + struct net_device *dev; + + struct list_headlist; + struct net_device_stats *real_stats; + + struct net_device_stats stats; +}; + +static LIST_HEAD(veth_list); + +static inline struct net_device_stats *veth_stats(struct veth_struct *veth, + int cpuid) +{ + return per_cpu_ptr(veth-real_stats, cpuid); +} + +/* + * Device functions + */ + +static int veth_open(struct net_device *dev) +{ + return 0; +} + +static int veth_close(struct net_device *dev) +{ + return 0; +} + +static void veth_destructor(struct net_device *dev) +{ + struct veth_struct *veth; + + veth = dev-priv; + free_percpu(veth-real_stats); + free_netdev(dev); +} + +static struct net_device_stats *veth_get_stats(struct net_device *dev) +{ + int i; + struct veth_struct *veth; + struct net_device_stats *stats; + struct net_device_stats *dev_stats; + + veth = dev-priv; + stats = veth-stats; + memset(stats, 0, sizeof(struct net_device_stats)); + + for_each_possible_cpu (i) { + dev_stats = veth_stats(veth, i); + stats-rx_bytes += dev_stats-rx_bytes; + stats-tx_bytes += dev_stats-tx_bytes; + stats-rx_packets += dev_stats-rx_packets; + stats-tx_packets += dev_stats-tx_packets; + } + + return stats; +} + +static int veth_xmit(struct sk_buff *skb, struct net_device *dev) +{ + struct net_device_stats *stats; + struct net_device *rcv = NULL; + struct veth_struct *veth; + int length, cpu; + + skb_orphan(skb); + + veth = dev-priv; + rcv = veth-peer; + + cpu = smp_processor_id(); + stats = veth_stats(veth, cpu); + + if (!(rcv-flags IFF_UP)) + goto outf; + + skb-dev = rcv; + skb-pkt_type = PACKET_HOST; + skb-protocol = eth_type_trans(skb, rcv); + + dst_release(skb-dst); + skb-dst = NULL; + + secpath_reset(skb); + nf_reset(skb); + + length = skb-len; + + stats-tx_bytes += length; + stats-tx_packets++; + + stats = veth_stats(rcv-priv, cpu); + stats-rx_bytes += length; + stats-rx_packets++; + + netif_rx(skb); + return 0; + +outf: + kfree_skb(skb); + stats-tx_dropped++; + return 0; +} + +/* + * Setup / remove routines + */ + +static int veth_init_dev(struct net_device *dev) +{ + struct veth_struct *veth; + + dev-hard_start_xmit = veth_xmit; + dev-get_stats = veth_get_stats; + dev-open = veth_open; + dev-stop = veth_close; + dev-destructor = veth_destructor; + + veth = dev
[PATCH] Make ip utility veth driver aware
The new command is called veth with the following syntax: * ip veth add dev1 dev2 creates interconnected pair of veth devices. * ip veth del dev destroys the pair of veth devices, where dev is either dev1 or dev2 used to create the pair. One question that is to be solved is whether or not to create a hard-coded netlink family for veth driver. Without it the family resolution code has to be moved to general place in ip utility (by now it is copy-paste-ed from one file to another till final decision). Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- diff --git a/ip/Makefile b/ip/Makefile index a749993..864e04e 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -1,7 +1,7 @@ IPOBJ=ip.o ipaddress.o iproute.o iprule.o \ rtm_map.o iptunnel.o ip6tunnel.o tunnel.o ipneigh.o ipntable.o iplink.o \ ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o \ -ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o +ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o veth.o RTMONOBJ=rtmon.o diff --git a/ip/ip.c b/ip/ip.c index c084292..d3d3a8a 100644 --- a/ip/ip.c +++ b/ip/ip.c @@ -27,6 +27,7 @@ #include SNAPSHOT.h #include utils.h #include ip_common.h +#include veth.h int preferred_family = AF_UNSPEC; int show_stats = 0; @@ -46,7 +47,7 @@ static void usage(void) Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }\n ip [ -force ] [-batch filename\n where OBJECT := { link | addr | route | rule | neigh | ntable | tunnel |\n - maddr | mroute | monitor | xfrm }\n + maddr | mroute | monitor | xfrm | veth }\n OPTIONS := { -V[ersion] | -s[tatistics] | -r[esolve] |\n -f[amily] { inet | inet6 | ipx | dnet | link } |\n -o[neline] | -t[imestamp] }\n); @@ -76,6 +77,7 @@ static const struct cmd { { monitor,do_ipmonitor }, { xfrm, do_xfrm }, { mroute, do_multiroute }, + { veth, do_veth }, { help, do_help }, { 0 } }; diff --git a/ip/veth.c b/ip/veth.c new file mode 100644 index 000..d4eecc8 --- /dev/null +++ b/ip/veth.c @@ -0,0 +1,196 @@ +/* + * veth.c ethernet tunnel + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors:Pavel Emelianov, [EMAIL PROTECTED] + * + */ + +#include stdio.h +#include string.h +#include unistd.h +#include sys/types.h +#include sys/socket.h +#include linux/genetlink.h + +#include utils.h +#include veth.h + +#define GENLMSG_DATA(glh) ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN)) +#define NLA_DATA(na)((void *)((char*)(na) + NLA_HDRLEN)) + +static int do_veth_help(void) +{ + fprintf(stderr, Usage: ip veth add DEVICE PEER_NAME\n); + fprintf(stderr,del DEVICE\n); + exit(-1); +} + +static int genl_ctrl_resolve_family(const char *family) +{ + struct rtnl_handle rth; + struct nlmsghdr *nlh; + struct genlmsghdr *ghdr; + int ret = 0; + struct { + struct nlmsghdr n; + charbuf[4096]; + } req; + + memset(req, 0, sizeof(req)); + + nlh = req.n; + nlh-nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN); + nlh-nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; + nlh-nlmsg_type = GENL_ID_CTRL; + + ghdr = NLMSG_DATA(req.n); + ghdr-cmd = CTRL_CMD_GETFAMILY; + + if (rtnl_open_byproto(rth, 0, NETLINK_GENERIC) 0) { + fprintf(stderr, Cannot open generic netlink socket\n); + exit(1); + } + + addattr_l(nlh, 128, CTRL_ATTR_FAMILY_NAME, family, strlen(family) + 1); + + if (rtnl_talk(rth, nlh, 0, 0, nlh, NULL, NULL) 0) { + fprintf(stderr, Error talking to the kernel\n); + goto errout; + } + + { + struct rtattr *tb[CTRL_ATTR_MAX + 1]; + struct genlmsghdr *ghdr = NLMSG_DATA(nlh); + int len = nlh-nlmsg_len; + struct rtattr *attrs; + + if (nlh-nlmsg_type != GENL_ID_CTRL) { + fprintf(stderr, Not a controller message, nlmsg_len=%d + nlmsg_type=0x%x\n, nlh-nlmsg_len, nlh-nlmsg_type); + goto errout; + } + + if (ghdr-cmd != CTRL_CMD_NEWFAMILY) { + fprintf(stderr, Unkown controller command %d\n, ghdr-cmd); + goto errout; + } + + len -= NLMSG_LENGTH(GENL_HDRLEN); + + if (len 0) { + fprintf(stderr, wrong controller message len %d\n, len); + return -1; + } + + attrs = (struct rtattr *) ((char *) ghdr
Re: [Devel] [PATCH] Virtual ethernet device (tunnel)
Daniel Lezcano wrote: Pavel Emelianov wrote: Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. Eric recently sent a similar driver called etun. This implementation is closer to the OpenVZ one and it lacks some unimportant features of etun driver (like ethtool_ops) for simplicity. The general difference from etun is that a netlink interface is used to create and destroy the pairs. The patch for an ip utility is also provided. if etun and veth are similar, why didn't you put the netlink interface to the etun driver instead of sending a new driver ? Alexey said, that he preferred the name veth to etun. So an incremental patch would look too bad. -- Daniel - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Rework dev_base via list_head
Cleanup of dev_base list use, with the aim to simplify making device list per-namespace. In almost every occasion, use of dev_base variable and dev-next pointer could be easily replaced by for_each_netdev loop. A few most complicated places were converted to using first_netdev()/next_netdev(). Fits 2.6.21-rc7 tree. Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] Acked-by: Kirill Korotaev [EMAIL PROTECTED] --- diff --git a/arch/s390/appldata/appldata_net_sum.c b/arch/s390/appldata/appldata_net_sum.c index f64b8c8..c271c9e 100644 --- a/arch/s390/appldata/appldata_net_sum.c +++ b/arch/s390/appldata/appldata_net_sum.c @@ -107,7 +107,7 @@ static void appldata_get_net_sum_data(vo tx_dropped = 0; collisions = 0; read_lock(dev_base_lock); - for (dev = dev_base; dev != NULL; dev = dev-next) { + for_each_netdev(dev) { if (dev-get_stats == NULL) { continue; } diff --git a/arch/sparc64/solaris/ioctl.c b/arch/sparc64/solaris/ioctl.c index 330743c..18352a4 100644 --- a/arch/sparc64/solaris/ioctl.c +++ b/arch/sparc64/solaris/ioctl.c @@ -686,7 +686,8 @@ static inline int solaris_i(unsigned int int i = 0; read_lock_bh(dev_base_lock); - for (d = dev_base; d; d = d-next) i++; + for_each_netdev(d) + i++; read_unlock_bh(dev_base_lock); if (put_user (i, (int __user *)A(arg))) diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c index 8d17d8d..236ec10 100644 --- a/drivers/block/aoe/aoecmd.c +++ b/drivers/block/aoe/aoecmd.c @@ -193,15 +193,15 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne sl = sl_tail = NULL; read_lock(dev_base_lock); - for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) { + for_each_netdev(ifp) { dev_hold(ifp); if (!is_aoe_netif(ifp)) - continue; + goto cont; skb = new_skb(sizeof *h + sizeof *ch); if (skb == NULL) { printk(KERN_INFO aoe: skb alloc failure\n); - continue; + goto cont; } skb_put(skb, sizeof *h + sizeof *ch); skb-dev = ifp; @@ -220,6 +220,8 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne skb-next = sl; sl = skb; +cont: + dev_put(ifp); } read_unlock(dev_base_lock); diff --git a/drivers/net/wireless/strip.c b/drivers/net/wireless/strip.c index f5ce1c6..200d131 100644 --- a/drivers/net/wireless/strip.c +++ b/drivers/net/wireless/strip.c @@ -1971,8 +1971,7 @@ static struct net_device *get_strip_dev( sizeof(zero_address))) { struct net_device *dev; read_lock_bh(dev_base_lock); - dev = dev_base; - while (dev) { + for_each_netdev(dev) { if (dev-type == strip_info-dev-type !memcmp(dev-dev_addr, strip_info-true_dev_addr, @@ -1983,7 +1982,6 @@ static struct net_device *get_strip_dev( read_unlock_bh(dev_base_lock); return (dev); } - dev = dev-next; } read_unlock_bh(dev_base_lock); } diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c index d190c05..5db0526 100644 --- a/drivers/parisc/led.c +++ b/drivers/parisc/led.c @@ -365,7 +365,7 @@ static __inline__ int led_get_net_activi * for reading should be OK */ read_lock(dev_base_lock); rcu_read_lock(); - for (dev = dev_base; dev; dev = dev-next) { + for_each_netdev(dev) { struct net_device_stats *stats; struct in_device *in_dev = __in_dev_get_rcu(dev); if (!in_dev || !in_dev-ifa_list) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 1a52854..7038e43 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -302,7 +302,7 @@ struct net_device unsigned long state; - struct net_device *next; + struct list_headdev_list; /* The device initialization function. Called only once. */ int (*init)(struct net_device *dev); @@ -569,9 +569,31 @@ struct packet_type { #include linux/notifier.h extern struct net_device loopback_dev; /* The loopback */ -extern struct net_device *dev_base; /* All devices */ +extern struct list_headdev_base_head; /* All devices */ extern rwlock_tdev_base_lock
Re: [BRIDGE] Unaligned access on IA64 when comparing ethernet addresses
[snip] --- linux-2.6.orig/net/bridge/br_private.h2007-04-17 13:26:48.0 -0700 +++ linux-2.6/net/bridge/br_private.h 2007-04-17 13:30:29.0 -0700 @@ -36,7 +36,7 @@ { unsigned char prio[2]; unsigned char addr[6]; -}; +} __attribute__((aligned(8))); Why 8? Mustn't it be 16? Address is to be 2-bytes aligned... struct mac_addr { - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NETLINK] Don't attach callback to a going-away netlink socket
Sorry, I forgot to put netdev and David in Cc when I first sent it. There is a race between netlink_dump_start() and netlink_release() that can lead to the situation when a netlink socket with non-zero callback is freed. Here it is: CPU1: CPU2 netlink_release(): netlink_dump_start(): sk = netlink_lookup(); /* OK */ netlink_remove(); spin_lock(nlk-cb_lock); if (nlk-cb) { /* false */ ... } spin_unlock(nlk-cb_lock); spin_lock(nlk-cb_lock); if (nlk-cb) { /* false */ ... } nlk-cb = cb; spin_unlock(nlk-cb_lock); ... sock_orphan(sk); /* * proceed with releasing * the socket */ The proposal it to make sock_orphan before detaching the callback in netlink_release() and to check for the sock to be SOCK_DEAD in netlink_dump_start() before setting a new callback. Signed-off-by: Denis Lunev [EMAIL PROTECTED] Signed-off-by: Kirill Korotaev [EMAIL PROTECTED] Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] Acked-by: Patrick McHardy [EMAIL PROTECTED] --- --- a/net/netlink/af_netlink.c 2004-10-25 12:12:23.0 +0400 +++ b/net/netlink/af_netlink.c 2004-10-28 16:26:12.0 +0400 @@ -255,6 +255,7 @@ static int netlink_release(struct socket return 0; netlink_remove(sk); + sock_orphan(sk); nlk = nlk_sk(sk); spin_lock(nlk-cb_lock); @@ -269,7 +270,6 @@ static int netlink_release(struct socket /* OK. Socket is unlinked, and, therefore, no new packets will arrive */ - sock_orphan(sk); sock-sk = NULL; wake_up_interruptible_all(nlk-wait); @@ -942,9 +942,9 @@ int netlink_dump_start(struct sock *ssk, return -ECONNREFUSED; } nlk = nlk_sk(sk); - /* A dump is in progress... */ + /* A dump or destruction is in progress... */ spin_lock(nlk-cb_lock); - if (nlk-cb) { + if (nlk-cb || sock_flag(sk, SOCK_DEAD)) { spin_unlock(nlk-cb_lock); netlink_destroy_callback(cb); sock_put(sk); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETLINK] Don't attach callback to a going-away netlink socket
Evgeniy Polyakov wrote: On Wed, Apr 18, 2007 at 12:16:18PM +0400, Pavel Emelianov ([EMAIL PROTECTED]) wrote: Sorry, I forgot to put netdev and David in Cc when I first sent it. There is a race between netlink_dump_start() and netlink_release() that can lead to the situation when a netlink socket with non-zero callback is freed. Out of curiosity, why not to fix a netlink_dump_start() to remove callback in error path, since in 'no-error' path it removes it in Error path is not relevant here. The problem is that we keep a calback on a socket that is about to be freed. netlink_dump(). And, btw, can release method be called while socket is being used, I thought about proper reference counters should prevent this, but not 100% sure with RCU dereferencing of the descriptor. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BRIDGE] Unaligned access on IA64 when comparing ethernet addresses
David Miller wrote: From: Pavel Emelianov [EMAIL PROTECTED] Date: Wed, 18 Apr 2007 10:43:56 +0400 [snip] --- linux-2.6.orig/net/bridge/br_private.h 2007-04-17 13:26:48.0 -0700 +++ linux-2.6/net/bridge/br_private.h 2007-04-17 13:30:29.0 -0700 @@ -36,7 +36,7 @@ { unsigned char prio[2]; unsigned char addr[6]; -}; +} __attribute__((aligned(8))); Why 8? Mustn't it be 16? Address is to be 2-bytes aligned... Actually it could be made 2, the aligned() attribute is in bytes, not bits. Indeed :) My bad :( Thank you... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETLINK] Don't attach callback to a going-away netlink socket
Evgeniy Polyakov wrote: On Wed, Apr 18, 2007 at 12:32:40PM +0400, Pavel Emelianov ([EMAIL PROTECTED]) wrote: Evgeniy Polyakov wrote: On Wed, Apr 18, 2007 at 12:16:18PM +0400, Pavel Emelianov ([EMAIL PROTECTED]) wrote: Sorry, I forgot to put netdev and David in Cc when I first sent it. There is a race between netlink_dump_start() and netlink_release() that can lead to the situation when a netlink socket with non-zero callback is freed. Out of curiosity, why not to fix a netlink_dump_start() to remove callback in error path, since in 'no-error' path it removes it in Error path is not relevant here. The problem is that we keep a calback on a socket that is about to be freed. Yes, you are right, that it will not be freed in netlink_release(), but it will be freed in netlink_dump() after it is processed (in no-error path only though). But error path will leak it. On success path we would have a leaked packet in sk_write_queue, since we did't see it in skb_queue_purge() while doing netlink_release(). Of course we can place the struts in code to handle the case when we have a released socket with the attached callback, but it is more correct (IMHO) not to allow to attach the callbacks to dead sockets. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[BRIDGE] Unaligned access on IA64 when comparing ethernet addresses
From: Evgeny Kravtsunov [EMAIL PROTECTED] compare_ether_addr() implicitly requires that the addresses passed are 2-bytes aligned in memory. This is not true for br_stp_change_bridge_id() and br_stp_recalculate_bridge_id() in which one of the addresses is unsigned char *, and thus may not be 2-bytes aligned. Signed-off-by: Evgeny Kravtsunov [EMAIL PROTECTED] Signed-off-by: Kirill Korotaev [EMAIL PROTECTED] Signed-off-by: Pavel Emelianov [EMAIL PROTECTED] --- --- a/net/bridge/br_stp_if.c2006-09-20 07:42:06.0 +0400 +++ b/net/bridge/br_stp_if.c2007-04-13 12:28:08.0 +0400 @@ -124,7 +124,9 @@ void br_stp_disable_port(struct net_brid /* called under bridge lock */ void br_stp_change_bridge_id(struct net_bridge *br, const unsigned char *addr) { - unsigned char oldaddr[6]; + /* should be aligned on 2 bytes for compare_ether_addr() */ + unsigned short oldaddr_aligned[ETH_ALEN 1]; + unsigned char *oldaddr = (unsigned char *)oldaddr_aligned; struct net_bridge_port *p; int wasroot; @@ -149,11 +151,14 @@ void br_stp_change_bridge_id(struct net_ br_become_root_bridge(br); } -static const unsigned char br_mac_zero[6]; +/* should be aligned on 2 bytes for compare_ether_addr() */ +static const unsigned short br_mac_zero_aligned[ETH_ALEN 1]; /* called under bridge lock */ void br_stp_recalculate_bridge_id(struct net_bridge *br) { + const unsigned char *br_mac_zero = + (const unsigned char *)br_mac_zero_aligned; const unsigned char *addr = br_mac_zero; struct net_bridge_port *p; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html