Re: [PATCH 2/2] Virtual ethernet device driver

2007-07-16 Thread Pavel Emelianov

Patrick McHardy wrote:

Pavel Emelianov wrote:

Patrick McHardy wrote:


+
+   if (tb[IFLA_IFNAME])
+   nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
+   else
+   snprintf(ifname, IFNAMSIZ, DRV_NAME %%d);


Does this work? The other device is not registered at this time, so I
think the allocated names could clash ..


Oh! And one more thing, I've just thought about - even two devices with
similar name would be ok when we have a namespaces and one of the ends of
this tunnel is to be in some other namespace rather that the init one.



Yes, but we don't have network namespaces yet. Which brings up another
question: is this driver of any use without namespaces?




Sure. For example, you can join two bridges together.

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Virtual ethernet device driver

2007-07-16 Thread Pavel Emelianov

Patrick McHardy wrote:

Pavel Emelianov wrote:

Patrick McHardy wrote:


Pavel Emelianov wrote:


+   if (data != NULL  data[VETH_INFO_PEER] != NULL) {
+   err = nla_parse_nested(tb, IFLA_INFO_MAX,
+   data[VETH_INFO_PEER], ifla_policy);
+   if (err  0)
+   return err;
+   }


One more suggestion regarding the PEER attribute: you only nest
IFLA attributes below it, but some information that might be
interesting to use on device creation is contained in ifinfomsg
(flags and ifindex). So I think it would be better to use a
complete message, including header.


I don't get it. Can you elaborate, please.



You don't have a struct ifinfomsg for the peer device. At some
point we might want to add support for specifying initial flags
for the device (some easily supportable ones are IFF_PROMISC,
IFF_ALLMULTI, IFF_NOARP) and ideally that should also be possible
for the peer device.

So I suggest you use a complete ifinfomsg including the header
instead of just the attributes for VETH_INFO_PEER.




I don't see any information from this struct being used on the
link-creation paths...

Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Virtual ethernet device (v3)

2007-07-12 Thread Pavel Emelianov
Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

Mainly it allows to communicate between network namespaces but
it can be used as is as well.

Eric recently sent a similar driver called etun with the sysfs
interface. This implementation uses another interface - the 
RTM_NRELINK message introduced by Patric.

The newlink callback is organized that way to make it easy to 
create the peer device in the separate namespace when we have 
them in kernel.

Changes from v.2.1:
 * Made the generic routine for link creation to be used
   by veth driver, any other tunnel driver that needs to
   create several devices at once and rtnl_newlink() code.

Changes from v.2:
 * Rebase over latest netdev tree. No actual changes;
 * Small code rework.

Changes from v.1:
 * percpu statistics;
 * standard convention for nla policy names;
 * module alias added;
 * xmit function fixes noticed by Patric;
 * code cleanup.

The patch for an ip utility is also provided.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
Cc: Patrick McHardy [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Introduce the generic rtnl_create_link()

2007-07-12 Thread Pavel Emelianov
This routine gets the parsed rtnl attributes and creates a new
link with generic info (IFLA_LINKINFO policy). Its intention
is to help the drivers, that need to create several links at 
once (like VETH).

This is nothing but a copy-paste-ed part of rtnl_newlink() function
that is responsible for creation of new device.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

 include/net/rtnetlink.h |4 ++
 net/core/rtnetlink.c|   83 ++--
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 3861c05..8218288 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -78,6 +78,10 @@ extern void  __rtnl_link_unregister(struc
 extern int rtnl_link_register(struct rtnl_link_ops *ops);
 extern voidrtnl_link_unregister(struct rtnl_link_ops *ops);
 
+extern struct net_device *rtnl_create_link(char *ifname,
+   const struct rtnl_link_ops *ops, struct nlattr *tb[]);
+extern const struct nla_policy ifla_policy[IFLA_MAX+1];
+
 #define MODULE_ALIAS_RTNL_LINK(kind) MODULE_ALIAS(rtnl-link- kind)
 
 #endif
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 864cbdf..02e00e8 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -714,7 +714,7 @@ cont:
return skb-len;
 }
 
-static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
+const struct nla_policy ifla_policy[IFLA_MAX+1] = {
[IFLA_IFNAME]   = { .type = NLA_STRING, .len = IFNAMSIZ-1 },
[IFLA_ADDRESS]  = { .type = NLA_BINARY, .len = MAX_ADDR_LEN },
[IFLA_BROADCAST]= { .type = NLA_BINARY, .len = MAX_ADDR_LEN },
@@ -941,6 +941,50 @@ static int rtnl_dellink(struct sk_buff *
return 0;
 }
 
+struct net_device *rtnl_create_link(char *ifname,
+   const struct rtnl_link_ops *ops, struct nlattr *tb[])
+{
+   int err;
+   struct net_device *dev;
+
+   err = -ENOMEM;
+   dev = alloc_netdev(ops-priv_size, ifname, ops-setup);
+   if (!dev)
+   goto err;
+
+   if (strchr(dev-name, '%')) {
+   err = dev_alloc_name(dev, dev-name);
+   if (err  0)
+   goto err_free;
+   }
+
+   dev-rtnl_link_ops = ops;
+
+   if (tb[IFLA_MTU])
+   dev-mtu = nla_get_u32(tb[IFLA_MTU]);
+   if (tb[IFLA_ADDRESS])
+   memcpy(dev-dev_addr, nla_data(tb[IFLA_ADDRESS]),
+   nla_len(tb[IFLA_ADDRESS]));
+   if (tb[IFLA_BROADCAST])
+   memcpy(dev-broadcast, nla_data(tb[IFLA_BROADCAST]),
+   nla_len(tb[IFLA_BROADCAST]));
+   if (tb[IFLA_TXQLEN])
+   dev-tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
+   if (tb[IFLA_WEIGHT])
+   dev-weight = nla_get_u32(tb[IFLA_WEIGHT]);
+   if (tb[IFLA_OPERSTATE])
+   set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE]));
+   if (tb[IFLA_LINKMODE])
+   dev-link_mode = nla_get_u8(tb[IFLA_LINKMODE]);
+
+   return dev;
+
+err_free:
+   free_netdev(dev);
+err:
+   return ERR_PTR(err);
+}
+
 static int rtnl_newlink(struct sk_buff *skb, struct nlmsghdr *nlh, void *arg)
 {
const struct rtnl_link_ops *ops;
@@ -1051,40 +1095,17 @@ replay:
 
if (!ifname[0])
snprintf(ifname, IFNAMSIZ, %s%%d, ops-kind);
-   dev = alloc_netdev(ops-priv_size, ifname, ops-setup);
-   if (!dev)
-   return -ENOMEM;
-
-   if (strchr(dev-name, '%')) {
-   err = dev_alloc_name(dev, dev-name);
-   if (err  0)
-   goto err_free;
-   }
-   dev-rtnl_link_ops = ops;
 
-   if (tb[IFLA_MTU])
-   dev-mtu = nla_get_u32(tb[IFLA_MTU]);
-   if (tb[IFLA_ADDRESS])
-   memcpy(dev-dev_addr, nla_data(tb[IFLA_ADDRESS]),
-  nla_len(tb[IFLA_ADDRESS]));
-   if (tb[IFLA_BROADCAST])
-   memcpy(dev-broadcast, nla_data(tb[IFLA_BROADCAST]),
-  nla_len(tb[IFLA_BROADCAST]));
-   if (tb[IFLA_TXQLEN])
-   dev-tx_queue_len = nla_get_u32(tb[IFLA_TXQLEN]);
-   if (tb[IFLA_WEIGHT])
-   dev-weight = nla_get_u32(tb[IFLA_WEIGHT]);
-   if (tb[IFLA_OPERSTATE])
-   set_operstate(dev, nla_get_u8(tb[IFLA_OPERSTATE]));
-   if (tb[IFLA_LINKMODE])
-   dev-link_mode = nla_get_u8(tb[IFLA_LINKMODE]);
+   dev = rtnl_create_link(ifname, ops, tb);
 
-   if (ops-newlink)
+   if (IS_ERR(dev))
+   err = PTR_ERR(dev);
+   else if (ops-newlink)
err = ops-newlink(dev, tb, data);
else

[PATCH 2/2] Virtual ethernet device driver

2007-07-12 Thread Pavel Emelianov
Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

Mainly it allows to communicate between network namespaces but
it can be used as is as well.

The newlink callback is organized that way to make it easy to 
create the peer device in the separate namespace when we have 
them in kernel.

This implementation uses another interface - the RTM_NRELINK
message introduced by Patric.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

 drivers/net/Kconfig  |6 
 drivers/net/Makefile |1 
 drivers/net/veth.c   |  426 +++
 include/net/veth.h   |   12 +
 4 files changed, 445 insertions(+)

diff-tree 293fc37c28eee0a27e19179ab0d809460c73eced (from 
15028aad00ddf241581fbe74a02ec89cbb28d35d)
Author: Pavel [EMAIL PROTECTED]
Date:   Thu Jul 12 13:05:44 2007 +0400

Veth support

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index d4e39ff..7bdc6e3 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -124,6 +124,12 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config VETH
+   tristate Virtual ethernet device
+   ---help---
+ The device is an ethernet tunnel. Devices are created in pairs. When
+ one end receives the packet it appears on its pair and vice versa.
+
 config NET_SB1000
tristate General Instruments Surfboard 1000
depends on PNP
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index a2241e6..d7b1103 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -189,6 +189,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MACMACE) += macmace.o
 obj-$(CONFIG_MAC89x0) += mac89x0.o
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_DL2K) += dl2k.o
 obj-$(CONFIG_R8169) += r8169.o
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
new file mode 100644
index 000..caa7b62
--- /dev/null
+++ b/drivers/net/veth.c
@@ -0,0 +1,426 @@
+/*
+ *  drivers/net/veth.c
+ *
+ *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
+ *
+ * Author: Pavel Emelianov [EMAIL PROTECTED]
+ * Ethtool interface from: Eric W. Biederman [EMAIL PROTECTED]
+ *
+ */
+
+#include linux/list.h
+#include linux/netdevice.h
+#include linux/ethtool.h
+#include linux/etherdevice.h
+
+#include net/dst.h
+#include net/xfrm.h
+#include net/veth.h
+
+#define DRV_NAME   veth
+#define DRV_VERSION1.0
+
+struct veth_net_stats {
+   unsigned long   rx_packets;
+   unsigned long   tx_packets;
+   unsigned long   rx_bytes;
+   unsigned long   tx_bytes;
+   unsigned long   tx_dropped;
+};
+
+struct veth_priv {
+   struct net_device *peer;
+   struct net_device *dev;
+   struct list_head list;
+   struct veth_net_stats *stats;
+   unsigned ip_summed;
+};
+
+static LIST_HEAD(veth_list);
+
+/*
+ * ethtool interface
+ */
+
+static struct {
+   const char string[ETH_GSTRING_LEN];
+} ethtool_stats_keys[] = {
+   { peer_ifindex },
+};
+
+static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+   cmd-supported  = 0;
+   cmd-advertising= 0;
+   cmd-speed  = SPEED_1;
+   cmd-duplex = DUPLEX_FULL;
+   cmd-port   = PORT_TP;
+   cmd-phy_address= 0;
+   cmd-transceiver= XCVR_INTERNAL;
+   cmd-autoneg= AUTONEG_DISABLE;
+   cmd-maxtxpkt   = 0;
+   cmd-maxrxpkt   = 0;
+   return 0;
+}
+
+static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo 
*info)
+{
+   strcpy(info-driver, DRV_NAME);
+   strcpy(info-version, DRV_VERSION);
+   strcpy(info-fw_version, N/A);
+}
+
+static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+   switch(stringset) {
+   case ETH_SS_STATS:
+   memcpy(buf, ethtool_stats_keys, sizeof(ethtool_stats_keys));
+   break;
+   }
+}
+
+static int veth_get_stats_count(struct net_device *dev)
+{
+   return ARRAY_SIZE(ethtool_stats_keys);
+}
+
+static void veth_get_ethtool_stats(struct net_device *dev,
+   struct ethtool_stats *stats, u64 *data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   data[0] = priv-peer-ifindex;
+}
+
+static u32 veth_get_rx_csum(struct net_device *dev)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   return priv-ip_summed == CHECKSUM_UNNECESSARY;
+}
+
+static int veth_set_rx_csum(struct net_device *dev, u32 data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   priv-ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE;
+   return 0;
+}
+
+static u32 veth_get_tx_csum(struct net_device *dev)
+{
+   return (dev-features  NETIF_F_NO_CSUM) != 0;
+}
+
+static int

[PATCH 0/2] Module for ip utility to support veth device (v.3)

2007-07-12 Thread Pavel Emelianov
The usage is
# ip link add [name] type veth [peer name] [mac mac] [peer_mac mac]

This consists of two parts:
1/2 makes some copy-paste changes in ip/iplink.c for generic CLI
attributes parsing,
2/2 the module itself.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Intruduce iplink_parse() routine

2007-07-12 Thread Pavel Emelianov
This routine parses CLI attributes, describing generic
link parameters such as name, address, etc.

This is mostly copy-pasted from iplink_modify().

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

 include/utils.h |3 +
 ip/iplink.c |  127 
 2 files changed, 76 insertions(+), 54 deletions(-)

diff --git a/include/utils.h b/include/utils.h
index a3fd335..3fd851d 100644
--- a/include/utils.h
+++ b/include/utils.h
@@ -146,4 +146,7 @@ extern int cmdlineno;
 extern size_t getcmdline(char **line, size_t *len, FILE *in);
 extern int makeargs(char *line, char *argv[], int maxargs);
 
+struct iplink_req;
+int iplink_parse(int argc, char **argv, struct iplink_req *req,
+   char **name, char **type, char **link, char **dev);
 #endif /* __UTILS_H__ */
diff --git a/ip/iplink.c b/ip/iplink.c
index cfacdab..9ee269c 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -142,140 +142,159 @@ static int iplink_have_newlink(void)
 }
 #endif /* ! IPLINK_IOCTL_COMPAT */
 
-static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
+struct iplink_req {
+   struct nlmsghdr n;
+   struct ifinfomsgi;
+   charbuf[1024];
+};
+
+int iplink_parse(int argc, char **argv, struct iplink_req *req,
+   char **name, char **type, char **link, char **dev)
 {
+   int ret, len;
+   char abuf[32];
int qlen = -1;
int mtu = -1;
-   int len;
-   char abuf[32];
-   char *dev = NULL;
-   char *name = NULL;
-   char *link = NULL;
-   char *type = NULL;
-   struct link_util *lu = NULL;
-   struct {
-   struct nlmsghdr n;
-   struct ifinfomsgi;
-   charbuf[1024];
-   } req;
 
-   memset(req, 0, sizeof(req));
-
-   req.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
-   req.n.nlmsg_flags = NLM_F_REQUEST|flags;
-   req.n.nlmsg_type = cmd;
-   req.i.ifi_family = preferred_family;
+   ret = argc;
 
while (argc  0) {
if (strcmp(*argv, up) == 0) {
-   req.i.ifi_change |= IFF_UP;
-   req.i.ifi_flags |= IFF_UP;
+   req-i.ifi_change |= IFF_UP;
+   req-i.ifi_flags |= IFF_UP;
} else if (strcmp(*argv, down) == 0) {
-   req.i.ifi_change |= IFF_UP;
-   req.i.ifi_flags = ~IFF_UP;
+   req-i.ifi_change |= IFF_UP;
+   req-i.ifi_flags = ~IFF_UP;
} else if (strcmp(*argv, name) == 0) {
NEXT_ARG();
-   name = *argv;
+   *name = *argv;
} else if (matches(*argv, link) == 0) {
NEXT_ARG();
-   link = *argv;
+   *link = *argv;
} else if (matches(*argv, address) == 0) {
NEXT_ARG();
len = ll_addr_a2n(abuf, sizeof(abuf), *argv);
-   addattr_l(req.n, sizeof(req), IFLA_ADDRESS, abuf, len);
+   addattr_l(req-n, sizeof(req), IFLA_ADDRESS, abuf, 
len);
} else if (matches(*argv, broadcast) == 0 ||
-  strcmp(*argv, brd) == 0) {
+   strcmp(*argv, brd) == 0) {
NEXT_ARG();
len = ll_addr_a2n(abuf, sizeof(abuf), *argv);
-   addattr_l(req.n, sizeof(req), IFLA_BROADCAST, abuf, 
len);
+   addattr_l(req-n, sizeof(req), IFLA_BROADCAST, abuf, 
len);
} else if (matches(*argv, txqueuelen) == 0 ||
-  strcmp(*argv, qlen) == 0 ||
-  matches(*argv, txqlen) == 0) {
+   strcmp(*argv, qlen) == 0 ||
+   matches(*argv, txqlen) == 0) {
NEXT_ARG();
if (qlen != -1)
duparg(txqueuelen, *argv);
if (get_integer(qlen,  *argv, 0))
invarg(Invalid \txqueuelen\ value\n, *argv);
-   addattr_l(req.n, sizeof(req), IFLA_TXQLEN, qlen, 4);
+   addattr_l(req-n, sizeof(req), IFLA_TXQLEN, qlen, 4);
} else if (strcmp(*argv, mtu) == 0) {
NEXT_ARG();
if (mtu != -1)
duparg(mtu, *argv);
if (get_integer(mtu, *argv, 0))
invarg(Invalid \mtu\ value\n, *argv);
-   addattr_l(req.n, sizeof(req), IFLA_MTU, mtu, 4);
+   addattr_l(req-n, sizeof(req), IFLA_MTU, mtu, 4);
} else if (strcmp(*argv, multicast) == 0

Re: [PATCH 2/2] Virtual ethernet device driver

2007-07-12 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 +static int veth_newlink(struct net_device *dev,
 + struct nlattr *tb[], struct nlattr *data[])
 +{
 +int err;
 +struct net_device *peer;
 +struct veth_priv *priv;
 +char ifname[IFNAMSIZ];
 +
 +/*
 + * prepare the devices info
 + */
 +
 +if (tb[IFLA_ADDRESS] == NULL)
 +random_ether_addr(dev-dev_addr);
 +
 +if (data != NULL  data[VETH_INFO_PEER] != NULL) {
 +err = nla_parse_nested(tb, IFLA_INFO_MAX,
 +data[VETH_INFO_PEER], ifla_policy);
 +if (err  0)
 +return err;
 +}
 
 
 Not having a peer should be an error, no?

No. That's the intention - if the user doesn't specify peer in the
command line then two _identical_ devices are created. Of course, if
he specifies one name - there'll be a collision, but one can say
my_own_veth_number_%d and everything will be ok. Or just use the 
default name provided. E.g. ip link add type veth will send a packet
with data[VETH_INFO_PEER} == NULL, but this is OK! User just wants a 
default tunnel and he will get it :)

Does this answer your second comment below?

 +
 +if (tb[IFLA_IFNAME])
 +nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
 +else
 +snprintf(ifname, IFNAMSIZ, DRV_NAME %%d);
 
 
 Does this work? The other device is not registered at this time, so I
 think the allocated names could clash ..
 
 If it does work you could also perform name allocation in the
 rtnl_create_link function.
 
 
 +
 +peer = rtnl_create_link(ifname, veth_link_ops, tb);
 +if (IS_ERR(peer))
 +return PTR_ERR(peer);
 +
 +if (tb[IFLA_ADDRESS] == NULL)
 +random_ether_addr(peer-dev_addr);
 +
 +/*
 + * register devices
 + */
 +
 +err = register_netdevice(peer);
 +if (err  0)
 +goto err_register_peer;
 +
 +err = register_netdevice(dev);
 +if (err  0)
 +goto err_register_dev;
 +
 +/*
 + * tie the deviced together
 + */
 +
 +priv = netdev_priv(dev);
 +priv-dev = dev;
 +priv-peer = peer;
 +list_add(priv-list, veth_list);
 +
 +priv = netdev_priv(peer);
 +priv-dev = peer;
 +priv-peer = dev;
 +INIT_LIST_HEAD(priv-list);
 +return 0;
 +
 +err_register_dev:
 +unregister_netdevice(peer);
 +return err;
 +
 +err_register_peer:
 +free_netdev(peer);
 +return err;
 +}
 
 +static __exit void veth_exit(void)
 +{
 +struct veth_priv *priv, *next;
 +
 +rtnl_lock();
 +__rtnl_link_unregister(veth_link_ops);
 +
 +list_for_each_entry_safe(priv, next, veth_list, list)
 +veth_dellink(priv-dev);
 +rtnl_unlock();
 
 
 Devices are unregistered automatically through the dellink function,
 rtnl_link_unregister(..) is enough.

OK. This looks like a minor and not-significant comment, so
do I need to resend the patch or David is OK to take it and
I will send an incremental one?

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Virtual ethernet device driver

2007-07-12 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 +static int veth_newlink(struct net_device *dev,
 + struct nlattr *tb[], struct nlattr *data[])
 +{
 +int err;
 +struct net_device *peer;
 +struct veth_priv *priv;
 +char ifname[IFNAMSIZ];
 +
 +/*
 + * prepare the devices info
 + */
 +
 +if (tb[IFLA_ADDRESS] == NULL)
 +random_ether_addr(dev-dev_addr);
 +
 +if (data != NULL  data[VETH_INFO_PEER] != NULL) {
 +err = nla_parse_nested(tb, IFLA_INFO_MAX,
 +data[VETH_INFO_PEER], ifla_policy);
 +if (err  0)
 +return err;
 +}
 
 
 Not having a peer should be an error, no?
 
 +
 +if (tb[IFLA_IFNAME])
 +nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
 +else
 +snprintf(ifname, IFNAMSIZ, DRV_NAME %%d);
 
 
 Does this work? The other device is not registered at this time, so I
 think the allocated names could clash ..

Oh! And one more thing, I've just thought about - even two devices with
similar name would be ok when we have a namespaces and one of the ends of
this tunnel is to be in some other namespace rather that the init one.

 If it does work you could also perform name allocation in the
 rtnl_create_link function.
 
 
 +
 +peer = rtnl_create_link(ifname, veth_link_ops, tb);
 +if (IS_ERR(peer))
 +return PTR_ERR(peer);
 +
 +if (tb[IFLA_ADDRESS] == NULL)
 +random_ether_addr(peer-dev_addr);
 +
 +/*
 + * register devices
 + */
 +
 +err = register_netdevice(peer);
 +if (err  0)
 +goto err_register_peer;
 +
 +err = register_netdevice(dev);
 +if (err  0)
 +goto err_register_dev;
 +
 +/*
 + * tie the deviced together
 + */
 +
 +priv = netdev_priv(dev);
 +priv-dev = dev;
 +priv-peer = peer;
 +list_add(priv-list, veth_list);
 +
 +priv = netdev_priv(peer);
 +priv-dev = peer;
 +priv-peer = dev;
 +INIT_LIST_HEAD(priv-list);
 +return 0;
 +
 +err_register_dev:
 +unregister_netdevice(peer);
 +return err;
 +
 +err_register_peer:
 +free_netdev(peer);
 +return err;
 +}
 
 +static __exit void veth_exit(void)
 +{
 +struct veth_priv *priv, *next;
 +
 +rtnl_lock();
 +__rtnl_link_unregister(veth_link_ops);
 +
 +list_for_each_entry_safe(priv, next, veth_list, list)
 +veth_dellink(priv-dev);
 +rtnl_unlock();
 
 
 Devices are unregistered automatically through the dellink function,
 rtnl_link_unregister(..) is enough.
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Virtual ethernet device driver

2007-07-12 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 Patrick McHardy wrote:

 Pavel Emelianov wrote:

 +static int veth_newlink(struct net_device *dev,
 +   struct nlattr *tb[], struct nlattr *data[])
 +{
 +  int err;
 +  struct net_device *peer;
 +  struct veth_priv *priv;
 +  char ifname[IFNAMSIZ];
 +
 +  /*
 +   * prepare the devices info
 +   */
 +
 +  if (tb[IFLA_ADDRESS] == NULL)
 +  random_ether_addr(dev-dev_addr);
 +
 +  if (data != NULL  data[VETH_INFO_PEER] != NULL) {
 +  err = nla_parse_nested(tb, IFLA_INFO_MAX,
 +  data[VETH_INFO_PEER], ifla_policy);
 +  if (err  0)
 +  return err;
 +  }

 Not having a peer should be an error, no?

 No. That's the intention - if the user doesn't specify peer in the
 command line then two _identical_ devices are created. Of course, if
 he specifies one name - there'll be a collision, but one can say
 my_own_veth_number_%d and everything will be ok. Or just use the 
 default name provided. E.g. ip link add type veth will send a packet
 with data[VETH_INFO_PEER} == NULL, but this is OK! User just wants a 
 default tunnel and he will get it :)
 
 I see.
 
 Does this answer your second comment below?
 
 
 No, to get unique names the sequence has to be:
 
 dev_alloc_name
 register_netdevice
 dev_alloc_name
 register_netdevice
 
 But you have:
 
 dev_alloc_name
 dev_alloc_name (- might allocate same name as first call)
 register_netdevice
 register_netdevice

Oops :) You're right. That's the problem. I was carried away by
testing the peer options and checking for names rather than
veth%d to work...

By the way, that will create some problems. You see, your patches
imply that the register_netdevice() will be called at the very end
of the -newlink callback. Otherwise, the error path of any code
following the registering will have to call unregister_netdevice()
which will BUG() in free_netdev() in rtnl_newlink() - the device
state will be neither UNINITIALIZED nor UNREGISTERED :(

 +static __exit void veth_exit(void)
 +{
 +  struct veth_priv *priv, *next;
 +
 +  rtnl_lock();
 +  __rtnl_link_unregister(veth_link_ops);
 +
 +  list_for_each_entry_safe(priv, next, veth_list, list)
 +  veth_dellink(priv-dev);
 +  rtnl_unlock();

 Devices are unregistered automatically through the dellink function,
 rtnl_link_unregister(..) is enough.

 OK. This looks like a minor and not-significant comment, so
 do I need to resend the patch or David is OK to take it and
 I will send an incremental one?
 
 
 An incremental patch for this is fine I guess, your code is correct,
 its merely a simplification.
 

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Virtual ethernet device driver

2007-07-12 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 Patrick McHardy wrote:

 No, to get unique names the sequence has to be:

 dev_alloc_name
 register_netdevice
 dev_alloc_name
 register_netdevice

 But you have:

 dev_alloc_name
 dev_alloc_name (- might allocate same name as first call)
 register_netdevice
 register_netdevice

 Oops :) You're right. That's the problem. I was carried away by
 testing the peer options and checking for names rather than
 veth%d to work...

 By the way, that will create some problems. You see, your patches
 imply that the register_netdevice() will be called at the very end
 of the -newlink callback. Otherwise, the error path of any code
 following the registering will have to call unregister_netdevice()
 which will BUG() in free_netdev() in rtnl_newlink() - the device
 state will be neither UNINITIALIZED nor UNREGISTERED :(
 
 
 Thats true. I think you could do:
 
 - use name of the supplied device for the second device
 - register second device
 - allocate new name for first device
 - register first device
 

Not 100% like this (since the first name is expected to be associated
with the device with the first address), but reallocating the name for
the first device sounds like a good idea.

David, which way would be more preferable - to fix booth issues pointed
by Patric and send the v.4 patch, or to wait for this patch to be committed
and then send two incremental fixes?

Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Virtual ethernet device driver

2007-07-12 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 +if (data != NULL  data[VETH_INFO_PEER] != NULL) {
 +err = nla_parse_nested(tb, IFLA_INFO_MAX,
 +data[VETH_INFO_PEER], ifla_policy);
 +if (err  0)
 +return err;
 +}
 
 
 One more suggestion regarding the PEER attribute: you only nest
 IFLA attributes below it, but some information that might be
 interesting to use on device creation is contained in ifinfomsg
 (flags and ifindex). So I think it would be better to use a
 complete message, including header.

I don't get it. Can you elaborate, please.

Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Virtual ethernet device (v2.1)

2007-07-11 Thread Pavel Emelianov
=
Since the netlink NEWLINK interface is now in the netdev tree
I resend the veth driver patch as submittion for inclusion.
=

LOG:

Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

Mainly it allows to communicate between network namespaces but
it can be used as is as well.

Eric recently sent a similar driver called etun with the sysfs
interface. This implementation uses another interface - the 
RTM_NRELINK message introduced by Patric.

The newlink callback is organized that way to make it easy to 
create the peer device in the separate namespace when we have 
them in kernel.

Changes from v.2:
 * Rebase over latest netdev tree. No actual changes;
 * Small code rework.

Changes from v.1:
 * percpu statistics;
 * standard convention for nla policy names;
 * module alias added;
 * xmit function fixes noticed by Patric;
 * code cleanup.

The patch for an ip utility is also provided.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

 drivers/net/Kconfig  |6 
 drivers/net/Makefile |1 
 drivers/net/veth.c   |  452 +++
 include/net/veth.h   |   14 +
 4 files changed, 473 insertions(+)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index d4e39ff..7bdc6e3 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -124,6 +124,12 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config VETH
+   tristate Virtual ethernet device
+   ---help---
+ The device is an ethernet tunnel. Devices are created in pairs. When
+ one end receives the packet it appears on its pair and vice versa.
+
 config NET_SB1000
tristate General Instruments Surfboard 1000
depends on PNP
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index a2241e6..d7b1103 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -189,6 +189,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MACMACE) += macmace.o
 obj-$(CONFIG_MAC89x0) += mac89x0.o
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_DL2K) += dl2k.o
 obj-$(CONFIG_R8169) += r8169.o
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
new file mode 100644
index 000..bf56e0e
--- /dev/null
+++ b/drivers/net/veth.c
@@ -0,0 +1,452 @@
+/*
+ *  drivers/net/veth.c
+ *
+ *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
+ *
+ * Author: Pavel Emelianov [EMAIL PROTECTED]
+ * Ethtool interface from: Eric W. Biederman [EMAIL PROTECTED]
+ *
+ */
+
+#include linux/list.h
+#include linux/netdevice.h
+#include linux/ethtool.h
+#include linux/etherdevice.h
+
+#include net/dst.h
+#include net/xfrm.h
+#include net/veth.h
+
+#define DRV_NAME   veth
+#define DRV_VERSION1.0
+
+struct veth_net_stats {
+   unsigned long   rx_packets;
+   unsigned long   tx_packets;
+   unsigned long   rx_bytes;
+   unsigned long   tx_bytes;
+   unsigned long   tx_dropped;
+};
+
+struct veth_priv {
+   struct net_device *peer;
+   struct net_device *dev;
+   struct list_head list;
+   struct veth_net_stats *stats;
+   unsigned ip_summed;
+};
+
+static LIST_HEAD(veth_list);
+
+/*
+ * ethtool interface
+ */
+
+static struct {
+   const char string[ETH_GSTRING_LEN];
+} ethtool_stats_keys[] = {
+   { peer_ifindex },
+};
+
+static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+   cmd-supported  = 0;
+   cmd-advertising= 0;
+   cmd-speed  = SPEED_1;
+   cmd-duplex = DUPLEX_FULL;
+   cmd-port   = PORT_TP;
+   cmd-phy_address= 0;
+   cmd-transceiver= XCVR_INTERNAL;
+   cmd-autoneg= AUTONEG_DISABLE;
+   cmd-maxtxpkt   = 0;
+   cmd-maxrxpkt   = 0;
+   return 0;
+}
+
+static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo 
*info)
+{
+   strcpy(info-driver, DRV_NAME);
+   strcpy(info-version, DRV_VERSION);
+   strcpy(info-fw_version, N/A);
+}
+
+static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+   switch(stringset) {
+   case ETH_SS_STATS:
+   memcpy(buf, ethtool_stats_keys, sizeof(ethtool_stats_keys));
+   break;
+   }
+}
+
+static int veth_get_stats_count(struct net_device *dev)
+{
+   return ARRAY_SIZE(ethtool_stats_keys);
+}
+
+static void veth_get_ethtool_stats(struct net_device *dev,
+   struct ethtool_stats *stats, u64 *data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   data[0] = priv-peer-ifindex;
+}
+
+static u32 veth_get_rx_csum(struct net_device *dev)
+{
+   struct veth_priv *priv

[PATCH] Module for ip utility to support veth device (v.2.1)

2007-07-11 Thread Pavel Emelianov
The usage is
# ip link add [name] type veth [peer name] [mac mac] [peer_mac mac]

This version doesn't include the fix for ip/iplink.c as Patrick
said that he had included it into his patches already.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

 Makefile|6 +++-
 link_veth.c |   89 
 veth.h  |   14 +
 3 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/ip/Makefile b/ip/Makefile
index 9a5bfe3..b46bce3 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o
 ALLOBJ=$(IPOBJ) $(RTMONOBJ)
 SCRIPTS=ifcfg rtpr routel routef
 TARGETS=ip rtmon
+LIBS=link_veth.so
 
-all: $(TARGETS) $(SCRIPTS)
+all: $(TARGETS) $(SCRIPTS) $(LIBS)
 
 ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL)
 
@@ -24,3 +25,6 @@ clean:
 
 LDLIBS += -ldl
 LDFLAGS+= -Wl,-export-dynamic
+
+%.so: %.c
+   $(CC) $(CFLAGS) -shared $ -o $@
diff --git a/ip/link_veth.c b/ip/link_veth.c
new file mode 100644
index 000..7760684
--- /dev/null
+++ b/ip/link_veth.c
@@ -0,0 +1,89 @@
+/*
+ * link_veth.c veth driver module
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Pavel Emelianov [EMAIL PROTECTED]
+ *
+ */
+
+#include string.h
+
+#include utils.h
+#include ip_common.h
+#include veth.h
+
+#define ETH_ALEN   6
+
+static void usage(void)
+{
+   printf(Usage: ip link add ... type veth 
+   [peer peer-name] [mac mac] [peer_mac mac]\n);
+}
+
+static int veth_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *hdr)
+{
+   __u8 mac[ETH_ALEN];
+
+   for (; argc != 0; argv++, argc--) {
+   if (strcmp(*argv, peer) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER,
+   *argv, strlen(*argv));
+
+   continue;
+   }
+
+   if (strcmp(*argv, mac) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   if (strcmp(*argv, peer_mac) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   usage();
+   return -1;
+   }
+
+   return 0;
+}
+
+struct link_util veth_link_util = {
+   .id = veth,
+   .parse_opt = veth_parse_opt,
+};
diff --git a/ip/veth.h b/ip/veth.h
new file mode 100644
index 000..b84a530
--- /dev/null
+++ b/ip/veth.h
@@ -0,0 +1,14 @@
+#ifndef __NET_VETH_H__
+#define __NET_VETH_H__
+
+enum {
+   VETH_INFO_UNSPEC,
+   VETH_INFO_MAC,
+   VETH_INFO_PEER,
+   VETH_INFO_PEER_MAC,
+
+   __VETH_INFO_MAX
+#define VETH_INFO_MAX  (__VETH_INFO_MAX - 1)
+};
+
+#endif
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet device (v2.1)

2007-07-11 Thread Pavel Emelianov
Christian Borntraeger wrote:
 Am Mittwoch, 11. Juli 2007 schrieb Pavel Emelianov:
  drivers/net/veth.c   |  452 
  include/net/veth.h   |   14 +
 
 I know, I am late in the game, but wont the name collide somewhat with 
 drivers/net/ibmveth.h, drivers/net/iseries_veth.c, and drivers/net/ibmveth.c?

David asked the same thing after v.1. The answer is :
No it will not. These devices use eth%d strings to generate their names.

 Christian
 

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet device (v2.1)

2007-07-11 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 +static const struct nla_policy veth_policy[VETH_INFO_MAX + 1] = {
 +[VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN },
 +[VETH_INFO_PEER]= { .type = NLA_STRING },
 +[VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN },
 +};
 
 
 Looks good, just one question. What happended to the IFLA_PARTNER
 attribute idea? I have a patch to allow specifying the initial
 MAC address for a device, IFLA_PARTNER would make the whole thing
 symetrical.

Well, the notion of a partner is not applicable to any generic link
unlike the device name, physical (MAC) address or MTU value. So i 
think that it's better to keep this as a specific driver information 
not to confuse the generic code. I think it's worth making this as 
some more generic code than it is now, but since this driver is the 
only user of partner maybe it's better not to make it right now. 
Later, when we do know what do we want partner to be.

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet device (v2.1)

2007-07-11 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 Patrick McHardy wrote:

 Pavel Emelianov wrote:

 +static const struct nla_policy veth_policy[VETH_INFO_MAX + 1] = {
 +  [VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN },
 +  [VETH_INFO_PEER]= { .type = NLA_STRING },
 +  [VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN },
 +};

 Looks good, just one question. What happended to the IFLA_PARTNER
 attribute idea? I have a patch to allow specifying the initial
 MAC address for a device, IFLA_PARTNER would make the whole thing
 symetrical.

 Well, the notion of a partner is not applicable to any generic link
 unlike the device name, physical (MAC) address or MTU value. So i 
 think that it's better to keep this as a specific driver information 
 not to confuse the generic code. I think it's worth making this as 
 some more generic code than it is now, but since this driver is the 
 only user of partner maybe it's better not to make it right now. 
 Later, when we do know what do we want partner to be.
 
 
 Mhh doing it later means dealing with compatibility issues, which
 is why I'm asking now. We currently support IFLA_NAME, IFLA_MTU,

Oh, I see.

 IFLA_TXQLEN, IFLA_WEIGTH, IFLA_OPERSTATE and IFLA_LINKMODE, and
 with my patch additionally IFLA_ADDRESS and IFLA_BROADCAST.
 AFAICT they are all applicable for the partner link as well.

Agree. Maybe it is better to make some generic routine to create the
device with the parameters specified in the netlink packet. Then the
generic code creates one end of a tunnel and calls -new_link callback.
This callback extracts the PARTNER packet part and calls this generic
routine to create the second pair.

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet device (v2.1)

2007-07-11 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 Patrick McHardy wrote:

 Mhh doing it later means dealing with compatibility issues, which
 is why I'm asking now. We currently support IFLA_NAME, IFLA_MTU,

 Oh, I see.


 IFLA_TXQLEN, IFLA_WEIGTH, IFLA_OPERSTATE and IFLA_LINKMODE, and
 with my patch additionally IFLA_ADDRESS and IFLA_BROADCAST.
 AFAICT they are all applicable for the partner link as well.

 Agree. Maybe it is better to make some generic routine to create the
 device with the parameters specified in the netlink packet. Then the
 generic code creates one end of a tunnel and calls -new_link callback.
 This callback extracts the PARTNER packet part and calls this generic
 routine to create the second pair.
 
 
 Something like that. Moving the part between NLM_F_CREATE and the
 ops-newlink call of rtnl_newlink to a new function should work.
 
 For now you could even parse the IFLA_PARTNER attribute and nested
 IFLA_NAME/IFLA_ADDRESS attributes yourself and ignore the rest,
 this will at least leave us the option of handling it generically
 later.

OK. I'll try to make the generic call. Could you please send me the
patches with IFLA_ADDRESS support for booth kernel and ip utility.

Thanks,
Pavel

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel

2007-06-13 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 Patrick McHardy wrote:

 
 +  skb-pkt_type = PACKET_HOST;
 +  skb-protocol = eth_type_trans(skb, rcv);
 +  if (dev-features  NETIF_F_NO_CSUM)
 +  skb-ip_summed = rcv_priv-ip_summed;
 +
 +  dst_release(skb-dst);
 +  skb-dst = NULL;
 +
 +  secpath_reset(skb);
 +  nf_reset(skb);

 Is skb-mark supposed to survive communication between different
 namespaces?

 I guess it must not. Thanks.
 
 
 I guess there are a few others that should be cleared as well,
 like the tc related members, secmark, ipvs_property, ...

It seems like we are about to have some skb_reset_all() routine to
make the skb look like newborn.

 The rtnl_link codes looks fine. I don't like the VETH_INFO_MAC attribute
 very much though, we already have a generic device attribute for MAC
 addresses. Of course that only allows you to supply one MAC address, so
 I'm wondering what you think of allocating only a single device per
 newlink operation and binding them in a seperate enslave operation?

 I did this at the very first version, but Alexey showed me that this
 would be wrong. Look. When we create the second device it must be in
 the other namespace as it is useless to have them in one namespace.
 But if we have the device in the other namespace the RTNL_NEWLINK 
 message from kernel would come into this namespace thus confusing ip
 utility in the init namespace. Creating the device in the init ns and
 moving it into the new one is rather a complex task.

 But with such approach the creation looks really logical. We send a 
 packet to the kernel and have a single response about the new device 
 appearance. At the same time we have a RTNL_NEWLINK message arrived at 
 the destination namespace informing that a new device has appeared 
 there as well.
 
 
 The question is how to proceed. I haven't read all mails yet, but it
 seems there is some disagreement about whether to create all devices
 in the same namespace and move them later or create them directly in

The agreement was that we can make any of the above. We can create
booth devices in the init namespace and then move one of them into the
desired namespace, or we can explicitly specify which namespace to create
the pair in.

 their target namespace. For now I guess it doesn't matter much, so
 can everyone agree to adding a IFLA_PARTNER attribute that includes
 a complete ifinfomsg and the attributes and you later decide how to
 handle namespaces?
 
 +enum {
 +  VETH_INFO_UNSPEC,
 +  VETH_INFO_MAC,
 +  VETH_INFO_PEER,
 +  VETH_INFO_PEER_MAC,
 +
 +  VETH_INFO_MAX
 +};
 Please follow the

 #define VETH_INFO_MAX   (__VETH_INFO_MAX - 1)

 convention here.

 Could you please clarify this point. I saw the lines
 enum {
  ...
  RTNL_NEWLINK
 #define RTNL_NEWLINK RTNL_NEWLINK
  ...
 }
 and had my brains exploded imagining what this would mean :(
 
 
 Thats just to make the new attributes visible as preprocessor
 symbols so userspace can use them for #ifdefs. We usually use
 it when adding new attributes/message types, but its not necessary
 for the initial set of attributes if you already have some other
 preprocessor-visisble symbol (like VETH_INFO_MAX) userspace can
 use.
 
 What I was refering to is this convention:
 
 enum {
 ...
 __IFLA_MAX
 };
 
 #define IFLA_MAX (__IFLA_MAX - 1)
 
 Which is used to make sure that IFLA_MAX is really the max and
 not max + 1 and additionally people won't forget to update it.

OK thanks. This is already done in the v2.

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel

2007-06-13 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 Patrick McHardy wrote:

 The question is how to proceed. I haven't read all mails yet, but it
 seems there is some disagreement about whether to create all devices
 in the same namespace and move them later or create them directly in

 The agreement was that we can make any of the above. We can create
 booth devices in the init namespace and then move one of them into the
 desired namespace, or we can explicitly specify which namespace to create
 the pair in.
 
 
 I'm going to push my latest patches to Dave today, the easiest way is
 probably is you just add whatever you need to the API afterwards.
 
 

OK. Dave didn't object against the driver. Hope he will accept it as well.

I have also found a BUG in your API. Look, when you declare the alias with
the MODULE_ALIAS_RTNL_LINK in drivers you use strings as an argument. But 
this macro stringifyes the argument itself which results in bad aliases.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index d744198..f627e1f 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -77,6 +77,6 @@ extern void   __rtnl_link_unregister(struc
 extern int rtnl_link_register(struct rtnl_link_ops *ops);
 extern voidrtnl_link_unregister(struct rtnl_link_ops *ops);
 
-#define MODULE_ALIAS_RTNL_LINK(name) MODULE_ALIAS(rtnl-link- #name)
+#define MODULE_ALIAS_RTNL_LINK(name) MODULE_ALIAS(rtnl-link- name)
 
 #endif
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel (v.2)

2007-06-08 Thread Pavel Emelianov
Ben Greear wrote:

[snip]

 I would also like some way to identify veth from other device types,
 preferably
 something like a value in sysfs.   However, that should not hold up
 

 We can do this with ethtool. It can get and print the driver name of
 the device.
   
 I think I'd like something in sysfs that we could query for any
 interface.  Possible return
 strings could be:
 VLAN
 VETH
 ETH
 PPP
 BRIDGE
 AP /* wifi access point interface */
 STA /* wifi station */
 
 
 I will cook up a patch for consideration after veth goes in.
 

Ben, could you please tell what sysfs features do you
plan to implement?

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel

2007-06-07 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 Veth stands for Virtual ETHernet. It is a simple tunnel driver
 that works at the link layer and looks like a pair of ethernet
 devices interconnected with each other.

 Mainly it allows to communicate between network namespaces but
 it can be used as is as well.

 Eric recently sent a similar driver called etun. This
 implementation uses another interface - the RTM_NRELINK
 message introduced by Patric. The patch fits today netdev
 tree with Patrick's patches.

 The newlink callback is organized that way to make it easy
 to create the peer device in the separate namespace when we
 have them in kernel.

 
 +struct veth_priv {
 +struct net_device *peer;
 +struct net_device *dev;
 +struct list_head list;
 +struct net_device_stats stats;
 
 
 You can use dev-stats instead.

OK. Actually I planned to use percpu stats to reduce cacheline
trashing (Stephen has noticed it also). The reason I didn't do it
here is that the patch would look more complicated, but I wanted to
show and approve the netlink interface first.

 +static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
 +{
 +struct net_device *rcv = NULL;
 +struct veth_priv *priv, *rcv_priv;
 +int length;
 +
 +skb_orphan(skb);
 +
 +priv = netdev_priv(dev);
 +rcv = priv-peer;
 +rcv_priv = netdev_priv(rcv);
 +
 +if (!(rcv-flags  IFF_UP))
 +goto outf;
 +
 +skb-dev = rcv;
 
 eth_type_trans already sets skb-dev.

Ok. Thanks.

 +skb-pkt_type = PACKET_HOST;
 +skb-protocol = eth_type_trans(skb, rcv);
 +if (dev-features  NETIF_F_NO_CSUM)
 +skb-ip_summed = rcv_priv-ip_summed;
 +
 +dst_release(skb-dst);
 +skb-dst = NULL;
 +
 +secpath_reset(skb);
 +nf_reset(skb);
 
 
 Is skb-mark supposed to survive communication between different
 namespaces?

I guess it must not. Thanks.

 +static const struct nla_policy veth_policy[VETH_INFO_MAX] = {
 +[VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN },
 +[VETH_INFO_PEER]= { .type = NLA_STRING },
 +[VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN },
 +};
 
 
 The rtnl_link codes looks fine. I don't like the VETH_INFO_MAC attribute
 very much though, we already have a generic device attribute for MAC
 addresses. Of course that only allows you to supply one MAC address, so
 I'm wondering what you think of allocating only a single device per
 newlink operation and binding them in a seperate enslave operation?

I did this at the very first version, but Alexey showed me that this
would be wrong. Look. When we create the second device it must be in
the other namespace as it is useless to have them in one namespace.
But if we have the device in the other namespace the RTNL_NEWLINK 
message from kernel would come into this namespace thus confusing ip
utility in the init namespace. Creating the device in the init ns and
moving it into the new one is rather a complex task.

But with such approach the creation looks really logical. We send a 
packet to the kernel and have a single response about the new device 
appearance. At the same time we have a RTNL_NEWLINK message arrived at 
the destination namespace informing that a new device has appeared 
there as well.

 +enum {
 +VETH_INFO_UNSPEC,
 +VETH_INFO_MAC,
 +VETH_INFO_PEER,
 +VETH_INFO_PEER_MAC,
 +
 +VETH_INFO_MAX
 +};
 
 Please follow the
 
 #define VETH_INFO_MAX (__VETH_INFO_MAX - 1)
 
 convention here.

Could you please clarify this point. I saw the lines
enum {
...
RTNL_NEWLINK
#define RTNL_NEWLINK RTNL_NEWLINK
...
}
and had my brains exploded imagining what this would mean :(

 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel

2007-06-07 Thread Pavel Emelianov
 I did this at the very first version, but Alexey showed me that this
 would be wrong. Look. When we create the second device it must be in
 the other namespace as it is useless to have them in one namespace.
 But if we have the device in the other namespace the RTNL_NEWLINK
 message from kernel would come into this namespace thus confusing ip
 utility in the init namespace. Creating the device in the init ns and
 moving it into the new one is rather a complex task.
   
 Pavel,
 
 moving the netdevice to another namespace is not a complex task. Eric
 Biederman did it in its patchset ( cf.  http://lxc.sf.net/network )

By saying complex I didn't mean that this is difficult to implement,
but that it consists (must consist) of many stages. I.e. composite.
Making the device right in the namespace is liter.

 When the pair device is created, both extremeties are into the init
 namespace and you can choose to which namespace to move one extremity.

I do not mind that.

 When the network namespace dies, the netdev is moved back to the init
 namespace.
 That facilitate network device management.
 
 Concerning netlink events, this is automatically generated when the
 network device is moved through namespaces.
 
 IMHO, we should have the network device movement between namespaces in
 order to be able to move a physical network device too (eg. you have 4
 NIC and you want to create 3 containers and assign 3 NIC to each of them)

Agree. Moving the devices is a must-have functionality.

I do not mind making the pair in the init namespace and move the second
one into the desired namespace. But if we *always* will have two ends in
different namespaces what to complicate things for?

Thanks,
Pavel
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Virtual ethernet tunnel (v.2)

2007-06-07 Thread Pavel Emelianov
Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

Mainly it allows to communicate between network namespaces 
but it can be used as is as well.

Eric recently sent a similar driver called etun. This 
implementation uses another interface - the RTM_NRELINK 
message introduced by Patric.

The newlink callback is organized that way to make it easy
to create the peer device in the separate namespace when we
have them in kernel.

Changes from v.1:
 * percpu statistics;
 * standard convention for nla policy names;
 * module alias added;
 * xmit function fixes noticed by Patric;
 * code cleanup.

The patch for an ip utility is also provided.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

Since ethtool interface was taken from Eric's patch, I think
that he would like to see his Signed-off line as well (however
he didn't answer yesterday).

---

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 7d57f4a..7e144be 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -119,6 +119,12 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config VETH
+   tristate Virtual ethernet device
+   ---help---
+ The device is an ethernet tunnel. Devices are created in pairs. When
+ one end receives the packet it appears on its pair and vice versa.
+
 config NET_SB1000
tristate General Instruments Surfboard 1000
depends on PNP
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index a77affa..4764119 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MACMACE) += macmace.o
 obj-$(CONFIG_MAC89x0) += mac89x0.o
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_DL2K) += dl2k.o
 obj-$(CONFIG_R8169) += r8169.o
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
new file mode 100644
index 000..e7ad43d
--- /dev/null
+++ b/drivers/net/veth.c
@@ -0,0 +1,442 @@
+/*
+ *  drivers/net/veth.c
+ *
+ *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
+ *
+ *  Author: Pavel Emelianov [EMAIL PROTECTED]
+ *
+ */
+
+#include linux/list.h
+#include linux/netdevice.h
+#include linux/ethtool.h
+#include linux/etherdevice.h
+
+#include net/dst.h
+#include net/xfrm.h
+#include net/veth.h
+
+#define DRV_NAME   veth
+#define DRV_VERSION1.0
+
+struct veth_device_stats {
+   unsigned long   rx_packets;
+   unsigned long   tx_packets;
+   unsigned long   rx_bytes;
+   unsigned long   tx_bytes;
+   unsigned long   tx_dropped;
+};
+
+struct veth_priv {
+   struct net_device *peer;
+   struct net_device *dev;
+   struct list_head list;
+   struct veth_device_stats *stats;
+   unsigned ip_summed;
+};
+
+static LIST_HEAD(veth_list);
+
+/*
+ * ethtool interface
+ */
+
+static struct {
+   const char string[ETH_GSTRING_LEN];
+} ethtool_stats_keys[] = {
+   { peer_ifindex },
+};
+
+static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+   cmd-supported  = 0;
+   cmd-advertising= 0;
+   cmd-speed  = SPEED_1;
+   cmd-duplex = DUPLEX_FULL;
+   cmd-port   = PORT_TP;
+   cmd-phy_address= 0;
+   cmd-transceiver= XCVR_INTERNAL;
+   cmd-autoneg= AUTONEG_DISABLE;
+   cmd-maxtxpkt   = 0;
+   cmd-maxrxpkt   = 0;
+   return 0;
+}
+
+static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo 
*info)
+{
+   strcpy(info-driver, DRV_NAME);
+   strcpy(info-version, DRV_VERSION);
+   strcpy(info-fw_version, N/A);
+}
+
+static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+   switch(stringset) {
+   case ETH_SS_STATS:
+   memcpy(buf, ethtool_stats_keys, sizeof(ethtool_stats_keys));
+   break;
+   }
+}
+
+static int veth_get_stats_count(struct net_device *dev)
+{
+   return ARRAY_SIZE(ethtool_stats_keys);
+}
+
+static void veth_get_ethtool_stats(struct net_device *dev,
+   struct ethtool_stats *stats, u64 *data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   data[0] = priv-peer-ifindex;
+}
+
+static u32 veth_get_rx_csum(struct net_device *dev)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   return priv-ip_summed == CHECKSUM_UNNECESSARY;
+}
+
+static int veth_set_rx_csum(struct net_device *dev, u32 data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   priv-ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE;
+   return 0;
+}
+
+static u32 veth_get_tx_csum(struct net_device *dev)
+{
+   return (dev-features  NETIF_F_NO_CSUM) != 0;
+}
+
+static int veth_set_tx_csum(struct net_device *dev, u32 data

[PATCH] Module for ip utility to support veth device (v.2)

2007-06-07 Thread Pavel Emelianov
The usage is
# ip link add [name] type veth [peer name] [mac mac] [peer_mac mac]

This version doesn't include the fix for ip/iplink.c as Patrick
said that he had included it into his patches already.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

diff --git a/ip/Makefile b/ip/Makefile
index 9a5bfe3..b46bce3 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o
 ALLOBJ=$(IPOBJ) $(RTMONOBJ)
 SCRIPTS=ifcfg rtpr routel routef
 TARGETS=ip rtmon
+LIBS=link_veth.so
 
-all: $(TARGETS) $(SCRIPTS)
+all: $(TARGETS) $(SCRIPTS) $(LIBS)
 
 ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL)
 
@@ -24,3 +25,6 @@ clean:
 
 LDLIBS += -ldl
 LDFLAGS+= -Wl,-export-dynamic
+
+%.so: %.c
+   $(CC) $(CFLAGS) -shared $ -o $@
diff --git a/ip/link_veth.c b/ip/link_veth.c
new file mode 100644
index 000..f2e4079
--- /dev/null
+++ b/ip/link_veth.c
@@ -0,0 +1,86 @@
+/*
+ * ip/link_veth.c
+ *
+ * Virtual ETHernet tunnel supprt.
+ *
+ * Author: Pavel Emelianov [EMAIL PROTECTED]
+ */
+
+#include stdio.h
+#include string.h
+
+#include utils.h
+#include ip_common.h
+#include veth.h
+
+#define ETH_ALEN   6
+
+static void usage(void)
+{
+   printf(Usage: ip link add ... 
+   [peer peer-name] [mac mac] [peer_mac mac]\n);
+}
+
+static int veth_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *hdr)
+{
+   __u8 mac[ETH_ALEN];
+
+   for (; argc != 0; argv++, argc--) {
+   if (strcmp(*argv, peer) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER,
+   *argv, strlen(*argv));
+
+   continue;
+   }
+
+   if (strcmp(*argv, mac) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   if (strcmp(*argv, peer_mac) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   usage();
+   return -1;
+   }
+
+   return 0;
+}
+
+struct link_util veth_link_util = {
+   .id = veth,
+   .parse_opt = veth_parse_opt,
+};
diff --git a/ip/veth.h b/ip/veth.h
new file mode 100644
index 000..d52e0c5
--- /dev/null
+++ b/ip/veth.h
@@ -0,0 +1,15 @@
+#ifndef __NET_VETH_H__
+#define __NET_VETH_H__
+
+enum {
+   VETH_INFO_UNSPEC,
+   VETH_INFO_MAC,
+   VETH_INFO_PEER,
+   VETH_INFO_PEER_MAC,
+
+   __VETH_INFO_MAX
+};
+
+#define VETH_INFO_MAX  (__VETH_INFO_MAX - 1)
+
+#endif
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel

2007-06-07 Thread Pavel Emelianov
Daniel Lezcano wrote:
 Pavel Emelianov wrote:
 I did this at the very first version, but Alexey showed me that this
 would be wrong. Look. When we create the second device it must be in
 the other namespace as it is useless to have them in one namespace.
 But if we have the device in the other namespace the RTNL_NEWLINK
 message from kernel would come into this namespace thus confusing ip
 utility in the init namespace. Creating the device in the init ns and
 moving it into the new one is rather a complex task.
 
 Pavel,

 moving the netdevice to another namespace is not a complex task. Eric
 Biederman did it in its patchset ( cf.  http://lxc.sf.net/network )
 

 By saying complex I didn't mean that this is difficult to implement,
 but that it consists (must consist) of many stages. I.e. composite.
 Making the device right in the namespace is liter.

  
 When the pair device is created, both extremeties are into the init
 namespace and you can choose to which namespace to move one extremity.
 

 I do not mind that.
  
 When the network namespace dies, the netdev is moved back to the init
 namespace.
 That facilitate network device management.

 Concerning netlink events, this is automatically generated when the
 network device is moved through namespaces.

 IMHO, we should have the network device movement between namespaces in
 order to be able to move a physical network device too (eg. you have 4
 NIC and you want to create 3 containers and assign 3 NIC to each of
 them)
 

 Agree. Moving the devices is a must-have functionality.

 I do not mind making the pair in the init namespace and move the second
 one into the desired namespace. But if we *always* will have two ends in
 different namespaces what to complicate things for?
   
 Just to provide a netdev sufficiently generic to be used by people who
 don't want namespaces but just want to do some network testing, like Ben
 Greear does. He mentioned in a previous email, he will be happy to stop
 redirecting people to out of tree patch.

This patch creates booth devices in the init namespace. That's what
you want, isn't it? When we have the namespaces we will be able to
create the pair with booth ends in the init namespace - just do not
specify the namespace id to create the 2nd end in and the driver will
leave it int the init one.

 https://lists.linux-foundation.org/pipermail/containers/2007-April/004420.html
 
 
 Thanks,
 Pavel

   
 
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel

2007-06-07 Thread Pavel Emelianov

 no one is against generic code and ability to create 2 interfaces in
 *one* namespace.
 (Like we currently allow to do so in OpenVZ)

 However, believe me, moving an interface is a *hard* operation. Much
 harder then netdev
 register from the scratch.

 Because it requires to take into account many things like:
 - packets in flight which requires synchronize and is slow on big
 machines
 - asynchronous sysfs entries registration/deregistration from
   rtln_unlock - netdev_run_todo
 - name/ifindex collisions
 - shutdown/cleanup of addresses/routes/qdisc and other similar stuff

   
 All of what you are describing is already implemented in the Eric's
 patchset.

Daniel. We *agree* that this task *is implementable*. We just want
to say that creating the device in the namespace is less comp...
(oh, well, forget this word) faster than creating and then moving.

 You can have a look at :
 
 http://lxc.sourceforge.net/patches/2.6.20/2.6.20-netns1/broken_out/
 
 And more precisly:
 
 for sysfs issues:
 http://lxc.sourceforge.net/patches/2.6.20/2.6.20-netns1/broken_out/0065-sysfs-Shadow-directory-support.patch
 
 
 for network device movement:
 http://lxc.sourceforge.net/patches/2.6.20/2.6.20-netns1/broken_out/0096-net-Implment-network-device-movement-between-namesp.patch

Good job. Can you prove that this code is less buggy than the existing
register_netdevice() one? This requires testing, doesn't it? But on the
other hand we have the ability to create the device right in the namespace
using well known (and thus well debugged) code with minimal actions from
the kernel.

 
 Thanks,
 Daniel
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel (v.2)

2007-06-07 Thread Pavel Emelianov
Ben Greear wrote:
 Pavel Emelianov wrote:
 Veth stands for Virtual ETHernet. It is a simple tunnel driver
 that works at the link layer and looks like a pair of ethernet
 devices interconnected with each other.
   
 As Dave mentioned, there is already a driver known as 'veth'.  Maybe borrow
 the etun name as well?

We have already seen that this driver uses ethXXX names for
its devices and Dave agreed with veth one. Moreover Alexey
Kuznetsov said that he would prefer the name veth for etun.

 I would also like some way to identify veth from other device types,
 preferably
 something like a value in sysfs.   However, that should not hold up

We can do this with ethtool. It can get and print the driver 
name of the device.

 consideration of
 this patch, and I am willing to submit a patch after this goes in to add
 the functionality
 I want...

Ok. Thanks.

 +/*
 + * xmit
 + */
 +
 +static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
 +{
 +struct net_device *rcv = NULL;
 +struct veth_device_stats *stats;
 +struct veth_priv *priv, *rcv_priv;
 +int length, cpu;
 +
 +skb_orphan(skb);
 +
 +priv = netdev_priv(dev);
 +cpu = smp_processor_id();
 +stats = per_cpu_ptr(priv-stats, cpu);
 +rcv = priv-peer;
 +
 +if (!(rcv-flags  IFF_UP))
 +goto outf;
   
 I think you need at least the option to zero out the time-stamp,
 otherwise it will
 not be re-calculated when received on the peer, and it potentially spent
 significant
 time since it was last calculated (think netem delay or similar).
 
 +/* Zero out the time-stamp so that receiving code is forced
 + * to recalculate it.
 + */
 +skb-tstamp.off_sec = 0;
 +skb-tstamp.off_usec = 0;
 
 +
 +rcv_priv = netdev_priv(rcv);
 +skb-pkt_type = PACKET_HOST;
 +skb-protocol = eth_type_trans(skb, rcv);
 +if (dev-features  NETIF_F_NO_CSUM)
 +skb-ip_summed = rcv_priv-ip_summed;
 +
 +dst_release(skb-dst);
 +skb-dst = NULL;
 +secpath_reset(skb);
 +nf_reset(skb);
 +skb-mark = 0;
 +
 +length = skb-len;
   
 This should be done before you do the eth_type_trans, as that pulls the
 header and your
 byte counters will be off.

This will be ETH_HLEN larger, do you mean this? I think this is
normal as this device tries to look like an iron ethernet card :)

 +
 +stats-tx_bytes += length;
 +stats-tx_packets++;
   +
 +stats = per_cpu_ptr(rcv_priv-stats, cpu);
 +stats-rx_bytes += length;
 +stats-rx_packets++;
 +
 +netif_rx(skb);
 +return 0;
 +
 +outf:
 +kfree_skb(skb);
 +stats-tx_dropped++;
 +return 0;
 +}
   
 Thanks,
 Ben
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Virtual ethernet tunnel (v.2)

2007-06-07 Thread Pavel Emelianov
Ben Greear wrote:
 Pavel Emelianov wrote:
 Ben Greear wrote:
  
 Pavel Emelianov wrote:

 Veth stands for Virtual ETHernet. It is a simple tunnel driver
 that works at the link layer and looks like a pair of ethernet
 devices interconnected with each other.
 
 As Dave mentioned, there is already a driver known as 'veth'.  Maybe
 borrow
 the etun name as well?
 

 We have already seen that this driver uses ethXXX names for
 its devices and Dave agreed with veth one. Moreover Alexey
 Kuznetsov said that he would prefer the name veth for etun.
   
 Ok, fine by me.  I started reading mail from the wrong direction this
 morning :)
  
 I would also like some way to identify veth from other device types,
 preferably
 something like a value in sysfs.   However, that should not hold up
 

 We can do this with ethtool. It can get and print the driver name of
 the device.
   
 I think I'd like something in sysfs that we could query for any
 interface.  Possible return
 strings could be:
 VLAN
 VETH
 ETH
 PPP
 BRIDGE
 AP /* wifi access point interface */
 STA /* wifi station */
 
 
 I will cook up a patch for consideration after veth goes in.

OK.

 I think you need at least the option to zero out the time-stamp,
 otherwise it will
 not be re-calculated when received on the peer, and it potentially spent
 significant
 time since it was last calculated (think netem delay or similar).

 +/* Zero out the time-stamp so that receiving code is forced
 + * to recalculate it.
 + */
 +skb-tstamp.off_sec = 0;
 +skb-tstamp.off_usec = 0;


 +
 +rcv_priv = netdev_priv(rcv);
 +skb-pkt_type = PACKET_HOST;
 +skb-protocol = eth_type_trans(skb, rcv);
 +if (dev-features  NETIF_F_NO_CSUM)
 +skb-ip_summed = rcv_priv-ip_summed;
 +
 +dst_release(skb-dst);
 +skb-dst = NULL;
 +secpath_reset(skb);
 +nf_reset(skb);
 +skb-mark = 0;
 +
 +length = skb-len;
 
 This should be done before you do the eth_type_trans, as that pulls the
 header and your
 byte counters will be off.
 

 This will be ETH_HLEN larger, do you mean this? I think this is
 normal as this device tries to look like an iron ethernet card :)
   
 For device counters, it should count the number of bytes received,
 including all headers,
 but excluding the ethernet FCS.   If an 'iron' card did differently, I'd
 consider it a bug.

Hmm... The loopback must be doing bad things then. It first calls
eth_type_trans and then accounts for the new skb-len.

 Thanks,
 Ben
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Virtual ethernet tunnel

2007-06-06 Thread Pavel Emelianov
Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

Mainly it allows to communicate between network namespaces but
it can be used as is as well.

Eric recently sent a similar driver called etun. This
implementation uses another interface - the RTM_NRELINK
message introduced by Patric. The patch fits today netdev
tree with Patrick's patches.

The newlink callback is organized that way to make it easy
to create the peer device in the separate namespace when we
have them in kernel.

The patch for an ip utility is also provided.

Eric, since ethtool interface was from your patch, I add
your Signed-off-by line.

Signed-off-by: Eric W. Biederman [EMAIL PROTECTED]
Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 7d57f4a..7e144be 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -119,6 +119,12 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config VETH
+   tristate Virtual ethernet device
+   ---help---
+ The device is an ethernet tunnel. Devices are created in pairs. When
+ one end receives the packet it appears on its pair and vice versa.
+
 config NET_SB1000
tristate General Instruments Surfboard 1000
depends on PNP
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index a77affa..4764119 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MACMACE) += macmace.o
 obj-$(CONFIG_MAC89x0) += mac89x0.o
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_DL2K) += dl2k.o
 obj-$(CONFIG_R8169) += r8169.o
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
new file mode 100644
index 000..6746c91
--- /dev/null
+++ b/drivers/net/veth.c
@@ -0,0 +1,391 @@
+/*
+ *  drivers/net/veth.c
+ *
+ *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
+ *
+ */
+
+#include linux/list.h
+#include linux/netdevice.h
+#include linux/ethtool.h
+#include linux/etherdevice.h
+
+#include net/dst.h
+#include net/xfrm.h
+#include net/veth.h
+
+#define DRV_NAME   veth
+#define DRV_VERSION1.0
+
+struct veth_priv {
+   struct net_device *peer;
+   struct net_device *dev;
+   struct list_head list;
+   struct net_device_stats stats;
+   unsigned ip_summed;
+};
+
+static LIST_HEAD(veth_list);
+
+/*
+ * ethtool interface
+ */
+
+static struct {
+   const char string[ETH_GSTRING_LEN];
+} ethtool_stats_keys[] = {
+   { peer_ifindex },
+};
+
+static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+   cmd-supported  = 0;
+   cmd-advertising= 0;
+   cmd-speed  = SPEED_1;
+   cmd-duplex = DUPLEX_FULL;
+   cmd-port   = PORT_TP;
+   cmd-phy_address= 0;
+   cmd-transceiver= XCVR_INTERNAL;
+   cmd-autoneg= AUTONEG_DISABLE;
+   cmd-maxtxpkt   = 0;
+   cmd-maxrxpkt   = 0;
+   return 0;
+}
+
+static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo 
*info)
+{
+   strcpy(info-driver, DRV_NAME);
+   strcpy(info-version, DRV_VERSION);
+   strcpy(info-fw_version, N/A);
+}
+
+static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+   switch(stringset) {
+   case ETH_SS_STATS:
+   memcpy(buf, ethtool_stats_keys, sizeof(ethtool_stats_keys));
+   break;
+   }
+}
+
+static int veth_get_stats_count(struct net_device *dev)
+{
+   return ARRAY_SIZE(ethtool_stats_keys);
+}
+
+static void veth_get_ethtool_stats(struct net_device *dev,
+   struct ethtool_stats *stats, u64 *data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   data[0] = priv-peer-ifindex;
+}
+
+static u32 veth_get_rx_csum(struct net_device *dev)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   return priv-ip_summed == CHECKSUM_UNNECESSARY;
+}
+
+static int veth_set_rx_csum(struct net_device *dev, u32 data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   priv-ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE;
+   return 0;
+}
+
+static u32 veth_get_tx_csum(struct net_device *dev)
+{
+   return (dev-features  NETIF_F_NO_CSUM) != 0;
+}
+
+static int veth_set_tx_csum(struct net_device *dev, u32 data)
+{
+   if (data)
+   dev-features |= NETIF_F_NO_CSUM;
+   else
+   dev-features = ~NETIF_F_NO_CSUM;
+   return 0;
+}
+
+static struct ethtool_ops veth_ethtool_ops = {
+   .get_settings   = veth_get_settings,
+   .get_drvinfo= veth_get_drvinfo,
+   .get_link   = ethtool_op_get_link,
+   .get_rx_csum

[PATCH] Module for ip utility to support veth device

2007-06-06 Thread Pavel Emelianov
The usage is
# ip link add [name] type veth [peer name] [mac mac] [peer_mac mac]

The Makefile is maybe not as beautiful as it could be. It
is to be discussed.

One thing I noticed during testing is the following. When launching
this with link_veth.so module and not specifying any module specific
parameters, the kernel refuses to accept the packet when parsing the
IFLA_LINKINFO. So the hunk for ip/iplink.c doesn't add an empty extra 
header if no extra data expected.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

diff --git a/ip/Makefile b/ip/Makefile
index 9a5bfe3..b46bce3 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o
 ALLOBJ=$(IPOBJ) $(RTMONOBJ)
 SCRIPTS=ifcfg rtpr routel routef
 TARGETS=ip rtmon
+LIBS=link_veth.so
 
-all: $(TARGETS) $(SCRIPTS)
+all: $(TARGETS) $(SCRIPTS) $(LIBS)
 
 ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL)
 
@@ -24,3 +25,6 @@ clean:
 
 LDLIBS += -ldl
 LDFLAGS+= -Wl,-export-dynamic
+
+%.so: %.c
+   $(CC) $(CFLAGS) -shared $ -o $@
diff --git a/ip/iplink.c b/ip/iplink.c
index 5170419..6975990 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -287,7 +287,7 @@ static int iplink_modify(int cmd, unsign
 strlen(type));
 
lu = get_link_type(type);
-   if (lu) {
+   if (lu  argc) {
struct rtattr * data = NLMSG_TAIL(req.n);
addattr_l(req.n, sizeof(req), IFLA_INFO_DATA, NULL, 0);
 
diff --git a/ip/link_veth.c b/ip/link_veth.c
new file mode 100644
index 000..adfdef6
--- /dev/null
+++ b/ip/link_veth.c
@@ -0,0 +1,77 @@
+#include string.h
+
+#include utils.h
+#include ip_common.h
+#include veth.h
+
+#define ETH_ALEN   6
+
+static void usage(void)
+{
+   printf(Usage: ip link add ... 
+   [peer peer-name] [mac mac] [peer_mac mac]\n);
+}
+
+static int veth_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *hdr)
+{
+   __u8 mac[ETH_ALEN];
+
+   for (; argc != 0; argv++, argc--) {
+   if (strcmp(*argv, peer) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER,
+   *argv, strlen(*argv));
+
+   continue;
+   }
+
+   if (strcmp(*argv, mac) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   if (strcmp(*argv, peer_mac) == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   usage();
+   return -1;
+   }
+
+   return 0;
+}
+
+struct link_util veth_link_util = {
+   .id = veth,
+   .parse_opt = veth_parse_opt,
+};
diff --git a/ip/veth.h b/ip/veth.h
new file mode 100644
index 000..74c8e1e
--- /dev/null
+++ b/ip/veth.h
@@ -0,0 +1,13 @@
+#ifndef __NET_VETH_H__
+#define __NET_VETH_H__
+
+enum {
+   VETH_INFO_UNSPEC,
+   VETH_INFO_MAC,
+   VETH_INFO_PEER,
+   VETH_INFO_PEER_MAC,
+
+   VETH_INFO_MAX
+};
+
+#endif
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Consolidate udp hash calculations

2007-05-07 Thread Pavel Emelianov
David Miller wrote:
 From: Pavel Emelianov [EMAIL PROTECTED]
 Date: Fri, 04 May 2007 18:51:36 +0400
 
 Make access to udphash/udplitehash symmetrical to inet hashes.

 This may also help network namespaces, since they tend to use
 one hash for different namespaces by selecting the hash chain
 depending on a hash value and the namespace.

 Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
 
 This is not the UDP Hash Function, it is a mask to bring
 the hash value modulo of the hash table size.
 
 So your function name is misleading and makes the code
 even worse.
 
 I'm also not so sure this cleanup really adds any clarity,
 even with a proper function name.

The same is applicable to inet_lhashfn and inet_bhashfn, but they
are called hashfn-s and do exist in kernel. This hashfn will make
things look similar.

 I can only assume you wish to do something with the UDP hash table
 sizes, and therefore only want to have a need to touch one function.

You are right, I have told that this must help network namespaces,
by making the hash differ depending on the namespace passed.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Consolidate checking for tcp orphan count being too big

2007-05-04 Thread Pavel Emelianov
tcp_out_of_resources() and tcp_close() perform the
same checking of number of orphan sockets. Move this
code into common place.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e22b4f0..a8af9ae 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -254,6 +254,12 @@ static inline int between(__u32 seq1, __
return seq3 - seq2 = seq1 - seq2;
 }
 
+static inline int tcp_too_many_orphans(struct sock *sk, int num)
+{
+   return (num  sysctl_tcp_max_orphans) ||
+   (sk-sk_wmem_queued  SOCK_MIN_SNDBUF 
+atomic_read(tcp_memory_allocated)  sysctl_tcp_mem[2]);
+}
 
 extern struct proto tcp_prot;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8b124ea..8fbe550 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1675,9 +1675,8 @@ adjudge_to_death:
}
if (sk-sk_state != TCP_CLOSE) {
sk_stream_mem_reclaim(sk);
-   if (atomic_read(sk-sk_prot-orphan_count)  
sysctl_tcp_max_orphans ||
-   (sk-sk_wmem_queued  SOCK_MIN_SNDBUF 
-atomic_read(tcp_memory_allocated)  sysctl_tcp_mem[2])) {
+   if (tcp_too_many_orphans(sk,
+   atomic_read(sk-sk_prot-orphan_count))) {
if (net_ratelimit())
printk(KERN_INFO TCP: too many of orphaned 
   sockets\n);
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 2ca97b2..e613401 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -78,9 +78,7 @@ static int tcp_out_of_resources(struct s
if (sk-sk_err_soft)
orphans = 1;
 
-   if (orphans = sysctl_tcp_max_orphans ||
-   (sk-sk_wmem_queued  SOCK_MIN_SNDBUF 
-atomic_read(tcp_memory_allocated)  sysctl_tcp_mem[2])) {
+   if (tcp_too_many_orphans(sk, orphans)) {
if (net_ratelimit())
printk(KERN_INFO Out of socket memory\n);
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Consolidate udp hash calculations

2007-05-04 Thread Pavel Emelianov
Make access to udphash/udplitehash symmetrical to inet hashes.

This may also help network namespaces, since they tend to use
one hash for different namespaces by selecting the hash chain
depending on a hash value and the namespace.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

diff --git a/include/linux/udp.h b/include/linux/udp.h
index 6de445c..a40ccd4 100644
--- a/include/linux/udp.h
+++ b/include/linux/udp.h
@@ -49,6 +49,11 @@ static inline struct udphdr *udp_hdr(con
 #include net/inet_sock.h
 #define UDP_HTABLE_SIZE128
 
+static inline unsigned int udp_hashfn(unsigned int num)
+{
+   return num  (UDP_HTABLE_SIZE - 1);
+}
+
 struct udp_sock {
/* inet_sock has to be the first member */
struct inet_sock inet;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 113e0c4..8ea1c81 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -132,7 +132,7 @@ static inline int __udp_lib_port_inuse(u
struct hlist_node *node;
struct inet_sock *inet;
 
-   sk_for_each(sk, node, udptable[hash  (UDP_HTABLE_SIZE - 1)]) {
+   sk_for_each(sk, node, udptable[udp_hashfn(hash)]) {
if (sk-sk_hash != hash)
continue;
inet = inet_sk(sk);
@@ -159,7 +159,6 @@ int __udp_lib_get_port(struct sock *sk, 
 const struct sock *sk2 ))
 {
struct hlist_node *node;
-   struct hlist_head *head;
struct sock *sk2;
unsigned int hash;
interror = 1;
@@ -175,10 +174,11 @@ int __udp_lib_get_port(struct sock *sk, 
best = result = *port_rover;
for (i = 0; i  UDP_HTABLE_SIZE; i++, result++) {
int size;
+   struct hlist_head *head;
 
hash = hash_port_and_addr(result,
inet_sk(sk)-rcv_saddr);
-   head = udptable[hash  (UDP_HTABLE_SIZE - 1)];
+   head = udptable[udp_hashfn(hash)];
if (hlist_empty(head)) {
if (result  sysctl_local_port_range[1])
result = sysctl_local_port_range[0] +
@@ -222,9 +222,7 @@ gotit:
*port_rover = snum = result;
} else {
hash = hash_port_and_addr(snum, 0);
-   head = udptable[hash  (UDP_HTABLE_SIZE - 1)];
-
-   sk_for_each(sk2, node, head)
+   sk_for_each(sk2, node, udptable[udp_hashfn(hash)])
if (sk2-sk_hash == hash 
sk2 != sk 
inet_sk(sk2)-num == snum 
@@ -237,9 +235,7 @@ gotit:
if (inet_sk(sk)-rcv_saddr) {
hash = hash_port_and_addr(snum,
  inet_sk(sk)-rcv_saddr);
-   head = udptable[hash  (UDP_HTABLE_SIZE - 1)];
-
-   sk_for_each(sk2, node, head)
+   sk_for_each(sk2, node, udptable[udp_hashfn(hash)])
if (sk2-sk_hash == hash 
sk2 != sk 
inet_sk(sk2)-num == snum 
@@ -255,8 +251,7 @@ gotit:
inet_sk(sk)-num = snum;
sk-sk_hash = hash;
if (sk_unhashed(sk)) {
-   head = udptable[hash  (UDP_HTABLE_SIZE - 1)];
-   sk_add_node(sk, head);
+   sk_add_node(sk, udptable[udp_hashfn(hash)]);
sock_prot_inc_use(sk-sk_prot);
}
error = 0;
@@ -304,7 +299,7 @@ static struct sock *__udp4_lib_lookup(__
 
 lookup:
 
-   sk_for_each(sk, node, udptable[hash  (UDP_HTABLE_SIZE - 1)]) {
+   sk_for_each(sk, node, udptable[udp_hashfn(hash)]) {
struct inet_sock *inet = inet_sk(sk);
 
if (sk-sk_hash != hash || ipv6_only_sock(sk) ||
@@ -1205,8 +1200,8 @@ static int __udp4_lib_mcast_deliver(stru
 
read_lock(udp_hash_lock);
 
-   sk = sk_head(udptable[hash  (UDP_HTABLE_SIZE - 1)]);
-   skw = sk_head(udptable[hashwild  (UDP_HTABLE_SIZE - 1)]);
+   sk = sk_head(udptable[udp_hashfn(hash)]);
+   skw = sk_head(udptable[udp_hashfn(hashwild)]);
 
sk = udp_v4_mcast_next(sk, hash, hport, daddr, uh-source, saddr, dif);
if (!sk) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index b083c09..a0a8915 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -67,7 +67,7 @@ static struct sock *__udp6_lib_lookup(st
int badness = -1;
 
read_lock(udp_hash_lock);
-   sk_for_each(sk, node, udptable[hnum  (UDP_HTABLE_SIZE - 1)]) {
+   sk_for_each(sk, node, udptable[udp_hashfn(hnum)]) {
struct inet_sock *inet = inet_sk(sk);
 
if (sk-sk_hash == hnum  sk-sk_family == PF_INET6) {
@@ -350,7 +350,7 @@ static int __udp6_lib_mcast_deliver(stru
int dif;
 
read_lock(udp_hash_lock);
-   sk

Re: [PATCH] Rework dev_base via list_head

2007-05-03 Thread Pavel Emelianov
David Miller wrote:
 From: Pavel Emelianov [EMAIL PROTECTED]
 Date: Wed, 02 May 2007 17:40:56 +0400
 
 Cleanup of dev_base list use, with the aim to simplify making
 device list per-namespace. In almost every occasion, use of
 dev_base variable and dev-next pointer could be easily replaced
 by  for_each_netdev loop. A few most complicated places were
 converted to using first_netdev()/next_netdev().

 Fits 2.6.21-rc7 tree.

 Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
 Acked-by: Kirill Korotaev [EMAIL PROTECTED]
 
 Overall this looks mostly good.
 
 One thing I want to audit before applying this is
 loop termination conditions.
 
 With the old loop, if you do something like this:
 
   for (dev = dev_base; dev; dev = dev-next) {
   if (dev == what_I_want)
   break;
   }
 
 you can test for a successful find after the loop with:
 
   if (dev) {
   I_found_it();
   }
 
 That doesn't work with for_each_netdev(), if the loop
 runs till the end of the list, the iterator will not
 be left at NULL.
 
 I just want to make sure you didn't leave any code around which wants
 that behavior still.

My fault :( I've found some places where this was missed.
I will make a new patch shortly.

 This is one of the subtle things about using the list iterators in
 linux/list.h, vs. a traditional by-hand singly linked list
 implementation.
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Rework dev_base via list_head (v2)

2007-05-03 Thread Pavel Emelianov
Cleanup of dev_base list use, with the aim to simplify making
device list per-namespace. In almost every occasion, use of
dev_base variable and dev-next pointer could be easily replaced
by  for_each_netdev loop. A few most complicated places were
converted to using first_netdev()/next_netdev().

Changes:
* David's comment about loop termination;
* Applies to todays netdev git tree.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
Acked-by: Kirill Korotaev [EMAIL PROTECTED]

---

diff --git a/arch/s390/appldata/appldata_net_sum.c 
b/arch/s390/appldata/appldata_net_sum.c
index a43f348..2180ac1 100644
--- a/arch/s390/appldata/appldata_net_sum.c
+++ b/arch/s390/appldata/appldata_net_sum.c
@@ -107,7 +107,7 @@ static void appldata_get_net_sum_data(vo
tx_dropped = 0;
collisions = 0;
read_lock(dev_base_lock);
-   for (dev = dev_base; dev != NULL; dev = dev-next) {
+   for_each_netdev(dev) {
stats = dev-get_stats(dev);
rx_packets += stats-rx_packets;
tx_packets += stats-tx_packets;
diff --git a/arch/sparc64/solaris/ioctl.c b/arch/sparc64/solaris/ioctl.c
index 330743c..18352a4 100644
--- a/arch/sparc64/solaris/ioctl.c
+++ b/arch/sparc64/solaris/ioctl.c
@@ -686,7 +686,8 @@ static inline int solaris_i(unsigned int
int i = 0;

read_lock_bh(dev_base_lock);
-   for (d = dev_base; d; d = d-next) i++;
+   for_each_netdev(d)
+   i++;
read_unlock_bh(dev_base_lock);
 
if (put_user (i, (int __user *)A(arg)))
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 1a6aeac..01fbdd3 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -194,15 +194,15 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
sl = sl_tail = NULL;
 
read_lock(dev_base_lock);
-   for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) {
+   for_each_netdev(ifp) {
dev_hold(ifp);
if (!is_aoe_netif(ifp))
-   continue;
+   goto cont;
 
skb = new_skb(sizeof *h + sizeof *ch);
if (skb == NULL) {
printk(KERN_INFO aoe: skb alloc failure\n);
-   continue;
+   goto cont;
}
skb_put(skb, sizeof *h + sizeof *ch);
skb-dev = ifp;
@@ -221,6 +221,8 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
 
skb-next = sl;
sl = skb;
+cont:
+   dev_put(ifp);
}
read_unlock(dev_base_lock);
 
diff --git a/drivers/net/wireless/strip.c b/drivers/net/wireless/strip.c
index 2a299a0..ef32a5c 100644
--- a/drivers/net/wireless/strip.c
+++ b/drivers/net/wireless/strip.c
@@ -1971,8 +1971,7 @@ static struct net_device *get_strip_dev(
  sizeof(zero_address))) {
struct net_device *dev;
read_lock_bh(dev_base_lock);
-   dev = dev_base;
-   while (dev) {
+   for_each_netdev(dev) {
if (dev-type == strip_info-dev-type 
!memcmp(dev-dev_addr,
strip_info-true_dev_addr,
@@ -1983,7 +1982,6 @@ static struct net_device *get_strip_dev(
read_unlock_bh(dev_base_lock);
return (dev);
}
-   dev = dev-next;
}
read_unlock_bh(dev_base_lock);
}
diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index 3df82fe..98be288 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -365,7 +365,7 @@ static __inline__ int led_get_net_activi
 * for reading should be OK */
read_lock(dev_base_lock);
rcu_read_lock();
-   for (dev = dev_base; dev; dev = dev-next) {
+   for_each_netdev(dev) {
struct net_device_stats *stats;
struct in_device *in_dev = __in_dev_get_rcu(dev);
if (!in_dev || !in_dev-ifa_list)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ac0c92b..782eb1d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -304,7 +304,7 @@ struct net_device
 
unsigned long   state;
 
-   struct net_device   *next;
+   struct list_headdev_list;

/* The device initialization function. Called only once. */
int (*init)(struct net_device *dev);
@@ -575,9 +575,31 @@ struct packet_type {
 #include linux/notifier.h
 
 extern struct net_device   loopback_dev;   /* The loopback 
*/
-extern struct net_device   *dev_base;  /* All devices 
*/
+extern struct list_head

Re: [PATCH] Rework dev_base via list_head (v2)

2007-05-03 Thread Pavel Emelianov
Patrick McHardy wrote:
 Pavel Emelianov wrote:
 Cleanup of dev_base list use, with the aim to simplify making
 device list per-namespace. In almost every occasion, use of
 dev_base variable and dev-next pointer could be easily replaced
 by  for_each_netdev loop. A few most complicated places were
 converted to using first_netdev()/next_netdev().

 Changes:
 * David's comment about loop termination;
 * Applies to todays netdev git tree.
 
 
 This seems to be missing fs/afs/netdevices.c.
 

Hm... I've cloned the git repo this morning but there's not such file...

$ ls fs/afs/netdevices.c
/bin/ls: fs/afs/netdevices.c: No such file or directory

Repo was cloned from
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.git
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Rework dev_base via list_head (v3)

2007-05-03 Thread Pavel Emelianov
Netdev tree was updated today soon after I sent a patch for it,
so this version applies to it and the fs/afs/netdevices.c file
is added. Hope I am not late yet again :)

Cleanup of dev_base list use, with the aim to simplify making
device list per-namespace. In almost every occasion, use of
dev_base variable and dev-next pointer could be easily replaced
by  for_each_netdev loop. A few most complicated places were
converted to using first_netdev()/next_netdev().

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
Acked-by: Kirill Korotaev [EMAIL PROTECTED]
Cc: Patrick McHardy [EMAIL PROTECTED]

---

diff --git a/arch/s390/appldata/appldata_net_sum.c 
b/arch/s390/appldata/appldata_net_sum.c
index a43f348..2180ac1 100644
--- a/arch/s390/appldata/appldata_net_sum.c
+++ b/arch/s390/appldata/appldata_net_sum.c
@@ -107,7 +107,7 @@ static void appldata_get_net_sum_data(vo
tx_dropped = 0;
collisions = 0;
read_lock(dev_base_lock);
-   for (dev = dev_base; dev != NULL; dev = dev-next) {
+   for_each_netdev(dev) {
stats = dev-get_stats(dev);
rx_packets += stats-rx_packets;
tx_packets += stats-tx_packets;
diff --git a/arch/sparc64/solaris/ioctl.c b/arch/sparc64/solaris/ioctl.c
index 330743c..18352a4 100644
--- a/arch/sparc64/solaris/ioctl.c
+++ b/arch/sparc64/solaris/ioctl.c
@@ -686,7 +686,8 @@ static inline int solaris_i(unsigned int
int i = 0;

read_lock_bh(dev_base_lock);
-   for (d = dev_base; d; d = d-next) i++;
+   for_each_netdev(d)
+   i++;
read_unlock_bh(dev_base_lock);
 
if (put_user (i, (int __user *)A(arg)))
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 1a6aeac..01fbdd3 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -194,15 +194,15 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
sl = sl_tail = NULL;
 
read_lock(dev_base_lock);
-   for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) {
+   for_each_netdev(ifp) {
dev_hold(ifp);
if (!is_aoe_netif(ifp))
-   continue;
+   goto cont;
 
skb = new_skb(sizeof *h + sizeof *ch);
if (skb == NULL) {
printk(KERN_INFO aoe: skb alloc failure\n);
-   continue;
+   goto cont;
}
skb_put(skb, sizeof *h + sizeof *ch);
skb-dev = ifp;
@@ -221,6 +221,8 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
 
skb-next = sl;
sl = skb;
+cont:
+   dev_put(ifp);
}
read_unlock(dev_base_lock);
 
diff --git a/drivers/net/wireless/strip.c b/drivers/net/wireless/strip.c
index 2a299a0..ef32a5c 100644
--- a/drivers/net/wireless/strip.c
+++ b/drivers/net/wireless/strip.c
@@ -1971,8 +1971,7 @@ static struct net_device *get_strip_dev(
  sizeof(zero_address))) {
struct net_device *dev;
read_lock_bh(dev_base_lock);
-   dev = dev_base;
-   while (dev) {
+   for_each_netdev(dev) {
if (dev-type == strip_info-dev-type 
!memcmp(dev-dev_addr,
strip_info-true_dev_addr,
@@ -1983,7 +1982,6 @@ static struct net_device *get_strip_dev(
read_unlock_bh(dev_base_lock);
return (dev);
}
-   dev = dev-next;
}
read_unlock_bh(dev_base_lock);
}
diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index 3df82fe..98be288 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -365,7 +365,7 @@ static __inline__ int led_get_net_activi
 * for reading should be OK */
read_lock(dev_base_lock);
rcu_read_lock();
-   for (dev = dev_base; dev; dev = dev-next) {
+   for_each_netdev(dev) {
struct net_device_stats *stats;
struct in_device *in_dev = __in_dev_get_rcu(dev);
if (!in_dev || !in_dev-ifa_list)
diff --git a/fs/afs/netdevices.c b/fs/afs/netdevices.c
index ce08977..fc27d4b 100644
--- a/fs/afs/netdevices.c
+++ b/fs/afs/netdevices.c
@@ -47,7 +47,7 @@ int afs_get_ipv4_interfaces(struct afs_i
ASSERT(maxbufs  0);
 
rtnl_lock();
-   for (dev = dev_base; dev; dev = dev-next) {
+   for_each_netdev(dev) {
if (dev-type == ARPHRD_LOOPBACK  !wantloopback)
continue;
idev = __in_dev_get_rtnl(dev);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4428f1c..3044622 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h

[PATCH] Virtual ethernet device (tunnel)

2007-05-02 Thread Pavel Emelianov
Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

Mainly it allows to communicate between network namespaces but
it can be used as is as well.

Eric recently sent a similar driver called etun. This
implementation is closer to the OpenVZ one and it lacks
some unimportant features of etun driver (like ethtool_ops)
for simplicity.

The general difference from etun is that a netlink interface
is used to create and destroy the pairs. The patch for an
ip utility is also provided.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
Acked-By: Kirill Korotaev [EMAIL PROTECTED]
Acked-By: Dmitry Mishin [EMAIL PROTECTED]
Acked-By: Alexey Kuznetsov [EMAIL PROTECTED]

---

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index c3f9f59..445dbc7 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -119,6 +119,12 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config VETH
+   tristate Virtual ethernet device
+   ---help---
+ The device is an ethernet tunnel. Devices are created in pairs. When
+ one end receives the packet it appears on its pair and vice versa.
+
 config NET_SB1000
tristate General Instruments Surfboard 1000
depends on PNP
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 33af833..2730b80 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MACMACE) += macmace.o
 obj-$(CONFIG_MAC89x0) += mac89x0.o
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_DL2K) += dl2k.o
 obj-$(CONFIG_R8169) += r8169.o
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
new file mode 100644
index 000..6105f99
--- /dev/null
+++ b/drivers/net/veth.c
@@ -0,0 +1,387 @@
+/*
+ *  drivers/net/veth.c
+ *
+ *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
+ *
+ */
+
+#include linux/init.h
+#include linux/if.h
+#include linux/netdevice.h
+#include linux/etherdevice.h
+#include net/dst.h
+#include net/xfrm.h
+#include net/genetlink.h
+#include net/veth.h
+
+struct veth_struct {
+   struct net_device   *peer;
+   struct net_device   *dev;
+
+   struct list_headlist;
+   struct net_device_stats *real_stats;
+
+   struct net_device_stats stats;
+};
+
+static LIST_HEAD(veth_list);
+
+static inline struct net_device_stats *veth_stats(struct veth_struct *veth,
+   int cpuid)
+{
+   return per_cpu_ptr(veth-real_stats, cpuid);
+}
+
+/*
+ * Device functions
+ */
+
+static int veth_open(struct net_device *dev)
+{
+   return 0;
+}
+
+static int veth_close(struct net_device *dev)
+{
+   return 0;
+}
+
+static void veth_destructor(struct net_device *dev)
+{
+   struct veth_struct *veth;
+
+   veth = dev-priv;
+   free_percpu(veth-real_stats);
+   free_netdev(dev);
+}
+
+static struct net_device_stats *veth_get_stats(struct net_device *dev)
+{
+   int i;
+   struct veth_struct *veth;
+   struct net_device_stats *stats;
+   struct net_device_stats *dev_stats;
+
+   veth = dev-priv;
+   stats = veth-stats;
+   memset(stats, 0, sizeof(struct net_device_stats));
+
+   for_each_possible_cpu (i) {
+   dev_stats = veth_stats(veth, i);
+   stats-rx_bytes   += dev_stats-rx_bytes;
+   stats-tx_bytes   += dev_stats-tx_bytes;
+   stats-rx_packets += dev_stats-rx_packets;
+   stats-tx_packets += dev_stats-tx_packets;
+   }
+
+   return stats;
+}
+
+static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+   struct net_device_stats *stats;
+   struct net_device *rcv = NULL;
+   struct veth_struct *veth;
+   int length, cpu;
+
+   skb_orphan(skb);
+
+   veth = dev-priv;
+   rcv = veth-peer;
+
+   cpu = smp_processor_id();
+   stats = veth_stats(veth, cpu);
+
+   if (!(rcv-flags  IFF_UP))
+   goto outf;
+
+   skb-dev = rcv;
+   skb-pkt_type = PACKET_HOST;
+   skb-protocol = eth_type_trans(skb, rcv);
+
+   dst_release(skb-dst);
+   skb-dst = NULL;
+
+   secpath_reset(skb);
+   nf_reset(skb);
+
+   length = skb-len;
+
+   stats-tx_bytes += length;
+   stats-tx_packets++;
+
+   stats = veth_stats(rcv-priv, cpu);
+   stats-rx_bytes += length;
+   stats-rx_packets++;
+
+   netif_rx(skb);
+   return 0;
+
+outf:
+   kfree_skb(skb);
+   stats-tx_dropped++;
+   return 0;
+}
+
+/*
+ * Setup / remove routines
+ */
+
+static int veth_init_dev(struct net_device *dev)
+{
+   struct veth_struct *veth;
+
+   dev-hard_start_xmit = veth_xmit;
+   dev-get_stats = veth_get_stats;
+   dev-open = veth_open;
+   dev-stop = veth_close;
+   dev-destructor = veth_destructor;
+
+   veth = dev

[PATCH] Make ip utility veth driver aware

2007-05-02 Thread Pavel Emelianov
The new command is called veth with the following syntax:
* ip veth add dev1 dev2
  creates interconnected pair of veth devices.
* ip veth del dev
  destroys the pair of veth devices, where dev is either
  dev1 or dev2 used to create the pair.

One question that is to be solved is whether or not to create
a hard-coded netlink family for veth driver. Without it the 
family resolution code has to be moved to general place in ip
utility (by now it is copy-paste-ed from one file to another
till final decision).

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

diff --git a/ip/Makefile b/ip/Makefile
index a749993..864e04e 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -1,7 +1,7 @@
 IPOBJ=ip.o ipaddress.o iproute.o iprule.o \
 rtm_map.o iptunnel.o ip6tunnel.o tunnel.o ipneigh.o ipntable.o iplink.o \
 ipmaddr.o ipmonitor.o ipmroute.o ipprefix.o \
-ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o
+ipxfrm.o xfrm_state.o xfrm_policy.o xfrm_monitor.o veth.o
 
 RTMONOBJ=rtmon.o
 
diff --git a/ip/ip.c b/ip/ip.c
index c084292..d3d3a8a 100644
--- a/ip/ip.c
+++ b/ip/ip.c
@@ -27,6 +27,7 @@
 #include SNAPSHOT.h
 #include utils.h
 #include ip_common.h
+#include veth.h
 
 int preferred_family = AF_UNSPEC;
 int show_stats = 0;
@@ -46,7 +47,7 @@ static void usage(void)
 Usage: ip [ OPTIONS ] OBJECT { COMMAND | help }\n
ip [ -force ] [-batch filename\n
 where  OBJECT := { link | addr | route | rule | neigh | ntable | tunnel |\n
-   maddr | mroute | monitor | xfrm }\n
+   maddr | mroute | monitor | xfrm | veth }\n
OPTIONS := { -V[ersion] | -s[tatistics] | -r[esolve] |\n
 -f[amily] { inet | inet6 | ipx | dnet | link } |\n
 -o[neline] | -t[imestamp] }\n);
@@ -76,6 +77,7 @@ static const struct cmd {
{ monitor,do_ipmonitor },
{ xfrm,   do_xfrm },
{ mroute, do_multiroute },
+   { veth,   do_veth },
{ help,   do_help },
{ 0 }
 };
diff --git a/ip/veth.c b/ip/veth.c
new file mode 100644
index 000..d4eecc8
--- /dev/null
+++ b/ip/veth.c
@@ -0,0 +1,196 @@
+/*
+ * veth.c ethernet tunnel
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Authors:Pavel Emelianov, [EMAIL PROTECTED]
+ *
+ */
+
+#include stdio.h
+#include string.h
+#include unistd.h
+#include sys/types.h
+#include sys/socket.h
+#include linux/genetlink.h
+
+#include utils.h
+#include veth.h
+
+#define GENLMSG_DATA(glh)   ((void *)(NLMSG_DATA(glh) + GENL_HDRLEN))
+#define NLA_DATA(na)((void *)((char*)(na) + NLA_HDRLEN))
+
+static int do_veth_help(void)
+{
+   fprintf(stderr, Usage: ip veth add DEVICE PEER_NAME\n);
+   fprintf(stderr,del DEVICE\n);
+   exit(-1);
+}
+
+static int genl_ctrl_resolve_family(const char *family)
+{
+   struct rtnl_handle rth;
+   struct nlmsghdr *nlh;
+   struct genlmsghdr *ghdr;
+   int ret = 0;
+   struct {
+   struct nlmsghdr n;
+   charbuf[4096];
+   } req;
+
+   memset(req, 0, sizeof(req));
+
+   nlh = req.n;
+   nlh-nlmsg_len = NLMSG_LENGTH(GENL_HDRLEN);
+   nlh-nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+   nlh-nlmsg_type = GENL_ID_CTRL;
+
+   ghdr = NLMSG_DATA(req.n);
+   ghdr-cmd = CTRL_CMD_GETFAMILY;
+
+   if (rtnl_open_byproto(rth, 0, NETLINK_GENERIC)  0) {
+   fprintf(stderr, Cannot open generic netlink socket\n);
+   exit(1);
+   }
+
+   addattr_l(nlh, 128, CTRL_ATTR_FAMILY_NAME, family, strlen(family) + 1);
+
+   if (rtnl_talk(rth, nlh, 0, 0, nlh, NULL, NULL)  0) {
+   fprintf(stderr, Error talking to the kernel\n);
+   goto errout;
+   }
+
+   {
+   struct rtattr *tb[CTRL_ATTR_MAX + 1];
+   struct genlmsghdr *ghdr = NLMSG_DATA(nlh);
+   int len = nlh-nlmsg_len;
+   struct rtattr *attrs;
+
+   if (nlh-nlmsg_type !=  GENL_ID_CTRL) {
+   fprintf(stderr, Not a controller message, nlmsg_len=%d 

+   nlmsg_type=0x%x\n, nlh-nlmsg_len, 
nlh-nlmsg_type);
+   goto errout;
+   }
+
+   if (ghdr-cmd != CTRL_CMD_NEWFAMILY) {
+   fprintf(stderr, Unkown controller command %d\n, 
ghdr-cmd);
+   goto errout;
+   }
+
+   len -= NLMSG_LENGTH(GENL_HDRLEN);
+
+   if (len  0) {
+   fprintf(stderr, wrong controller message len %d\n, 
len);
+   return -1;
+   }
+
+   attrs = (struct rtattr *) ((char *) ghdr

Re: [Devel] [PATCH] Virtual ethernet device (tunnel)

2007-05-02 Thread Pavel Emelianov
Daniel Lezcano wrote:
 Pavel Emelianov wrote:
 Veth stands for Virtual ETHernet. It is a simple tunnel driver
 that works at the link layer and looks like a pair of ethernet
 devices interconnected with each other.

 Mainly it allows to communicate between network namespaces but
 it can be used as is as well.

 Eric recently sent a similar driver called etun. This
 implementation is closer to the OpenVZ one and it lacks
 some unimportant features of etun driver (like ethtool_ops)
 for simplicity.

 The general difference from etun is that a netlink interface
 is used to create and destroy the pairs. The patch for an
 ip utility is also provided.
 if etun and veth are similar, why didn't you put the netlink interface
 to the etun driver instead of sending a new driver ?

Alexey said, that he preferred the name veth to etun.
So an incremental patch would look too bad.

-- Daniel
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Rework dev_base via list_head

2007-05-02 Thread Pavel Emelianov
Cleanup of dev_base list use, with the aim to simplify making
device list per-namespace. In almost every occasion, use of
dev_base variable and dev-next pointer could be easily replaced
by  for_each_netdev loop. A few most complicated places were
converted to using first_netdev()/next_netdev().

Fits 2.6.21-rc7 tree.

Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
Acked-by: Kirill Korotaev [EMAIL PROTECTED]

---

diff --git a/arch/s390/appldata/appldata_net_sum.c 
b/arch/s390/appldata/appldata_net_sum.c
index f64b8c8..c271c9e 100644
--- a/arch/s390/appldata/appldata_net_sum.c
+++ b/arch/s390/appldata/appldata_net_sum.c
@@ -107,7 +107,7 @@ static void appldata_get_net_sum_data(vo
tx_dropped = 0;
collisions = 0;
read_lock(dev_base_lock);
-   for (dev = dev_base; dev != NULL; dev = dev-next) {
+   for_each_netdev(dev) {
if (dev-get_stats == NULL) {
continue;
}
diff --git a/arch/sparc64/solaris/ioctl.c b/arch/sparc64/solaris/ioctl.c
index 330743c..18352a4 100644
--- a/arch/sparc64/solaris/ioctl.c
+++ b/arch/sparc64/solaris/ioctl.c
@@ -686,7 +686,8 @@ static inline int solaris_i(unsigned int
int i = 0;

read_lock_bh(dev_base_lock);
-   for (d = dev_base; d; d = d-next) i++;
+   for_each_netdev(d)
+   i++;
read_unlock_bh(dev_base_lock);
 
if (put_user (i, (int __user *)A(arg)))
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index 8d17d8d..236ec10 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -193,15 +193,15 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
sl = sl_tail = NULL;
 
read_lock(dev_base_lock);
-   for (ifp = dev_base; ifp; dev_put(ifp), ifp = ifp-next) {
+   for_each_netdev(ifp) {
dev_hold(ifp);
if (!is_aoe_netif(ifp))
-   continue;
+   goto cont;
 
skb = new_skb(sizeof *h + sizeof *ch);
if (skb == NULL) {
printk(KERN_INFO aoe: skb alloc failure\n);
-   continue;
+   goto cont;
}
skb_put(skb, sizeof *h + sizeof *ch);
skb-dev = ifp;
@@ -220,6 +220,8 @@ aoecmd_cfg_pkts(ushort aoemajor, unsigne
 
skb-next = sl;
sl = skb;
+cont:
+   dev_put(ifp);
}
read_unlock(dev_base_lock);
 
diff --git a/drivers/net/wireless/strip.c b/drivers/net/wireless/strip.c
index f5ce1c6..200d131 100644
--- a/drivers/net/wireless/strip.c
+++ b/drivers/net/wireless/strip.c
@@ -1971,8 +1971,7 @@ static struct net_device *get_strip_dev(
  sizeof(zero_address))) {
struct net_device *dev;
read_lock_bh(dev_base_lock);
-   dev = dev_base;
-   while (dev) {
+   for_each_netdev(dev) {
if (dev-type == strip_info-dev-type 
!memcmp(dev-dev_addr,
strip_info-true_dev_addr,
@@ -1983,7 +1982,6 @@ static struct net_device *get_strip_dev(
read_unlock_bh(dev_base_lock);
return (dev);
}
-   dev = dev-next;
}
read_unlock_bh(dev_base_lock);
}
diff --git a/drivers/parisc/led.c b/drivers/parisc/led.c
index d190c05..5db0526 100644
--- a/drivers/parisc/led.c
+++ b/drivers/parisc/led.c
@@ -365,7 +365,7 @@ static __inline__ int led_get_net_activi
 * for reading should be OK */
read_lock(dev_base_lock);
rcu_read_lock();
-   for (dev = dev_base; dev; dev = dev-next) {
+   for_each_netdev(dev) {
struct net_device_stats *stats;
struct in_device *in_dev = __in_dev_get_rcu(dev);
if (!in_dev || !in_dev-ifa_list)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1a52854..7038e43 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -302,7 +302,7 @@ struct net_device
 
unsigned long   state;
 
-   struct net_device   *next;
+   struct list_headdev_list;

/* The device initialization function. Called only once. */
int (*init)(struct net_device *dev);
@@ -569,9 +569,31 @@ struct packet_type {
 #include linux/notifier.h
 
 extern struct net_device   loopback_dev;   /* The loopback 
*/
-extern struct net_device   *dev_base;  /* All devices 
*/
+extern struct list_headdev_base_head;  /* All 
devices */
 extern rwlock_tdev_base_lock

Re: [BRIDGE] Unaligned access on IA64 when comparing ethernet addresses

2007-04-18 Thread Pavel Emelianov
[snip]

 --- linux-2.6.orig/net/bridge/br_private.h2007-04-17
 13:26:48.0 -0700 +++ linux-2.6/net/bridge/br_private.h
 2007-04-17 13:30:29.0 -0700 @@ -36,7 +36,7 @@
  {
   unsigned char   prio[2];
   unsigned char   addr[6];
 -};
 +} __attribute__((aligned(8)));

Why 8? Mustn't it be 16? Address is to be 2-bytes aligned...

  struct mac_addr
  {
 -
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NETLINK] Don't attach callback to a going-away netlink socket

2007-04-18 Thread Pavel Emelianov
Sorry, I forgot to put netdev and David in Cc when I first sent it.

There is a race between netlink_dump_start() and netlink_release()
that can lead to the situation when a netlink socket with non-zero
callback is freed.

Here it is:

CPU1:   CPU2
netlink_release():  netlink_dump_start():

sk = netlink_lookup(); /* OK */

netlink_remove();

spin_lock(nlk-cb_lock);
if (nlk-cb) { /* false */
  ...
}
spin_unlock(nlk-cb_lock);

spin_lock(nlk-cb_lock);
if (nlk-cb) { /* false */
 ...
}
nlk-cb = cb;
spin_unlock(nlk-cb_lock);
...
sock_orphan(sk);
/*
 * proceed with releasing
 * the socket
 */

The proposal it to make sock_orphan before detaching the callback
in netlink_release() and to check for the sock to be SOCK_DEAD in
netlink_dump_start() before setting a new callback.

Signed-off-by: Denis Lunev [EMAIL PROTECTED]
Signed-off-by: Kirill Korotaev [EMAIL PROTECTED]
Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]
Acked-by: Patrick McHardy [EMAIL PROTECTED]

---

--- a/net/netlink/af_netlink.c  2004-10-25 12:12:23.0 +0400
+++ b/net/netlink/af_netlink.c  2004-10-28 16:26:12.0 +0400
@@ -255,6 +255,7 @@ static int netlink_release(struct socket
return 0;
 
netlink_remove(sk);
+   sock_orphan(sk);
nlk = nlk_sk(sk);
 
spin_lock(nlk-cb_lock);
@@ -269,7 +270,6 @@ static int netlink_release(struct socket
/* OK. Socket is unlinked, and, therefore,
   no new packets will arrive */
 
-   sock_orphan(sk);
sock-sk = NULL;
wake_up_interruptible_all(nlk-wait);
 
@@ -942,9 +942,9 @@ int netlink_dump_start(struct sock *ssk,
return -ECONNREFUSED;
}
nlk = nlk_sk(sk);
-   /* A dump is in progress... */
+   /* A dump or destruction is in progress... */
spin_lock(nlk-cb_lock);
-   if (nlk-cb) {
+   if (nlk-cb || sock_flag(sk, SOCK_DEAD)) {
spin_unlock(nlk-cb_lock);
netlink_destroy_callback(cb);
sock_put(sk);
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETLINK] Don't attach callback to a going-away netlink socket

2007-04-18 Thread Pavel Emelianov
Evgeniy Polyakov wrote:
 On Wed, Apr 18, 2007 at 12:16:18PM +0400, Pavel Emelianov ([EMAIL PROTECTED]) 
 wrote:
 Sorry, I forgot to put netdev and David in Cc when I first sent it.

 There is a race between netlink_dump_start() and netlink_release()
 that can lead to the situation when a netlink socket with non-zero
 callback is freed.
 
 Out of curiosity, why not to fix a netlink_dump_start() to remove
 callback in error path, since in 'no-error' path it removes it in

Error path is not relevant here. The problem is that we
keep a calback on a socket that is about to be freed.

 netlink_dump().
 
 And, btw, can release method be called while socket is being used, I
 thought about proper reference counters should prevent this, but not
 100% sure with RCU dereferencing of the descriptor.
 

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [BRIDGE] Unaligned access on IA64 when comparing ethernet addresses

2007-04-18 Thread Pavel Emelianov
David Miller wrote:
 From: Pavel Emelianov [EMAIL PROTECTED]
 Date: Wed, 18 Apr 2007 10:43:56 +0400
 
 [snip]

 --- linux-2.6.orig/net/bridge/br_private.h  2007-04-17
 13:26:48.0 -0700 +++ linux-2.6/net/bridge/br_private.h
 2007-04-17 13:30:29.0 -0700 @@ -36,7 +36,7 @@
  {
 unsigned char   prio[2];
 unsigned char   addr[6];
 -};
 +} __attribute__((aligned(8)));
 Why 8? Mustn't it be 16? Address is to be 2-bytes aligned...
 
 Actually it could be made 2, the aligned() attribute is
 in bytes, not bits.

Indeed :) My bad :( Thank you...

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETLINK] Don't attach callback to a going-away netlink socket

2007-04-18 Thread Pavel Emelianov
Evgeniy Polyakov wrote:
 On Wed, Apr 18, 2007 at 12:32:40PM +0400, Pavel Emelianov ([EMAIL PROTECTED]) 
 wrote:
 Evgeniy Polyakov wrote:
 On Wed, Apr 18, 2007 at 12:16:18PM +0400, Pavel Emelianov ([EMAIL 
 PROTECTED]) wrote:
 Sorry, I forgot to put netdev and David in Cc when I first sent it.

 There is a race between netlink_dump_start() and netlink_release()
 that can lead to the situation when a netlink socket with non-zero
 callback is freed.
 Out of curiosity, why not to fix a netlink_dump_start() to remove
 callback in error path, since in 'no-error' path it removes it in
 Error path is not relevant here. The problem is that we
 keep a calback on a socket that is about to be freed.
 
 Yes, you are right, that it will not be freed in netlink_release(), 
 but it will be freed in netlink_dump() after it is processed (in no-error 
 path only though).
 

But error path will leak it. On success path we would have
a leaked packet in sk_write_queue, since we did't see it in
skb_queue_purge() while doing netlink_release().

Of course we can place the struts in code to handle the case
when we have a released socket with the attached callback, but
it is more correct (IMHO) not to allow to attach the callbacks
to dead sockets.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[BRIDGE] Unaligned access on IA64 when comparing ethernet addresses

2007-04-17 Thread Pavel Emelianov
From: Evgeny Kravtsunov [EMAIL PROTECTED]

compare_ether_addr() implicitly requires that the addresses
passed are 2-bytes aligned in memory.

This is not true for br_stp_change_bridge_id() and
br_stp_recalculate_bridge_id() in which one of the addresses
is unsigned char *, and thus may not be 2-bytes aligned.

Signed-off-by: Evgeny Kravtsunov [EMAIL PROTECTED]
Signed-off-by: Kirill Korotaev [EMAIL PROTECTED]
Signed-off-by: Pavel Emelianov [EMAIL PROTECTED]

---

--- a/net/bridge/br_stp_if.c2006-09-20 07:42:06.0 +0400
+++ b/net/bridge/br_stp_if.c2007-04-13 12:28:08.0 +0400
@@ -124,7 +124,9 @@ void br_stp_disable_port(struct net_brid
 /* called under bridge lock */
 void br_stp_change_bridge_id(struct net_bridge *br, const unsigned char *addr)
 {
-   unsigned char oldaddr[6];
+   /* should be aligned on 2 bytes for compare_ether_addr() */
+   unsigned short oldaddr_aligned[ETH_ALEN  1];
+   unsigned char *oldaddr = (unsigned char *)oldaddr_aligned;
struct net_bridge_port *p;
int wasroot;
 
@@ -149,11 +151,14 @@ void br_stp_change_bridge_id(struct net_
br_become_root_bridge(br);
 }
 
-static const unsigned char br_mac_zero[6];
+/* should be aligned on 2 bytes for compare_ether_addr() */
+static const unsigned short br_mac_zero_aligned[ETH_ALEN  1];
 
 /* called under bridge lock */
 void br_stp_recalculate_bridge_id(struct net_bridge *br)
 {
+   const unsigned char *br_mac_zero =
+   (const unsigned char *)br_mac_zero_aligned;
const unsigned char *addr = br_mac_zero;
struct net_bridge_port *p;
 
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html