[Devel] Re: [ckrm-tech] [PATCH 00/10] Containers(V10): Generic Process Containers
> I suppose as a cleaner alternative we could > add a container_subsys->inherit_defaults() handler, to be called at > container_clone(), and for cpusets this would set cpus and mems to > the parent values - sibling exclusive values. If that comes to nothing, > then the attach_task() is still refused, and the unshare() or clone() > fails, but this time with good reason. Unfortunately, I haven't spent the time I should thinking about container cloning, namespaces and such. I don't know, for the workloads that matter to me, when, how or if this container cloning will be used. I'm tempted to suggest the following. First, I am assuming that the classic method of creating cpuset children will still work, such as the following (which can fail for certain combinations of exclusive cpus or mems): cd /dev/cpuset/foobar mkdir foochild cp cpus foochild cp mems foochild echo $$ > foochild/tasks Second, given that, how about you fail the unshare() or clone() anytime that the instance to be cloned has any sibling cpusets with any exclusive flags set. The exclusive property is not really on friendly terms with cloning. Now if the above classic code must be encoded using cloning under the covers, then we've got problems, probably more problems than just this. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 00/10] Containers(V10): Generic Process Containers
Quoting Paul Jackson ([EMAIL PROTECTED]): > > > I wasn't paying close enough attention to understand why you couldn't > > > do it in two steps - make the container, and then populate it with > > > resources. > > > > Sorry, please clarify - are you saying that now you do understand, or > > that I should explain? > > Could you explain -- I still don't understand why you need this option. > I still don't understand why you can't do it in two steps - make the > container, then add cpu/mem separately. Sure - the key is that the ns subsystem uses container_clone() to automatically create a new container (on sys_unshare() or clone(2) with certain flags) and move the current task into it. Let's say we have done mount -t container -o ns,cpuset nsproxy /containers and we, as task 875, happen to be in the topmost container: /containers/ Now we fork task 999 which does an unshare(CLONE_NEWNS), or we just clone(CLONE_NEWNS). This will create /containers/node_999 and move task 999 into that container. Except that when it tries attach_task() it is refused by cpuset. So the container_clone() fails, and in turn the sys_unshare() or clone() fails. A login making use of the pam_namespace.so library would fail this way with the ns and cpuset subsystems composed. We could special case this by having kernel/container.c:container_clone() check whether one of the subsystems is cpusets and, if so, setting the defaults for mems and cpus, but that is kind of ugly. I suppose as a cleaner alternative we could add a container_subsys->inherit_defaults() handler, to be called at container_clone(), and for cpusets this would set cpus and mems to the parent values - sibling exclusive values. If that comes to nothing, then the attach_task() is still refused, and the unshare() or clone() fails, but this time with good reason. thanks, -serge ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 00/10] Containers(V10): Generic Process Containers
> > I wasn't paying close enough attention to understand why you couldn't > > do it in two steps - make the container, and then populate it with > > resources. > > Sorry, please clarify - are you saying that now you do understand, or > that I should explain? Could you explain -- I still don't understand why you need this option. I still don't understand why you can't do it in two steps - make the container, then add cpu/mem separately. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401 ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 00/10] Containers(V10): Generic Process Containers
Quoting Paul Jackson ([EMAIL PROTECTED]): > > Would it then make sense to just > > default to (parent_set - sibling_exclusive_set) for a new sibling's > > value? > > Which could well be empty, which in turn puts one back in the position > of dealing with a newborn cpuset that is empty (of cpus or of memory), > or else it introduces a new and odd constraint on when cpusets can be > created (only when there are non-exclusive cpus and mems available.) > > > An option is fine with me, but without such an option at all, cpusets > > could not be applied to namespaces... > > I wasn't paying close enough attention to understand why you couldn't > do it in two steps - make the container, and then populate it with > resources. Sorry, please clarify - are you saying that now you do understand, or that I should explain? > But if indeed that's not possible, then I guess we need some sort of > option specifying whether to create kids empty, or inheriting. Paul (uh, Menage :) should I do a patch for this or have you got it already? thanks, -serge ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
Re: [Devel] Re: [PATCH] Virtual ethernet tunnel
From: Daniel Lezcano <[EMAIL PROTECTED]> Date: Wed, 06 Jun 2007 22:38:11 +0200 > Perhaps, a name like "epipe" or "npipe", which reflects what does the > device, is more appropriate ? 'npipe' (Network PIPE) or 'epipe' (Ethernet PIPE) are fine with me. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
Re: [Devel] Re: [PATCH] Virtual ethernet tunnel
David Miller wrote: From: Pavel Emelianov <[EMAIL PROTECTED]> Date: Wed, 06 Jun 2007 19:11:38 +0400 Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. I would suggest choosing a different name. 'veth' is also the name of the virtualized ethernet device found on IBM machines, driven by driver/net/ibmveth.[ch] Eric Biederman proposed the name "etun" which stands for "Ethernet TUNnel". The goal of the pair device is to pass network packets between namespaces. That reminds me the well known "pipe" which pass data between processes. All network devices have a name describing what they do : "eth" for ethernet, "dummy" for a device doing nothing, loopback, ... Perhaps, a name like "epipe" or "npipe", which reflects what does the device, is more appropriate ? -- Daniel ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Virtual ethernet tunnel
From: Pavel Emelianov <[EMAIL PROTECTED]> Date: Wed, 06 Jun 2007 19:11:38 +0400 > Veth stands for Virtual ETHernet. It is a simple tunnel driver > that works at the link layer and looks like a pair of ethernet > devices interconnected with each other. I would suggest choosing a different name. 'veth' is also the name of the virtualized ethernet device found on IBM machines, driven by driver/net/ibmveth.[ch] ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Virtual ethernet tunnel
On Wed, 06 Jun 2007 19:11:38 +0400 Pavel Emelianov <[EMAIL PROTECTED]> wrote: > Veth stands for Virtual ETHernet. It is a simple tunnel driver > that works at the link layer and looks like a pair of ethernet > devices interconnected with each other. > > Mainly it allows to communicate between network namespaces but > it can be used as is as well. > > Eric recently sent a similar driver called etun. This > implementation uses another interface - the RTM_NRELINK > message introduced by Patric. The patch fits today netdev > tree with Patrick's patches. > > The newlink callback is organized that way to make it easy > to create the peer device in the separate namespace when we > have them in kernel. > > The patch for an ip utility is also provided. > > Eric, since ethtool interface was from your patch, I add > your Signed-off-by line. > > Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> > Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]> > > --- > > diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig > index 7d57f4a..7e144be 100644 > --- a/drivers/net/Kconfig > +++ b/drivers/net/Kconfig > @@ -119,6 +119,12 @@ config TUN > > If you don't know what to use this for, you don't need it. > > +config VETH > + tristate "Virtual ethernet device" > + ---help--- > + The device is an ethernet tunnel. Devices are created in pairs. When > + one end receives the packet it appears on its pair and vice versa. > + > config NET_SB1000 > tristate "General Instruments Surfboard 1000" > depends on PNP > diff --git a/drivers/net/Makefile b/drivers/net/Makefile > index a77affa..4764119 100644 > --- a/drivers/net/Makefile > +++ b/drivers/net/Makefile > @@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o > obj-$(CONFIG_MACMACE) += macmace.o > obj-$(CONFIG_MAC89x0) += mac89x0.o > obj-$(CONFIG_TUN) += tun.o > +obj-$(CONFIG_VETH) += veth.o > obj-$(CONFIG_NET_NETX) += netx-eth.o > obj-$(CONFIG_DL2K) += dl2k.o > obj-$(CONFIG_R8169) += r8169.o > diff --git a/drivers/net/veth.c b/drivers/net/veth.c > new file mode 100644 > index 000..6746c91 > --- /dev/null > +++ b/drivers/net/veth.c > @@ -0,0 +1,391 @@ > +/* > + * drivers/net/veth.c > + * > + * Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc > + * > + */ > + > +#include > +#include > +#include > +#include > + > +#include > +#include > +#include > + > +#define DRV_NAME "veth" > +#define DRV_VERSION "1.0" > + > +struct veth_priv { > + struct net_device *peer; > + struct net_device *dev; > + struct list_head list; > + struct net_device_stats stats; > + unsigned ip_summed; > +}; > + > +static LIST_HEAD(veth_list); > + > +/* > + * ethtool interface > + */ > + > +static struct { > + const char string[ETH_GSTRING_LEN]; > +} ethtool_stats_keys[] = { > + { "peer_ifindex" }, > +}; Seems like a good usage of sysfs attributes, rather than ethtool. Then you can get rid of all the useless ethtool for what is basically a virtual device. > +/* > + * xmit > + */ > + > +static int veth_xmit(struct sk_buff *skb, struct net_device *dev) > +{ > + struct net_device *rcv = NULL; > + struct veth_priv *priv, *rcv_priv; > + int length; > + > + skb_orphan(skb); > + > + priv = netdev_priv(dev); > + rcv = priv->peer; > + rcv_priv = netdev_priv(rcv); > + > + if (!(rcv->flags & IFF_UP)) > + goto outf; > + > + skb->dev = rcv; > + skb->pkt_type = PACKET_HOST; > + skb->protocol = eth_type_trans(skb, rcv); > + if (dev->features & NETIF_F_NO_CSUM) > + skb->ip_summed = rcv_priv->ip_summed; > + > + dst_release(skb->dst); > + skb->dst = NULL; > + > + secpath_reset(skb); > + nf_reset(skb); > + > + length = skb->len; > + > + priv->stats.tx_bytes += length; > + priv->stats.tx_packets++; > + > + rcv_priv->stats.rx_bytes += length; > + rcv_priv->stats.rx_packets++; Per-cpu stats? This will cacheline thrash. > + netif_rx(skb); > + return 0; > + > +outf: > + kfree_skb(skb); > + priv->stats.tx_dropped++; > + return 0; > +} -- Stephen Hemminger <[EMAIL PROTECTED]> ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Virtual ethernet tunnel
Pavel Emelianov wrote: > +MODULE_DESCRIPTION("Virtual Ethernet Tunnel"); > +MODULE_LICENSE("GPL v2"); This seems to be missing MODULE_ALIAS_RTNL_LINK("veth"); ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Virtual ethernet tunnel
Pavel Emelianov wrote: > Veth stands for Virtual ETHernet. It is a simple tunnel driver > that works at the link layer and looks like a pair of ethernet > devices interconnected with each other. > > Mainly it allows to communicate between network namespaces but > it can be used as is as well. > > Eric recently sent a similar driver called etun. This > implementation uses another interface - the RTM_NRELINK > message introduced by Patric. The patch fits today netdev > tree with Patrick's patches. > > The newlink callback is organized that way to make it easy > to create the peer device in the separate namespace when we > have them in kernel. > > +struct veth_priv { > + struct net_device *peer; > + struct net_device *dev; > + struct list_head list; > + struct net_device_stats stats; You can use dev->stats instead. > +static int veth_xmit(struct sk_buff *skb, struct net_device *dev) > +{ > + struct net_device *rcv = NULL; > + struct veth_priv *priv, *rcv_priv; > + int length; > + > + skb_orphan(skb); > + > + priv = netdev_priv(dev); > + rcv = priv->peer; > + rcv_priv = netdev_priv(rcv); > + > + if (!(rcv->flags & IFF_UP)) > + goto outf; > + > + skb->dev = rcv; eth_type_trans already sets skb->dev. > + skb->pkt_type = PACKET_HOST; > + skb->protocol = eth_type_trans(skb, rcv); > + if (dev->features & NETIF_F_NO_CSUM) > + skb->ip_summed = rcv_priv->ip_summed; > + > + dst_release(skb->dst); > + skb->dst = NULL; > + > + secpath_reset(skb); > + nf_reset(skb); Is skb->mark supposed to survive communication between different namespaces? > +static const struct nla_policy veth_policy[VETH_INFO_MAX] = { > + [VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN }, > + [VETH_INFO_PEER]= { .type = NLA_STRING }, > + [VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN }, > +}; The rtnl_link codes looks fine. I don't like the VETH_INFO_MAC attribute very much though, we already have a generic device attribute for MAC addresses. Of course that only allows you to supply one MAC address, so I'm wondering what you think of allocating only a single device per newlink operation and binding them in a seperate enslave operation? > +enum { > + VETH_INFO_UNSPEC, > + VETH_INFO_MAC, > + VETH_INFO_PEER, > + VETH_INFO_PEER_MAC, > + > + VETH_INFO_MAX > +}; Please follow the #define VETH_INFO_MAX (__VETH_INFO_MAX - 1) convention here. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: checkpointing and restoring processes
On Wed, 2007-06-06 at 13:37 +0200, Mark Pflueger wrote: > hi everyone! > > i'm not subscribed to the list, so if you care to flame because of my noob > question, just do it to the list, otherwise please cc me. > > i'm trying to write a checkpoint/restore module for processes and so have > a basic version going already - problem is, when i restore the process, > one of three things happens at random. first is, the process restored > segfaults. second is, i get a kernel null pointer dereference and third > is, i get a virtual address lookup error and a kernel crash. the trace > back and the address always change. Your patch definitely takes a simple, straightforward approach, which is good. But, there are a couple of things that need to get added. For instance, when you make a copy of tsk->mm, what happens if that original task exits? It will drop its reference count and free that task, along with the mm. The new task will fault on its access to newtsk->mm because the mm has gone away. Also, just setting tsk->pid is not enough to get the pid to show up in the system. It needs to make sure no other task has that pid as well as making entries in data structures like the pid allocation map. In any case, it's nice to have other people interested in the same things! As Cedric suggested, please pop over to [EMAIL PROTECTED] There are at least two other efforts, besides ours working toward the same goal, so you'll have lots of comrades there. ;) -- Dave ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [patch 1/5][RFC - ipv4/udp checkpoint/restart] : add lookup for unhashed inode
Serge E. Hallyn wrote: Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]): The socket relies on the sockfs. In some cases, the socket are orphans and it is not possible to access them via a file descriptor, this is the case for example for timewait sockets. Hopefully, an inode is still usable to specify a socket. This one can be retrieved from /proc/net/tcp for orphan sockets or from a fstat. When a socket is created the socket inode is added to the sockfs. Unfortunatly, this one is not stored into the hashed inode list, so I need a helper to browse the inode list contained in the superblock of the sockfs. This is one solution, another solution is to stored the inode into the hashed list when socket is created. I assume that would be unacceptable overhead on a very busy server. Walking all the inodes NUM_INODES(task_set) for a checkpoint could be a real bottleneck, but at least it's only at checkpoint time. Have you checked net-dev archives for discussions about not hashing these inodes? I suppose at some point you'll want to ask there what the preference is. I didn't looked at the netdev archive, but, sure, I will dig and ask to netdev@ Thanks. But certainly for now this seems the right approach. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> Acked-by: Serge E. Hallyn <[EMAIL PROTECTED]> (Or whatever tag they decide over on lkml that I should be using :) thanks, -serge PS - I won't be acking other patches bc I just haven't looked at netlink enough - so don't read anything more into that :) --- fs/inode.c | 29 + include/linux/fs.h |1 + 2 files changed, 30 insertions(+) Index: 2.6.20-cr/fs/inode.c === --- 2.6.20-cr.orig/fs/inode.c +++ 2.6.20-cr/fs/inode.c @@ -877,6 +877,35 @@ EXPORT_SYMBOL(ilookup); + +/** + * ilookup_unhased - search for an inode in the superblock + * @sb:super block of file system to search + * @ino: inode number to search for + * + * The ilookup_unhashed browse the superblock inode list to find the inode. + * + * If the inode is found in the inode list stored in the superblock, the inode is + * with an incremented reference count. + * + * Otherwise NULL is returned. + */ +struct inode *ilookup_unhashed(struct super_block *sb, unsigned long ino) +{ + struct inode *inode = NULL; + + spin_lock(&inode_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) + if (inode->i_ino == ino) { + __iget(inode); + break; + } + spin_unlock(&inode_lock); + return inode; + +} +EXPORT_SYMBOL(ilookup_unhashed); + /** * iget5_locked - obtain an inode from a mounted file system * @sb:super block of file system Index: 2.6.20-cr/include/linux/fs.h === --- 2.6.20-cr.orig/include/linux/fs.h +++ 2.6.20-cr/include/linux/fs.h @@ -1657,6 +1657,7 @@ extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval, int (*test)(struct inode *, void *), void *data); extern struct inode *ilookup(struct super_block *sb, unsigned long ino); +extern struct inode *ilookup_unhashed(struct super_block *sb, unsigned long ino); extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *); extern struct inode * iget_locked(struct super_block *, unsigned long); -- ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Module for ip utility to support veth device
Pavel Emelianov wrote: > diff --git a/ip/iplink.c b/ip/iplink.c > index 5170419..6975990 100644 > --- a/ip/iplink.c > +++ b/ip/iplink.c > @@ -287,7 +287,7 @@ static int iplink_modify(int cmd, unsign >strlen(type)); > > lu = get_link_type(type); > - if (lu) { > + if (lu && argc) { > struct rtattr * data = NLMSG_TAIL(&req.n); > addattr_l(&req.n, sizeof(req), IFLA_INFO_DATA, NULL, 0); I've folded this part into my iproute patch, thanks. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH] Module for ip utility to support veth device
The usage is # ip link add [name] type veth [peer ] [mac ] [peer_mac ] The Makefile is maybe not as beautiful as it could be. It is to be discussed. One thing I noticed during testing is the following. When launching this with link_veth.so module and not specifying any module specific parameters, the kernel refuses to accept the packet when parsing the IFLA_LINKINFO. So the hunk for ip/iplink.c doesn't add an empty extra header if no extra data expected. Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]> --- diff --git a/ip/Makefile b/ip/Makefile index 9a5bfe3..b46bce3 100644 --- a/ip/Makefile +++ b/ip/Makefile @@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o ALLOBJ=$(IPOBJ) $(RTMONOBJ) SCRIPTS=ifcfg rtpr routel routef TARGETS=ip rtmon +LIBS=link_veth.so -all: $(TARGETS) $(SCRIPTS) +all: $(TARGETS) $(SCRIPTS) $(LIBS) ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL) @@ -24,3 +25,6 @@ clean: LDLIBS += -ldl LDFLAGS+= -Wl,-export-dynamic + +%.so: %.c + $(CC) $(CFLAGS) -shared $< -o $@ diff --git a/ip/iplink.c b/ip/iplink.c index 5170419..6975990 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -287,7 +287,7 @@ static int iplink_modify(int cmd, unsign strlen(type)); lu = get_link_type(type); - if (lu) { + if (lu && argc) { struct rtattr * data = NLMSG_TAIL(&req.n); addattr_l(&req.n, sizeof(req), IFLA_INFO_DATA, NULL, 0); diff --git a/ip/link_veth.c b/ip/link_veth.c new file mode 100644 index 000..adfdef6 --- /dev/null +++ b/ip/link_veth.c @@ -0,0 +1,77 @@ +#include + +#include "utils.h" +#include "ip_common.h" +#include "veth.h" + +#define ETH_ALEN 6 + +static void usage(void) +{ + printf("Usage: ip link add ... " + "[peer ] [mac ] [peer_mac ]\n"); +} + +static int veth_parse_opt(struct link_util *lu, int argc, char **argv, + struct nlmsghdr *hdr) +{ + __u8 mac[ETH_ALEN]; + + for (; argc != 0; argv++, argc--) { + if (strcmp(*argv, "peer") == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + addattr_l(hdr, 1024, VETH_INFO_PEER, + *argv, strlen(*argv)); + + continue; + } + + if (strcmp(*argv, "mac") == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL) + return -1; + + addattr_l(hdr, 1024, VETH_INFO_MAC, + mac, ETH_ALEN); + continue; + } + + if (strcmp(*argv, "peer_mac") == 0) { + argv++; + argc--; + if (argc == 0) { + usage(); + return -1; + } + + if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL) + return -1; + + addattr_l(hdr, 1024, VETH_INFO_PEER_MAC, + mac, ETH_ALEN); + continue; + } + + usage(); + return -1; + } + + return 0; +} + +struct link_util veth_link_util = { + .id = "veth", + .parse_opt = veth_parse_opt, +}; diff --git a/ip/veth.h b/ip/veth.h new file mode 100644 index 000..74c8e1e --- /dev/null +++ b/ip/veth.h @@ -0,0 +1,13 @@ +#ifndef __NET_VETH_H__ +#define __NET_VETH_H__ + +enum { + VETH_INFO_UNSPEC, + VETH_INFO_MAC, + VETH_INFO_PEER, + VETH_INFO_PEER_MAC, + + VETH_INFO_MAX +}; + +#endif ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH] Virtual ethernet tunnel
Veth stands for Virtual ETHernet. It is a simple tunnel driver that works at the link layer and looks like a pair of ethernet devices interconnected with each other. Mainly it allows to communicate between network namespaces but it can be used as is as well. Eric recently sent a similar driver called etun. This implementation uses another interface - the RTM_NRELINK message introduced by Patric. The patch fits today netdev tree with Patrick's patches. The newlink callback is organized that way to make it easy to create the peer device in the separate namespace when we have them in kernel. The patch for an ip utility is also provided. Eric, since ethtool interface was from your patch, I add your Signed-off-by line. Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]> --- diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 7d57f4a..7e144be 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -119,6 +119,12 @@ config TUN If you don't know what to use this for, you don't need it. +config VETH + tristate "Virtual ethernet device" + ---help--- + The device is an ethernet tunnel. Devices are created in pairs. When + one end receives the packet it appears on its pair and vice versa. + config NET_SB1000 tristate "General Instruments Surfboard 1000" depends on PNP diff --git a/drivers/net/Makefile b/drivers/net/Makefile index a77affa..4764119 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o obj-$(CONFIG_MACMACE) += macmace.o obj-$(CONFIG_MAC89x0) += mac89x0.o obj-$(CONFIG_TUN) += tun.o +obj-$(CONFIG_VETH) += veth.o obj-$(CONFIG_NET_NETX) += netx-eth.o obj-$(CONFIG_DL2K) += dl2k.o obj-$(CONFIG_R8169) += r8169.o diff --git a/drivers/net/veth.c b/drivers/net/veth.c new file mode 100644 index 000..6746c91 --- /dev/null +++ b/drivers/net/veth.c @@ -0,0 +1,391 @@ +/* + * drivers/net/veth.c + * + * Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc + * + */ + +#include +#include +#include +#include + +#include +#include +#include + +#define DRV_NAME "veth" +#define DRV_VERSION"1.0" + +struct veth_priv { + struct net_device *peer; + struct net_device *dev; + struct list_head list; + struct net_device_stats stats; + unsigned ip_summed; +}; + +static LIST_HEAD(veth_list); + +/* + * ethtool interface + */ + +static struct { + const char string[ETH_GSTRING_LEN]; +} ethtool_stats_keys[] = { + { "peer_ifindex" }, +}; + +static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +{ + cmd->supported = 0; + cmd->advertising= 0; + cmd->speed = SPEED_1; + cmd->duplex = DUPLEX_FULL; + cmd->port = PORT_TP; + cmd->phy_address= 0; + cmd->transceiver= XCVR_INTERNAL; + cmd->autoneg= AUTONEG_DISABLE; + cmd->maxtxpkt = 0; + cmd->maxrxpkt = 0; + return 0; +} + +static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) +{ + strcpy(info->driver, DRV_NAME); + strcpy(info->version, DRV_VERSION); + strcpy(info->fw_version, "N/A"); +} + +static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf) +{ + switch(stringset) { + case ETH_SS_STATS: + memcpy(buf, ðtool_stats_keys, sizeof(ethtool_stats_keys)); + break; + } +} + +static int veth_get_stats_count(struct net_device *dev) +{ + return ARRAY_SIZE(ethtool_stats_keys); +} + +static void veth_get_ethtool_stats(struct net_device *dev, + struct ethtool_stats *stats, u64 *data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + data[0] = priv->peer->ifindex; +} + +static u32 veth_get_rx_csum(struct net_device *dev) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + return priv->ip_summed == CHECKSUM_UNNECESSARY; +} + +static int veth_set_rx_csum(struct net_device *dev, u32 data) +{ + struct veth_priv *priv; + + priv = netdev_priv(dev); + priv->ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE; + return 0; +} + +static u32 veth_get_tx_csum(struct net_device *dev) +{ + return (dev->features & NETIF_F_NO_CSUM) != 0; +} + +static int veth_set_tx_csum(struct net_device *dev, u32 data) +{ + if (data) + dev->features |= NETIF_F_NO_CSUM; + else + dev->features &= ~NETIF_F_NO_CSUM; + return 0; +} + +static struct ethtool_ops veth_ethtool_ops = { + .get_settings = veth_get_settings, + .get_drvinfo= veth_get_drvinfo, + .get_link = ethtool_op_get_link, + .get_rx_csum= veth_get_rx_csum, + .set_rx_csum= vet
[Devel] Re: checkpointing and restoring processes
Quoting Cedric Le Goater ([EMAIL PROTECTED]): > Mark Pflueger wrote: > > hi everyone! > > > > i'm not subscribed to the list, so if you care to flame because of my noob > > question, just do it to the list, otherwise please cc me. > > you should subscribe to [EMAIL PROTECTED] and send your ideas on that > list. There's a BOF on that topic at OLS if you can attend. > > cheers, > > C. Hi Mark, Thanks for sending that patch. Ignoring code details for now, this is a good time to discuss checkpoint strategies. It looks like you are writing task memory to userspace explicitly on demand. Dave Hansen is taking a different approach, using the swapfile to back up memory. Eventually we would enable a swapfile per container. Hopefully he can send a prototype out soon - I thought it was a really cool idea, although of course we'll have to see how it pans out in implementation :) On the larger scale, there is the question of how we want to orchestrate the checkpoint. Do we want to have one syscall enable a checkpoint of a set of tasks, kicking out all the relevant information to userspace? Do we want userspace to orchestrate the checkpoint, asking for the tasks to be pulled off the runqueue, then polling for all the relevant information (through /proc/pid/fd, etc), then putting the tasks back on the runqueue? Same as the above, but using the container interface to make it more robust (i.e. pull all tasks off the runqueue using echo 1 > /containers/vserver1/job_1/suspend) against for instance tasks being forked while we're in a 'for p in $pids; suspend $pid'? Use the freezer code to freeze and initiate dump on a set of tasks or a container? thanks, -serge > > i'm trying to write a checkpoint/restore module for processes and so have > > a basic version going already - problem is, when i restore the process, > > one of three things happens at random. first is, the process restored > > segfaults. second is, i get a kernel null pointer dereference and third > > is, i get a virtual address lookup error and a kernel crash. the trace > > back and the address always change. > > > > the user space process is as simple as i could make it: (error checking > > and debugging messages are left out) > > > > > > void take_chkpt(void) { > > pid_t pid; > > char call_pid[10]; > > char call_num[10]; > > > > chkptpid = getpid(); > > snprintf(call_pid, 9, "%d", chkptpid); > > snprintf(call_num, 9, "%d", checkpointnum); > > > > switch(pid = fork()) { > > case -1: > > fprintf(stderr, "Fork failed.\n"); > > return; > > break; > > case 0: /* child process */ > > if(!execl("child_take", call_pid, call_num, (char *)0)) > > perror("execl: "); > > break; > > default: /* parent process */ > > waitpid(pid, NULL, 0); > > break; > > } > > > > return; > > } > > > > > > void restore_chkpts(void) { > > pid_t pid; > > char call_pid[10]; > > char call_num[10]; > > > > ENTERFUN(); > > > > if(restore_retry) // do nothing on second call to restore > > return; > > > > chkptpid = getpid(); > > snprintf(call_pid, 9, "%d", chkptpid); > > snprintf(call_num, 9, "%d", checkpointnum); > > > > switch(pid = fork()) { > > case -1: > > fprintf(stderr, "MP: Fork failed.\n"); > > return; > > break; > > case 0: /* child process */ > > if(!execl("child_restore", call_pid, call_num, (char *)0)) > > perror("execl: "); > > break; > > default: /* parent process */ > > INF(("Parent Process")); > > restore_retry=1; > > INF(("Wait for Child...")); > > waitpid(pid, NULL, 0); > > break; > > } > > > > LEAVEFUN(); > > > > return; > > } > > > > int main(int argc, char* argv[]) { > > take_chkpt(); > > printf("Hello cruel world!\n"); > > restore_chkpts(); > > return 0; > > } > > > > where child_take and child_restore do the following: > > > > > > void child_take_chkpt(int chkptpid, int checkpointnum) { > > struct chkpt_ioctl chkptio; > > int dev_fd; // ioctl device file > > char chkptname[30]; > > > > if ((dev_fd = open(CHKPT_DEVICE, O_RDWR)) < 0) { > > perror("MP: Open device file"); > > exit(EXIT_FAILURE); > > } > > chkptio.pid = chkptpid; > > snprintf(chkptname, 29, "/tmp/chkpt_%d_%d", chkptio.pid, > > checkpointnum); > > chkptio.file = creat(chkptname, 00755); > > sleep(1); // to go sure the parent process is in waitpid -- ugly, > > but works > > kill(chkptio.pid, SIGSTOP); > > sleep(1); > > ioct
[Devel] Re: [patch 1/5][RFC - ipv4/udp checkpoint/restart] : add lookup for unhashed inode
Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]): > The socket relies on the sockfs. In some cases, the socket are orphans and > it is not possible to access them via a file descriptor, this is the case for > example for timewait sockets. Hopefully, an inode is still usable to specify > a socket. This one can be retrieved from /proc/net/tcp for orphan sockets or > from a fstat. > > When a socket is created the socket inode is added to the sockfs. > Unfortunatly, this one is not stored into the hashed inode list, so > I need a helper to browse the inode list contained in the superblock > of the sockfs. > > This is one solution, another solution is to stored the inode into > the hashed list when socket is created. I assume that would be unacceptable overhead on a very busy server. Walking all the inodes NUM_INODES(task_set) for a checkpoint could be a real bottleneck, but at least it's only at checkpoint time. Have you checked net-dev archives for discussions about not hashing these inodes? I suppose at some point you'll want to ask there what the preference is. But certainly for now this seems the right approach. > Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> Acked-by: Serge E. Hallyn <[EMAIL PROTECTED]> (Or whatever tag they decide over on lkml that I should be using :) thanks, -serge PS - I won't be acking other patches bc I just haven't looked at netlink enough - so don't read anything more into that :) > --- > fs/inode.c | 29 + > include/linux/fs.h |1 + > 2 files changed, 30 insertions(+) > > Index: 2.6.20-cr/fs/inode.c > === > --- 2.6.20-cr.orig/fs/inode.c > +++ 2.6.20-cr/fs/inode.c > @@ -877,6 +877,35 @@ > > EXPORT_SYMBOL(ilookup); > > + > +/** > + * ilookup_unhased - search for an inode in the superblock > + * @sb: super block of file system to search > + * @ino: inode number to search for > + * > + * The ilookup_unhashed browse the superblock inode list to find the inode. > + * > + * If the inode is found in the inode list stored in the superblock, the > inode is > + * with an incremented reference count. > + * > + * Otherwise NULL is returned. > + */ > +struct inode *ilookup_unhashed(struct super_block *sb, unsigned long ino) > +{ > + struct inode *inode = NULL; > + > + spin_lock(&inode_lock); > + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) > + if (inode->i_ino == ino) { > + __iget(inode); > + break; > + } > + spin_unlock(&inode_lock); > + return inode; > + > +} > +EXPORT_SYMBOL(ilookup_unhashed); > + > /** > * iget5_locked - obtain an inode from a mounted file system > * @sb: super block of file system > Index: 2.6.20-cr/include/linux/fs.h > === > --- 2.6.20-cr.orig/include/linux/fs.h > +++ 2.6.20-cr/include/linux/fs.h > @@ -1657,6 +1657,7 @@ > extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval, > int (*test)(struct inode *, void *), void *data); > extern struct inode *ilookup(struct super_block *sb, unsigned long ino); > +extern struct inode *ilookup_unhashed(struct super_block *sb, unsigned long > ino); > > extern struct inode * iget5_locked(struct super_block *, unsigned long, int > (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *); > extern struct inode * iget_locked(struct super_block *, unsigned long); > > -- > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.linux-foundation.org/mailman/listinfo/containers ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: breakfast at ols?
Cedric Le Goater wrote: Serge E. Hallyn wrote: Last year we all met for breakfast at OLS. Now we've all pretty much all already met so maybe it's less exciting, but do people (who will be at OLS) care to meet for breakfast on the thursday or friday? OK for me, if i can skip the pancakes with tons of cream and jam. Ok for me too. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: breakfast at ols?
Serge E. Hallyn wrote: > Last year we all met for breakfast at OLS. Now we've all pretty much > all already met so maybe it's less exciting, but do people (who will be > at OLS) care to meet for breakfast on the thursday or friday? OK for me, if i can skip the pancakes with tons of cream and jam. C. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: checkpointing and restoring processes
Mark Pflueger wrote: > hi everyone! > > i'm not subscribed to the list, so if you care to flame because of my noob > question, just do it to the list, otherwise please cc me. you should subscribe to [EMAIL PROTECTED] and send your ideas on that list. There's a BOF on that topic at OLS if you can attend. cheers, C. > i'm trying to write a checkpoint/restore module for processes and so have > a basic version going already - problem is, when i restore the process, > one of three things happens at random. first is, the process restored > segfaults. second is, i get a kernel null pointer dereference and third > is, i get a virtual address lookup error and a kernel crash. the trace > back and the address always change. > > the user space process is as simple as i could make it: (error checking > and debugging messages are left out) > > > void take_chkpt(void) { > pid_t pid; > char call_pid[10]; > char call_num[10]; > > chkptpid = getpid(); > snprintf(call_pid, 9, "%d", chkptpid); > snprintf(call_num, 9, "%d", checkpointnum); > > switch(pid = fork()) { > case -1: > fprintf(stderr, "Fork failed.\n"); > return; > break; > case 0: /* child process */ > if(!execl("child_take", call_pid, call_num, (char *)0)) > perror("execl: "); > break; > default: /* parent process */ > waitpid(pid, NULL, 0); > break; > } > > return; > } > > > void restore_chkpts(void) { > pid_t pid; > char call_pid[10]; > char call_num[10]; > > ENTERFUN(); > > if(restore_retry) // do nothing on second call to restore > return; > > chkptpid = getpid(); > snprintf(call_pid, 9, "%d", chkptpid); > snprintf(call_num, 9, "%d", checkpointnum); > > switch(pid = fork()) { > case -1: > fprintf(stderr, "MP: Fork failed.\n"); > return; > break; > case 0: /* child process */ > if(!execl("child_restore", call_pid, call_num, (char *)0)) > perror("execl: "); > break; > default: /* parent process */ > INF(("Parent Process")); > restore_retry=1; > INF(("Wait for Child...")); > waitpid(pid, NULL, 0); > break; > } > > LEAVEFUN(); > > return; > } > > int main(int argc, char* argv[]) { > take_chkpt(); > printf("Hello cruel world!\n"); > restore_chkpts(); > return 0; > } > > where child_take and child_restore do the following: > > > void child_take_chkpt(int chkptpid, int checkpointnum) { > struct chkpt_ioctl chkptio; > int dev_fd; // ioctl device file > char chkptname[30]; > > if ((dev_fd = open(CHKPT_DEVICE, O_RDWR)) < 0) { > perror("MP: Open device file"); > exit(EXIT_FAILURE); > } > chkptio.pid = chkptpid; > snprintf(chkptname, 29, "/tmp/chkpt_%d_%d", chkptio.pid, > checkpointnum); > chkptio.file = creat(chkptname, 00755); > sleep(1); // to go sure the parent process is in waitpid -- ugly, > but works > kill(chkptio.pid, SIGSTOP); > sleep(1); > ioctl(dev_fd, CHKPT_IOCTL_SAVE, (unsigned long)&chkptio); > close(dev_fd); > close(chkptio.file); > kill(chkptio.pid, SIGCONT); > exit(0); > } > > void child_restore_chkpts(int chkptpid, int checkpointnum) { > struct chkpt_ioctl chkptio; > int dev_fd; // ioctl device file > char chkptname[30]; > > snprintf(chkptname, 29, "/tmp/chkpt_%d_%d", chkptpid, > checkpointnum-1); > chkptio.file = open(chkptname, O_RDONLY); > chkptio.pid = chkptpid; > dev_fd = open(CHKPT_DEVICE, O_RDWR); > sleep(1); > kill(chkptpid, SIGSTOP); > sleep(1); > ioctl(dev_fd, CHKPT_IOCTL_RESTORE, (unsigned long)&chkptio); > close(chkptio.file); > close(dev_fd); > kill(chkptpid, SIGCONT); > exit(0); > } > > the header for the files is this: > > > enum { > CHKPT_IOCTL_SAVE, > CHKPT_IOCTL_RESTORE > }; > > struct chkpt_ioctl { > pid_t pid; // for fork tests > int file; > }; > > struct chkpt { > pid_t pid; // for fork tests > struct pt_regs regs; > unsigned int datasize; > unsigned int brksize; > unsigned int stacksize; > }; > > > and finally the kernel module: > > int chkpt_ioctl_handler(struct inode *i, struct file *f, > unsigned int cmd, unsigned long arg) { > struct chkpt_ioctl pmio, *u_pmio; > int ret = -1; > > u_pmio = (struct chkpt_ioctl *)arg; > > switch(cmd) { >
[Devel] [patch 5/5][RFC - ipv4/udp checkpoint/restart] : c/r the udp part of the socket
From: Daniel Lezcano <[EMAIL PROTECTED]> This patch defines a set of netlink attributes to store/retrieve udp option and endpoints. The logic is to extend the netlink message attribute to take into account these new values. The ops of struct sock is extended with the dump/restore callbacks, so when a socket is asked to be checkpointed, the call will fail if the dump/restore is not implemented in the protocol. That allows to bring C/R functionnality for each protocol step by step. * At dump time : the local binding is retrieve from kernel_getname and distant binding is retrieve with kernel_getpeername. * At restore time : the local binding is set by the kernel_bind call and distant binding is set by kernel_connect If the local binding was done with an autobind, the userlock flags, will not be set, so the flag are resetted if they were not set during the dump. One point to be discussed is : should we C/R sendQ and recvQ knowing the protocol is not reliable ? Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/af_inet_cr.h |9 +++ include/linux/udp_cr.h | 26 + include/net/sock.h |6 +- net/ipv4/Makefile |2 net/ipv4/af_inet_cr.c | 23 net/ipv4/udp.c |6 +- net/ipv4/udp_cr.c | 119 + 7 files changed, 187 insertions(+), 4 deletions(-) Index: 2.6.20-cr/include/linux/af_inet_cr.h === --- 2.6.20-cr.orig/include/linux/af_inet_cr.h +++ 2.6.20-cr/include/linux/af_inet_cr.h @@ -90,6 +90,15 @@ AF_INET_CR_ATTR_IPOPT_MREQ, AF_INET_CR_ATTR_IPOPT_MULTICAST_IF, + /* udp options */ + AF_INET_CR_ATTR_UDPOPT_CORK, + + /* udp protocol */ + UDP_CR_ATTR_BIND, + UDP_CR_ATTR_BIND_ADDR_ULOCK, + UDP_CR_ATTR_BIND_PORT_ULOCK, + UDP_CR_ATTR_PEER, + AF_INET_CR_ATTR_MAX }; #endif Index: 2.6.20-cr/include/linux/udp_cr.h === --- /dev/null +++ 2.6.20-cr/include/linux/udp_cr.h @@ -0,0 +1,26 @@ +/* + * + * + */ +#ifndef _UDP_CR_H +#define _UDP_CR_H +#include +#include +#include + +#ifdef CONFIG_IP_CR +extern int udp_dump(struct socket *sock, struct sk_buff *skb); +extern int udp_restore(struct socket *sock, const struct genl_info *info); +#else +static inline int udp_dump(struct socket *sock, struct sk_buff *skb) +{ + return -ENOSYS; +} + +static inline int udp_restore(struct socket *sock, const struct genl_info *info) +{ + return -ENOSYS; +} +#endif /* CONFIG_IP_CR */ + +#endif Index: 2.6.20-cr/include/net/sock.h === --- 2.6.20-cr.orig/include/net/sock.h +++ 2.6.20-cr/include/net/sock.h @@ -55,6 +55,7 @@ #include #include #include +#include /* * This structure really needs to be cleaned up. @@ -554,7 +555,10 @@ void(*hash)(struct sock *sk); void(*unhash)(struct sock *sk); int (*get_port)(struct sock *sk, unsigned short snum); - +#ifdef CONFIG_IP_CR + int (*dump)(struct socket *sock, struct sk_buff *skb); + int (*restore)(struct socket *sock, const struct genl_info *info); +#endif /* Memory pressure */ void(*enter_memory_pressure)(void); atomic_t*memory_allocated; /* Current allocated memory. */ Index: 2.6.20-cr/net/ipv4/Makefile === --- 2.6.20-cr.orig/net/ipv4/Makefile +++ 2.6.20-cr/net/ipv4/Makefile @@ -53,4 +53,4 @@ obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o -obj-$(CONFIG_IP_CR) += af_inet_cr.o +obj-$(CONFIG_IP_CR) += af_inet_cr.o udp_cr.o Index: 2.6.20-cr/net/ipv4/af_inet_cr.c === --- 2.6.20-cr.orig/net/ipv4/af_inet_cr.c +++ 2.6.20-cr/net/ipv4/af_inet_cr.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include @@ -72,6 +73,14 @@ [AF_INET_CR_ATTR_IPOPT_OPTIONS] = { .len = sizeof(struct ip_options) + 40 }, + /* udp options */ + [AF_INET_CR_ATTR_UDPOPT_CORK] = { .type = NLA_FLAG }, + + /* udp endpoints */ + [UDP_CR_ATTR_BIND] = { .len = sizeof(struct sockaddr_in) }, + [UDP_CR_ATTR_PEER] = { .len = sizeof(struct sockaddr_in) }, + [UDP_CR_ATTR_BIND_ADDR_ULOCK] = { .type = NLA_FLAG }, + [UDP_CR_ATTR_BIND_PORT_ULOCK] = { .type = NLA_FLAG }, }; /* @@ -513,6 +522,9 @@ void *msg_head; int ret; + if (!sk->sk_prot->dump) + return -ENOSYS; + if (family != AF_INET) return -EINVAL; @@ -542,6 +554,10 @@ if (ret) goto out; +
[Devel] [patch 4/5][RFC - ipv4/udp checkpoint/restart] : c/r the inet options of the socket
From: Daniel Lezcano <[EMAIL PROTECTED]> This patch defines a set of netlink attributes to store/retrieve inet options. The logic is to extend the netlink message attribute to take into account these new values. The multicast list is browsed first and the netlink nested attribute is filled in the reverse order. That allows, when restoring the socket, to keep the initial order of the multicast list. Not really a big issue if the list are inverted, but that facilitate the test because the attribute will stay exactly, the same and comparison with initial socket and restored socket can be done with a simple "memcmp". Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/af_inet_cr.h | 19 net/ipv4/af_inet_cr.c | 205 - 2 files changed, 223 insertions(+), 1 deletion(-) Index: 2.6.20-cr/include/linux/af_inet_cr.h === --- 2.6.20-cr.orig/include/linux/af_inet_cr.h +++ 2.6.20-cr/include/linux/af_inet_cr.h @@ -71,6 +71,25 @@ AF_INET_CR_ATTR_SOCKOPT_SNDBUF_ULOCK, AF_INET_CR_ATTR_SOCKOPT_RCVBUF_ULOCK, + /* ip options */ + AF_INET_CR_ATTR_IPOPT_OPTIONS, + AF_INET_CR_ATTR_IPOPT_PKTINFO, + AF_INET_CR_ATTR_IPOPT_RECVTOS, + AF_INET_CR_ATTR_IPOPT_RECVTTL, + AF_INET_CR_ATTR_IPOPT_RECVOPTS, + AF_INET_CR_ATTR_IPOPT_RETOPTS, + AF_INET_CR_ATTR_IPOPT_TOS, + AF_INET_CR_ATTR_IPOPT_TTL, + AF_INET_CR_ATTR_IPOPT_HDRINCL, + AF_INET_CR_ATTR_IPOPT_RECVERR, + AF_INET_CR_ATTR_IPOPT_MTU_DISCOVER, + AF_INET_CR_ATTR_IPOPT_ROUTER_ALERT, + AF_INET_CR_ATTR_IPOPT_MULTICAST_TTL, + AF_INET_CR_ATTR_IPOPT_MULTICAST_LOOP, + AF_INET_CR_ATTR_IPOPT_MEMBERSHIP, + AF_INET_CR_ATTR_IPOPT_MREQ, + AF_INET_CR_ATTR_IPOPT_MULTICAST_IF, + AF_INET_CR_ATTR_MAX }; #endif Index: 2.6.20-cr/net/ipv4/af_inet_cr.c === --- 2.6.20-cr.orig/net/ipv4/af_inet_cr.c +++ 2.6.20-cr/net/ipv4/af_inet_cr.c @@ -13,8 +13,12 @@ #include #include #include +#include +#include #include +#include + /* * Netlink message policy definition */ @@ -44,6 +48,30 @@ [AF_INET_CR_ATTR_SOCKOPT_SNDTIMEO] = { .len = sizeof(struct timeval) }, [AF_INET_CR_ATTR_SOCKOPT_LINGER] = { .len = sizeof(struct linger) }, [AF_INET_CR_ATTR_SOCKOPT_BINDTODEVICE] = { .len = IFNAMSIZ }, + + /* ip options */ + [AF_INET_CR_ATTR_IPOPT_PKTINFO] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_RECVTOS] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_RECVTTL] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_RECVOPTS]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_RETOPTS] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_HDRINCL] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_RECVERR] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_ROUTER_ALERT]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_MULTICAST_LOOP] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_IPOPT_MULTICAST_IF]= { .type = NLA_U32 }, + [AF_INET_CR_ATTR_IPOPT_TOS] = { .type = NLA_U8 }, + [AF_INET_CR_ATTR_IPOPT_TTL] = { .type = NLA_U8 }, + [AF_INET_CR_ATTR_IPOPT_MTU_DISCOVER]= { .type = NLA_U8 }, + [AF_INET_CR_ATTR_IPOPT_MULTICAST_TTL] = { .type = NLA_U8 }, + [AF_INET_CR_ATTR_IPOPT_MEMBERSHIP] = { .type = NLA_NESTED }, + + [AF_INET_CR_ATTR_IPOPT_MREQ] = + { .len = sizeof(struct ip_mreqn) }, + + [AF_INET_CR_ATTR_IPOPT_OPTIONS] = + { .len = sizeof(struct ip_options) + 40 }, + }; /* @@ -77,6 +105,28 @@ { SO_BINDTODEVICE, AF_INET_CR_ATTR_SOCKOPT_BINDTODEVICE, 0, SET }, }; +/* + * ip options association with netlink attribute + */ +struct af_inet_cr_optattr ip_options[] = { + { IP_PKTINFO,AF_INET_CR_ATTR_IPOPT_PKTINFO,1, BOTH }, + { IP_RECVTOS,AF_INET_CR_ATTR_IPOPT_RECVTOS,1, BOTH }, + { IP_RECVTTL,AF_INET_CR_ATTR_IPOPT_RECVTTL,0, BOTH }, + { IP_RECVOPTS, AF_INET_CR_ATTR_IPOPT_RECVOPTS, 1, BOTH }, + { IP_RETOPTS,AF_INET_CR_ATTR_IPOPT_RETOPTS,1, BOTH }, + { IP_TOS,AF_INET_CR_ATTR_IPOPT_TOS,0, BOTH }, + { IP_TTL,AF_INET_CR_ATTR_IPOPT_TTL,0, BOTH }, + { IP_HDRINCL,AF_INET_CR_ATTR_IPOPT_HDRINCL,1, BOTH }, + { IP_RECVERR,AF_INET_CR_ATTR_IPOPT_RECVERR,1, BOTH }, + { IP_MTU_DISCOVER, AF_INET_CR_ATTR_IPOPT_MTU_DISCOVER, 0, BOTH }, + { IP_MULTICAST_TTL, AF_INET_CR_ATTR_IPOPT_MULTICAST_TTL, 1, BOTH }, + { IP_MULTICAST_LOOP, AF_INET_CR_ATTR_IPOPT_MULTICAST_LOOP, 0, BOTH }, + { IP_MULTICAST_IF, AF_INET_CR_ATTR_IPOPT_MULTICAST_IF,
[Devel] [patch 3/5][RFC - ipv4/udp checkpoint/restart] : c/r the socket information and options
From: Daniel Lezcano <[EMAIL PROTECTED]> This patch defines a set of netlink attributes to store/retrieve socket options. * At dump time, a netlink message specify the inode of the socket to be checkpointed. The socket is retrieved with the inode number. A new netlink message is built in order to store the socket information. The type, state and socket options are stored into it and the netlink message is transmitted to the requestor. * At restore time, the netlink message contains the type of the socket. A new socket is created, using this type and the attributes are browsed in order to use the values to restore the differents options. The choice of the C/R is to stick as much as possible to the user/kernel frontier. For this reason, the kernel_{set,get}sockopt are used. That allows to reduce code and delegate the different checks to the corresponding function. Unfortunatly, some get/set are not symetric, so some options can be retrieved but not set and vice-versa. For this reason, there are a few helpers, and the option definitions contains a GET|SET|BOTH flag. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/af_inet_cr.h | 61 net/ipv4/af_inet_cr.c | 640 +++-- 2 files changed, 680 insertions(+), 21 deletions(-) Index: 2.6.20-cr/net/ipv4/af_inet_cr.c === --- 2.6.20-cr.orig/net/ipv4/af_inet_cr.c +++ 2.6.20-cr/net/ipv4/af_inet_cr.c @@ -12,36 +12,644 @@ #include #include #include +#include #include /* - * af_inet_cr_nldump : this function is called when a netlink message is received - * with AF_INET_CR_CMD_DUMP command. + * Netlink message policy definition + */ +static struct nla_policy af_inet_cr_policy[AF_INET_CR_ATTR_MAX] = { + [AF_INET_CR_ATTR_INODE]= { .type = NLA_U32 }, + + [AF_INET_CR_ATTR_SOCK_STATE] = { .type = NLA_U32 }, + [AF_INET_CR_ATTR_SOCK_TYPE]= { .type = NLA_U32 }, + + [AF_INET_CR_ATTR_SOCKOPT_BROADCAST]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_DEBUG]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_DONTROUTE]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_KEEPALIVE]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_OOBINLINE]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_PASSCRED] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_REUSEADDR]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_TIMESTAMP]= { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_SNDBUF_ULOCK] = { .type = NLA_FLAG }, + [AF_INET_CR_ATTR_SOCKOPT_RCVBUF_ULOCK] = { .type = NLA_FLAG }, + + [AF_INET_CR_ATTR_SOCKOPT_RCVBUF] = { .type = NLA_U32 }, + [AF_INET_CR_ATTR_SOCKOPT_SNDBUF] = { .type = NLA_U32 }, + [AF_INET_CR_ATTR_SOCKOPT_PRIORITY] = { .type = NLA_U32 }, + [AF_INET_CR_ATTR_SOCKOPT_RCVLOWAT] = { .type = NLA_U32 }, + + [AF_INET_CR_ATTR_SOCKOPT_RCVTIMEO] = { .len = sizeof(struct timeval) }, + [AF_INET_CR_ATTR_SOCKOPT_SNDTIMEO] = { .len = sizeof(struct timeval) }, + [AF_INET_CR_ATTR_SOCKOPT_LINGER] = { .len = sizeof(struct linger) }, + [AF_INET_CR_ATTR_SOCKOPT_BINDTODEVICE] = { .len = IFNAMSIZ }, +}; + +/* + * Generic netlink family definition + */ +static struct genl_family af_inet_cr_family = { + .id = GENL_ID_GENERATE, + .name = "af_inet_cr", + .version= 0x1, + .maxattr= AF_INET_CR_ATTR_MAX - 1, +}; + +/* + * socket options association with netlink attribute + */ +struct af_inet_cr_optattr socket_options[] = { + { SO_BROADCAST,AF_INET_CR_ATTR_SOCKOPT_BROADCAST,0, BOTH }, + { SO_DEBUG,AF_INET_CR_ATTR_SOCKOPT_DEBUG,0, BOTH }, + { SO_DONTROUTE,AF_INET_CR_ATTR_SOCKOPT_DONTROUTE,0, BOTH }, + { SO_KEEPALIVE,AF_INET_CR_ATTR_SOCKOPT_KEEPALIVE,0, BOTH }, + { SO_OOBINLINE,AF_INET_CR_ATTR_SOCKOPT_OOBINLINE,0, BOTH }, + { SO_PRIORITY, AF_INET_CR_ATTR_SOCKOPT_PRIORITY, 0, BOTH }, + { SO_RCVLOWAT, AF_INET_CR_ATTR_SOCKOPT_RCVLOWAT, 0, BOTH }, + { SO_RCVBUF, AF_INET_CR_ATTR_SOCKOPT_RCVBUF, 0, GET }, + { SO_SNDBUF, AF_INET_CR_ATTR_SOCKOPT_SNDBUF, 0, GET }, + { SO_REUSEADDR,AF_INET_CR_ATTR_SOCKOPT_REUSEADDR,0, BOTH }, + { SO_TIMESTAMP,AF_INET_CR_ATTR_SOCKOPT_TIMESTAMP,0, BOTH }, + { SO_LINGER, AF_INET_CR_ATTR_SOCKOPT_LINGER, 0, BOTH }, + { SO_RCVTIMEO, AF_INET_CR_ATTR_SOCKOPT_RCVTIMEO, 0, BOTH }, + { SO_SNDTIMEO, AF_INET_CR_ATTR_SOCKOPT_SNDTIMEO, 0, BOTH }, + { SO_BINDTODEVICE, AF_INET_CR_ATTR_SOCKOPT_BINDTODEVICE, 0, SET }, +}; + + +/* + * socket_lookup : search for socket using the inode number + * + * @sb : superblock associated to
[Devel] [patch 2/5][RFC - ipv4/udp checkpoint/restart] : provide compilation option and genetlink framework
From: Daniel Lezcano <[EMAIL PROTECTED]> This patchset provide the AF_INET C/R option in the makefile and a generic netlink framework for passing the socket messages. It seems that we are encouraged to use netlink instead of the /proc, /sysfs, ioctls: http://kerneltrap.org/node/6637 I found that there is a lot of advantages to use the netlink: * the protocol is secure * the protocol will describe the socket attributes and obviously brings a little abstraction layer with the internal kernel structure/data type. This is a good way to implement ascendant compatibility (eg. move a socket to a kernel with an upper version) * the amount of socket informations is variable. With netlink, we don't need to take care of the size of the data (eg. just read data until there is something) * the netlink attributes can be direclty dumped to disk and reused to restore the socket or examined for specific processing (I don't have example). Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- include/linux/af_inet_cr.h | 15 + net/ipv4/Kconfig |8 +++ net/ipv4/Makefile |1 net/ipv4/af_inet_cr.c | 119 + 4 files changed, 143 insertions(+) Index: 2.6.20-cr/net/ipv4/Kconfig === --- 2.6.20-cr.orig/net/ipv4/Kconfig +++ 2.6.20-cr/net/ipv4/Kconfig @@ -1,6 +1,14 @@ # # IP configuration # +config IP_CR + tristate "IP: checkpoint/restart" + help + The checkpoint/restart allows to dump the sockets states and + the associated protocols internals to the userspace land. + The data can be reused to recreate the socket in the same state. + It's safe to say N. + config IP_MULTICAST bool "IP: multicasting" help Index: 2.6.20-cr/net/ipv4/Makefile === --- 2.6.20-cr.orig/net/ipv4/Makefile +++ 2.6.20-cr/net/ipv4/Makefile @@ -53,3 +53,4 @@ obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o +obj-$(CONFIG_IP_CR) += af_inet_cr.o Index: 2.6.20-cr/net/ipv4/af_inet_cr.c === --- /dev/null +++ 2.6.20-cr/net/ipv4/af_inet_cr.c @@ -0,0 +1,119 @@ +/* + * Copyright (C) 2007 IBM Corporation + * + * Author: Daniel Lezcano <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation, version 2 of the + * License. + */ + +#include +#include +#include +#include + +/* + * af_inet_cr_nldump : this function is called when a netlink message is received + * with AF_INET_CR_CMD_DUMP command. + * @skb : the netlink packet giving the restore command + * @info : the generic netlink message + */ +static int af_inet_cr_nldump(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + +/* + * af_inet_cr_nldump : this function is called when a netlink message is received + * with AF_INET_CR_CMD_RESTORE command. + * @skb : the netlink packet giving the restore command + * @info : the generic netlink message + */ +static int af_inet_cr_nlrestore(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + +/* + * Netlink message policy definition + */ +static struct nla_policy af_inet_cr_policy[AF_INET_CR_ATTR_MAX] = { + [AF_INET_CR_ATTR_INODE] = { .type = NLA_U32 }, +}; + +/* + * Netlink dumping command configuration + */ +static struct genl_ops af_inet_cr_nldump_ops = { + .cmd = AF_INET_CR_CMD_DUMP, + .doit = af_inet_cr_nldump, + .policy = af_inet_cr_policy, +}; + +/* + * Netlink restore command configuration + */ +static struct genl_ops af_inet_cr_nlrestore_ops = { + .cmd = AF_INET_CR_CMD_RESTORE, + .doit = af_inet_cr_nlrestore, + .policy = af_inet_cr_policy, +}; + +/* + * Generic netlink family definition + */ +static struct genl_family af_inet_cr_family = { + .id = GENL_ID_GENERATE, + .name = "af_inet_cr", + .version= 0x1, + .maxattr= AF_INET_CR_ATTR_MAX - 1, +}; + +/* + * af_inet_cr_init : this function is called at initialization + * time. It register the generic netlink family associated with + * this module and hang different ops with it. + */ +static __init int af_inet_cr_init(void) +{ + int err; + + err = genl_register_family(&af_inet_cr_family); + if (err < 0) + goto out; + + err = genl_register_ops(&af_inet_cr_family, + &af_inet_cr_nldump_ops); + if (err < 0) + goto out_unregister_fam; + + err = genl_register_ops(&af_inet_cr_family, + &af_inet_cr_nlrestore_ops); + if (err < 0) + goto out_unregister_dump; + + return 0; +
[Devel] [patch 1/5][RFC - ipv4/udp checkpoint/restart] : add lookup for unhashed inode
The socket relies on the sockfs. In some cases, the socket are orphans and it is not possible to access them via a file descriptor, this is the case for example for timewait sockets. Hopefully, an inode is still usable to specify a socket. This one can be retrieved from /proc/net/tcp for orphan sockets or from a fstat. When a socket is created the socket inode is added to the sockfs. Unfortunatly, this one is not stored into the hashed inode list, so I need a helper to browse the inode list contained in the superblock of the sockfs. This is one solution, another solution is to stored the inode into the hashed list when socket is created. Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]> --- fs/inode.c | 29 + include/linux/fs.h |1 + 2 files changed, 30 insertions(+) Index: 2.6.20-cr/fs/inode.c === --- 2.6.20-cr.orig/fs/inode.c +++ 2.6.20-cr/fs/inode.c @@ -877,6 +877,35 @@ EXPORT_SYMBOL(ilookup); + +/** + * ilookup_unhased - search for an inode in the superblock + * @sb:super block of file system to search + * @ino: inode number to search for + * + * The ilookup_unhashed browse the superblock inode list to find the inode. + * + * If the inode is found in the inode list stored in the superblock, the inode is + * with an incremented reference count. + * + * Otherwise NULL is returned. + */ +struct inode *ilookup_unhashed(struct super_block *sb, unsigned long ino) +{ + struct inode *inode = NULL; + + spin_lock(&inode_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) + if (inode->i_ino == ino) { + __iget(inode); + break; + } + spin_unlock(&inode_lock); + return inode; + +} +EXPORT_SYMBOL(ilookup_unhashed); + /** * iget5_locked - obtain an inode from a mounted file system * @sb:super block of file system Index: 2.6.20-cr/include/linux/fs.h === --- 2.6.20-cr.orig/include/linux/fs.h +++ 2.6.20-cr/include/linux/fs.h @@ -1657,6 +1657,7 @@ extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval, int (*test)(struct inode *, void *), void *data); extern struct inode *ilookup(struct super_block *sb, unsigned long ino); +extern struct inode *ilookup_unhashed(struct super_block *sb, unsigned long ino); extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *); extern struct inode * iget_locked(struct super_block *, unsigned long); -- ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [patch 0/5][RFC - ipv4/udp checkpoint/restart] dumping/restoring the IPV4/UDP sockets
Hi, I would like to resurect the discussion we had concerning the socket checkpoint/restart. I began to look how to checkpoint them. I thought the following: The socket can be checkpointed one by one from userspace. That will allow to provide a mechanism to application which wants to checkpoint himself (I think for example at the tcpcp project) and increase the scope of usability. If we are inside a container, only the sockets related to the container are viewable and by the way, checkpointable, so the socket checkpoint/restart will bring mobility to the container. The socket relies to the socket fs. We can use the inode number to identify the socket. Using that, we can checkpoint/restart a socket associated with a fd because inode number can be easily retrieved from a fstat call and we can identify orphan sockets. Inode number of orphan sockets are viewable in the /proc/net/tcp file. The checkpoint/restart data can be transfered between kernel and userspace via the generic netlink. It is a clean and secure way to define a message format with ascendant compatibily, ie : move the socket to an os with a superior kernel version. The generic netlink message can be either used as raw data to be directly dumped to disk or can be modified from userspace for some specific purpose (I dont have examples) The way the socket are checkpointed/restored is to stick as much as possible to user/kernel frontier in order to catch errors and bad values the sooner. I think for example, all the kernel_setsockopt, kernel_connect, etc ... function family, it is more reliable and secure. The following patchset is a RFC for C/R the UDP sockets. It applies to 2.6.20. One question is pending. Should we dump/restore send and receive queue knowing the protocol is not reliable ? Regards. -- Daniel -- ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Per container statistics (containerstats)
Hi, Andrew/Paul, Here's the latest version of containerstats ported to v10. Could you please consider it for inclusion Changelog 1. Instead of parsing long container path's use the dentry to match the container for which stats are required. The user space application opens the container directory and passes the file descriptor, which is used to determine the container for which stats are required. This approach was suggested by Paul Menage This patch is inspired by the discussion at http://lkml.org/lkml/2007/4/11/187 and implements per container statistics as suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. The patch is on top of 2.6.21-mm1 with Paul's containers v9 patches (forward ported) This patch implements per container statistics infrastructure and re-uses code from the taskstats interface. A new set of container operations are registered with commands and attributes. It should be very easy to *extend* per container statistics, by adding members to the containerstats structure. The current model for containerstats is a pull, a push model (to post statistics on interesting events), should be very easy to add. Currently user space requests for statistics by passing the container file descriptor. Statistics about the state of all the tasks in the container is returned to user space. TODO's/NOTE: This patch provides an infrastructure for implementing container statistics. Based on the needs of each controller, we can incrementally add more statistics, event based support for notification of statistics, accumulation of taskstats into container statistics in the future. Sample output # ./containerstats -C /container/a sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0 # ./containerstats -C /container/ sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0 If the approach looks good, I'll enhance and post the user space utility for the same Feedback, comments, test results are always welcome! Signed-off-by: Balbir Singh <[EMAIL PROTECTED]> --- Documentation/accounting/containerstats.txt | 27 ++ include/linux/Kbuild|1 include/linux/container.h |8 +++ include/linux/containerstats.h | 70 include/linux/delayacct.h | 11 kernel/container.c | 63 + kernel/sched.c |4 + kernel/taskstats.c | 66 ++ 8 files changed, 250 insertions(+) diff -puN /dev/null Documentation/accounting/containerstats.txt --- /dev/null 2007-06-01 20:42:04.0 +0530 +++ linux-2.6.22-rc2-mm1-balbir/Documentation/accounting/containerstats.txt 2007-06-06 17:16:54.0 +0530 @@ -0,0 +1,27 @@ +Containerstats is inspired by the discussion at +http://lkml.org/lkml/2007/4/11/187 and implements per container statistics as +suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263. + +Per container statistics infrastructure re-uses code from the taskstats +interface. A new set of container operations are registered with commands +and attributes specific to containers. It should be very easy to +extend per container statistics, by adding members to the containerstats +structure. + +The current model for containerstats is a pull, a push model (to post +statistics on interesting events), should be very easy to add. Currently +user space requests for statistics by passing the container path. +Statistics about the state of all the tasks in the container is returned to +user space. + +NOTE: We currently rely on delay accounting for extracting information +about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this +information will not be available. + +To extract container statistics a utility very similar to getdelays.c +has been developed, the sample output of the utility is shown below + +~/balbir/containerstats # ./containerstats -C "/container/a" +sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0 +~/balbir/containerstats # ./containerstats -C "/container" +sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2 diff -puN include/linux/container.h~containers-taskstats include/linux/container.h --- linux-2.6.22-rc2-mm1/include/linux/container.h~containers-taskstats 2007-06-05 17:21:57.0 +0530 +++ linux-2.6.22-rc2-mm1-balbir/include/linux/container.h 2007-06-06 16:59:30.0 +0530 @@ -12,6 +12,7 @@ #include #include #include +#include #ifdef CONFIG_CONTAINERS @@ -21,6 +22,8 @@ extern void container_init_smp(void); extern void container_fork(struct task_struct *p); extern void container_fork_callbacks(struct task_struct *p); extern void container_exit(struct task_struct *p, int run_callbacks); +extern int containerstats_build(struct containerstats *stats, + struct dentry *dentry); extern struct
Re: [Devel] breakfast at ols?
My apologies, but I won't be able to visit OLS this year :/ I think Kirill Kolyshkin and Denis Lunev will be glad to meet you again! Hope Pavel Emelianov will be able to join as well. Thanks, Kirill Serge E. Hallyn wrote: > Last year we all met for breakfast at OLS. Now we've all pretty much > all already met so maybe it's less exciting, but do people (who will be > at OLS) care to meet for breakfast on the thursday or friday? > > -serge > ___ > Containers mailing list > [EMAIL PROTECTED] > https://lists.linux-foundation.org/mailman/listinfo/containers > > ___ > Devel mailing list > Devel@openvz.org > https://openvz.org/mailman/listinfo/devel > ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Pid namespaces approaches testing results
Cedric Le Goater wrote: >> The flat model has many optimization ways in comparison with the multilevel >> one. Like we can cache the pid value on structs and some other. >> >> Moreover having generic level nesting sounds reasonable. Having single level >> nesting - too as all the namespace we have are single nested. But having the >> 4 level nesting sounds strange... Why 4? Why not 5? What if I don't know how >> many I will need exactly, but do know that it will be definitely more than 1? >> >> Moreover - I have shown that we can have 1% or less performance on generic >> nesting model, why not keep it? > > did you send that patchset ? is it included in the one you sent ? The patchset I sent earlier changed slightly. The tests were performed on the version I sent. Right now I'm waiting for your results to make a final decision whether or not to develop the flat model together with the hierarchical one. So what are we going to do? The ways we have: 1. Make two models - hierarchical and flat. Maybe we'll see how to merge them later; 2. Optimize the hierarchical model to produce no performance hit on the first 2 levels (init and VS). I don't see the way to make this gracefully, but I maybe this can be solved ... somehow. Anyway, if the latest patches from Suka do not produce any noticeable overhead, I am OK to go on with them; 3. Make the CONFIG_MAX_NS_DEPTH model. This is likely to be fast in the flat case, but I am in doubt whether Andrew will like it :) > sorry if i missed something :( > > C. > Thanks, Pavel ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel