[Devel] Re: [ckrm-tech] [PATCH 00/10] Containers(V10): Generic Process Containers

2007-06-06 Thread Paul Jackson
> I suppose as a cleaner alternative we could 
> add a container_subsys->inherit_defaults() handler, to be called at
> container_clone(), and for cpusets this would set cpus and mems to
> the parent values - sibling exclusive values.  If that comes to nothing,
> then the attach_task() is still refused, and the unshare() or clone()
> fails, but this time with good reason.

Unfortunately, I haven't spent the time I should thinking about
container cloning, namespaces and such.

I don't know, for the workloads that matter to me, when, how or
if this container cloning will be used.

I'm tempted to suggest the following.

First, I am assuming that the classic method of creating cpuset
children will still work, such as the following (which can fail
for certain combinations of exclusive cpus or mems):
cd /dev/cpuset/foobar
mkdir foochild
cp cpus foochild
cp mems foochild
echo $$ > foochild/tasks

Second, given that, how about you fail the unshare() or clone()
anytime that the instance to be cloned has any sibling cpusets
with any exclusive flags set.

The exclusive property is not really on friendly terms with cloning.

Now if the above classic code must be encoded using cloning under
the covers, then we've got problems, probably more problems than
just this.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 00/10] Containers(V10): Generic Process Containers

2007-06-06 Thread Serge E. Hallyn
Quoting Paul Jackson ([EMAIL PROTECTED]):
> > > I wasn't paying close enough attention to understand why you couldn't
> > > do it in two steps - make the container, and then populate it with
> > > resources.
> > 
> > Sorry, please clarify - are you saying that now you do understand, or
> > that I should explain?
> 
> Could you explain -- I still don't understand why you need this option.
> I still don't understand why you can't do it in two steps - make the
> container, then add cpu/mem separately.

Sure - the key is that the ns subsystem uses container_clone() to
automatically create a new container (on sys_unshare() or clone(2)
with certain flags) and move the current task into it.  Let's say
we have done

mount -t container -o ns,cpuset nsproxy /containers

and we, as task 875, happen to be in the topmost container:

/containers/

Now we fork task 999 which does an unshare(CLONE_NEWNS), or we just
clone(CLONE_NEWNS).  This will create

/containers/node_999

and move task 999 into that container.  Except that when it tries
attach_task() it is refused by cpuset.  So the container_clone() fails,
and in turn the sys_unshare() or clone() fails.  A login making use
of the pam_namespace.so library would fail this way with the
ns and cpuset subsystems composed.

We could special case this by having
kernel/container.c:container_clone() check whether one of the subsystems
is cpusets and, if so, setting the defaults for mems and cpus, but
that is kind of ugly.  I suppose as a cleaner alternative we could 
add a container_subsys->inherit_defaults() handler, to be called at
container_clone(), and for cpusets this would set cpus and mems to
the parent values - sibling exclusive values.  If that comes to nothing,
then the attach_task() is still refused, and the unshare() or clone()
fails, but this time with good reason.

thanks,
-serge

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 00/10] Containers(V10): Generic Process Containers

2007-06-06 Thread Paul Jackson
> > I wasn't paying close enough attention to understand why you couldn't
> > do it in two steps - make the container, and then populate it with
> > resources.
> 
> Sorry, please clarify - are you saying that now you do understand, or
> that I should explain?

Could you explain -- I still don't understand why you need this option.
I still don't understand why you can't do it in two steps - make the
container, then add cpu/mem separately.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.925.600.0401

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 00/10] Containers(V10): Generic Process Containers

2007-06-06 Thread Serge E. Hallyn
Quoting Paul Jackson ([EMAIL PROTECTED]):
> > Would it then make sense to just
> > default to (parent_set - sibling_exclusive_set) for a new sibling's
> > value?
> 
> Which could well be empty, which in turn puts one back in the position
> of dealing with a newborn cpuset that is empty (of cpus or of memory),
> or else it introduces a new and odd constraint on when cpusets can be
> created (only when there are non-exclusive cpus and mems available.)
> 
> > An option is fine with me, but without such an option at all, cpusets
> > could not be applied to namespaces...
> 
> I wasn't paying close enough attention to understand why you couldn't
> do it in two steps - make the container, and then populate it with
> resources.

Sorry, please clarify - are you saying that now you do understand, or
that I should explain?

> But if indeed that's not possible, then I guess we need some sort of
> option specifying whether to create kids empty, or inheriting.

Paul (uh, Menage :) should I do a patch for this or have you got it
already?

thanks,
-serge

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread David Miller
From: Daniel Lezcano <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 22:38:11 +0200

> Perhaps, a name like "epipe" or "npipe", which reflects what does the 
> device, is more appropriate ?

'npipe' (Network PIPE) or 'epipe' (Ethernet PIPE) are fine with me.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Daniel Lezcano

David Miller wrote:

From: Pavel Emelianov <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 19:11:38 +0400

  

Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.



I would suggest choosing a different name.

'veth' is also the name of the virtualized ethernet device
found on IBM machines, driven by driver/net/ibmveth.[ch]
  

Eric Biederman proposed the name "etun" which stands for "Ethernet TUNnel".

The goal of the pair device is to pass network packets between namespaces.
That reminds me the well known "pipe" which pass data between processes.

All network devices have a name describing what they do : "eth" for 
ethernet,

"dummy" for a device doing nothing, loopback, ...

Perhaps, a name like "epipe" or "npipe", which reflects what does the 
device, is more appropriate ?


-- Daniel


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread David Miller
From: Pavel Emelianov <[EMAIL PROTECTED]>
Date: Wed, 06 Jun 2007 19:11:38 +0400

> Veth stands for Virtual ETHernet. It is a simple tunnel driver
> that works at the link layer and looks like a pair of ethernet
> devices interconnected with each other.

I would suggest choosing a different name.

'veth' is also the name of the virtualized ethernet device
found on IBM machines, driven by driver/net/ibmveth.[ch]

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Stephen Hemminger
On Wed, 06 Jun 2007 19:11:38 +0400
Pavel Emelianov <[EMAIL PROTECTED]> wrote:

> Veth stands for Virtual ETHernet. It is a simple tunnel driver
> that works at the link layer and looks like a pair of ethernet
> devices interconnected with each other.
> 
> Mainly it allows to communicate between network namespaces but
> it can be used as is as well.
> 
> Eric recently sent a similar driver called etun. This
> implementation uses another interface - the RTM_NRELINK
> message introduced by Patric. The patch fits today netdev
> tree with Patrick's patches.
> 
> The newlink callback is organized that way to make it easy
> to create the peer device in the separate namespace when we
> have them in kernel.
> 
> The patch for an ip utility is also provided.
> 
> Eric, since ethtool interface was from your patch, I add
> your Signed-off-by line.
> 
> Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
> Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>
> 
> ---
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index 7d57f4a..7e144be 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -119,6 +119,12 @@ config TUN
>  
> If you don't know what to use this for, you don't need it.
>  
> +config VETH
> + tristate "Virtual ethernet device"
> + ---help---
> +   The device is an ethernet tunnel. Devices are created in pairs. When
> +   one end receives the packet it appears on its pair and vice versa.
> +
>  config NET_SB1000
>   tristate "General Instruments Surfboard 1000"
>   depends on PNP
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index a77affa..4764119 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
>  obj-$(CONFIG_MACMACE) += macmace.o
>  obj-$(CONFIG_MAC89x0) += mac89x0.o
>  obj-$(CONFIG_TUN) += tun.o
> +obj-$(CONFIG_VETH) += veth.o
>  obj-$(CONFIG_NET_NETX) += netx-eth.o
>  obj-$(CONFIG_DL2K) += dl2k.o
>  obj-$(CONFIG_R8169) += r8169.o
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> new file mode 100644
> index 000..6746c91
> --- /dev/null
> +++ b/drivers/net/veth.c
> @@ -0,0 +1,391 @@
> +/*
> + *  drivers/net/veth.c
> + *
> + *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
> + *
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +#define DRV_NAME "veth"
> +#define DRV_VERSION  "1.0"
> +
> +struct veth_priv {
> + struct net_device *peer;
> + struct net_device *dev;
> + struct list_head list;
> + struct net_device_stats stats;
> + unsigned ip_summed;
> +};
> +
> +static LIST_HEAD(veth_list);
> +
> +/*
> + * ethtool interface
> + */
> +
> +static struct {
> + const char string[ETH_GSTRING_LEN];
> +} ethtool_stats_keys[] = {
> + { "peer_ifindex" },
> +};

Seems like a good usage of sysfs attributes, rather than ethtool.

Then you can get rid of all the useless ethtool for what is
basically a virtual device.

> +/*
> + * xmit
> + */
> +
> +static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct net_device *rcv = NULL;
> + struct veth_priv *priv, *rcv_priv;
> + int length;
> +
> + skb_orphan(skb);
> +
> + priv = netdev_priv(dev);
> + rcv = priv->peer;
> + rcv_priv = netdev_priv(rcv);
> +
> + if (!(rcv->flags & IFF_UP))
> + goto outf;
> +
> + skb->dev = rcv;
> + skb->pkt_type = PACKET_HOST;
> + skb->protocol = eth_type_trans(skb, rcv);
> + if (dev->features & NETIF_F_NO_CSUM)
> + skb->ip_summed = rcv_priv->ip_summed;
> +
> + dst_release(skb->dst);
> + skb->dst = NULL;
> +
> + secpath_reset(skb);
> + nf_reset(skb);
> +
> + length = skb->len;
> +
> + priv->stats.tx_bytes += length;
> + priv->stats.tx_packets++;
> +
> + rcv_priv->stats.rx_bytes += length;
> + rcv_priv->stats.rx_packets++;

Per-cpu stats? This will cacheline thrash.

> + netif_rx(skb);
> + return 0;
> +
> +outf:
> + kfree_skb(skb);
> + priv->stats.tx_dropped++;
> + return 0;
> +}



-- 
Stephen Hemminger <[EMAIL PROTECTED]>
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Patrick McHardy
Pavel Emelianov wrote:

> +MODULE_DESCRIPTION("Virtual Ethernet Tunnel");
> +MODULE_LICENSE("GPL v2");

This seems to be missing MODULE_ALIAS_RTNL_LINK("veth");
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Patrick McHardy
Pavel Emelianov wrote:
> Veth stands for Virtual ETHernet. It is a simple tunnel driver
> that works at the link layer and looks like a pair of ethernet
> devices interconnected with each other.
> 
> Mainly it allows to communicate between network namespaces but
> it can be used as is as well.
> 
> Eric recently sent a similar driver called etun. This
> implementation uses another interface - the RTM_NRELINK
> message introduced by Patric. The patch fits today netdev
> tree with Patrick's patches.
> 
> The newlink callback is organized that way to make it easy
> to create the peer device in the separate namespace when we
> have them in kernel.
> 

> +struct veth_priv {
> + struct net_device *peer;
> + struct net_device *dev;
> + struct list_head list;
> + struct net_device_stats stats;


You can use dev->stats instead.

> +static int veth_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct net_device *rcv = NULL;
> + struct veth_priv *priv, *rcv_priv;
> + int length;
> +
> + skb_orphan(skb);
> +
> + priv = netdev_priv(dev);
> + rcv = priv->peer;
> + rcv_priv = netdev_priv(rcv);
> +
> + if (!(rcv->flags & IFF_UP))
> + goto outf;
> +
> + skb->dev = rcv;

eth_type_trans already sets skb->dev.

> + skb->pkt_type = PACKET_HOST;
> + skb->protocol = eth_type_trans(skb, rcv);
> + if (dev->features & NETIF_F_NO_CSUM)
> + skb->ip_summed = rcv_priv->ip_summed;
> +
> + dst_release(skb->dst);
> + skb->dst = NULL;
> +
> + secpath_reset(skb);
> + nf_reset(skb);


Is skb->mark supposed to survive communication between different
namespaces?

> +static const struct nla_policy veth_policy[VETH_INFO_MAX] = {
> + [VETH_INFO_MAC] = { .type = NLA_BINARY, .len = ETH_ALEN },
> + [VETH_INFO_PEER]= { .type = NLA_STRING },
> + [VETH_INFO_PEER_MAC]= { .type = NLA_BINARY, .len = ETH_ALEN },
> +};


The rtnl_link codes looks fine. I don't like the VETH_INFO_MAC attribute
very much though, we already have a generic device attribute for MAC
addresses. Of course that only allows you to supply one MAC address, so
I'm wondering what you think of allocating only a single device per
newlink operation and binding them in a seperate enslave operation?

> +enum {
> + VETH_INFO_UNSPEC,
> + VETH_INFO_MAC,
> + VETH_INFO_PEER,
> + VETH_INFO_PEER_MAC,
> +
> + VETH_INFO_MAX
> +};

Please follow the

#define VETH_INFO_MAX   (__VETH_INFO_MAX - 1)

convention here.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: checkpointing and restoring processes

2007-06-06 Thread Dave Hansen
On Wed, 2007-06-06 at 13:37 +0200, Mark Pflueger wrote:
> hi everyone!
> 
> i'm not subscribed to the list, so if you care to flame because of my noob 
> question, just do it to the list, otherwise please cc me.
> 
> i'm trying to write a checkpoint/restore module for processes and so have 
> a basic version going already - problem is, when i restore the process, 
> one of three things happens at random. first is, the process restored 
> segfaults. second is, i get a kernel null pointer dereference and third 
> is, i get a virtual address lookup error and a kernel crash. the trace 
> back and the address always change.

Your patch definitely takes a simple, straightforward approach, which is
good.  But, there are a couple of things that need to get added.

For instance, when you make a copy of tsk->mm, what happens if that
original task exits?  It will drop its reference count and free that
task, along with the mm.  The new task will fault on its access to
newtsk->mm because the mm has gone away.

Also, just setting tsk->pid is not enough to get the pid to show up in
the system.  It needs to make sure no other task has that pid as well as
making entries in data structures like the pid allocation map.  

In any case, it's nice to have other people interested in the same
things!  As Cedric suggested, please pop over to
[EMAIL PROTECTED]  There are at least two other
efforts, besides ours working toward the same goal, so you'll have lots
of comrades there. ;)

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [patch 1/5][RFC - ipv4/udp checkpoint/restart] : add lookup for unhashed inode

2007-06-06 Thread Daniel Lezcano

Serge E. Hallyn wrote:

Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
  

The socket relies on the sockfs. In some cases, the socket are orphans and
it is not possible to access them via a file descriptor, this is the case for
example for timewait sockets. Hopefully, an inode is still usable to specify
a socket. This one can be retrieved from /proc/net/tcp for orphan sockets or
from a fstat.

When a socket is created the socket inode is added to the sockfs.
Unfortunatly, this one is not stored into the hashed inode list, so
I need a helper to browse the inode list contained in the superblock 
of the sockfs.


This is one solution, another solution is to stored the inode into
the hashed list when socket is created.



I assume that would be unacceptable overhead on a very busy server.
Walking all the inodes NUM_INODES(task_set) for a checkpoint could
be a real bottleneck, but at least it's only at checkpoint time.

Have you checked net-dev archives for discussions about not hashing
these inodes?  I suppose at some point you'll want to ask there what the
preference is.
  
I didn't looked at the netdev archive, but, sure, I will dig and ask to 
netdev@

Thanks.


But certainly for now this seems the right approach.

  

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>


Acked-by: Serge E. Hallyn <[EMAIL PROTECTED]>

(Or whatever tag they decide over on lkml that I should be using  :)

thanks,
-serge

PS - I won't be acking other patches bc I just haven't looked at
netlink enough - so don't read anything more into that :)

  

---
 fs/inode.c |   29 +
 include/linux/fs.h |1 +
 2 files changed, 30 insertions(+)

Index: 2.6.20-cr/fs/inode.c
===
--- 2.6.20-cr.orig/fs/inode.c
+++ 2.6.20-cr/fs/inode.c
@@ -877,6 +877,35 @@

 EXPORT_SYMBOL(ilookup);

+
+/**
+ * ilookup_unhased - search for an inode in the superblock
+ * @sb:super block of file system to search
+ * @ino:   inode number to search for
+ *
+ * The ilookup_unhashed browse the superblock inode list to find the inode.
+ *
+ * If the inode is found in the inode list stored in the superblock, the inode 
is
+ * with an incremented reference count.
+ *
+ * Otherwise NULL is returned.
+ */
+struct inode *ilookup_unhashed(struct super_block *sb, unsigned long ino)
+{
+   struct inode *inode = NULL;
+
+   spin_lock(&inode_lock);
+   list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
+   if (inode->i_ino == ino) {
+   __iget(inode);
+   break;
+   }
+   spin_unlock(&inode_lock);
+   return inode;
+
+}
+EXPORT_SYMBOL(ilookup_unhashed);
+
 /**
  * iget5_locked - obtain an inode from a mounted file system
  * @sb:super block of file system
Index: 2.6.20-cr/include/linux/fs.h
===
--- 2.6.20-cr.orig/include/linux/fs.h
+++ 2.6.20-cr/include/linux/fs.h
@@ -1657,6 +1657,7 @@
 extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data);
 extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
+extern struct inode *ilookup_unhashed(struct super_block *sb, unsigned long 
ino);

 extern struct inode * iget5_locked(struct super_block *, unsigned long, int 
(*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
 extern struct inode * iget_locked(struct super_block *, unsigned long);

--
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers



  


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Module for ip utility to support veth device

2007-06-06 Thread Patrick McHardy
Pavel Emelianov wrote:
> diff --git a/ip/iplink.c b/ip/iplink.c
> index 5170419..6975990 100644
> --- a/ip/iplink.c
> +++ b/ip/iplink.c
> @@ -287,7 +287,7 @@ static int iplink_modify(int cmd, unsign
>strlen(type));
>  
>   lu = get_link_type(type);
> - if (lu) {
> + if (lu && argc) {
>   struct rtattr * data = NLMSG_TAIL(&req.n);
>   addattr_l(&req.n, sizeof(req), IFLA_INFO_DATA, NULL, 0);


I've folded this part into my iproute patch, thanks.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] Module for ip utility to support veth device

2007-06-06 Thread Pavel Emelianov
The usage is
# ip link add [name] type veth [peer ] [mac ] [peer_mac ]

The Makefile is maybe not as beautiful as it could be. It
is to be discussed.

One thing I noticed during testing is the following. When launching
this with link_veth.so module and not specifying any module specific
parameters, the kernel refuses to accept the packet when parsing the
IFLA_LINKINFO. So the hunk for ip/iplink.c doesn't add an empty extra 
header if no extra data expected.

Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>

---

diff --git a/ip/Makefile b/ip/Makefile
index 9a5bfe3..b46bce3 100644
--- a/ip/Makefile
+++ b/ip/Makefile
@@ -8,8 +8,9 @@ RTMONOBJ=rtmon.o
 ALLOBJ=$(IPOBJ) $(RTMONOBJ)
 SCRIPTS=ifcfg rtpr routel routef
 TARGETS=ip rtmon
+LIBS=link_veth.so
 
-all: $(TARGETS) $(SCRIPTS)
+all: $(TARGETS) $(SCRIPTS) $(LIBS)
 
 ip: $(IPOBJ) $(LIBNETLINK) $(LIBUTIL)
 
@@ -24,3 +25,6 @@ clean:
 
 LDLIBS += -ldl
 LDFLAGS+= -Wl,-export-dynamic
+
+%.so: %.c
+   $(CC) $(CFLAGS) -shared $< -o $@
diff --git a/ip/iplink.c b/ip/iplink.c
index 5170419..6975990 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -287,7 +287,7 @@ static int iplink_modify(int cmd, unsign
 strlen(type));
 
lu = get_link_type(type);
-   if (lu) {
+   if (lu && argc) {
struct rtattr * data = NLMSG_TAIL(&req.n);
addattr_l(&req.n, sizeof(req), IFLA_INFO_DATA, NULL, 0);
 
diff --git a/ip/link_veth.c b/ip/link_veth.c
new file mode 100644
index 000..adfdef6
--- /dev/null
+++ b/ip/link_veth.c
@@ -0,0 +1,77 @@
+#include 
+
+#include "utils.h"
+#include "ip_common.h"
+#include "veth.h"
+
+#define ETH_ALEN   6
+
+static void usage(void)
+{
+   printf("Usage: ip link add ... "
+   "[peer ] [mac ] [peer_mac ]\n");
+}
+
+static int veth_parse_opt(struct link_util *lu, int argc, char **argv,
+   struct nlmsghdr *hdr)
+{
+   __u8 mac[ETH_ALEN];
+
+   for (; argc != 0; argv++, argc--) {
+   if (strcmp(*argv, "peer") == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER,
+   *argv, strlen(*argv));
+
+   continue;
+   }
+
+   if (strcmp(*argv, "mac") == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   if (strcmp(*argv, "peer_mac") == 0) {
+   argv++;
+   argc--;
+   if (argc == 0) {
+   usage();
+   return -1;
+   }
+
+   if (hexstring_a2n(*argv, mac, sizeof(mac)) == NULL)
+   return -1;
+
+   addattr_l(hdr, 1024, VETH_INFO_PEER_MAC,
+   mac, ETH_ALEN);
+   continue;
+   }
+
+   usage();
+   return -1;
+   }
+
+   return 0;
+}
+
+struct link_util veth_link_util = {
+   .id = "veth",
+   .parse_opt = veth_parse_opt,
+};
diff --git a/ip/veth.h b/ip/veth.h
new file mode 100644
index 000..74c8e1e
--- /dev/null
+++ b/ip/veth.h
@@ -0,0 +1,13 @@
+#ifndef __NET_VETH_H__
+#define __NET_VETH_H__
+
+enum {
+   VETH_INFO_UNSPEC,
+   VETH_INFO_MAC,
+   VETH_INFO_PEER,
+   VETH_INFO_PEER_MAC,
+
+   VETH_INFO_MAX
+};
+
+#endif
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] Virtual ethernet tunnel

2007-06-06 Thread Pavel Emelianov
Veth stands for Virtual ETHernet. It is a simple tunnel driver
that works at the link layer and looks like a pair of ethernet
devices interconnected with each other.

Mainly it allows to communicate between network namespaces but
it can be used as is as well.

Eric recently sent a similar driver called etun. This
implementation uses another interface - the RTM_NRELINK
message introduced by Patric. The patch fits today netdev
tree with Patrick's patches.

The newlink callback is organized that way to make it easy
to create the peer device in the separate namespace when we
have them in kernel.

The patch for an ip utility is also provided.

Eric, since ethtool interface was from your patch, I add
your Signed-off-by line.

Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]>
Signed-off-by: Pavel Emelianov <[EMAIL PROTECTED]>

---

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 7d57f4a..7e144be 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -119,6 +119,12 @@ config TUN
 
  If you don't know what to use this for, you don't need it.
 
+config VETH
+   tristate "Virtual ethernet device"
+   ---help---
+ The device is an ethernet tunnel. Devices are created in pairs. When
+ one end receives the packet it appears on its pair and vice versa.
+
 config NET_SB1000
tristate "General Instruments Surfboard 1000"
depends on PNP
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index a77affa..4764119 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -185,6 +185,7 @@ obj-$(CONFIG_MACSONIC) += macsonic.o
 obj-$(CONFIG_MACMACE) += macmace.o
 obj-$(CONFIG_MAC89x0) += mac89x0.o
 obj-$(CONFIG_TUN) += tun.o
+obj-$(CONFIG_VETH) += veth.o
 obj-$(CONFIG_NET_NETX) += netx-eth.o
 obj-$(CONFIG_DL2K) += dl2k.o
 obj-$(CONFIG_R8169) += r8169.o
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
new file mode 100644
index 000..6746c91
--- /dev/null
+++ b/drivers/net/veth.c
@@ -0,0 +1,391 @@
+/*
+ *  drivers/net/veth.c
+ *
+ *  Copyright (C) 2007 OpenVZ http://openvz.org, SWsoft Inc
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+#define DRV_NAME   "veth"
+#define DRV_VERSION"1.0"
+
+struct veth_priv {
+   struct net_device *peer;
+   struct net_device *dev;
+   struct list_head list;
+   struct net_device_stats stats;
+   unsigned ip_summed;
+};
+
+static LIST_HEAD(veth_list);
+
+/*
+ * ethtool interface
+ */
+
+static struct {
+   const char string[ETH_GSTRING_LEN];
+} ethtool_stats_keys[] = {
+   { "peer_ifindex" },
+};
+
+static int veth_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+{
+   cmd->supported  = 0;
+   cmd->advertising= 0;
+   cmd->speed  = SPEED_1;
+   cmd->duplex = DUPLEX_FULL;
+   cmd->port   = PORT_TP;
+   cmd->phy_address= 0;
+   cmd->transceiver= XCVR_INTERNAL;
+   cmd->autoneg= AUTONEG_DISABLE;
+   cmd->maxtxpkt   = 0;
+   cmd->maxrxpkt   = 0;
+   return 0;
+}
+
+static void veth_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo 
*info)
+{
+   strcpy(info->driver, DRV_NAME);
+   strcpy(info->version, DRV_VERSION);
+   strcpy(info->fw_version, "N/A");
+}
+
+static void veth_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+   switch(stringset) {
+   case ETH_SS_STATS:
+   memcpy(buf, ðtool_stats_keys, sizeof(ethtool_stats_keys));
+   break;
+   }
+}
+
+static int veth_get_stats_count(struct net_device *dev)
+{
+   return ARRAY_SIZE(ethtool_stats_keys);
+}
+
+static void veth_get_ethtool_stats(struct net_device *dev,
+   struct ethtool_stats *stats, u64 *data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   data[0] = priv->peer->ifindex;
+}
+
+static u32 veth_get_rx_csum(struct net_device *dev)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   return priv->ip_summed == CHECKSUM_UNNECESSARY;
+}
+
+static int veth_set_rx_csum(struct net_device *dev, u32 data)
+{
+   struct veth_priv *priv;
+
+   priv = netdev_priv(dev);
+   priv->ip_summed = data ? CHECKSUM_UNNECESSARY : CHECKSUM_NONE;
+   return 0;
+}
+
+static u32 veth_get_tx_csum(struct net_device *dev)
+{
+   return (dev->features & NETIF_F_NO_CSUM) != 0;
+}
+
+static int veth_set_tx_csum(struct net_device *dev, u32 data)
+{
+   if (data)
+   dev->features |= NETIF_F_NO_CSUM;
+   else
+   dev->features &= ~NETIF_F_NO_CSUM;
+   return 0;
+}
+
+static struct ethtool_ops veth_ethtool_ops = {
+   .get_settings   = veth_get_settings,
+   .get_drvinfo= veth_get_drvinfo,
+   .get_link   = ethtool_op_get_link,
+   .get_rx_csum= veth_get_rx_csum,
+   .set_rx_csum= vet

[Devel] Re: checkpointing and restoring processes

2007-06-06 Thread Serge E. Hallyn
Quoting Cedric Le Goater ([EMAIL PROTECTED]):
> Mark Pflueger wrote:
> > hi everyone!
> > 
> > i'm not subscribed to the list, so if you care to flame because of my noob 
> > question, just do it to the list, otherwise please cc me.
> 
> you should subscribe to [EMAIL PROTECTED] and send your ideas on that
> list. There's a BOF on that topic at OLS if you can attend.
> 
> cheers,
> 
> C.

Hi Mark,

Thanks for sending that patch.  Ignoring code details for now, this is a
good time to discuss checkpoint strategies.

It looks like you are writing task memory to userspace explicitly on
demand.  Dave Hansen is taking a different approach, using the swapfile
to back up memory.  Eventually we would enable a swapfile per container.
Hopefully he can send a prototype out soon - I thought it was a really
cool idea, although of course we'll have to see how it pans out in
implementation :)

On the larger scale, there is the question of how we want to orchestrate
the checkpoint.  Do we want to have one syscall enable a checkpoint of a
set of tasks, kicking out all the relevant information to userspace?

Do we want userspace to orchestrate the checkpoint, asking for the tasks
to be pulled off the runqueue, then polling for all the relevant
information (through /proc/pid/fd, etc), then putting the tasks back on
the runqueue?

Same as the above, but using the container interface to make it more
robust (i.e.  pull all tasks off the runqueue using
echo 1 > /containers/vserver1/job_1/suspend) against for instance tasks
being forked while we're in a 'for p in $pids; suspend $pid'?

Use the freezer code to freeze and initiate dump on a set of tasks or
a container?


thanks,
-serge

> > i'm trying to write a checkpoint/restore module for processes and so have 
> > a basic version going already - problem is, when i restore the process, 
> > one of three things happens at random. first is, the process restored 
> > segfaults. second is, i get a kernel null pointer dereference and third 
> > is, i get a virtual address lookup error and a kernel crash. the trace 
> > back and the address always change.
> > 
> > the user space process is as simple as i could make it: (error checking 
> > and debugging messages are left out)
> > 
> > 
> > void take_chkpt(void) {
> > pid_t pid;
> > char call_pid[10];
> > char call_num[10];
> > 
> > chkptpid = getpid();
> > snprintf(call_pid, 9, "%d", chkptpid);
> > snprintf(call_num, 9, "%d", checkpointnum);
> > 
> > switch(pid = fork()) {
> > case -1:
> > fprintf(stderr, "Fork failed.\n");
> > return;
> > break;
> > case  0:   /* child process */
> > if(!execl("child_take", call_pid, call_num, (char *)0))
> > perror("execl: ");
> > break;
> > default:   /* parent process */
> > waitpid(pid, NULL, 0);
> > break;
> > }
> > 
> > return;
> > }
> > 
> > 
> > void restore_chkpts(void) {
> > pid_t pid;
> > char call_pid[10];
> > char call_num[10];
> > 
> > ENTERFUN();
> > 
> > if(restore_retry) // do nothing on second call to restore
> > return;
> > 
> > chkptpid = getpid();
> > snprintf(call_pid, 9, "%d", chkptpid);
> > snprintf(call_num, 9, "%d", checkpointnum);
> > 
> > switch(pid = fork()) {
> > case -1:
> > fprintf(stderr, "MP: Fork failed.\n");
> > return;
> > break;
> > case  0:   /* child process */
> > if(!execl("child_restore", call_pid, call_num, (char *)0))
> > perror("execl: ");
> > break;
> > default:   /* parent process */
> > INF(("Parent Process"));
> > restore_retry=1;
> > INF(("Wait for Child..."));
> > waitpid(pid, NULL, 0);
> > break;
> > }
> > 
> > LEAVEFUN();
> > 
> > return;
> > }
> > 
> > int main(int argc, char* argv[]) {
> > take_chkpt();
> > printf("Hello cruel world!\n");
> > restore_chkpts();
> > return 0;
> > }
> > 
> > where child_take and child_restore do the following:
> > 
> > 
> > void child_take_chkpt(int chkptpid, int checkpointnum) {
> > struct chkpt_ioctl chkptio;
> > int dev_fd; // ioctl device file
> > char chkptname[30];
> > 
> > if ((dev_fd = open(CHKPT_DEVICE, O_RDWR)) < 0) {
> > perror("MP: Open device file");
> > exit(EXIT_FAILURE);
> > }
> > chkptio.pid = chkptpid;
> > snprintf(chkptname, 29, "/tmp/chkpt_%d_%d", chkptio.pid, 
> > checkpointnum);
> > chkptio.file = creat(chkptname, 00755);
> > sleep(1); // to go sure the parent process is in waitpid -- ugly, 
> > but works
> > kill(chkptio.pid, SIGSTOP);
> > sleep(1);
> > ioct

[Devel] Re: [patch 1/5][RFC - ipv4/udp checkpoint/restart] : add lookup for unhashed inode

2007-06-06 Thread Serge E. Hallyn
Quoting [EMAIL PROTECTED] ([EMAIL PROTECTED]):
> The socket relies on the sockfs. In some cases, the socket are orphans and
> it is not possible to access them via a file descriptor, this is the case for
> example for timewait sockets. Hopefully, an inode is still usable to specify
> a socket. This one can be retrieved from /proc/net/tcp for orphan sockets or
> from a fstat.
> 
> When a socket is created the socket inode is added to the sockfs.
> Unfortunatly, this one is not stored into the hashed inode list, so
> I need a helper to browse the inode list contained in the superblock 
> of the sockfs.
> 
> This is one solution, another solution is to stored the inode into
> the hashed list when socket is created.

I assume that would be unacceptable overhead on a very busy server.
Walking all the inodes NUM_INODES(task_set) for a checkpoint could
be a real bottleneck, but at least it's only at checkpoint time.

Have you checked net-dev archives for discussions about not hashing
these inodes?  I suppose at some point you'll want to ask there what the
preference is.

But certainly for now this seems the right approach.

> Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
Acked-by: Serge E. Hallyn <[EMAIL PROTECTED]>

(Or whatever tag they decide over on lkml that I should be using  :)

thanks,
-serge

PS - I won't be acking other patches bc I just haven't looked at
netlink enough - so don't read anything more into that :)

> ---
>  fs/inode.c |   29 +
>  include/linux/fs.h |1 +
>  2 files changed, 30 insertions(+)
> 
> Index: 2.6.20-cr/fs/inode.c
> ===
> --- 2.6.20-cr.orig/fs/inode.c
> +++ 2.6.20-cr/fs/inode.c
> @@ -877,6 +877,35 @@
> 
>  EXPORT_SYMBOL(ilookup);
> 
> +
> +/**
> + * ilookup_unhased - search for an inode in the superblock
> + * @sb:  super block of file system to search
> + * @ino: inode number to search for
> + *
> + * The ilookup_unhashed browse the superblock inode list to find the inode.
> + *
> + * If the inode is found in the inode list stored in the superblock, the 
> inode is
> + * with an incremented reference count.
> + *
> + * Otherwise NULL is returned.
> + */
> +struct inode *ilookup_unhashed(struct super_block *sb, unsigned long ino)
> +{
> + struct inode *inode = NULL;
> +
> + spin_lock(&inode_lock);
> + list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
> + if (inode->i_ino == ino) {
> + __iget(inode);
> + break;
> + }
> + spin_unlock(&inode_lock);
> + return inode;
> +
> +}
> +EXPORT_SYMBOL(ilookup_unhashed);
> +
>  /**
>   * iget5_locked - obtain an inode from a mounted file system
>   * @sb:  super block of file system
> Index: 2.6.20-cr/include/linux/fs.h
> ===
> --- 2.6.20-cr.orig/include/linux/fs.h
> +++ 2.6.20-cr/include/linux/fs.h
> @@ -1657,6 +1657,7 @@
>  extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
>   int (*test)(struct inode *, void *), void *data);
>  extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
> +extern struct inode *ilookup_unhashed(struct super_block *sb, unsigned long 
> ino);
> 
>  extern struct inode * iget5_locked(struct super_block *, unsigned long, int 
> (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
>  extern struct inode * iget_locked(struct super_block *, unsigned long);
> 
> -- 
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/containers
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: breakfast at ols?

2007-06-06 Thread Daniel Lezcano

Cedric Le Goater wrote:

Serge E. Hallyn wrote:
  

Last year we all met for breakfast at OLS.  Now we've all pretty much
all already met so maybe it's less exciting, but do people (who will be
at OLS) care to meet for breakfast on the thursday or friday?



OK for me, if i can skip the pancakes with tons of cream and jam. 
  

Ok for me too.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: breakfast at ols?

2007-06-06 Thread Cedric Le Goater
Serge E. Hallyn wrote:
> Last year we all met for breakfast at OLS.  Now we've all pretty much
> all already met so maybe it's less exciting, but do people (who will be
> at OLS) care to meet for breakfast on the thursday or friday?

OK for me, if i can skip the pancakes with tons of cream and jam. 

C.   
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: checkpointing and restoring processes

2007-06-06 Thread Cedric Le Goater
Mark Pflueger wrote:
> hi everyone!
> 
> i'm not subscribed to the list, so if you care to flame because of my noob 
> question, just do it to the list, otherwise please cc me.

you should subscribe to [EMAIL PROTECTED] and send your ideas on that
list. There's a BOF on that topic at OLS if you can attend.

cheers,

C.

> i'm trying to write a checkpoint/restore module for processes and so have 
> a basic version going already - problem is, when i restore the process, 
> one of three things happens at random. first is, the process restored 
> segfaults. second is, i get a kernel null pointer dereference and third 
> is, i get a virtual address lookup error and a kernel crash. the trace 
> back and the address always change.
> 
> the user space process is as simple as i could make it: (error checking 
> and debugging messages are left out)
> 
> 
> void take_chkpt(void) {
> pid_t pid;
> char call_pid[10];
> char call_num[10];
> 
> chkptpid = getpid();
> snprintf(call_pid, 9, "%d", chkptpid);
> snprintf(call_num, 9, "%d", checkpointnum);
> 
>   switch(pid = fork()) {
>   case -1:
> fprintf(stderr, "Fork failed.\n");
> return;
> break;
>   case  0:   /* child process */
> if(!execl("child_take", call_pid, call_num, (char *)0))
> perror("execl: ");
> break;
>   default:   /* parent process */
> waitpid(pid, NULL, 0);
> break;
>   }
> 
> return;
> }
> 
> 
> void restore_chkpts(void) {
> pid_t pid;
> char call_pid[10];
> char call_num[10];
> 
>   ENTERFUN();
> 
> if(restore_retry) // do nothing on second call to restore
> return;
> 
> chkptpid = getpid();
> snprintf(call_pid, 9, "%d", chkptpid);
> snprintf(call_num, 9, "%d", checkpointnum);
> 
>   switch(pid = fork()) {
>   case -1:
> fprintf(stderr, "MP: Fork failed.\n");
> return;
> break;
>   case  0:   /* child process */
> if(!execl("child_restore", call_pid, call_num, (char *)0))
> perror("execl: ");
> break;
>   default:   /* parent process */
> INF(("Parent Process"));
> restore_retry=1;
> INF(("Wait for Child..."));
> waitpid(pid, NULL, 0);
> break;
>   }
> 
>   LEAVEFUN();
> 
> return;
> }
> 
> int main(int argc, char* argv[]) {
>   take_chkpt();
>   printf("Hello cruel world!\n");
>   restore_chkpts();
>   return 0;
> }
> 
> where child_take and child_restore do the following:
> 
> 
> void child_take_chkpt(int chkptpid, int checkpointnum) {
> struct chkpt_ioctl chkptio;
> int dev_fd; // ioctl device file
> char chkptname[30];
> 
> if ((dev_fd = open(CHKPT_DEVICE, O_RDWR)) < 0) {
> perror("MP: Open device file");
> exit(EXIT_FAILURE);
> }
> chkptio.pid = chkptpid;
> snprintf(chkptname, 29, "/tmp/chkpt_%d_%d", chkptio.pid, 
> checkpointnum);
> chkptio.file = creat(chkptname, 00755);
> sleep(1); // to go sure the parent process is in waitpid -- ugly, 
> but works
> kill(chkptio.pid, SIGSTOP);
> sleep(1);
> ioctl(dev_fd, CHKPT_IOCTL_SAVE, (unsigned long)&chkptio);
> close(dev_fd);
> close(chkptio.file);
> kill(chkptio.pid, SIGCONT);
> exit(0);
> }
> 
> void child_restore_chkpts(int chkptpid, int checkpointnum) {
> struct chkpt_ioctl chkptio;
> int dev_fd; // ioctl device file
> char chkptname[30];
> 
> snprintf(chkptname, 29, "/tmp/chkpt_%d_%d", chkptpid, 
> checkpointnum-1);
> chkptio.file = open(chkptname, O_RDONLY);
> chkptio.pid = chkptpid;
> dev_fd = open(CHKPT_DEVICE, O_RDWR);
> sleep(1);
> kill(chkptpid, SIGSTOP);
> sleep(1);
> ioctl(dev_fd, CHKPT_IOCTL_RESTORE, (unsigned long)&chkptio);
> close(chkptio.file);
> close(dev_fd);
> kill(chkptpid, SIGCONT);
> exit(0);
> }
> 
> the header for the files is this:
> 
> 
> enum {
> CHKPT_IOCTL_SAVE,
> CHKPT_IOCTL_RESTORE
> };
> 
> struct chkpt_ioctl {
> pid_t pid; // for fork tests
> int file;
> };
> 
> struct chkpt {
> pid_t pid; // for fork tests
> struct pt_regs regs;
> unsigned int datasize;
> unsigned int brksize;
> unsigned int stacksize;
> };
> 
> 
> and finally the kernel module:
> 
> int chkpt_ioctl_handler(struct inode *i, struct file *f,
>  unsigned int cmd, unsigned long arg) {
> struct chkpt_ioctl pmio, *u_pmio;
> int ret = -1;
> 
> u_pmio = (struct chkpt_ioctl *)arg;
> 
> switch(cmd) {
> 

[Devel] [patch 5/5][RFC - ipv4/udp checkpoint/restart] : c/r the udp part of the socket

2007-06-06 Thread dlezcano
From: Daniel Lezcano <[EMAIL PROTECTED]>

This patch defines a set of netlink attributes to store/retrieve udp
option and endpoints. The logic is to extend the netlink message attribute 
to take into account these new values.
The ops of struct sock is extended with the dump/restore callbacks, so when
a socket is asked to be checkpointed, the call will fail if the dump/restore
is not implemented in the protocol. That allows to bring C/R functionnality 
for each protocol step by step.

 * At dump time : the local binding is retrieve from kernel_getname and 
   distant binding is retrieve with kernel_getpeername.

 * At restore time : the local binding is set by the kernel_bind call and
   distant binding is set by kernel_connect
   If the local binding was done with an autobind, the userlock flags, will
   not be set, so the flag are resetted if they were not set during the dump.
 
One point to be discussed is : should we C/R sendQ and recvQ knowing the 
protocol is not reliable ?

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
---
 include/linux/af_inet_cr.h |9 +++
 include/linux/udp_cr.h |   26 +
 include/net/sock.h |6 +-
 net/ipv4/Makefile  |2 
 net/ipv4/af_inet_cr.c  |   23 
 net/ipv4/udp.c |6 +-
 net/ipv4/udp_cr.c  |  119 +
 7 files changed, 187 insertions(+), 4 deletions(-)

Index: 2.6.20-cr/include/linux/af_inet_cr.h
===
--- 2.6.20-cr.orig/include/linux/af_inet_cr.h
+++ 2.6.20-cr/include/linux/af_inet_cr.h
@@ -90,6 +90,15 @@
AF_INET_CR_ATTR_IPOPT_MREQ,
AF_INET_CR_ATTR_IPOPT_MULTICAST_IF,
 
+   /* udp options */
+   AF_INET_CR_ATTR_UDPOPT_CORK,
+
+   /* udp protocol */
+   UDP_CR_ATTR_BIND,
+   UDP_CR_ATTR_BIND_ADDR_ULOCK,
+   UDP_CR_ATTR_BIND_PORT_ULOCK,
+   UDP_CR_ATTR_PEER,
+
AF_INET_CR_ATTR_MAX
 };
 #endif
Index: 2.6.20-cr/include/linux/udp_cr.h
===
--- /dev/null
+++ 2.6.20-cr/include/linux/udp_cr.h
@@ -0,0 +1,26 @@
+/*
+ *
+ *
+ */
+#ifndef _UDP_CR_H
+#define _UDP_CR_H
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_IP_CR
+extern int udp_dump(struct socket *sock, struct sk_buff *skb);
+extern int udp_restore(struct socket *sock, const struct genl_info *info);
+#else
+static inline int udp_dump(struct socket *sock, struct sk_buff *skb)
+{
+   return -ENOSYS;
+}
+
+static inline int udp_restore(struct socket *sock, const struct genl_info 
*info)
+{
+   return -ENOSYS;
+}
+#endif /* CONFIG_IP_CR */
+
+#endif
Index: 2.6.20-cr/include/net/sock.h
===
--- 2.6.20-cr.orig/include/net/sock.h
+++ 2.6.20-cr/include/net/sock.h
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * This structure really needs to be cleaned up.
@@ -554,7 +555,10 @@
void(*hash)(struct sock *sk);
void(*unhash)(struct sock *sk);
int (*get_port)(struct sock *sk, unsigned short 
snum);
-
+#ifdef CONFIG_IP_CR
+   int (*dump)(struct socket *sock, struct sk_buff 
*skb);
+   int (*restore)(struct socket *sock, const struct 
genl_info *info);
+#endif
/* Memory pressure */
void(*enter_memory_pressure)(void);
atomic_t*memory_allocated;  /* Current allocated 
memory. */
Index: 2.6.20-cr/net/ipv4/Makefile
===
--- 2.6.20-cr.orig/net/ipv4/Makefile
+++ 2.6.20-cr/net/ipv4/Makefile
@@ -53,4 +53,4 @@
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
  xfrm4_output.o
-obj-$(CONFIG_IP_CR) += af_inet_cr.o
+obj-$(CONFIG_IP_CR) += af_inet_cr.o udp_cr.o
Index: 2.6.20-cr/net/ipv4/af_inet_cr.c
===
--- 2.6.20-cr.orig/net/ipv4/af_inet_cr.c
+++ 2.6.20-cr/net/ipv4/af_inet_cr.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -72,6 +73,14 @@
[AF_INET_CR_ATTR_IPOPT_OPTIONS] =
{ .len = sizeof(struct ip_options) + 40 },
 
+   /* udp options */
+   [AF_INET_CR_ATTR_UDPOPT_CORK] = { .type = NLA_FLAG },
+
+   /* udp endpoints */
+   [UDP_CR_ATTR_BIND] = { .len =  sizeof(struct sockaddr_in) },
+   [UDP_CR_ATTR_PEER] = { .len =  sizeof(struct sockaddr_in) },
+   [UDP_CR_ATTR_BIND_ADDR_ULOCK] = { .type = NLA_FLAG },
+   [UDP_CR_ATTR_BIND_PORT_ULOCK] = { .type = NLA_FLAG },
 };
 
 /*
@@ -513,6 +522,9 @@
void *msg_head;
int ret;
 
+   if (!sk->sk_prot->dump)
+   return -ENOSYS;
+
if (family != AF_INET)
return -EINVAL;
 
@@ -542,6 +554,10 @@
if (ret)
goto out;
 
+ 

[Devel] [patch 4/5][RFC - ipv4/udp checkpoint/restart] : c/r the inet options of the socket

2007-06-06 Thread dlezcano
From: Daniel Lezcano <[EMAIL PROTECTED]>

This patch defines a set of netlink attributes to store/retrieve inet 
options. The logic is to extend the netlink message attribute to take into 
account these new values.

The multicast list is browsed first and the netlink nested attribute is filled
in the reverse order. That allows, when restoring the socket, to keep the 
initial order of the multicast list. Not really a big issue if the list are
inverted, but that facilitate the test because the attribute will stay exactly,
the same and comparison with initial socket and restored socket can be done with
a simple "memcmp".


Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
---
 include/linux/af_inet_cr.h |   19 
 net/ipv4/af_inet_cr.c  |  205 -
 2 files changed, 223 insertions(+), 1 deletion(-)

Index: 2.6.20-cr/include/linux/af_inet_cr.h
===
--- 2.6.20-cr.orig/include/linux/af_inet_cr.h
+++ 2.6.20-cr/include/linux/af_inet_cr.h
@@ -71,6 +71,25 @@
AF_INET_CR_ATTR_SOCKOPT_SNDBUF_ULOCK,
AF_INET_CR_ATTR_SOCKOPT_RCVBUF_ULOCK,
 
+   /* ip options */
+   AF_INET_CR_ATTR_IPOPT_OPTIONS,
+   AF_INET_CR_ATTR_IPOPT_PKTINFO,
+   AF_INET_CR_ATTR_IPOPT_RECVTOS,
+   AF_INET_CR_ATTR_IPOPT_RECVTTL,
+   AF_INET_CR_ATTR_IPOPT_RECVOPTS,
+   AF_INET_CR_ATTR_IPOPT_RETOPTS,
+   AF_INET_CR_ATTR_IPOPT_TOS,
+   AF_INET_CR_ATTR_IPOPT_TTL,
+   AF_INET_CR_ATTR_IPOPT_HDRINCL,
+   AF_INET_CR_ATTR_IPOPT_RECVERR,
+   AF_INET_CR_ATTR_IPOPT_MTU_DISCOVER,
+   AF_INET_CR_ATTR_IPOPT_ROUTER_ALERT,
+   AF_INET_CR_ATTR_IPOPT_MULTICAST_TTL,
+   AF_INET_CR_ATTR_IPOPT_MULTICAST_LOOP,
+   AF_INET_CR_ATTR_IPOPT_MEMBERSHIP,
+   AF_INET_CR_ATTR_IPOPT_MREQ,
+   AF_INET_CR_ATTR_IPOPT_MULTICAST_IF,
+
AF_INET_CR_ATTR_MAX
 };
 #endif
Index: 2.6.20-cr/net/ipv4/af_inet_cr.c
===
--- 2.6.20-cr.orig/net/ipv4/af_inet_cr.c
+++ 2.6.20-cr/net/ipv4/af_inet_cr.c
@@ -13,8 +13,12 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 
+#include 
+
 /*
  * Netlink message policy definition
  */
@@ -44,6 +48,30 @@
[AF_INET_CR_ATTR_SOCKOPT_SNDTIMEO] = { .len = sizeof(struct timeval) },
[AF_INET_CR_ATTR_SOCKOPT_LINGER]   = { .len = sizeof(struct linger)  },
[AF_INET_CR_ATTR_SOCKOPT_BINDTODEVICE]  = { .len  = IFNAMSIZ },
+
+   /* ip options */
+   [AF_INET_CR_ATTR_IPOPT_PKTINFO] = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_RECVTOS] = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_RECVTTL] = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_RECVOPTS]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_RETOPTS] = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_HDRINCL] = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_RECVERR] = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_ROUTER_ALERT]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_MULTICAST_LOOP]  = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_IPOPT_MULTICAST_IF]= { .type = NLA_U32 },
+   [AF_INET_CR_ATTR_IPOPT_TOS] = { .type = NLA_U8 },
+   [AF_INET_CR_ATTR_IPOPT_TTL] = { .type = NLA_U8 },
+   [AF_INET_CR_ATTR_IPOPT_MTU_DISCOVER]= { .type = NLA_U8 },
+   [AF_INET_CR_ATTR_IPOPT_MULTICAST_TTL]   = { .type = NLA_U8 },
+   [AF_INET_CR_ATTR_IPOPT_MEMBERSHIP]  = { .type = NLA_NESTED },
+
+   [AF_INET_CR_ATTR_IPOPT_MREQ] =
+   { .len =  sizeof(struct ip_mreqn) },
+
+   [AF_INET_CR_ATTR_IPOPT_OPTIONS] =
+   { .len = sizeof(struct ip_options) + 40 },
+
 };
 
 /*
@@ -77,6 +105,28 @@
{ SO_BINDTODEVICE, AF_INET_CR_ATTR_SOCKOPT_BINDTODEVICE, 0, SET  },
 };
 
+/*
+ * ip options association with netlink attribute
+ */
+struct af_inet_cr_optattr ip_options[] = {
+   { IP_PKTINFO,AF_INET_CR_ATTR_IPOPT_PKTINFO,1, BOTH },
+   { IP_RECVTOS,AF_INET_CR_ATTR_IPOPT_RECVTOS,1, BOTH },
+   { IP_RECVTTL,AF_INET_CR_ATTR_IPOPT_RECVTTL,0, BOTH },
+   { IP_RECVOPTS,   AF_INET_CR_ATTR_IPOPT_RECVOPTS,   1, BOTH },
+   { IP_RETOPTS,AF_INET_CR_ATTR_IPOPT_RETOPTS,1, BOTH },
+   { IP_TOS,AF_INET_CR_ATTR_IPOPT_TOS,0, BOTH },
+   { IP_TTL,AF_INET_CR_ATTR_IPOPT_TTL,0, BOTH },
+   { IP_HDRINCL,AF_INET_CR_ATTR_IPOPT_HDRINCL,1, BOTH },
+   { IP_RECVERR,AF_INET_CR_ATTR_IPOPT_RECVERR,1, BOTH },
+   { IP_MTU_DISCOVER,   AF_INET_CR_ATTR_IPOPT_MTU_DISCOVER,   0, BOTH },
+   { IP_MULTICAST_TTL,  AF_INET_CR_ATTR_IPOPT_MULTICAST_TTL,  1, BOTH },
+   { IP_MULTICAST_LOOP, AF_INET_CR_ATTR_IPOPT_MULTICAST_LOOP, 0, BOTH },
+   { IP_MULTICAST_IF,   AF_INET_CR_ATTR_IPOPT_MULTICAST_IF, 

[Devel] [patch 3/5][RFC - ipv4/udp checkpoint/restart] : c/r the socket information and options

2007-06-06 Thread dlezcano
From: Daniel Lezcano <[EMAIL PROTECTED]>

This patch defines a set of netlink attributes to store/retrieve socket 
options.

 * At dump time, a netlink message specify the inode of the socket to
   be checkpointed. The socket is retrieved with the inode number. A 
   new netlink message is built in order to store the socket information. 
   The type, state and socket options are stored into it and the netlink
   message is transmitted to the requestor.

 * At restore time, the netlink message contains the type of the socket.
   A new socket is created, using this type and the attributes are browsed
   in order to use the values to restore the differents options.

The choice of the C/R is to stick as much as possible to the user/kernel
frontier. For this reason, the kernel_{set,get}sockopt are used. That allows
to reduce code and delegate the different checks to the corresponding function.
Unfortunatly, some get/set are not symetric, so some options can be retrieved
but not set and vice-versa. For this reason, there are a few helpers, and the 
option definitions contains a GET|SET|BOTH flag.


Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
---
 include/linux/af_inet_cr.h |   61 
 net/ipv4/af_inet_cr.c  |  640 +++--
 2 files changed, 680 insertions(+), 21 deletions(-)

Index: 2.6.20-cr/net/ipv4/af_inet_cr.c
===
--- 2.6.20-cr.orig/net/ipv4/af_inet_cr.c
+++ 2.6.20-cr/net/ipv4/af_inet_cr.c
@@ -12,36 +12,644 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /*
- * af_inet_cr_nldump : this function is called when a netlink message is 
received
- * with AF_INET_CR_CMD_DUMP command.
+ * Netlink message policy definition
+ */
+static struct nla_policy af_inet_cr_policy[AF_INET_CR_ATTR_MAX] = {
+   [AF_INET_CR_ATTR_INODE]= { .type = NLA_U32 },
+
+   [AF_INET_CR_ATTR_SOCK_STATE]   = { .type = NLA_U32 },
+   [AF_INET_CR_ATTR_SOCK_TYPE]= { .type = NLA_U32 },
+
+   [AF_INET_CR_ATTR_SOCKOPT_BROADCAST]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_DEBUG]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_DONTROUTE]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_KEEPALIVE]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_OOBINLINE]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_PASSCRED] = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_REUSEADDR]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_TIMESTAMP]= { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_SNDBUF_ULOCK] = { .type = NLA_FLAG },
+   [AF_INET_CR_ATTR_SOCKOPT_RCVBUF_ULOCK] = { .type = NLA_FLAG },
+
+   [AF_INET_CR_ATTR_SOCKOPT_RCVBUF]   = { .type = NLA_U32  },
+   [AF_INET_CR_ATTR_SOCKOPT_SNDBUF]   = { .type = NLA_U32  },
+   [AF_INET_CR_ATTR_SOCKOPT_PRIORITY] = { .type = NLA_U32  },
+   [AF_INET_CR_ATTR_SOCKOPT_RCVLOWAT] = { .type = NLA_U32  },
+
+   [AF_INET_CR_ATTR_SOCKOPT_RCVTIMEO] = { .len = sizeof(struct timeval) },
+   [AF_INET_CR_ATTR_SOCKOPT_SNDTIMEO] = { .len = sizeof(struct timeval) },
+   [AF_INET_CR_ATTR_SOCKOPT_LINGER]   = { .len = sizeof(struct linger)  },
+   [AF_INET_CR_ATTR_SOCKOPT_BINDTODEVICE]  = { .len  = IFNAMSIZ },
+};
+
+/*
+ * Generic netlink family definition
+ */
+static struct genl_family af_inet_cr_family = {
+   .id = GENL_ID_GENERATE,
+   .name   = "af_inet_cr",
+   .version= 0x1,
+   .maxattr= AF_INET_CR_ATTR_MAX - 1,
+};
+
+/*
+ * socket options association with netlink attribute
+ */
+struct af_inet_cr_optattr socket_options[] = {
+   { SO_BROADCAST,AF_INET_CR_ATTR_SOCKOPT_BROADCAST,0, BOTH },
+   { SO_DEBUG,AF_INET_CR_ATTR_SOCKOPT_DEBUG,0, BOTH },
+   { SO_DONTROUTE,AF_INET_CR_ATTR_SOCKOPT_DONTROUTE,0, BOTH },
+   { SO_KEEPALIVE,AF_INET_CR_ATTR_SOCKOPT_KEEPALIVE,0, BOTH },
+   { SO_OOBINLINE,AF_INET_CR_ATTR_SOCKOPT_OOBINLINE,0, BOTH },
+   { SO_PRIORITY, AF_INET_CR_ATTR_SOCKOPT_PRIORITY, 0, BOTH },
+   { SO_RCVLOWAT, AF_INET_CR_ATTR_SOCKOPT_RCVLOWAT, 0, BOTH },
+   { SO_RCVBUF,   AF_INET_CR_ATTR_SOCKOPT_RCVBUF,   0, GET  },
+   { SO_SNDBUF,   AF_INET_CR_ATTR_SOCKOPT_SNDBUF,   0, GET  },
+   { SO_REUSEADDR,AF_INET_CR_ATTR_SOCKOPT_REUSEADDR,0, BOTH },
+   { SO_TIMESTAMP,AF_INET_CR_ATTR_SOCKOPT_TIMESTAMP,0, BOTH },
+   { SO_LINGER,   AF_INET_CR_ATTR_SOCKOPT_LINGER,   0, BOTH },
+   { SO_RCVTIMEO, AF_INET_CR_ATTR_SOCKOPT_RCVTIMEO, 0, BOTH },
+   { SO_SNDTIMEO, AF_INET_CR_ATTR_SOCKOPT_SNDTIMEO, 0, BOTH },
+   { SO_BINDTODEVICE, AF_INET_CR_ATTR_SOCKOPT_BINDTODEVICE, 0, SET  },
+};
+
+
+/*
+ * socket_lookup : search for socket using the inode number
+ *
+ * @sb : superblock associated to

[Devel] [patch 2/5][RFC - ipv4/udp checkpoint/restart] : provide compilation option and genetlink framework

2007-06-06 Thread dlezcano
From: Daniel Lezcano <[EMAIL PROTECTED]>

This patchset provide the AF_INET C/R option in the makefile and a generic 
netlink framework for passing the socket messages.

It seems that we are encouraged to use netlink instead of the /proc, /sysfs, 
ioctls:
http://kerneltrap.org/node/6637

I found that there is a lot of advantages to use the netlink:
 * the protocol is secure
 * the protocol will describe the socket attributes and obviously brings a 
little
   abstraction layer with the internal kernel structure/data type. This is a 
good 
   way to implement ascendant compatibility (eg. move a socket to a kernel with 
an upper
   version)
 * the amount of socket informations is variable. With netlink, we don't need 
to take
   care of the size of the data (eg. just read data until there is something)
 * the netlink attributes can be direclty dumped to disk and reused to restore 
the socket
   or examined for specific processing (I don't have example).

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
---
 include/linux/af_inet_cr.h |   15 +
 net/ipv4/Kconfig   |8 +++
 net/ipv4/Makefile  |1 
 net/ipv4/af_inet_cr.c  |  119 +
 4 files changed, 143 insertions(+)

Index: 2.6.20-cr/net/ipv4/Kconfig
===
--- 2.6.20-cr.orig/net/ipv4/Kconfig
+++ 2.6.20-cr/net/ipv4/Kconfig
@@ -1,6 +1,14 @@
 #
 # IP configuration
 #
+config IP_CR
+   tristate "IP: checkpoint/restart"
+   help
+ The checkpoint/restart allows to dump the sockets states and
+  the associated protocols internals to the userspace land.
+  The data can be reused to recreate the socket in the same state.
+  It's safe to say N.
+
 config IP_MULTICAST
bool "IP: multicasting"
help
Index: 2.6.20-cr/net/ipv4/Makefile
===
--- 2.6.20-cr.orig/net/ipv4/Makefile
+++ 2.6.20-cr/net/ipv4/Makefile
@@ -53,3 +53,4 @@
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
  xfrm4_output.o
+obj-$(CONFIG_IP_CR) += af_inet_cr.o
Index: 2.6.20-cr/net/ipv4/af_inet_cr.c
===
--- /dev/null
+++ 2.6.20-cr/net/ipv4/af_inet_cr.c
@@ -0,0 +1,119 @@
+/*
+ *  Copyright (C) 2007 IBM Corporation
+ *
+ *  Author: Daniel Lezcano <[EMAIL PROTECTED]>
+ *
+ *  This program is free software; you can redistribute it and/or
+ *  modify it under the terms of the GNU General Public License as
+ *  published by the Free Software Foundation, version 2 of the
+ *  License.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * af_inet_cr_nldump : this function is called when a netlink message is 
received
+ * with AF_INET_CR_CMD_DUMP command.
+ * @skb  : the netlink packet giving the restore command
+ * @info : the generic netlink message
+ */
+static int af_inet_cr_nldump(struct sk_buff *skb, struct genl_info *info)
+{
+   return 0;
+}
+
+/*
+ * af_inet_cr_nldump : this function is called when a netlink message is 
received
+ * with AF_INET_CR_CMD_RESTORE command.
+ * @skb  : the netlink packet giving the restore command
+ * @info : the generic netlink message
+ */
+static int af_inet_cr_nlrestore(struct sk_buff *skb, struct genl_info *info)
+{
+   return 0;
+}
+
+/*
+ * Netlink message policy definition
+ */
+static struct nla_policy af_inet_cr_policy[AF_INET_CR_ATTR_MAX] = {
+   [AF_INET_CR_ATTR_INODE] = { .type = NLA_U32 },
+};
+
+/*
+ * Netlink dumping command configuration
+ */
+static struct genl_ops af_inet_cr_nldump_ops = {
+   .cmd = AF_INET_CR_CMD_DUMP,
+   .doit = af_inet_cr_nldump,
+   .policy = af_inet_cr_policy,
+};
+
+/*
+ * Netlink restore command configuration
+ */
+static struct genl_ops af_inet_cr_nlrestore_ops = {
+   .cmd = AF_INET_CR_CMD_RESTORE,
+   .doit = af_inet_cr_nlrestore,
+   .policy = af_inet_cr_policy,
+};
+
+/*
+ * Generic netlink family definition
+ */
+static struct genl_family af_inet_cr_family = {
+   .id = GENL_ID_GENERATE,
+   .name   = "af_inet_cr",
+   .version= 0x1,
+   .maxattr= AF_INET_CR_ATTR_MAX - 1,
+};
+
+/*
+ * af_inet_cr_init : this function is called at initialization
+ * time. It register the generic netlink family associated with
+ * this module and hang different ops with it.
+ */
+static __init int af_inet_cr_init(void)
+{
+   int err;
+
+   err = genl_register_family(&af_inet_cr_family);
+   if (err < 0)
+   goto out;
+
+   err = genl_register_ops(&af_inet_cr_family,
+   &af_inet_cr_nldump_ops);
+   if (err < 0)
+   goto out_unregister_fam;
+
+   err = genl_register_ops(&af_inet_cr_family,
+   &af_inet_cr_nlrestore_ops);
+   if (err < 0)
+   goto out_unregister_dump;
+
+   return 0;
+

[Devel] [patch 1/5][RFC - ipv4/udp checkpoint/restart] : add lookup for unhashed inode

2007-06-06 Thread dlezcano
The socket relies on the sockfs. In some cases, the socket are orphans and
it is not possible to access them via a file descriptor, this is the case for
example for timewait sockets. Hopefully, an inode is still usable to specify
a socket. This one can be retrieved from /proc/net/tcp for orphan sockets or
from a fstat.

When a socket is created the socket inode is added to the sockfs.
Unfortunatly, this one is not stored into the hashed inode list, so
I need a helper to browse the inode list contained in the superblock 
of the sockfs.

This is one solution, another solution is to stored the inode into
the hashed list when socket is created.

Signed-off-by: Daniel Lezcano <[EMAIL PROTECTED]>
---
 fs/inode.c |   29 +
 include/linux/fs.h |1 +
 2 files changed, 30 insertions(+)

Index: 2.6.20-cr/fs/inode.c
===
--- 2.6.20-cr.orig/fs/inode.c
+++ 2.6.20-cr/fs/inode.c
@@ -877,6 +877,35 @@
 
 EXPORT_SYMBOL(ilookup);
 
+
+/**
+ * ilookup_unhased - search for an inode in the superblock
+ * @sb:super block of file system to search
+ * @ino:   inode number to search for
+ *
+ * The ilookup_unhashed browse the superblock inode list to find the inode.
+ *
+ * If the inode is found in the inode list stored in the superblock, the inode 
is
+ * with an incremented reference count.
+ *
+ * Otherwise NULL is returned.
+ */
+struct inode *ilookup_unhashed(struct super_block *sb, unsigned long ino)
+{
+   struct inode *inode = NULL;
+
+   spin_lock(&inode_lock);
+   list_for_each_entry(inode, &sb->s_inodes, i_sb_list)
+   if (inode->i_ino == ino) {
+   __iget(inode);
+   break;
+   }
+   spin_unlock(&inode_lock);
+   return inode;
+
+}
+EXPORT_SYMBOL(ilookup_unhashed);
+
 /**
  * iget5_locked - obtain an inode from a mounted file system
  * @sb:super block of file system
Index: 2.6.20-cr/include/linux/fs.h
===
--- 2.6.20-cr.orig/include/linux/fs.h
+++ 2.6.20-cr/include/linux/fs.h
@@ -1657,6 +1657,7 @@
 extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data);
 extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
+extern struct inode *ilookup_unhashed(struct super_block *sb, unsigned long 
ino);
 
 extern struct inode * iget5_locked(struct super_block *, unsigned long, int 
(*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
 extern struct inode * iget_locked(struct super_block *, unsigned long);

-- 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [patch 0/5][RFC - ipv4/udp checkpoint/restart] dumping/restoring the IPV4/UDP sockets

2007-06-06 Thread dlezcano
Hi,

I would like to resurect the discussion we had concerning the socket 
checkpoint/restart.

I began to look how to checkpoint them. I thought the following:

The socket can be checkpointed one by one from userspace. 
That will allow to provide a mechanism to application which wants to 
checkpoint himself (I think for example at the tcpcp project) and 
increase the scope of usability.

If we are inside a container, only the sockets related to the container 
are viewable and by the way, checkpointable, so the socket checkpoint/restart 
will bring mobility to the container.

The socket relies to the socket fs. We can use the inode number to identify the 
socket. Using that, we can checkpoint/restart a socket associated with a fd 
because inode number can be easily retrieved from a fstat call and we can 
identify orphan sockets. Inode number of orphan sockets are viewable in the 
/proc/net/tcp file.

The checkpoint/restart data can be transfered between kernel and userspace via 
the generic netlink. It is a clean and secure way to define a message format 
with ascendant compatibily, ie : move the socket to an os  with a superior 
kernel version. The generic netlink message can be either used as raw data to 
be directly dumped to disk or can be modified from userspace for some specific 
purpose (I dont have examples)

The way the socket are checkpointed/restored is to stick as much as possible to 
user/kernel frontier in order to catch errors and bad values the sooner. 
I think for example, all the kernel_setsockopt, kernel_connect, etc ... 
function family, it is more reliable and secure.

The following patchset is a RFC for C/R the UDP sockets. It applies to 2.6.20.

One question is pending. Should we dump/restore send and receive queue knowing 
the protocol is not reliable ?

Regards.

  -- Daniel
-- 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Per container statistics (containerstats)

2007-06-06 Thread Balbir Singh
Hi, Andrew/Paul,

Here's the latest version of containerstats ported to v10. Could you
please consider it for inclusion

Changelog

1. Instead of parsing long container path's use the dentry to match the
   container for which stats are required. The user space application
   opens the container directory and passes the file descriptor, which
   is used to determine the container for which stats are required.
   This approach was suggested by Paul Menage

This patch is inspired by the discussion at http://lkml.org/lkml/2007/4/11/187
and implements per container statistics as suggested by Andrew Morton
in http://lkml.org/lkml/2007/4/11/263. The patch is on top of 2.6.21-mm1
with Paul's containers v9 patches (forward ported)

This patch implements per container statistics infrastructure and re-uses
code from the taskstats interface. A new set of container operations are
registered with commands and attributes. It should be very easy to
*extend* per container statistics, by adding members to the containerstats
structure.

The current model for containerstats is a pull, a push model (to post
statistics on interesting events), should be very easy to add. Currently
user space requests for statistics by passing the container file descriptor.
Statistics about the state of all the tasks in the container is returned to
user space.

TODO's/NOTE:

This patch provides an infrastructure for implementing container statistics.
Based on the needs of each controller, we can incrementally add more statistics,
event based support for notification of statistics, accumulation of taskstats
into container statistics in the future.

Sample output

# ./containerstats -C /container/a
sleeping 2, blocked 0, running 1, stopped 0, uninterruptible 0

# ./containerstats -C /container/
sleeping 154, blocked 0, running 0, stopped 0, uninterruptible 0

If the approach looks good, I'll enhance and post the user space utility for
the same

Feedback, comments, test results are always welcome!



Signed-off-by: Balbir Singh <[EMAIL PROTECTED]>
---

 Documentation/accounting/containerstats.txt |   27 ++
 include/linux/Kbuild|1 
 include/linux/container.h   |8 +++
 include/linux/containerstats.h  |   70 
 include/linux/delayacct.h   |   11 
 kernel/container.c  |   63 +
 kernel/sched.c  |4 +
 kernel/taskstats.c  |   66 ++
 8 files changed, 250 insertions(+)

diff -puN /dev/null Documentation/accounting/containerstats.txt
--- /dev/null   2007-06-01 20:42:04.0 +0530
+++ linux-2.6.22-rc2-mm1-balbir/Documentation/accounting/containerstats.txt 
2007-06-06 17:16:54.0 +0530
@@ -0,0 +1,27 @@
+Containerstats is inspired by the discussion at
+http://lkml.org/lkml/2007/4/11/187 and implements per container statistics as
+suggested by Andrew Morton in http://lkml.org/lkml/2007/4/11/263.
+
+Per container statistics infrastructure re-uses code from the taskstats
+interface. A new set of container operations are registered with commands
+and attributes specific to containers. It should be very easy to
+extend per container statistics, by adding members to the containerstats
+structure.
+
+The current model for containerstats is a pull, a push model (to post
+statistics on interesting events), should be very easy to add. Currently
+user space requests for statistics by passing the container path.
+Statistics about the state of all the tasks in the container is returned to
+user space.
+
+NOTE: We currently rely on delay accounting for extracting information
+about tasks blocked on I/O. If CONFIG_TASK_DELAY_ACCT is disabled, this
+information will not be available.
+
+To extract container statistics a utility very similar to getdelays.c
+has been developed, the sample output of the utility is shown below
+
+~/balbir/containerstats # ./containerstats  -C "/container/a"
+sleeping 1, blocked 0, running 1, stopped 0, uninterruptible 0
+~/balbir/containerstats # ./containerstats  -C "/container"
+sleeping 155, blocked 0, running 1, stopped 0, uninterruptible 2
diff -puN include/linux/container.h~containers-taskstats 
include/linux/container.h
--- linux-2.6.22-rc2-mm1/include/linux/container.h~containers-taskstats 
2007-06-05 17:21:57.0 +0530
+++ linux-2.6.22-rc2-mm1-balbir/include/linux/container.h   2007-06-06 
16:59:30.0 +0530
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_CONTAINERS
 
@@ -21,6 +22,8 @@ extern void container_init_smp(void);
 extern void container_fork(struct task_struct *p);
 extern void container_fork_callbacks(struct task_struct *p);
 extern void container_exit(struct task_struct *p, int run_callbacks);
+extern int containerstats_build(struct containerstats *stats,
+   struct dentry *dentry);
 
 extern struct 

Re: [Devel] breakfast at ols?

2007-06-06 Thread Kirill Korotaev
My apologies, but I won't be able to visit OLS this year :/
I think Kirill Kolyshkin and Denis Lunev will be glad to meet you again!
Hope Pavel Emelianov will be able to join as well.

Thanks,
Kirill


Serge E. Hallyn wrote:
> Last year we all met for breakfast at OLS.  Now we've all pretty much
> all already met so maybe it's less exciting, but do people (who will be
> at OLS) care to meet for breakfast on the thursday or friday?
> 
> -serge
> ___
> Containers mailing list
> [EMAIL PROTECTED]
> https://lists.linux-foundation.org/mailman/listinfo/containers
> 
> ___
> Devel mailing list
> Devel@openvz.org
> https://openvz.org/mailman/listinfo/devel
> 

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Pid namespaces approaches testing results

2007-06-06 Thread Pavel Emelianov
Cedric Le Goater wrote:
>> The flat model has many optimization ways in comparison with the multilevel
>> one. Like we can cache the pid value on structs and some other.
>>
>> Moreover having generic level nesting sounds reasonable. Having single level
>> nesting - too as all the namespace we have are single nested. But having the
>> 4 level nesting sounds strange... Why 4? Why not 5? What if I don't know how
>> many I will need exactly, but do know that it will be definitely more than 1?
>>
>> Moreover - I have shown that we can have 1% or less performance on generic
>> nesting model, why not keep it?
> 
> did you send that patchset ? is it included in the one you sent ?

The patchset I sent earlier changed slightly. The tests were performed
on the version I sent. Right now I'm waiting for your results to make
a final decision whether or not to develop the flat model together with
the hierarchical one.

So what are we going to do? The ways we have:
1. Make two models - hierarchical and flat. Maybe we'll see how to merge
   them later;
2. Optimize the hierarchical model to produce no performance hit on the
   first 2 levels (init and VS). I don't see the way to make this
   gracefully, but I maybe this can be solved ... somehow. Anyway, if
   the latest patches from Suka do not produce any noticeable overhead,
   I am OK to go on with them;
3. Make the CONFIG_MAX_NS_DEPTH model. This is likely to be fast in the
   flat case, but I am in doubt whether Andrew will like it :)

> sorry if i missed something :( 
> 
> C.
> 

Thanks,
Pavel
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel