[Devel] Re: How do containers tie to multiple IP's on a NIC?

2010-07-04 Thread Daniel Lezcano
On 07/04/2010 05:40 AM, Whit Blauvelt wrote:
 Hi,

 In the containerless world, I often have multiple IPs assigned to a NIC. The
 scant documentation I can find on running containers only ever speaks of
 single IP assignment schemes. Can I have for example a box with a single NIC
 with 8 IPs assigned to it, where the host gets one IP, or perhaps
 alternately can see all 8 to run iptables across, but each of the containers
 can see only whichever IP or IPs are assigned to it?


What container userspace command are you using ? libvirt ? liblxc ? 
unshare --net ?

Thanks
   -- Daniel


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: How do containers tie to multiple IP's on a NIC?

2010-07-04 Thread Daniel Lezcano
On 07/04/2010 09:18 PM, Whit Blauvelt wrote:
 On Sun, Jul 04, 2010 at 06:51:34PM +0200, Daniel Lezcano wrote:


 What container userspace command are you using ? libvirt ? liblxc ?
 unshare --net ?
  
 Which one do you recommend, considering what I'm trying to do with multiple
 IPs on a NIC? I haven't committed to one yet. Which utility do you expect
 future development will favor most? I'll be happy to use any tool which gets
 the job done, preferably one that has a future.


Well  ... please don't consider what I will suggest as preaching for 
its parish ;)
(not sure it is a correct expression. It is a direct translation from 
French)

I would recommend to use the lxc tools, preferably the 0.7.1 version. 
These tools allow to do what you are expecting that is assign several Ip 
addresses to the same virtual nic.

They are available at:

http://lxc.sourceforge.net/download/lxc/lxc-0.7.1.tar.gz

an older version is certainly available on your distro.

As a quick start:

write a configuration file (eg. lxc.conf)

lxc.network.type=macvlan
lxc.network.link=eth0
lxc.network.flags=up
lxc.network.ipv4=1.2.3.4/24
lxc.network.ipv4=192.168.1.123/24
lxc.network.ipv4=10.0.0.23
lxc.network.ipv4=172.2.1.3

And then lxc-execute -n foo -f lxc.conf /bin/bash

In your shell you should have a new network with one interface and 
several IP addresses.

You can create much more complex configuration but I let you check if 
these tools fit your needs.

Thanks
   -- Daniel

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone

2010-07-04 Thread Matt Helsley
On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote:
 On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu
 suka...@linux.vnet.ibm.com wrote:
 
  +struct clone_args {
  +   u64 clone_flags_high;
  +   u64 child_stack_base;
  +   u64 child_stack_size;
  +   u64 parent_tid_ptr;
  +   u64 child_tid_ptr;
  +   u32 nr_pids;
  +   u32 reserved0;
  +};
  +
  +
  +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
  +   pid_t * __user pids)
 
 I don't see why cargs_size is needed for expansion if you have flags.

I think it's cleaner this way. The alternative you seem to be hinting at
is:

If we used a flag bit to indicate an expansion of the parameters then it
would only be able to specify one expansion before we'd have to start
using bits in the args structure itself. Using those extra bits is
quite gross -- we'd have to copy the initial portion of the struct, decode
the bit(s) describing the size, and then copy the rest. Also, do we have
any bits left in flags_low? I thought those were all used up...

Or perhaps I wasn't able to anticipate the details of your suggestion
and you had something else in mind?

The way Suka has it we just directly encode the size of struct clone_args
as a parameter and get it over with.

 
  +   The order of pids in @pids is oldest in pids[0] to youngest pid
  +   namespace in pids[nr_pids-1]. If the number of pids specified in the
  +   @pids list is fewer than the nesting level of the process, the pids
  +   are applied from youngest namespace. I.e if the process is nested in
  +   a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
  +   are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
  +   have a pid of '0' (the kernel will assign a pid in those 
  namespaces).
 
 That feels backwards. I'd have guessed pids[0] is how the
 process sees itself. You'd truncate the array to reduce nesting
 level rather than pointing into it.

The only difference between what you ask for and what eclone does is the
order of the pids. The array is truncatable as you requested. Knowing nr_pids
is sufficient -- there's no need for something pointing into it.

ISTR it was ordered this way to avoid odd-looking loops or extra copying in
the kernel pid code. Suka may recall the reasoning better than I do -- I'd
have to dig through list archives to be certain.

 
  +   On failure, eclone() returns -1 and sets 'errno' to one of following
  +   values (the child process is not created).
 
 Careful here: do you intend to document the system call itself,
 or an expected glibc wrapper that doesn't exist yet?
 
  +   EPERM   Caller does not have the CAP_SYS_ADMIN privilege needed to
  +   specify the pids in this call (if pids are not specifed
  +   CAP_SYS_ADMIN is not required).
 
 It seems appropriate to let PID 1 in any PID namespace be
 able to assign PIDs in it's own namespace and in any
 child namespaces.

I disagree. The way you describe it, more than one pid 1 could be
involved thus the pid assignment could conflict. Especially in the case
of container checkpoint/restart where one or more of those pid 1 tasks
is not aware that it's being restarted. It gets even worse if you don't
assume that the same container software is being used in nested
containers.

 
  +   EINVAL  The child_stack_size field is not 0 (on architectures that
  +   pass in a stack pointer in -child_stack field).
 
 need to change this
 
  +int $0x80\n\t/* Linux/i386 system call */
  +testl %0,%0\n\t  /* check return value */
  +jne 1f\n\t   /* jump if parent */
  +
  +popl %%esi\n\t   /* get subthread function */
  +call *%%esi\n\t  /* start subthread function */
  +movl %2,%0\n\t
  +int $0x80\n  /* exit system call: exit subthread 
  */
 ...
  +/*
  + * Allocate a stack for the clone-child and arrange to have the child
  + * execute @child_fn with @child_arg as the argument.
  + */
 ...
  +   *--stack = child_arg;
  +   *--stack = child_fn;
 ...
  +static int do_clone(int (*child_fn)(void *), void *child_arg,
  +   unsigned int flags_low, int nr_pids, pid_t *pids_list)
 
 There needs to be a way to pass child_fn and child_arg
 via the kernel. Besides being required for kernel-managed
 stacks, it's normally a saner interface. Stack setup would
 be much like the stack setup for signal handlers. Imagine

I'm inclined to say this is a bad idea.

I didn't think we had kernel-managed stacks in mainline. The most we
have, to my knowledge, is the sigaltstack support and kernel threads.

I don't see how being able to pass in child_fn and child_arg to the
kernel improves the sanity of the interface. If anything it will make
eclone even more exotic -- now at the end of the syscall we'll
need 

[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone

2010-07-04 Thread H. Peter Anvin
On 07/04/2010 04:39 PM, Matt Helsley wrote:

 1. can you implement it for i386 (register starved) using eclone?
 
 That's a very good question. I'm going to punt on a direct answer for
 now. Instead,  I wonder if it's even worth enabling vfork through eclone.
 vfork is rarely used, is supported by the old clone syscall, and any
 old code adapted to use eclone for vfork would need significant
 changes because of vfork's specialness. (A consequence of the way vfork
 borrows page tables and must avoid clobbering parent's registers..)
 

vfork is its own system call for a reason.  We used to do it with
sys_clone, and it turned out to be a mess.  Doing it in a separate
system call -- even though the internals are largely the same -- is cleaner.

-hpa
-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone

2010-07-04 Thread Oren Laadan


Matt Helsley wrote:
 On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote:
 On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu
 suka...@linux.vnet.ibm.com wrote:


[...]

 +   The order of pids in @pids is oldest in pids[0] to youngest pid
 +   namespace in pids[nr_pids-1]. If the number of pids specified in the
 +   @pids list is fewer than the nesting level of the process, the pids
 +   are applied from youngest namespace. I.e if the process is nested in
 +   a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
 +   are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
 +   have a pid of '0' (the kernel will assign a pid in those 
 namespaces).
 That feels backwards. I'd have guessed pids[0] is how the
 process sees itself. You'd truncate the array to reduce nesting
 level rather than pointing into it.
 
 The only difference between what you ask for and what eclone does is the
 order of the pids. The array is truncatable as you requested. Knowing nr_pids
 is sufficient -- there's no need for something pointing into it.
 
 ISTR it was ordered this way to avoid odd-looking loops or extra copying in
 the kernel pid code. Suka may recall the reasoning better than I do -- I'd
 have to dig through list archives to be certain.

This is what I remember, too: the idea was to order the pid's in
the order that the kernel - and (many of ?) us humans - view of
them: top-down. IOW, this corresponds to the hierarchical nature
of the pid-namespace hierarchy: ancestors come first, followed by
descendants.

 
 +   On failure, eclone() returns -1 and sets 'errno' to one of following
 +   values (the child process is not created).
 Careful here: do you intend to document the system call itself,
 or an expected glibc wrapper that doesn't exist yet?

 +   EPERM   Caller does not have the CAP_SYS_ADMIN privilege needed to
 +   specify the pids in this call (if pids are not specifed
 +   CAP_SYS_ADMIN is not required).
 It seems appropriate to let PID 1 in any PID namespace be
 able to assign PIDs in it's own namespace and in any
 child namespaces.
 
 I disagree. The way you describe it, more than one pid 1 could be
 involved thus the pid assignment could conflict. Especially in the case
 of container checkpoint/restart where one or more of those pid 1 tasks
 is not aware that it's being restarted. It gets even worse if you don't
 assume that the same container software is being used in nested
 containers.

I assume that by any child namespace you mean any descendant
namespace ?

I'm with Matt here. In particular, I make the distinction between
where a process lives and where it is visible. A process is
visible in all the pid-namespaces up its ancestry. However, it
lives only in the bottom-most pid-namespace. For example, if it
calls kill(2), it will affect another process in _that_ namespace.

It follows that trying to set pid's in pid-namespaces _below_ you
simply doesn't make sense (beyond the CLONE_NEWPID case).

Finally, there have been objections before to allow pid-selection
by non-privileged process. If such functionality is deemed desired
later on, it can be easily added. However, let's separate that
discussion from the current discussion.

[...]

 +static int do_clone(int (*child_fn)(void *), void *child_arg,
 +   unsigned int flags_low, int nr_pids, pid_t *pids_list)
 There needs to be a way to pass child_fn and child_arg
 via the kernel. Besides being required for kernel-managed
 stacks, it's normally a saner interface. Stack setup would
 be much like the stack setup for signal handlers. Imagine
 
 I'm inclined to say this is a bad idea.
 
 I didn't think we had kernel-managed stacks in mainline. The most we
 have, to my knowledge, is the sigaltstack support and kernel threads.
 
 I don't see how being able to pass in child_fn and child_arg to the
 kernel improves the sanity of the interface. If anything it will make
 eclone even more exotic -- now at the end of the syscall we'll
 need to mess with the registers/stack of the child much like when we're
 invoking a signal handler. That just adds more arch-specific code than is
 necessary.
 
 Userspace wrappers are perfectly capable of invoking the child function
 and passing the arguments. Furthermore, passing those arguments requires
 expanding the argument structure or putting even greater pressure on
 registers (which, as you pointed out below, is an issue for vfork).
 
 using this for a vfork-like interface that didn't have painful
 interactions with the compiler.

Pardon my ignorance - what sort of painful interactions ?


 Speaking of vfork

 1. can you implement it for i386 (register starved) using eclone?
 
 That's a very good question. I'm going to punt on a direct answer for
 now. Instead,  I wonder if it's even worth enabling vfork through eclone.
 vfork is rarely used, is supported by the old clone syscall, and any
 old code adapted to