[Devel] Re: How do containers tie to multiple IP's on a NIC?
On 07/04/2010 05:40 AM, Whit Blauvelt wrote: Hi, In the containerless world, I often have multiple IPs assigned to a NIC. The scant documentation I can find on running containers only ever speaks of single IP assignment schemes. Can I have for example a box with a single NIC with 8 IPs assigned to it, where the host gets one IP, or perhaps alternately can see all 8 to run iptables across, but each of the containers can see only whichever IP or IPs are assigned to it? What container userspace command are you using ? libvirt ? liblxc ? unshare --net ? Thanks -- Daniel ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: How do containers tie to multiple IP's on a NIC?
On 07/04/2010 09:18 PM, Whit Blauvelt wrote: On Sun, Jul 04, 2010 at 06:51:34PM +0200, Daniel Lezcano wrote: What container userspace command are you using ? libvirt ? liblxc ? unshare --net ? Which one do you recommend, considering what I'm trying to do with multiple IPs on a NIC? I haven't committed to one yet. Which utility do you expect future development will favor most? I'll be happy to use any tool which gets the job done, preferably one that has a future. Well ... please don't consider what I will suggest as preaching for its parish ;) (not sure it is a correct expression. It is a direct translation from French) I would recommend to use the lxc tools, preferably the 0.7.1 version. These tools allow to do what you are expecting that is assign several Ip addresses to the same virtual nic. They are available at: http://lxc.sourceforge.net/download/lxc/lxc-0.7.1.tar.gz an older version is certainly available on your distro. As a quick start: write a configuration file (eg. lxc.conf) lxc.network.type=macvlan lxc.network.link=eth0 lxc.network.flags=up lxc.network.ipv4=1.2.3.4/24 lxc.network.ipv4=192.168.1.123/24 lxc.network.ipv4=10.0.0.23 lxc.network.ipv4=172.2.1.3 And then lxc-execute -n foo -f lxc.conf /bin/bash In your shell you should have a new network with one interface and several IP addresses. You can create much more complex configuration but I let you check if these tools fit your needs. Thanks -- Daniel ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone
On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu suka...@linux.vnet.ibm.com wrote: +struct clone_args { + u64 clone_flags_high; + u64 child_stack_base; + u64 child_stack_size; + u64 parent_tid_ptr; + u64 child_tid_ptr; + u32 nr_pids; + u32 reserved0; +}; + + +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, + pid_t * __user pids) I don't see why cargs_size is needed for expansion if you have flags. I think it's cleaner this way. The alternative you seem to be hinting at is: If we used a flag bit to indicate an expansion of the parameters then it would only be able to specify one expansion before we'd have to start using bits in the args structure itself. Using those extra bits is quite gross -- we'd have to copy the initial portion of the struct, decode the bit(s) describing the size, and then copy the rest. Also, do we have any bits left in flags_low? I thought those were all used up... Or perhaps I wasn't able to anticipate the details of your suggestion and you had something else in mind? The way Suka has it we just directly encode the size of struct clone_args as a parameter and get it over with. + The order of pids in @pids is oldest in pids[0] to youngest pid + namespace in pids[nr_pids-1]. If the number of pids specified in the + @pids list is fewer than the nesting level of the process, the pids + are applied from youngest namespace. I.e if the process is nested in + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to + have a pid of '0' (the kernel will assign a pid in those namespaces). That feels backwards. I'd have guessed pids[0] is how the process sees itself. You'd truncate the array to reduce nesting level rather than pointing into it. The only difference between what you ask for and what eclone does is the order of the pids. The array is truncatable as you requested. Knowing nr_pids is sufficient -- there's no need for something pointing into it. ISTR it was ordered this way to avoid odd-looking loops or extra copying in the kernel pid code. Suka may recall the reasoning better than I do -- I'd have to dig through list archives to be certain. + On failure, eclone() returns -1 and sets 'errno' to one of following + values (the child process is not created). Careful here: do you intend to document the system call itself, or an expected glibc wrapper that doesn't exist yet? + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to + specify the pids in this call (if pids are not specifed + CAP_SYS_ADMIN is not required). It seems appropriate to let PID 1 in any PID namespace be able to assign PIDs in it's own namespace and in any child namespaces. I disagree. The way you describe it, more than one pid 1 could be involved thus the pid assignment could conflict. Especially in the case of container checkpoint/restart where one or more of those pid 1 tasks is not aware that it's being restarted. It gets even worse if you don't assume that the same container software is being used in nested containers. + EINVAL The child_stack_size field is not 0 (on architectures that + pass in a stack pointer in -child_stack field). need to change this +int $0x80\n\t/* Linux/i386 system call */ +testl %0,%0\n\t /* check return value */ +jne 1f\n\t /* jump if parent */ + +popl %%esi\n\t /* get subthread function */ +call *%%esi\n\t /* start subthread function */ +movl %2,%0\n\t +int $0x80\n /* exit system call: exit subthread */ ... +/* + * Allocate a stack for the clone-child and arrange to have the child + * execute @child_fn with @child_arg as the argument. + */ ... + *--stack = child_arg; + *--stack = child_fn; ... +static int do_clone(int (*child_fn)(void *), void *child_arg, + unsigned int flags_low, int nr_pids, pid_t *pids_list) There needs to be a way to pass child_fn and child_arg via the kernel. Besides being required for kernel-managed stacks, it's normally a saner interface. Stack setup would be much like the stack setup for signal handlers. Imagine I'm inclined to say this is a bad idea. I didn't think we had kernel-managed stacks in mainline. The most we have, to my knowledge, is the sigaltstack support and kernel threads. I don't see how being able to pass in child_fn and child_arg to the kernel improves the sanity of the interface. If anything it will make eclone even more exotic -- now at the end of the syscall we'll need
[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone
On 07/04/2010 04:39 PM, Matt Helsley wrote: 1. can you implement it for i386 (register starved) using eclone? That's a very good question. I'm going to punt on a direct answer for now. Instead, I wonder if it's even worth enabling vfork through eclone. vfork is rarely used, is supported by the old clone syscall, and any old code adapted to use eclone for vfork would need significant changes because of vfork's specialness. (A consequence of the way vfork borrows page tables and must avoid clobbering parent's registers..) vfork is its own system call for a reason. We used to do it with sys_clone, and it turned out to be a mess. Doing it in a separate system call -- even though the internals are largely the same -- is cleaner. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 11/11][v15]: Document sys_eclone
Matt Helsley wrote: On Sat, Jul 03, 2010 at 07:41:30PM -0400, Albert Cahalan wrote: On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu suka...@linux.vnet.ibm.com wrote: [...] + The order of pids in @pids is oldest in pids[0] to youngest pid + namespace in pids[nr_pids-1]. If the number of pids specified in the + @pids list is fewer than the nesting level of the process, the pids + are applied from youngest namespace. I.e if the process is nested in + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to + have a pid of '0' (the kernel will assign a pid in those namespaces). That feels backwards. I'd have guessed pids[0] is how the process sees itself. You'd truncate the array to reduce nesting level rather than pointing into it. The only difference between what you ask for and what eclone does is the order of the pids. The array is truncatable as you requested. Knowing nr_pids is sufficient -- there's no need for something pointing into it. ISTR it was ordered this way to avoid odd-looking loops or extra copying in the kernel pid code. Suka may recall the reasoning better than I do -- I'd have to dig through list archives to be certain. This is what I remember, too: the idea was to order the pid's in the order that the kernel - and (many of ?) us humans - view of them: top-down. IOW, this corresponds to the hierarchical nature of the pid-namespace hierarchy: ancestors come first, followed by descendants. + On failure, eclone() returns -1 and sets 'errno' to one of following + values (the child process is not created). Careful here: do you intend to document the system call itself, or an expected glibc wrapper that doesn't exist yet? + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to + specify the pids in this call (if pids are not specifed + CAP_SYS_ADMIN is not required). It seems appropriate to let PID 1 in any PID namespace be able to assign PIDs in it's own namespace and in any child namespaces. I disagree. The way you describe it, more than one pid 1 could be involved thus the pid assignment could conflict. Especially in the case of container checkpoint/restart where one or more of those pid 1 tasks is not aware that it's being restarted. It gets even worse if you don't assume that the same container software is being used in nested containers. I assume that by any child namespace you mean any descendant namespace ? I'm with Matt here. In particular, I make the distinction between where a process lives and where it is visible. A process is visible in all the pid-namespaces up its ancestry. However, it lives only in the bottom-most pid-namespace. For example, if it calls kill(2), it will affect another process in _that_ namespace. It follows that trying to set pid's in pid-namespaces _below_ you simply doesn't make sense (beyond the CLONE_NEWPID case). Finally, there have been objections before to allow pid-selection by non-privileged process. If such functionality is deemed desired later on, it can be easily added. However, let's separate that discussion from the current discussion. [...] +static int do_clone(int (*child_fn)(void *), void *child_arg, + unsigned int flags_low, int nr_pids, pid_t *pids_list) There needs to be a way to pass child_fn and child_arg via the kernel. Besides being required for kernel-managed stacks, it's normally a saner interface. Stack setup would be much like the stack setup for signal handlers. Imagine I'm inclined to say this is a bad idea. I didn't think we had kernel-managed stacks in mainline. The most we have, to my knowledge, is the sigaltstack support and kernel threads. I don't see how being able to pass in child_fn and child_arg to the kernel improves the sanity of the interface. If anything it will make eclone even more exotic -- now at the end of the syscall we'll need to mess with the registers/stack of the child much like when we're invoking a signal handler. That just adds more arch-specific code than is necessary. Userspace wrappers are perfectly capable of invoking the child function and passing the arguments. Furthermore, passing those arguments requires expanding the argument structure or putting even greater pressure on registers (which, as you pointed out below, is an issue for vfork). using this for a vfork-like interface that didn't have painful interactions with the compiler. Pardon my ignorance - what sort of painful interactions ? Speaking of vfork 1. can you implement it for i386 (register starved) using eclone? That's a very good question. I'm going to punt on a direct answer for now. Instead, I wonder if it's even worth enabling vfork through eclone. vfork is rarely used, is supported by the old clone syscall, and any old code adapted to