[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd
On Thu, 2010-09-23 at 16:58 +0200, David Lamparter wrote: migrating route table entries makes no sense because a) they refer to devices and configuration that does not exist in the target namespace; they only make sense within their netns context b) they are purely virtual and you get the same result from deleting and recreating them. Network devices are special because they may have something attached to them, be it hardware or some daemon. Routes functionally reside on top of netdevices, point to nexthop neighbors across these netdevices etc. Underlying assumption is you take care of that dependency when migrating. We are talking about FIB entries here not the route cache; moving a few pointers within the kernel is a hell lot faster than recreating a subset of BGP entries from user space. Eric, I didnt follow the exposed-races arguement: Why would it involve more than just some basic locking only while you change the struct net pointer to the new namespace for these sub-subsystems? cheers, jamal ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd
On Fri, 2010-09-24 at 14:57 +0200, David Lamparter wrote: No. While you sure could associate routes with devices, they don't *functionally* reside on top of network devices. They reside on top of the entire IP configuration, I think i am not clearly making my point. There are data dependencies; If you were to move routes, youd need everything that routes depend on. IOW, if i was to draw a functional graph, routes would appear on top of netdevs (I dont care what other functional blocks you put in between or sideways to them). and in case of BGP they even reside on top of your set of peerings and their data. Even if you could move routes together with a network device, the result would be utter nonsense. You could argue that moving a netdevice where some of its fundamental properties such as an ifindex change is utter nonsense. But you can work around it. The routes depend on your BGP view, and if your set of interfaces (and peers) changes, your routes will change. Your bgpd will, either way, need to set up new peerings and redo best path evaluations. Worst case scenario, yes. I am beginning to get a feeling we are trying to achieve different goals maybe? Why are you even migrating netdevs? (On an unrelated note, how often are you planning to move stuff between namespaces? I don't expect to be moving stuff except on configuration events...) Triggering on config events is useful and it is likely the only possibility if you assumed the other namespace is remote. But if could send a single command to migrate several things in the kernel (in my case to recover state to a different ns), then that is much simpler and uses the least resources (memory, cpu, bandwidth). I admit it is very hard to do in most cases where the underlying dependencies are evolving and synchronizing via user space is the best approach. The example of route table i pointed to is simple. Besides that: dynamic state created in the kernel that doesnt have to be recreated by the next arriving 100K packets helps to improve recovery. cheers, jamal ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd
On 09/23/2010 10:51 AM, Eric W. Biederman wrote: Take advantage of the new abstraction and allow network devices to be placed in any network namespace that we have a fd to talk about. Signed-off-by: Eric W. Biedermanebied...@xmission.com --- [ ... ] +struct net *get_net_ns_by_fd(int fd) +{ + struct proc_inode *ei; + struct file *file; + struct net *net; + + file = NULL; + net = ERR_PTR(-EINVAL); + file = proc_ns_fget(fd); + if (!fd) + goto out; + return ERR_PTR(-EINVAL); + + ei = PROC_I(file-f_dentry-d_inode); + if (ei-ns_ops !=netns_operations) + goto out; Is this check necessary here ? proc_ns_fget checks file-f_op != ns_file_operations, no ? ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors
On 09/24/2010 03:02 PM, Andrew Lutomirski wrote: Eric W. Biederman wrote: Introduce file for manipulating namespaces and related syscalls. files: /proc/self/ns/nstype syscalls: int setns(unsigned long nstype, int fd); socketat(int nsfd, int family, int type, int protocol); How does security work? Are there different kinds of fd that give (say) pin-the-namespace permission, socketat permission, and setns permission? AFAICS, socketat, setns and set netns by fd only accept fd from /proc/pid/ns/ns. setns does : file = proc_ns_fget(fd); if (IS_ERR(file)) return PTR_ERR(file); proc_ns_fget checks if (file-f_op != ns_file_operations) socketat and get_net_ns_by_fd: net = get_net_ns_by_fd(fd); this one calls proc_ns_fget. We have the guarantee here, the fd is resulting from an open of the file with the right permissions. Another way to pin the namespace, would be to mount --bind /proc/pid/ns/ns but we have to be root to do that ... ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd
On Fri, 2010-09-24 at 16:09 +0200, David Lamparter wrote: I understood your point. What I'm saying is that that functional graph you're describing is too simplistic do be a workable model. Your graph allows for what you're trying to do, yes. But your graph is not modeling the reality. How about we put this specific point to rest by agreeing to disagree? ;- Err... I'm migrating netdevs to assign them to namespaces to allow them to use them? Setup, basically. Either way a device move only happens as result of some administrative action; be it creating a new namespace or changing the physical/logical network setup. Ok, different need. You have a much more basic requirement than i do. wtf is a remote namespace? A namespace that is remotely located on another machine/hardware ;- Can you please describe your application that requires moving possibly several network devices together with their routes to a different namespace? scaling and availability are the driving requirements. cheers, jamal ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors
Daniel Lezcano daniel.lezc...@free.fr writes: On 09/24/2010 03:02 PM, Andrew Lutomirski wrote: Eric W. Biederman wrote: Introduce file for manipulating namespaces and related syscalls. files: /proc/self/ns/nstype syscalls: int setns(unsigned long nstype, int fd); socketat(int nsfd, int family, int type, int protocol); How does security work? Are there different kinds of fd that give (say) pin-the-namespace permission, socketat permission, and setns permission? AFAICS, socketat, setns and set netns by fd only accept fd from /proc/pid/ns/ns. setns does : file = proc_ns_fget(fd); if (IS_ERR(file)) return PTR_ERR(file); proc_ns_fget checks if (file-f_op != ns_file_operations) socketat and get_net_ns_by_fd: net = get_net_ns_by_fd(fd); this one calls proc_ns_fget. We have the guarantee here, the fd is resulting from an open of the file with the right permissions. In particular the default /proc permissions say you have to be the owner of the process (or root) to access the file. If you are the owner of the process with a namespace (or root) you already have permission to access and manipulate the namespace. Additionally setns like unshare requires CAP_SYS_ADMIN (aka root magic). Another way to pin the namespace, would be to mount --bind /proc/pid/ns/ns but we have to be root to do that ... Simply keeping the process running, pins the namespace. That requires no new permissions. Similarly socketat. It is possible to use unix domain sockets to implement it today without any kernel changes. It is just an unnecessary pain to run a server process to pin a namespace or to serve up file descriptors in other network namespaces. The primary change of this patchset is the ability to do everything with file descriptors, and with the mount namespace. That moves everything from a bizarre hard to understand and manipulate interface to one where things can be done much more easily, and cheaply. Resulting in a much more powerful and usable interface. Eric ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting
On Fri, 24 Sep 2010 11:10:15 +0200 Michael Holzheu holz...@linux.vnet.ibm.com wrote: Hello Andrew, On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote: GOALS OF THIS PATCH SET --- The intention of this patch set is to provide better support for tools like top. The goal is to: * provide a task snapshot mechanism where we can get a consistent view of all running tasks. * provide a transport mechanism that does not require a lot of system calls and that allows implementing low CPU overhead task monitoring. * provide microsecond CPU time granularity. This is a big change! If this is done right then we're heading in the direction of deprecating the longstanding way in which userspace observes the state of Linux processes and we're recommending that the whole world migrate to taskstats. I think? Or it can be used as alternative. Since procfs has its drawbacks (e.g. performance) an alternative could be helpful. And it can be harmful. More kernel code to maintain and test, more userspace code to develop, maintain, etc. Less user testing than if there was a single interface. I worry that there's a dependency on CONFIG_NET? If so then that's a big problem because in N years time, 99% of the world will be using taskstats, but a few embedded losers will be stuck using (and having to support) the old tools. Sure, but if we could add the /proc/taskstats approach, this dependency would not be there. So why do we need to present the same info over netlink? If the info is available via procfs then userspace code should use that and not netlink, because that userspace code would also be applicable to CONFIG_NET=n systems. Does this have the potential to save us from the CONFIG_NET=n problem? Yes Let's say that when it's all tested ;) Are PIDs over all namespaces unique? Nope. The same pid can be present in different namespaces at the same time. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel