[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-24 Thread jamal
On Thu, 2010-09-23 at 16:58 +0200, David Lamparter wrote:

 migrating route table entries makes no sense because
 a) they refer to devices and configuration that does not exist in the
target namespace; they only make sense within their netns context
 b) they are purely virtual and you get the same result from deleting and
recreating them.
 
 Network devices are special because they may have something attached to
 them, be it hardware or some daemon.

Routes functionally reside on top of netdevices, point to nexthop
neighbors across these netdevices etc. Underlying assumption is you take
care of that dependency when migrating.
We are talking about FIB entries here not the route cache; moving a few
pointers within the kernel is a hell lot faster than recreating a subset
of BGP entries from user space. 

Eric, I didnt follow the exposed-races arguement: Why would it involve
more than just some basic locking only while you change the struct net
pointer to the new namespace for these sub-subsystems?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-24 Thread jamal
On Fri, 2010-09-24 at 14:57 +0200, David Lamparter wrote:

 No. While you sure could associate routes with devices, they don't
 *functionally* reside on top of network devices. They reside on top of
 the entire IP configuration, 

I think i am not clearly making my point. There are data dependencies;
If you were to move routes, youd need everything that routes depend on.
IOW, if i was to draw a functional graph, routes would appear on top
of netdevs (I dont care what other functional blocks you put in between
or sideways to them).

 and in case of BGP they even reside on top
 of your set of peerings and their data.
 Even if you could move routes together with a network device, the
 result would be utter nonsense. 

You could argue that moving a netdevice where some of its fundamental
properties such as an ifindex change is utter nonsense. But you can
work around it.

 The routes depend on your BGP view, and
 if your set of interfaces (and peers) changes, your routes will change.
 Your bgpd will, either way, need to set up new peerings and redo best
 path evaluations.

Worst case scenario, yes. I am beginning to get a feeling we are trying 
to achieve different goals maybe? Why are you even migrating netdevs?

 (On an unrelated note, how often are you planning to move stuff between
 namespaces? I don't expect to be moving stuff except on configuration
 events...)

Triggering on config events is useful and it is likely the only
possibility if you assumed the other namespace is remote. But if could
send a single command to migrate several things in the kernel (in my
case to recover state to a different ns), then that is much simpler and
uses the least resources (memory, cpu, bandwidth). I admit it is very
hard to do in most cases where the underlying dependencies are evolving
and synchronizing via user space is the best approach. The example
of route table i pointed to is simple.
Besides that: dynamic state created in the kernel that doesnt have to be
recreated by the next arriving 100K packets helps to improve recovery.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-24 Thread Daniel Lezcano
On 09/23/2010 10:51 AM, Eric W. Biederman wrote:

 Take advantage of the new abstraction and allow network devices
 to be placed in any network namespace that we have a fd to talk
 about.

 Signed-off-by: Eric W. Biedermanebied...@xmission.com
 ---

[ ... ]

 +struct net *get_net_ns_by_fd(int fd)
 +{
 + struct proc_inode *ei;
 + struct file *file;
 + struct net *net;
 +
 + file = NULL;
 + net = ERR_PTR(-EINVAL);
 + file = proc_ns_fget(fd);
 + if (!fd)
 + goto out;
 + return ERR_PTR(-EINVAL);
 +
 + ei = PROC_I(file-f_dentry-d_inode);
 + if (ei-ns_ops !=netns_operations)
 + goto out;

Is this check necessary here ? proc_ns_fget checks file-f_op != 
ns_file_operations, no ?
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors

2010-09-24 Thread Daniel Lezcano
On 09/24/2010 03:02 PM, Andrew Lutomirski wrote:
 Eric W. Biederman wrote:
 Introduce file for manipulating namespaces and related syscalls.
 files:
 /proc/self/ns/nstype

 syscalls:
 int setns(unsigned long nstype, int fd);
 socketat(int nsfd, int family, int type, int protocol);


 How does security work?  Are there different kinds of fd that give (say) 
 pin-the-namespace permission, socketat permission, and setns permission?

AFAICS, socketat, setns and set netns by fd only accept fd from 
/proc/pid/ns/ns.

setns does :

file = proc_ns_fget(fd);
if (IS_ERR(file))
return PTR_ERR(file);

proc_ns_fget checks if (file-f_op != ns_file_operations)


socketat and get_net_ns_by_fd:

net = get_net_ns_by_fd(fd);

this one calls proc_ns_fget.

We have the guarantee here, the fd is resulting from an open of the file 
with the right permissions.

Another way to pin the namespace, would be to mount --bind 
/proc/pid/ns/ns but we have to be root to do that ...
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-24 Thread jamal
On Fri, 2010-09-24 at 16:09 +0200, David Lamparter wrote:

 I understood your point. What I'm saying is that that functional graph
 you're describing is too simplistic do be a workable model. Your graph
 allows for what you're trying to do, yes. But your graph is not modeling
 the reality.

How about we put this specific point to rest by agreeing to
disagree? ;-

 Err... I'm migrating netdevs to assign them to namespaces to allow them
 to use them? Setup, basically. Either way a device move only happens as
 result of some administrative action; be it creating a new namespace or
 changing the physical/logical network setup.
 

Ok, different need. You have a much more basic requirement than i do.

 wtf is a remote namespace?
 

A namespace that is remotely located on another machine/hardware ;-

 Can you please describe your application that requires moving possibly
 several network devices together with their routes to a different
 namespace?

scaling and availability are the driving requirements.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [ABI REVIEW][PATCH 0/8] Namespace file descriptors

2010-09-24 Thread Eric W. Biederman
Daniel Lezcano daniel.lezc...@free.fr writes:

 On 09/24/2010 03:02 PM, Andrew Lutomirski wrote:
 Eric W. Biederman wrote:
 Introduce file for manipulating namespaces and related syscalls.
 files:
 /proc/self/ns/nstype

 syscalls:
 int setns(unsigned long nstype, int fd);
 socketat(int nsfd, int family, int type, int protocol);


 How does security work?  Are there different kinds of fd that give (say) 
 pin-the-namespace permission, socketat permission, and setns permission?

 AFAICS, socketat, setns and set netns by fd only accept fd from
 /proc/pid/ns/ns.

 setns does :

   file = proc_ns_fget(fd);
   if (IS_ERR(file))
   return PTR_ERR(file);

 proc_ns_fget checks if (file-f_op != ns_file_operations)


 socketat and get_net_ns_by_fd:

   net = get_net_ns_by_fd(fd);

 this one calls proc_ns_fget.

 We have the guarantee here, the fd is resulting from an open of the file with
 the right permissions.

In particular the default /proc permissions say you have to be the owner
of the process (or root) to access the file.  If you are the owner of
the process with a namespace (or root) you already have permission to
access and manipulate the namespace.

Additionally setns like unshare requires CAP_SYS_ADMIN (aka root magic).

 Another way to pin the namespace, would be to mount --bind /proc/pid/ns/ns
 but we have to be root to do that ...

Simply keeping the process running, pins the namespace. That requires no
new permissions.

Similarly socketat.  It is possible to use unix domain sockets to
implement it today without any kernel changes.  It is just an
unnecessary pain to run a server process to pin a namespace or to serve
up file descriptors in other network namespaces.

The primary change of this patchset is the ability to do everything
with file descriptors, and with the mount namespace.  That moves
everything from a bizarre hard to understand and manipulate interface
to one where things can be done much more easily, and cheaply.
Resulting in a much more powerful and usable interface.

Eric

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 00/10] taskstats: Enhancements for precise accounting

2010-09-24 Thread Andrew Morton
On Fri, 24 Sep 2010 11:10:15 +0200
Michael Holzheu holz...@linux.vnet.ibm.com wrote:

 Hello Andrew,
 
 On Thu, 2010-09-23 at 13:11 -0700, Andrew Morton wrote:
   GOALS OF THIS PATCH SET
   ---
   The intention of this patch set is to provide better support for tools 
   like
   top. The goal is to:
   
   * provide a task snapshot mechanism where we can get a consistent view of
 all running tasks.
   * provide a transport mechanism that does not require a lot of system 
   calls
 and that allows implementing low CPU overhead task monitoring.
   * provide microsecond CPU time granularity.
  
  This is a big change!  If this is done right then we're heading in the
  direction of deprecating the longstanding way in which userspace
  observes the state of Linux processes and we're recommending that the
  whole world migrate to taskstats.  I think?
 
 Or it can be used as alternative. Since procfs has its drawbacks (e.g.
 performance) an alternative could be helpful. 

And it can be harmful.  More kernel code to maintain and test, more
userspace code to develop, maintain, etc.  Less user testing than if
there was a single interface.

 
  I worry that there's a dependency on CONFIG_NET?  If so then that's a
  big problem because in N years time, 99% of the world will be using
  taskstats, but a few embedded losers will be stuck using (and having to
  support) the old tools.
 
 Sure, but if we could add the /proc/taskstats approach, this dependency
 would not be there.

So why do we need to present the same info over netlink?

If the info is available via procfs then userspace code should use that
and not netlink, because that userspace code would also be applicable
to CONFIG_NET=n systems.

 
  Does this have the potential to save us from the CONFIG_NET=n problem?
 
 Yes

Let's say that when it's all tested ;)

 Are PIDs over all namespaces unique?

Nope.  The same pid can be present in different namespaces at the same
time.

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel