[Devel] Re: netns patches WAS( Re: [PATCH 8/8] net: Implement socketat.

2010-10-26 Thread jamal
Eric,

Ping?
If you are too busy to push these in maybe have
someone clueful like Daniel help out submitting? I think it
should probably be reasonable to leave out the sockeat
patch initially if it is deemed controversial..

cheers,
jamal

On Fri, 2010-10-15 at 08:30 -0400, jamal wrote:
> Eric et al,
> 
> Did these patches make it in? I was looking at
> two Davem net trees and i dont see them.
> 
> cheers,
> jamal
> 


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] netns patches WAS( Re: [PATCH 8/8] net: Implement socketat.

2010-10-15 Thread jamal
Eric et al,

Did these patches make it in? I was looking at
two Davem net trees and i dont see them.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 8/8] net: Implement socketat.

2010-10-03 Thread jamal
Hi Daniel,

Thanks for clarifying this ..

On Sat, 2010-10-02 at 23:13 +0200, Daniel Lezcano wrote:
> Just to clarify this point. You enter the namespace, create the socket
> and go back to the initial namespace (or create a new one). Further 
> operations can be made against this fd because it is the network 
> namespace stored in the sock struct which is used, not the current 
> process network namespace which is used at the socket creation only.
> 
> We can actually already do that by unsharing and then create a
> socket. 
> This socket will pin the namespace and can be used as a control socket
> for the namespace (assuming the socket domain will be ok for all the 
> operations).
>
> Jamal, I don't know what kind of application you want to use but if I 
> assume you want to create a process controlling 1024 netns, 

At the moment i am looking at 8K on a Nehalem with lots of RAM. They
will mostly be created at startup but some could be created afterwards.
Each will have its own netdevs etc. also created at startup (and some
other config that may happen later). 
Because startup time may accumulate, it is clearly important to me
to pick whatever scheme that reduces the number of calls...

> let's try to identificate what happen with setns and with socketat :
> 
> With setns:
> 
>  * open /proc/self/ns/net (1)
>  * unshare the netns
>  * open /proc/self/ns/net (2)
>  * setns (1)
>  * create a virtual network device
>  * move the virtual device to (2) (using the set netns by fd)
>  * unshare the netns
>  ...
> 
> With socketat:
> 
>  * open a socket (1)
>  * unshare the netns
>  * open a netlink with socketat(1) => (2)
>  * create a virtual device using (2) (at this point it is
> init_net_ns)
>  * move the virtual device to the current netns (using the set
> netns 
> by pid)
>  * open a socket (3)
>  * unshare the netns
>  ...
> 
> We have the same number of file descriptors kept opened. Except, with 
> setns we can bind mount the directory somewhere, that will pin the 
> namespace and then we can close the /proc/self/ns/net file descriptors
> and reopen them later.
> 

Ok, so a wrapper such as: create_socket_on(namespaceid)
will have generally less system calls with socketat()

> If your application has to do a lot of specific network processing, 
> during its life cycle, in different namespaces, the socketat syscall 
> will be better because it will reduce the number of syscalls but at
> the cost of keeping the file descriptors opened (potentially a big
> number). Otherwise, setns should fit your needs.

Makes sense. 

One thing still confuses me...
The app control point is in namespace0. I still want to be able to
"boot" namespaces first and maybe a few seconds later do a socketat()...
and create devices, tcp sockets etc. I suspect create_ns(namespace-name)
would involve:
 * open /proc/self/ns/net (namespace-name)
 * unshare the netns
Is this correct?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-24 Thread jamal
On Fri, 2010-09-24 at 16:09 +0200, David Lamparter wrote:

> I understood your point. What I'm saying is that that functional graph
> you're describing is too simplistic do be a workable model. Your graph
> allows for what you're trying to do, yes. But your graph is not modeling
> the reality.

How about we put this specific point to rest by agreeing to
disagree? ;->

> Err... I'm migrating netdevs to assign them to namespaces to allow them
> to use them? Setup, basically. Either way a device move only happens as
> result of some administrative action; be it creating a new namespace or
> changing the physical/logical network setup.
> 

Ok, different need. You have a much more basic requirement than i do.

> wtf is a "remote" namespace?
> 

A namespace that is remotely located on another machine/hardware ;->

> Can you please describe your application that requires moving possibly
> several network devices together with "their" routes to a different
> namespace?

scaling and availability are the driving requirements.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-24 Thread jamal
On Fri, 2010-09-24 at 14:57 +0200, David Lamparter wrote:

> No. While you sure could associate routes with devices, they don't
> *functionally* reside on top of network devices. They reside on top of
> the entire IP configuration, 

I think i am not clearly making my point. There are data dependencies;
If you were to move routes, youd need everything that routes depend on.
IOW, if i was to draw a functional graph, routes would appear on top
of netdevs (I dont care what other functional blocks you put in between
or sideways to them).

> and in case of BGP they even reside on top
> of your set of peerings and their data.
> Even if you could "move" routes together with a network device, the
> result would be utter nonsense. 

You could argue that moving a netdevice where some of its fundamental
properties such as an ifindex change is utter nonsense. But you can
work around it.

> The routes depend on your BGP view, and
> if your set of interfaces (and peers) changes, your routes will change.
> Your bgpd will, either way, need to set up new peerings and redo best
> path evaluations.

Worst case scenario, yes. I am beginning to get a feeling we are trying 
to achieve different goals maybe? Why are you even migrating netdevs?

> (On an unrelated note, how often are you planning to move stuff between
> namespaces? I don't expect to be moving stuff except on configuration
> events...)

Triggering on config events is useful and it is likely the only
possibility if you assumed the other namespace is remote. But if could
send a single command to migrate several things in the kernel (in my
case to recover state to a different ns), then that is much simpler and
uses the least resources (memory, cpu, bandwidth). I admit it is very
hard to do in most cases where the underlying dependencies are evolving
and synchronizing via user space is the best approach. The example
of route table i pointed to is simple.
Besides that: dynamic state created in the kernel that doesnt have to be
recreated by the next arriving 100K packets helps to improve recovery.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-24 Thread jamal
On Thu, 2010-09-23 at 16:58 +0200, David Lamparter wrote:

> migrating route table entries makes no sense because
> a) they refer to devices and configuration that does not exist in the
>target namespace; they only make sense within their netns context
> b) they are purely virtual and you get the same result from deleting and
>recreating them.
> 
> Network devices are special because they may have something attached to
> them, be it hardware or some daemon.

Routes functionally reside on top of netdevices, point to nexthop
neighbors across these netdevices etc. Underlying assumption is you take
care of that dependency when migrating.
We are talking about FIB entries here not the route cache; moving a few
pointers within the kernel is a hell lot faster than recreating a subset
of BGP entries from user space. 

Eric, I didnt follow the exposed-races arguement: Why would it involve
more than just some basic locking only while you change the struct net
pointer to the new namespace for these sub-subsystems?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 8/8] net: Implement socketat.

2010-09-23 Thread jamal
On Thu, 2010-09-23 at 15:53 +0400, Pavel Emelyanov wrote:

> Why does it matter? You told, that the usage scenario was to
> add routes to container. If I do 2 syscalls instead of 1, is
> it THAT worse?
> 

Anything to do with socket IO that requires namespace awareness
applies for usage; it could be tcp/udp/etc socket. If it doesnt
make any difference performance wise using one scheme vs other
to write/read heavy messages then i dont see an issue and socketat
is redundant.

If i was to pick blindly - I would say whatever approach with
less syscalls is better even if just a "slow" path one time
thing. I could create a scenario which would make it bad
to have more syscalls.

But theres also the simplicity aspect in doing:
fdx = socketat namespace foo
use fdx for read/write/poll into foo without any wrapper code.
Vs
enter foo
fdx = socket ..
read/write fdx
leave foo.

> Just like it used to before the enter.
> 

So if i enter foo, get a fdx, leave foo i can use it in
ns0 as if it was in ns0?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 8/8] net: Implement socketat.

2010-09-23 Thread jamal
On Thu, 2010-09-23 at 15:33 +0400, Pavel Emelyanov wrote:

> This particular usecase is unneeded once you have the "enter" ability.

Is that cheaper from a syscall count/cost?
i.e do I have to enter every time i want to write/read this fd?
How does poll/select work in that enter scenario?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 7/8] net: Allow setting the network namespace by fd

2010-09-23 Thread jamal
On Thu, 2010-09-23 at 01:51 -0700, Eric W. Biederman wrote:
> Take advantage of the new abstraction and allow network devices
> to be placed in any network namespace that we have a fd to talk
> about.
> 

So ... why just netdevice? could you allow migration of other
net "items" eg a route table since they are all tagged by
netns?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 8/8] net: Implement socketat.

2010-09-23 Thread jamal
On Thu, 2010-09-23 at 12:56 +0400, Pavel Emelyanov wrote:
> On 09/23/2010 12:51 PM, Eric W. Biederman wrote:
> > 
> > Add a system call for creating sockets in a specified network namespace.
> 
> What for?

I can see many uses if my understanding is correct..
ex, from mother namespace:
fdx = open socket at namespace blah
from mother namespace, read/write/poll fdx 
(eg add route with netlink socket)

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)

2010-05-03 Thread jamal
On Mon, 2010-05-03 at 07:21 -0700, Dan Smith wrote:

> The benefits of doing what we can in userspace are well-understood and
> arguing for doing so where it makes sense is, of course, a good idea.
> 
> However, it seems to me that the rtnl interface provides us a
> reasonable layer of isolation between us and such changes.  Am I
> wrong?  

I may not have made my point earlier:
Let me give you an example by looking at your migration attributes..

-
+__be32 inet4_len;  /* mask length (bits)*/
+__u32  inet4_met;  /* metric*/
+__be32 inet4_dst;  /* route address */
+__be32 inet4_gwy;  /* gateway address   */
-

At some point i had a discussion with some folks on netdev where it
seemed valueable to add a fwmark to the route. If such is made, I dont
see what the motivation for whoever is codifying to add it to your
attributes so you can migrate the fwmark. One good motivation is to make
sure the main route code fails to compile if your attributes dont get
modified - this could happen if you re-use the same data structures as
the kernel etc.

> The rtnl messages appear to be rather generic and timeless,
> and in most cases have a significant amount of flexibility with
> respect to allowing advanced attributes to be ignored (which implies
> taking the default).

True - but you still need to worry about compat issues etc i.e when you
migrate to a remote kernel they better have the same features and kernel
config.. I am assuming this is not hard to impose on an admin.
Doing things in user space allows for doing more interesting things like
negotiating on capabilities etc

> In many other areas of C/R we're not so lucky and don't have a
> well-defined interface for dumping that information out of the
> kernel...

Maybe the answer is to start by formalizing that, not sure.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] [RFC] C/R: inet4 and inet6 unicast routes (v2)

2010-04-30 Thread jamal
On Fri, 2010-04-30 at 14:24 -0700, Dan Smith wrote:

> 
> I'm sure it would be doable.  However, checkpointing the routes that
> way would:
> 
> (a) Be inconsistent with how we checkpoint all the other resources,
> including the other network resources we handle from the kernel
> with rtnl

My 2c:
The problem as i see it (with all net structures not just routes - i was
equally pessimistic when i saw those other net structure
checkpoint/restore changes) is you are faced with a herculean
high-maintainance effort...
You have a separate piece of code which populates structures that _you_
maintain for attributes that are defined elsewhere by other people.
Nobody adding a new attribute that is very important to route
restoration for example is likely to change your code. Unless you tie
the two together (so changing one forces the coder to change the other).
And once people deploy kernels it is hard to change. Historically (for
pragmatic reasons) such rich interfaces sit in user space - much easier
to update user space.
 
cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/4] [RFC] Add sock_create_kern_net()

2010-04-28 Thread jamal

On Fri, 2010-04-23 at 07:55 -0700, Dan Smith wrote:
> This helper allows kernel routines to create a socket in a given netns,
> instead of forcing it to the initial or current one.
> 
> I know this seems like it's violating the netns boundary.  The intended
> use (as in the following patches) is specifically when talking to RTNETLINK
> in another netns for the purposes of creating or examining resources there.
> It is expected that this will be used for that sort of transient socket
> creation only.  In other words:
> 
>   s = sock_create_kern_net(AF_NETLINK, ..., other_netns, ...);
>   rtnl_talk(s);
>   close(s);
> 

CCing Eric B. and Daniel with whom i have had this discussion before.

So ... how does user space know what "other_netns" is?

Also note Eric's recent patches introduced another way of opening a
socket in a different namespace - are you using those in the abstraction
to find what netns is?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"

2010-02-23 Thread jamal
On Tue, 2010-02-23 at 12:00 -0800, Eric W. Biederman wrote:

> That point of the mount to hold a persistent reference to the
> namespace without using a process.
> 
> The point of the of the to be written set_ns call is to change
> the default network namespace of the process such that all future
> open/bind/socket calls happen in the referenced network namespace.

Ok, i like it ;-> Patches RSN? Let me if you want someone to test..

> The are a few stray places like sysfs where it is the mount point
> not current->nsproxy->net_ns that will determine what we see.

Is sysfs considered "usable enough" for namespaces?

> Attributes of the specific namespace?

Well, example what is being un/shared etc. 

cheers,
jamal


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"

2010-02-23 Thread jamal
Added Daniel to the discussion..

On Tue, 2010-02-23 at 06:07 -0800, Eric W. Biederman wrote:
> jamal  writes:

> > Does the point after sys_setns(fd) allow me to do io inside
> > ns ? Can i do open() and get a fd from ns ?
> 
> Yes.  My intention is that current->nsproxy->net_ns be changed.
> We can already change it in unshare so this is feasible.

I like it if it makes it as easy as it sounds;-> With lxc,
i essentially have to create a proxy process inside the
namespace that i use unix domain to open fds inside the ns.
Do i still need that?

> > The only problem that i see is events are not as nice. I take it i am 
> > going to get something like an inotify when a new namespace is created?
> 
> Yes.  Inotify would at the very least see that mkdir.  You could also
> use poll on /proc/mounts to see the set of mounts change.

It is not as nice but livable. I suppose attributes of the specific
namespace are retrieved somewhere there as well..

> > Is it not just a naming convention that you are dealing with?
> > Example in your scheme above a nested namespace shows up as:
> > /var/run/netns//, no?
> 
> No.  More like:
> 
> For the outer namespace:
> /var/run/netns/
> 
> For the inner namespace:
> /some/random/fs/path/to/a/chroot/var/run/netns/
> 
> For a doubly nested scenario:
> /some/random/fs/path/to/a/chroot/some/other/random/fs/path/to/another/chroot/var/run/netns/
> 
> Since I would be using mount namespaces instead of chroot it is not
> strictly required that the fs paths nest at all.

Ok.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"

2010-02-23 Thread jamal
On Mon, 2010-02-22 at 15:17 -0800, Eric W. Biederman wrote:

> What I am thinking is:
> 
> "ip ns  route add blah" is:
> fd = open("/var/run/netns/");
> sys_setns(fd);  /* Like unshare but takes an existing namespace */
> /* Then the rest of the existing ip command */

The other two below make some sense; For the above:
Does the point after sys_setns(fd) allow me to do io inside
ns ? Can i do open() and get a fd from ns ?

> "ip ns list" is:
> dfd = open("/var/run/netns", O_DIRECTORY);
> getdents(dfd, buf, count);
> 
> "ip ns new " is:
> unshare(CLONE_NEWNS);
> fd = nsfd(NETNS);
> mkdir("/var/run/netns/");
> mount("none", "/var/run/netns/", "ns", 0, fd);
> 
> Using unix domain names means that which namespaces you see is under
> control of userspace.  Which allows for nested containers (something I
> use today), and ultimately container migration.

The only problem that i see is events are not as nice. I take it i am 
going to get something like an inotify when a new namespace is created?

> Using genetlink userspace doesn't result in a nestable implementation
> unless I introduce yet another namespace, ugh.

Is it not just a naming convention that you are dealing with?
Example in your scheme above a nested namespace shows up as:
/var/run/netns//, no?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"

2010-02-22 Thread jamal
On Mon, 2010-02-22 at 12:46 -0800, Eric W. Biederman wrote:
> jamal  writes:

> 
> This is one of the long standing issues that we have always known
> we needed to solve, but have not taken the time to do it.  Now that
> the need is more real it looks about time to solve this one.
> 
> There are currently two problems.
> 1) A process is needed to hold a reference to the network namespace.
> 2) We use pids which are an awkward way of talking about network
>namespaces.
> 
> The solution I have been playing with involves.
> - Using a file descriptor to refer to a network namespace.
> - Using a trivial virtual filesystem to persistently hold onto
>   a namespace without the need of a process.
> - Have a convention of mounting the fs at something like
>   /var/run/netns/
> 

I didnt quiet follow how i could use the above to do:
"ip ns  route add blah" from namespace0.

I tend to think in packets and wires instead of files;
How about just allowing a "control" channel from which
i could discover the namespace?
Example, assuming i have the right permissions:
1) listen to async events example on a multicast bus when
a namespace is created or destroyed. Provide me a little more info on
the created namespace such as its pid, name(?), types of namespace, etc
2) send a query to dump existing namespace or query by name, id etc.
I get the same details as above.

using genetlink should provide you with sufficient ability to do this.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"

2010-01-15 Thread jamal
On Fri, 2010-01-15 at 11:15 +0100, Patrick McHardy wrote:
> jamal wrote:

> > b) dynamic protocols (routing, IKE etc): how do you do that without 
> > making both sides understand what is going on?
> 
> In case of IPsec the outer addresses are different, its only the
> selectors which will have similar addresses. A keying deamon should
> have no trouble with this. The ifindex would be needed in the
> selectors though to make sure each policy is used for the correct
> traffic.

you need to have user space knowledgeable of the mapping between an
ifindex and a zone. It may work with perhaps that info explicitly in
config with tunnel mode/ESP.

> A routing daemon is unrealistic to be used in this scenario, at
> least a single one for all the overlapping networks.

I think in general, it would be hard to deal with anything that requires
dynamic control where one or more peers have to discover each other once
you have IP overlap. You will have to change those user space apps.

In any case, for what you seem to intend this for, i think it works.

> > Agreed. But the major ones like iproute2 etc could be taught. We have
> > namespaces in the kernel already, over a period of time I think changing
> > the user space tools would a sensible evolution.
> 
> Yes, that might be useful in any case. But I don't think it would
> even work for iproute or other standalone programs, a process can't
> associate to an existing namespace except through clone(). So it
> needs to run as child of a process already associated with the
> namespace.

The mechanics are not there, yet. But if i had sufficient permission,
and was able to find the namespaces when i ask and/or get events when it
is created it should be an issue of sending it a message.
The current approach to say migrate a veth via iproute2 requires we 
know the pid of the target namespace. Thats a usability issue.
I tried to muck with namespaces and if you use a library like lxc
you can do it - but it is a hack as it stands today (and merging
iproute2 with lxc is questionable).


>  (X + 152) bytes,
> > correct?
> > What is the typical sizeof X?
> 
> No, to give some correct number. Assuming a conntrack table of
> 10MB (large, but reasonable depending on the number of connections)
> we get an overhead of:
> 
> namespaces: 150 * 10MB memory use
> "zones": 152 bytes increased code size

That is substantial if you are doing an embedded device.
But otherwise, RAM is so cheap that i would take usability
any day for an extra $5.

BTW, I think the zones approach will still use more than 10MB
in this case given it encompasses all "zones" whereas namespace only
does it for a single mapped "zone".

> Both approaches additionally need one extra connection tracking
> entry of ~300 bytes per connection that is actually handled twice.

Ok, so computation is not a differentiator.

> That will go away once I add a target for classification. 

Makes sense 
On a side note: I wouldnt mind seeing some field in struct
netdev for some general purpose grouping/IDing which could be
set from user space. 

cheers,
jamal



___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"

2010-01-15 Thread jamal
On Thu, 2010-01-14 at 10:32 -0800, Ben Greear wrote:

> For small or simple cases, this may be true..but there is a lot of work
> to make a complex user-space app that manages arbitrary amounts of interfaces
> routing tables in an arbitrary amount of network namespaces.  With the 
> contrack-zones
> approach, user-space apps do not require any significant changes, and you do 
> not
> need the rest of the namespace overhead to accomplish the task.

I think for your use case what you state is true. In the general case,
it is not. 
Note: I am not arguing against the patch - just that it is not the
generic scenario solution compared to namespaces.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"

2010-01-14 Thread jamal
On Thu, 2010-01-14 at 16:37 +0100, Patrick McHardy wrote:
> jamal wrote:

> > Agreed that this would be a main driver of such a feature.
> > Which means that you need zones (or whatever noun other people use) to
> > work on not just netfilter, but also routing, ipsec etc.
> 
> Routing already works fine. I believe IPsec should also work already,
> but I haven't tried it.

maybe further discussion  would clarify this point..

> The zone is set based on some other criteria (in this case the
> incoming device).

If you are using a netdev as a reference point, then I take it 
if you add vlans should be possible to do multiple zones on a single
physical netdev? Or is there some other way to satisfy that?

>  The packets make one pass through the stack
> to a veth device and are SNATed in POSTROUTING to non-clashing
> addresses. 

Ok - makes sense. 
i.e NAT would work; and policy routing as well as arp would be fine.
Also it looks to be sufficiently useful to fit a specific use case you
are interested in.
But back to my question on routing, ipsec etc (and you may not be
interested in solving this problem, but it is what i was getting to
earlier). Lets take for example: 
a) network tables like SAD/SPD tables: how you would separate those on a
per-zone basis? i.e 10.0.0.1/zone1 could use different
policy/association than 10.0.0.1/zone2
b) dynamic protocols (routing, IKE etc): how do you do that without 
making both sides understand what is going on?

> > This is a valid concern against the namespace approach. Existing tools
> > of course could be taught to know about namespaces - and one could
> > argue that if you can resolve the overlap IP address issue, then you
> > _have to_ modify user space anyways.
> 
> I don't think thats true. 

Refer to my statements above for an example.

> In any case its completely impractical
> to modify every userspace tool that does something with networking
> and potentially make complex configuration changes to have all
> those namespaces interact nicely. 

Agreed. But the major ones like iproute2 etc could be taught. We have
namespaces in the kernel already, over a period of time I think changing
the user space tools would a sensible evolution.

> Currently they are simply not
> very well suited for virtualizing selected parts of networking.

My contention is that it is a lot less headache to just virtualize 
all the network stack and then use what you want than it is to go and
selectively changing the network objects.
Note: if i wanted today i could run racoon on every namespace 
unchanged and it would work or i could modify racoon to understand
namespaces...

> I'm not sure whether there is a typical user for overlapping
> networks :) I know of setups with ~150 overlapping networks.
> 
> The number of conntracks per zone doesn't matter since the
> table is shared between all zones. network namespaces would
> allocate 150 tables, each of the same size, which might be
> quite large.

Thats what i was looking for ..
So the difference, to pick the 150 zones example so as to put a number
around it, is namespaces will consume 150.X bytes (where X is the
overhead of a conntrack table) and you approach will be (X + 152) bytes,
correct?
What is the typical sizeof X?

> > You may also wanna look as a metric at code complexity/maintainability
> > of this scheme vs namespace (which adds zero changes to the kernel).
> 
> There's not a lot of complexity, its basically passing a numeric
> identifier around in a few spots and comparing it. Something like
> TOS handling in the routing code.

I think the challenge is whether zones will have to encroach on other
net stack objects or not. You are already touching structure netdev...
A digression: TOS is different really - it has network level semantic. This 
would be more like mark or in some cases ifindex (i.e local semantics)
 
> > BTW, why not use skb->mark instead of creating a new semantic construct?
> 
> Because people are already using it for different purposes.

tru dat - it only gives you one semantical axis and you need an
additional dimension in your case (namespace have that resolved via
struct net).

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: RFC: netfilter: nf_conntrack: add support for "conntrack zones"

2010-01-14 Thread jamal

Ive had an equivalent discussion with B Greear (CCed) at one point on
something similar, curious if you solve things differently - couldnt
tell from the patch if you address it.
Comments inline:

On Thu, 2010-01-14 at 15:05 +0100, Patrick McHardy wrote:
> The attached largish patch adds support for "conntrack zones",
> which are virtual conntrack tables that can be used to seperate
> connections from different zones, allowing to handle multiple
> connections with equal identities in conntrack and NAT.
>
> A zone is simply a numerical identifier associated with a network
> device that is incorporated into the various hashes and used to
> distinguish entries in addition to the connection tuples. Additionally
> it is used to seperate conntrack defragmentation queues. An iptables
> target for the raw table could be used alternatively to the network
> device for assigning conntrack entries to zones.
>
>
> This is mainly useful when connecting multiple private networks using
> the same addresses (which unfortunately happens occasionally) 

Agreed that this would be a main driver of such a feature.
Which means that you need zones (or whatever noun other people use) to
work on not just netfilter, but also routing, ipsec etc.
As a digression: this is trivial to solve with network namespaces. 

> to pass
> the packets through a set of veth devices and SNAT each network to a
> unique address, after which they can pass through the "main" zone and
> be handled like regular non-clashing packets and/or have NAT applied a
> second time based f.i. on the outgoing interface.
> 

The fundamental question i have is:
how you deal with overlapping addresses?
i.e zone1 uses 10.0.0.1 and zone2 uses 10.0.0.1 but they are for
different NAT users/endpoints.

> Something like this, with multiple tunl and veth devices, each pair
> using a unique zone:
> 
>   
>  |
>   PREROUTING
>  |
>   FORWARD
>  |
>   POSTROUTING: SNAT to unique network
>  |
>   
>   
>  |
>   PREROUTING
>  |
>   FORWARD
>  |
>   POSTROUTING: SNAT to eth0 address
>  |
>   
> 
> As probably everyone has noticed, this is quite similar to what you
> can do using network namespaces. The main reason for not using
> network namespaces is that its an all-or-nothing approach, you can't
> virtualize just connection tracking. 

Unless there is a clever approach for overlapping IP addresses (my
question above), i dont see a way around essentially virtualizing the
whole stack which clone(CLONE_NEWNET) provides..

> Beside the difficulties in
> managing different namespaces from f.i. an IKE or PPP daemon running
> in the initial namespace, 

This is a valid concern against the namespace approach. Existing tools
of course could be taught to know about namespaces - and one could
argue that if you can resolve the overlap IP address issue, then you
_have to_ modify user space anyways.

> network namespaces have a quite large
> overhead, especially when used with a large conntrack table.

Elaboration needed.
You said the size in 64 bit increases to 152B per conntrack i think?
Do you have a hand-wave figure we can use as a metric to elaborate this
point? What would a typical user of this feature have in number of
"zones" and how many contracks per zone? Actually we could also look
at extremes (huge number vs low numbers)...

You may also wanna look as a metric at code complexity/maintainability
of this scheme vs namespace (which adds zero changes to the kernel).
I am pretty sure you will soon be "zoning" on other pieces of the net
stack ;->

> I'm not too fond of this partial feature duplication myself, but I
> couldn't think of a better way to do this without the downsides of
> using namespaces. Having partially shared network namespaces would
> be great, but it doesn't seem to fit in the design very well.
> I'm open for any better suggestion :)

My opinions above.

BTW, why not use skb->mark instead of creating a new semantic construct?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] 10K containers

2009-09-02 Thread jamal
Olla,

Has anybody benchmarked in the range of 10K containers for virtual
routing or even virtual servers with traffic? If you have, could you
share your experience on memory, cpu and throughput?

I have been pointed to:
https://lists.linux-foundation.org/pipermail/containers/2009-January/015315.html

which focuses on creation phase.
Dan, was "slow" measured by how long it took to create? What was the
kernel? I am running  2.6.31-rc7 - do the sysfs issues still exist?

I have also been pointed to:
http://lxc.sourceforge.net/network/benchs.php
Demonstrates throughput - seems to stay constant with the case of etun
hitting  2+ times the CPU use!
I am taking it that etun is pretty close to veth performance-wise, so
the results there may still be valid today.

Daniel - maybe you run these tests; is there a profile that points out
where the bottleneck is?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-06-10 Thread jamal
Apologies for the latency.

On Wed, 2009-06-03 at 11:27 +, Jarek Poplawski wrote:

> After the second look I have some questions:
> - if it's really aimed to skip checking by ->get() tp's before they're
> configured in ->change(), maybe instead of using tp_c to check this it
> would be simpler to generally skip calling ->get() for newly created
> tp's?
> - otherwise the current method probably needs adding a tp_c check for
> NULL in u32_destroy()?
> - it seems this method would also need adding a 'handle' lookup to
> the u32_change(); otherwise its 'handle' parameter isn't controlled
> for for uniqueness, unless I miss something?

Lets just ignore the need for these changes since the patch fixes them
for now. I would still like to make the changes i suggested but later
with more thought put into them.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-06-02 Thread jamal
On Tue, 2009-06-02 at 06:26 +, Jarek Poplawski wrote:

> The patch #2 is obviously worse and fixes less (of course it still
> needs testing for Minoru's case), but I'm 100% confident it can't
> introduce any regression (neither take 1 nor 2), which is much harder
> to say about patch #1, considering various "rude" configs we could
> miss (but we could give them some time to show off). But, as I've
> written before, I'm really (really) OK with your decision.

Thanks for the courtesy.
I note Dave already swallowed Minoru's patch; so lets move from there.
Yes, there's a possibility of a regression - I (and so are you) are only
recently evolved humans; we are not perfect... yet ;->
So i would agree with Minoru testing your patch as plan B in case the
applied one starts causing trouble.
BTW, ok - here's a quick untested, uncompiled fix to the u32 classifier
to fix the first rock (which you already worked around in your changes
to the included patch). No rush to submit for now..

On the second rock you threw so violently, after some reflection, I
think it is ok to send a replace twice with different priorities.
The second one will be added and the old not deleted, but if the user
has chosen the correct priority, then things will work out just fine.
And if they want they have to explicitly delete the one they dont want.
It is also not illegal to do a "replace" for installing instead of
"add".

So the only other things left to do from this exercise are (no rush in
any of them):
a) remove all "buckets" from underneath other classifiers
b) get consistency across all classifiers in usage of setup API

If you want to do this - go ahead; else i plan on tackling it probably
when stable 2.6.31 kicks in.

cheers,
jamal
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index 07372f6..5ad0b98 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -249,6 +249,9 @@ static unsigned long u32_get(struct tcf_proto *tp, u32 handle)
 	struct tc_u_hnode *ht;
 	struct tc_u_common *tp_c = tp->data;
 
+	if (!tp_c)
+		return 0;
+
 	if (TC_U32_HTID(handle) == TC_U32_ROOT)
 		ht = tp->root;
 	else
@@ -311,7 +314,6 @@ static int u32_init(struct tcf_proto *tp)
 	root_ht->tp_c = tp_c;
 
 	tp->root = root_ht;
-	tp->data = tp_c;
 	return 0;
 }
 
@@ -524,7 +526,7 @@ static int u32_change(struct tcf_proto *tp, unsigned long base, u32 handle,
 		  struct nlattr **tca,
 		  unsigned long *arg)
 {
-	struct tc_u_common *tp_c = tp->data;
+	struct tc_u_common *tp_c = tp->root->tp_c;
 	struct tc_u_hnode *ht;
 	struct tc_u_knode *n;
 	struct tc_u32_sel *s;
@@ -540,6 +542,7 @@ static int u32_change(struct tcf_proto *tp, unsigned long base, u32 handle,
 	if (err < 0)
 		return err;
 
+	tp->data = tp_c;
 	if ((n = (struct tc_u_knode*)*arg) != NULL) {
 		if (TC_U32_KEY(n->handle) == 0)
 			return -EINVAL;
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-06-02 Thread jamal
On Tue, 2009-06-02 at 15:59 +0900, Minoru Usui wrote:

> If we have to test Jarek's patch #2, I'll test it tomorrow.
> What do you think Jamal?

Yes please - even if it is for reasons of givein Jarek some peace of
mind;->
If something goes seriously wrong with other classifiers because of the
patch, and Jareks patch works, we can back out the current patch and fix
it with that approach.

cheers,
jamal


___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-06-01 Thread jamal
On Mon, 2009-06-01 at 22:49 +0200, Jarek Poplawski wrote:

> 
> Actually, I'd insist with the old rock and handling that other rude
> u32 case, at least until it's fixed in place. So I attach my version
> of your patch (additionally I removed a pair of braces because of
> checkpatch warning).

Sure, it doesnt complicate anything - Minoru, this is the version to go
for.

> Alas, I still think we don't need to change so much in -stable to
> fix the cls_cgroup oops, so I attach a patch which I think is
> enough for -stable and probably -net too. It could be "reverted"
> in -net-next just after applying cls_api patch. Of course, treat
> it only as my humble proposal, and feel free to recommend to David
> your version, no problem (really).


My view is the same - that the second patch doesnt fix the root
cause; and it is not that complicated to fix the root cause. So I humbly
disagree with you.

The issue is that a classifer (cls_group in this example) is being
misconfigured. It rejects the config - but the tp has already been
added. 
It then tries to use the tp in the fast and fails. 

If you look as closely as you did with the patch i posted, youd find
ways to construct similar hostile misconfigs for other classifiers.
You just need to create the scenario where the attributes will fail to
validate. 

I actually suspect the most common scenario for such a failure is not
that head is null (I doubt in Minoru case that allocation will fail);
rather it is some reference to head->something.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-06-01 Thread jamal
On Mon, 2009-06-01 at 06:06 +, Jarek Poplawski wrote:

> 
> But how about that (of course extremely rude) case "tc filter replace"
> is run with a new prio?


Jarek, sir, handyman extraordinaire, handsome devil, and lover of
kittens I humbly opine that we need to handle that case. 
How about going back to your original idea of defining tp_created? With
apologies to Minoru (he must be thinking we are lunatics by now), how
does the attached changed patch look to you?

Before you throw another rock, there is another issue which will be
caused by this rude misconfig:
"replace" really means "get rid of the old and add this new one".
But for the last 50 years we do not "get rid of the old". I cant think
of a clean way to do it sans shaving one of the kittens. One simple
thing to do is to printk a warning when detecting this error. I think
one needs to draw a line where bad config affects your life - in this
case i dont think it is worth big changes..

cheers,
jamal
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0759f32..08d98e8 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -135,6 +135,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
 	unsigned long cl;
 	unsigned long fh;
 	int err;
+	int tp_created = 0;
 
 	if (net != &init_net)
 		return -EINVAL;
@@ -266,10 +267,7 @@ replay:
 			goto errout;
 		}
 
-		spin_lock_bh(root_lock);
-		tp->next = *back;
-		*back = tp;
-		spin_unlock_bh(root_lock);
+		tp_created = 1;
 
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind))
 		goto errout;
@@ -314,8 +312,19 @@ replay:
 	}
 
 	err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh);
-	if (err == 0)
+	if (err == 0) {
+		if (tp_created) {
+			spin_lock_bh(root_lock);
+			tp->next = *back;
+			*back = tp;
+			spin_unlock_bh(root_lock);
+		}
 		tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER);
+	} else {
+		if (tp_created) {
+			tcf_destroy(tp);
+		}
+	}
 
 errout:
 	if (cl)
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-31 Thread jamal
On Sun, 2009-05-31 at 21:55 +0200, Jarek Poplawski wrote:

> But then, there could be "tc filter replace" with only this
> (n->nlmsg_type == RTM_NEWTFILTER && (n->nlmsg_flags&(NLM_F_CREATE)))
> which can't get here with a newly created tp, I guess.

Right - because in that case tp would already exist and would
be found in the lookup.

> I'm almost sure these commands (right or wrong) can trigger such a 
> leak:
> 
> $ sudo tc qdisc add dev lo root handle 1: htb
> $ sudo tc filter add dev lo proto ip pref 1 handle 800::1 u32 match
u32 0 0 classid 1:1
> $ sudo tc filter add dev lo proto ip pref 2 handle 800::1 u32 match
u32 0 0 classid 1:1
> RTNETLINK answers: File exists

You are good - A tip of my hat and a wag of my finger at you;->
Ok, now i looked a lot closer at all 8 classifiers
and u32 is the only one that can fault. It just needs to be fixed.
I will wait until Minoru's patch and then i will submit a fix for it.
Of course the commands from user space are a little rude ;->

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-31 Thread jamal
On Sun, 2009-05-31 at 09:24 -0400, jamal wrote:

> This would imply the classifier is buggy. I will stare at the different
> classifier - and if any exhibits such traits it needs to be fixed

I couldnt find any case where this is possible in the current code. If
you have a specific example, we need to fix it.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-31 Thread jamal
On Sun, 2009-05-31 at 10:12 +0200, Jarek Poplawski wrote:

> If somebody runs tc add filter with a new priority but existing handle
> a newly created (and not linked now) tc would be handled with goto
> errout and would leak, I guess.

This would imply the classifier is buggy. I will stare at the different
classifier - and if any exhibits such traits it needs to be fixed

> The change (in the two spots) is:
> > +   if (n->nlmsg_type == RTM_NEWTFILTER &&
> > +   (n->nlmsg_flags&(NLM_F_CREATE|NLM_F_EXCL))) {
> > 
> 
> Sorry, but I don't think this change is enough; tc filter replace
> with only this (n->nlmsg_type == RTM_NEWTFILTER &&
> (n->nlmsg_flags&(NLM_F_CREATE))) can get here with an "old" tp
> and will relink it or destroy depending on the ->change() return.
> 

Excellent point - there could be buggy user space apps that will do
that. Minoru change the check to:

+   if (n->nlmsg_type == RTM_NEWTFILTER &&
+   (n->nlmsg_flags&NLM_F_CREATE &&
+n->nlmsg_flags&NLM_F_EXCL)) {

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-31 Thread jamal
On Sun, 2009-05-31 at 09:55 +0200, Jarek Poplawski wrote:

> Sure, after fixing it properly we should get rid of unneeded checks.

> > > Anyway, it's worked for other classifiers like this for some time...
> > 
> > Would you agree that it is/was a bandaid?
> > Or maybe you have some other fear that this may break something else and
> > prefer the workaround instead?
> 
> If somebody decided to do it this way instead of the "proper" fix then
> it looks to me more like a bandaid "by design". 

I think your and my definition of "proper" are at odds here ;->
My logic says there's a causality effect and you always fix the cause
in such a situation. 
Here's the anology of our conversation (some captured above) as i see it
centred around a bug of leaky pipe which just messed up the carpet
overnight in some room:
 
handyman Jamal: fix the pipe so it doesnt leak.
handyman Jarek: put a little bucket below the dripping spot
handyman Jamal: fixing this pipe is cheap - it is on warranty too
handyman Jarek: I fixed two other rooms by putting buckets
handyman Jamal: Proper: fix the pipe - you dont need the buckets forever
handyman Jarek: Proper fix is to put the bucket below any drip

Of course i am exagerating to make a point on our logical approaches:
leaking pipes dont quiet match software bugs i.e.
a dripping pipe will have the short term sense of emergency of needing
buckets but in this case the cost of the damage and time is the same
if you put a bucket or fixed the pipe.

> And, yes, I have some
> fear we could break some strange configs, which could depend on this
> wrong but working design.

To continue our discussion
handyman Jamal: What is your fear about fixing the pipe?
handyman Jarek: Someone may have plants which depend on the drips;
so if you fix it his plants wont get water anymore.

I hope the above doesnt offend you - it is meant in good spirit.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-30 Thread jamal
On Sun, 2009-05-31 at 07:38 +0900, Minoru Usui wrote:

> I agree.
> I think cls_cgroup should check head(tp->root) whether NULL or not
> like other classifiers, too.
> 

I dont think this is necessary if the adding/linking unconfigured tp
doesnt happen on failed config.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-30 Thread jamal
On Sun, 2009-05-31 at 07:22 +0900, Minoru Usui wrote:

> I'll do it after understanding above code.

Current patch has two places with:

+   if (n->nlmsg_type == RTM_NEWTFILTER &&
+   (n->nlmsg_flags&NLM_F_CREATE)) {

The change (in the two spots) is:
+   if (n->nlmsg_type == RTM_NEWTFILTER &&
+   (n->nlmsg_flags&(NLM_F_CREATE|NLM_F_EXCL))) {

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-30 Thread jamal
On Sat, 2009-05-30 at 16:00 +0200, Jarek Poplawski wrote:
> On Sat, May 30, 2009 at 09:31:37AM -0400, jamal wrote:
> > What were you suggesting to change in cls_group to avoid this oops?
> 
> I think checking the head (tp->root) for NULL like in cls_fw or
> cls_route should work.

IMHO that is a workaround for the tp linking bug
[IOW, there is no need to check for tp->root == NULL (in the fast path)
if such an illegal tp was never linked to begin with (on the slow
path)].

So those classifiers you point to need to be fixed afterwards (but 
not -stable material).
My thinking of fixing them to do proper init/get as well later.

> Anyway, it's worked for other classifiers like this for some time...

Would you agree that it is/was a bandaid?
Or maybe you have some other fear that this may break something else and
prefer the workaround instead?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-30 Thread jamal
On Sat, 2009-05-30 at 15:20 +0200, Jarek Poplawski wrote:

> Yes it oopses in cls_cgroup_classify(). Here is the first report:
> http://permalink.gmane.org/gmane.linux.network/128593
> 

Oopsing in classify is after the fact though. It should not have been
linked to begin with (because wrong config was passed from user space).
As far as cls_group is concerned that is an illegitimate config - thats
why it failed it.

What were you suggesting to change in cls_group to avoid this oops?

> We add/link unconfigured tp, but it could be destroyed later, so I
> wouldn't call this a memory leak.

Indeed - call it "We add/link unconfigured tp".

Anyways, a nice sun just came out over here and i am off to run 
some chores. If you respond you will hear from me in a few hours.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-30 Thread jamal
On Sat, 2009-05-30 at 14:45 +0200, Jarek Poplawski wrote:
> On Sat, May 30, 2009 at 08:31:23AM -0400, jamal wrote:

> > Remeber, you could have NLM_F_EXCL|NLM_F_CREATE to indicate "create this
> > thing if it doesnt exist; if it exists  it is an error"
> > If it doesnt exist we will enter that (tp == NULL) path
> > also fh will be 0 ==> So you will never enter the code
> > path you are refering to.
> > If it exists (i.e you found it) and you enter the code path you refer
> > to, then you surely dont want to destroy it if NLM_F_EXCL is set.
> 
> I mean we don't want to link it again or destroy after ->change() err
> if we run replace (n->nlmsg_type == RTM_NEWTFILTER &&
> (n->nlmsg_flags&NLM_F_CREATE)).

excellent point: an additional flag is needed then
n->nlmsg_flags& (NLM_F_CREATE|NLM_F_EXCL).
Minoru, please add this change in the patch before testing...

> > The ops is caused by the code fixed in the patch - did i miss something?
> 
> IMHO it could be fixed "old way" in cls_group code too.

Is the code oopsing in cls_group? It didnt seem to be so to me, and
yes cls_group is quarky in its practise (but thats a separate issue)
and even if it did oops in cls_group - this change above is a memory
leak and needs to be fixed in -stable. 

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-30 Thread jamal
On Sat, 2009-05-30 at 14:07 +0200, Jarek Poplawski wrote:
> On Sat, May 30, 2009 at 07:56:34AM -0400, jamal wrote:

> > tp_created is the check
> > n->nlmsg_type == RTM_NEWTFILTER && n->nlmsg_flags&NLM_F_CREATE
> > replace will be
> > n->nlmsg_type == RTM_NEWTFILTER && n->nlmsg_flags & NLM_F_EXCL
> 
> Hmm... Probably I miss something, but I've just seen this prink during
> tc filter replace with:
> 
> err = tp->ops->change();
> if (n->nlmsg_type == RTM_NEWTFILTER && (n->nlmsg_flags&NLM_F_CREATE))
>   printk(...);

That sounds right. 
Remeber, you could have NLM_F_EXCL|NLM_F_CREATE to indicate "create this
thing if it doesnt exist; if it exists  it is an error"
If it doesnt exist we will enter that (tp == NULL) path
also fh will be 0 ==> So you will never enter the code
path you are refering to.
If it exists (i.e you found it) and you enter the code path you refer
to, then you surely dont want to destroy it if NLM_F_EXCL is set.

> > I think they are two separate issues.
> > The fact that we dont destroy an allocated tp on failure is an issue
> > regardless of what cls_group does. In the case of Minoru's issue
> > it is because he is misconfiguring cls_group.
> 
> Sure, but we don't want people to get oops in such a case, I guess.
> 

The ops is caused by the code fixed in the patch - did i miss something?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-30 Thread jamal
On Sat, 2009-05-30 at 13:45 +0200, Jarek Poplawski wrote:

> > > >   }
> > > >
> > > >   err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh);
> > > > - if (err == 0)
> > > > + if (err == 0) {
> > > > + if (n->nlmsg_type == RTM_NEWTFILTER &&
> > > > + (n->nlmsg_flags&NLM_F_CREATE)) {
> 
> Since "tc filter replace" uses this type and flag too without creating
> tp, this check is not enough. I guess we could simply use a variable
> like tp_created etc. 

It will be superfluos. 
tp_created is the check
n->nlmsg_type == RTM_NEWTFILTER && n->nlmsg_flags&NLM_F_CREATE
replace will be
n->nlmsg_type == RTM_NEWTFILTER && n->nlmsg_flags & NLM_F_EXCL

> Anyway, changing this place looks tricky to me,
> so maybe it would be safer to do a separate cls_cgroup fix just for
> -stable, and this one patch for -next only?

I think they are two separate issues.
The fact that we dont destroy an allocated tp on failure is an issue
regardless of what cls_group does. In the case of Minoru's issue
it is because he is misconfiguring cls_group.

cheers,
jamal



___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-30 Thread jamal
On Sat, 2009-05-30 at 00:59 +0200, Jarek Poplawski wrote:
  
> ...
> } else {
> switch (n->nlmsg_type) {
> case RTM_NEWTFILTER:
> err = -EEXIST;
> if (n->nlmsg_flags & NLM_F_EXCL)
> goto errout;
> break;
> 
> Probably this case needs tcf_destroy() too.

No - that if stmnt will fail if this is a new filter being
created.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-29 Thread jamal
Hi Minoru,

I hate to do this to you after i made suggestions on how to
make the changes (as an adult i hate it when people do it to
me;->)
But sometimes code helps. So what i meant is the attached patch.
I havent even compiled it yet.
If it works, please submit it and add yourself as the author
(and a signed-off from me). Then we can revisit the init()
issue in cls_group..
You should also cc tgraf in your cls_grp config questions.

cheers,
jamal

On Fri, 2009-05-29 at 09:46 -0400, jamal wrote:
> 
> This is incorrect. tp may already exist and you dont want to destroy
> for failure to change its parameters. You also dont want to reattach
> an existing tp because it succeeded in parameter change. 
> So the soln is to check if this is a new tp and then do what you did
> above...
> Did that make sense?
> 
> cheers,
> jamal
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0759f32..8760a48 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -266,11 +266,6 @@ replay:
 			goto errout;
 		}
 
-		spin_lock_bh(root_lock);
-		tp->next = *back;
-		*back = tp;
-		spin_unlock_bh(root_lock);
-
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind))
 		goto errout;
 
@@ -314,8 +309,21 @@ replay:
 	}
 
 	err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh);
-	if (err == 0)
+	if (err == 0) {
+		if (n->nlmsg_type == RTM_NEWTFILTER &&
+		(n->nlmsg_flags&NLM_F_CREATE)) {
+			spin_lock_bh(root_lock);
+			tp->next = *back;
+			*back = tp;
+			spin_unlock_bh(root_lock);
+		}
 		tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER);
+	} else {
+		if (n->nlmsg_type == RTM_NEWTFILTER &&
+		(n->nlmsg_flags&NLM_F_CREATE)) {
+			tcf_destroy(tp);
+		}
+	}
 
 errout:
 	if (cl)
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-29 Thread jamal
Hi there,

Sorry - cant help you on earlier question in the thread on
syntax of cls_group config (but we can revisit after resolving
this). You should always copy the maintainer if you want quick
answers (for cls_group case Thomas Graf).

On your patch:
I think you have found a real issue (I have a strong feeling
it has everything to do with your config process)

Comments below:

On Fri, 2009-05-29 at 14:18 +0900, Minoru Usui wrote:
> Hi, 


> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index 0759f32..756148b 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -266,11 +266,6 @@ replay:
>   goto errout;
>   }
>  
> - spin_lock_bh(root_lock);
> - tp->next = *back;
> - *back = tp;
> - spin_unlock_bh(root_lock);
> -
>   } else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind))
>   goto errout;
>  
> @@ -314,8 +309,17 @@ replay:
>   }
>  
>   err = tp->ops->change(tp, cl, t->tcm_handle, tca, &fh);
> - if (err == 0)
> - tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER);
> + if (err) {
> + tcf_destroy(tp);
> + goto errout;
> + }
> +
> + spin_lock_bh(root_lock);
> + tp->next = *back;
> + *back = tp;
> + spin_unlock_bh(root_lock);
> +
> + tfilter_notify(skb, n, tp, fh, RTM_NEWTFILTER);

This is incorrect. tp may already exist and you dont want to destroy
for failure to change its parameters. You also dont want to reattach
an existing tp because it succeeded in parameter change. 
So the soln is to check if this is a new tp and then do what you did
above...
Did that make sense?

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [BUG] net_cls: Panic occured when net_cls subsystem use

2009-05-29 Thread jamal
Thanks for the CC Jarek.

On Fri, 2009-05-29 at 10:22 +, Jarek Poplawski wrote:

> > > I didn't verify this too much, so I might be wrong, but it looks like
> > > cls_cgroup_classify() does things a bit different than others (doesn't
> > > check the 'head' etc. for NULL), so maybe you should consider fixing
> > > it instead? (Btw., the tc classifier maintainer added to Cc).
> > 
> > OOPS! Others mostly don't check this either, so my suggestion was wrong.
> 
> Hmm... Or maybe I wasn't so wrong; it seems classifiers which
> don't assign the head during init do this check for NULL later.

It is good practise to initialize things in init() and use attributes
in change() to set them. 
I will read the thread and check out the code path then respond
more intelligently.

cheers,
jamal

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][PKTGEN] Fix double unlock of xfrm_state->lock

2007-11-19 Thread jamal
On Mon, 2007-19-11 at 12:47 +0300, Pavel Emelyanov wrote:
> The pktgen_output_ipsec() function can unlock this lock twice
> due to merged error and plain paths. Remove one of the calls
> to spin_unlock.

Good catch.
Acked-by: Jamal Hadi Salim <[EMAIL PROTECTED]>

cheers,
jamal

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Virtual ethernet device (tunnel)

2007-05-02 Thread jamal
On Wed, 2007-02-05 at 14:34 +0200, Patrick McHardy wrote:

> Thats a lot better than using sysfs, but I think it would be
> preferrable to use rtnetlink instead of genetlink for network
> configuration.

or you can just hold rtnl while using genl.
I do agree it would be easier to just use rtnetlink ...

cheers,
jamal


___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel