Re: [PATCH 0/1] IPN: Inter Process Networking

2007-12-17 Thread david

On Mon, 17 Dec 2007, Ludovico Gardenghi wrote:


On Mon, Dec 17, 2007 at 04:10:19AM -0800, [EMAIL PROTECTED] wrote:


if you are talking network connections between virtual systems, then the
exiting tap interfaces would seem to do everything you are looking for. you
can add them to bridges, route between them, filter traffic between them
(at whatever layer you want with netfilter), use multicast, etc as you
would any real interface.

if, however, you are talking about non-network communications (your example
of sending raw video frames across the interface), and want multiple
processes to receive them, this sounds like exactly the thing that splice
was designed to do, distribute data to multiple recipiants simultaniously
and efficiantly.


I'll try to explain.

Our first interest was to be able to interconnect virtual, real, and partial
virtual machines. We developed VDE for this, it's a user-level L2
switch. Specific as it may be, it's quite popular as a simple but
flexible tool. It can interconnect UML, Qemu, UMView, slirp, everything that
can be connected to a tap interface, etc.

So, you say, it's a networking issue and we could live with tun/tap.
There's a major point here: at present, dealing with tun/tap, bridges,
routing is quite difficult if you are a *regular* user with *no*
capabilites at all. You have tun/tap persistency and association to a
specific user (or group, recently), at most. That's good - we don't want
regular users to mess with global networking rules and settings.

Think of a bunch of etherogeneous virtual machines, partial virtual
machines (i.e. VMs where only a subset of system calls may be
virtualized or not depending on the parameters - that's the case of
View-OS) that must be interconnected and that may or may not have a
connection to a real network interface (maybe via a tunnel towards a
different machine). There's no need for administrator intervention here.
Why should an user have to ask root to create lots of tap interfaces for
him, bind them in a bridge and set up filtering/routing rules? What
would the list of interfaces become when different users asked for the
same thing at the same time?

You could define a specific interconnecting bus, but we've already have
it: ethernet. VDE comes in help as it allows regular users to build
distributed ethernet networks.

VDE works fine, but at present often results in a bottleneck because of
the high number of user-processes involved and user-kernel-user switches
needed in order to transfer a single ethernet frame. Moving the core
inside the kernel would limit this problem and result in faster
communication with still no need for root intervention or global
namespace messing. (we're thinking if something can be done working with
containers or similar structures, both for networking and partial
virtualization, but that's another topic).


so it sounds like the real issue you are trying to deal with is that only 
root is allowed to make changes to the networking configuration, and you 
want to allow non-root users to make changes.


in doing this you started by duplicating the kernel networking 
functionality into userspace (your userspace L2 switch) and are running 
into performance problems so trying to push this into the kernel to reduce 
context switches.


besides your approach I see two other options on their way into the 
kernel.


1. no changes, run your switch in a VM and your users (with their group 
permissions) connect their VM interfaces to the interfaces of the VM 
running the switch/filtering. this allows them 'root' inside the VM where 
they can make all these changes.


this may have the same performance problems as your current userspace 
switch.


2. networking virtualization. there is work being done to be able to have 
what would be essentially multiple networking stacks on a machine to allow 
a VM/container to control some things without having to go through the 
tun/tap interface. This would allow a user to change the filtering rules 
without the changes being global.


however, note that if the VM's are more then just a test-bed and actually 
need to talk to the outside world, at some point they will need to connect 
to the real interfaces, and making that connection should still require 
superuser privilages on the master kernel.


besides, useing the standard networking stack has the advantage that if 
you end up needing to spread your VM's across multiple machines the 
support is already there, where adding a new IPC mechanism will require 
figuring out how to extend that mechanism across machines.


It also doesn't require the applications to be coded specificly for your 
mechanism. they just use standard networking API's and the virtual 
connections happen for them.



So we started thinking how to use existing kernel structures, and we
concluded that:

- no existing kernel structures appeared to be optimal for this work;
- if we've had to design a new structure, it would have been more
  useful if we tr

Re: [PATCH 0/1] IPN: Inter Process Networking

2007-12-17 Thread Ludovico Gardenghi
On Mon, Dec 17, 2007 at 04:10:19AM -0800, [EMAIL PROTECTED] wrote:

> if you are talking network connections between virtual systems, then the 
> exiting tap interfaces would seem to do everything you are looking for. you 
> can add them to bridges, route between them, filter traffic between them 
> (at whatever layer you want with netfilter), use multicast, etc as you 
> would any real interface.
>
> if, however, you are talking about non-network communications (your example 
> of sending raw video frames across the interface), and want multiple 
> processes to receive them, this sounds like exactly the thing that splice 
> was designed to do, distribute data to multiple recipiants simultaniously 
> and efficiantly.

I'll try to explain.

Our first interest was to be able to interconnect virtual, real, and partial
virtual machines. We developed VDE for this, it's a user-level L2
switch. Specific as it may be, it's quite popular as a simple but
flexible tool. It can interconnect UML, Qemu, UMView, slirp, everything that
can be connected to a tap interface, etc.

So, you say, it's a networking issue and we could live with tun/tap.
There's a major point here: at present, dealing with tun/tap, bridges,
routing is quite difficult if you are a *regular* user with *no*
capabilites at all. You have tun/tap persistency and association to a
specific user (or group, recently), at most. That's good - we don't want
regular users to mess with global networking rules and settings.

Think of a bunch of etherogeneous virtual machines, partial virtual
machines (i.e. VMs where only a subset of system calls may be
virtualized or not depending on the parameters - that's the case of
View-OS) that must be interconnected and that may or may not have a
connection to a real network interface (maybe via a tunnel towards a
different machine). There's no need for administrator intervention here.
Why should an user have to ask root to create lots of tap interfaces for
him, bind them in a bridge and set up filtering/routing rules? What
would the list of interfaces become when different users asked for the
same thing at the same time?

You could define a specific interconnecting bus, but we've already have
it: ethernet. VDE comes in help as it allows regular users to build
distributed ethernet networks.

VDE works fine, but at present often results in a bottleneck because of
the high number of user-processes involved and user-kernel-user switches
needed in order to transfer a single ethernet frame. Moving the core
inside the kernel would limit this problem and result in faster
communication with still no need for root intervention or global
namespace messing. (we're thinking if something can be done working with
containers or similar structures, both for networking and partial
virtualization, but that's another topic).

So we started thinking how to use existing kernel structures, and we
concluded that:

 - no existing kernel structures appeared to be optimal for this work;
 - if we've had to design a new structure, it would have been more
   useful if we tried to be as more general as we could.

At present we're still focused on networking and other applications are
just examples, but we thought that adding a general extensible multipoint
IPC family is quite better than adding the most specific solution to our
current problem.

Maybe people with experience in other fields may tell us if there are
other problems that can be resolved, or optimized, or simply made
simpler, with IPN. Maybe our proposal is not the best as for interface
and semantics. But we feel that it may fill an "empty space" in the
available IPC mechanisms with a quite simple but powerful approach.

> for a new family to be valuble, you need to show what it does that isn't 
> available in existing families.

Is it "more acceptable" to add a new address family or to add features
to existing ones? (my question is purely informative, I don't want to
sound sarcastic or whatever) For instance, someone proposed "let's just
add access control to the netlink family". It seems a though work. 

You proposed splice, other have proposed multicast or netlink. If I have
understood correctly, splice helps in copying data to different
destinations in a very fast way. But it needs a userspace program that
receives data, iterates on fds and splices the data "out", calling a
syscall for each destination.  syscall calling may have become very fast
but we still notice slowdowns due to the reasons I've explained before.

--- (the following is not related to IPN but i wanted to answer this too)

> I'm not familiar enough with ptrace vs utrace to know this argument. but I 
> haven't heard any of the virtualization people complaining about the 
> existing interfaces. They seem to have been happily useing them for a 
> number of years.

ptrace has a number of drawbacks that have been partially addressed
adding flags and parameters for "cheating" and obtaining better
performances. It's *slow* exp

Re: [PATCH 0/1] IPN: Inter Process Networking

2007-12-17 Thread david

On Mon, 17 Dec 2007, Ludovico Gardenghi wrote:


On Mon, Dec 17, 2007 at 03:31:48AM -0800, [EMAIL PROTECTED] wrote:


wouldn't it be better to just add the ability for multiple writers to send
to the same pipe, and then have all of them splice into the output of that
pipe? this would give the same data-agnostic communication that you are
looking for, and with the minor detail that software would have to filter
out messages that they send, would appear to meet all the goals you are
looking at, useing existing kernel features that are designed to be very
high performance.


Being able to define both filtering policies (think of a virtual
ethernet layer 2 switch, for instance. We have situations where dozens
or hundreds of virtual cables are connected to the same switch, it would
be much, much slower if you had to awake all the user processes for each
single non-broadcast ethernet frame, and send them useless data) and
delivery guarantees (lossless vs best-effort delivery) are not minor
details in our opinion.

We might have added a level2 virtual ethernet switch at kernel
level, but it seemed to specific. With a minor effort we have split the
"dumb" bus (IPN) and the ability to process specific structured data
with specific policies (sub-modules as kvde_switch).


it seems like you are mixing your use cases and arguing reasons for one 
when answering questions about another.


if you are talking network connections between virtual systems, then the 
exiting tap interfaces would seem to do everything you are looking for. 
you can add them to bridges, route between them, filter traffic between 
them (at whatever layer you want with netfilter), use multicast, etc as 
you would any real interface.


if, however, you are talking about non-network communications (your 
example of sending raw video frames across the interface), and want 
multiple processes to receive them, this sounds like exactly the thing 
that splice was designed to do, distribute data to multiple recipiants 
simultaniously and efficiantly.


I think you need to seperate out these two use cases (and any others you 
are advocating this for) and argue each one on it's own.



We surely may adapt existing features (AF_UNIX, or pipes) but they offer
a quite established interface and semantics and we think it should be
better to add a new family. This would prevent from breaking what
already exists and leaving more freedom in defining the new family
according to needs.


for a new family to be valuble, you need to show what it does that isn't 
available in existing families.



As for ptrace vs utrace: ptrace has been designed for debugging; trying
to bend it to be fit for virtualization is likely to end up in an
intricated interface and implementation. utrace has been designed in a
much more general way. You can implement ptrace over utrace, but you can
use utrace also for virtualization in a cleaner, simpler and more
efficient way. Why not?


I'm not familiar enough with ptrace vs utrace to know this argument. but I 
haven't heard any of the virtualization people complaining about the 
existing interfaces. They seem to have been happily useing them for a 
number of years.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/1] IPN: Inter Process Networking

2007-12-17 Thread Ludovico Gardenghi
On Mon, Dec 17, 2007 at 03:31:48AM -0800, [EMAIL PROTECTED] wrote:

> wouldn't it be better to just add the ability for multiple writers to send 
> to the same pipe, and then have all of them splice into the output of that 
> pipe? this would give the same data-agnostic communication that you are 
> looking for, and with the minor detail that software would have to filter 
> out messages that they send, would appear to meet all the goals you are 
> looking at, useing existing kernel features that are designed to be very 
> high performance.

Being able to define both filtering policies (think of a virtual
ethernet layer 2 switch, for instance. We have situations where dozens
or hundreds of virtual cables are connected to the same switch, it would
be much, much slower if you had to awake all the user processes for each
single non-broadcast ethernet frame, and send them useless data) and
delivery guarantees (lossless vs best-effort delivery) are not minor
details in our opinion.

We might have added a level2 virtual ethernet switch at kernel
level, but it seemed to specific. With a minor effort we have split the
"dumb" bus (IPN) and the ability to process specific structured data
with specific policies (sub-modules as kvde_switch).

We surely may adapt existing features (AF_UNIX, or pipes) but they offer
a quite established interface and semantics and we think it should be
better to add a new family. This would prevent from breaking what
already exists and leaving more freedom in defining the new family
according to needs.

As for ptrace vs utrace: ptrace has been designed for debugging; trying
to bend it to be fit for virtualization is likely to end up in an
intricated interface and implementation. utrace has been designed in a
much more general way. You can implement ptrace over utrace, but you can
use utrace also for virtualization in a cleaner, simpler and more
efficient way. Why not?

Ludovico
-- 
<[EMAIL PROTECTED]>#acheronte (irc.freenode.net) ICQ: 64483080
GPG ID: 07F89BB8  Jabber: [EMAIL PROTECTED] Yahoo: gardenghelle
-- This is signature nr. 3556
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/1] IPN: Inter Process Networking

2007-12-17 Thread david

On Mon, 17 Dec 2007, Renzo Davoli wrote:


Inter Process Networking (PATCH):

1. WHAT IS IPN?
---

IPN is a new address family designed for one-to-many, many-to-many and
peer-to-peer communication among processes.
Berkeley sockets have been designed for client-server or point-to-point
communication; AF_UNIX does not support multicast/broadcast. AF_IPN
does, in a simple, efficient but extensible way.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.

On IPN, processes can interoperate using real networking protocols
(e.g. ethernet) but also using application defined protocols (maybe
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.

IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).

IPN is part of the Virtual Square Project (vde, lwipv6, view-os,
umview/kmview, see wiki.virtualsquare.org).


other then the fact that this is bi-directional, how is this better then 
using pipes and splice?


wouldn't it be better to just add the ability for multiple writers to send 
to the same pipe, and then have all of them splice into the output of that 
pipe? this would give the same data-agnostic communication that you are 
looking for, and with the minor detail that software would have to filter 
out messages that they send, would appear to meet all the goals you are 
looking at, useing existing kernel features that are designed to be very 
high performance.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/1] IPN: Inter Process Networking

2007-12-17 Thread Renzo Davoli
Inter Process Networking (PATCH):

This patch adds a new address family for inter process communication.
AF_IPN: inter process networking, i.e. multipoint,
multicast/broadcast communication among processes (and networks).

Contents of this document:

1. What is IPN?
2. Why IPN?
2.1 Why IPN instead of IP Multicast?
2.2 Why IPN instead of AF_NETLINK?
3. How?

We've read all the comments in the previous thread about IPN and we've
tried to answer.

1. WHAT IS IPN?
---

IPN is a new address family designed for one-to-many, many-to-many and 
peer-to-peer communication among processes.
Berkeley sockets have been designed for client-server or point-to-point
communication; AF_UNIX does not support multicast/broadcast. AF_IPN
does, in a simple, efficient but extensible way.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.

On IPN, processes can interoperate using real networking protocols 
(e.g. ethernet) but also using application defined protocols (maybe 
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.

IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).

IPN is part of the Virtual Square Project (vde, lwipv6, view-os, 
umview/kmview, see wiki.virtualsquare.org).

2. WHY IPN?
---
Many applications can benefit from IPN.
First of all VDE (Virtual Distributed Ethernet): one service of IPN is a
kernel implementation of VDE.
IPN can be useful for applications where one or some processes feed their
data (*any kind* of data, not only networking-related messages) to several
consuming processes (maybe joining the stream at run time). IPN sockets
can be also connected to tap (tuntap) like interfaces or to real interfaces
(like "brctl addif").
There are specific ioctls to define a tap interface or grab an existing
one.

Several existing services could be implemented (and often could have extended
features) on the top of IPN:
 - kernel Ethernet bridging
 - TUN/TAP
 - MACVLAN

IPN could be used (IMHO) to provide multicast services to processes.
Audio frames or video frames could be multiplexed such that multiple
applications can use them. I think that something like Jack can be
implemented on the top of IPN. Something like a VideoJack can
provide video frames to several applications: e.g. the same image from a
camera can be viewed by xawtv, recorded and sent to a streaming service.
IPN sockets can be used wherever there is the idea of broadcasting channel 
i.e. where processes can "join (and leave) the information flow" at
runtime. IPN can be seen as "publish and subscribe".
Different delivery policies can be defined as IPN protocols (loaded 
as submodules of ipn.ko).
For instance, an ethernet switch is a policy (kvde_switch.ko: packets
are unicast delivered if the MAC address is already in the switching
hash table), we are designing an extendended switch, full of interesting
features like our userland vde_switch (with vlan/fst/manamement etc..),
and a layer3 switch, but other policies can be defined to implement the
specific requirements of other services. I feel that there is no limits
to creativity about multicast services for processes.  Userspace
services (like vde) do exist, but IPN provides a faster and unified
support.

2.1 Why IPN instead of IP Multicast?

 - IPN seems to be faster than IP Multicast. (see my message to LKML
   of Dec 06).
 - IPN provides file system permission to access the communication medium,
   and it uses the file system for naming.
 - IPN does not need any tunneling or packet encapsulation, it works as a
   layer 1 virtual network.
 - IPN protocols (implemented by kernel submodules) provide forwarding
   policies: the set of receipients for each messages is computed from the
   contents of the message itself.
   Ethernet virtual switches or other routing rules for any kind of data
   can be implemented as IPN protocols.

2.2 Why IPN instead of AF_NETLINK?
--
 - Netlink has been designed for user to kernel communication.
 - Netlink has many missing features to provide services similar to IPN.
 - Currently multicast seems to be allowed for root only. Access control
   should be added completely.
 - Netlink interface for user processes is not very immediate (libnl has
   been developed as a higher level solution to that).
 - Netlink already seems to suffer from "overpopulation":
   NETLINK_GENERIC has been added for "simplified netlink usage" but it
   adds yet another header and rules to be followed.
 - Netlinks is quite rigid as for message delivery guarantees: unicast
   implies lossless co