Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events

2019-06-07 Thread Renzo Davoli
Hi Roman,

On Thu, Jun 06, 2019 at 10:11:57PM +0200, Roman Penyaev wrote:
> Hi Renzo,
> On 2019-06-03 17:00, Renzo Davoli wrote:
> > Please, have a look of the README.md page here:
> > https://github.com/virtualsquare/vuos
> Is that similar to what user-mode linux does?  I mean the principle.

let us write this proportion:
user-mode-linux / umvu = linux / namespace

In a comparison between user-mode linux and umvu,
while the way to get the system call requests is the same (ptrace)
the goal is different.

user-mode linux catches all the system calls, none of them is forwarded to
the real kernel: it uses a linux kernel compiled as a process to give processes
the illusion to live in another machine.

umvu catches all the system calls and then it decides if the syscall must be 
forwarded
to the kernel (maybe modified) or entirely processed at user-level by the 
hypervisor
(by means of specific plug-in modules like vufuse for file systems, vudev for 
devices
 and so on).
While the "illusion" of user-mode linux is a global illusion, the "illusion" 
provided by
umvu is limited and configurable. After a "mount" of a filesystem using vufuse,
the file system tree is the same *but* the subtree of the mountpoint.
The illusion is limited to the subtree as only the system call requests for 
paths inside
the subtree are processed by umvu and its modules.

It is similar to a namespace implemented at user level.
w.r.t. namespaces:
* umvu does not change the attack surface of the kernel (it is just a 
virtualization
-- a.k.a. illusion -- provided by a user process to other user 
processes)
* umvu can provide features not currently supported by the kernel (e.g. a file 
system
organization unavailable as kernel code, networking stacks at 
user level etc.)
* ...

umvu is an implementation of vuos concepts using ptrace. In a future maybe it 
will be
possibile to reimplement the same idea of partial virtual machines using other
syscall tracing/filtering tools.
> 
> > I am not trying to port some tools to use user-space implemented
> > stacks or device
> > drivers/emulators, I am seeking to a general purpose approach.
> 
> You still intersect *each* syscall, why not to do the same for epoll_wait()
> and replace events with correct value?  Seems you do something similar
> already
> in a vu_wrap_poll.c: wo_epoll_wait(), right?
> 
> Don't get me wrong, I really want to understand whether everything really
> looks so bad without proposed change. It seems not, because the whole
> principle
> is based on intersection of each syscall, thus one more one less - it does
> not
> become more clean and especially does not look like a generic purpose
> solution,
> which you seek.  I may be wrong.
Your comments are precious. Thank you as I see that you have browsed into my 
code
to have a better view of the problem.

umvu is a modular tool. The executable of umvu is a dispatcher between the
system call requests coming from the user processes and modules (loaded at
run time as dynamic plug-in libraries)

+-+ +--+  +-+
+processes running|<--->| umvu |<>| module (e.g. vufuse/vudev/vunet)|
+  "inside" umvu  | +--+  +-+
+-+

Each module "registers" to umvu its "responsabilities"
It can register:
* a pathname (it will receive the syscall requests for that subtree)
* an address_family (all the syscall for sockets of that AF)
* major/minor numbers of a char or block device
* a systam call number
* 
(each module can register more items)

The problem is not in the dialogue between umvu and the user processes
(<---> on the left in the diagram above) but between umvu and its modules
(<---> on the right). 
(wi_epoll_wait, wd_epoll_wait, wo_epoll_wait are the three wrappers used
 respectively before, during and after epoll_wait in the dialogue on the left 
with the
user processes).

When a user process generates a "read" syscall request and umvu discovers that
the fd is managed by vufuse, it forwards to vufuse a "read" request
having the same signature of the "read" system call (plus a trailing fdprivate 
arg for
syscalls using a fd. This arg can be used to speed up virtualization but can be 
safely
ignored).

If for the same "read" request the file descriptor is managed by vunet,
it is forwarded to vunet (actually it is converted to "recvmsg": if fd is a 
socket
recvmesg manages all read/recv/recvfrom/recvmsg, umvu tends to simplify the API 
by unifying similar system calls).

But what about poll/epoll/ppoll/select/pselect?
umvu takes care of all the system call requests but it needs a clean way to ask
modules some feedback when the expected events happen.

I think the cl

Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events

2019-06-03 Thread Renzo Davoli
Hi Roman,

 I sorry for the delay in my answer, but I needed to set up a minimal
tutorial to show what I am working on and why I need a feature like the
one I am proposing.

Please, have a look of the README.md page here:
https://github.com/virtualsquare/vuos
(everything can be downloaded and tested)

On Fri, May 31, 2019 at 01:48:39PM +0200, Roman Penyaev wrote:
> Since each such a stack has a set of read/write/etc functions you always
> can extend you stack with another call which returns you event mask,
> specifying what exactly you have to do, e.g.:
> 
> nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1);
> for (n = 0; n < nfds; ++n) {
>  struct sock *sock;
> 
>  sock = events[n].data.ptr;
>  events = sock->get_events(sock, [n]);
> 
>  if (events & EPOLLIN)
>  sock->read(sock);
>  if (events & EPOLLOUT)
>  sock->write(sock);
> }
> 
> 
> With such a virtual table you can mix all userspace stacks and even
> with normal sockets, for which 'get_events' function can be declared as
> 
> static poll_t kernel_sock_get_events(struct sock *sock, struct epoll_event
> *ev)
> {
> return ev->events;
> }
> 
> Do I miss something?

I am not trying to port some tools to use user-space implemented stacks or 
device
drivers/emulators, I am seeking to a general purpose approach.

I think that the example in the section of the README "mount a user-level 
networking stack" explains the situation.

The submodule vunetvdestack uses a namespace to define a networking stack 
connected
to a VDE network (see https://github.com/rd235/vdeplug4).

The API is clean (as it can be seen at the end of the file 
vunet_modules/vunetvdestack.c).
All the methods but "socket" are directly mapped to their system call 
counterparts:

struct vunet_operations vunet_ops = {
  .socket = vdestack_socket,
  .bind = bind,
  .connect = connect,
  .listen = listen,
  .accept4 = accept4,

.epoll_ctl = epoll_ctl,
...
}

(the elegance of the API can be seen also in vunet_modules/vunetreal.c: a 38 
lines module
 implementing a gateway to the real networking of the hosting machine)

Unfortunately I cannot use the same clean interface to support user-library 
implemented
stacks like lwip/lwipv6/picotcp because I cannot generate EPOLL events...

Bizantine workarounds based on data structures exchanged in the data.ptr field 
of epoll_event
that must be decoded by the hypervisor to retrieve the missing information 
about the event
can be implemented... but it would be a pity ;-)

The same problem arises in umdev modules: virtual devices should generate the 
same
EPOLL events of their real couterparts.

I feel that the ability to generate/synthesize EPOLL events could be useful for 
many projects.
(In my first message I included some URLs of people seeking for this feature, 
retrieved by
 some queries on a web search engine)

Implementations may vary as well as the kernel API to support such a feature.
As I told, my proposal has a minimal impact on the code, it does not require 
the definition
of new syscalls, it simply enhances the features of eventfd.

> 
> Eventually you come up with such a lock to protect your tcp or whatever
> state machine.  Or you have a real example where read and write paths
> can work completely independently?

Actually umvu hypervisor uses concurrent tracing of concurrent processes.
We have named this technique "guardian angels": each process/thread running in 
the
partial virtual machine has a correspondent thread in the hypervisor.
So if a process uses two threads to manage a network connection (say a TCP 
stream),
the two guardian angels replicate their requests towards the networking module.

So I am looking for a general solution, not to a pattern to port some projects.
(and I cannot use two different approaches for event driven and multi-threaded
 implementations as I have to support both).

If you reached this point...  Thank you for your patience.
I am more than pleased to receive further comments or proposals.

renzo


Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events

2019-05-31 Thread Renzo Davoli
HI Roman,

On Fri, May 31, 2019 at 11:34:08AM +0200, Roman Penyaev wrote:
> On 2019-05-27 15:36, Renzo Davoli wrote:
> > Unfortunately this approach cannot be applied to
> > poll/select/ppoll/pselect/epoll.
> 
> If you have to override other systemcalls, what is the problem to override
> poll family?  It will add, let's say, 50 extra code lines complexity to your
> userspace code.  All you need is to be woken up by *any* event and check
> one mask variable, in order to understand what you need to do: read or
> write,
> basically exactly what you do in your eventfd modification, but only in
> userspace.

This approach would not scale. If I want to use both a (user-space) network 
stack
and a (emulated) device (or more stacks and devices) which (overridden) poll 
would I use?

The poll of the first stack is not able to to deal with the third device.

> 
> 
> > > Why can it not be less than 64?
> > This is the imeplementation of 'write'. The 64 bits include the
> > 'command'
> > EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the
> > most
> > significant 32 bits) and the set of events (in the lowest 32 bits).
> 
> Do you really need add/del/mod semantics?  Userspace still has to keep mask
> somewhere, so you can have one simple command, which does:
>ctx->count = events;
> in kernel, so no masks and this games with bits are needed.  That will
> simplify API.

It is true, at the price to have more complex code in user space.
Other system calls could have beeen implemented as "set the value", instead 
there are
ADD/DEL modification flags.
I mean for example sigprocmask (SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASK), or even 
epoll_ctl.
While poll requires the program to keep the struct pollfd array stored 
somewhere,
epoll is more powerful and flexible as different file descriptors can be added
and deleted by different modules/components.

If I have two threads implementing the send and receive path of a socket in a 
user-space
network stack implementation the epoll pending bitmap is shared so I have to 
create
critical sections like the following one any time I need to set or reset a bit.
pthread_mutex_lock(mylock)
events |= EPOLLIN
write(efd, , sizeof(events));
pthread_mutex_unlock(mylock)
Using add/del semantics locking is not required as the send path thread deals 
with EPOLLOUT while
its siblings receive thread uses EPOLLIN or EPOLLPRI

I would prefer the add/del/mod semantics, but if this is generally perceived as 
a unnecessary 
complexity in the kernel code I can update my patch.  

Thank you Roman,

renzo


[PATCH v3 1/1] eventfd new tag EFD_VPOLL: generate epoll events

2019-05-31 Thread Renzo Davoli
eceives an event (or a set of 
events)
it prints it and disarm it.
The following shell session shows a sample run of the program:
timeout...
timeout...
GOT event 1
timeout...
GOT event 1
timeout...
GOT event 3
timeout...
GOT event 2
timeout...
GOT event 4
timeout...
GOT event 10

Program source:
#include 
#include 
#include 
#include 
#include 
#include  /* Definition of uint64_t */

#ifndef EFD_VPOLL
#define EFD_VPOLL (1 << 1)
#define EFD_VPOLL_ADDEVENTS (1ULL << 32)
#define EFD_VPOLL_DELEVENTS (2ULL << 32)
#define EFD_VPOLL_MODEVENTS (3ULL << 32)
#endif

#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)

static void vpoll_ctl(int fd, uint64_t request) {
ssize_t s;
s = write(fd, , sizeof(request));
if (s != sizeof(uint64_t))
handle_error("write");
}

int
main(int argc, char *argv[])
{
int efd, epollfd;
struct epoll_event ev;
ev.events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | 
EPOLLPRI;
ev.data.u64 = 0;

efd = eventfd(0, EFD_VPOLL | EFD_CLOEXEC);
if (efd == -1)
handle_error("eventfd");
epollfd = epoll_create1(EPOLL_CLOEXEC);
if (efd == -1)
handle_error("epoll_create1");
if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, ) == -1)
handle_error("epoll_ctl");

switch (fork()) {
case 0:
sleep(3);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN | 
EPOLLPRI);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLPRI);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLOUT);
sleep(2);
vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLHUP);
exit(EXIT_SUCCESS);
default:
while (1) {
int nfds;
nfds = epoll_wait(epollfd, , 1, 1000);
if (nfds < 0)
handle_error("epoll_wait");
else if (nfds == 0)
printf("timeout...\n");
else {
printf("GOT event %x\n", ev.events);
vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | 
ev.events);
if (ev.events & EPOLLHUP)
break;
}
}
case -1:
        handle_error("fork");
}
close(epollfd);
close(efd);
return 0;
}

Signed-off-by: Renzo Davoli 
Reported-by: kbuild test robot 

---
 fs/eventfd.c   | 116 +++--
 include/linux/eventfd.h|   7 +-
 include/uapi/linux/eventpoll.h |   2 +
 3 files changed, 117 insertions(+), 8 deletions(-)

Changes in v2:
 - Fix size of EFD_VPOLL_*EVENTS constants for 32 bit architectures

Changes in v3:
 - Fix sparse warnings and wrong arg of wake_up_locked_poll in 
eventfd_vpoll_write

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8aa0ea8c55e8..6cdb1b854341 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -24,18 +24,32 @@
 #include 
 #include 
 
+#define EPOLLALLMASK64 ((__force __u64)EPOLLALLMASK)
+
 static DEFINE_IDA(eventfd_ida);
 
 struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
/*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
+* If the EFD_VPOLL flag was NOT set at eventfd creation:
+*   Every time that a write(2) is performed on an eventfd, the
+*   value of the __u64 being written is added to "count" and a
+*   wakeup is performed on "wqh". A read(2) will return the "count"
+*   value to userspace, and will reset "count" to zero (or decrement
+*   "count" by 1 if the flag EFD_SEMAPHORE has been set). Th

Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events

2019-05-27 Thread Renzo Davoli
On Mon, May 27, 2019 at 09:33:32AM +0200, Greg KH wrote:
> On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote:
> > This patch implements an extension of eventfd to define file descriptors 
> > whose I/O events can be generated at user level. These file descriptors
> > trigger notifications for [p]select/[p]poll/epoll.
> > 
> > This feature is useful for user-level implementations of network stacks
> > or virtual device drivers as libraries.
> 
> How can this be used to create a "virtual device driver"?  Do you have
> any examples of this new interface being used anywhere?

Networking programs use system calls implementing the Berkeley sockets API:
socket, accept, connect, listen, recv*, send* etc.  Programs dealing with a
device use system calls like open, read, write, ioctl etc.

When somebody wants to write a library able to behave like a network stack (say
lwipv6, picotcp) or a device, they can implement functions like my_socket,
my_accept, my_open or my_ioctl, as drop-in replacement of their system
call counterpart.  (It is also possible to use dynamic library magic to
rename/divert the system call requests to use their 'virtual'
implementation provided by the library: socket maps to my_socket, recv
to my_recv etc).

In this way portability and compatibility is easier, using a well known API
instead of inventing new ones.

Unfortunately this approach cannot be applied to
poll/select/ppoll/pselect/epoll.  These system calls can refer at the same time
to file descriptors created by 'real' system calls like socket, open, 
signalfd... 
and to file descriptors returned by my_open, your_socket.

> 
> Also, meta-comment, you should provide some sort of test to kselftests
> for your new feature so that it can actually be tested, as well as a man
> page update (separately).
Sure. I'll do it ASAP, let me collect suggestions first.

> 
> > Development and porting of code often requires to find the way to wait for 
> > I/O
> > events both coming from file descriptors and generated by user-level code 
> > (e.g.
> > user-implemented net stacks or drivers).  While it is possible to provide a
> > partial support (e.g. using pipes or socketpairs), a clean and complete
> > solution is still missing (as far as I have seen); e.g. I have not seen any
> > clean way to generate EPOLLPRI, EPOLLERR, etc.
> 
> What's wrong with pipes or sockets for stuff like this?  Why is epoll
> required?
Example:
suppose there is an application waiting for a TCP OOB message. It uses poll to 
wait 
for POLLPRI and then reads the message (e.g. by 'recv').
If I want to port that application to use a network stack implemented as a 
library
I have to rewrite the code about 'poll' as it is not possible to receive a 
POLLPRI.
>From a pipe I can just receive a POLLIN, I have to encode in an external data 
>structure
any further information.
Using EFD_VPOLL the solution is straightforward: the function mysocket (used in 
place
of socket to create a file descripor behaving as a 'real'socket) returns a file
descriptor created by eventfd/EFD_VPOLL, so the poll system call can be left
unmodified in the code. When the OOB message is available the library can 
trigger
an EPOLLPRI and the message can be received using my_recv.

> 
...omissis...
> > 
> > Signed-off-by: Renzo Davoli 
> > ---
> >  fs/eventfd.c   | 115 +++--
> >  include/linux/eventfd.h|   7 +-
> >  include/uapi/linux/eventpoll.h |   2 +
> >  3 files changed, 116 insertions(+), 8 deletions(-)
> > 
> > diff --git a/fs/eventfd.c b/fs/eventfd.c
> > index 8aa0ea8c55e8..f83b7d02307e 100644
> > --- a/fs/eventfd.c
> > +++ b/fs/eventfd.c
> > @@ -3,6 +3,7 @@
> >   *  fs/eventfd.c
> >   *
> >   *  Copyright (C) 2007  Davide Libenzi 
> > + *  EFD_VPOLL support: 2019 Renzo Davoli 
> 
> No need for this line, that's what the git history shows.
okay

> 
> >   *
> >   */
> >  
> > @@ -30,12 +31,24 @@ struct eventfd_ctx {
> > struct kref kref;
> > wait_queue_head_t wqh;
> > /*
> > -* Every time that a write(2) is performed on an eventfd, the
> > -* value of the __u64 being written is added to "count" and a
> > -* wakeup is performed on "wqh". A read(2) will return the "count"
> > -* value to userspace, and will reset "count" to zero. The kernel
> > -* side eventfd_signal() also, adds to the "count" counter and
> > -* issue a wakeup.
> > +* If the EFD_VPOLL flag was NOT set at eventfd creation:
> > +*   Every time that a write(2) is performed on an eventfd, the
> > +*   value of the __u64 being written is added

[PATCH v2 1/1] eventfd new tag EFD_VPOLL: generate epoll events

2019-05-26 Thread Renzo Davoli
  else {
printf("GOT event %x\n", ev.events);
vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | 
ev.events);
if (ev.events & EPOLLHUP)
break;
}
}
case -1:
handle_error("fork");
}
close(epollfd);
close(efd);
return 0;
}

Signed-off-by: Renzo Davoli 
---
 fs/eventfd.c   | 115 +++--
 include/linux/eventfd.h|   7 +-
 include/uapi/linux/eventpoll.h |   2 +
 3 files changed, 116 insertions(+), 8 deletions(-)

Changes in v2:
 - Fix size of EFD_VPOLL_*EVENTS constants for 32 bit architectures

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8aa0ea8c55e8..f83b7d02307e 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -3,6 +3,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  EFD_VPOLL support: 2019 Renzo Davoli 
  *
  */
 
@@ -30,12 +31,24 @@ struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
/*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
+* If the EFD_VPOLL flag was NOT set at eventfd creation:
+*   Every time that a write(2) is performed on an eventfd, the
+*   value of the __u64 being written is added to "count" and a
+*   wakeup is performed on "wqh". A read(2) will return the "count"
+*   value to userspace, and will reset "count" to zero (or decrement
+*   "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
+*   side eventfd_signal() also, adds to the "count" counter and
+*   issue a wakeup.
+*
+* If the EFD_VPOLL flag was set at eventfd creation:
+*   count is the set of pending EPOLL events.
+*   read(2) returns the current value of count.
+*   The argument of write(2) is an 8-byte integer:
+*   it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
+*   EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
+*   events to be added, deleted to the current set of pending events.
+*   (i.e. which bits of "count" must be set or reset).
+*   EFD_VPOLL_MODEVENTS redefines the set of pending events.
 */
__u64 count;
unsigned int flags;
@@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char 
__user *buf, size_t c
return res;
 }
 
+static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
+{
+   struct eventfd_ctx *ctx = file->private_data;
+   __poll_t events = 0;
+   u64 count;
+
+   poll_wait(file, >wqh, wait);
+
+   count = READ_ONCE(ctx->count);
+
+   events = (count & EPOLLALLMASK);
+
+   return events;
+}
+
+static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct eventfd_ctx *ctx = file->private_data;
+   ssize_t res;
+   __u64 ucnt = 0;
+
+   if (count < sizeof(ucnt))
+   return -EINVAL;
+   res = sizeof(ucnt);
+   ucnt = READ_ONCE(ctx->count);
+   if (put_user(ucnt, (__u64 __user *)buf))
+   return -EFAULT;
+
+   return res;
+}
+
+static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct eventfd_ctx *ctx = file->private_data;
+   ssize_t res;
+   __u64 ucnt;
+   __u32 events;
+
+   if (count < sizeof(ucnt))
+   return -EINVAL;
+   if (copy_from_user(, buf, sizeof(ucnt)))
+   return -EFAULT;
+   spin_lock_irq(>wqh.lock);
+
+   events = ucnt & EPOLLALLMASK;
+   res = sizeof(ucnt);
+   switch (ucnt & ~((__u64)EPOLLALLMASK)) {
+   case EFD_VPOLL_ADDEVENTS:
+   ctx->count |= events;
+   break;
+   case EFD_VPOLL_DELEVENTS:
+   ctx->count &= ~(events);
+   break;
+   case EFD_VPOLL_MODEVENTS:
+   ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
+   break;
+   default:
+   res = -EINVAL;
+   }
+
+   /* wake up waiting threads */
+   if (res >= 0 && waitqueue_active(>wqh))
+   wake_up_locked_poll(>wqh, res);
+
+   

[PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events

2019-05-26 Thread Renzo Davoli
  else {
printf("GOT event %x\n", ev.events);
vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | 
ev.events);
if (ev.events & EPOLLHUP)
break;
}
}
case -1:
handle_error("fork");
}
close(epollfd);
close(efd);
return 0;
}

Signed-off-by: Renzo Davoli 
---
 fs/eventfd.c   | 115 +++--
 include/linux/eventfd.h|   7 +-
 include/uapi/linux/eventpoll.h |   2 +
 3 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8aa0ea8c55e8..f83b7d02307e 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -3,6 +3,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi 
+ *  EFD_VPOLL support: 2019 Renzo Davoli 
  *
  */
 
@@ -30,12 +31,24 @@ struct eventfd_ctx {
struct kref kref;
wait_queue_head_t wqh;
/*
-* Every time that a write(2) is performed on an eventfd, the
-* value of the __u64 being written is added to "count" and a
-* wakeup is performed on "wqh". A read(2) will return the "count"
-* value to userspace, and will reset "count" to zero. The kernel
-* side eventfd_signal() also, adds to the "count" counter and
-* issue a wakeup.
+* If the EFD_VPOLL flag was NOT set at eventfd creation:
+*   Every time that a write(2) is performed on an eventfd, the
+*   value of the __u64 being written is added to "count" and a
+*   wakeup is performed on "wqh". A read(2) will return the "count"
+*   value to userspace, and will reset "count" to zero (or decrement
+*   "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
+*   side eventfd_signal() also, adds to the "count" counter and
+*   issue a wakeup.
+*
+* If the EFD_VPOLL flag was set at eventfd creation:
+*   count is the set of pending EPOLL events.
+*   read(2) returns the current value of count.
+*   The argument of write(2) is an 8-byte integer:
+*   it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
+*   EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
+*   events to be added, deleted to the current set of pending events.
+*   (i.e. which bits of "count" must be set or reset).
+*   EFD_VPOLL_MODEVENTS redefines the set of pending events.
 */
__u64 count;
unsigned int flags;
@@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char 
__user *buf, size_t c
return res;
 }
 
+static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
+{
+   struct eventfd_ctx *ctx = file->private_data;
+   __poll_t events = 0;
+   u64 count;
+
+   poll_wait(file, >wqh, wait);
+
+   count = READ_ONCE(ctx->count);
+
+   events = (count & EPOLLALLMASK);
+
+   return events;
+}
+
+static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct eventfd_ctx *ctx = file->private_data;
+   ssize_t res;
+   __u64 ucnt = 0;
+
+   if (count < sizeof(ucnt))
+   return -EINVAL;
+   res = sizeof(ucnt);
+   ucnt = READ_ONCE(ctx->count);
+   if (put_user(ucnt, (__u64 __user *)buf))
+   return -EFAULT;
+
+   return res;
+}
+
+static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct eventfd_ctx *ctx = file->private_data;
+   ssize_t res;
+   __u64 ucnt;
+   __u32 events;
+
+   if (count < sizeof(ucnt))
+   return -EINVAL;
+   if (copy_from_user(, buf, sizeof(ucnt)))
+   return -EFAULT;
+   spin_lock_irq(>wqh.lock);
+
+   events = ucnt & EPOLLALLMASK;
+   res = sizeof(ucnt);
+   switch (ucnt & ~((__u64)EPOLLALLMASK)) {
+   case EFD_VPOLL_ADDEVENTS:
+   ctx->count |= events;
+   break;
+   case EFD_VPOLL_DELEVENTS:
+   ctx->count &= ~(events);
+   break;
+   case EFD_VPOLL_MODEVENTS:
+   ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
+   break;
+   default:
+   res = -EINVAL;
+   }
+
+   /* wake up waiting threads */
+   if (res >= 0 && waitqueue_active(>wqh))
+   wake_up_locked_poll(>wqh, res);
+
+   spin_unlock_irq(>wqh.lock);
+
+   return res;
+
+}
+
 #ifdef

[PATCH 0/1] IPN: Inter Process Networking

2007-12-17 Thread Renzo Davoli
Inter Process Networking (PATCH):

This patch adds a new address family for inter process communication.
AF_IPN: inter process networking, i.e. multipoint,
multicast/broadcast communication among processes (and networks).

Contents of this document:

1. What is IPN?
2. Why IPN?
2.1 Why IPN instead of IP Multicast?
2.2 Why IPN instead of AF_NETLINK?
3. How?

We've read all the comments in the previous thread about IPN and we've
tried to answer.

1. WHAT IS IPN?
---

IPN is a new address family designed for one-to-many, many-to-many and 
peer-to-peer communication among processes.
Berkeley sockets have been designed for client-server or point-to-point
communication; AF_UNIX does not support multicast/broadcast. AF_IPN
does, in a simple, efficient but extensible way.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.

On IPN, processes can interoperate using real networking protocols 
(e.g. ethernet) but also using application defined protocols (maybe 
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.

IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).

IPN is part of the Virtual Square Project (vde, lwipv6, view-os, 
umview/kmview, see wiki.virtualsquare.org).

2. WHY IPN?
---
Many applications can benefit from IPN.
First of all VDE (Virtual Distributed Ethernet): one service of IPN is a
kernel implementation of VDE.
IPN can be useful for applications where one or some processes feed their
data (*any kind* of data, not only networking-related messages) to several
consuming processes (maybe joining the stream at run time). IPN sockets
can be also connected to tap (tuntap) like interfaces or to real interfaces
(like "brctl addif").
There are specific ioctls to define a tap interface or grab an existing
one.

Several existing services could be implemented (and often could have extended
features) on the top of IPN:
 - kernel Ethernet bridging
 - TUN/TAP
 - MACVLAN

IPN could be used (IMHO) to provide multicast services to processes.
Audio frames or video frames could be multiplexed such that multiple
applications can use them. I think that something like Jack can be
implemented on the top of IPN. Something like a VideoJack can
provide video frames to several applications: e.g. the same image from a
camera can be viewed by xawtv, recorded and sent to a streaming service.
IPN sockets can be used wherever there is the idea of broadcasting channel 
i.e. where processes can "join (and leave) the information flow" at
runtime. IPN can be seen as "publish and subscribe".
Different delivery policies can be defined as IPN protocols (loaded 
as submodules of ipn.ko).
For instance, an ethernet switch is a policy (kvde_switch.ko: packets
are unicast delivered if the MAC address is already in the switching
hash table), we are designing an extendended switch, full of interesting
features like our userland vde_switch (with vlan/fst/manamement etc..),
and a layer3 switch, but other policies can be defined to implement the
specific requirements of other services. I feel that there is no limits
to creativity about multicast services for processes.  Userspace
services (like vde) do exist, but IPN provides a faster and unified
support.

2.1 Why IPN instead of IP Multicast?

 - IPN seems to be faster than IP Multicast. (see my message to LKML
   of Dec 06).
 - IPN provides file system permission to access the communication medium,
   and it uses the file system for naming.
 - IPN does not need any tunneling or packet encapsulation, it works as a
   layer 1 virtual network.
 - IPN protocols (implemented by kernel submodules) provide forwarding
   policies: the set of receipients for each messages is computed from the
   contents of the message itself.
   Ethernet virtual switches or other routing rules for any kind of data
   can be implemented as IPN protocols.

2.2 Why IPN instead of AF_NETLINK?
--
 - Netlink has been designed for user to kernel communication.
 - Netlink has many missing features to provide services similar to IPN.
 - Currently multicast seems to be allowed for root only. Access control
   should be added completely.
 - Netlink interface for user processes is not very immediate (libnl has
   been developed as a higher level solution to that).
 - Netlink already seems to suffer from "overpopulation":
   NETLINK_GENERIC has been added for "simplified netlink usage" but it
   adds yet another header and rules to be followed.
 - Netlinks is quite rigid as for message delivery guarantees: unicast
   implies lossless 

[PATCH 0/1] IPN: Inter Process Networking

2007-12-17 Thread Renzo Davoli
Inter Process Networking (PATCH):

This patch adds a new address family for inter process communication.
AF_IPN: inter process networking, i.e. multipoint,
multicast/broadcast communication among processes (and networks).

Contents of this document:

1. What is IPN?
2. Why IPN?
2.1 Why IPN instead of IP Multicast?
2.2 Why IPN instead of AF_NETLINK?
3. How?

We've read all the comments in the previous thread about IPN and we've
tried to answer.

1. WHAT IS IPN?
---

IPN is a new address family designed for one-to-many, many-to-many and 
peer-to-peer communication among processes.
Berkeley sockets have been designed for client-server or point-to-point
communication; AF_UNIX does not support multicast/broadcast. AF_IPN
does, in a simple, efficient but extensible way.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.

On IPN, processes can interoperate using real networking protocols 
(e.g. ethernet) but also using application defined protocols (maybe 
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.

IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).

IPN is part of the Virtual Square Project (vde, lwipv6, view-os, 
umview/kmview, see wiki.virtualsquare.org).

2. WHY IPN?
---
Many applications can benefit from IPN.
First of all VDE (Virtual Distributed Ethernet): one service of IPN is a
kernel implementation of VDE.
IPN can be useful for applications where one or some processes feed their
data (*any kind* of data, not only networking-related messages) to several
consuming processes (maybe joining the stream at run time). IPN sockets
can be also connected to tap (tuntap) like interfaces or to real interfaces
(like brctl addif).
There are specific ioctls to define a tap interface or grab an existing
one.

Several existing services could be implemented (and often could have extended
features) on the top of IPN:
 - kernel Ethernet bridging
 - TUN/TAP
 - MACVLAN

IPN could be used (IMHO) to provide multicast services to processes.
Audio frames or video frames could be multiplexed such that multiple
applications can use them. I think that something like Jack can be
implemented on the top of IPN. Something like a VideoJack can
provide video frames to several applications: e.g. the same image from a
camera can be viewed by xawtv, recorded and sent to a streaming service.
IPN sockets can be used wherever there is the idea of broadcasting channel 
i.e. where processes can join (and leave) the information flow at
runtime. IPN can be seen as publish and subscribe.
Different delivery policies can be defined as IPN protocols (loaded 
as submodules of ipn.ko).
For instance, an ethernet switch is a policy (kvde_switch.ko: packets
are unicast delivered if the MAC address is already in the switching
hash table), we are designing an extendended switch, full of interesting
features like our userland vde_switch (with vlan/fst/manamement etc..),
and a layer3 switch, but other policies can be defined to implement the
specific requirements of other services. I feel that there is no limits
to creativity about multicast services for processes.  Userspace
services (like vde) do exist, but IPN provides a faster and unified
support.

2.1 Why IPN instead of IP Multicast?

 - IPN seems to be faster than IP Multicast. (see my message to LKML
   of Dec 06).
 - IPN provides file system permission to access the communication medium,
   and it uses the file system for naming.
 - IPN does not need any tunneling or packet encapsulation, it works as a
   layer 1 virtual network.
 - IPN protocols (implemented by kernel submodules) provide forwarding
   policies: the set of receipients for each messages is computed from the
   contents of the message itself.
   Ethernet virtual switches or other routing rules for any kind of data
   can be implemented as IPN protocols.

2.2 Why IPN instead of AF_NETLINK?
--
 - Netlink has been designed for user to kernel communication.
 - Netlink has many missing features to provide services similar to IPN.
 - Currently multicast seems to be allowed for root only. Access control
   should be added completely.
 - Netlink interface for user processes is not very immediate (libnl has
   been developed as a higher level solution to that).
 - Netlink already seems to suffer from overpopulation:
   NETLINK_GENERIC has been added for simplified netlink usage but it
   adds yet another header and rules to be followed.
 - Netlinks is quite rigid as for message delivery guarantees: unicast
   implies lossless 

[PATCH] misc driver: eliminate 256 minor limit & deprecated call register_chrdev

2007-12-16 Thread Renzo Davoli
I already posted this patch on September 9th but nobody cared.

Is anybody interested in knowing that there is an old limit for
misc device minors to 256, that we are terminating the minor numbers,
and that there is a deprecated call in this code?

drivers/char/misc.c: the deprecated call is 
register_chrdev and it limits the number of minors to 256.

I propose this patch that eliminate both problems. With this patch 
misc allocates the entire major 10.

This patch was designed for a previous version of the kernel code
(2.6.22?), I have tested it today and applies to 2.6.24-rc5 with -12
lines offset.

renzo

Signed-off-by: Renzo Davoli <[EMAIL PROTECTED]>

--- a/drivers/char/misc.c   2007-08-05 16:56:59.0 +0200
+++ b/drivers/char/misc.c   2007-09-06 11:07:51.0 +0200
@@ -56,6 +56,8 @@
 static LIST_HEAD(misc_list);
 static DEFINE_MUTEX(misc_mtx);
 
+static struct cdev misc_cdev;
+
 /*
  * Assigned numbers, used for dynamic minors
  */
@@ -273,6 +275,31 @@
 EXPORT_SYMBOL(misc_register);
 EXPORT_SYMBOL(misc_deregister);
 
+static int misc_register_chrdev(void)
+{
+   dev_t from=MKDEV(MISC_MAJOR,0);
+   int rv;
+   int err = -ENOMEM;
+   char *s;
+
+   if ((rv=register_chrdev_region(from,MINORMASK,"misc")) != 0)
+   return rv;
+
+   cdev_init(_cdev, _fops);
+   misc_cdev.owner=misc_fops.owner;
+   kobject_set_name(_cdev.kobj, "%s", "misc");
+   for (s = strchr(kobject_name(_cdev.kobj),'/'); s; s = strchr(s, 
'/'))
+   *s = '!';
+   err = cdev_add(_cdev, from, MINORMASK);
+   if (err)
+   goto out;
+   return 0;
+out:
+   kobject_put(_cdev.kobj);
+   unregister_chrdev_region(from,MINORMASK);
+  return err;
+}
+
 static int __init misc_init(void)
 {
 #ifdef CONFIG_PROC_FS
@@ -286,7 +313,7 @@
if (IS_ERR(misc_class))
return PTR_ERR(misc_class);
 
-   if (register_chrdev(MISC_MAJOR,"misc",_fops)) {
+   if (misc_register_chrdev()) {
printk("unable to get major %d for misc devices\n",
   MISC_MAJOR);
class_destroy(misc_class);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] misc driver: eliminate 256 minor limit deprecated call register_chrdev

2007-12-16 Thread Renzo Davoli
I already posted this patch on September 9th but nobody cared.

Is anybody interested in knowing that there is an old limit for
misc device minors to 256, that we are terminating the minor numbers,
and that there is a deprecated call in this code?

drivers/char/misc.c: the deprecated call is 
register_chrdev and it limits the number of minors to 256.

I propose this patch that eliminate both problems. With this patch 
misc allocates the entire major 10.

This patch was designed for a previous version of the kernel code
(2.6.22?), I have tested it today and applies to 2.6.24-rc5 with -12
lines offset.

renzo

Signed-off-by: Renzo Davoli [EMAIL PROTECTED]

--- a/drivers/char/misc.c   2007-08-05 16:56:59.0 +0200
+++ b/drivers/char/misc.c   2007-09-06 11:07:51.0 +0200
@@ -56,6 +56,8 @@
 static LIST_HEAD(misc_list);
 static DEFINE_MUTEX(misc_mtx);
 
+static struct cdev misc_cdev;
+
 /*
  * Assigned numbers, used for dynamic minors
  */
@@ -273,6 +275,31 @@
 EXPORT_SYMBOL(misc_register);
 EXPORT_SYMBOL(misc_deregister);
 
+static int misc_register_chrdev(void)
+{
+   dev_t from=MKDEV(MISC_MAJOR,0);
+   int rv;
+   int err = -ENOMEM;
+   char *s;
+
+   if ((rv=register_chrdev_region(from,MINORMASK,misc)) != 0)
+   return rv;
+
+   cdev_init(misc_cdev, misc_fops);
+   misc_cdev.owner=misc_fops.owner;
+   kobject_set_name(misc_cdev.kobj, %s, misc);
+   for (s = strchr(kobject_name(misc_cdev.kobj),'/'); s; s = strchr(s, 
'/'))
+   *s = '!';
+   err = cdev_add(misc_cdev, from, MINORMASK);
+   if (err)
+   goto out;
+   return 0;
+out:
+   kobject_put(misc_cdev.kobj);
+   unregister_chrdev_region(from,MINORMASK);
+  return err;
+}
+
 static int __init misc_init(void)
 {
 #ifdef CONFIG_PROC_FS
@@ -286,7 +313,7 @@
if (IS_ERR(misc_class))
return PTR_ERR(misc_class);
 
-   if (register_chrdev(MISC_MAJOR,misc,misc_fops)) {
+   if (misc_register_chrdev()) {
printk(unable to get major %d for misc devices\n,
   MISC_MAJOR);
class_destroy(misc_class);
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


AF_IPN: Inter Process Networking, try these...

2007-12-07 Thread Renzo Davoli
Andi, David,

I disagree. If you suspect we would be better using IP multicast, I think
your suspects are not supported.
Try the following exercises, please Can you provide better solutions
without IPN?

renzo

Exercise #1.
I am a user (NOT ROOT), I like kvm, qemu etc. I want an efficient network
between my VM.

My solution:
I Create a IPN socket, with protocol IPN_VDESWITCH and all the VM can
communicate.

Your solution:
- I am condamned by two kernel developers to run the switch in the userland 
- I beg the sysadm to give me some pre-allocated taps connected together
by a kernel bridge.
- I create a multicast socket limited to this host (TTL=0) and I use it
like a hub. It cannot switch the packets.   

Exercise #2.
I am a sysadm (maybe a lab administrator). I want my users (not root)
of the group "vmenabled" to run their VM connected to a network. 
I have hundreds of users in vmenabled(say students).

My Solution:
I create a IPN socket, with protocol IPN_VDESWITCH, connected to a virtual
interface say ipn0. I give to the socket permission 760 owner
root:vmenabled.

Your solution:
- I am condamned by two kernel developers to run the switch in the userland
- I create a multicast socket connected to a tap and then I define iptables
filters to avoid unauthorized users to join the net.
- I create hundreds of preallocated tap interfaces, at least one per user.

Exercise #3.
I am a user (NOT ROOT) and I have a heavy stream of *very private data* 
generated by some processes that must be received by several processes.
I am looking for an efficient solution.
Data can be ASCII strings, or a binary stream.
It is not a "networking" issue, it is just IPC.

My solution.
I Create a IPN socket with permission 700, IPN_BROADCAST protocol. All 
the processes connect to the socket either for writing or for reading (or both).

Your solution:
- I am condamned by two kernel developers to use userland inefficient
solutions like named pipes, tee, or a user daemon among AF_UNIX sockets.
- If I use multicast, others can read the stream.
(security by obscurity? the attacker do not know the address?)
- I use a multicast socket with SSL (it sounds funny to use encryption
  to talk with myself, exposing the stream to crypto attack).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


AF_IPN: Inter Process Networking, try these...

2007-12-07 Thread Renzo Davoli
Andi, David,

I disagree. If you suspect we would be better using IP multicast, I think
your suspects are not supported.
Try the following exercises, please Can you provide better solutions
without IPN?

renzo

Exercise #1.
I am a user (NOT ROOT), I like kvm, qemu etc. I want an efficient network
between my VM.

My solution:
I Create a IPN socket, with protocol IPN_VDESWITCH and all the VM can
communicate.

Your solution:
- I am condamned by two kernel developers to run the switch in the userland 
- I beg the sysadm to give me some pre-allocated taps connected together
by a kernel bridge.
- I create a multicast socket limited to this host (TTL=0) and I use it
like a hub. It cannot switch the packets.   

Exercise #2.
I am a sysadm (maybe a lab administrator). I want my users (not root)
of the group vmenabled to run their VM connected to a network. 
I have hundreds of users in vmenabled(say students).

My Solution:
I create a IPN socket, with protocol IPN_VDESWITCH, connected to a virtual
interface say ipn0. I give to the socket permission 760 owner
root:vmenabled.

Your solution:
- I am condamned by two kernel developers to run the switch in the userland
- I create a multicast socket connected to a tap and then I define iptables
filters to avoid unauthorized users to join the net.
- I create hundreds of preallocated tap interfaces, at least one per user.

Exercise #3.
I am a user (NOT ROOT) and I have a heavy stream of *very private data* 
generated by some processes that must be received by several processes.
I am looking for an efficient solution.
Data can be ASCII strings, or a binary stream.
It is not a networking issue, it is just IPC.

My solution.
I Create a IPN socket with permission 700, IPN_BROADCAST protocol. All 
the processes connect to the socket either for writing or for reading (or both).

Your solution:
- I am condamned by two kernel developers to use userland inefficient
solutions like named pipes, tee, or a user daemon among AF_UNIX sockets.
- If I use multicast, others can read the stream.
(security by obscurity? the attacker do not know the address?)
- I use a multicast socket with SSL (it sounds funny to use encryption
  to talk with myself, exposing the stream to crypto attack).
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-06 Thread Renzo Davoli
I have done some raw tests.
(you can read the code here: http://www.cs.unibo.it/~renzo/rawperftest/)

The programs are quite simple. The sender sends "Hello World" as fast as it
can, while the receiver prints time() for each 1 million message
received.

On my laptop, tests on 2000 "Hello World" packets, 

One receiver:
multicast   244,000 msg/sec
IPN 333,000 msg/sec  (36% faster)

Two receivers:
multicast   174,000 msg/sec
IPN 250,000 msg/sec  (43% faster)

Apart from this, how could I implement policies over a multicast socket,
e.g. how does a Kernel VDE_switch work on multicast sockets?

If I send an ethernet packet over a multicast socket it can emulate just a
hub (Although it seems to me quite innatural to have to have TCP-UDP 
over IP over Ethernet over UDP over IP - okay we can skip the Ethernet 
on localhost, long ethernet frames get fragmentated but... details).

On multicast socket you cannot use policies, I mean a IPN network (or
bus or group) can have a policy reading some info on the packet to
decide the set of receipients.
For a vde_switch it is the destination mac address when found in the
MAC hash table to select the receipient port. For midi communication it 
could be the channel number

Moving the switching fabric to the userland the performance figures are
quite different.

renzo

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-06 Thread Renzo Davoli
Some more explanations trying to describe what IPN is and what it is
useful for.  We are writing the complete patch

Summary:
* IPN is for inter-process communication. It is *not* directly related 
to TCP-IP or Ethernet.
* IPN itself is a *level 1* virtual physical network.  IPN services
* (like AF_UNIX) do not require root privileges.  TAP and GRAB are just
* extra features for for IPN deliverying Ethernet frames.


* IPN is for inter-process communication. It is *not* directly related 
to TCP-IP or Ethernet.

If you want you can call it Inter Process Bus Communication.  It is an
extension of AF_UNIX.  Comments saying that some services can be
implemented by using TCP-IP multicast protocols are unrelated to IPN.
All AF_UNIX services could be implemented as TCP-IP services on
127.0.0.1. Do we abolish AF_UNIX, then?  The problem is that to use
TCP-IP, you'd need to wrap the packets with TCP or UDP, IP and Ethernet
headers, the stack would lose time to manage useless protocols.  If you
want just to send strings to set of local processes TCP-IP is an
overloading solution.  Even X-Window uses AF_UNIX sockets to talk with
local clients, it is a performance issue... I think Chris is right.

* IPN itself is a *level 1* virtual physical network.

Like any physical network you can run higher level protocols on it, thus
Ethernet, and then TCP-IP can be services you can run on IPN, but there
can be IPN networks running neither TCP-IP nor Ethernet.

* IPN services (like AF_UNIX) do not require root privileges.

There are many communication services where the user need broadcast or
p2p among user processes.  If a user (not root) wants to run several
User-Mode Linux, Qemu, Kvm VM the only way to have them connected
together is our Virtual Distributed Ethernet.  (For this reason VDE
exists in almost all the distros, it has been ported to other OSs, and
is already supported in the Linux Kernel for User-Mode Linux).  VDE is a
userland deamon, hence requires two context switches to deliver a
packet: VM1 -> K -> Daemon -> K -> VM2. Kvde running on IPN just one:
VM1 -> K ->VM2.  I think D-Bus can use IPN, too. The same cutoff of
context switches applies.  May I speculate that there will be a sensible
increase in performance?  *nix are multiuser. It means that there do
exist people that need to set up services without root access.  And even
if you have root access, the less you need to work as root, the safer is
you system.

* TAP and GRAB are just extra features for for IPN deliverying Ethernet frames.

Some IPN networks do use Ethernet as Data-Link protocol.  It is useful
to provide means to connect the IPN socket to a virtual (TAP) interface
or to a real (GRAB) interface.  I know that a lot of people use tap
interfaces, and the kernel bridge to connect Virtual Machines.  The
access can be resticted to some users or processes by itpables, but it
not as simple as a chmod to the socket.  A lot of people also use tunctl
to define a priori tap interfaces for users.  They must define as many
tuntap interfaces as the number of VM the users may want, each user has
his/her own taps.  Some users define a userland VDE switch to
interconnect their VM.  IPN itself could use a userland process to
define a standard TAP interface and loose its time and its cpu cycles to
move packets from tap to ipn and viceversa.  IPN is already kernel code
and then all its context switches and cpu cycles can be saved by
accessing the tap or grabbed interface diretly from the kernel.  (TAP
and GRAB obviously require CAP_NET_ADMIN).  Using IPN with TAP you can
define one single TAP interface connected to an IPN socket. Several VMs
can use that IPN socket, in this way the VMs are connected by a (virtual
ethernet) network which include the TAP interface.  The access control
to the network (and then to the TAP) is done by setting the permissions
to the socket.  Tunctl is *not* able to create a tap where all the users
belonging to a group can start their VM. IPN can.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-06 Thread Renzo Davoli
Some more explanations trying to describe what IPN is and what it is
useful for.  We are writing the complete patch

Summary:
* IPN is for inter-process communication. It is *not* directly related 
to TCP-IP or Ethernet.
* IPN itself is a *level 1* virtual physical network.  IPN services
* (like AF_UNIX) do not require root privileges.  TAP and GRAB are just
* extra features for for IPN deliverying Ethernet frames.


* IPN is for inter-process communication. It is *not* directly related 
to TCP-IP or Ethernet.

If you want you can call it Inter Process Bus Communication.  It is an
extension of AF_UNIX.  Comments saying that some services can be
implemented by using TCP-IP multicast protocols are unrelated to IPN.
All AF_UNIX services could be implemented as TCP-IP services on
127.0.0.1. Do we abolish AF_UNIX, then?  The problem is that to use
TCP-IP, you'd need to wrap the packets with TCP or UDP, IP and Ethernet
headers, the stack would lose time to manage useless protocols.  If you
want just to send strings to set of local processes TCP-IP is an
overloading solution.  Even X-Window uses AF_UNIX sockets to talk with
local clients, it is a performance issue... I think Chris is right.

* IPN itself is a *level 1* virtual physical network.

Like any physical network you can run higher level protocols on it, thus
Ethernet, and then TCP-IP can be services you can run on IPN, but there
can be IPN networks running neither TCP-IP nor Ethernet.

* IPN services (like AF_UNIX) do not require root privileges.

There are many communication services where the user need broadcast or
p2p among user processes.  If a user (not root) wants to run several
User-Mode Linux, Qemu, Kvm VM the only way to have them connected
together is our Virtual Distributed Ethernet.  (For this reason VDE
exists in almost all the distros, it has been ported to other OSs, and
is already supported in the Linux Kernel for User-Mode Linux).  VDE is a
userland deamon, hence requires two context switches to deliver a
packet: VM1 - K - Daemon - K - VM2. Kvde running on IPN just one:
VM1 - K -VM2.  I think D-Bus can use IPN, too. The same cutoff of
context switches applies.  May I speculate that there will be a sensible
increase in performance?  *nix are multiuser. It means that there do
exist people that need to set up services without root access.  And even
if you have root access, the less you need to work as root, the safer is
you system.

* TAP and GRAB are just extra features for for IPN deliverying Ethernet frames.

Some IPN networks do use Ethernet as Data-Link protocol.  It is useful
to provide means to connect the IPN socket to a virtual (TAP) interface
or to a real (GRAB) interface.  I know that a lot of people use tap
interfaces, and the kernel bridge to connect Virtual Machines.  The
access can be resticted to some users or processes by itpables, but it
not as simple as a chmod to the socket.  A lot of people also use tunctl
to define a priori tap interfaces for users.  They must define as many
tuntap interfaces as the number of VM the users may want, each user has
his/her own taps.  Some users define a userland VDE switch to
interconnect their VM.  IPN itself could use a userland process to
define a standard TAP interface and loose its time and its cpu cycles to
move packets from tap to ipn and viceversa.  IPN is already kernel code
and then all its context switches and cpu cycles can be saved by
accessing the tap or grabbed interface diretly from the kernel.  (TAP
and GRAB obviously require CAP_NET_ADMIN).  Using IPN with TAP you can
define one single TAP interface connected to an IPN socket. Several VMs
can use that IPN socket, in this way the VMs are connected by a (virtual
ethernet) network which include the TAP interface.  The access control
to the network (and then to the TAP) is done by setting the permissions
to the socket.  Tunctl is *not* able to create a tap where all the users
belonging to a group can start their VM. IPN can.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-06 Thread Renzo Davoli
I have done some raw tests.
(you can read the code here: http://www.cs.unibo.it/~renzo/rawperftest/)

The programs are quite simple. The sender sends Hello World as fast as it
can, while the receiver prints time() for each 1 million message
received.

On my laptop, tests on 2000 Hello World packets, 

One receiver:
multicast   244,000 msg/sec
IPN 333,000 msg/sec  (36% faster)

Two receivers:
multicast   174,000 msg/sec
IPN 250,000 msg/sec  (43% faster)

Apart from this, how could I implement policies over a multicast socket,
e.g. how does a Kernel VDE_switch work on multicast sockets?

If I send an ethernet packet over a multicast socket it can emulate just a
hub (Although it seems to me quite innatural to have to have TCP-UDP 
over IP over Ethernet over UDP over IP - okay we can skip the Ethernet 
on localhost, long ethernet frames get fragmentated but... details).

On multicast socket you cannot use policies, I mean a IPN network (or
bus or group) can have a policy reading some info on the packet to
decide the set of receipients.
For a vde_switch it is the destination mac address when found in the
MAC hash table to select the receipient port. For midi communication it 
could be the channel number

Moving the switching fabric to the userland the performance figures are
quite different.

renzo

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Renzo Davoli
> In the meanwhile we would be grateful if the community could kindly ask to the
> questions above.
Obviously I meant:
In the meanwhile we would be grateful if the community could kindly *answer*
to the questions above

sorry (it is early morning here, it happens ;-)

renzo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Renzo Davoli
On Wed, Dec 05, 2007 at 04:55:52PM -0500, Stephen Hemminger wrote:
> On Wed, 5 Dec 2007 17:40:55 +0100
> [EMAIL PROTECTED] (Renzo Davoli) wrote:
> > 0- (Constructive) comments.
> > 1- The "official" assignment of an Address Family.
> > 2- Another "grabbing hook" for interfaces (like the ones already
> > We are studying some way to register/deregister grabbing services,
> > I feel this would be the cleanest way. 
> 
> Post complete source code for kernel part to [EMAIL PROTECTED]
I'll do it as soon as possible.
> If you want the hooks, you need to include the full source code for inclusion
> in mainline. All the Documentation/SubmittingPatches rules apply;
> you can't just ask for "facilitators" and expect to keep your stuff out of 
> tree.
I am sorry if I was misunderstood.
I did not want any "facilitator", nor I wanted to keep my code outside
the kernel, on the contrary.
It is perfectly okay for me to provide the entire code for inclusion.
The purposes of my message were the following:
- I wanted to introduce the idea and say to the linux kernel community
  that a team is working on it.
- Address family: is it okay to send a patch that add a new AF?
is there a "AF registry" somewhere? (like the device major/minor
registry or the well-known port assignment for TCP-IP).
- Hook: we have two different options. We can add another grabbing
inline function like those used by the bridge and macvlan or we can
design a grabbing service registration facility. Which one is preferrable?
The former is simpler, the latter is more elegant but it requires some 
changes in the kernel bridge code.
So the former choice is between less-invasive,safer,inelegant, the
latter is more-invasive,less safe,elegant.

We need a bit of time to stabilize the code: deeply testing the existing
features and implementing some more ideas we have on it.
In the meanwhile we would be grateful if the community could kindly ask to the
questions above.

renzo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Renzo Davoli
On Thu, Dec 06, 2007 at 12:39:22AM +0100, Andi Kleen wrote:
> [EMAIL PROTECTED] (Renzo Davoli) writes:
> 
> > Berkeley socket have been designed for client server or point to point
> > communication. All existing Address Families implement this idea.
> Netlink is multicast/broadcast by default for once. And BC/MC certainly
> works for IPv[46] and a couple of other protocols too.
> 
> > IPN is an Inter Process Communication paradigm where all the processes
> > appear as they were connected by a networking bus.
> 
> Sounds like netlink. See also RFC 3549

RFC 3549 says:
"This document describes Linux Netlink, which is used in Linux both as
   an intra-kernel messaging system as well as between kernel and user
   space."

We know AF_NETLINK, our user-space stack lwipv6 supports it.

AF_IPN is different. 
AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX.
It supports communication among *user* processes. 

Example:

Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an
Ethernet Hub and communicate among themselves with the hosting computer 
and the world by a tap like interface.

You can also grab an interface (say eth1) and use eth0 for your hosting
computer and eth1 for the IPN network of virtual machines.

If you load the kvde_switch submodule IPN can be a virtual Ethernet switch.

This example is already working using the svn versions of ipn and
vdeplug.

Another Example:

You have a continuous stream of data packets generated by a process,
and you want to send this data to many processes.
Maybe the set of processes is not known in advance, you want to send the
data to any interested process. Some kind of publish
communication service (among unix processes not on TCP-IP).
Without IPN you need a server. With IPN the sender creates the socket
connects to it and feed it with data packets. All the interested 
receivers connects to it and start reading. That's all.

I hope that this message can give a better undertanding of what IPN is.

renzo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Renzo Davoli
Inter Process Networking: 
a kernel module (and some simple kernel patches) to provide 
AF_IPN: a new address family for process networking, i.e. multipoint,
multicast/broadcast communication among processes (and networks).

WHAT IS IT?
---
Berkeley socket have been designed for client server or point to point
communication. All existing Address Families implement this idea.
IPN is a new address family designed for one-to-many, many-to-many and 
peer-to-peer communication among processes.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.
On IPN, processes can interoperate using real networking protocols 
(e.g. ethernet) but also using application defined protocols (maybe 
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.
IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).
IPN is part of the Virtual Square Project (vde, lwipv6, view-os, 
umview/kmview, see wiki.virtualsquare.org).

WHY?

Many applications can benefit from IPN.
First of all VDE (Virtual Distributed Ethernet): one service of IPN is a
kernel implementation of VDE.
IPN can be useful for applications where one or some processes feed their 
data to several consuming processes (maybe joining the stream at run time).
IPN sockets can be also connected to tap (tuntap) like interfaces or
to real interfaces (like "brctl addif").
There are specific ioctls to define a tap interface or grab an existing
one.
Several existing services could be implemented (and often could have extended
features) on the top of IPN:
- kernel bridge
- tuntap
- macvlan
IPN could be used (IMHO) to provide multicast services to processes.
Audio frames or video frames could be multiplexed such that multiple
applications can use them. I think that something like Jack can be
implemented on the top of IPN. Something like a VideoJack can
provide video frames to several applications: e.g. the same image from a
camera can be viewed by xawtv, recorded and sent to a streaming service.
IPN sockets can be used wherever there is the idea of broadcasting channel 
i.e. where processes can "join (and leave) the information flow" at
runtime. 
Different delivery policies can be defined as IPN protocols (loaded 
as submodules of ipn.ko).
e.g. ethernet switch is a policy (kvde_switch.ko: packets are unicast 
delivered if the MAC address is already in the switching hash table), 
we are designing an extendended switch, full of interesting features like
our userland vde_switch (with vlan/fst/manamement etc..), and a layer3
switch, but other policies can be defined to implement the specific
requirements of other services. I feel that there is no limits to creativity 
about multicast services for processes.
Userspace services (like vde or jack) do exist, but IPN provides
a faster and unified support.

HOW?

The complete specifications for IPN can be found here:
http://wiki.virtualsquare.org/index.php/IPN

bind() creates the socket (if it does not already exist). When bind() succeeds, 
the process has the right to manage the "network". 
No data is received or can be send if the socket is not connected 
(only get/setsockopt and ioctl work on bound unconnected sockets).

connect() is used to join the network. When the socket is connected it 
is possible to send/receive data. If the socket is already bound it is
useless to specify the socket again (you can use NULL, or specify the same
address).
connect() can be also used without bind(). In this case the process sends and
receives data but it cannot manage the network (in this case the socket
address specification is required).

listen() and accept() are for servers, thus they does not exist for IPN.

Examples:
1- Peer-to-Peer Communication:
Several processes run the same code:

  struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"};
  int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); 
  err=bind(s,(struct sockaddr *),sizeof(sun));
  err=connect(s,NULL,0);

In this case all the messages sent by each process get received by all the
other processes (IPN_BROADCAST). 
The processes need to be able to receive data when there are pending packets, 
e.g. by using poll/select and event driven programming or multithreading.

2- (One or) Some senders/many receivers
The sender runs the following code:

  struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"};
  int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST);
  err=shutdown(s,SHUT_RD);
  err=bind(s,(struct sockaddr *),sizeof(sun));
  err=connect(s,NULL,0);

The receivers do not need to define the network, thus they skip the bind():

  struct sockaddr_un 

New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Renzo Davoli
Inter Process Networking: 
a kernel module (and some simple kernel patches) to provide 
AF_IPN: a new address family for process networking, i.e. multipoint,
multicast/broadcast communication among processes (and networks).

WHAT IS IT?
---
Berkeley socket have been designed for client server or point to point
communication. All existing Address Families implement this idea.
IPN is a new address family designed for one-to-many, many-to-many and 
peer-to-peer communication among processes.
IPN is an Inter Process Communication paradigm where all the processes
appear as they were connected by a networking bus.
On IPN, processes can interoperate using real networking protocols 
(e.g. ethernet) but also using application defined protocols (maybe 
just sending ascii strings, video or audio frames, etc).
IPN provides networking (in the broaden definition you can imagine) to
the processes. Processes can be ethernet nodes, run their own TCP-IP stacks
if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc.
IPN networks can be interconnected with real networks or IPN networks
running on different computers can interoperate (can be connected by
virtual cables).
IPN is part of the Virtual Square Project (vde, lwipv6, view-os, 
umview/kmview, see wiki.virtualsquare.org).

WHY?

Many applications can benefit from IPN.
First of all VDE (Virtual Distributed Ethernet): one service of IPN is a
kernel implementation of VDE.
IPN can be useful for applications where one or some processes feed their 
data to several consuming processes (maybe joining the stream at run time).
IPN sockets can be also connected to tap (tuntap) like interfaces or
to real interfaces (like brctl addif).
There are specific ioctls to define a tap interface or grab an existing
one.
Several existing services could be implemented (and often could have extended
features) on the top of IPN:
- kernel bridge
- tuntap
- macvlan
IPN could be used (IMHO) to provide multicast services to processes.
Audio frames or video frames could be multiplexed such that multiple
applications can use them. I think that something like Jack can be
implemented on the top of IPN. Something like a VideoJack can
provide video frames to several applications: e.g. the same image from a
camera can be viewed by xawtv, recorded and sent to a streaming service.
IPN sockets can be used wherever there is the idea of broadcasting channel 
i.e. where processes can join (and leave) the information flow at
runtime. 
Different delivery policies can be defined as IPN protocols (loaded 
as submodules of ipn.ko).
e.g. ethernet switch is a policy (kvde_switch.ko: packets are unicast 
delivered if the MAC address is already in the switching hash table), 
we are designing an extendended switch, full of interesting features like
our userland vde_switch (with vlan/fst/manamement etc..), and a layer3
switch, but other policies can be defined to implement the specific
requirements of other services. I feel that there is no limits to creativity 
about multicast services for processes.
Userspace services (like vde or jack) do exist, but IPN provides
a faster and unified support.

HOW?

The complete specifications for IPN can be found here:
http://wiki.virtualsquare.org/index.php/IPN

bind() creates the socket (if it does not already exist). When bind() succeeds, 
the process has the right to manage the network. 
No data is received or can be send if the socket is not connected 
(only get/setsockopt and ioctl work on bound unconnected sockets).

connect() is used to join the network. When the socket is connected it 
is possible to send/receive data. If the socket is already bound it is
useless to specify the socket again (you can use NULL, or specify the same
address).
connect() can be also used without bind(). In this case the process sends and
receives data but it cannot manage the network (in this case the socket
address specification is required).

listen() and accept() are for servers, thus they does not exist for IPN.

Examples:
1- Peer-to-Peer Communication:
Several processes run the same code:

  struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path=/tmp/sockipn};
  int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); 
  err=bind(s,(struct sockaddr *)sun,sizeof(sun));
  err=connect(s,NULL,0);

In this case all the messages sent by each process get received by all the
other processes (IPN_BROADCAST). 
The processes need to be able to receive data when there are pending packets, 
e.g. by using poll/select and event driven programming or multithreading.

2- (One or) Some senders/many receivers
The sender runs the following code:

  struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path=/tmp/sockipn};
  int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST);
  err=shutdown(s,SHUT_RD);
  err=bind(s,(struct sockaddr *)sun,sizeof(sun));
  err=connect(s,NULL,0);

The receivers do not need to define the network, thus they skip the bind():

  struct sockaddr_un 

Re: New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Renzo Davoli
On Thu, Dec 06, 2007 at 12:39:22AM +0100, Andi Kleen wrote:
 [EMAIL PROTECTED] (Renzo Davoli) writes:
 
  Berkeley socket have been designed for client server or point to point
  communication. All existing Address Families implement this idea.
 Netlink is multicast/broadcast by default for once. And BC/MC certainly
 works for IPv[46] and a couple of other protocols too.
 
  IPN is an Inter Process Communication paradigm where all the processes
  appear as they were connected by a networking bus.
 
 Sounds like netlink. See also RFC 3549

RFC 3549 says:
This document describes Linux Netlink, which is used in Linux both as
   an intra-kernel messaging system as well as between kernel and user
   space.

We know AF_NETLINK, our user-space stack lwipv6 supports it.

AF_IPN is different. 
AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX.
It supports communication among *user* processes. 

Example:

Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an
Ethernet Hub and communicate among themselves with the hosting computer 
and the world by a tap like interface.

You can also grab an interface (say eth1) and use eth0 for your hosting
computer and eth1 for the IPN network of virtual machines.

If you load the kvde_switch submodule IPN can be a virtual Ethernet switch.

This example is already working using the svn versions of ipn and
vdeplug.

Another Example:

You have a continuous stream of data packets generated by a process,
and you want to send this data to many processes.
Maybe the set of processes is not known in advance, you want to send the
data to any interested process. Some kind of publishsubscribe
communication service (among unix processes not on TCP-IP).
Without IPN you need a server. With IPN the sender creates the socket
connects to it and feed it with data packets. All the interested 
receivers connects to it and start reading. That's all.

I hope that this message can give a better undertanding of what IPN is.

renzo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Renzo Davoli
On Wed, Dec 05, 2007 at 04:55:52PM -0500, Stephen Hemminger wrote:
 On Wed, 5 Dec 2007 17:40:55 +0100
 [EMAIL PROTECTED] (Renzo Davoli) wrote:
  0- (Constructive) comments.
  1- The official assignment of an Address Family.
  2- Another grabbing hook for interfaces (like the ones already
  We are studying some way to register/deregister grabbing services,
  I feel this would be the cleanest way. 
 
 Post complete source code for kernel part to [EMAIL PROTECTED]
I'll do it as soon as possible.
 If you want the hooks, you need to include the full source code for inclusion
 in mainline. All the Documentation/SubmittingPatches rules apply;
 you can't just ask for facilitators and expect to keep your stuff out of 
 tree.
I am sorry if I was misunderstood.
I did not want any facilitator, nor I wanted to keep my code outside
the kernel, on the contrary.
It is perfectly okay for me to provide the entire code for inclusion.
The purposes of my message were the following:
- I wanted to introduce the idea and say to the linux kernel community
  that a team is working on it.
- Address family: is it okay to send a patch that add a new AF?
is there a AF registry somewhere? (like the device major/minor
registry or the well-known port assignment for TCP-IP).
- Hook: we have two different options. We can add another grabbing
inline function like those used by the bridge and macvlan or we can
design a grabbing service registration facility. Which one is preferrable?
The former is simpler, the latter is more elegant but it requires some 
changes in the kernel bridge code.
So the former choice is between less-invasive,safer,inelegant, the
latter is more-invasive,less safe,elegant.

We need a bit of time to stabilize the code: deeply testing the existing
features and implementing some more ideas we have on it.
In the meanwhile we would be grateful if the community could kindly ask to the
questions above.

renzo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: New Address Family: Inter Process Networking (IPN)

2007-12-05 Thread Renzo Davoli
 In the meanwhile we would be grateful if the community could kindly ask to the
 questions above.
Obviously I meant:
In the meanwhile we would be grateful if the community could kindly *answer*
to the questions above

sorry (it is early morning here, it happens ;-)

renzo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers/char/misc.c: deprecated call register_chrdev, and 256 minor limit eliminated

2007-09-06 Thread Renzo Davoli
Dear Kernel Developers,

I have seen that drivers/char/misc.c already used the deprecated 
register_chrdev call and this limited the number of minors to 256.

I recetly asked for a misc minor number attribution for View-OS/kmview
and Torben Mathiasen told me about this issue.

I propose this patch that eliminate both problems. With this patch 
misc allocates the entire major 10 and avoids the deprecated call.

These are just my 2 (euro) cents, I hope it can be useful.

renzo
-- 

Renzo Davoli| Dept. of Computer Science
(NIC rd235, HAM IZ4DJE) | University of Bologna 
Tel. +39 051 2094501| Mura Anteo Zamboni, 7
Fax. +39 051 2094510| I-40127 Bologna  ITALY
Key fingerprint = A019 17E2 5562 06F6 77BB  2E93 1A01 F646 30EA B487

--- drivers/char/misc.orig.c2007-08-05 16:56:59.0 +0200
+++ drivers/char/misc.c 2007-09-06 11:07:51.0 +0200
@@ -56,6 +56,8 @@
 static LIST_HEAD(misc_list);
 static DEFINE_MUTEX(misc_mtx);
 
+static struct cdev misc_cdev;
+
 /*
  * Assigned numbers, used for dynamic minors
  */
@@ -273,6 +275,31 @@
 EXPORT_SYMBOL(misc_register);
 EXPORT_SYMBOL(misc_deregister);
 
+static int misc_register_chrdev(void)
+{
+   dev_t from=MKDEV(MISC_MAJOR,0);
+   int rv;
+   int err = -ENOMEM;
+   char *s;
+
+   if ((rv=register_chrdev_region(from,MINORMASK,"misc")) != 0)
+   return rv;
+
+   cdev_init(_cdev, _fops);
+   misc_cdev.owner=misc_fops.owner;
+   kobject_set_name(_cdev.kobj, "%s", "misc");
+   for (s = strchr(kobject_name(_cdev.kobj),'/'); s; s = strchr(s, 
'/'))
+   *s = '!';
+   err = cdev_add(_cdev, from, MINORMASK);
+   if (err)
+   goto out;
+   return 0;
+out:
+   kobject_put(_cdev.kobj);
+   unregister_chrdev_region(from,MINORMASK);
+  return err;
+}
+
 static int __init misc_init(void)
 {
 #ifdef CONFIG_PROC_FS
@@ -286,7 +313,7 @@
if (IS_ERR(misc_class))
return PTR_ERR(misc_class);
 
-   if (register_chrdev(MISC_MAJOR,"misc",_fops)) {
+   if (misc_register_chrdev()) {
printk("unable to get major %d for misc devices\n",
   MISC_MAJOR);
class_destroy(misc_class);


Re: [PATCH] drivers/char/misc.c: deprecated call register_chrdev, and 256 minor limit eliminated

2007-09-06 Thread Renzo Davoli
On Thu, Sep 06, 2007 at 11:38:44AM +0200, Renzo Davoli wrote:
Dear Kernel Developers,

I have seen that drivers/char/misc.c already used the deprecated 
register_chrdev call and this limited the number of minors to 256.

I recetly asked for a misc minor number attribution for View-OS/kmview
and Torben Mathiasen told me about this issue.

I propose this patch that eliminate both problems. With this patch 
misc allocates the entire major 10 and avoids the deprecated call.

These are just my 2 (euro) cents, I hope it can be useful.

renzo

Signed-off-by: Renzo Davoli <[EMAIL PROTECTED]>

-- 
====
Renzo Davoli| Dept. of Computer Science
(NIC rd235, HAM IZ4DJE) | University of Bologna 
Tel. +39 051 2094501| Mura Anteo Zamboni, 7
Fax. +39 051 2094510| I-40127 Bologna  ITALY
Key fingerprint = A019 17E2 5562 06F6 77BB  2E93 1A01 F646 30EA B487


--- drivers/char/misc.orig.c2007-08-05 16:56:59.0 +0200
+++ drivers/char/misc.c 2007-09-06 11:07:51.0 +0200
@@ -56,6 +56,8 @@
 static LIST_HEAD(misc_list);
 static DEFINE_MUTEX(misc_mtx);
 
+static struct cdev misc_cdev;
+
 /*
  * Assigned numbers, used for dynamic minors
  */
@@ -273,6 +275,31 @@
 EXPORT_SYMBOL(misc_register);
 EXPORT_SYMBOL(misc_deregister);
 
+static int misc_register_chrdev(void)
+{
+   dev_t from=MKDEV(MISC_MAJOR,0);
+   int rv;
+   int err = -ENOMEM;
+   char *s;
+
+   if ((rv=register_chrdev_region(from,MINORMASK,"misc")) != 0)
+   return rv;
+
+   cdev_init(_cdev, _fops);
+   misc_cdev.owner=misc_fops.owner;
+   kobject_set_name(_cdev.kobj, "%s", "misc");
+   for (s = strchr(kobject_name(_cdev.kobj),'/'); s; s = strchr(s, 
'/'))
+   *s = '!';
+   err = cdev_add(_cdev, from, MINORMASK);
+   if (err)
+   goto out;
+   return 0;
+out:
+   kobject_put(_cdev.kobj);
+   unregister_chrdev_region(from,MINORMASK);
+  return err;
+}
+
 static int __init misc_init(void)
 {
 #ifdef CONFIG_PROC_FS
@@ -286,7 +313,7 @@
if (IS_ERR(misc_class))
return PTR_ERR(misc_class);
 
-   if (register_chrdev(MISC_MAJOR,"misc",_fops)) {
+   if (misc_register_chrdev()) {
printk("unable to get major %d for misc devices\n",
   MISC_MAJOR);
class_destroy(misc_class);


-- 
========
Renzo Davoli| Dept. of Computer Science
(NIC rd235, HAM IZ4DJE) | University of Bologna 
Tel. +39 051 2094501| Mura Anteo Zamboni, 7
Fax. +39 051 2094510| I-40127 Bologna  ITALY
Key fingerprint = A019 17E2 5562 06F6 77BB  2E93 1A01 F646 30EA B487

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] drivers/char/misc.c: deprecated call register_chrdev, and 256 minor limit eliminated

2007-09-06 Thread Renzo Davoli
On Thu, Sep 06, 2007 at 11:38:44AM +0200, Renzo Davoli wrote:
Dear Kernel Developers,

I have seen that drivers/char/misc.c already used the deprecated 
register_chrdev call and this limited the number of minors to 256.

I recetly asked for a misc minor number attribution for View-OS/kmview
and Torben Mathiasen told me about this issue.

I propose this patch that eliminate both problems. With this patch 
misc allocates the entire major 10 and avoids the deprecated call.

These are just my 2 (euro) cents, I hope it can be useful.

renzo

Signed-off-by: Renzo Davoli [EMAIL PROTECTED]

-- 

Renzo Davoli| Dept. of Computer Science
(NIC rd235, HAM IZ4DJE) | University of Bologna 
Tel. +39 051 2094501| Mura Anteo Zamboni, 7
Fax. +39 051 2094510| I-40127 Bologna  ITALY
Key fingerprint = A019 17E2 5562 06F6 77BB  2E93 1A01 F646 30EA B487


--- drivers/char/misc.orig.c2007-08-05 16:56:59.0 +0200
+++ drivers/char/misc.c 2007-09-06 11:07:51.0 +0200
@@ -56,6 +56,8 @@
 static LIST_HEAD(misc_list);
 static DEFINE_MUTEX(misc_mtx);
 
+static struct cdev misc_cdev;
+
 /*
  * Assigned numbers, used for dynamic minors
  */
@@ -273,6 +275,31 @@
 EXPORT_SYMBOL(misc_register);
 EXPORT_SYMBOL(misc_deregister);
 
+static int misc_register_chrdev(void)
+{
+   dev_t from=MKDEV(MISC_MAJOR,0);
+   int rv;
+   int err = -ENOMEM;
+   char *s;
+
+   if ((rv=register_chrdev_region(from,MINORMASK,misc)) != 0)
+   return rv;
+
+   cdev_init(misc_cdev, misc_fops);
+   misc_cdev.owner=misc_fops.owner;
+   kobject_set_name(misc_cdev.kobj, %s, misc);
+   for (s = strchr(kobject_name(misc_cdev.kobj),'/'); s; s = strchr(s, 
'/'))
+   *s = '!';
+   err = cdev_add(misc_cdev, from, MINORMASK);
+   if (err)
+   goto out;
+   return 0;
+out:
+   kobject_put(misc_cdev.kobj);
+   unregister_chrdev_region(from,MINORMASK);
+  return err;
+}
+
 static int __init misc_init(void)
 {
 #ifdef CONFIG_PROC_FS
@@ -286,7 +313,7 @@
if (IS_ERR(misc_class))
return PTR_ERR(misc_class);
 
-   if (register_chrdev(MISC_MAJOR,misc,misc_fops)) {
+   if (misc_register_chrdev()) {
printk(unable to get major %d for misc devices\n,
   MISC_MAJOR);
class_destroy(misc_class);


-- 

Renzo Davoli| Dept. of Computer Science
(NIC rd235, HAM IZ4DJE) | University of Bologna 
Tel. +39 051 2094501| Mura Anteo Zamboni, 7
Fax. +39 051 2094510| I-40127 Bologna  ITALY
Key fingerprint = A019 17E2 5562 06F6 77BB  2E93 1A01 F646 30EA B487

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] drivers/char/misc.c: deprecated call register_chrdev, and 256 minor limit eliminated

2007-09-06 Thread Renzo Davoli
Dear Kernel Developers,

I have seen that drivers/char/misc.c already used the deprecated 
register_chrdev call and this limited the number of minors to 256.

I recetly asked for a misc minor number attribution for View-OS/kmview
and Torben Mathiasen told me about this issue.

I propose this patch that eliminate both problems. With this patch 
misc allocates the entire major 10 and avoids the deprecated call.

These are just my 2 (euro) cents, I hope it can be useful.

renzo
-- 

Renzo Davoli| Dept. of Computer Science
(NIC rd235, HAM IZ4DJE) | University of Bologna 
Tel. +39 051 2094501| Mura Anteo Zamboni, 7
Fax. +39 051 2094510| I-40127 Bologna  ITALY
Key fingerprint = A019 17E2 5562 06F6 77BB  2E93 1A01 F646 30EA B487

--- drivers/char/misc.orig.c2007-08-05 16:56:59.0 +0200
+++ drivers/char/misc.c 2007-09-06 11:07:51.0 +0200
@@ -56,6 +56,8 @@
 static LIST_HEAD(misc_list);
 static DEFINE_MUTEX(misc_mtx);
 
+static struct cdev misc_cdev;
+
 /*
  * Assigned numbers, used for dynamic minors
  */
@@ -273,6 +275,31 @@
 EXPORT_SYMBOL(misc_register);
 EXPORT_SYMBOL(misc_deregister);
 
+static int misc_register_chrdev(void)
+{
+   dev_t from=MKDEV(MISC_MAJOR,0);
+   int rv;
+   int err = -ENOMEM;
+   char *s;
+
+   if ((rv=register_chrdev_region(from,MINORMASK,misc)) != 0)
+   return rv;
+
+   cdev_init(misc_cdev, misc_fops);
+   misc_cdev.owner=misc_fops.owner;
+   kobject_set_name(misc_cdev.kobj, %s, misc);
+   for (s = strchr(kobject_name(misc_cdev.kobj),'/'); s; s = strchr(s, 
'/'))
+   *s = '!';
+   err = cdev_add(misc_cdev, from, MINORMASK);
+   if (err)
+   goto out;
+   return 0;
+out:
+   kobject_put(misc_cdev.kobj);
+   unregister_chrdev_region(from,MINORMASK);
+  return err;
+}
+
 static int __init misc_init(void)
 {
 #ifdef CONFIG_PROC_FS
@@ -286,7 +313,7 @@
if (IS_ERR(misc_class))
return PTR_ERR(misc_class);
 
-   if (register_chrdev(MISC_MAJOR,misc,misc_fops)) {
+   if (misc_register_chrdev()) {
printk(unable to get major %d for misc devices\n,
   MISC_MAJOR);
class_destroy(misc_class);