Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
Hi Roman, On Thu, Jun 06, 2019 at 10:11:57PM +0200, Roman Penyaev wrote: > Hi Renzo, > On 2019-06-03 17:00, Renzo Davoli wrote: > > Please, have a look of the README.md page here: > > https://github.com/virtualsquare/vuos > Is that similar to what user-mode linux does? I mean the principle. let us write this proportion: user-mode-linux / umvu = linux / namespace In a comparison between user-mode linux and umvu, while the way to get the system call requests is the same (ptrace) the goal is different. user-mode linux catches all the system calls, none of them is forwarded to the real kernel: it uses a linux kernel compiled as a process to give processes the illusion to live in another machine. umvu catches all the system calls and then it decides if the syscall must be forwarded to the kernel (maybe modified) or entirely processed at user-level by the hypervisor (by means of specific plug-in modules like vufuse for file systems, vudev for devices and so on). While the "illusion" of user-mode linux is a global illusion, the "illusion" provided by umvu is limited and configurable. After a "mount" of a filesystem using vufuse, the file system tree is the same *but* the subtree of the mountpoint. The illusion is limited to the subtree as only the system call requests for paths inside the subtree are processed by umvu and its modules. It is similar to a namespace implemented at user level. w.r.t. namespaces: * umvu does not change the attack surface of the kernel (it is just a virtualization -- a.k.a. illusion -- provided by a user process to other user processes) * umvu can provide features not currently supported by the kernel (e.g. a file system organization unavailable as kernel code, networking stacks at user level etc.) * ... umvu is an implementation of vuos concepts using ptrace. In a future maybe it will be possibile to reimplement the same idea of partial virtual machines using other syscall tracing/filtering tools. > > > I am not trying to port some tools to use user-space implemented > > stacks or device > > drivers/emulators, I am seeking to a general purpose approach. > > You still intersect *each* syscall, why not to do the same for epoll_wait() > and replace events with correct value? Seems you do something similar > already > in a vu_wrap_poll.c: wo_epoll_wait(), right? > > Don't get me wrong, I really want to understand whether everything really > looks so bad without proposed change. It seems not, because the whole > principle > is based on intersection of each syscall, thus one more one less - it does > not > become more clean and especially does not look like a generic purpose > solution, > which you seek. I may be wrong. Your comments are precious. Thank you as I see that you have browsed into my code to have a better view of the problem. umvu is a modular tool. The executable of umvu is a dispatcher between the system call requests coming from the user processes and modules (loaded at run time as dynamic plug-in libraries) +-+ +--+ +-+ +processes running|<--->| umvu |<>| module (e.g. vufuse/vudev/vunet)| + "inside" umvu | +--+ +-+ +-+ Each module "registers" to umvu its "responsabilities" It can register: * a pathname (it will receive the syscall requests for that subtree) * an address_family (all the syscall for sockets of that AF) * major/minor numbers of a char or block device * a systam call number * (each module can register more items) The problem is not in the dialogue between umvu and the user processes (<---> on the left in the diagram above) but between umvu and its modules (<---> on the right). (wi_epoll_wait, wd_epoll_wait, wo_epoll_wait are the three wrappers used respectively before, during and after epoll_wait in the dialogue on the left with the user processes). When a user process generates a "read" syscall request and umvu discovers that the fd is managed by vufuse, it forwards to vufuse a "read" request having the same signature of the "read" system call (plus a trailing fdprivate arg for syscalls using a fd. This arg can be used to speed up virtualization but can be safely ignored). If for the same "read" request the file descriptor is managed by vunet, it is forwarded to vunet (actually it is converted to "recvmsg": if fd is a socket recvmesg manages all read/recv/recvfrom/recvmsg, umvu tends to simplify the API by unifying similar system calls). But what about poll/epoll/ppoll/select/pselect? umvu takes care of all the system call requests but it needs a clean way to ask modules some feedback when the expected events happen. I think the cl
Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
Hi Roman, I sorry for the delay in my answer, but I needed to set up a minimal tutorial to show what I am working on and why I need a feature like the one I am proposing. Please, have a look of the README.md page here: https://github.com/virtualsquare/vuos (everything can be downloaded and tested) On Fri, May 31, 2019 at 01:48:39PM +0200, Roman Penyaev wrote: > Since each such a stack has a set of read/write/etc functions you always > can extend you stack with another call which returns you event mask, > specifying what exactly you have to do, e.g.: > > nfds = epoll_wait(epollfd, events, MAX_EVENTS, -1); > for (n = 0; n < nfds; ++n) { > struct sock *sock; > > sock = events[n].data.ptr; > events = sock->get_events(sock, [n]); > > if (events & EPOLLIN) > sock->read(sock); > if (events & EPOLLOUT) > sock->write(sock); > } > > > With such a virtual table you can mix all userspace stacks and even > with normal sockets, for which 'get_events' function can be declared as > > static poll_t kernel_sock_get_events(struct sock *sock, struct epoll_event > *ev) > { > return ev->events; > } > > Do I miss something? I am not trying to port some tools to use user-space implemented stacks or device drivers/emulators, I am seeking to a general purpose approach. I think that the example in the section of the README "mount a user-level networking stack" explains the situation. The submodule vunetvdestack uses a namespace to define a networking stack connected to a VDE network (see https://github.com/rd235/vdeplug4). The API is clean (as it can be seen at the end of the file vunet_modules/vunetvdestack.c). All the methods but "socket" are directly mapped to their system call counterparts: struct vunet_operations vunet_ops = { .socket = vdestack_socket, .bind = bind, .connect = connect, .listen = listen, .accept4 = accept4, .epoll_ctl = epoll_ctl, ... } (the elegance of the API can be seen also in vunet_modules/vunetreal.c: a 38 lines module implementing a gateway to the real networking of the hosting machine) Unfortunately I cannot use the same clean interface to support user-library implemented stacks like lwip/lwipv6/picotcp because I cannot generate EPOLL events... Bizantine workarounds based on data structures exchanged in the data.ptr field of epoll_event that must be decoded by the hypervisor to retrieve the missing information about the event can be implemented... but it would be a pity ;-) The same problem arises in umdev modules: virtual devices should generate the same EPOLL events of their real couterparts. I feel that the ability to generate/synthesize EPOLL events could be useful for many projects. (In my first message I included some URLs of people seeking for this feature, retrieved by some queries on a web search engine) Implementations may vary as well as the kernel API to support such a feature. As I told, my proposal has a minimal impact on the code, it does not require the definition of new syscalls, it simply enhances the features of eventfd. > > Eventually you come up with such a lock to protect your tcp or whatever > state machine. Or you have a real example where read and write paths > can work completely independently? Actually umvu hypervisor uses concurrent tracing of concurrent processes. We have named this technique "guardian angels": each process/thread running in the partial virtual machine has a correspondent thread in the hypervisor. So if a process uses two threads to manage a network connection (say a TCP stream), the two guardian angels replicate their requests towards the networking module. So I am looking for a general solution, not to a pattern to port some projects. (and I cannot use two different approaches for event driven and multi-threaded implementations as I have to support both). If you reached this point... Thank you for your patience. I am more than pleased to receive further comments or proposals. renzo
Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
HI Roman, On Fri, May 31, 2019 at 11:34:08AM +0200, Roman Penyaev wrote: > On 2019-05-27 15:36, Renzo Davoli wrote: > > Unfortunately this approach cannot be applied to > > poll/select/ppoll/pselect/epoll. > > If you have to override other systemcalls, what is the problem to override > poll family? It will add, let's say, 50 extra code lines complexity to your > userspace code. All you need is to be woken up by *any* event and check > one mask variable, in order to understand what you need to do: read or > write, > basically exactly what you do in your eventfd modification, but only in > userspace. This approach would not scale. If I want to use both a (user-space) network stack and a (emulated) device (or more stacks and devices) which (overridden) poll would I use? The poll of the first stack is not able to to deal with the third device. > > > > > Why can it not be less than 64? > > This is the imeplementation of 'write'. The 64 bits include the > > 'command' > > EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS (in the > > most > > significant 32 bits) and the set of events (in the lowest 32 bits). > > Do you really need add/del/mod semantics? Userspace still has to keep mask > somewhere, so you can have one simple command, which does: >ctx->count = events; > in kernel, so no masks and this games with bits are needed. That will > simplify API. It is true, at the price to have more complex code in user space. Other system calls could have beeen implemented as "set the value", instead there are ADD/DEL modification flags. I mean for example sigprocmask (SIG_BLOCK, SIG_UNBLOCK, SIG_SETMASK), or even epoll_ctl. While poll requires the program to keep the struct pollfd array stored somewhere, epoll is more powerful and flexible as different file descriptors can be added and deleted by different modules/components. If I have two threads implementing the send and receive path of a socket in a user-space network stack implementation the epoll pending bitmap is shared so I have to create critical sections like the following one any time I need to set or reset a bit. pthread_mutex_lock(mylock) events |= EPOLLIN write(efd, , sizeof(events)); pthread_mutex_unlock(mylock) Using add/del semantics locking is not required as the send path thread deals with EPOLLOUT while its siblings receive thread uses EPOLLIN or EPOLLPRI I would prefer the add/del/mod semantics, but if this is generally perceived as a unnecessary complexity in the kernel code I can update my patch. Thank you Roman, renzo
[PATCH v3 1/1] eventfd new tag EFD_VPOLL: generate epoll events
eceives an event (or a set of events) it prints it and disarm it. The following shell session shows a sample run of the program: timeout... timeout... GOT event 1 timeout... GOT event 1 timeout... GOT event 3 timeout... GOT event 2 timeout... GOT event 4 timeout... GOT event 10 Program source: #include #include #include #include #include #include /* Definition of uint64_t */ #ifndef EFD_VPOLL #define EFD_VPOLL (1 << 1) #define EFD_VPOLL_ADDEVENTS (1ULL << 32) #define EFD_VPOLL_DELEVENTS (2ULL << 32) #define EFD_VPOLL_MODEVENTS (3ULL << 32) #endif #define handle_error(msg) \ do { perror(msg); exit(EXIT_FAILURE); } while (0) static void vpoll_ctl(int fd, uint64_t request) { ssize_t s; s = write(fd, , sizeof(request)); if (s != sizeof(uint64_t)) handle_error("write"); } int main(int argc, char *argv[]) { int efd, epollfd; struct epoll_event ev; ev.events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI; ev.data.u64 = 0; efd = eventfd(0, EFD_VPOLL | EFD_CLOEXEC); if (efd == -1) handle_error("eventfd"); epollfd = epoll_create1(EPOLL_CLOEXEC); if (efd == -1) handle_error("epoll_create1"); if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, ) == -1) handle_error("epoll_ctl"); switch (fork()) { case 0: sleep(3); vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN); sleep(2); vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN); sleep(2); vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI); sleep(2); vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLPRI); sleep(2); vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLOUT); sleep(2); vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLHUP); exit(EXIT_SUCCESS); default: while (1) { int nfds; nfds = epoll_wait(epollfd, , 1, 1000); if (nfds < 0) handle_error("epoll_wait"); else if (nfds == 0) printf("timeout...\n"); else { printf("GOT event %x\n", ev.events); vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | ev.events); if (ev.events & EPOLLHUP) break; } } case -1: handle_error("fork"); } close(epollfd); close(efd); return 0; } Signed-off-by: Renzo Davoli Reported-by: kbuild test robot --- fs/eventfd.c | 116 +++-- include/linux/eventfd.h| 7 +- include/uapi/linux/eventpoll.h | 2 + 3 files changed, 117 insertions(+), 8 deletions(-) Changes in v2: - Fix size of EFD_VPOLL_*EVENTS constants for 32 bit architectures Changes in v3: - Fix sparse warnings and wrong arg of wake_up_locked_poll in eventfd_vpoll_write diff --git a/fs/eventfd.c b/fs/eventfd.c index 8aa0ea8c55e8..6cdb1b854341 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -24,18 +24,32 @@ #include #include +#define EPOLLALLMASK64 ((__force __u64)EPOLLALLMASK) + static DEFINE_IDA(eventfd_ida); struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. +* If the EFD_VPOLL flag was NOT set at eventfd creation: +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the "count" +* value to userspace, and will reset "count" to zero (or decrement +* "count" by 1 if the flag EFD_SEMAPHORE has been set). Th
Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
On Mon, May 27, 2019 at 09:33:32AM +0200, Greg KH wrote: > On Sun, May 26, 2019 at 04:25:21PM +0200, Renzo Davoli wrote: > > This patch implements an extension of eventfd to define file descriptors > > whose I/O events can be generated at user level. These file descriptors > > trigger notifications for [p]select/[p]poll/epoll. > > > > This feature is useful for user-level implementations of network stacks > > or virtual device drivers as libraries. > > How can this be used to create a "virtual device driver"? Do you have > any examples of this new interface being used anywhere? Networking programs use system calls implementing the Berkeley sockets API: socket, accept, connect, listen, recv*, send* etc. Programs dealing with a device use system calls like open, read, write, ioctl etc. When somebody wants to write a library able to behave like a network stack (say lwipv6, picotcp) or a device, they can implement functions like my_socket, my_accept, my_open or my_ioctl, as drop-in replacement of their system call counterpart. (It is also possible to use dynamic library magic to rename/divert the system call requests to use their 'virtual' implementation provided by the library: socket maps to my_socket, recv to my_recv etc). In this way portability and compatibility is easier, using a well known API instead of inventing new ones. Unfortunately this approach cannot be applied to poll/select/ppoll/pselect/epoll. These system calls can refer at the same time to file descriptors created by 'real' system calls like socket, open, signalfd... and to file descriptors returned by my_open, your_socket. > > Also, meta-comment, you should provide some sort of test to kselftests > for your new feature so that it can actually be tested, as well as a man > page update (separately). Sure. I'll do it ASAP, let me collect suggestions first. > > > Development and porting of code often requires to find the way to wait for > > I/O > > events both coming from file descriptors and generated by user-level code > > (e.g. > > user-implemented net stacks or drivers). While it is possible to provide a > > partial support (e.g. using pipes or socketpairs), a clean and complete > > solution is still missing (as far as I have seen); e.g. I have not seen any > > clean way to generate EPOLLPRI, EPOLLERR, etc. > > What's wrong with pipes or sockets for stuff like this? Why is epoll > required? Example: suppose there is an application waiting for a TCP OOB message. It uses poll to wait for POLLPRI and then reads the message (e.g. by 'recv'). If I want to port that application to use a network stack implemented as a library I have to rewrite the code about 'poll' as it is not possible to receive a POLLPRI. >From a pipe I can just receive a POLLIN, I have to encode in an external data >structure any further information. Using EFD_VPOLL the solution is straightforward: the function mysocket (used in place of socket to create a file descripor behaving as a 'real'socket) returns a file descriptor created by eventfd/EFD_VPOLL, so the poll system call can be left unmodified in the code. When the OOB message is available the library can trigger an EPOLLPRI and the message can be received using my_recv. > ...omissis... > > > > Signed-off-by: Renzo Davoli > > --- > > fs/eventfd.c | 115 +++-- > > include/linux/eventfd.h| 7 +- > > include/uapi/linux/eventpoll.h | 2 + > > 3 files changed, 116 insertions(+), 8 deletions(-) > > > > diff --git a/fs/eventfd.c b/fs/eventfd.c > > index 8aa0ea8c55e8..f83b7d02307e 100644 > > --- a/fs/eventfd.c > > +++ b/fs/eventfd.c > > @@ -3,6 +3,7 @@ > > * fs/eventfd.c > > * > > * Copyright (C) 2007 Davide Libenzi > > + * EFD_VPOLL support: 2019 Renzo Davoli > > No need for this line, that's what the git history shows. okay > > > * > > */ > > > > @@ -30,12 +31,24 @@ struct eventfd_ctx { > > struct kref kref; > > wait_queue_head_t wqh; > > /* > > -* Every time that a write(2) is performed on an eventfd, the > > -* value of the __u64 being written is added to "count" and a > > -* wakeup is performed on "wqh". A read(2) will return the "count" > > -* value to userspace, and will reset "count" to zero. The kernel > > -* side eventfd_signal() also, adds to the "count" counter and > > -* issue a wakeup. > > +* If the EFD_VPOLL flag was NOT set at eventfd creation: > > +* Every time that a write(2) is performed on an eventfd, the > > +* value of the __u64 being written is added
[PATCH v2 1/1] eventfd new tag EFD_VPOLL: generate epoll events
else { printf("GOT event %x\n", ev.events); vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | ev.events); if (ev.events & EPOLLHUP) break; } } case -1: handle_error("fork"); } close(epollfd); close(efd); return 0; } Signed-off-by: Renzo Davoli --- fs/eventfd.c | 115 +++-- include/linux/eventfd.h| 7 +- include/uapi/linux/eventpoll.h | 2 + 3 files changed, 116 insertions(+), 8 deletions(-) Changes in v2: - Fix size of EFD_VPOLL_*EVENTS constants for 32 bit architectures diff --git a/fs/eventfd.c b/fs/eventfd.c index 8aa0ea8c55e8..f83b7d02307e 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -3,6 +3,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * EFD_VPOLL support: 2019 Renzo Davoli * */ @@ -30,12 +31,24 @@ struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. +* If the EFD_VPOLL flag was NOT set at eventfd creation: +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the "count" +* value to userspace, and will reset "count" to zero (or decrement +* "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel +* side eventfd_signal() also, adds to the "count" counter and +* issue a wakeup. +* +* If the EFD_VPOLL flag was set at eventfd creation: +* count is the set of pending EPOLL events. +* read(2) returns the current value of count. +* The argument of write(2) is an 8-byte integer: +* it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS, +* EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of +* events to be added, deleted to the current set of pending events. +* (i.e. which bits of "count" must be set or reset). +* EFD_VPOLL_MODEVENTS redefines the set of pending events. */ __u64 count; unsigned int flags; @@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c return res; } +static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait) +{ + struct eventfd_ctx *ctx = file->private_data; + __poll_t events = 0; + u64 count; + + poll_wait(file, >wqh, wait); + + count = READ_ONCE(ctx->count); + + events = (count & EPOLLALLMASK); + + return events; +} + +static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct eventfd_ctx *ctx = file->private_data; + ssize_t res; + __u64 ucnt = 0; + + if (count < sizeof(ucnt)) + return -EINVAL; + res = sizeof(ucnt); + ucnt = READ_ONCE(ctx->count); + if (put_user(ucnt, (__u64 __user *)buf)) + return -EFAULT; + + return res; +} + +static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct eventfd_ctx *ctx = file->private_data; + ssize_t res; + __u64 ucnt; + __u32 events; + + if (count < sizeof(ucnt)) + return -EINVAL; + if (copy_from_user(, buf, sizeof(ucnt))) + return -EFAULT; + spin_lock_irq(>wqh.lock); + + events = ucnt & EPOLLALLMASK; + res = sizeof(ucnt); + switch (ucnt & ~((__u64)EPOLLALLMASK)) { + case EFD_VPOLL_ADDEVENTS: + ctx->count |= events; + break; + case EFD_VPOLL_DELEVENTS: + ctx->count &= ~(events); + break; + case EFD_VPOLL_MODEVENTS: + ctx->count = (ctx->count & ~EPOLLALLMASK) | events; + break; + default: + res = -EINVAL; + } + + /* wake up waiting threads */ + if (res >= 0 && waitqueue_active(>wqh)) + wake_up_locked_poll(>wqh, res); + +
[PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
else { printf("GOT event %x\n", ev.events); vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | ev.events); if (ev.events & EPOLLHUP) break; } } case -1: handle_error("fork"); } close(epollfd); close(efd); return 0; } Signed-off-by: Renzo Davoli --- fs/eventfd.c | 115 +++-- include/linux/eventfd.h| 7 +- include/uapi/linux/eventpoll.h | 2 + 3 files changed, 116 insertions(+), 8 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index 8aa0ea8c55e8..f83b7d02307e 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -3,6 +3,7 @@ * fs/eventfd.c * * Copyright (C) 2007 Davide Libenzi + * EFD_VPOLL support: 2019 Renzo Davoli * */ @@ -30,12 +31,24 @@ struct eventfd_ctx { struct kref kref; wait_queue_head_t wqh; /* -* Every time that a write(2) is performed on an eventfd, the -* value of the __u64 being written is added to "count" and a -* wakeup is performed on "wqh". A read(2) will return the "count" -* value to userspace, and will reset "count" to zero. The kernel -* side eventfd_signal() also, adds to the "count" counter and -* issue a wakeup. +* If the EFD_VPOLL flag was NOT set at eventfd creation: +* Every time that a write(2) is performed on an eventfd, the +* value of the __u64 being written is added to "count" and a +* wakeup is performed on "wqh". A read(2) will return the "count" +* value to userspace, and will reset "count" to zero (or decrement +* "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel +* side eventfd_signal() also, adds to the "count" counter and +* issue a wakeup. +* +* If the EFD_VPOLL flag was set at eventfd creation: +* count is the set of pending EPOLL events. +* read(2) returns the current value of count. +* The argument of write(2) is an 8-byte integer: +* it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS, +* EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of +* events to be added, deleted to the current set of pending events. +* (i.e. which bits of "count" must be set or reset). +* EFD_VPOLL_MODEVENTS redefines the set of pending events. */ __u64 count; unsigned int flags; @@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c return res; } +static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait) +{ + struct eventfd_ctx *ctx = file->private_data; + __poll_t events = 0; + u64 count; + + poll_wait(file, >wqh, wait); + + count = READ_ONCE(ctx->count); + + events = (count & EPOLLALLMASK); + + return events; +} + +static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct eventfd_ctx *ctx = file->private_data; + ssize_t res; + __u64 ucnt = 0; + + if (count < sizeof(ucnt)) + return -EINVAL; + res = sizeof(ucnt); + ucnt = READ_ONCE(ctx->count); + if (put_user(ucnt, (__u64 __user *)buf)) + return -EFAULT; + + return res; +} + +static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct eventfd_ctx *ctx = file->private_data; + ssize_t res; + __u64 ucnt; + __u32 events; + + if (count < sizeof(ucnt)) + return -EINVAL; + if (copy_from_user(, buf, sizeof(ucnt))) + return -EFAULT; + spin_lock_irq(>wqh.lock); + + events = ucnt & EPOLLALLMASK; + res = sizeof(ucnt); + switch (ucnt & ~((__u64)EPOLLALLMASK)) { + case EFD_VPOLL_ADDEVENTS: + ctx->count |= events; + break; + case EFD_VPOLL_DELEVENTS: + ctx->count &= ~(events); + break; + case EFD_VPOLL_MODEVENTS: + ctx->count = (ctx->count & ~EPOLLALLMASK) | events; + break; + default: + res = -EINVAL; + } + + /* wake up waiting threads */ + if (res >= 0 && waitqueue_active(>wqh)) + wake_up_locked_poll(>wqh, res); + + spin_unlock_irq(>wqh.lock); + + return res; + +} + #ifdef
[PATCH 0/1] IPN: Inter Process Networking
Inter Process Networking (PATCH): This patch adds a new address family for inter process communication. AF_IPN: inter process networking, i.e. multipoint, multicast/broadcast communication among processes (and networks). Contents of this document: 1. What is IPN? 2. Why IPN? 2.1 Why IPN instead of IP Multicast? 2.2 Why IPN instead of AF_NETLINK? 3. How? We've read all the comments in the previous thread about IPN and we've tried to answer. 1. WHAT IS IPN? --- IPN is a new address family designed for one-to-many, many-to-many and peer-to-peer communication among processes. Berkeley sockets have been designed for client-server or point-to-point communication; AF_UNIX does not support multicast/broadcast. AF_IPN does, in a simple, efficient but extensible way. IPN is an Inter Process Communication paradigm where all the processes appear as they were connected by a networking bus. On IPN, processes can interoperate using real networking protocols (e.g. ethernet) but also using application defined protocols (maybe just sending ascii strings, video or audio frames, etc). IPN provides networking (in the broaden definition you can imagine) to the processes. Processes can be ethernet nodes, run their own TCP-IP stacks if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc. IPN networks can be interconnected with real networks or IPN networks running on different computers can interoperate (can be connected by virtual cables). IPN is part of the Virtual Square Project (vde, lwipv6, view-os, umview/kmview, see wiki.virtualsquare.org). 2. WHY IPN? --- Many applications can benefit from IPN. First of all VDE (Virtual Distributed Ethernet): one service of IPN is a kernel implementation of VDE. IPN can be useful for applications where one or some processes feed their data (*any kind* of data, not only networking-related messages) to several consuming processes (maybe joining the stream at run time). IPN sockets can be also connected to tap (tuntap) like interfaces or to real interfaces (like "brctl addif"). There are specific ioctls to define a tap interface or grab an existing one. Several existing services could be implemented (and often could have extended features) on the top of IPN: - kernel Ethernet bridging - TUN/TAP - MACVLAN IPN could be used (IMHO) to provide multicast services to processes. Audio frames or video frames could be multiplexed such that multiple applications can use them. I think that something like Jack can be implemented on the top of IPN. Something like a VideoJack can provide video frames to several applications: e.g. the same image from a camera can be viewed by xawtv, recorded and sent to a streaming service. IPN sockets can be used wherever there is the idea of broadcasting channel i.e. where processes can "join (and leave) the information flow" at runtime. IPN can be seen as "publish and subscribe". Different delivery policies can be defined as IPN protocols (loaded as submodules of ipn.ko). For instance, an ethernet switch is a policy (kvde_switch.ko: packets are unicast delivered if the MAC address is already in the switching hash table), we are designing an extendended switch, full of interesting features like our userland vde_switch (with vlan/fst/manamement etc..), and a layer3 switch, but other policies can be defined to implement the specific requirements of other services. I feel that there is no limits to creativity about multicast services for processes. Userspace services (like vde) do exist, but IPN provides a faster and unified support. 2.1 Why IPN instead of IP Multicast? - IPN seems to be faster than IP Multicast. (see my message to LKML of Dec 06). - IPN provides file system permission to access the communication medium, and it uses the file system for naming. - IPN does not need any tunneling or packet encapsulation, it works as a layer 1 virtual network. - IPN protocols (implemented by kernel submodules) provide forwarding policies: the set of receipients for each messages is computed from the contents of the message itself. Ethernet virtual switches or other routing rules for any kind of data can be implemented as IPN protocols. 2.2 Why IPN instead of AF_NETLINK? -- - Netlink has been designed for user to kernel communication. - Netlink has many missing features to provide services similar to IPN. - Currently multicast seems to be allowed for root only. Access control should be added completely. - Netlink interface for user processes is not very immediate (libnl has been developed as a higher level solution to that). - Netlink already seems to suffer from "overpopulation": NETLINK_GENERIC has been added for "simplified netlink usage" but it adds yet another header and rules to be followed. - Netlinks is quite rigid as for message delivery guarantees: unicast implies lossless
[PATCH 0/1] IPN: Inter Process Networking
Inter Process Networking (PATCH): This patch adds a new address family for inter process communication. AF_IPN: inter process networking, i.e. multipoint, multicast/broadcast communication among processes (and networks). Contents of this document: 1. What is IPN? 2. Why IPN? 2.1 Why IPN instead of IP Multicast? 2.2 Why IPN instead of AF_NETLINK? 3. How? We've read all the comments in the previous thread about IPN and we've tried to answer. 1. WHAT IS IPN? --- IPN is a new address family designed for one-to-many, many-to-many and peer-to-peer communication among processes. Berkeley sockets have been designed for client-server or point-to-point communication; AF_UNIX does not support multicast/broadcast. AF_IPN does, in a simple, efficient but extensible way. IPN is an Inter Process Communication paradigm where all the processes appear as they were connected by a networking bus. On IPN, processes can interoperate using real networking protocols (e.g. ethernet) but also using application defined protocols (maybe just sending ascii strings, video or audio frames, etc). IPN provides networking (in the broaden definition you can imagine) to the processes. Processes can be ethernet nodes, run their own TCP-IP stacks if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc. IPN networks can be interconnected with real networks or IPN networks running on different computers can interoperate (can be connected by virtual cables). IPN is part of the Virtual Square Project (vde, lwipv6, view-os, umview/kmview, see wiki.virtualsquare.org). 2. WHY IPN? --- Many applications can benefit from IPN. First of all VDE (Virtual Distributed Ethernet): one service of IPN is a kernel implementation of VDE. IPN can be useful for applications where one or some processes feed their data (*any kind* of data, not only networking-related messages) to several consuming processes (maybe joining the stream at run time). IPN sockets can be also connected to tap (tuntap) like interfaces or to real interfaces (like brctl addif). There are specific ioctls to define a tap interface or grab an existing one. Several existing services could be implemented (and often could have extended features) on the top of IPN: - kernel Ethernet bridging - TUN/TAP - MACVLAN IPN could be used (IMHO) to provide multicast services to processes. Audio frames or video frames could be multiplexed such that multiple applications can use them. I think that something like Jack can be implemented on the top of IPN. Something like a VideoJack can provide video frames to several applications: e.g. the same image from a camera can be viewed by xawtv, recorded and sent to a streaming service. IPN sockets can be used wherever there is the idea of broadcasting channel i.e. where processes can join (and leave) the information flow at runtime. IPN can be seen as publish and subscribe. Different delivery policies can be defined as IPN protocols (loaded as submodules of ipn.ko). For instance, an ethernet switch is a policy (kvde_switch.ko: packets are unicast delivered if the MAC address is already in the switching hash table), we are designing an extendended switch, full of interesting features like our userland vde_switch (with vlan/fst/manamement etc..), and a layer3 switch, but other policies can be defined to implement the specific requirements of other services. I feel that there is no limits to creativity about multicast services for processes. Userspace services (like vde) do exist, but IPN provides a faster and unified support. 2.1 Why IPN instead of IP Multicast? - IPN seems to be faster than IP Multicast. (see my message to LKML of Dec 06). - IPN provides file system permission to access the communication medium, and it uses the file system for naming. - IPN does not need any tunneling or packet encapsulation, it works as a layer 1 virtual network. - IPN protocols (implemented by kernel submodules) provide forwarding policies: the set of receipients for each messages is computed from the contents of the message itself. Ethernet virtual switches or other routing rules for any kind of data can be implemented as IPN protocols. 2.2 Why IPN instead of AF_NETLINK? -- - Netlink has been designed for user to kernel communication. - Netlink has many missing features to provide services similar to IPN. - Currently multicast seems to be allowed for root only. Access control should be added completely. - Netlink interface for user processes is not very immediate (libnl has been developed as a higher level solution to that). - Netlink already seems to suffer from overpopulation: NETLINK_GENERIC has been added for simplified netlink usage but it adds yet another header and rules to be followed. - Netlinks is quite rigid as for message delivery guarantees: unicast implies lossless
[PATCH] misc driver: eliminate 256 minor limit & deprecated call register_chrdev
I already posted this patch on September 9th but nobody cared. Is anybody interested in knowing that there is an old limit for misc device minors to 256, that we are terminating the minor numbers, and that there is a deprecated call in this code? drivers/char/misc.c: the deprecated call is register_chrdev and it limits the number of minors to 256. I propose this patch that eliminate both problems. With this patch misc allocates the entire major 10. This patch was designed for a previous version of the kernel code (2.6.22?), I have tested it today and applies to 2.6.24-rc5 with -12 lines offset. renzo Signed-off-by: Renzo Davoli <[EMAIL PROTECTED]> --- a/drivers/char/misc.c 2007-08-05 16:56:59.0 +0200 +++ b/drivers/char/misc.c 2007-09-06 11:07:51.0 +0200 @@ -56,6 +56,8 @@ static LIST_HEAD(misc_list); static DEFINE_MUTEX(misc_mtx); +static struct cdev misc_cdev; + /* * Assigned numbers, used for dynamic minors */ @@ -273,6 +275,31 @@ EXPORT_SYMBOL(misc_register); EXPORT_SYMBOL(misc_deregister); +static int misc_register_chrdev(void) +{ + dev_t from=MKDEV(MISC_MAJOR,0); + int rv; + int err = -ENOMEM; + char *s; + + if ((rv=register_chrdev_region(from,MINORMASK,"misc")) != 0) + return rv; + + cdev_init(_cdev, _fops); + misc_cdev.owner=misc_fops.owner; + kobject_set_name(_cdev.kobj, "%s", "misc"); + for (s = strchr(kobject_name(_cdev.kobj),'/'); s; s = strchr(s, '/')) + *s = '!'; + err = cdev_add(_cdev, from, MINORMASK); + if (err) + goto out; + return 0; +out: + kobject_put(_cdev.kobj); + unregister_chrdev_region(from,MINORMASK); + return err; +} + static int __init misc_init(void) { #ifdef CONFIG_PROC_FS @@ -286,7 +313,7 @@ if (IS_ERR(misc_class)) return PTR_ERR(misc_class); - if (register_chrdev(MISC_MAJOR,"misc",_fops)) { + if (misc_register_chrdev()) { printk("unable to get major %d for misc devices\n", MISC_MAJOR); class_destroy(misc_class); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] misc driver: eliminate 256 minor limit deprecated call register_chrdev
I already posted this patch on September 9th but nobody cared. Is anybody interested in knowing that there is an old limit for misc device minors to 256, that we are terminating the minor numbers, and that there is a deprecated call in this code? drivers/char/misc.c: the deprecated call is register_chrdev and it limits the number of minors to 256. I propose this patch that eliminate both problems. With this patch misc allocates the entire major 10. This patch was designed for a previous version of the kernel code (2.6.22?), I have tested it today and applies to 2.6.24-rc5 with -12 lines offset. renzo Signed-off-by: Renzo Davoli [EMAIL PROTECTED] --- a/drivers/char/misc.c 2007-08-05 16:56:59.0 +0200 +++ b/drivers/char/misc.c 2007-09-06 11:07:51.0 +0200 @@ -56,6 +56,8 @@ static LIST_HEAD(misc_list); static DEFINE_MUTEX(misc_mtx); +static struct cdev misc_cdev; + /* * Assigned numbers, used for dynamic minors */ @@ -273,6 +275,31 @@ EXPORT_SYMBOL(misc_register); EXPORT_SYMBOL(misc_deregister); +static int misc_register_chrdev(void) +{ + dev_t from=MKDEV(MISC_MAJOR,0); + int rv; + int err = -ENOMEM; + char *s; + + if ((rv=register_chrdev_region(from,MINORMASK,misc)) != 0) + return rv; + + cdev_init(misc_cdev, misc_fops); + misc_cdev.owner=misc_fops.owner; + kobject_set_name(misc_cdev.kobj, %s, misc); + for (s = strchr(kobject_name(misc_cdev.kobj),'/'); s; s = strchr(s, '/')) + *s = '!'; + err = cdev_add(misc_cdev, from, MINORMASK); + if (err) + goto out; + return 0; +out: + kobject_put(misc_cdev.kobj); + unregister_chrdev_region(from,MINORMASK); + return err; +} + static int __init misc_init(void) { #ifdef CONFIG_PROC_FS @@ -286,7 +313,7 @@ if (IS_ERR(misc_class)) return PTR_ERR(misc_class); - if (register_chrdev(MISC_MAJOR,misc,misc_fops)) { + if (misc_register_chrdev()) { printk(unable to get major %d for misc devices\n, MISC_MAJOR); class_destroy(misc_class); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
AF_IPN: Inter Process Networking, try these...
Andi, David, I disagree. If you suspect we would be better using IP multicast, I think your suspects are not supported. Try the following exercises, please Can you provide better solutions without IPN? renzo Exercise #1. I am a user (NOT ROOT), I like kvm, qemu etc. I want an efficient network between my VM. My solution: I Create a IPN socket, with protocol IPN_VDESWITCH and all the VM can communicate. Your solution: - I am condamned by two kernel developers to run the switch in the userland - I beg the sysadm to give me some pre-allocated taps connected together by a kernel bridge. - I create a multicast socket limited to this host (TTL=0) and I use it like a hub. It cannot switch the packets. Exercise #2. I am a sysadm (maybe a lab administrator). I want my users (not root) of the group "vmenabled" to run their VM connected to a network. I have hundreds of users in vmenabled(say students). My Solution: I create a IPN socket, with protocol IPN_VDESWITCH, connected to a virtual interface say ipn0. I give to the socket permission 760 owner root:vmenabled. Your solution: - I am condamned by two kernel developers to run the switch in the userland - I create a multicast socket connected to a tap and then I define iptables filters to avoid unauthorized users to join the net. - I create hundreds of preallocated tap interfaces, at least one per user. Exercise #3. I am a user (NOT ROOT) and I have a heavy stream of *very private data* generated by some processes that must be received by several processes. I am looking for an efficient solution. Data can be ASCII strings, or a binary stream. It is not a "networking" issue, it is just IPC. My solution. I Create a IPN socket with permission 700, IPN_BROADCAST protocol. All the processes connect to the socket either for writing or for reading (or both). Your solution: - I am condamned by two kernel developers to use userland inefficient solutions like named pipes, tee, or a user daemon among AF_UNIX sockets. - If I use multicast, others can read the stream. (security by obscurity? the attacker do not know the address?) - I use a multicast socket with SSL (it sounds funny to use encryption to talk with myself, exposing the stream to crypto attack). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
AF_IPN: Inter Process Networking, try these...
Andi, David, I disagree. If you suspect we would be better using IP multicast, I think your suspects are not supported. Try the following exercises, please Can you provide better solutions without IPN? renzo Exercise #1. I am a user (NOT ROOT), I like kvm, qemu etc. I want an efficient network between my VM. My solution: I Create a IPN socket, with protocol IPN_VDESWITCH and all the VM can communicate. Your solution: - I am condamned by two kernel developers to run the switch in the userland - I beg the sysadm to give me some pre-allocated taps connected together by a kernel bridge. - I create a multicast socket limited to this host (TTL=0) and I use it like a hub. It cannot switch the packets. Exercise #2. I am a sysadm (maybe a lab administrator). I want my users (not root) of the group vmenabled to run their VM connected to a network. I have hundreds of users in vmenabled(say students). My Solution: I create a IPN socket, with protocol IPN_VDESWITCH, connected to a virtual interface say ipn0. I give to the socket permission 760 owner root:vmenabled. Your solution: - I am condamned by two kernel developers to run the switch in the userland - I create a multicast socket connected to a tap and then I define iptables filters to avoid unauthorized users to join the net. - I create hundreds of preallocated tap interfaces, at least one per user. Exercise #3. I am a user (NOT ROOT) and I have a heavy stream of *very private data* generated by some processes that must be received by several processes. I am looking for an efficient solution. Data can be ASCII strings, or a binary stream. It is not a networking issue, it is just IPC. My solution. I Create a IPN socket with permission 700, IPN_BROADCAST protocol. All the processes connect to the socket either for writing or for reading (or both). Your solution: - I am condamned by two kernel developers to use userland inefficient solutions like named pipes, tee, or a user daemon among AF_UNIX sockets. - If I use multicast, others can read the stream. (security by obscurity? the attacker do not know the address?) - I use a multicast socket with SSL (it sounds funny to use encryption to talk with myself, exposing the stream to crypto attack). -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
I have done some raw tests. (you can read the code here: http://www.cs.unibo.it/~renzo/rawperftest/) The programs are quite simple. The sender sends "Hello World" as fast as it can, while the receiver prints time() for each 1 million message received. On my laptop, tests on 2000 "Hello World" packets, One receiver: multicast 244,000 msg/sec IPN 333,000 msg/sec (36% faster) Two receivers: multicast 174,000 msg/sec IPN 250,000 msg/sec (43% faster) Apart from this, how could I implement policies over a multicast socket, e.g. how does a Kernel VDE_switch work on multicast sockets? If I send an ethernet packet over a multicast socket it can emulate just a hub (Although it seems to me quite innatural to have to have TCP-UDP over IP over Ethernet over UDP over IP - okay we can skip the Ethernet on localhost, long ethernet frames get fragmentated but... details). On multicast socket you cannot use policies, I mean a IPN network (or bus or group) can have a policy reading some info on the packet to decide the set of receipients. For a vde_switch it is the destination mac address when found in the MAC hash table to select the receipient port. For midi communication it could be the channel number Moving the switching fabric to the userland the performance figures are quite different. renzo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
Some more explanations trying to describe what IPN is and what it is useful for. We are writing the complete patch Summary: * IPN is for inter-process communication. It is *not* directly related to TCP-IP or Ethernet. * IPN itself is a *level 1* virtual physical network. IPN services * (like AF_UNIX) do not require root privileges. TAP and GRAB are just * extra features for for IPN deliverying Ethernet frames. * IPN is for inter-process communication. It is *not* directly related to TCP-IP or Ethernet. If you want you can call it Inter Process Bus Communication. It is an extension of AF_UNIX. Comments saying that some services can be implemented by using TCP-IP multicast protocols are unrelated to IPN. All AF_UNIX services could be implemented as TCP-IP services on 127.0.0.1. Do we abolish AF_UNIX, then? The problem is that to use TCP-IP, you'd need to wrap the packets with TCP or UDP, IP and Ethernet headers, the stack would lose time to manage useless protocols. If you want just to send strings to set of local processes TCP-IP is an overloading solution. Even X-Window uses AF_UNIX sockets to talk with local clients, it is a performance issue... I think Chris is right. * IPN itself is a *level 1* virtual physical network. Like any physical network you can run higher level protocols on it, thus Ethernet, and then TCP-IP can be services you can run on IPN, but there can be IPN networks running neither TCP-IP nor Ethernet. * IPN services (like AF_UNIX) do not require root privileges. There are many communication services where the user need broadcast or p2p among user processes. If a user (not root) wants to run several User-Mode Linux, Qemu, Kvm VM the only way to have them connected together is our Virtual Distributed Ethernet. (For this reason VDE exists in almost all the distros, it has been ported to other OSs, and is already supported in the Linux Kernel for User-Mode Linux). VDE is a userland deamon, hence requires two context switches to deliver a packet: VM1 -> K -> Daemon -> K -> VM2. Kvde running on IPN just one: VM1 -> K ->VM2. I think D-Bus can use IPN, too. The same cutoff of context switches applies. May I speculate that there will be a sensible increase in performance? *nix are multiuser. It means that there do exist people that need to set up services without root access. And even if you have root access, the less you need to work as root, the safer is you system. * TAP and GRAB are just extra features for for IPN deliverying Ethernet frames. Some IPN networks do use Ethernet as Data-Link protocol. It is useful to provide means to connect the IPN socket to a virtual (TAP) interface or to a real (GRAB) interface. I know that a lot of people use tap interfaces, and the kernel bridge to connect Virtual Machines. The access can be resticted to some users or processes by itpables, but it not as simple as a chmod to the socket. A lot of people also use tunctl to define a priori tap interfaces for users. They must define as many tuntap interfaces as the number of VM the users may want, each user has his/her own taps. Some users define a userland VDE switch to interconnect their VM. IPN itself could use a userland process to define a standard TAP interface and loose its time and its cpu cycles to move packets from tap to ipn and viceversa. IPN is already kernel code and then all its context switches and cpu cycles can be saved by accessing the tap or grabbed interface diretly from the kernel. (TAP and GRAB obviously require CAP_NET_ADMIN). Using IPN with TAP you can define one single TAP interface connected to an IPN socket. Several VMs can use that IPN socket, in this way the VMs are connected by a (virtual ethernet) network which include the TAP interface. The access control to the network (and then to the TAP) is done by setting the permissions to the socket. Tunctl is *not* able to create a tap where all the users belonging to a group can start their VM. IPN can. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
Some more explanations trying to describe what IPN is and what it is useful for. We are writing the complete patch Summary: * IPN is for inter-process communication. It is *not* directly related to TCP-IP or Ethernet. * IPN itself is a *level 1* virtual physical network. IPN services * (like AF_UNIX) do not require root privileges. TAP and GRAB are just * extra features for for IPN deliverying Ethernet frames. * IPN is for inter-process communication. It is *not* directly related to TCP-IP or Ethernet. If you want you can call it Inter Process Bus Communication. It is an extension of AF_UNIX. Comments saying that some services can be implemented by using TCP-IP multicast protocols are unrelated to IPN. All AF_UNIX services could be implemented as TCP-IP services on 127.0.0.1. Do we abolish AF_UNIX, then? The problem is that to use TCP-IP, you'd need to wrap the packets with TCP or UDP, IP and Ethernet headers, the stack would lose time to manage useless protocols. If you want just to send strings to set of local processes TCP-IP is an overloading solution. Even X-Window uses AF_UNIX sockets to talk with local clients, it is a performance issue... I think Chris is right. * IPN itself is a *level 1* virtual physical network. Like any physical network you can run higher level protocols on it, thus Ethernet, and then TCP-IP can be services you can run on IPN, but there can be IPN networks running neither TCP-IP nor Ethernet. * IPN services (like AF_UNIX) do not require root privileges. There are many communication services where the user need broadcast or p2p among user processes. If a user (not root) wants to run several User-Mode Linux, Qemu, Kvm VM the only way to have them connected together is our Virtual Distributed Ethernet. (For this reason VDE exists in almost all the distros, it has been ported to other OSs, and is already supported in the Linux Kernel for User-Mode Linux). VDE is a userland deamon, hence requires two context switches to deliver a packet: VM1 - K - Daemon - K - VM2. Kvde running on IPN just one: VM1 - K -VM2. I think D-Bus can use IPN, too. The same cutoff of context switches applies. May I speculate that there will be a sensible increase in performance? *nix are multiuser. It means that there do exist people that need to set up services without root access. And even if you have root access, the less you need to work as root, the safer is you system. * TAP and GRAB are just extra features for for IPN deliverying Ethernet frames. Some IPN networks do use Ethernet as Data-Link protocol. It is useful to provide means to connect the IPN socket to a virtual (TAP) interface or to a real (GRAB) interface. I know that a lot of people use tap interfaces, and the kernel bridge to connect Virtual Machines. The access can be resticted to some users or processes by itpables, but it not as simple as a chmod to the socket. A lot of people also use tunctl to define a priori tap interfaces for users. They must define as many tuntap interfaces as the number of VM the users may want, each user has his/her own taps. Some users define a userland VDE switch to interconnect their VM. IPN itself could use a userland process to define a standard TAP interface and loose its time and its cpu cycles to move packets from tap to ipn and viceversa. IPN is already kernel code and then all its context switches and cpu cycles can be saved by accessing the tap or grabbed interface diretly from the kernel. (TAP and GRAB obviously require CAP_NET_ADMIN). Using IPN with TAP you can define one single TAP interface connected to an IPN socket. Several VMs can use that IPN socket, in this way the VMs are connected by a (virtual ethernet) network which include the TAP interface. The access control to the network (and then to the TAP) is done by setting the permissions to the socket. Tunctl is *not* able to create a tap where all the users belonging to a group can start their VM. IPN can. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
I have done some raw tests. (you can read the code here: http://www.cs.unibo.it/~renzo/rawperftest/) The programs are quite simple. The sender sends Hello World as fast as it can, while the receiver prints time() for each 1 million message received. On my laptop, tests on 2000 Hello World packets, One receiver: multicast 244,000 msg/sec IPN 333,000 msg/sec (36% faster) Two receivers: multicast 174,000 msg/sec IPN 250,000 msg/sec (43% faster) Apart from this, how could I implement policies over a multicast socket, e.g. how does a Kernel VDE_switch work on multicast sockets? If I send an ethernet packet over a multicast socket it can emulate just a hub (Although it seems to me quite innatural to have to have TCP-UDP over IP over Ethernet over UDP over IP - okay we can skip the Ethernet on localhost, long ethernet frames get fragmentated but... details). On multicast socket you cannot use policies, I mean a IPN network (or bus or group) can have a policy reading some info on the packet to decide the set of receipients. For a vde_switch it is the destination mac address when found in the MAC hash table to select the receipient port. For midi communication it could be the channel number Moving the switching fabric to the userland the performance figures are quite different. renzo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
> In the meanwhile we would be grateful if the community could kindly ask to the > questions above. Obviously I meant: In the meanwhile we would be grateful if the community could kindly *answer* to the questions above sorry (it is early morning here, it happens ;-) renzo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
On Wed, Dec 05, 2007 at 04:55:52PM -0500, Stephen Hemminger wrote: > On Wed, 5 Dec 2007 17:40:55 +0100 > [EMAIL PROTECTED] (Renzo Davoli) wrote: > > 0- (Constructive) comments. > > 1- The "official" assignment of an Address Family. > > 2- Another "grabbing hook" for interfaces (like the ones already > > We are studying some way to register/deregister grabbing services, > > I feel this would be the cleanest way. > > Post complete source code for kernel part to [EMAIL PROTECTED] I'll do it as soon as possible. > If you want the hooks, you need to include the full source code for inclusion > in mainline. All the Documentation/SubmittingPatches rules apply; > you can't just ask for "facilitators" and expect to keep your stuff out of > tree. I am sorry if I was misunderstood. I did not want any "facilitator", nor I wanted to keep my code outside the kernel, on the contrary. It is perfectly okay for me to provide the entire code for inclusion. The purposes of my message were the following: - I wanted to introduce the idea and say to the linux kernel community that a team is working on it. - Address family: is it okay to send a patch that add a new AF? is there a "AF registry" somewhere? (like the device major/minor registry or the well-known port assignment for TCP-IP). - Hook: we have two different options. We can add another grabbing inline function like those used by the bridge and macvlan or we can design a grabbing service registration facility. Which one is preferrable? The former is simpler, the latter is more elegant but it requires some changes in the kernel bridge code. So the former choice is between less-invasive,safer,inelegant, the latter is more-invasive,less safe,elegant. We need a bit of time to stabilize the code: deeply testing the existing features and implementing some more ideas we have on it. In the meanwhile we would be grateful if the community could kindly ask to the questions above. renzo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
On Thu, Dec 06, 2007 at 12:39:22AM +0100, Andi Kleen wrote: > [EMAIL PROTECTED] (Renzo Davoli) writes: > > > Berkeley socket have been designed for client server or point to point > > communication. All existing Address Families implement this idea. > Netlink is multicast/broadcast by default for once. And BC/MC certainly > works for IPv[46] and a couple of other protocols too. > > > IPN is an Inter Process Communication paradigm where all the processes > > appear as they were connected by a networking bus. > > Sounds like netlink. See also RFC 3549 RFC 3549 says: "This document describes Linux Netlink, which is used in Linux both as an intra-kernel messaging system as well as between kernel and user space." We know AF_NETLINK, our user-space stack lwipv6 supports it. AF_IPN is different. AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX. It supports communication among *user* processes. Example: Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an Ethernet Hub and communicate among themselves with the hosting computer and the world by a tap like interface. You can also grab an interface (say eth1) and use eth0 for your hosting computer and eth1 for the IPN network of virtual machines. If you load the kvde_switch submodule IPN can be a virtual Ethernet switch. This example is already working using the svn versions of ipn and vdeplug. Another Example: You have a continuous stream of data packets generated by a process, and you want to send this data to many processes. Maybe the set of processes is not known in advance, you want to send the data to any interested process. Some kind of publish communication service (among unix processes not on TCP-IP). Without IPN you need a server. With IPN the sender creates the socket connects to it and feed it with data packets. All the interested receivers connects to it and start reading. That's all. I hope that this message can give a better undertanding of what IPN is. renzo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
New Address Family: Inter Process Networking (IPN)
Inter Process Networking: a kernel module (and some simple kernel patches) to provide AF_IPN: a new address family for process networking, i.e. multipoint, multicast/broadcast communication among processes (and networks). WHAT IS IT? --- Berkeley socket have been designed for client server or point to point communication. All existing Address Families implement this idea. IPN is a new address family designed for one-to-many, many-to-many and peer-to-peer communication among processes. IPN is an Inter Process Communication paradigm where all the processes appear as they were connected by a networking bus. On IPN, processes can interoperate using real networking protocols (e.g. ethernet) but also using application defined protocols (maybe just sending ascii strings, video or audio frames, etc). IPN provides networking (in the broaden definition you can imagine) to the processes. Processes can be ethernet nodes, run their own TCP-IP stacks if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc. IPN networks can be interconnected with real networks or IPN networks running on different computers can interoperate (can be connected by virtual cables). IPN is part of the Virtual Square Project (vde, lwipv6, view-os, umview/kmview, see wiki.virtualsquare.org). WHY? Many applications can benefit from IPN. First of all VDE (Virtual Distributed Ethernet): one service of IPN is a kernel implementation of VDE. IPN can be useful for applications where one or some processes feed their data to several consuming processes (maybe joining the stream at run time). IPN sockets can be also connected to tap (tuntap) like interfaces or to real interfaces (like "brctl addif"). There are specific ioctls to define a tap interface or grab an existing one. Several existing services could be implemented (and often could have extended features) on the top of IPN: - kernel bridge - tuntap - macvlan IPN could be used (IMHO) to provide multicast services to processes. Audio frames or video frames could be multiplexed such that multiple applications can use them. I think that something like Jack can be implemented on the top of IPN. Something like a VideoJack can provide video frames to several applications: e.g. the same image from a camera can be viewed by xawtv, recorded and sent to a streaming service. IPN sockets can be used wherever there is the idea of broadcasting channel i.e. where processes can "join (and leave) the information flow" at runtime. Different delivery policies can be defined as IPN protocols (loaded as submodules of ipn.ko). e.g. ethernet switch is a policy (kvde_switch.ko: packets are unicast delivered if the MAC address is already in the switching hash table), we are designing an extendended switch, full of interesting features like our userland vde_switch (with vlan/fst/manamement etc..), and a layer3 switch, but other policies can be defined to implement the specific requirements of other services. I feel that there is no limits to creativity about multicast services for processes. Userspace services (like vde or jack) do exist, but IPN provides a faster and unified support. HOW? The complete specifications for IPN can be found here: http://wiki.virtualsquare.org/index.php/IPN bind() creates the socket (if it does not already exist). When bind() succeeds, the process has the right to manage the "network". No data is received or can be send if the socket is not connected (only get/setsockopt and ioctl work on bound unconnected sockets). connect() is used to join the network. When the socket is connected it is possible to send/receive data. If the socket is already bound it is useless to specify the socket again (you can use NULL, or specify the same address). connect() can be also used without bind(). In this case the process sends and receives data but it cannot manage the network (in this case the socket address specification is required). listen() and accept() are for servers, thus they does not exist for IPN. Examples: 1- Peer-to-Peer Communication: Several processes run the same code: struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"}; int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); err=bind(s,(struct sockaddr *),sizeof(sun)); err=connect(s,NULL,0); In this case all the messages sent by each process get received by all the other processes (IPN_BROADCAST). The processes need to be able to receive data when there are pending packets, e.g. by using poll/select and event driven programming or multithreading. 2- (One or) Some senders/many receivers The sender runs the following code: struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path="/tmp/sockipn"}; int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); err=shutdown(s,SHUT_RD); err=bind(s,(struct sockaddr *),sizeof(sun)); err=connect(s,NULL,0); The receivers do not need to define the network, thus they skip the bind(): struct sockaddr_un
New Address Family: Inter Process Networking (IPN)
Inter Process Networking: a kernel module (and some simple kernel patches) to provide AF_IPN: a new address family for process networking, i.e. multipoint, multicast/broadcast communication among processes (and networks). WHAT IS IT? --- Berkeley socket have been designed for client server or point to point communication. All existing Address Families implement this idea. IPN is a new address family designed for one-to-many, many-to-many and peer-to-peer communication among processes. IPN is an Inter Process Communication paradigm where all the processes appear as they were connected by a networking bus. On IPN, processes can interoperate using real networking protocols (e.g. ethernet) but also using application defined protocols (maybe just sending ascii strings, video or audio frames, etc). IPN provides networking (in the broaden definition you can imagine) to the processes. Processes can be ethernet nodes, run their own TCP-IP stacks if they like (e.g. virtual machines), mount ATAonEthernet disks, etc.etc. IPN networks can be interconnected with real networks or IPN networks running on different computers can interoperate (can be connected by virtual cables). IPN is part of the Virtual Square Project (vde, lwipv6, view-os, umview/kmview, see wiki.virtualsquare.org). WHY? Many applications can benefit from IPN. First of all VDE (Virtual Distributed Ethernet): one service of IPN is a kernel implementation of VDE. IPN can be useful for applications where one or some processes feed their data to several consuming processes (maybe joining the stream at run time). IPN sockets can be also connected to tap (tuntap) like interfaces or to real interfaces (like brctl addif). There are specific ioctls to define a tap interface or grab an existing one. Several existing services could be implemented (and often could have extended features) on the top of IPN: - kernel bridge - tuntap - macvlan IPN could be used (IMHO) to provide multicast services to processes. Audio frames or video frames could be multiplexed such that multiple applications can use them. I think that something like Jack can be implemented on the top of IPN. Something like a VideoJack can provide video frames to several applications: e.g. the same image from a camera can be viewed by xawtv, recorded and sent to a streaming service. IPN sockets can be used wherever there is the idea of broadcasting channel i.e. where processes can join (and leave) the information flow at runtime. Different delivery policies can be defined as IPN protocols (loaded as submodules of ipn.ko). e.g. ethernet switch is a policy (kvde_switch.ko: packets are unicast delivered if the MAC address is already in the switching hash table), we are designing an extendended switch, full of interesting features like our userland vde_switch (with vlan/fst/manamement etc..), and a layer3 switch, but other policies can be defined to implement the specific requirements of other services. I feel that there is no limits to creativity about multicast services for processes. Userspace services (like vde or jack) do exist, but IPN provides a faster and unified support. HOW? The complete specifications for IPN can be found here: http://wiki.virtualsquare.org/index.php/IPN bind() creates the socket (if it does not already exist). When bind() succeeds, the process has the right to manage the network. No data is received or can be send if the socket is not connected (only get/setsockopt and ioctl work on bound unconnected sockets). connect() is used to join the network. When the socket is connected it is possible to send/receive data. If the socket is already bound it is useless to specify the socket again (you can use NULL, or specify the same address). connect() can be also used without bind(). In this case the process sends and receives data but it cannot manage the network (in this case the socket address specification is required). listen() and accept() are for servers, thus they does not exist for IPN. Examples: 1- Peer-to-Peer Communication: Several processes run the same code: struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path=/tmp/sockipn}; int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); err=bind(s,(struct sockaddr *)sun,sizeof(sun)); err=connect(s,NULL,0); In this case all the messages sent by each process get received by all the other processes (IPN_BROADCAST). The processes need to be able to receive data when there are pending packets, e.g. by using poll/select and event driven programming or multithreading. 2- (One or) Some senders/many receivers The sender runs the following code: struct sockaddr_un sun={.sun_family=AF_IPN,.sun_path=/tmp/sockipn}; int s=socket(AF_IPN,SOCK_RAW,IPN_BROADCAST); err=shutdown(s,SHUT_RD); err=bind(s,(struct sockaddr *)sun,sizeof(sun)); err=connect(s,NULL,0); The receivers do not need to define the network, thus they skip the bind(): struct sockaddr_un
Re: New Address Family: Inter Process Networking (IPN)
On Thu, Dec 06, 2007 at 12:39:22AM +0100, Andi Kleen wrote: [EMAIL PROTECTED] (Renzo Davoli) writes: Berkeley socket have been designed for client server or point to point communication. All existing Address Families implement this idea. Netlink is multicast/broadcast by default for once. And BC/MC certainly works for IPv[46] and a couple of other protocols too. IPN is an Inter Process Communication paradigm where all the processes appear as they were connected by a networking bus. Sounds like netlink. See also RFC 3549 RFC 3549 says: This document describes Linux Netlink, which is used in Linux both as an intra-kernel messaging system as well as between kernel and user space. We know AF_NETLINK, our user-space stack lwipv6 supports it. AF_IPN is different. AF_IPN is the broadcast and peer-to-peer extension of AF_UNIX. It supports communication among *user* processes. Example: Qemu, User-Mode Linux, Kvm, our umview machines can use IPN as an Ethernet Hub and communicate among themselves with the hosting computer and the world by a tap like interface. You can also grab an interface (say eth1) and use eth0 for your hosting computer and eth1 for the IPN network of virtual machines. If you load the kvde_switch submodule IPN can be a virtual Ethernet switch. This example is already working using the svn versions of ipn and vdeplug. Another Example: You have a continuous stream of data packets generated by a process, and you want to send this data to many processes. Maybe the set of processes is not known in advance, you want to send the data to any interested process. Some kind of publishsubscribe communication service (among unix processes not on TCP-IP). Without IPN you need a server. With IPN the sender creates the socket connects to it and feed it with data packets. All the interested receivers connects to it and start reading. That's all. I hope that this message can give a better undertanding of what IPN is. renzo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
On Wed, Dec 05, 2007 at 04:55:52PM -0500, Stephen Hemminger wrote: On Wed, 5 Dec 2007 17:40:55 +0100 [EMAIL PROTECTED] (Renzo Davoli) wrote: 0- (Constructive) comments. 1- The official assignment of an Address Family. 2- Another grabbing hook for interfaces (like the ones already We are studying some way to register/deregister grabbing services, I feel this would be the cleanest way. Post complete source code for kernel part to [EMAIL PROTECTED] I'll do it as soon as possible. If you want the hooks, you need to include the full source code for inclusion in mainline. All the Documentation/SubmittingPatches rules apply; you can't just ask for facilitators and expect to keep your stuff out of tree. I am sorry if I was misunderstood. I did not want any facilitator, nor I wanted to keep my code outside the kernel, on the contrary. It is perfectly okay for me to provide the entire code for inclusion. The purposes of my message were the following: - I wanted to introduce the idea and say to the linux kernel community that a team is working on it. - Address family: is it okay to send a patch that add a new AF? is there a AF registry somewhere? (like the device major/minor registry or the well-known port assignment for TCP-IP). - Hook: we have two different options. We can add another grabbing inline function like those used by the bridge and macvlan or we can design a grabbing service registration facility. Which one is preferrable? The former is simpler, the latter is more elegant but it requires some changes in the kernel bridge code. So the former choice is between less-invasive,safer,inelegant, the latter is more-invasive,less safe,elegant. We need a bit of time to stabilize the code: deeply testing the existing features and implementing some more ideas we have on it. In the meanwhile we would be grateful if the community could kindly ask to the questions above. renzo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: New Address Family: Inter Process Networking (IPN)
In the meanwhile we would be grateful if the community could kindly ask to the questions above. Obviously I meant: In the meanwhile we would be grateful if the community could kindly *answer* to the questions above sorry (it is early morning here, it happens ;-) renzo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] drivers/char/misc.c: deprecated call register_chrdev, and 256 minor limit eliminated
Dear Kernel Developers, I have seen that drivers/char/misc.c already used the deprecated register_chrdev call and this limited the number of minors to 256. I recetly asked for a misc minor number attribution for View-OS/kmview and Torben Mathiasen told me about this issue. I propose this patch that eliminate both problems. With this patch misc allocates the entire major 10 and avoids the deprecated call. These are just my 2 (euro) cents, I hope it can be useful. renzo -- Renzo Davoli| Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501| Mura Anteo Zamboni, 7 Fax. +39 051 2094510| I-40127 Bologna ITALY Key fingerprint = A019 17E2 5562 06F6 77BB 2E93 1A01 F646 30EA B487 --- drivers/char/misc.orig.c2007-08-05 16:56:59.0 +0200 +++ drivers/char/misc.c 2007-09-06 11:07:51.0 +0200 @@ -56,6 +56,8 @@ static LIST_HEAD(misc_list); static DEFINE_MUTEX(misc_mtx); +static struct cdev misc_cdev; + /* * Assigned numbers, used for dynamic minors */ @@ -273,6 +275,31 @@ EXPORT_SYMBOL(misc_register); EXPORT_SYMBOL(misc_deregister); +static int misc_register_chrdev(void) +{ + dev_t from=MKDEV(MISC_MAJOR,0); + int rv; + int err = -ENOMEM; + char *s; + + if ((rv=register_chrdev_region(from,MINORMASK,"misc")) != 0) + return rv; + + cdev_init(_cdev, _fops); + misc_cdev.owner=misc_fops.owner; + kobject_set_name(_cdev.kobj, "%s", "misc"); + for (s = strchr(kobject_name(_cdev.kobj),'/'); s; s = strchr(s, '/')) + *s = '!'; + err = cdev_add(_cdev, from, MINORMASK); + if (err) + goto out; + return 0; +out: + kobject_put(_cdev.kobj); + unregister_chrdev_region(from,MINORMASK); + return err; +} + static int __init misc_init(void) { #ifdef CONFIG_PROC_FS @@ -286,7 +313,7 @@ if (IS_ERR(misc_class)) return PTR_ERR(misc_class); - if (register_chrdev(MISC_MAJOR,"misc",_fops)) { + if (misc_register_chrdev()) { printk("unable to get major %d for misc devices\n", MISC_MAJOR); class_destroy(misc_class);
Re: [PATCH] drivers/char/misc.c: deprecated call register_chrdev, and 256 minor limit eliminated
On Thu, Sep 06, 2007 at 11:38:44AM +0200, Renzo Davoli wrote: Dear Kernel Developers, I have seen that drivers/char/misc.c already used the deprecated register_chrdev call and this limited the number of minors to 256. I recetly asked for a misc minor number attribution for View-OS/kmview and Torben Mathiasen told me about this issue. I propose this patch that eliminate both problems. With this patch misc allocates the entire major 10 and avoids the deprecated call. These are just my 2 (euro) cents, I hope it can be useful. renzo Signed-off-by: Renzo Davoli <[EMAIL PROTECTED]> -- ==== Renzo Davoli| Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501| Mura Anteo Zamboni, 7 Fax. +39 051 2094510| I-40127 Bologna ITALY Key fingerprint = A019 17E2 5562 06F6 77BB 2E93 1A01 F646 30EA B487 --- drivers/char/misc.orig.c2007-08-05 16:56:59.0 +0200 +++ drivers/char/misc.c 2007-09-06 11:07:51.0 +0200 @@ -56,6 +56,8 @@ static LIST_HEAD(misc_list); static DEFINE_MUTEX(misc_mtx); +static struct cdev misc_cdev; + /* * Assigned numbers, used for dynamic minors */ @@ -273,6 +275,31 @@ EXPORT_SYMBOL(misc_register); EXPORT_SYMBOL(misc_deregister); +static int misc_register_chrdev(void) +{ + dev_t from=MKDEV(MISC_MAJOR,0); + int rv; + int err = -ENOMEM; + char *s; + + if ((rv=register_chrdev_region(from,MINORMASK,"misc")) != 0) + return rv; + + cdev_init(_cdev, _fops); + misc_cdev.owner=misc_fops.owner; + kobject_set_name(_cdev.kobj, "%s", "misc"); + for (s = strchr(kobject_name(_cdev.kobj),'/'); s; s = strchr(s, '/')) + *s = '!'; + err = cdev_add(_cdev, from, MINORMASK); + if (err) + goto out; + return 0; +out: + kobject_put(_cdev.kobj); + unregister_chrdev_region(from,MINORMASK); + return err; +} + static int __init misc_init(void) { #ifdef CONFIG_PROC_FS @@ -286,7 +313,7 @@ if (IS_ERR(misc_class)) return PTR_ERR(misc_class); - if (register_chrdev(MISC_MAJOR,"misc",_fops)) { + if (misc_register_chrdev()) { printk("unable to get major %d for misc devices\n", MISC_MAJOR); class_destroy(misc_class); -- ======== Renzo Davoli| Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501| Mura Anteo Zamboni, 7 Fax. +39 051 2094510| I-40127 Bologna ITALY Key fingerprint = A019 17E2 5562 06F6 77BB 2E93 1A01 F646 30EA B487 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] drivers/char/misc.c: deprecated call register_chrdev, and 256 minor limit eliminated
On Thu, Sep 06, 2007 at 11:38:44AM +0200, Renzo Davoli wrote: Dear Kernel Developers, I have seen that drivers/char/misc.c already used the deprecated register_chrdev call and this limited the number of minors to 256. I recetly asked for a misc minor number attribution for View-OS/kmview and Torben Mathiasen told me about this issue. I propose this patch that eliminate both problems. With this patch misc allocates the entire major 10 and avoids the deprecated call. These are just my 2 (euro) cents, I hope it can be useful. renzo Signed-off-by: Renzo Davoli [EMAIL PROTECTED] -- Renzo Davoli| Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501| Mura Anteo Zamboni, 7 Fax. +39 051 2094510| I-40127 Bologna ITALY Key fingerprint = A019 17E2 5562 06F6 77BB 2E93 1A01 F646 30EA B487 --- drivers/char/misc.orig.c2007-08-05 16:56:59.0 +0200 +++ drivers/char/misc.c 2007-09-06 11:07:51.0 +0200 @@ -56,6 +56,8 @@ static LIST_HEAD(misc_list); static DEFINE_MUTEX(misc_mtx); +static struct cdev misc_cdev; + /* * Assigned numbers, used for dynamic minors */ @@ -273,6 +275,31 @@ EXPORT_SYMBOL(misc_register); EXPORT_SYMBOL(misc_deregister); +static int misc_register_chrdev(void) +{ + dev_t from=MKDEV(MISC_MAJOR,0); + int rv; + int err = -ENOMEM; + char *s; + + if ((rv=register_chrdev_region(from,MINORMASK,misc)) != 0) + return rv; + + cdev_init(misc_cdev, misc_fops); + misc_cdev.owner=misc_fops.owner; + kobject_set_name(misc_cdev.kobj, %s, misc); + for (s = strchr(kobject_name(misc_cdev.kobj),'/'); s; s = strchr(s, '/')) + *s = '!'; + err = cdev_add(misc_cdev, from, MINORMASK); + if (err) + goto out; + return 0; +out: + kobject_put(misc_cdev.kobj); + unregister_chrdev_region(from,MINORMASK); + return err; +} + static int __init misc_init(void) { #ifdef CONFIG_PROC_FS @@ -286,7 +313,7 @@ if (IS_ERR(misc_class)) return PTR_ERR(misc_class); - if (register_chrdev(MISC_MAJOR,misc,misc_fops)) { + if (misc_register_chrdev()) { printk(unable to get major %d for misc devices\n, MISC_MAJOR); class_destroy(misc_class); -- Renzo Davoli| Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501| Mura Anteo Zamboni, 7 Fax. +39 051 2094510| I-40127 Bologna ITALY Key fingerprint = A019 17E2 5562 06F6 77BB 2E93 1A01 F646 30EA B487 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] drivers/char/misc.c: deprecated call register_chrdev, and 256 minor limit eliminated
Dear Kernel Developers, I have seen that drivers/char/misc.c already used the deprecated register_chrdev call and this limited the number of minors to 256. I recetly asked for a misc minor number attribution for View-OS/kmview and Torben Mathiasen told me about this issue. I propose this patch that eliminate both problems. With this patch misc allocates the entire major 10 and avoids the deprecated call. These are just my 2 (euro) cents, I hope it can be useful. renzo -- Renzo Davoli| Dept. of Computer Science (NIC rd235, HAM IZ4DJE) | University of Bologna Tel. +39 051 2094501| Mura Anteo Zamboni, 7 Fax. +39 051 2094510| I-40127 Bologna ITALY Key fingerprint = A019 17E2 5562 06F6 77BB 2E93 1A01 F646 30EA B487 --- drivers/char/misc.orig.c2007-08-05 16:56:59.0 +0200 +++ drivers/char/misc.c 2007-09-06 11:07:51.0 +0200 @@ -56,6 +56,8 @@ static LIST_HEAD(misc_list); static DEFINE_MUTEX(misc_mtx); +static struct cdev misc_cdev; + /* * Assigned numbers, used for dynamic minors */ @@ -273,6 +275,31 @@ EXPORT_SYMBOL(misc_register); EXPORT_SYMBOL(misc_deregister); +static int misc_register_chrdev(void) +{ + dev_t from=MKDEV(MISC_MAJOR,0); + int rv; + int err = -ENOMEM; + char *s; + + if ((rv=register_chrdev_region(from,MINORMASK,misc)) != 0) + return rv; + + cdev_init(misc_cdev, misc_fops); + misc_cdev.owner=misc_fops.owner; + kobject_set_name(misc_cdev.kobj, %s, misc); + for (s = strchr(kobject_name(misc_cdev.kobj),'/'); s; s = strchr(s, '/')) + *s = '!'; + err = cdev_add(misc_cdev, from, MINORMASK); + if (err) + goto out; + return 0; +out: + kobject_put(misc_cdev.kobj); + unregister_chrdev_region(from,MINORMASK); + return err; +} + static int __init misc_init(void) { #ifdef CONFIG_PROC_FS @@ -286,7 +313,7 @@ if (IS_ERR(misc_class)) return PTR_ERR(misc_class); - if (register_chrdev(MISC_MAJOR,misc,misc_fops)) { + if (misc_register_chrdev()) { printk(unable to get major %d for misc devices\n, MISC_MAJOR); class_destroy(misc_class);