Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 00:09, Johann Borck wrote: Regarding mukevent I'm thinking of a event-type specific struct, that is filled by the originating code, and placed into a per-event-type ring buffer (which requires modification of kevent_wait). I'd personally worry about an implementation that used a per-event-type ring buffer, because you're still left having to hack around starvation issues in user-space. It is of course possible under the current model for anyone who wants per-event-type ring buffers to have them - just make separate kevent sets. I haven't thought this through all the way yet, but why not have variable length event structures and have the kernel fill in a next pointer in each one? This could even be used to keep backwards binary compatibility while adding additional fields to the structures over time, though no space would be wasted on modern programs. You still end up with a question of what to do in case of overflow, but I'm thinking the thing to do in that case might be to start pushing overflow events onto a linked list which can be written back into the ring buffer when space becomes available. The appropriate behavior would be to throw new events on the linked list if the linked list had any events, so that things are delivered in order, but write to the mapped buffer directly otherwise. Deciding when to do that is tricky, and I haven't thought through the implications fully when I say this, but what about activating a bottom half when more space becomes available, and let that drain overflowed events back into the mapped buffer? Or perhaps the time to do it would be in the next blocking wait, when the queue emptied? I think it is very important to avoid any limits that can not be adjusted on the fly at run-time by CAP_SYS_ADMIN or what have you. Doing it this way may have other problems I've ignored but at least the big one - compile-time capacity limits in the year 2006 - would be largely avoided :P Nothing real solid yet, just some electrical storms in the grey matter... Thanks, Chase - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 07:10:14AM +0200, Johann Borck ([EMAIL PROTECTED]) wrote: Ulrich Drepper wrote: Evgeniy Polyakov wrote: Existing design does not allow overflow. And I've pointed out a number of times that this is not practical at best. There are event sources which can create events which cannot be coalesced into one single event as it would be required with your design. Signals are one example, specifically realtime signals. If we do not want the design to be limited from the start this approach has to be thought over. So zap mmap() support completely, since it is not usable at all. We wont discuss on it. Initial implementation did not have it. But I was requested to do it, and it is ready now. No one likes it, but no one provides an alternative implementation. We are stuck. We need the mapped ring buffer. The current design (before it was removed) was broken but this does not mean it shouldn't be implemented. We just need more time to figure out how to implement it correctly. Considering the if at all and if then how of ring buffer implemetation I'd like to throw in some ideas I had when reading the discussion and respective code. If I understood Ulrich Drepper right, his notion of a generic event handling interface is, that it has to be flexible enough to transport additional info from origin to userspace, and to support queuing of events from the same origin, so that additional per-event-occurrence data doesn't get lost, which would happen when coalescing multiple events into one until delivery. From what I read he says ring buffer is broken because of insufficient space for additional data (mukevent) and the limited number of events that can be put into ring buffer. Another argument is missing notification of userspace about dropped events in case ring buffer limit is reached. (is that right?) I can add such notification, but its existense _is_ the broken design. After such condition happend, all new events will dissapear (although they are still accessible through usual queue) from mapped buffer. While writing this I have come to the idea on how to imrove the case of the size of mapped buffer - we can make it with limited size, and when it is full, some bit will be set in the shared area and obviously no new events can be added there, but when user commits some events from that buffer (i.e. says to kernel that appropriate kevents can be freed or requeued according to theirs flags), new ready events from ready queue can be copied into mapped buffer. It still does not solve (and I do insist that it is broken behaviour) the case when kernel is going to generate infinite number of events for one requested by userspace (as in case of generating new 'data_has_arrived' event when new byte has been received). Userspace events are only marked as ready, they are not generated - it is high-performance _feature_ of the new design, not some kind of a bug. I see no reason why kevent couldn't be modified to fit (all) these needs. While modifying the server-example and writing a client using kevent I came across the coalescing problem, there were more incoming connections than accept events, and I had to work around that. In this Btw, accept() issue is exactly the same as with usual poll() - repeated insertion of the same kevent will fire immediately, which requires event to be one-shot. One of the initial implementation contained number of ready for accept sockets as one of the returned parameters though. case the pure number of coalesced events would suffice, while it wouldn't for the example of RT-signals that Ulrich Drepper gave. So if coalescing can be done at all or if it is impossible depends on the type of event. The same goes for additional data delivered with the events. There might be no panacea for all possible scenarios with one fixed design. Either performance suffers for 'lightweight' events which don't need additional data and/or coalescing is not problematic and/or ring buffer, or kevent is not usable for other types of events. Why not treat different things differently, and let the (kernel-)user decide. I don't know if I got all this right, but if, then ring buffer is needed especially for cases where coalescing is not possible and additional data has to be delivered for each triggered notification (so the pure number of events is not enough; other reasons? performance? ). To me it doesn't make sense to have kevent fill memory and use processor-time if buffer is not used at all, which is the case when using kevent_getevents. So here are my Ideas: Make usage of ring buffer optional, if not required for specific event-type it might be chosen by userspace-code. Make limit of events in ring buffer optional and controllable from userspace. It is of course possible, main problem is that existing design of the mapped buffer is not sufficient, and there are no other propositions except that 'it
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 12:59:47AM -0500, Chase Venters ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 00:09, Johann Borck wrote: Regarding mukevent I'm thinking of a event-type specific struct, that is filled by the originating code, and placed into a per-event-type ring buffer (which requires modification of kevent_wait). I'd personally worry about an implementation that used a per-event-type ring buffer, because you're still left having to hack around starvation issues in user-space. It is of course possible under the current model for anyone who wants per-event-type ring buffers to have them - just make separate kevent sets. I haven't thought this through all the way yet, but why not have variable length event structures and have the kernel fill in a next pointer in each one? This could even be used to keep backwards binary compatibility while Why do we want variable size structures in mmap ring buffer? adding additional fields to the structures over time, though no space would be wasted on modern programs. You still end up with a question of what to do in case of overflow, but I'm thinking the thing to do in that case might be to start pushing overflow events onto a linked list which can be written back into the ring buffer when space becomes available. The appropriate behavior would be to throw new events on the linked list if the linked list had any events, so that things are delivered in order, but write to the mapped buffer directly otherwise. I think in a similar way. Kevent actually do not require such list, since it has already queue of the ready events. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 05:42, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 12:59:47AM -0500, Chase Venters ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 00:09, Johann Borck wrote: Regarding mukevent I'm thinking of a event-type specific struct, that is filled by the originating code, and placed into a per-event-type ring buffer (which requires modification of kevent_wait). I'd personally worry about an implementation that used a per-event-type ring buffer, because you're still left having to hack around starvation issues in user-space. It is of course possible under the current model for anyone who wants per-event-type ring buffers to have them - just make separate kevent sets. I haven't thought this through all the way yet, but why not have variable length event structures and have the kernel fill in a next pointer in each one? This could even be used to keep backwards binary compatibility while Why do we want variable size structures in mmap ring buffer? Flexibility primarily. So when we all decide to add a new event type six months from now, or add more information to an existing one, we don't run the risk that the existing mukevent isn't big enough. adding additional fields to the structures over time, though no space would be wasted on modern programs. You still end up with a question of what to do in case of overflow, but I'm thinking the thing to do in that case might be to start pushing overflow events onto a linked list which can be written back into the ring buffer when space becomes available. The appropriate behavior would be to throw new events on the linked list if the linked list had any events, so that things are delivered in order, but write to the mapped buffer directly otherwise. I think in a similar way. Kevent actually do not require such list, since it has already queue of the ready events. The current event types coalesce if there are multiple events, correct? It sounds like there may be other event types where coalescing multiple events is not the correct approach. Thanks, Chase - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote: I can add such notification, but its existense _is_ the broken design. After such condition happend, all new events will dissapear (although they are still accessible through usual queue) from mapped buffer. While writing this I have come to the idea on how to imrove the case of the size of mapped buffer - we can make it with limited size, and when it is full, some bit will be set in the shared area and obviously no new events can be added there, but when user commits some events from that buffer (i.e. says to kernel that appropriate kevents can be freed or requeued according to theirs flags), new ready events from ready queue can be copied into mapped buffer. It still does not solve (and I do insist that it is broken behaviour) the case when kernel is going to generate infinite number of events for one requested by userspace (as in case of generating new 'data_has_arrived' event when new byte has been received). Behavior is not broken. It's quite usefull and works 99.% of time. I was trying to suggest you but you missed my point. You dont want to use a bit, but a full sequence counter, 32bits. A program may handle XXX.XXX handles, but use a 4096 entries ring buffer 'only'. The user program keeps a local copy of a special word named 'ring_buffer_full_counter' Each time the kernel cannot queue an event in the ring buffer, it increase the ring_buffer_was_full_counter (exported to user app in the mmap view) When the user application notice the kernel changed ring_buffer_was_full_counter it does a full scan of all file handles (preferably using poll() to get all relevant info in one syscall) : do { if (read_event_from_mmap()) {handle_event(fd); continue;} /* ring buffer is empty, check if we missed some events */ if (unlikely(mmap-ring_buffer_full_counter != my_ring_buffer_full_counter)) { my_ring_buffer_full_counter = mmap-ring_buffer_full_counter; /* slow PATH */ /* can use a big poll() for example, or just a loop without poll() */ for_all_file_desc_do() { check if some event/data is waiting on THIS fd } /* } else syscall_wait_for_one_available_kevent(queue) } This is how a program can recover. If ring buffer has a reasonable size, this kind of event should not happen very frequently. If it does (because events continue to fill ring_buffer during recovery and might hit FULL again), maybe a smart program is able to resize the ring_buffer, and start using it after yet another recovery pass. If not, we dont care, because a big poll() give us many ready file-descriptors in one syscall, and maybe this is much better than kevent/epoll when XX.XXX events are ready. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 08:12:04AM -0500, Chase Venters ([EMAIL PROTECTED]) wrote: Regarding mukevent I'm thinking of a event-type specific struct, that is filled by the originating code, and placed into a per-event-type ring buffer (which requires modification of kevent_wait). I'd personally worry about an implementation that used a per-event-type ring buffer, because you're still left having to hack around starvation issues in user-space. It is of course possible under the current model for anyone who wants per-event-type ring buffers to have them - just make separate kevent sets. I haven't thought this through all the way yet, but why not have variable length event structures and have the kernel fill in a next pointer in each one? This could even be used to keep backwards binary compatibility while Why do we want variable size structures in mmap ring buffer? Flexibility primarily. So when we all decide to add a new event type six months from now, or add more information to an existing one, we don't run the risk that the existing mukevent isn't big enough. Do we need such flexibility, when we have unique id attached to each event? User can store any information in own buffers, which are indexed by that id. adding additional fields to the structures over time, though no space would be wasted on modern programs. You still end up with a question of what to do in case of overflow, but I'm thinking the thing to do in that case might be to start pushing overflow events onto a linked list which can be written back into the ring buffer when space becomes available. The appropriate behavior would be to throw new events on the linked list if the linked list had any events, so that things are delivered in order, but write to the mapped buffer directly otherwise. I think in a similar way. Kevent actually do not require such list, since it has already queue of the ready events. The current event types coalesce if there are multiple events, correct? It sounds like there may be other event types where coalescing multiple events is not the correct approach. There is no events coalescing, I think that it is even incorrect to say, that something is being coalesced in kevents. There is 'new' (which is well forgotten old) approach - user _asks_ kernel about some information, and kernel says when it is ready. Kernel does not say: part of the info is ready, part of the info is ready and so on, it just marks user's request as ready - that means that it is possible that there were zillions of events, each one could mark the _same_ userspace request as ready, and exactly what user requested is transferred back. Thus it is very fast and is correct way to deal with problem of pipes of different diameters. Kernel does not generate events - only user creates requests, which are marked as ready. I made that decision to remove _any_ kind of possible overflows from kernel side - if user was scheduled away, or has unsufficient space or bad mood, to not introduce any kind of ugly priorities (higher one could fill the whole pipe while lower could not even send a single event). Instead kernel does just what it was requested to do, and it can provide some hints on how that process happend (for example how many sockets are ready for accept(), or how many bytes are in the receiving queue). And that approach does solve the problem of the cases when it looks like it is logical to _generate_ event - for example in inotify case, where new event is _generated_ each time requested case happens. For example the case when new files are created in the directory - it is possible that there will be queue overflow (btw, watch for each file in the kernel tree takes about 2gb of kernel mem), if many files were created, so userspace must rescan the whole directory to check missed files, so why is it needed at all to generate info about first two or ten files, instead userspace asks kernel to notify it when directory has changed or some new files were created, and kernelspace will answer when directory has been changed or new files were created (with some hint with number of them). Likely request for generation of events in kernel is a workaround for some other problems, which in long term will hit us with new troubles - queue length and overflows. Thanks, Chase -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 03:19:36PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote: I can add such notification, but its existense _is_ the broken design. After such condition happend, all new events will dissapear (although they are still accessible through usual queue) from mapped buffer. While writing this I have come to the idea on how to imrove the case of the size of mapped buffer - we can make it with limited size, and when it is full, some bit will be set in the shared area and obviously no new events can be added there, but when user commits some events from that buffer (i.e. says to kernel that appropriate kevents can be freed or requeued according to theirs flags), new ready events from ready queue can be copied into mapped buffer. It still does not solve (and I do insist that it is broken behaviour) the case when kernel is going to generate infinite number of events for one requested by userspace (as in case of generating new 'data_has_arrived' event when new byte has been received). Behavior is not broken. It's quite usefull and works 99.% of time. I was trying to suggest you but you missed my point. You dont want to use a bit, but a full sequence counter, 32bits. A program may handle XXX.XXX handles, but use a 4096 entries ring buffer 'only'. The user program keeps a local copy of a special word named 'ring_buffer_full_counter' Each time the kernel cannot queue an event in the ring buffer, it increase the ring_buffer_was_full_counter (exported to user app in the mmap view) When the user application notice the kernel changed ring_buffer_was_full_counter it does a full scan of all file handles (preferably using poll() to get all relevant info in one syscall) : I.e. to scan the rest of the xxx.xxx events? do { if (read_event_from_mmap()) {handle_event(fd); continue;} /* ring buffer is empty, check if we missed some events */ if (unlikely(mmap-ring_buffer_full_counter != my_ring_buffer_full_counter)) { my_ring_buffer_full_counter = mmap-ring_buffer_full_counter; /* slow PATH */ /* can use a big poll() for example, or just a loop without poll() */ for_all_file_desc_do() { check if some event/data is waiting on THIS fd } /* } else syscall_wait_for_one_available_kevent(queue) } This is how a program can recover. If ring buffer has a reasonable size, this kind of event should not happen very frequently. If it does (because events continue to fill ring_buffer during recovery and might hit FULL again), maybe a smart program is able to resize the ring_buffer, and start using it after yet another recovery pass. If not, we dont care, because a big poll() give us many ready file-descriptors in one syscall, and maybe this is much better than kevent/epoll when XX.XXX events are ready. What about the case, which I described in other e-mail, when in case of the full ring buffer, no new events are written there, and when userspace commits (i.e. marks as ready to be freed or requeued by kernel) some events, new ones will be copied from ready queue into the buffer? Eric -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 15:42, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 03:19:36PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 12:39, Evgeniy Polyakov wrote: I can add such notification, but its existense _is_ the broken design. After such condition happend, all new events will dissapear (although they are still accessible through usual queue) from mapped buffer. While writing this I have come to the idea on how to imrove the case of the size of mapped buffer - we can make it with limited size, and when it is full, some bit will be set in the shared area and obviously no new events can be added there, but when user commits some events from that buffer (i.e. says to kernel that appropriate kevents can be freed or requeued according to theirs flags), new ready events from ready queue can be copied into mapped buffer. It still does not solve (and I do insist that it is broken behaviour) the case when kernel is going to generate infinite number of events for one requested by userspace (as in case of generating new 'data_has_arrived' event when new byte has been received). Behavior is not broken. It's quite usefull and works 99.% of time. I was trying to suggest you but you missed my point. You dont want to use a bit, but a full sequence counter, 32bits. A program may handle XXX.XXX handles, but use a 4096 entries ring buffer 'only'. The user program keeps a local copy of a special word named 'ring_buffer_full_counter' Each time the kernel cannot queue an event in the ring buffer, it increase the ring_buffer_was_full_counter (exported to user app in the mmap view) When the user application notice the kernel changed ring_buffer_was_full_counter it does a full scan of all file handles (preferably using poll() to get all relevant info in one syscall) : I.e. to scan the rest of the xxx.xxx events? do { if (read_event_from_mmap()) {handle_event(fd); continue;} /* ring buffer is empty, check if we missed some events */ if (unlikely(mmap-ring_buffer_full_counter != my_ring_buffer_full_counter)) { my_ring_buffer_full_counter = mmap-ring_buffer_full_counter; /* slow PATH */ /* can use a big poll() for example, or just a loop without poll() */ for_all_file_desc_do() { check if some event/data is waiting on THIS fd } /* } else syscall_wait_for_one_available_kevent(queue) } This is how a program can recover. If ring buffer has a reasonable size, this kind of event should not happen very frequently. If it does (because events continue to fill ring_buffer during recovery and might hit FULL again), maybe a smart program is able to resize the ring_buffer, and start using it after yet another recovery pass. If not, we dont care, because a big poll() give us many ready file-descriptors in one syscall, and maybe this is much better than kevent/epoll when XX.XXX events are ready. What about the case, which I described in other e-mail, when in case of the full ring buffer, no new events are written there, and when userspace commits (i.e. marks as ready to be freed or requeued by kernel) some events, new ones will be copied from ready queue into the buffer? Then, user might receive 'false events', exactly like poll()/select()/epoll() can do sometime. IE a 'ready' indication while there is no current event available on a particular fd / event_source. This should be safe, since those programs already ignore read() returns -EAGAIN and other similar things. Programmer prefers to receive two 'event available' indications than ZERO (and be stuck for infinite time). Of course, hot path (normal cases) should return one 'event' only. In order words, being ultra fast 99.99 % of the time, but being able to block forever once in a while is not an option. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: What about the case, which I described in other e-mail, when in case of the full ring buffer, no new events are written there, and when userspace commits (i.e. marks as ready to be freed or requeued by kernel) some events, new ones will be copied from ready queue into the buffer? Then, user might receive 'false events', exactly like poll()/select()/epoll() can do sometime. IE a 'ready' indication while there is no current event available on a particular fd / event_source. Only if user simultaneously uses oth interfaces and remove even from the queue when it's copy was in mapped buffer, but in that case it's user's problem (and if we do want, we can store pointer/index of the ring buffer entry, so when event is removed from the ready queue (using kevent_get_events()), appropriate entry in the ring buffer will be updated to show that it is no longer valid. This should be safe, since those programs already ignore read() returns -EAGAIN and other similar things. Programmer prefers to receive two 'event available' indications than ZERO (and be stuck for infinite time). Of course, hot path (normal cases) should return one 'event' only. In order words, being ultra fast 99.99 % of the time, but being able to block forever once in a while is not an option. Have I missed something? It looks like the only problematic situation is described above when user simultaneously uses both interfaces. Eric -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: What about the case, which I described in other e-mail, when in case of the full ring buffer, no new events are written there, and when userspace commits (i.e. marks as ready to be freed or requeued by kernel) some events, new ones will be copied from ready queue into the buffer? Then, user might receive 'false events', exactly like poll()/select()/epoll() can do sometime. IE a 'ready' indication while there is no current event available on a particular fd / event_source. Only if user simultaneously uses oth interfaces and remove even from the queue when it's copy was in mapped buffer, but in that case it's user's problem (and if we do want, we can store pointer/index of the ring buffer entry, so when event is removed from the ready queue (using kevent_get_events()), appropriate entry in the ring buffer will be updated to show that it is no longer valid. This should be safe, since those programs already ignore read() returns -EAGAIN and other similar things. Programmer prefers to receive two 'event available' indications than ZERO (and be stuck for infinite time). Of course, hot path (normal cases) should return one 'event' only. In order words, being ultra fast 99.99 % of the time, but being able to block forever once in a while is not an option. Have I missed something? It looks like the only problematic situation is described above when user simultaneously uses both interfaces. In my point of view, user of the 'mmaped ring buffer' should be prepared to use both interfaces. Or else you are forced to presize the ring buffer to insane limits. That is : - Most of the time, we expect consuming events via mmaped ring buffer and no syscalls. - In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume events that could not be stored in mmaped buffer (but queued by kevent subsystem). If not stored by kevent subsystem (memory failure ?), revert to poll() to fetch all 'missed fds' in one row. Go back to normal mode. - In case of empty ring buffer (or no mmap support at all, because this app doesnt expect lot of events per time unit, or because kevent dont have mmap support) : Be able to syscall and wait for an event. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 04:25:00PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: What about the case, which I described in other e-mail, when in case of the full ring buffer, no new events are written there, and when userspace commits (i.e. marks as ready to be freed or requeued by kernel) some events, new ones will be copied from ready queue into the buffer? Then, user might receive 'false events', exactly like poll()/select()/epoll() can do sometime. IE a 'ready' indication while there is no current event available on a particular fd / event_source. Only if user simultaneously uses oth interfaces and remove even from the queue when it's copy was in mapped buffer, but in that case it's user's problem (and if we do want, we can store pointer/index of the ring buffer entry, so when event is removed from the ready queue (using kevent_get_events()), appropriate entry in the ring buffer will be updated to show that it is no longer valid. This should be safe, since those programs already ignore read() returns -EAGAIN and other similar things. Programmer prefers to receive two 'event available' indications than ZERO (and be stuck for infinite time). Of course, hot path (normal cases) should return one 'event' only. In order words, being ultra fast 99.99 % of the time, but being able to block forever once in a while is not an option. Have I missed something? It looks like the only problematic situation is described above when user simultaneously uses both interfaces. In my point of view, user of the 'mmaped ring buffer' should be prepared to use both interfaces. Or else you are forced to presize the ring buffer to insane limits. That is : - Most of the time, we expect consuming events via mmaped ring buffer and no syscalls. - In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume events that could not be stored in mmaped buffer (but queued by kevent subsystem). If not stored by kevent subsystem (memory failure ?), revert to poll() to fetch all 'missed fds' in one row. Go back to normal mode. kevent uses smaller amount of memory than epoll() per event, so it is very unlikely that it will be impossible to store new event there and epoll() will succeed. The same can be applied to poll(), which allocates the whole table in syscall. - In case of empty ring buffer (or no mmap support at all, because this app doesnt expect lot of events per time unit, or because kevent dont have mmap support) : Be able to syscall and wait for an event. So the most complex case is when user is going to use both interfaces, and it's steps when mapped ring buffer has overflow. In that case user can either read and mark some events as ready in ring buffer (the latter is being done through special syscall), so kevent core will put there new ready events. User can also get events using usual syscall, in that case events in ring buffer must be updated - and actually I implemented mapped buffer in the way which allows to remove events from the queue - queue is a FIFO, and the first entry to be obtained through syscall is _always_ the first entry in the ring buffer. So when user reads event through syscall (no matter if we are in overflow case or not), even being read is easily accessible in the ring buffer. So I propose following design for ring buffer (quite simple): kernelspace maintains two indexes - to the first and the last events in the ring buffer (and maximum size of the buffer of course). When new event is marked as ready, some info is being copied into ring buffer and index of the last entry is increased. When event is being read through syscall it is _guaranteed_ that that event will be at the position pointed by the index of the first element, that index is then increased (thus opening new slot in the buffer). If index of the last entry reaches (with possible wrapping) index of the first entry, that means that overflow has happend. In this case no new events can be copied into ring buffer, so they are only placed into ready queue (accessible through syscall kevent_get_events()). When user calls kevent_get_events() it will obtain the first element (pointed by index of the first element in the ring buffer), and if there is ready event, which is not placed into the ring buffer, it is copied (with appropriate update of the last index and new overflow condition). When userspace calls kevent_wait(num), it means that userspace marks as ready first (from index of the first element) $num elements, which thus can be removed (or requeued) and replaced by pending ready events. Does it sound like clawing over the glass or much better? Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More
Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 17:09, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 04:25:00PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: What about the case, which I described in other e-mail, when in case of the full ring buffer, no new events are written there, and when userspace commits (i.e. marks as ready to be freed or requeued by kernel) some events, new ones will be copied from ready queue into the buffer? Then, user might receive 'false events', exactly like poll()/select()/epoll() can do sometime. IE a 'ready' indication while there is no current event available on a particular fd / event_source. Only if user simultaneously uses oth interfaces and remove even from the queue when it's copy was in mapped buffer, but in that case it's user's problem (and if we do want, we can store pointer/index of the ring buffer entry, so when event is removed from the ready queue (using kevent_get_events()), appropriate entry in the ring buffer will be updated to show that it is no longer valid. This should be safe, since those programs already ignore read() returns -EAGAIN and other similar things. Programmer prefers to receive two 'event available' indications than ZERO (and be stuck for infinite time). Of course, hot path (normal cases) should return one 'event' only. In order words, being ultra fast 99.99 % of the time, but being able to block forever once in a while is not an option. Have I missed something? It looks like the only problematic situation is described above when user simultaneously uses both interfaces. In my point of view, user of the 'mmaped ring buffer' should be prepared to use both interfaces. Or else you are forced to presize the ring buffer to insane limits. That is : - Most of the time, we expect consuming events via mmaped ring buffer and no syscalls. - In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume events that could not be stored in mmaped buffer (but queued by kevent subsystem). If not stored by kevent subsystem (memory failure ?), revert to poll() to fetch all 'missed fds' in one row. Go back to normal mode. kevent uses smaller amount of memory than epoll() per event, so it is very unlikely that it will be impossible to store new event there and epoll() will succeed. The same can be applied to poll(), which allocates the whole table in syscall. - In case of empty ring buffer (or no mmap support at all, because this app doesnt expect lot of events per time unit, or because kevent dont have mmap support) : Be able to syscall and wait for an event. So the most complex case is when user is going to use both interfaces, and it's steps when mapped ring buffer has overflow. In that case user can either read and mark some events as ready in ring buffer (the latter is being done through special syscall), so kevent core will put there new ready events. User can also get events using usual syscall, in that case events in ring buffer must be updated - and actually I implemented mapped buffer in the way which allows to remove events from the queue - queue is a FIFO, and the first entry to be obtained through syscall is _always_ the first entry in the ring buffer. So when user reads event through syscall (no matter if we are in overflow case or not), even being read is easily accessible in the ring buffer. So I propose following design for ring buffer (quite simple): kernelspace maintains two indexes - to the first and the last events in the ring buffer (and maximum size of the buffer of course). When new event is marked as ready, some info is being copied into ring buffer and index of the last entry is increased. When event is being read through syscall it is _guaranteed_ that that event will be at the position pointed by the index of the first element, that index is then increased (thus opening new slot in the buffer). If index of the last entry reaches (with possible wrapping) index of the first entry, that means that overflow has happend. In this case no new events can be copied into ring buffer, so they are only placed into ready queue (accessible through syscall kevent_get_events()). When user calls kevent_get_events() it will obtain the first element (pointed by index of the first element in the ring buffer), and if there is ready event, which is not placed into the ring buffer, it is copied (with appropriate update of the last index and new overflow condition). Well, I'm not sure its good to do this 'move one event from ready list to slot X', one by one, because this event will likely be flushed out of processor cache (because we will have to consume 4096 events before reaching this one). I think its better to batch
Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 16:25, Eric Dumazet wrote: On Tuesday 17 October 2006 16:07, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 03:52:34PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: What about the case, which I described in other e-mail, when in case of the full ring buffer, no new events are written there, and when userspace commits (i.e. marks as ready to be freed or requeued by kernel) some events, new ones will be copied from ready queue into the buffer? Then, user might receive 'false events', exactly like poll()/select()/epoll() can do sometime. IE a 'ready' indication while there is no current event available on a particular fd / event_source. Only if user simultaneously uses oth interfaces and remove even from the queue when it's copy was in mapped buffer, but in that case it's user's problem (and if we do want, we can store pointer/index of the ring buffer entry, so when event is removed from the ready queue (using kevent_get_events()), appropriate entry in the ring buffer will be updated to show that it is no longer valid. This should be safe, since those programs already ignore read() returns -EAGAIN and other similar things. Programmer prefers to receive two 'event available' indications than ZERO (and be stuck for infinite time). Of course, hot path (normal cases) should return one 'event' only. In order words, being ultra fast 99.99 % of the time, but being able to block forever once in a while is not an option. Have I missed something? It looks like the only problematic situation is described above when user simultaneously uses both interfaces. In my point of view, user of the 'mmaped ring buffer' should be prepared to use both interfaces. Or else you are forced to presize the ring buffer to insane limits. I don't see why overflow couldn't be handle by a syscall telling the kernel that the buffer is ready for new events. As mentioned most of the time overflow should not happend and if it does the syscall should be amortized nicely by the number of events. That is : - Most of the time, we expect consuming events via mmaped ring buffer and no syscalls. - In case we notice a 'mmaped ring buffer overflow', syscalls to get/consume events that could not be stored in mmaped buffer (but queued by kevent subsystem). If not stored by kevent subsystem (memory failure ?), revert to poll() to fetch all 'missed fds' in one row. Go back to normal mode. - In case of empty ring buffer (or no mmap support at all, because this app doesnt expect lot of events per time unit, or because kevent dont have mmap support) : Be able to syscall and wait for an event. As I see it there are two main problems with a mmapped ring buffer (correct me if I'm wrong): 1. Overflow. 2. Handle multiple kernel event that only needs one user event. I.e. multiple packet arriving at the same socket. The user should only see one IN event at the time he is ready to handle it. In an earlier post I suggested a scheme that solves these issues. It was based on the assumption that kernel and user-space share index variables and can read/update them atomically without much overhead. Only in cases where the buffer is empty and full system call would be required. Hans Henrik Happe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 05:32:28PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: So the most complex case is when user is going to use both interfaces, and it's steps when mapped ring buffer has overflow. In that case user can either read and mark some events as ready in ring buffer (the latter is being done through special syscall), so kevent core will put there new ready events. User can also get events using usual syscall, in that case events in ring buffer must be updated - and actually I implemented mapped buffer in the way which allows to remove events from the queue - queue is a FIFO, and the first entry to be obtained through syscall is _always_ the first entry in the ring buffer. So when user reads event through syscall (no matter if we are in overflow case or not), even being read is easily accessible in the ring buffer. So I propose following design for ring buffer (quite simple): kernelspace maintains two indexes - to the first and the last events in the ring buffer (and maximum size of the buffer of course). When new event is marked as ready, some info is being copied into ring buffer and index of the last entry is increased. When event is being read through syscall it is _guaranteed_ that that event will be at the position pointed by the index of the first element, that index is then increased (thus opening new slot in the buffer). If index of the last entry reaches (with possible wrapping) index of the first entry, that means that overflow has happend. In this case no new events can be copied into ring buffer, so they are only placed into ready queue (accessible through syscall kevent_get_events()). When user calls kevent_get_events() it will obtain the first element (pointed by index of the first element in the ring buffer), and if there is ready event, which is not placed into the ring buffer, it is copied (with appropriate update of the last index and new overflow condition). Well, I'm not sure its good to do this 'move one event from ready list to slot X', one by one, because this event will likely be flushed out of processor cache (because we will have to consume 4096 events before reaching this one). I think its better to batch this kind of 'push XX events' later, XX being small enough not to waste CPU cache, and when ring buffer is empty again. Ok, that's possible. mmap buffer is good for latency and minimum synchro between user thread and kernel producer. But once we hit an 'overflow', it is better to revert to a mode feeding XX events per syscall, to be sure it fits CPU caches : The user thread will do the copy between kernel memory to user memory, and this thread will shortly use those events in user land. User can do both - either get events through syscall, or get them from mapped ring buffer when it is refilled. BTW, maintaining coherency on mmap buffer is expensive : once a event is copied to mmap buffer, kernel has to issue a smp_mb() before updating the index, so that a user thread wont start to consume an event with random values because its CPU see the update on index before updates on data. There will be some tricks with barriers indeed. Once all the queue is flushed in efficient way, we can switch to mmap mode again. Eric Ok, there is one apologist for mmap buffer implementation, who forced me to create first implementation, which was dropped due to absense of remote mental reading abilities. Ulrich, does above approach sound good for you? I actually do not want to reimplement something, that will be pointed to with words 'no matter what you say, it is broken and I do not want it' again :). -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote: Ok, there is one apologist for mmap buffer implementation, who forced me to create first implementation, which was dropped due to absense of remote mental reading abilities. Ulrich, does above approach sound good for you? I actually do not want to reimplement something, that will be pointed to with words 'no matter what you say, it is broken and I do not want it' again :). In my humble opinion, you should first write a 'real application', to show how the mmap buffer and kevent syscalls would be used (fast path and slow/recovery paths). I am sure it would be easier for everybody to agree on the API *before* you start coding a *lot* of hard (kernel) stuff : It would certainly save your mental CPU cycles (and ours too :) ) This 'real application' could be the event loop of a simple HTTP server, or a basic 'echo all' server. Adding the bits about timers events and signals should be done too. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote: Ok, there is one apologist for mmap buffer implementation, who forced me to create first implementation, which was dropped due to absense of remote mental reading abilities. Ulrich, does above approach sound good for you? I actually do not want to reimplement something, that will be pointed to with words 'no matter what you say, it is broken and I do not want it' again :). In my humble opinion, you should first write a 'real application', to show how the mmap buffer and kevent syscalls would be used (fast path and slow/recovery paths). I am sure it would be easier for everybody to agree on the API *before* you start coding a *lot* of hard (kernel) stuff : It would certainly save your mental CPU cycles (and ours too :) ) This 'real application' could be the event loop of a simple HTTP server, or a basic 'echo all' server. Adding the bits about timers events and signals should be done too. I wrote one with previous ring buffer implementation - it used timers and echoed when they fired, it was even described in details in one of the lwn.net articles. I'm not going to waste others and my time implementing feature requests without at least _some_ feedback from those who asked them. In case when person, originally requested some feature, does not answer and there are other opinions, only they will be get into account of course. Eric -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tuesday 17 October 2006 18:35, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote: Ok, there is one apologist for mmap buffer implementation, who forced me to create first implementation, which was dropped due to absense of remote mental reading abilities. Ulrich, does above approach sound good for you? I actually do not want to reimplement something, that will be pointed to with words 'no matter what you say, it is broken and I do not want it' again :). In my humble opinion, you should first write a 'real application', to show how the mmap buffer and kevent syscalls would be used (fast path and slow/recovery paths). I am sure it would be easier for everybody to agree on the API *before* you start coding a *lot* of hard (kernel) stuff : It would certainly save your mental CPU cycles (and ours too :) ) This 'real application' could be the event loop of a simple HTTP server, or a basic 'echo all' server. Adding the bits about timers events and signals should be done too. I wrote one with previous ring buffer implementation - it used timers and echoed when they fired, it was even described in details in one of the lwn.net articles. I'm not going to waste others and my time implementing feature requests without at least _some_ feedback from those who asked them. In case when person, originally requested some feature, does not answer and there are other opinions, only they will be get into account of course. I am not sure I understand what you wrote, English is not our native language. I think many people gave you feedbacks. I feel that all feedback on this mailing list is constructive. Many posts/patches on this list are never commented at all. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 17, 2006 at 06:45:54PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 18:35, Evgeniy Polyakov wrote: On Tue, Oct 17, 2006 at 06:26:04PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Tuesday 17 October 2006 18:01, Evgeniy Polyakov wrote: Ok, there is one apologist for mmap buffer implementation, who forced me to create first implementation, which was dropped due to absense of remote mental reading abilities. Ulrich, does above approach sound good for you? I actually do not want to reimplement something, that will be pointed to with words 'no matter what you say, it is broken and I do not want it' again :). In my humble opinion, you should first write a 'real application', to show how the mmap buffer and kevent syscalls would be used (fast path and slow/recovery paths). I am sure it would be easier for everybody to agree on the API *before* you start coding a *lot* of hard (kernel) stuff : It would certainly save your mental CPU cycles (and ours too :) ) This 'real application' could be the event loop of a simple HTTP server, or a basic 'echo all' server. Adding the bits about timers events and signals should be done too. I wrote one with previous ring buffer implementation - it used timers and echoed when they fired, it was even described in details in one of the lwn.net articles. I'm not going to waste others and my time implementing feature requests without at least _some_ feedback from those who asked them. In case when person, originally requested some feature, does not answer and there are other opinions, only they will be get into account of course. I am not sure I understand what you wrote, English is not our native language. I think many people gave you feedbacks. I feel that all feedback on this mailing list is constructive. Many posts/patches on this list are never commented at all. And I do greatly appreciate feedback from those people! But I do not understand why I never got feedback on initial design and implementation (and then created as far as I recall at least 10 releases) from Ulrich, who first asked for such a feture. So right now I'm waiting for his opinion on that problem, even if it will be 'it sucks' again, but at least in that case I will not waste people's time. Ulrich, could you please comment on design notes sent couple of mail above? Eric -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
Evgeniy Polyakov a e'crit : On Tue, Oct 17, 2006 at 06:45:54PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: I am not sure I understand what you wrote, English is not our native language. I think many people gave you feedbacks. I feel that all feedback on this mailing list is constructive. Many posts/patches on this list are never commented at all. And I do greatly appreciate feedback from those people! But I do not understand why I never got feedback on initial design and implementation (and then created as far as I recall at least 10 releases) from Ulrich, who first asked for such a feture. So right now I'm waiting for his opinion on that problem, even if it will be 'it sucks' again, but at least in that case I will not waste people's time. Ulrich, could you please comment on design notes sent couple of mail above? Ulrich is a very busy man. We have to live with that. rant_mode For example, I *complained* one day, that each glibc fopen()/fread()/fclose() pass does a mmap()/munmap() to obtain a single 4KB of memory, without any cache mechanism. This badly hurts performance of multi-threaded programs as we know mmap()/munmap() has to down_write(mm-mmap_sem); and play VM games. So to avoid this, I manually call setvbuf() in my own programs, to provide a suitable buffer to glibc, because of its suboptimal default allocation, vestige of an old epoch... /rant_mode Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Sun, Oct 15, 2006 at 04:22:45PM -0700, Ulrich Drepper ([EMAIL PROTECTED]) wrote: Evgeniy Polyakov wrote: Existing design does not allow overflow. And I've pointed out a number of times that this is not practical at best. There are event sources which can create events which cannot be coalesced into one single event as it would be required with your design. Signals are one example, specifically realtime signals. If we do not want the design to be limited from the start this approach has to be thought over. The whole idea of mmap buffer seems to be broken, since those who asked for creation do not like existing design and do not show theirs... According to signals and possibility to overflow in existing ring buffer implementation. You seems to not checked the code - each event can be marked as ready only one time, which means only one copy and so on. It was done _specially_. And it is not limitation, but new approach. Queue of the same signals or any other events has fundamental flawness (as any other ring buffer implementation, which has queue size) - it's size of the queue and extremely bad case of the overflow. So, the same event may not be ready several times. Any design which allows to create infinite number of events generated for the same case is broken, since consumer can be in situation, when it can not handle that flow. That is why poll() returns only POLLIN when data is ready in network stack, but is not trying to generate some kind of a signal for each byte/packet/MTU/MSS received. RT signals have design problems, and I will not repeate the same error with similar limits in kevent. So zap mmap() support completely, since it is not usable at all. We wont discuss on it. Initial implementation did not have it. But I was requested to do it, and it is ready now. No one likes it, but no one provides an alternative implementation. We are stuck. We need the mapped ring buffer. The current design (before it was removed) was broken but this does not mean it shouldn't be implemented. We just need more time to figure out how to implement it correctly. In the latest patchset it was removed. I'm waiting for your code. Mmap implementation can be added separately, since it does not affect kevent core. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
Evgeniy Polyakov wrote: The whole idea of mmap buffer seems to be broken, since those who asked for creation do not like existing design and do not show theirs... What kind of argumentation is that? Because my attempt to implement it doesn't work and nobody right away has a better suggestion this means the idea is broken. Nonsense. It just means that time should be spend on thinking about this. You cut all this short by rushing out your attempt without any discussions. Unfortunately nobody else really looked at the approach so it lingered around for some weeks. Well, now it is clear that it is not the right approach and we can start thinking about it again. You seems to not checked the code - each event can be marked as ready only one time, which means only one copy and so on. It was done _specially_. And it is not limitation, but new approach. I know that it is done deliberately and I tell you that this is wrong and unacceptable. Realtime signals are one event which need to have more than one event queued. This is no description of what you have implemented, it's a description of the reality of realtime signals. RT signals are queued. They carry a data value (the sigval_t object) which can be unique for each signal delivery. Coalescing the signal events therefore leads to information loss. Therefore, at the very least for signal we need to have the ability to queue more than one event for each event source. Not having this functionality means that signals and likely other types of events cannot be implemented using kevent queues. Queue of the same signals or any other events has fundamental flawness (as any other ring buffer implementation, which has queue size) - it's size of the queue and extremely bad case of the overflow. Of course there are additional problems. Overflows need to be handled. But this is nothing which is unsolvable. So, the same event may not be ready several times. Any design which allows to create infinite number of events generated for the same case is broken, since consumer can be in situation, when it can not handle that flow. That's complete nonsense. Again, for RT signals it is very reasonable and not broken to have multiple outstanding signals. That is why poll() returns only POLLIN when data is ready in network stack, but is not trying to generate some kind of a signal for each byte/packet/MTU/MSS received. It makes no sense to drag poll() into this discussion. poll() is a very limited interface. The new event handling is supposed to be the opposite, namely, usable for all kinds of events. Arguing that because poll() does it like this just means you don't see what big step is needed to get to the goal of a unified event handling. The shackles of poll() must be left behind. RT signals have design problems, and I will not repeate the same error with similar limits in kevent. I don't know what to say. You claim to be the source of all wisdom is OS design. Maybe you should design your own OS, from ground up. I wonder how many people would like that since all your arguments are squarely geared towards optimizing the implementation. But: the implementation is irrelevant without users. The functionality users (= programmers) want and need is what must drive the implementation. And RT signals are definitely heavily used and liked by programmers. You have to accept that you try to modify an OS which has that functionality regardless of how much you hate it and want to fight it. Mmap implementation can be added separately, since it does not affect kevent core. That I doubt very much and it is why I would not want the kevent stuff go into any released kernel until that detail is resolved. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Mon, Oct 16, 2006 at 03:16:15AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) wrote: Evgeniy Polyakov wrote: The whole idea of mmap buffer seems to be broken, since those who asked for creation do not like existing design and do not show theirs... What kind of argumentation is that? Because my attempt to implement it doesn't work and nobody right away has a better suggestion this means the idea is broken. Nonsense. Ok, let's reformulate: My attempt works, but nobody around likes it, I remove it and wait until some other implement it. It just means that time should be spend on thinking about this. You cut all this short by rushing out your attempt without any discussions. Unfortunately nobody else really looked at the approach so it lingered around for some weeks. Well, now it is clear that it is not the right approach and we can start thinking about it again. I talked about it in the last 13 releases of the kevent, and _noone_ said at least some comments. And now I get - 'it is broken, it does not work, there are problems, we do not want it' and the like. I tried hardly to show that it does work and problems shown can not happen, but noone still hears me. Since I think it is not that interface which is 100% required for correct functionality, I removed it. When there are better suggestions and implementation we can return to them of course. You seems to not checked the code - each event can be marked as ready only one time, which means only one copy and so on. It was done _specially_. And it is not limitation, but new approach. I know that it is done deliberately and I tell you that this is wrong and unacceptable. Realtime signals are one event which need to have more than one event queued. This is no description of what you have implemented, it's a description of the reality of realtime signals. RT signals are queued. They carry a data value (the sigval_t object) which can be unique for each signal delivery. Coalescing the signal events therefore leads to information loss. Therefore, at the very least for signal we need to have the ability to queue more than one event for each event source. Not having this functionality means that signals and likely other types of events cannot be implemented using kevent queues. Well, my point about rt-signals is that they do not deserve to be resurrected, but it is only my point :) In case it is still used, each signal setup should create event - many signals means many events, each signal can be sent with different parameters - each event should correspond to one unique case. Queue of the same signals or any other events has fundamental flawness (as any other ring buffer implementation, which has queue size) - it's size of the queue and extremely bad case of the overflow. Of course there are additional problems. Overflows need to be handled. But this is nothing which is unsolvable. I strongly disagree that having design which allows overflows is acceptible - do we really want rt-signals queue overflow problems in new place? Instead some complex allocation scheme can be created. So, the same event may not be ready several times. Any design which allows to create infinite number of events generated for the same case is broken, since consumer can be in situation, when it can not handle that flow. That's complete nonsense. Again, for RT signals it is very reasonable and not broken to have multiple outstanding signals. The same signal with different payload is acceptible, but when number of them increases ulimit and they are started to be forgotten - that's what I call broken design. That is why poll() returns only POLLIN when data is ready in network stack, but is not trying to generate some kind of a signal for each byte/packet/MTU/MSS received. It makes no sense to drag poll() into this discussion. poll() is a very limited interface. The new event handling is supposed to be the opposite, namely, usable for all kinds of events. Arguing that because poll() does it like this just means you don't see what big step is needed to get to the goal of a unified event handling. The shackles of poll() must be left behind. Kevent is that subsystem, and for now it works quite good. RT signals have design problems, and I will not repeate the same error with similar limits in kevent. I don't know what to say. You claim to be the source of all wisdom is OS design. Maybe you should design your own OS, from ground up. I wonder how many people would like that since all your arguments are squarely geared towards optimizing the implementation. But: the implementation is irrelevant without users. The functionality users (= programmers) want and need is what must drive the implementation. And RT signals are definitely heavily used and liked by programmers. You have to accept that you try to modify an OS which has that functionality regardless of how
Re: [take19 1/4] kevent: Core files.
Ulrich Drepper wrote: Evgeniy Polyakov wrote: Existing design does not allow overflow. And I've pointed out a number of times that this is not practical at best. There are event sources which can create events which cannot be coalesced into one single event as it would be required with your design. Signals are one example, specifically realtime signals. If we do not want the design to be limited from the start this approach has to be thought over. So zap mmap() support completely, since it is not usable at all. We wont discuss on it. Initial implementation did not have it. But I was requested to do it, and it is ready now. No one likes it, but no one provides an alternative implementation. We are stuck. We need the mapped ring buffer. The current design (before it was removed) was broken but this does not mean it shouldn't be implemented. We just need more time to figure out how to implement it correctly. Considering the if at all and if then how of ring buffer implemetation I'd like to throw in some ideas I had when reading the discussion and respective code. If I understood Ulrich Drepper right, his notion of a generic event handling interface is, that it has to be flexible enough to transport additional info from origin to userspace, and to support queuing of events from the same origin, so that additional per-event-occurrence data doesn't get lost, which would happen when coalescing multiple events into one until delivery. From what I read he says ring buffer is broken because of insufficient space for additional data (mukevent) and the limited number of events that can be put into ring buffer. Another argument is missing notification of userspace about dropped events in case ring buffer limit is reached. (is that right?) I see no reason why kevent couldn't be modified to fit (all) these needs. While modifying the server-example and writing a client using kevent I came across the coalescing problem, there were more incoming connections than accept events, and I had to work around that. In this case the pure number of coalesced events would suffice, while it wouldn't for the example of RT-signals that Ulrich Drepper gave. So if coalescing can be done at all or if it is impossible depends on the type of event. The same goes for additional data delivered with the events. There might be no panacea for all possible scenarios with one fixed design. Either performance suffers for 'lightweight' events which don't need additional data and/or coalescing is not problematic and/or ring buffer, or kevent is not usable for other types of events. Why not treat different things differently, and let the (kernel-)user decide. I don't know if I got all this right, but if, then ring buffer is needed especially for cases where coalescing is not possible and additional data has to be delivered for each triggered notification (so the pure number of events is not enough; other reasons? performance? ). To me it doesn't make sense to have kevent fill memory and use processor-time if buffer is not used at all, which is the case when using kevent_getevents. So here are my Ideas: Make usage of ring buffer optional, if not required for specific event-type it might be chosen by userspace-code. Make limit of events in ring buffer optional and controllable from userspace. Regarding mukevent I'm thinking of a event-type specific struct, that is filled by the originating code, and placed into a per-event-type ring buffer (which requires modification of kevent_wait). To my limited understanding it seems that alternative or modified versions of kevent_storage_ready, (__)kevent_requeue and kevent_user_ring_add_event could return a void pointer to the position in buffer, and all kevent has to know about is the size of the struct. If coalescing doesn't hurt for a specific event-type it might just be modified to notify userspace about the number of coalesced events. Make it depend on type of event. I know this doesn't address all objections that have been made, and Evgeniy, big sorry for this being just talk again, and maybe not even applicable for some reasons I do not overlook, but maybe it's worth consideration. I'll gladly try to put that into code, and see where it leads. I think kevent is great, and if things can be done to increase it's genericity without sacrifying performance, why not. Sorry for the length of post and repetitions, Johann - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
Evgeniy Polyakov wrote: Existing design does not allow overflow. And I've pointed out a number of times that this is not practical at best. There are event sources which can create events which cannot be coalesced into one single event as it would be required with your design. Signals are one example, specifically realtime signals. If we do not want the design to be limited from the start this approach has to be thought over. So zap mmap() support completely, since it is not usable at all. We wont discuss on it. Initial implementation did not have it. But I was requested to do it, and it is ready now. No one likes it, but no one provides an alternative implementation. We are stuck. We need the mapped ring buffer. The current design (before it was removed) was broken but this does not mean it shouldn't be implemented. We just need more time to figure out how to implement it correctly. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Wed, Oct 04, 2006 at 10:57:32AM -0700, Ulrich Drepper ([EMAIL PROTECTED]) wrote: On 10/3/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c These are simple programs which by themselves have problems. For instance, I consider a very bad idea to hardcode the size of the ring buffer. Specifying macros in the header file counts as hardcoding. Systems grow over time and so will the demand of connections. I have no problem with the kernel hardcoding the value internally (or having a /proc entry to select it) but programs should be able to dynamically learn about the value so they don't have to be recompiled. Well, it is possible to create /sys/proc entry for that, and even now userspace can grow mapping ring until it is forbiden by kernel, which means limit is reached. Actually the whole idea with global limit of kevents does not sound very good to me, but it is required to remove overflow in mapped buffer. But more problematic is that I don't see how the interfaces can be efficiently used in multi-threaded (or multi-process) programs. How would multiple threads using the same kevent queue and running in the same kevent_get_events() loop work out? How do they guarantee that each request is only handled once? kqueue_dequeue_ready() is atomic and this function removes kevent from ready queue so other thread can not get it. From what I see now this means a second data structure is needed to keep track of the state of each entry. But even then, how do we even recognized used ring buffer entries? For instance, assume two threads. Both call get_events, one event is reported, both threads are woken up (which is another thing to consider, more later). One thread uses ring buffer entry, the other goes back to sleep in get_events. Now, how does the kernel know when the other thread is done working on the ring buffer entry? There might be lots of entries coming in overflowing the entire buffer. Heck, you don't even need two threads for this scenario. Are you talking about mapped buffer or syscall interface? The former has special syscall kevent_wait(), which reports number of 'processed' events and first processed number, so kernel can remove all appropriate events. The latter is described above - kqueue_dequeue_ready() is atomic, so that event will be removed from the ready queue and optionally from the whole kevent tree. It is possible to work with both interfaces at the same time, since mapped buffer contains a copy of the event, which is potentially freed and processed by other thread. Actually I do not like idea of mapped ring anyway, since if application uses a lot of events, it will batch them into big chunks, so syscall overhead is negligible, if application uses small number of events, syscalls will be rare and will not hurt performance. When I was thinking about this (and discussing it in Ottawa) I was always assuming that we have a status field in the ring buffer entry which lets the userlevel code indicate whether the entry is free again or not. This requires a writable mapping, yes, and potentially causes cache line ping-pong. I think Zach mentioned he has some ideas about this. As far as I can see, there are no other ideas on how to implement ring buffer, so I did it like I wanted. It has some limitation indeed, but since I do not see any other code, how can I say what is better or worse? As for the multiple thread wakeup, I mentioned this before. We have to avoid the trampling herd problem. We cannot wakeup all waiters. But we also cannot assume that, without protocols, waking up just one for each available entry is sufficient. So the first question is: what is the current policy? It is a good practice to _not_ share the same queue between a lot of threads. Currently all waiters are awakened. AIO was removed from patchset by request of Cristoph. Timers, network AIO, fs AIO, socket nortifications and poll/select events work well with existing structures. Well, excuse me if I don't take your word for it. I agree, the AIO code should not be submitted along with this. The same for any other code using the event handling. But we need to check whether the interface is generic enough to accomodate them in a way which actually makes sense. Again, think highly threaded processes or multiple processes sharing the same event queue. You missed the point. I implemented _all_ above and it does work. Although it was removed from submission patchset. You can find all patches on kevent homepage, they were posted to lkml@ and netdev@ too many times to miss them. It is even possible to create variable sized kevents - each kevent contain pointer to user's data, which can be considered as pointer to additional area (it's size kernel implementation for given kevent type can determine from other parameters or use
Re: [take19 1/4] kevent: Core files.
On Thursday 05 October 2006 10:57, Evgeniy Polyakov wrote: Well, it is possible to create /sys/proc entry for that, and even now userspace can grow mapping ring until it is forbiden by kernel, which means limit is reached. No need for yet another /sys/proc entry. Right now, I (for example) may have a use for Generic event handling, but for a program that needs XXX.XXX handles, and about XX.XXX events per second. Right now, this program uses epoll, and reaches no limit at all, once you pass the ulimit -n, and other kernel wide tunes of course, not related to epoll. With your current kevent, I cannot switch to it, because of hardcoded limits. I may be wrong, but what is currently missing for me is : - No hardcoded limit on the max number of events. (A process that can open XXX.XXX files should be allowed to open a kevent queue with at least XXX.XXX events). Right now thats not clear what happens IF the current limit is reached. - In order to avoid touching the whole ring buffer, it might be good to be able to reset the indexes to the beginning when ring buffer is empty. (So if the user land is responsive enough to consume events, only first pages of the mapping would be used : that saves L1/L2 cpu caches) A plus would be - A working/usable mmap ring buffer implementation, but I think its not mandatory. System calls are not that expensive, especially if you can batch XX events per syscall (like epoll). Nice thing with a ring buffer is that we touch less cache lines than say epoll that have lot of linked structures. About mmap, I think you might want a hybrid thing : One writable page where userland can write its index, (and hold one or more futex shared by kernel) (with appropriate thread locking in case multiple threads want to dequeue events). In fast path, no syscalls are needed to maintain this user index. XXX readonly pages (for user, but r/w for kernel), where kernel write its own index, and events of course. Using separate cache lines avoid false sharing : kernel can update its own index and events without having to pay the price of cache line ping pongs. It could use futex infrastructure to wakeup one thread 'only' instead of all threads waiting an event. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Thursday 05 October 2006 12:55, Evgeniy Polyakov wrote: On Thu, Oct 05, 2006 at 12:45:03PM +0200, Eric Dumazet ([EMAIL PROTECTED]) What is missing or not obvious is : If events are skipped because of overflows, What happens ? Connections stuck forever ? Hope that everything will restore itself ? Is kernel able to SIGNAL this problem to user land ? Exisitng code does not overflow by design, but can consume a lot of memory. I talked about the case, when there will be some limit on number of entries put into mapped buffer. You still dont answer my question. Please answer the question. Recap : You have a max of events queued. A network message come and kernel want to add another event. It cannot because limit is reached. How the User Program knows that this problem was hit ? It is the same. What if reing buffer was grown upto 3 entry, and is now empty, and we need to put there 4 entries? Grow it again? It can be done, easily, but it looks like a workaround not as solution. And it is highly unlikely that in situation, when there are a lot of event, ring can be empty. I dont speak of re-allocation of ring buffer. I dont care to allocate at startup a big enough buffer. Say you have allocated a ring buffer of 1024*1024 entries. Then you queue 100 events per second, and dequeue them immediatly. No need to blindly use all 1024*1024 slots in the ring buffer, doing index = (index+1)%(1024*1024) epoll() does not have mmap. Problem is not about how many events can be put into the kernel, but how many of them can be put into mapped buffer. There is no problem if mmap is turned off. So zap mmap() support completely, since it is not usable at all. We wont discuss on it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Thu, Oct 05, 2006 at 02:09:31PM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Thursday 05 October 2006 12:55, Evgeniy Polyakov wrote: On Thu, Oct 05, 2006 at 12:45:03PM +0200, Eric Dumazet ([EMAIL PROTECTED]) What is missing or not obvious is : If events are skipped because of overflows, What happens ? Connections stuck forever ? Hope that everything will restore itself ? Is kernel able to SIGNAL this problem to user land ? Exisitng code does not overflow by design, but can consume a lot of memory. I talked about the case, when there will be some limit on number of entries put into mapped buffer. You still dont answer my question. Please answer the question. Recap : You have a max of events queued. A network message come and kernel want to add another event. It cannot because limit is reached. How the User Program knows that this problem was hit ? Existing design does not allow overflow. If event was added into the queue (like user requested notification, when new data has arrived), it is guaranteed that there will be place to put that event into mapped buffer when it is ready. If user wants to add anotehr event (for example after accept() user wants to add another socket with request for notification about data arrival into that socket), it can fail though. This limit is introduced only because of mmap buffer. It is the same. What if reing buffer was grown upto 3 entry, and is now empty, and we need to put there 4 entries? Grow it again? It can be done, easily, but it looks like a workaround not as solution. And it is highly unlikely that in situation, when there are a lot of event, ring can be empty. I dont speak of re-allocation of ring buffer. I dont care to allocate at startup a big enough buffer. Say you have allocated a ring buffer of 1024*1024 entries. Then you queue 100 events per second, and dequeue them immediatly. No need to blindly use all 1024*1024 slots in the ring buffer, doing index = (index+1)%(1024*1024) But what if they are not dequeued immediateyl? What if rate is high and while one tries to dequeue, system adds another events? epoll() does not have mmap. Problem is not about how many events can be put into the kernel, but how many of them can be put into mapped buffer. There is no problem if mmap is turned off. So zap mmap() support completely, since it is not usable at all. We wont discuss on it. Initial implementation did not have it. But I was requested to do it, and it is ready now. No one likes it, but no one provides an alternative implementation. We are stuck. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Thursday 05 October 2006 12:21, Evgeniy Polyakov wrote: On Thu, Oct 05, 2006 at 11:56:24AM +0200, Eric Dumazet ([EMAIL PROTECTED]) wrote: On Thursday 05 October 2006 10:57, Evgeniy Polyakov wrote: Well, it is possible to create /sys/proc entry for that, and even now userspace can grow mapping ring until it is forbiden by kernel, which means limit is reached. No need for yet another /sys/proc entry. Right now, I (for example) may have a use for Generic event handling, but for a program that needs XXX.XXX handles, and about XX.XXX events per second. Right now, this program uses epoll, and reaches no limit at all, once you pass the ulimit -n, and other kernel wide tunes of course, not related to epoll. With your current kevent, I cannot switch to it, because of hardcoded limits. I may be wrong, but what is currently missing for me is : - No hardcoded limit on the max number of events. (A process that can open XXX.XXX files should be allowed to open a kevent queue with at least XXX.XXX events). Right now thats not clear what happens IF the current limit is reached. This forces to overflows in fixed sized memory mapped buffer. If we remove memory mapped buffer or will allow to have overflows (and thus skipped entries) keven can easily scale to that limits (tested with xx.xxx events though). - In order to avoid touching the whole ring buffer, it might be good to be able to reset the indexes to the beginning when ring buffer is empty. (So if the user land is responsive enough to consume events, only first pages of the mapping would be used : that saves L1/L2 cpu caches) And what happens when there are 3 empty at the beginning and \we need to put there 4 ready events? Couldn't there be 3 areas in the mmap buffer: - Unused: entries that the kernel can alloc from. - Alloced: entries alloced by kernel but not yet used by user. Kernel can update these if new events requires that. - Consumed: entries that the user are processing. The user takes a set of alloced entries and make them consumed. Then it processes the events after which it makes them unused. If there are no unused entries and the kernel needs some, it has wait for free entries. The user has to notify when unused entries becomes available. It could set a flag in the mmap'ed area to avoid unnessesary wakeups. The are some details with indexing and wakeup notification that I have left out, but I hope my idea is clear. I could give a more detailed description if requested. Also, I'm a user-level programmer so I might not get the whole picture. Hans Henrik Happe - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Thu, Oct 05, 2006 at 04:01:19PM +0200, Hans Henrik Happe ([EMAIL PROTECTED]) wrote: And what happens when there are 3 empty at the beginning and \we need to put there 4 ready events? Couldn't there be 3 areas in the mmap buffer: - Unused: entries that the kernel can alloc from. - Alloced: entries alloced by kernel but not yet used by user. Kernel can update these if new events requires that. - Consumed: entries that the user are processing. The user takes a set of alloced entries and make them consumed. Then it processes the events after which it makes them unused. If there are no unused entries and the kernel needs some, it has wait for free entries. The user has to notify when unused entries becomes available. It could set a flag in the mmap'ed area to avoid unnessesary wakeups. The are some details with indexing and wakeup notification that I have left out, but I hope my idea is clear. I could give a more detailed description if requested. Also, I'm a user-level programmer so I might not get the whole picture. This looks good on a picture, but how can you put it into page-based storage without major and complex shared structures, which should be properly locked between kernelspace and userspace? Hans Henrik Happe -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On 9/20/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: This patch includes core kevent files: [...] I tried to look at the example programs before and failed. I tried again. Where can I find up-to-date example code? Some other points: - I really would prefer not to rush all this into the upstream kernel. The main problem is that the ring buffer interface is a shared data structure. These are always tricky. We need to find the right combination between size (as small as possible) and supporting all the interfaces. - so far only the timer and aio notification is speced out. What about the rest? Are we sure all aspects can be expressed? I am not yet. - we need an interface to add an event from userlevel. I.e., we need to be able to synthesize events. There are events (like, for instance the async DNS functionality) which come from userlevel code. I would very much prefer we look at the other events before setting the data structures in stone. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On Tue, Oct 03, 2006 at 11:34:02PM -0700, Ulrich Drepper ([EMAIL PROTECTED]) wrote: On 9/20/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: This patch includes core kevent files: [...] I tried to look at the example programs before and failed. I tried again. Where can I find up-to-date example code? http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c Structures were not changed from the beginning of kevent project. Some other points: - I really would prefer not to rush all this into the upstream kernel. The main problem is that the ring buffer interface is a shared data structure. These are always tricky. We need to find the right combination between size (as small as possible) and supporting all the interfaces. mmap interface itself is in question, since it allows to create dos since there are no rlimits for pinned memory. - so far only the timer and aio notification is speced out. What about the rest? Are we sure all aspects can be expressed? I am not yet. AIO was removed from patchset by request of Cristoph. Timers, network AIO, fs AIO, socket nortifications and poll/select events work well with existing structures. - we need an interface to add an event from userlevel. I.e., we need to be able to synthesize events. There are events (like, for instance the async DNS functionality) which come from userlevel code. I would very much prefer we look at the other events before setting the data structures in stone. Signals and userspace events (hello solaris) easily fits into existing structures. It is even possible to create variable sized kevents - each kevent contain pointer to user's data, which can be considered as pointer to additional area (it's size kernel implementation for given kevent type can determine from other parameters or use predefined one and fetch additional data in -enqueue() callback). -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On 10/3/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c These are simple programs which by themselves have problems. For instance, I consider a very bad idea to hardcode the size of the ring buffer. Specifying macros in the header file counts as hardcoding. Systems grow over time and so will the demand of connections. I have no problem with the kernel hardcoding the value internally (or having a /proc entry to select it) but programs should be able to dynamically learn about the value so they don't have to be recompiled. But more problematic is that I don't see how the interfaces can be efficiently used in multi-threaded (or multi-process) programs. How would multiple threads using the same kevent queue and running in the same kevent_get_events() loop work out? How do they guarantee that each request is only handled once? From what I see now this means a second data structure is needed to keep track of the state of each entry. But even then, how do we even recognized used ring buffer entries? For instance, assume two threads. Both call get_events, one event is reported, both threads are woken up (which is another thing to consider, more later). One thread uses ring buffer entry, the other goes back to sleep in get_events. Now, how does the kernel know when the other thread is done working on the ring buffer entry? There might be lots of entries coming in overflowing the entire buffer. Heck, you don't even need two threads for this scenario. When I was thinking about this (and discussing it in Ottawa) I was always assuming that we have a status field in the ring buffer entry which lets the userlevel code indicate whether the entry is free again or not. This requires a writable mapping, yes, and potentially causes cache line ping-pong. I think Zach mentioned he has some ideas about this. As for the multiple thread wakeup, I mentioned this before. We have to avoid the trampling herd problem. We cannot wakeup all waiters. But we also cannot assume that, without protocols, waking up just one for each available entry is sufficient. So the first question is: what is the current policy? AIO was removed from patchset by request of Cristoph. Timers, network AIO, fs AIO, socket nortifications and poll/select events work well with existing structures. Well, excuse me if I don't take your word for it. I agree, the AIO code should not be submitted along with this. The same for any other code using the event handling. But we need to check whether the interface is generic enough to accomodate them in a way which actually makes sense. Again, think highly threaded processes or multiple processes sharing the same event queue. It is even possible to create variable sized kevents - each kevent contain pointer to user's data, which can be considered as pointer to additional area (it's size kernel implementation for given kevent type can determine from other parameters or use predefined one and fetch additional data in -enqueue() callback). That sounds interesting and certainly helps with securing the interface for the future. But if there is anything we can do to avoid unnecessary costs we should do it, even if this means investigation all this further. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take19 1/4] kevent: Core files.
Core files. This patch includes core kevent files: - userspace controlling - kernelspace interfaces - initialization - notification state machines Signed-off-by: Evgeniy Polyakov [EMAIL PROTECTED] diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index dd63d47..c10698e 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -317,3 +317,6 @@ ENTRY(sys_call_table) .long sys_tee /* 315 */ .long sys_vmsplice .long sys_move_pages + .long sys_kevent_get_events + .long sys_kevent_ctl + .long sys_kevent_wait /* 320 */ diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 5d4a7d1..a06b76f 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -710,7 +710,10 @@ #endif .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages + .quad sys_kevent_get_events + .quad sys_kevent_ctl + .quad sys_kevent_wait /* 320 */ ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index fc1c8dd..68072b5 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -323,10 +323,13 @@ #define __NR_sync_file_range 314 #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages317 +#define __NR_kevent_get_events 318 +#define __NR_kevent_ctl319 +#define __NR_kevent_wait 320 #ifdef __KERNEL__ -#define NR_syscalls 318 +#define NR_syscalls 321 /* * user-visible error numbers are in the range -1 - -128: see diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 94387c9..ee907ad 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,16 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_wait #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 000..24ced10 --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,195 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov [EMAIL PROTECTED] + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include linux/types.h +#include linux/list.h +#include linux/rbtree.h +#include linux/spinlock.h +#include linux/mutex.h +#include linux/wait.h +#include linux/net.h +#include linux/rcupdate.h +#include linux/kevent_storage.h +#include linux/ukevent.h + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_headstorage_entry; + /* Entry of user's ready. */ + struct list_headready_entry; + + u32 flags; + + /* User who requested this kevent. */ +