Re: Linux's implementation of poll() not scalable?
Mike Jagdis wrote: > This patch firstly extends the wait queue mechanism > to allow an arbitrary action to be performed. Then I rewrote > the select/poll implementation to use event queueing to avoid > rescanning descriptors that had not changed - and restructured > the loops to be rather more efficient. This approach doesn't > need any changes to driver poll routines, it doesn't need > backwards mapping struct files. ... > Performance graphs and the lmbench derived test programs I > used are at http://www.purplet.demon.co.uk/linux/select/ ... > Oh, and I updated this patch for 2.4.0-test9. I can't wait to run my benchmark on it... hope I can get to it soon. BTW, can you update that web page to also point to your patch? - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Linux's implementation of poll() not scalable?
Here's something I did last year and then put on ice, partly through lack of time and partly because I thought I'd pick it up for 2.5. All this talk of event queues misses one thing: we already have an event queue mechanism. They're called wait queues. The only problem is that the only on-event action possible is to wake the process (assuming it was asleep in the first place). This patch firstly extends the wait queue mechanism to allow an arbitrary action to be performed. Then I rewrote the select/poll implementation to use event queueing to avoid rescanning descriptors that had not changed - and restructured the loops to be rather more efficient. This approach doesn't need any changes to driver poll routines, it doesn't need backwards mapping struct files. It should be fairly easy to implement a /dev/poll mechanism using this, although I haven't yet. Yes, the change to wait queues has a slight cost, but it isn't great and the main part of it only happens if you actually sleep. Performance graphs and the lmbench derived test programs I used are at http://www.purplet.demon.co.uk/linux/select/ (bounce in and out of the index page 'cos the next and prev buttons aren't wired up :-) ) Oh, and I updated this patch for 2.4.0-test9. Comments and opinions are, as always, welcome :-). Mike select.patch
Readiness vs. completion (was: Re: Linux's implementation of poll()not scalable?)
John Gardiner Myers <[EMAIL PROTECTED]> wrote: > Your proposed interface suffers from most of the same problems as the > other Unix event interfaces I've seen. Key among the problems are > inherent race conditions when the interface is used by multithreaded > applications. > > The "stickiness" of the event binding can cause race conditions where > two threads simultaneously attempt to handle the same event. For > example, consider a socket becomes writeable, delivering a writable > event to one of the multiple threads calling get_events(). While the > callback for that event is running, it writes to the socket, causing the > socket to become non-writable and then writeable again. That in turn > can cause another writable event to be delivered to some other thread. > ... > In the async I/O library I work with, this problem is addressed by > having delivery of the event atomically remove the binding. If the > event needs to be bound after the callback is finished, then it is the > responsibility for the callback to rebind the event. IMHO you're describing a situation where a 'completion notification event' (as with aio) would be more appropriate than a 'readiness notification event' (as with poll). With completion notification, one naturally expects 'edge triggered', 'one shot' behavior from the notification system, with no event coalescing, and there is no need to remove or reestablish bindings. > There are three performance issues that need to be addressed by the > implementation of get_events(). One is that events preferably be given > to threads that are the same CPU as bound the event. That allows the > event's context to remain in the CPU's cache. > > Two is that of threads on a given CPU, events should wake threads in > LIFO order. This allows excess worker threads to be paged out. > > Three is that the event queue should limit the number of worker threads > contending with each other for CPU. If an event is triggered while > there are enough worker threads in runnable state, it is better to wait > for one of those threads to block before delivering the event. That describes NT's 'completion port / thread pooling' scheme, I think (which incidentally is a 'completion notification' rather than a 'readiness notification' - based scheme). I suspect readiness notification using edge triggering is a strange beast, not often seen in the wild, and hard to define precisely. I'm going to risk generalizing, and categorizing the existing base of application software into two groups. Would it be going to far to say the following: Readiness notification, like that provided by traditional poll(), fits naturally with level-triggered events with event coalescing, and a large body of traditional Unix software exists that uses this paradigm. Completion notification, like that provided by aio and NT's networking, fits naturally with edge-triggered events with no event coalescing, and a large body of win32 software exists that uses this paradigm. And, come to think of it, network programmers usually can be categorized into the same two groups :-) Each style of programming is an acquired taste. IMHO if Linux is to be maximally popular with software developers (desirable if we want to boost the number of apps available for Linux), it would help to cater to both flavors of network programming. So I'd like to see both a high-performance level-triggered readiness notification API with event coalescing, and a high-performance edge-triggered completion API with no event coalescing. With luck, they'll be the same API, but with slightly different flag values. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvolds wrote: > So sticky arrays of events are good, while queues are bad. Let's take that > as one of the fundamentals. Please let's not. There is nothing fundamentally different between an event queue of size N and an interest set of size N. Your proposed interface suffers from most of the same problems as the other Unix event interfaces I've seen. Key among the problems are inherent race conditions when the interface is used by multithreaded applications. The "stickiness" of the event binding can cause race conditions where two threads simultaneously attempt to handle the same event. For example, consider a socket becomes writeable, delivering a writable event to one of the multiple threads calling get_events(). While the callback for that event is running, it writes to the socket, causing the socket to become non-writable and then writeable again. That in turn can cause another writable event to be delivered to some other thread. In the async I/O library I work with, this problem is addressed by having delivery of the event atomically remove the binding. If the event needs to be bound after the callback is finished, then it is the responsibility for the callback to rebind the event. Typically, a server implements a command/response protocol, where the read callback reads a command, writes the response, and repeats. It is quite likely that after the read callback completes, the connection is interested in a write event, not another read event. Cancellation of event bindings is another area prone to race conditions. A thread canceling an event binding frequently needs to know whether or not that event has been delivered to some other thread. If it has, the canceling thread has to synchronize with the other thread before destroying any resources associated with the event. Multiple event queues are needed by libraries as different libraries can have different thread pools. The threads in those different pools can have different characteristics, such as stack size, priority, CPU affinity, signal state, etc. There are three performance issues that need to be addressed by the implementation of get_events(). One is that events preferably be given to threads that are the same CPU as bound the event. That allows the event's context to remain in the CPU's cache. Two is that of threads on a given CPU, events should wake threads in LIFO order. This allows excess worker threads to be paged out. Three is that the event queue should limit the number of worker threads contending with each other for CPU. If an event is triggered while there are enough worker threads in runnable state, it is better to wait for one of those threads to block before delivering the event. S/MIME Cryptographic Signature
RE: Linux's implementation of poll() not scalable?
> It doesn't practically matter how efficient the X server is when > you aren't busy, after all. A simple polling scheme (i.e. not using poll() or select(), just looping through all fd's trying nonblocking reads) is perfectly efficient when the server is 100% busy, and perfectly inefficient when there is nothing to do. I'm not saying that your statements are wrong--in your example, X is calling select() which is not wasting as much time as a hard-polling loop--but it's wrong to say that high-load efficiency is the primary concern. I would be horrified if X took a signifigant portion of the CPU time when many clients were connected, but none were actually doing anything. chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
In article [EMAIL PROTECTED]> you write: >Linus Torvalds wrote: >> I'd much rather have an event interface that is documented to be edge- >> triggered and is really _lightweight_, than have another interface that >> starts out with some piggy features. > >Agreed (except for that 'edge-triggered' part), but I don't think >'level-triggered' implies piggy. I haven't benchmarked whether >kqueue() slows down the networking layer of FreeBSD yet; do you >suspect maintaining the level-triggered structures actually is >a bottleneck for them? I really don't think it's a bottleneck. At the moment, all events are maintained on a linked list. To dequeue an event, we simply: 1. take the event on the front of the list. 2. validate event. (call filter function) 3. copy event into return array. 4. put event back on end of list. If the `EV_ONESHOT' flag is set, we skip steps 2 & 4, and destroy the event after it is returned to the user. (we want to wait only once for this particular event) If the `EV_CLEAR' flag is set, we skip step 4. (pure edge-triggered delivery) Step 4 is pretty simple, just re-insertion back onto the queue. If you eliminate Step 2, then you have a `correctness' issue; where the application must deal with stale events. The validation function is equally lightweight and doesn't (IMO) cause a performance problem. >> ... the "re-bind()" approach works very simply, and means that the >> overhead of testing whether the event is still active is not a generic >> thing that _always_ has to be done, but something where the application >> can basically give the kernel the information that "this time we're >> leaving the event possibly half-done, please re-test now". > >Hmm. I don't like the extra system call, though. Any chance you'd be >willing to make get_events() take a vector of bind requests, so we can >avoid the system call overhead of re-binding? (Or is that too close >to kqueue for you?) IMO, I'd think that the calls should be orthogonal. If the "get_events()" call returns an array, why shouldn't the "bind_request()" call as well? Otherwise you're only amortizing the system calls in one direction. >And are you sure apps will always know whether they need to rebind? >Sometimes you're faced with a protocol stack which may or may not >read the requests fully, and which you aren't allowed to change. >It'd be nice to still have a high-performance interface that can deal with >that situation. Agreed. -- Jonathan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > However, we also need to remember what got us to this discussion in the > first place. One of the reasons why poll() is such a piggy interface is > exactly because it tries to be "nice" to the programmer. poll() is a piggy interface because it is O(n) in polled file descriptors, not because of its level-triggered nature. > I'd much rather have an event interface that is documented to be edge- > triggered and is really _lightweight_, than have another interface that > starts out with some piggy features. Agreed (except for that 'edge-triggered' part), but I don't think 'level-triggered' implies piggy. I haven't benchmarked whether kqueue() slows down the networking layer of FreeBSD yet; do you suspect maintaining the level-triggered structures actually is a bottleneck for them? > I do understand that level to some degree is "nicer", but everybody pretty > much agrees that apart from requireing more care, edge-triggered certainly > does everything the level-triggered interfaces do. Agreed. > For example, if you want to only partially handle an event (one of the > more valid arguments I've seen, although I do not agree with it actually > being all that common or important thing to do), the edge-triggered > interface _does_ allow for that. It's fairly easy, in fact: you just > re-bind the thing. > > (My suggested "bind()" interface would be to just always add a newly bound > fd to the event queue, but I would not object to a "one-time test for > active" kind of "bind()" interface either - which would basically allow > for a poll() interface - and the existing Linux internal "poll()" > implementation actually already allows for exactly this so it wouldn't > even add any complexity). > ... the "re-bind()" approach works very simply, and means that the > overhead of testing whether the event is still active is not a generic > thing that _always_ has to be done, but something where the application > can basically give the kernel the information that "this time we're > leaving the event possibly half-done, please re-test now". Hmm. I don't like the extra system call, though. Any chance you'd be willing to make get_events() take a vector of bind requests, so we can avoid the system call overhead of re-binding? (Or is that too close to kqueue for you?) And are you sure apps will always know whether they need to rebind? Sometimes you're faced with a protocol stack which may or may not read the requests fully, and which you aren't allowed to change. It'd be nice to still have a high-performance interface that can deal with that situation. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Jim Gettys wrote: > So I want an interface in which I can get as many events as possible > at once, and one in which the events themselves can have appropriate > aggregation behavior. It isn't quite clear to me if the proposed interface > would have this property. I believe get_event, /dev/poll, and kqueue all share this property. e.g. none of them will present multiple POLLIN events per fd per call. Is that what you meant? - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Note that there is another aspect to the efficiency / performance of the select/poll style of interfaces not immediately obvious, but which occurs as a result of how some (streaming/batching) protocols work. An X server does not call select all that often (probably one of the two items most frequently used that care; though I believe getting the Apache case right is more important). X is such a streaming protocol: it is a feature that I don't have to do extra reads or system calls to deal with more data arriving from a client. An X server doesn't want one event generated for each incoming TCP segment: it merely needs to know there is data available on a file descriptor as a binary condition. I really don't want to have to do one operation per segment; this is less efficient than the current situation. Similarly, it is a feature that with one call I can find out that there is work to do on multiple file descriptors. In short, the X server does a select, and then loops across all the file descriptors with work to do before doing another select: the system call overhead gets amortized across multiple clients and buffers received from the client. As the server gets busier, it is more and more likely that there is more than one client with work to do, and/or multiple TCP segments have arrived to process (in the common single client is busy case). So we make the system call less and less often as a fraction of work done. This has the happy consequence that the select caused overhead DROPS as a fraction of total work as the X server gets busier, and X is most efficient at the point in time you care the most: when you have the most work to do. The system call is returning more information each time it is called, and some of that information is aggregated as well (additional data arriving). It doesn't practically matter how efficient the X server is when you aren't busy, after all. This aggregation property is therefore important, and there needs to be some way to achieve this, IMHO. Web servers often have similar behavior, though since most current HTTP clients don't implement streaming behavior, the benefit is currently much lower (would that HTTP clients start driving HTTP servers the way the HTTP/1.1 protocol allows... Sigh...). Right now, scaling to large numbers of descriptors is most urgent for big web servers. So I want an interface in which I can get as many events as possible at once, and one in which the events themselves can have appropriate aggregation behavior. It isn't quite clear to me if the proposed interface would have this property. As I said in early talks about X: "X is an exercise in avoiding system calls" - Jim Gettys -- Jim Gettys Technology and Corporate Development Compaq Computer Corporation [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Thu, 26 Oct 2000, Dan Kegel wrote: > > With level-triggered interfaces like poll(), your chances > of writing a correctly functioning program are higher because you > can throw events away (on purpose or accidentally) with no consequences; > the next time around the loop, poll() will happily tell you the current > state of all the fd's. Agreed. However, we also need to remember what got us to this discussion in the first place. One of the reasons why poll() is such a piggy interface is exactly because it tries to be "nice" to the programmer. I'd much rather have an event interface that is documented to be edge- triggered and is really _lightweight_, than have another interface that starts out with some piggy features. I do understand that level to some degree is "nicer", but everybody pretty much agrees that apart from requireing more care, edge-triggered certainly does everything the level-triggered interfaces do. For example, if you want to only partially handle an event (one of the more valid arguments I've seen, although I do not agree with it actually being all that common or important thing to do), the edge-triggered interface _does_ allow for that. It's fairly easy, in fact: you just re-bind the thing. (My suggested "bind()" interface would be to just always add a newly bound fd to the event queue, but I would not object to a "one-time test for active" kind of "bind()" interface either - which would basically allow for a poll() interface - and the existing Linux internal "poll()" implementation actually already allows for exactly this so it wouldn't even add any complexity). > With edge-triggered interfaces, the programmer must be much more careful > to avoid ever dropping a single event; if he does, he's screwed. > This gives rise to complicated code in apps to remember the current > state of fd's in those cases where the programmer has to drop events. No, the "re-bind()" approach works very simply, and means that the overhead of testing whether the event is still active is not a generic thing that _always_ has to be done, but something where the application can basically give the kernel the information that "this time we're leaving the event possibly half-done, please re-test now". Advantage: performance again. The common case doesn't take the performance hit. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
"Eric W. Biederman" wrote: > > Dan Kegel <[EMAIL PROTECTED]> writes: > > It's harder to write correct programs that use edge-triggered events. > > Huh? The race between when an event is reported, and when you take action > on it effectively means all events are edge triggered. Nope. With any of these interfaces, whether level or edge, the app must treat all the events as hints, and be prepared for them to be wrong. That's why code that uses poll() tends to use nonblocking sockets. No matter what you do, you can't get away from that. Consider accepting a connection. Poll or whatever tells you a listening socket is ready for you to call accept(). Before you do, the other end sends an RST. Consequence: app must be prepared for a readiness event to be wrong. cf. http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0616.html This is orthogonal to the question of whether edge or level triggering is easier to write code for, I think. > So making the interface clearly edge triggered seems to be a win for > correctness. With level-triggered interfaces like poll(), your chances of writing a correctly functioning program are higher because you can throw events away (on purpose or accidentally) with no consequences; the next time around the loop, poll() will happily tell you the current state of all the fd's. With edge-triggered interfaces, the programmer must be much more careful to avoid ever dropping a single event; if he does, he's screwed. This gives rise to complicated code in apps to remember the current state of fd's in those cases where the programmer has to drop events. And there are many cases like that; see http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0529.html and http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0592.html for examples. Better to let apps ask the kernel for the current state of each socket if they want, is what I say. This reduces the amount of code and effort needed to write many apps, *and makes migrating legacy apps to high-performance interfaces easier*. For new apps that are willing to maintain the state themselves, by all means, provide an edge-oriented interface, too. It's probably better if your code is manly enough to handle it. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Dan Kegel <[EMAIL PROTECTED]> writes: > It's harder to write correct programs that use edge-triggered events. Huh? The race between when an event is reported, and when you take action on it effectively means all events are edge triggered. So making the interface clearly edge triggered seems to be a win for correctness. Eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Helge Hafting wrote: > > With poll(), it was *not a bug* for the user code to drop events; with > > your proposed interface, it *is a bug* for the user code to drop events. > > I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted > > incorrectly that your interface "has the same semantics as poll from > > the event perspective". > > Don't worry - existing interfaces will be around for existing > applications. I believe Linus is trying to engineer a new and better > interface that will make a difference - (server) programs written to use > the new stuff will show better performance use up less kernel time. > > The semantics will be different, so only new programs will benefit. > However, the benefits may be so great that people will want to rewrite > programs to get them. It's harder to write correct programs that use edge-triggered events. Level-triggered events are inherantly easier to use. If our high-performance interface only supports edge-triggered events, people who write high performance apps may well migrate to FreeBSD, where there is a lovely high performance interface that offers either level or edge triggered events, whichever you prefer. > Seems Linus don't want to do this no matter what you comes up with. You > are free to try of course... Or free to switch to FreeBSD, if that's the only OS which has a high-performance poll replacement that I can get my existing codebase to work with. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, Oct 24, 2000 at 04:12:38PM -0700, Dan Kegel wrote: > With poll(), it was *not a bug* for the user code to drop events; with > your proposed interface, it *is a bug* for the user code to drop events. > I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted > incorrectly that your interface "has the same semantics as poll from > the event perspective". I missed this because I've never written anything that drops or forgets events and didn't think about it. Most programs will read() until EOF is returned and write() until EAGAIN is returned with non-blocking sockets. Is there any reason to ignore events other than to slow down response to some events in favor to others? I don't see why this is a problem as this interface _isn't_ replacing select or poll, so it shouldn't matter for existing programs that aren't converted to use the new interface. In any case, I think I would prefer that the kernel be optimized for the common case and leave any strange processing up to userspace so that the majority of programs which don't need this special case can run as fast as possible. Besides, it wouldn't be difficult for a program to stack up a list of events, even in the same structure as it would get from the kernel, so that it can process them later. At least then this data would be in swappable memory. Heck, even from an efficiency perspective, it would be faster for userspace to store the data as it wouldn't keep getting it returned from a syscall each time... Am I missing something else? Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
[EMAIL PROTECTED] (Linus Torvalds) wrote on 23.10.00 in <[EMAIL PROTECTED]>: > actually inform about the events. The way to do this simply is to limit it > in very clear ways, the most notable one being simply that there is only > one event queue per process (or rather, per "struct files_struct" - so > threads would automatically share the event queue). This keeps the While the rest looks fine, I suspect this one is a show-stopper. IMO, the ability to wait for separate event sets in separate threads is a MUST. In a single-threaded program, it would probably still be useful in a number of situations. Not that it's all that difficult to expand your model to do multiple queues. MfG Kai - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
> > > > On Tue, 24 Oct 2000, Evan Jeffrey wrote: > > > > > Multiple event queues are bad, because it completely breaks the notion of > > > even-driven programming. How do you want to listen to them all? You can't. > > > You can only listen to one event queue at a time - unless you create some > > > > You can listen to one event queue per thread. > > Oh, I agree. > > And I think something like CLONE_EVENTS would be fine - and decide > yourself what kind of threads you want (do you want indistinguishable > "anonymous" threads like apache, or do you want a threads that handle > separate event queues). Or you might have a mixture of the two - for web > serving the "accept()" event list might be a separate thing with a few > threads doing that, while "worker threads" handle the actual IO.. All this is fine; but why force people to break up their apps into threads? [ from an earlier message ] > Trust me. Multiple event queues are idiotic. Anybody who _designs_ for > multiple event queues has some serious problems. So, are you now saying that "serious problems" are no longer to be had, provided that an app is chopped into threads? If you have one queue per process, multiple queues could be implemented in a straightforward manner. Add a "queue" fd type and a system call to create one. Allow any queuable fd (such as a socket) to be queued on the one implicit queue and also to be associated with zero or one explicitly-created queues. This would satisfy library or other code that wanted for whatever reason to have visibility into all fds with pending events. It would also give apps the ability to break up work amongst disjoint portions without forcing the use of multiple threads. This seems reasonable to implement. Or am I missing something? gid - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > > The point they disagree is when the event gets removed from the event > queue. For edge triggered, this one is trivial: when a get_events() thing > happens and moves it into user land. This is basically a one-liner, and it > is local to get_events() and needs absolutely no help from anybody else. > So obviously event removal is _very_ simple for edge-triggered events - > the INTACK basically removes the event (and also re-arms the trigger > logic: which is different from most interrupt controllers, so the analogy > falls down here). And IMHO here's a problem. The events are no longer events. They are just hints saying: after the previous get_events() something has happened. You don't know if you've already handled this event. There's no synchron- ization between what the app does and the triggering of 'hints'. For example your waitpid-loop: you get the event, start the waitpid-loop. While processing another process dies. You handle it too (still in the loop). But a new 'hint' has already been registered. So on the next get_event you'll be notified again. I just hope, every event-generator has a WNOHANG flag... It could even be possible, that you are unable to perform some actions without triggering hints despite the fact that the conditions will already be gone before the next get_event. May generate lot of bogus hints. At least the current semantic of for example "POLL_IN on fd was signaled so I may read without blocking" gets lost. Maybe (don't know kernel wise) it makes sense to check in the kernel if the events to be returned to userspace are still valid. The user space has to do it anyway... But that way you get a more level-based design ;) Another thing: while toying with cooperative userspace multithreading I found it much more versatile to have a req_type/req_data tuple in the request structure (ie READ/, TIMEOUT/, WAKEUP/). Ciao, ET. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds <[EMAIL PROTECTED]> writes: > bind_event(sock, POLLIN, NULL, accept_fn); [...] > (In fact, you might as well move the event array completely inside > "get_event()", because nobody would be supposed to look at the raw array > any more. So the "get_event()" interface would be even simpler: just a > timeout, nothing more. Talk about simple programming. It might be good to have bind_event return any previous accept_fn that's there, in case a library or something wants to chain a handler rather than clobbering what's there. Or maybe have a way to query, although this seems fraught with danger if another thread binds between a query and a bind. - M -- Mark "Monty" Montague | [EMAIL PROTECTED] | I don't do Windows(tm) I'm dubious about any company whose assets can be destroyed by rm -rf http://www.gg.caltech.edu/~monty/monty.shtml> X-PGP-Fingerprint: E4 EA 6D B1 82 46 DB A1 B0 FF 60 B9 F9 5D 5C F7 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Followup to: <[EMAIL PROTECTED]> By author:Linus Torvalds <[EMAIL PROTECTED]> In newsgroup: linux.dev.kernel > > Oh, I agree. > > And I think something like CLONE_EVENTS would be fine - and decide > yourself what kind of threads you want (do you want indistinguishable > "anonymous" threads like apache, or do you want a threads that handle > separate event queues). Or you might have a mixture of the two - for web > serving the "accept()" event list might be a separate thing with a few > threads doing that, while "worker threads" handle the actual IO.. > > But if you want to have some event queue ID, I just wonder how you'd set > it up sanely without tons of complications.. > How about handle the event queue ID as an event mask internally? -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > > But user code currently written for poll() has the luxury of dropping > > events because poll() will happily report on the current readiness of > > the socket every time. /dev/poll is level-triggered because it's trying > > to make conversion of poll()-based code easy. With your scheme, > > whatever user code is receiving the edges better darn well do something > > about them, because it's only going to get them once. > > Oh, I agree. I'm not saying that my approach magically fixes bugs in user > space ;) With poll(), it was *not a bug* for the user code to drop events; with your proposed interface, it *is a bug* for the user code to drop events. I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted incorrectly that your interface "has the same semantics as poll from the event perspective". [ Linus explains why the level-triggered style is difficult ] > - the "proactively remove events when the thing that triggerred them goes >away" approach means that each anti-event (like a read that empties the >buffer) needs to undo it's events, but it also needs to be careful that >it doesn't undo combined events, and it needs to be very careful about >races (new packet coming in), so the proactive remove actually ends up >being less than trivial - and in a performance-critical section. That's the approach I was thinking of. It should be about as difficult as continuously maintaining an integer for each fd which matches what poll() with a zero timeout on that fd would return. (The difference is that as the bits of the integer change, you would move the fd's record in and out of the ready list.) How hard is that? > Implement it, and see. I bet you'll find that it gets really interesting > when you have concurrent accesses to the same file descriptor etc. Sure, but lots of the kernel is really interesting. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Evan Jeffrey wrote: > > > Multiple event queues are bad, because it completely breaks the notion of > > even-driven programming. How do you want to listen to them all? You can't. > > You can only listen to one event queue at a time - unless you create some > > You can listen to one event queue per thread. Oh, I agree. And I think something like CLONE_EVENTS would be fine - and decide yourself what kind of threads you want (do you want indistinguishable "anonymous" threads like apache, or do you want a threads that handle separate event queues). Or you might have a mixture of the two - for web serving the "accept()" event list might be a separate thing with a few threads doing that, while "worker threads" handle the actual IO.. But if you want to have some event queue ID, I just wonder how you'd set it up sanely without tons of complications.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
> Multiple event queues are bad, because it completely breaks the notion of > even-driven programming. How do you want to listen to them all? You can't. > You can only listen to one event queue at a time - unless you create some You can listen to one event queue per thread. Maybe in the case people always bring up (Apache handing 1 concurrent connections), threads are indistiguishable, but for many applications different threads or thread groups will be assigned to different tasks. An ORB ought to be allowed to clone off an event handler thread that doesn't have to worry about receiving X events, which might be really slow to handle. (Or vice versa, handling a mouse_move event shouldn't have to wait for a CORBA upcall that recalculates a spreasheet). Also, what happens if someone does a clone with CLONE_FILES but no CLONE_VM? Now the event handlers and opaque data point to things that may not even be valid in this VM. I see a couple of ways to do this that don't give in much from the "One event handler list per process" idea. Add a CLONE_EVENTS flag to clone. This is rather nasty, since it violates orthogonality (since a struct event references both VM and fd objects), and makes it difficult for pthreads users to choose whether to share event handlers. The second is to tag events with a threadid, and allow add_event to specify a flag saying "only deliver to this thread" and get_events to say "wait for this threads events, or any threads events". The third is to just declare "event handlers are shared by any processes sharing both a VM and FILES". This seems rather silly (since it would still have to be independant in the task_struct, but with no user space control over it), and doensn't help the poor SOB who has to handle > kind of "hierarchy" of event queues, where you have one event queue that > you wait on that contains the _other_ event queues, so that you can tell > which one has anything pending.. Agreed. The only reason to do this is if your event queue sucks as bad as poll() and you have to do so for effeciency reasons... (Like people do with 10 threads each handling 1/10 of the fds). Evan --- Fear is the mind killer. -- Frank Herbert, "Dune" - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
> Shouldn't there also be a way to add non-filedescriptor based events > into this, such as "child exited" or "signal caught" or shm things? Waiting on pthreads condition variables, POSIX message queues, and semaphores (as well as fd's) at the same time would *rock*... Unifying all these "waitable objects" would be tremendously helpful to fully exploit the "library transparency" advantage that Linus brought up. Some libraries might want to wait on things that are not file descriptors... Regards, Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
In article <[EMAIL PROTECTED]>, Linus Torvalds <[EMAIL PROTECTED]> wrote: > struct event { > unsigned long id; /* file descriptor ID the event is on */ > unsigned long event;/* bitmask of active events */ > }; > int bind_event(int fd, struct event *event); Shouldn't there also be a way to add non-filedescriptor based events into this, such as "child exited" or "signal caught" or shm things? Or should there simply be ways to make those events available through filedescriptors anyway? Then you'd get /dev/events - open it, tell it what kind of event you want to bind to the fd, pass it to bind_event, read a struct event_info from it when there's something to read. Just a random thought. Ignore if not appropriate. Mike. -- Q: How many ears do Trekkies have? A: Three; the left ear, the right ear, and the final front ear. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > > * it doesn't add extra syscalls > > Sure it does. > > What do you think ioctl's are? As I explained a few lines down from where you stopped quoting (and probably stopped reading) the ioctl() use is just an artifact of Solaris's icky implementation. It could and should use read(). Not an extra syscall. > Sure, you can have multiple "poll lists" by opening /dev/poll multiple > times, and you seem to see this as a great feature. I'm telling you that > multiple event queues are a horrible disaster, and it's a perfect example > of what an engineer with the "more is better" mindset comes up with. It has nothing to do with how exciting of a feature it is. It's a matter "hey, if you implement it with /dev/poll, you get multiple queues FOR FREE". Those sorts of things are an indication that you've got the implementation right. The fact that in your scheme multiple queues would be extra work is just residue from the fact that you've got the idea wrong. And how do you propose dealing with multi-threaded programs. Obviously for SMP you'd want multiple threads to be able to grab events, so I guess the "one great queue" will be shared among all the threads. But, suppose I want to dedicate a thread to coping with a specific set of events. H... guess you'll need a CLONE_EVENTQUEUE flag to support that. With /dev/poll none of this foolishness is needed - you just open tow of them and let the threads do what they want. > Multiple event queues are bad, because it completely breaks the notion of > even-driven programming. How do you want to listen to them all? You can't. Explained in detail in my mail. > You can only listen to one event queue at a time - unless you create some > kind of "hierarchy" of event queues, where you have one event queue that > you wait on that contains the _other_ event queues, so that you can tell > which one has anything pending.. No, event queues of event queues are messy in implementation. There's nothing fundamentally wrong with them, but you need to make sure the graph stays acyclic which is just an implementation hassle. Not enough gain for too much code (unless someone has a beautiful algorithm) However, it's perfectly reasonable to use select() or poll() on two event queues. As I explained in my email, this is needed if you're already using a library that does its own event queue based on poll or select. > Basically, tastes differ. Everything you seem to see as an advantage of > /dev/poll, I see as a horrible kludge. Well, everything I see indicates that /dev/poll would be the easier of the two interfaces to actually implement. All the added flexibility is just gravy. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Abramo Bagnara wrote: > Linus Torvalds wrote: > > > > > > struct event { > > int fd; > > unsigned long mask; > > void *opaque; > > void (*event_fn)(ind fd, unsigned long mask, void *opaque); > > My experience say that: > > unsigned long rmask; > void (*event_fn)(struct event *event); > > is a far better solution (more type safe, more descriptive). You're probably right. It certainly makes it easier to add fields later on if we'd want to extend the ID part without having to change old users.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Dan Kegel wrote: > > But user code currently written for poll() has the luxury of dropping > events because poll() will happily report on the current readiness of > the socket every time. /dev/poll is level-triggered because it's trying > to make conversion of poll()-based code easy. With your scheme, > whatever user code is receiving the edges better darn well do something > about them, because it's only going to get them once. Oh, I agree. I'm not saying that my approach magically fixes bugs in user space ;) > > The BSD kevent paper goes on about "level and edge triggered" and it > > becomes a big thing for them, and they selected level-triggered events as > > if it made any difference. And sure - it _does_ make a difference, but the > > only difference is in how hard it is to implement, and level-triggered is > > a noticeably harder. > > I don't see why edge triggered is that much harder. All it adds is level > a layer which receives the edges and moves fds back and forth between > a 'ready' list and a 'not ready' list. Easy as pie. Not true. For example, if you're truly level-triggered, and you have a socket that gets data, the event move onto the event queue. So far so fine: both edge and level agree about this one. The point they disagree is when the event gets removed from the event queue. For edge triggered, this one is trivial: when a get_events() thing happens and moves it into user land. This is basically a one-liner, and it is local to get_events() and needs absolutely no help from anybody else. So obviously event removal is _very_ simple for edge-triggered events - the INTACK basically removes the event (and also re-arms the trigger logic: which is different from most interrupt controllers, so the analogy falls down here). For level, the thing is not as easy at ALL. Suddenly removal becomes a big issue, and needs help from the actual driver. You can do it two ways: calling down to the driver when you remove (to see if the event should be dismissed or not once it has been read) or have the driver pro-actively remove the event whenever a read() happens (or whatever that undoes the event). Both are actually fairly hard. Much harder than they sound. For different reasons. - the callback approach at get_events() time sounds trivial, but actually has two problems: cache footprint for "get_events()" grows a _lot_ (because the events are likely to be spread out a lot if there are a lot of them pending, so you don't get a nice tight inner loop at all), and you get "double events" - by the time the event first happens, it will still be active, so we cannot actually remove it at that time (there is still data to be read - and the event doesn't go away until we read it) so we'll get the event _again_, and on the next get_events() it will notice that the event was bogus, and remove it (and we can optimize it away from reporting it to user land at that point, so only the kernel needs to look at it twice and do two callbacks) - the "proactively remove events when the thing that triggerred them goes away" approach means that each anti-event (like a read that empties the buffer) needs to undo it's events, but it also needs to be careful that it doesn't undo combined events, and it needs to be very careful about races (new packet coming in), so the proactive remove actually ends up being less than trivial - and in a performance-critical section. Now, compare that to a one-liner that just does a "list_del(&event->list)" as it copies over the event to user mode. Woudln't you say that the edge-triggered version is simpler? > > The reason "edge-triggered" (ie only when an event changes) is preferable > > is that it's MUCH simpler, and means that the "get_events()" system call > > doesn't need to actually understand what the events mean at all. > > Not much understanding is required on the part of the edge-to-level filter. Implement it, and see. I bet you'll find that it gets really interesting when you have concurrent accesses to the same file descriptor etc. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > > * Do you get an event whenever an fd is ready for something, or > > only when its readiness changes? (Presumably whenever an fd is ready for >something?) > > Only when its readiness changes - probably with the addition that it would > simplify things that a new event always (unconditionally) gets added to > the event queue. > > Note that there are no races in this implementation - there is no way for > events to get lost, even trivially seen in SMP/threaded environments. > That's important. But user code currently written for poll() has the luxury of dropping events because poll() will happily report on the current readiness of the socket every time. /dev/poll is level-triggered because it's trying to make conversion of poll()-based code easy. With your scheme, whatever user code is receiving the edges better darn well do something about them, because it's only going to get them once. > The BSD kevent paper goes on about "level and edge triggered" and it > becomes a big thing for them, and they selected level-triggered events as > if it made any difference. And sure - it _does_ make a difference, but the > only difference is in how hard it is to implement, and level-triggered is > a noticeably harder. I don't see why edge triggered is that much harder. All it adds is a layer which receives the edges and moves fds back and forth between a 'ready' list and a 'not ready' list. Easy as pie. > The reason "edge-triggered" (ie only when an event changes) is preferable > is that it's MUCH simpler, and means that the "get_events()" system call > doesn't need to actually understand what the events mean at all. Not much understanding is required on the part of the edge-to-level filter. That said, the edge-to-level filter can be in userspace, and the kernel part can be just edge-triggered -- as long as the kernel delivers the downward edges as well as the upward edges. We could do that by adding NOTPOLLIN, NOTPOLLOUT, etc., i.e. events that are delivered when a socket becomes UN-ready for reading. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
There is only one thiong I don't understand about this... why can't we re-implement the poll() implementation of Linux instead of introducing another system call? If I understood Linux correctly, what he is saying is that the bind_event system call is needed to give the kernel a hint that the application is interested in a certain event associated with a file descriptor. If the kernel kept such an event queue per process any way (as soon as the process opened the file/socket)... then the poll implementation would be exactly like the get_events system call. What is wrong with this? Thanks Lee On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > What is your favourite interface then ? > > I suspect a good interface that can easily be done efficiently would > basically be something where the user _does_ do the equivalent of a > read-only mmap() of poll entries - and explicit and controlled > "add_entry()" and "remove_entry()" controls, so that the kernel can > maintain the cache without playing tricks. Actually, forget the mmap, it's not needed. Here's a suggested "good" interface that would certainly be easy to implement, and very easy to use, with none of the scalability issues that many interfaces have. First, let's see what is so nice about "select()" and "poll()". They do have one _huge_ advantage, which is why you want to fall back on poll() once the RT signal interface stops working. What is that? Basically, RT signals or any kind of event queue has a major fundamental queuing theory problem: if you have events happening really quickly, the events pile up, and queuing theory tells you that as you start having queueing problems, your latency increases, which in turn tends to mean that later events are even more likely to queue up, and you end up in a nasty meltdown schenario where your queues get longer and longer. This is why RT signals suck so badly as a generic interface - clearly we cannot keep sending RT signals forever, because we'd run out of memory just keeping the signal queue information around. Neither poll() nor select() have this problem: they don't get more expensive as you have more and more events - their expense is the number of file descriptors, not the number of events per se. In fact, both poll() and select() tend to perform _better_ when you have pending events, as they are both amenable to optimizations when there is no need for waiting, and scanning the arrays can use early-out semantics. So sticky arrays of events are good, while queues are bad. Let's take that as one of the fundamentals. So why do people still like RT signals? They do have one advantage, which is that you do NOT have that silly array traversal when there is nothing to do. Basically, the RT signals kind of approach is really good for the cases where select() and poll() suck: no need to traverse mostly empty and non-changing arrays all the time. It boils down to one very simple rule: dense arrays of sticky status information is good. So let's design a good interface for a dense array. Basically, the perfect interface for events would be struct event { unsigned long id; /* file descriptor ID the event is on */ unsigned long event; /* bitmask of active events */ }; int get_events(struct event * event_array, int maxnr, struct timeval *tmout); where you say "I want an array of pending events, and I have an array you can fill with up to 'maxnr' events - and if you have no events for me, please sleep until you get one, or until 'tmout'". The above looks like a _really_ simple interface to me. Much simpler than either select() or poll(). Agreed? Now, we still need to inform the kernel of what kind of events we want, ie the "binding" of events. The most straightforward way to do that is to just do a simple "bind_event()" system call: int bind_event(int fd, struct event *event); which basically says: I'm interested in the events in "event" on the file descriptor "fd". The way to stop being interested in events is to just set the event bitmask to zero. Now, the perfect interface would be the above. Nothing more. Nothing fancy, nothing complicated. Only really simple stuff. Remember the old rule: "keep it simple, stupid". The really nice part of the above is that it's trivial to implement. It's about 50 lines of code, plus some simple logic to various drivers etc to actually inform about the events. The way to do this simply is to limit it in very clear ways, the most notable one being simply that there is only one event queue per process (or rather, per "struct files_struct" - so threads would automatically share the event queue). This keeps the implementation simple, but it's also what keeps the interfaces simple: no queue ID's to pass around etc. Implementation is straightforward: the event queue basically consists of - a queue head in "struct files_struct", initially empty. - doing a "bind_event()" basically adds a fasync entry to the file structure, but rather than cause a signal, it just looks a
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > > > struct event { > int fd; > unsigned long mask; > void *opaque; > void (*event_fn)(ind fd, unsigned long mask, void *opaque); My experience say that: unsigned long rmask; void (*event_fn)(struct event *event); is a far better solution (more type safe, more descriptive). -- Abramo Bagnara mailto:[EMAIL PROTECTED] Opera Unica Via Emilia Interna, 140 Phone: +39.0546.656023 48014 Castel Bolognese (RA) - Italy Fax: +39.0546.656023 ALSA project ishttp://www.alsa-project.org sponsored by SuSE Linuxhttp://www.suse.com It sounds good! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, Oct 24, 2000 at 10:03:04AM -0700, Linus Torvalds wrote: > Basically, with get_events(), there is a maximum of one event per "bind". > And the memory for that is statically allocated at bind_event() time. >... > But you'd be doing so in a controlled manner: the memory use wouldn't go > up just because there is a sudden influx of 5 packets. So it scales > with load by virtue of simply not _caring_ about the load - it only cares > about the number of fd's you're waiting on. Nice. I like this. It would be easy for existing userspace code to start using this interface as it has the same semantics as select/poll from the event perspective. But it would make things even easier, as the bind would follow the life of the descriptor and thus wouldn't need to be "requeued" before every get_events call, so that part of userspace code could just be ripped out^W^W disabled and kept only for portability. In most of the daemons I have written, I've ended up using memcpy() to keep a non-scribbled-over copy of the fdsets around so I don't have to walk data structures and requeue fds on every loop for select()...nasty. Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Simon Kirby wrote: > > However, isn't there already something like this, albeit maybe without > the ability to return multiple events at a time? When discussing > select/poll on IRC a while ago with sct, sct said: > > Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for >SIGIO as usual. [ siginfo and RT signals, but using "sigwaitinfo()" instead of actual signal handlers ] The problem with RT signals (whether you use sigwaitinfo() or signal handling overhead) is that they don't scale to lots of events. The memory pressure for keeping track of millions of siginfo structures is more than any system can handle - so all RT signals implementations will have to eventually fall back on something else. Which makes for really hard to debug bugs, and complex programming (you essentially end up having to duplicate all of the event handling, and the "fallback" case has the added difficulty that you won't even see it except under heavy load). > Also, what is different in your above interface that prevents it from > being able to queue up too many events? Basically, with get_events(), there is a maximum of one event per "bind". And the memory for that is statically allocated at bind_event() time. > I guess the structure is only > sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000 > FDs on x86, but I don't see how the other method would be significantly > different. The kernel would have to store the queued events still, > surely... Oh, the bind information would be more like 32 bytes per bind, so assuming you have 2 fd's you're waiting for, you'd certainly be using a lot of memory. Still less than the quite complex SIGIO structures, though. But you'd be doing so in a controlled manner: the memory use wouldn't go up just because there is a sudden influx of 5 packets. So it scales with load by virtue of simply not _caring_ about the load - it only cares about the number of fd's you're waiting on. (Obviously _other_ parts of the system will have to keep up with tons of new packets, but that's why protocols like TCP have flow control, and that is something that is done independently of the event notification, so regardless of _what_ kind of event model you have you have to handle the simple load of TCP ;). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, Oct 23, 2000 at 10:39:36PM -0700, Linus Torvalds wrote: > Actually, forget the mmap, it's not needed. > > Here's a suggested "good" interface that would certainly be easy to > implement, and very easy to use, with none of the scalability issues that > many interfaces have. >... > Basically, the perfect interface for events would be > > struct event { > unsigned long id; /* file descriptor ID the event is on */ > unsigned long event;/* bitmask of active events */ > }; > > int get_events(struct event * event_array, int maxnr, struct timeval *tmout); I like. :) However, isn't there already something like this, albeit maybe without the ability to return multiple events at a time? When discussing select/poll on IRC a while ago with sct, sct said: Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for SIGIO as usual. Simon: Then fcntl(fd, F_SETSIG, rtsignum) Simon: And you'll get a signal queue which passes you the fd of each SIGIO in turn. sct: easy :) Simon: You don't even need the overhead of a signal handler: instead of select(), you just do "sigwaitinfo(&siginfo, timeout)" and it will do a select-style IO wait, returning the fd in the siginfo when it's available. (Captured from IRC on Nov 12th, 1998.) Or does this menthod still have the overhead of encapsulating the events into signals within the kernel? Also, what is different in your above interface that prevents it from being able to queue up too many events? I guess the structure is only sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000 FDs on x86, but I don't see how the other method would be significantly different. The kernel would have to store the queued events still, surely... Simon- [ Stormix Technologies Inc. ][ NetNation Communications Inc. ] [ [EMAIL PROTECTED] ][ [EMAIL PROTECTED]] [ Opinions expressed are not necessarily those of my employers. ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Mitchell Blank Jr wrote: > I think everyone should take a timeout and look at Solaris 8's /dev/poll > interface. This discussion is reinventing the wheel, the lever, and the > inclined plane. > > http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123 > > I think this is a lot cleaner than your approach: You must be kidding. /dev/poll is the interface from _hell_. It's absolutely disgusting. > * it doesn't add extra syscalls Sure it does. What do you think ioctl's are? Every single ioctl number is an extra system call. It's just indirected inside the device driver instead of at the normal system call entry point. The advantage? It's noticeably slower than a real system call. And the final reason why /dev/poll should just go away: not only is the interface ugly, but it's hard to use too. It's really hard to mix different users of /dev/poll - you'll have to create a whole new event structure around it. Sure, you can have multiple "poll lists" by opening /dev/poll multiple times, and you seem to see this as a great feature. I'm telling you that multiple event queues are a horrible disaster, and it's a perfect example of what an engineer with the "more is better" mindset comes up with. Multiple event queues are bad, because it completely breaks the notion of even-driven programming. How do you want to listen to them all? You can't. You can only listen to one event queue at a time - unless you create some kind of "hierarchy" of event queues, where you have one event queue that you wait on that contains the _other_ event queues, so that you can tell which one has anything pending.. Trust me. Multiple event queues are idiotic. Anybody who _designs_ for multiple event queues has some serious problems. Basically, tastes differ. Everything you seem to see as an advantage of /dev/poll, I see as a horrible kludge. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Dan Kegel wrote: > Linus Torvalds wrote: > > Basically, the main loop would boil down to > > for (;;) { > > static struct event ev_list[MAXEV]; > > get_event(ev_list, MAXEV, &tmout); > > .. timeout handling here .. > > } > > > > because get_even() would end up doing all the user-mode calls too (so > > "get_event()" is no longer a system call: it's a system call + a for-loop > > to call all the ID handler functions that were associated with the events > > that triggered). > > Occurs to me that this stuff will be used from other languages > than C, which might prefer to do their own event dispatching, > and would want the system to treat event_fn as another opaque quantity. Yes and no. The KERNEL would treat it as just another opaque quantity: after all, it would never ever touch the thing other than get it as part of "bind_event()" and return it as part of "get_events()". So as far as the kernel is concerned, both "event_fn" and "opaque" are just random information. However, it is very important to have every common user agree about the meaning of the ID's in user space: the whole point of encapsulating "event_fn" as a function pointer is that you can have different parts of the same program use the "bind_event()" interface, and they won't step on each others toes. So if you program in Fortran, for example, and you expect to link against the C library, and the C library uses events for something, then you'd better make sure that the fortran interfaces end up using the event_fn thing in the same wasy as the C one.. > So I don't object to a function that dispatches events, e.g. > int get_event(&tmout); > as long as there's also > int get_events(ev_list, MAXEV, &tmout) > that treats event_fn as an opaque pointer and *does not* dispatch the events. That's a user-mode issue, and obviously, if you don't care about compatibility with anything else you can do this. The kernel won't know, or care. But I think it is unlikely that you'd ever use the raw events in any other way - most environments under UNIX tend to have a C library component somewhere, because the low-level stuff is written in C and uses the common code that way even when the programmer isn't aware of it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
[ Moving on to practical matters ] On Tue, 24 Oct 2000, Dan Kegel wrote: > > Might be good to pick more unique names than 'struct event' and 'get_event'. > People probably use those already. I agree. I would never implement them under those names, but it's easier to talk about "event" than about something very specific. > Hiding ev_list is probably ok. However, > http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf > suggests that knowing how many events are pending is a useful measure > of server load, and that if more than a certain number of events > are pending, web servers should reject new connections. Thus it might > be handy to make the return value of get_event be the number of events > gotten. Note that my "get_event()" function actually did return exactly that: it returned the number of events, even though the actual event array handling was hidden inside the library function. > > So now you'd start everything off (assuming the same kind of "listen to > > everything and react to it" server as in my previous example) by just > > setting > > > > bind_event(sock, POLLIN, NULL, accept_fn); > > A couple questions: > > * how do you unbind? (By calling bind_event with NULL as the accept_fn?) If done that way, I'd prefer something where an event _mask_ of 0 means that the event obviously no longer exists, as it can no longer trigger. The accept_fn would be part of the identifier, and as such zeroing it out would lose the identification. But I'd personally probably prefer to be more explicit, and have a separate function for it. Note that whether it's a separate function or just making "bind_event()" really be "modify_or_bind_or_remove_event()" is pretty much not anything I care about - it gets to be so low-level details that it would be something that is probably best tried out and something that I don't have any religious issues with. I care about high-level interfaces, because _those_ are the ones that screw me years down the line. For example, select() (and later bind()) was always an interface that did _not_ fit into the UNIX way of doing things AT ALL. It really stands out among all the IO interfaces as being very different. And it's one of the nastier ones to maintain. > * how do you change a mask? (By calling bind_event with a new mask?) Same as above. Either bind with a new mask (zero being remove), or separate call. > * Is it ok to do these things while in an event_fn? (Yes?) Oh, absolutely. The kernel doesn't even _know_ when something is in an event function - the kernel only sees "change the event" and "get the list of events". > * Do you get an event whenever an fd is ready for something, or > only when its readiness changes? (Presumably whenever an fd is ready for >something?) Only when its readiness changes - probably with the addition that it would simplify things that a new event always (unconditionally) gets added to the event queue. Note that there are no races in this implementation - there is no way for events to get lost, even trivially seen in SMP/threaded environments. That's important. The BSD kevent paper goes on about "level and edge triggered" and it becomes a big thing for them, and they selected level-triggered events as if it made any difference. And sure - it _does_ make a difference, but the only difference is in how hard it is to implement, and level-triggered is a noticeably harder. The bind_event()/get_events() interface is edge-triggered, and works fundamentally the same way as SIGCHLD. If you don't keep up, you get SIGCHLD only once, even if ten children died on you in rapid succession. And it's edge-triggered: that doesn't mean that you lose events (like the BSD paper implies), it only means that you have to reap your children with a loop: while ((err = waitpid(..)) >= 0) { .. get all children .. } The reason "edge-triggered" (ie only when an event changes) is preferable is that it's MUCH simpler, and means that the "get_events()" system call doesn't need to actually understand what the events mean at all. It will just take the events off the list. If anythig triggers after that, it will be put back on the list again, but you can guarantee that you don't get any lost events simply by the simple fact that - by the time get_events() returns to user mode (ie before the action functions have been called), any new event that happens even before we have had to react to the old events will cause the event to be queued up again. What does this mean? It basically means that if you continue to get events on a file descriptor during your event handler function, the event will have moved back to the event list, and next time you _will_ get an event again. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > Here's a suggested "good" interface that would certainly be easy to > implement, and very easy to use, with none of the scalability issues that > many interfaces have. I think everyone should take a timeout and look at Solaris 8's /dev/poll interface. This discussion is reinventing the wheel, the lever, and the inclined plane. http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123 I think this is a lot cleaner than your approach: * it doesn't add extra syscalls * it handles multiple event queues, and does so without ugliness. all the per-queue state is held in the /dev/poll's "struct file" instance * in your method you have a per-process queue - but under what clone() conditions is it shared among siblings? Here the user has a choice - they can share the open("/dev/poll") file descriptor or not using any of the existing means. Granted, they also would probably want to arrange to share the fd's being polled in order to make this useful. * No new fields in task_struct A few simple improvements can be made to the Sun model, though: * The fact that the fd of /dev/poll can't be used for poll(2) or select(2) is ugly. Sure, you probably don't want an open instance of /dev/poll being added to another /dev/poll, but being able to call "select" on them would be really useful: 1. Imagine a server that has to process connections from both high-priority and low-priority clients - and that requests from the high-priority ones always take precedence. With this interface you could easily have two open instances of /dev/poll and then call select(2) on them. This ability just falls out naturally from the interface. 2. Some libraries are internally driven by select(2) loops (I think Xlib is like this, IIRC) If you have a lot of fd's you want to watch, this means you must take the hit of calling select(2) on all of them. If you could just pass in a fd for /dev/poll, problem solved. * I think the fact that you add events via write(2) but retrieve them via ioctl(2) is an ugly asymmetry. From what I can see, the entire motivation for using ioctl as opposed to read(2) is to allow the user to specify a timeout. If you could use poll(2)/select(2) on /dev/poll this issue would be moot (see above) * It would be nice if the interface were flexible enough to report items _other_ than "activity on fd" in the future. Maybe SYSV IPC? itimers? directory update notification? It seems that every time UNIX adds a mechanism of waiting for events, we spoil it by not making it flexible enough to wait on everything you might want. Lets not repeat the mistake with a new interface. * The "struct pollfd" and "struct dvpoll" should also include a 64-bit opaque quantity supplied by userland which is returned with each event on that fd. This would save the program from having to look up which connection context goes with each fd - the kernel could just give you the pointer to the structure. Not having this capability isn't a burden for programs dealing with a small number of fd's (since they can just have a lookup table) but if you potentially have 1's of connections it may be undesirable to allocate an array for them all. The linux patch of /dev/poll implements mmap'ing of the in-kenrel poll table... I don't think this is a good idea. First, the user just wants to be able to add events and dequeue them - both linear operations. Second, why should the kernel be forced to maintain a fixed-size linear list of events we're looking for... this seems mindlessly inefficient. Why not just pull a "struct pollfd" out of slab each time a new event is listened for? My unresolved concerns: * Is this substantially better than the already existing rtsig+poll solution? Enough to make implementing it worth while? * How do we quickly do the "struct file" -> "struct pollfd" conversion each time an event happens? Especially if there are multiple /dev/poll instances open in the current process. Probably each "struct file" would need a pointer to the instance of /dev/poll which would have some b-tree variant (or maybe a hash table). The limitation would be that a single fd couldn't be polled for events by two different /dev/poll instances, even for different events. This is probably good for sanity's sake anyway. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Thus spake Linus Torvalds ([EMAIL PROTECTED]): > I disagree. > Let's just face it, poll() is a bad interface scalability-wise. Is that a reason to implement it badly? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, Oct 23, 2000 at 09:06:11PM -0700, Linus Torvalds wrote: > > > On Tue, 24 Oct 2000, Andi Kleen wrote: > > > > I don't see the problem. You have the poll table allocated in the kernel, > > the drivers directly change it and the user mmaps it (I was not proposing > > to let poll make a kiobuf out of the passed array) > > That's _not_ how poll() works at all. But how /dev/poll would work. Sorry for not marking it clearly. > > We don't _have_ a poll table in the kernel, and no way to mmap it. The > poll() tables gets created dynamically based on the stuff that the user > has set up in the table. And the user can obviously change the fd's etc in > the table directly, so in order for the caching to work you need to do > various games with page table dirty or writable bits, or at least test > dynamically whether the poll table is the same as it was before. Actually assuming the user does not duplicate fds in the poll array it would always work out the lazy way (and when he duplicates fds I would say the behaviour is undefined -- no spec says in what order poll must process fds in the supplied poll table) You either find the correct fd in the offset you cached, and if not the user has changed the table and you need to recompute all cached offsets. No need to play pte tricks. > > Sure, it's doable, and apparently Solaris does something like this. But > what _is_ the overhead of the Solaris code for small number of fd's? I bet > it really is quite noticeable. I also suspect it is very optimized toward > an unchangning poll-table. For small number of fds you use the fast_poll/fast_select that I implemented in the patch I sent to you. > > > What is your favourite interface then ? > > I suspect a good interface that can easily be done efficiently would > basically be something where the user _does_ do the equivalent of a > read-only mmap() of poll entries - and explicit and controlled > "add_entry()" and "remove_entry()" controls, so that the kernel can > maintain the cache without playing tricks. I think when you say the case of duplicated fd entries in an poll array as undefined you don't need the controlled add/remove_entry. [and from a quick glimpse at Single Unix it does not say anything about the order of fd processing in poll, so the spec is definitely flexible enough for that] -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > Basically, the main loop would boil down to > for (;;) { > static struct event ev_list[MAXEV]; > get_event(ev_list, MAXEV, &tmout); > .. timeout handling here .. > } > > because get_even() would end up doing all the user-mode calls too (so > "get_event()" is no longer a system call: it's a system call + a for-loop > to call all the ID handler functions that were associated with the events > that triggered). Occurs to me that this stuff will be used from other languages than C, which might prefer to do their own event dispatching, and would want the system to treat event_fn as another opaque quantity. So I don't object to a function that dispatches events, e.g. int get_event(&tmout); as long as there's also int get_events(ev_list, MAXEV, &tmout) that treats event_fn as an opaque pointer and *does not* dispatch the events. These probably both want to be C library functions (rather than get_events being only a system call) on the chance that some operating systems will implement Linux emulation at the shared library level rather than the syscall level. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > Basically, the main loop would boil down to > > for (;;) { > static struct event ev_list[MAXEV]; > get_event(ev_list, MAXEV, &tmout); > .. timeout handling here .. > } > > because get_even() would end up doing all the user-mode calls too (so > "get_event()" is no longer a system call: it's a system call + a for-loop > to call all the ID handler functions that were associated with the events > that triggered). > > So the "struct event" would just be: > > struct event { > int fd; > unsigned long mask; > void *opaque; > void (*event_fn)(ind fd, unsigned long mask, void *opaque); > } > > and there's no need for separate event queues, because the separate event > queues have been completely subsumed by the fact that every single event > has a separate event function. OK, guess I buy the one-queue-to-bind-them-all argument. Might be good to pick more unique names than 'struct event' and 'get_event'. People probably use those already. Hiding ev_list is probably ok. However, http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf suggests that knowing how many events are pending is a useful measure of server load, and that if more than a certain number of events are pending, web servers should reject new connections. Thus it might be handy to make the return value of get_event be the number of events gotten. > So now you'd start everything off (assuming the same kind of "listen to > everything and react to it" server as in my previous example) by just > setting > > bind_event(sock, POLLIN, NULL, accept_fn); A couple questions: * how do you unbind? (By calling bind_event with NULL as the accept_fn?) * how do you change a mask? (By calling bind_event with a new mask?) * Is it ok to do these things while in an event_fn? (Yes?) * Do you get an event whenever an fd is ready for something, or only when its readiness changes? (Presumably whenever an fd is ready for something?) > (This is also the ideal event programming interface - signals get racy and > hard to handle, while in the above example you can trivially just be > single-threaded. Which doesn't mean that you CANNOT be multi-threaded if > you want to: you multi-thread things by just having multiple threads that > all call "get_event()" on their own). I for one will be glad to not have to think of the races caused by the delay between signal queueing and signal pickup. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
At 10:39 PM 23/10/2000 -0700, Linus Torvalds wrote: >First, let's see what is so nice about "select()" and "poll()". They do >have one _huge_ advantage, which is why you want to fall back on poll() >once the RT signal interface stops working. What is that? RT methods are bad if they consume too many resources. SIGIO is a good example of this - the current overhead of passing events to user-space incurs both a spinlock and a memory copy of 512 bytes for each event. while it removes the requirement to "walk lists", the signal semantics in the kernel and the overhead of memory copies to userspace negate its performance a fair bit. that isn't to say that all "event-driven" methods are bad. in the past year, i've done many experiments at making SIGIO more efficient. some of these experiments include -- [1] 'aggregate' events. that is, if you've registered a POLL_IN, no need to registered another POLL_IN this was marginally successful, but ultimately still didn't scale. [2] create a new interface for event delivery. for i settled on a 16-byte structure sufficient to pass all of the relevant information: typedef struct zerocopy_buf { int fd; short int cmd; #define ZEROCOPY_VALID_BUFFER 0xe1e2 short int valid_buffer; void*buf; /* skbuff */ #ifdef __KERNEL__ volatile #endif struct zerocopy_buf *next; } zerocopy_buf_t; so, we get down to 16 bytes per-event. these are allocated coupled with this was an interface whereby user-space could view kernel-space (via read-only mmap). in my case, this allowed for user-space to be able to read the above chain of zerocopy_buf events with no kernel-to-user memory copies. an ioctl on a character driver could ask the kernel to give it the head of the chain of the current zerocopy_buf structure. a similar ioctl() call allows it to pass a chain of instructions to the kernel (adding/removing events from notification) and other housekeeping. since user-space had read-only visibility into kernel memory address-space, one could then pick up skbuff's in userspace without the overhead of copies. ... and so-on. the above is a bit of a simplification of what goes on. using flip-buffers of queues, one can use this in multiple processes and be SMP-safe without the requirements for spinlocks or semaphores in the "fast path". solving the "walk the list of fd's" and "incur the overhead of memory copies" tied in with network hardware capable of handling scatter/gather DMA and IP and TCP checksum calculations, i've more than doubled the performance of an existing application which depended on poll()-type behaviour. while i agree that it isn't necessarily a 'generic' interface, and won't necessarilly appeal to everyone as the cure-all, the techniques used have removed two significant bottlenecks to high-network-performance i/o on tens-of-thousands of TCP sockets for an application we've been working on. cheers, lincoln. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Dan Kegel wrote: > > kqueue lets you associate an arbitrary integer with each event > specification; the integer is returned along with the event. > This is very handy for, say, passing the 'this' pointer of the > object that should handle the event. Yes, you can simulate it > with an array indexed by fd, but I like having it included. I agree. I thin the ID part is the most important part of the interfaces, because you definitely want to re-use the same functions over and over again - and you wan tto have some way other than just the raw fd to identify where the actual _buffers_ for that IO is, and what stage of the state machine we're in for that fd etc etc. Also, it's potentially important to have different "id"s for even the same fd - in some cases you want to have the same event handle both the read and the write part on an fd, but in other cases it might make more conceptual sense to separate out the read handling from the write handling, and instead of using "mask = POLLIN | POLLOUT", you'd just have two separate events, one with POLLIN and one with POLLOUT. This was what my "unsigned long id" was, but that is much too hard to use. See my expanded suggestion of it just a moment ago. (And yes, I'm sure you can do all this with kevents. I just abhor the syntax of those things). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Dan Kegel wrote: > > >http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&manpath=FreeBSD+5.0-current&format=html > describes the FreeBSD kqueue interface for events: I've actually read the BSD kevent stuff, and I think it's classic over-design. It's not easy to see what it's all about, and the whole tuple crap is just silly. Looks much too complicated. I don't believe in the "library" argument at all, and I think multiple event queues completely detract from the whole point of being simple to use and implement. Now, I agree that my bind_event()/get_event() has limitations, and the biggest one is probably the event "id". It needs to be better, and it needs to have more structure. The "id" really should be something that not only contains the "fd", but also contains the actor function to be called, along with some opaque data for that function. In fact, if you take my example server, and move the "handle[id]()" call _into_ get_events() (and make the "handle[id]()" function pointer a part of the ID of the event), then the library argument goes away too: it doesn't matter _who_ calls the get_event() function, because the end result is always going to be the same regardless of whether it is called from within a library or from a main loop: it's going to call the function handle associated with the ID that triggered. Basically, the main loop would boil down to for (;;) { static struct event ev_list[MAXEV]; get_event(ev_list, MAXEV, &tmout); .. timeout handling here .. } because get_even() would end up doing all the user-mode calls too (so "get_event()" is no longer a system call: it's a system call + a for-loop to call all the ID handler functions that were associated with the events that triggered). So the "struct event" would just be: struct event { int fd; unsigned long mask; void *opaque; void (*event_fn)(ind fd, unsigned long mask, void *opaque); } and there's no need for separate event queues, because the separate event queues have been completely subsumed by the fact that every single event has a separate event function. So now you'd start everything off (assuming the same kind of "listen to everything and react to it" server as in my previous example) by just setting bind_event(sock, POLLIN, NULL, accept_fn); which basically creates the event inside the kernel, and will pass it to the "__get_event()" system call through the event array, so the get_event() library function basically looks like int get_event(struct event *array, int maxevents, struct timeval *tv) { int nr = __get_event(array, maxevents, tv); int i; for (i = 0; i < nr; i++) { array->event_fn(array->fd, array->mask, array->opaque); array++; } return nr; } and tell me why you'd want to have multiple event queues any more? (In fact, you might as well move the event array completely inside "get_event()", because nobody would be supposed to look at the raw array any more. So the "get_event()" interface would be even simpler: just a timeout, nothing more. Talk about simple programming. (This is also the ideal event programming interface - signals get racy and hard to handle, while in the above example you can trivially just be single-threaded. Which doesn't mean that you CANNOT be multi-threaded if you want to: you multi-thread things by just having multiple threads that all call "get_event()" on their own). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Followup to: <[EMAIL PROTECTED]> By author:Dave Zarzycki <[EMAIL PROTECTED]> In newsgroup: linux.dev.kernel > > Maybe I'm missing something, but why do you seperate out fd from the event > structure. Why not just "int bind_event(struct event *event)" > > The only thing I might have done differently is allow for multiple event > queues per process, and the ability to add event queues to event > queues. But these features might not be all that useful in real life. > This could be useful if it doesn't screw up the implementation too much. Pretty much, what Linus is saying is the following: select()/poll() have one sizable cost, and that is to set up and destroy the set of events we want to trigger on. We would like to amortize this cost by making it persistent. It would definitely be useful for the user to have more than one such "event set" installed at any one time, so that you can call different wait_for_event() [or whatever] as appropriate. However, if that means we're doing lots of constructing and deconstructing in kernel space, then we probably didn't gain much. The other things I think we'd really like in a new interface is an interface where you can explicitly avoid the "storming hordes" problem -- if N servers is waiting for the same event, it should be at least possible to tell the kernel to only wake up one (arbitrarily chosen) of them, rather than all. Finally, it would be somewhat nice to have a unified interface for synchronous and asynchronous notification. This should be quite easily doable by adding a call event_notify(event_set,signal) that causes real-time signal "signal" to be raised (presumably with the event_set as the argument), when the specified event_set triggers. -hpa -- <[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private! "Unix gives you enough rope to shoot yourself in the foot." http://www.zytor.com/~hpa/puzzle.txt - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Dan Kegel wrote: > [kqueue is] Pretty similar to yours, with the following additions: > > Your proposal seems to only have one stream of available events per > process. kqueue() returns a handle to an event queue, and kevent() > takes that handle as a first parameter. > > [kqueue] uses a single call to both bind (or unbind) an array of fd's > and pick up new events. Probably reduces overhead. > > [kqueue] allows you to watch not just sockets, but also plain files > or directories. (Hey, haven't we heard people talk about letting apps > do that under Linux? What interface does that use?) I forgot to mention: kqueue lets you associate an arbitrary integer with each event specification; the integer is returned along with the event. This is very handy for, say, passing the 'this' pointer of the object that should handle the event. Yes, you can simulate it with an array indexed by fd, but I like having it included. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Linus Torvalds wrote: > where you say "I want an array of pending events, and I have an array you > can fill with up to 'maxnr' events - and if you have no events for me, > please sleep until you get one, or until 'tmout'". > > The above looks like a _really_ simple interface to me. Much simpler than > either select() or poll(). Agreed? Totally. I've been wanting an API like this in user-space for a long time. One question about "int bind_event(int fd, struct event *event)" Maybe I'm missing something, but why do you seperate out fd from the event structure. Why not just "int bind_event(struct event *event)" The only thing I might have done differently is allow for multiple event queues per process, and the ability to add event queues to event queues. But these features might not be all that useful in real life. Hmmm... It might be cute if you could do something like this: int num_of_resulting_events; int r, q = open("/dev/eventq"); struct event interested_events[1024]; struct event events_that_happened[1024]; /* fill up interested_event_array */ write(q, interested_events, sizeof(interested_events)); r = read(q, events_that_happened, sizeof(events_that_happened)); num_of_resulting_events = r / sizeof(struct event); davez -- Dave Zarzycki http://thor.sbay.org/~dave/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Linus Torvalds wrote: > Here's a suggested "good" interface that would certainly be easy to > implement, and very easy to use, with none of the scalability issues that > many interfaces have. ... > It boils down to one very simple rule: dense arrays of sticky status > information is good. So let's design a good interface for a dense array. > > Basically, the perfect interface for events would be > > struct event { > unsigned long id; /* file descriptor ID the event is on */ > unsigned long event; /* bitmask of active events */ > }; > > int get_events(struct event * event_array, int maxnr, struct timeval >*tmout); > > int bind_event(int fd, struct event *event); http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&manpath=FreeBSD+5.0-current&format=html describes the FreeBSD kqueue interface for events: struct kevent { uintptr_t ident;/* identifier for this event */ short filter; /* filter for event */ u_short flags;/* action flags for kqueue */ u_int fflags; /* filter flag value */ intptr_t data; /* filter data value */ void *udata; /* opaque user data identifier */ }; int kqueue(void); int kevent(int kq, const struct kevent *changelist, int nchanges, struct kevent *eventlist, int nevents, const struct timespec *timeout); Pretty similar to yours, with the following additions: Your proposal seems to only have one stream of available events per process. kqueue() returns a handle to an event queue, and kevent() takes that handle as a first parameter. Their proposal uses a single call to both bind (or unbind) an array of fd's and pick up new events. Probably reduces overhead. Their proposal allows you to watch not just sockets, but also plain files or directories. (Hey, haven't we heard people talk about letting apps do that under Linux? What interface does that use?) > The really nice part of the above is that it's trivial to implement. It's > about 50 lines of code, plus some simple logic to various drivers etc to > actually inform about the events. The way to do this simply is to limit it > in very clear ways, the most notable one being simply that there is only > one event queue per process (or rather, per "struct files_struct" - so > threads would automatically share the event queue). This keeps the > implementation simple, but it's also what keeps the interfaces simple: no > queue ID's to pass around etc. I dislike having only one event queue per process. Makes it hard to use in a library. ("But wait, *I* was going to use bind_event()!") > Advantage: everything is O(1), except for "get_event()" which is O(n) > where 'n' is the number of active events that it fills in. This is a big win, and is exactly the payoff that Solaris gets with /dev/poll. > Example "server": See http://www.monkeys.com/kqueue/echo.c for a very similar example server for kqueue() / kevent(). > You get the idea. Very simple, and looks like it should perform quite > admirably. With none of the complexities of signal handling, and none of > the downsides of select() or poll(). Both of the new system calls would be > on the order of 20-30 lines of code (along with the infrastructure to > extend the fasync stuff to also be able to handle events) Go, Linus, go! Let's do it! But don't go *too* simple. I really would like a few of the things kqueue has. > Yes, I'm probably missing something. But this is the kind of thing I think > we should look at (and notice that you can _emulate_ this with poll(), so > you can use this kind of interface even on systems that wouldn't support > these kinds of events natively). - Dan p.s. my collection of notes on kqueue is at http://www.kegel.com/c10k.html#nb.kqueue - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Jordan Mendelson wrote: > What you describe is exactly what the /dev/poll interface patch from the > Linux scalability project does. > > It creates a special device which you can open up and write > add/remove/modify entries you wish to be notified of using the standard > struct pollfd. Removing entries is done by setting the events in a > struct written to the device to POLLREMOVE. And that's an ugly crap. * you add struct {} into the kernel API. _Always_ a bad idea. * you either create yet another example of "every open() gives a new instance" kind of device or you got to introduce a broker process. * no easy way to check the current set. * no fscking way to use that from scripts/etc. > You can optionally mmap() memory which the notifications are written to. > Two ioctl() calls are provide for the initial allocation and also to > force it to check all items in your poll() list. * useless use of ioctl() award > Solaris has this same interface minus the mmap()'ed memory. Oh, yes. Solaris. Great example of good taste... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Linus Torvalds wrote: > > > What is your favourite interface then ? > > I suspect a good interface that can easily be done efficiently would > basically be something where the user _does_ do the equivalent of a > read-only mmap() of poll entries - and explicit and controlled > "add_entry()" and "remove_entry()" controls, so that the kernel can > maintain the cache without playing tricks. Actually, forget the mmap, it's not needed. Here's a suggested "good" interface that would certainly be easy to implement, and very easy to use, with none of the scalability issues that many interfaces have. First, let's see what is so nice about "select()" and "poll()". They do have one _huge_ advantage, which is why you want to fall back on poll() once the RT signal interface stops working. What is that? Basically, RT signals or any kind of event queue has a major fundamental queuing theory problem: if you have events happening really quickly, the events pile up, and queuing theory tells you that as you start having queueing problems, your latency increases, which in turn tends to mean that later events are even more likely to queue up, and you end up in a nasty meltdown schenario where your queues get longer and longer. This is why RT signals suck so badly as a generic interface - clearly we cannot keep sending RT signals forever, because we'd run out of memory just keeping the signal queue information around. Neither poll() nor select() have this problem: they don't get more expensive as you have more and more events - their expense is the number of file descriptors, not the number of events per se. In fact, both poll() and select() tend to perform _better_ when you have pending events, as they are both amenable to optimizations when there is no need for waiting, and scanning the arrays can use early-out semantics. So sticky arrays of events are good, while queues are bad. Let's take that as one of the fundamentals. So why do people still like RT signals? They do have one advantage, which is that you do NOT have that silly array traversal when there is nothing to do. Basically, the RT signals kind of approach is really good for the cases where select() and poll() suck: no need to traverse mostly empty and non-changing arrays all the time. It boils down to one very simple rule: dense arrays of sticky status information is good. So let's design a good interface for a dense array. Basically, the perfect interface for events would be struct event { unsigned long id; /* file descriptor ID the event is on */ unsigned long event;/* bitmask of active events */ }; int get_events(struct event * event_array, int maxnr, struct timeval *tmout); where you say "I want an array of pending events, and I have an array you can fill with up to 'maxnr' events - and if you have no events for me, please sleep until you get one, or until 'tmout'". The above looks like a _really_ simple interface to me. Much simpler than either select() or poll(). Agreed? Now, we still need to inform the kernel of what kind of events we want, ie the "binding" of events. The most straightforward way to do that is to just do a simple "bind_event()" system call: int bind_event(int fd, struct event *event); which basically says: I'm interested in the events in "event" on the file descriptor "fd". The way to stop being interested in events is to just set the event bitmask to zero. Now, the perfect interface would be the above. Nothing more. Nothing fancy, nothing complicated. Only really simple stuff. Remember the old rule: "keep it simple, stupid". The really nice part of the above is that it's trivial to implement. It's about 50 lines of code, plus some simple logic to various drivers etc to actually inform about the events. The way to do this simply is to limit it in very clear ways, the most notable one being simply that there is only one event queue per process (or rather, per "struct files_struct" - so threads would automatically share the event queue). This keeps the implementation simple, but it's also what keeps the interfaces simple: no queue ID's to pass around etc. Implementation is straightforward: the event queue basically consists of - a queue head in "struct files_struct", initially empty. - doing a "bind_event()" basically adds a fasync entry to the file structure, but rather than cause a signal, it just looks at whether the fasync entry is already linked into the event queue, and if it isn't (and the event is one of the ones in the event bitmask), it adds itself to the event queue. - get_event() just traverses the event queue and fills in the array, removing them from the event queue. End of story. If the event queue is empty, it trivially sees that in a single line of code (+ timeout handling) Advantage: everything is O(1), except for "get_event()" which is O(n) where 'n' is the number of active eve
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > > On Tue, 24 Oct 2000, Andi Kleen wrote: > > > > I don't see the problem. You have the poll table allocated in the kernel, > > the drivers directly change it and the user mmaps it (I was not proposing > > to let poll make a kiobuf out of the passed array) > Th eproblem with poll() as-is is that the user doesn't really tell the > kernel explictly when it is changing the table.. What you describe is exactly what the /dev/poll interface patch from the Linux scalability project does. It creates a special device which you can open up and write add/remove/modify entries you wish to be notified of using the standard struct pollfd. Removing entries is done by setting the events in a struct written to the device to POLLREMOVE. You can optionally mmap() memory which the notifications are written to. Two ioctl() calls are provide for the initial allocation and also to force it to check all items in your poll() list. Solaris has this same interface minus the mmap()'ed memory. Jordan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Andi Kleen wrote: > > I don't see the problem. You have the poll table allocated in the kernel, > the drivers directly change it and the user mmaps it (I was not proposing > to let poll make a kiobuf out of the passed array) That's _not_ how poll() works at all. We don't _have_ a poll table in the kernel, and no way to mmap it. The poll() tables gets created dynamically based on the stuff that the user has set up in the table. And the user can obviously change the fd's etc in the table directly, so in order for the caching to work you need to do various games with page table dirty or writable bits, or at least test dynamically whether the poll table is the same as it was before. Sure, it's doable, and apparently Solaris does something like this. But what _is_ the overhead of the Solaris code for small number of fd's? I bet it really is quite noticeable. I also suspect it is very optimized toward an unchangning poll-table. > What is your favourite interface then ? I suspect a good interface that can easily be done efficiently would basically be something where the user _does_ do the equivalent of a read-only mmap() of poll entries - and explicit and controlled "add_entry()" and "remove_entry()" controls, so that the kernel can maintain the cache without playing tricks. Basically, something like a user interface to something that looks like the linux poll_table_page structures, with the difference being that it doesn't have to be created and torn down all the time because the user would explicitly ask for "add this fd" and "remove this fd" from the table. Th eproblem with poll() as-is is that the user doesn't really tell the kernel explictly when it is changing the table.. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Nick Piggin ([EMAIL PROTECTED]) wrote: > > I'm trying to write a server that handles 1 clients. On 2.4.x, > > the RT signal queue stuff looks like the way to achieve that. > > I would suggest you try multiple polling threads. Not only will you get > better SMP scalability, if you have say 16 threads, each one only has to > handle ~ 600 fds. Good point. My code is already able to use multiple network threads, and I have done what you suggest in the past. But I'm interested in pushing the state of the art here, and want to see if Linux can handle it with just a single network thread. (My server has enough non-network threads to keep multiple CPUs busy, don't worry :-) - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
David Schwartz wrote: > > I'm trying to write a server that handles 1 clients. On 2.4.x, > > the RT signal queue stuff looks like the way to achieve that. > > Unfortunately, when the RT signal queue overflows, the consensus seems > > to be that you fall back to a big poll(). And even though the RT signal > > queue [almost] never overflows, it certainly can, and servers have to be > > able to handle it. > > Don't let that bother you. In the case where you get a hit a significant > fraction of the descriptors you are polling on, poll is very efficient. The > inefficiency comes when you have to wade through 10,000 uninteresting file > descriptors to find the one interesting one. If the poll set is rich in > ready descriptors, there is little advantage to signal queues over poll > itself. > > In fact, if you assume the percentage of ready file descriptors (as opposed > to the number of file descriptors) is constant, then poll is just as > scalable (theoretically) as any other method. Under both schemes, with twice > as many file descriptors you have to do twice as much work. Yep, I've made similar arguments myself. It's just that seeing poll() take 14 milliseconds to return on a 650 MHz system is a little daunting. I'll report again when I have results for RT signal stuff and different percentages of idle sockets (probably 0, 1, and 10). - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Dan Kegel wrote: > > Jordan Mendelson ([EMAIL PROTECTED]) wrote: > > An implementation of /dev/poll for Linux already exists and has shown to > > be more scalable than using RT signals under my tests. A patch for 2.2.x > > and 2.4.x should be available at the Linux Scalability Project @ > > http://www.citi.umich.edu/projects/linux-scalability/ in the patches > > section. > > If you'll look at the page I linked to in my original post, > http://www.kegel.com/dkftpbench/Poller_bench.html > you'll see that I also benchmarked /dev/poll. The Linux /dev/poll implementation has a few "non-standard" features such as the ability to mmap() the poll structure memory to eliminate a memory copy. int dpoll_fd; unsigned char *dpoll; struct pollfd *mmap_dpoll; dpoll_fd = open("/dev/poll", O_RDWR, 0); ioctl(dpoll_fd, DP_ALLOC, 1); dpoll = mmap(0, DP_MMAP_SIZE(1), PROT_WRITE|PROT_READ, MAP_SHARED, dpoll_fd, 0); dpoll = (struct pollfd *)mmap_dpoll; Use this memory when reading and write() to add/remove and see if you get any boost in performance. Also, I believe there is a hash table associated with /dev/poll in the kernel patch which might slow down your performance tests when it's first growing to resize itself. Jordan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Linux's implementation of poll() not scalable?
> I'm trying to write a server that handles 1 clients. On 2.4.x, > the RT signal queue stuff looks like the way to achieve that. > Unfortunately, when the RT signal queue overflows, the consensus seems > to be that you fall back to a big poll(). And even though the RT signal > queue [almost] never overflows, it certainly can, and servers have to be > able to handle it. Don't let that bother you. In the case where you get a hit a significant fraction of the descriptors you are polling on, poll is very efficient. The inefficiency comes when you have to wade through 10,000 uninteresting file descriptors to find the one interesting one. If the poll set is rich in ready descriptors, there is little advantage to signal queues over poll itself. In fact, if you assume the percentage of ready file descriptors (as opposed to the number of file descriptors) is constant, then poll is just as scalable (theoretically) as any other method. Under both schemes, with twice as many file descriptors you have to do twice as much work. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, Oct 23, 2000 at 06:42:39PM -0700, Linus Torvalds wrote: > > > On Tue, 24 Oct 2000, Andi Kleen wrote: > > > > Also with the poll table mmap'ed via /dev/poll and the optimizations I > > described poll could be made quite nice (I know that queued SIGIO exists, > > but it has its drawbacks too and often you need to fallback to poll anyways) > > The problem is that your proposed optimizations would be horrible: the > poll events themselves happen when some other process is running, so you'd > have to do some serious VM hacking and consider the poll table to be some > kind of direct-IO thing etc. Ugh. I don't see the problem. You have the poll table allocated in the kernel, the drivers directly change it and the user mmaps it (I was not proposing to let poll make a kiobuf out of the passed array) > > And the file->fd mapping is not a simple mapping, it'a s 1:m mapping from > file -> potentially many [process,fd] pairs. Yes, but in 95% of all cases it is a 1:1 mapping, so you could just fall back to the old slow method when that isn't the case. > > Doing the caching across multiple poll-calls is even worse, you'd have to > cache where in user space people had the array etc. Not pretty. The file -> fdnum reverse table does not depend On every poll call you can walk the poll table and fix the pointer from file to poll entry. That can be done cheaply during copyin for normal poll. mmap'ed /dev/poll could probably use a lazy method (cache the pointers and verify and walk if it the verify fails) > I'm sure it can be speeded up, I'm just not sure it's really worth it if > you'd instead just have a better interface that wouldn't need this crap, > and consider poll() as just a compatibility layer. What is your favourite interface then ? ndi /dev/poll is nice in that it can be relatively easy hacked into older programs. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Andi Kleen wrote: > > Also with the poll table mmap'ed via /dev/poll and the optimizations I > described poll could be made quite nice (I know that queued SIGIO exists, > but it has its drawbacks too and often you need to fallback to poll anyways) The problem is that your proposed optimizations would be horrible: the poll events themselves happen when some other process is running, so you'd have to do some serious VM hacking and consider the poll table to be some kind of direct-IO thing etc. Ugh. And the file->fd mapping is not a simple mapping, it'a s 1:m mapping from file -> potentially many [process,fd] pairs. Doing the caching across multiple poll-calls is even worse, you'd have to cache where in user space people had the array etc. Not pretty. I'm sure it can be speeded up, I'm just not sure it's really worth it if you'd instead just have a better interface that wouldn't need this crap, and consider poll() as just a compatibility layer. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, Oct 23, 2000 at 06:16:24PM -0700, Linus Torvalds wrote: > > > On Tue, 24 Oct 2000, Andi Kleen wrote: > > > > It would be possible to setup a file -> fdnum reverse table (possibly cached > > over poll calls, I think Solaris does that) and let the async events directly > > change the bits in the output buffer in O(1). > > I disagree. > > Let's just face it, poll() is a bad interface scalability-wise. It is, but Linux poll is far from being a good implementation even in the limitations of the interfaces, as the comparison with Solaris shows. It is inherently O(n), but the constant part of that in Linux could be certainly improved. Also with the poll table mmap'ed via /dev/poll and the optimizations I described poll could be made quite nice (I know that queued SIGIO exists, but it has its drawbacks too and often you need to fallback to poll anyways) > > > Also the current 2.4 poll is very wasteful both in memory and cycles > > for small numbers of fd. > > Yes, we could go back to the "optimize the case of n < 8" thing. Which > should be particularly easy with the new "poll_table_page" setup: the > thing is more abstracted than it used to be in 2.2. I did it for n < 40. It also put the wait queues on the stack, in that it is even more efficient than 2.0 I attached the patch for your reference, it is against test3 or so (but iirc select/poll has not changed after that). I'm not proposing it for inclusion right now. It does only the easy parts of course, I think with some VFS extensions it could be made a lot better. BTW the linux select can be get a lower constant part of n too by simply testing whole words of the FD_SET for being zero and skipping them quickly (that makes it a lot cheaper when you have bigger holes in the select fd space). The patch does that too. -Andi --- linux/fs/select.c Mon Jul 24 07:39:44 2000 +++ linux-work/fs/select.c Sat Aug 5 17:23:18 2000 @@ -12,6 +12,9 @@ * 24 January 2000 * Changed sys_poll()/do_poll() to use PAGE_SIZE chunk-based allocation * of fds to overcome nfds < 16390 descriptors limit (Tigran Aivazian). + * + * July 2000 + * Add fast poll/select (Andi Kleen) */ #include @@ -24,6 +27,8 @@ #define ROUND_UP(x,y) (((x)+(y)-1)/(y)) #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM) +#define __MAX_POLL_TABLE_ENTRIES ((PAGE_SIZE - sizeof (struct poll_table_page)) / +sizeof (struct poll_table_entry)) + struct poll_table_entry { struct file * filp; wait_queue_t wait; @@ -32,42 +37,43 @@ struct poll_table_page { struct poll_table_page * next; - struct poll_table_entry * entry; + int nr, max; struct poll_table_entry entries[0]; }; -#define POLL_TABLE_FULL(table) \ - ((unsigned long)((table)->entry+1) > PAGE_SIZE + (unsigned long)(table)) - /* - * Ok, Peter made a complicated, but straightforward multiple_wait() function. - * I have rewritten this, taking some shortcuts: This code may not be easy to - * follow, but it should be free of race-conditions, and it's practical. If you - * understand what I'm doing here, then you understand how the linux - * sleep/wakeup mechanism works. - * - * Two very simple procedures, poll_wait() and poll_freewait() make all the - * work. poll_wait() is an inline-function defined in , - * as all select/poll functions have to call it to add an entry to the - * poll table. + * Tune fast poll/select. Limit is the kernel stack. */ +#define FAST_SELECT_LIMIT 40 +#define FAST_POLL_LIMIT 40 + +#define FAST_SELECT_BYTES ((FAST_SELECT_LIMIT + 7) / 8) +#define FAST_SELECT_ULONG \ + ((FAST_SELECT_BYTES + sizeof(unsigned long) - 1) / sizeof(unsigned long)) + +static inline void do_freewait(struct poll_table_page *p) +{ + struct poll_table_entry * entry; + entry = p->entries + p->nr; + while (p->nr > 0) { + p->nr--; + entry--; + remove_wait_queue(entry->wait_address,&entry->wait); + fput(entry->filp); + } +} void poll_freewait(poll_table* pt) { + struct poll_table_page *old; struct poll_table_page * p = pt->table; while (p) { - struct poll_table_entry * entry; - struct poll_table_page *old; - - entry = p->entry; - do { - entry--; - remove_wait_queue(entry->wait_address,&entry->wait); - fput(entry->filp); - } while (entry > p->entries); old = p; + do_freewait(p); p = p->next; - free_page((unsigned long) old); + if (old->max == __MAX_POLL_TABLE_ENTRIES) { + free_page((unsigned long) old); + } } } @@ -75,7 +81,7 @@ { struct poll_table_page *table = p->table; - if (!table || POLL_TABLE_FULL(table)) { + if (!table ||
Re: Linux's implementation of poll() not scalable?
> I'm trying to write a server that handles 1 clients. On 2.4.x, > the RT signal queue stuff looks like the way to achieve that. I would suggest you try multiple polling threads. Not only will you get better SMP scalability, if you have say 16 threads, each one only has to handle ~ 600 fds. Nick. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Tue, 24 Oct 2000, Andi Kleen wrote: > > It would be possible to setup a file -> fdnum reverse table (possibly cached > over poll calls, I think Solaris does that) and let the async events directly > change the bits in the output buffer in O(1). I disagree. Let's just face it, poll() is a bad interface scalability-wise. If you want to do efficient events, you should have some other approach, like an event queue. Yes, I know it's a dirty word because NT uses it, but let's face it, poll() was a hack to make it easier to do something select()-like but with the same machinery as select. > Also the current 2.4 poll is very wasteful both in memory and cycles > for small numbers of fd. Yes, we could go back to the "optimize the case of n < 8" thing. Which should be particularly easy with the new "poll_table_page" setup: the thing is more abstracted than it used to be in 2.2. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Jordan Mendelson ([EMAIL PROTECTED]) wrote: > An implementation of /dev/poll for Linux already exists and has shown to > be more scalable than using RT signals under my tests. A patch for 2.2.x > and 2.4.x should be available at the Linux Scalability Project @ > http://www.citi.umich.edu/projects/linux-scalability/ in the patches > section. If you'll look at the page I linked to in my original post, http://www.kegel.com/dkftpbench/Poller_bench.html you'll see that I also benchmarked /dev/poll. When finding 1 active fd among 1, the Solaris implementation creamed the living snot out of the Linux one, even though the Solaris hardware was 5 or so times slower (see the lmbench results on that page). Here are the numbers (times in microseconds): On a 167MHz sun4u Sparc Ultra-1 running SunOS 5.7 (Solaris 7) Generic_106541-11: pipes1001000 1 select151 - - poll470 6763742 /dev/poll 61 70 92 On an idle 650 MHz dual Pentium III running Red Hat Linux 6.2 with kernel 2.2.14smp plus the /dev/poll patch: pipes1001000 1 select 28 - - poll 23 890 11333 /dev/poll 19 1464264 On the same machine as above, but with vanilla kernel 2.4.0-test10-pre4 smp: pipes1001000 1 select 52 - - poll 491184 14660 > It works fairly well, but I was actually somewhat disappointed to find > that it wasn't the primary cause for the system CPU suckage for my > particular system. Granted, when you only have to poll a few times per > second, the overhead of standard poll() just isn't that bad. If you have to poll 100 or fewer sockets, Linux's poll is quite good. If you have to poll 1000 or so sockets, Linux's /dev/poll works well (I wish it were available for the current kernels). But if you have to poll 1 or sockets on Linux, neither standard poll nor /dev/poll as currently implemented performs adequately. I suspect that RT signals on Linux will do much better for N=1 than the current /dev/poll, and hope to benchmark that soon. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, Oct 23, 2000 at 11:14:35AM -0700, Linus Torvalds wrote: > [ Small treatize on "scalability" included. People obviously do not > understand what "scalability" really means. ] > > In article <[EMAIL PROTECTED]>, > Dan Kegel <[EMAIL PROTECTED]> wrote: > >I ran a benchmark to see how long a call to poll() takes > >as you increase the number of idle fd's it has to wade through. > >I used socketpair() to generate the fd's. > > > >Under Solaris 7, when the number of idle sockets was increased from > >100 to 1, the time to check for active sockets with poll() > >increased by a factor of only 6.5. That's a sublinear increase in time, > >pretty spiffy. > > Yeah. It's pretty spiffy. > > Basically, poll() is _fundamentally_ a O(n) interface. There is no way > to avoid it - you have an array, and there simply is _no_ known > algorithm to scan an array in faster than O(n) time. Sorry. One problem in Linux is that it scans multiple times. At least 4 times currently: copyin, setup of wait queues, ask for results, copyout. copyin and copyout are relatively cheap (and could be made even cheaper with /dev/poll), the problem are the two other passes which involve a lot of function pointer calls which generally cause pipeline stalls in modern CPUs. It would be possible to setup a file -> fdnum reverse table (possibly cached over poll calls, I think Solaris does that) and let the async events directly change the bits in the output buffer in O(1). This would save one pass. It may also be possible to cache the wait queue setup over polls, this would make poll much cheaper in terms of cache lines used. Also the current 2.4 poll is very wasteful both in memory and cycles for small numbers of fd. I did some experiments with a poll fast path for small ns by falling back to the 2.0 stack allocation method, and it decreased latency dramatically (>-30%). It also saves a lot of memory, because all the daemons only polling for a few fds wouldn't use two additional pages to their stack page [patches are availble, I just didn't want to submit them during code freeze] -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Dan Kegel wrote: > > Linus Torvalds wrote: > > Dan Kegel wrote: > >> [ http://www.kegel.com/dkftpbench/Poller_bench.html ] > >> [ With only one active fd and N idle ones, poll's execution time scales > >> [ as 6N on Solaris, but as 300N on Linux. ] > > > > Basically, poll() is _fundamentally_ a O(n) interface. There is no way > > to avoid it - you have an array, and there simply is _no_ known > > algorithm to scan an array in faster than O(n) time. Sorry. > > ... > > Under Linux, I'm personally more worried about the performance of X etc, > > and small poll()'s are actually common. So I would argue that the > > Solaris scalability is going the wrong way. But as performance really > > depends on the load, and maybe that 1 entry load is what you > > consider "real life", you are of course free to disagree (and you'd be > > equally right ;) > The way I'm implementing RT signal support is by writing a userspace > wrapper to make it look like an OO version of poll(), more or less, > with an 'add(int fd)' method so the wrapper manages the arrays of pollfd's. > When and if I get that working, I may move it into the kernel as an > implementation of /dev/poll -- and then I won't need to worry about > the RT signal queue overflowing anymore, and I won't care how scalable > poll() is. An implementation of /dev/poll for Linux already exists and has shown to be more scalable than using RT signals under my tests. A patch for 2.2.x and 2.4.x should be available at the Linux Scalability Project @ http://www.citi.umich.edu/projects/linux-scalability/ in the patches section. It works fairly well, but I was actually somewhat disappointed to find that it wasn't the primary cause for the system CPU suckage for my particular system. Granted, when you only have to poll a few times per second, the overhead of standard poll() just isn't that bad. Jordan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Linus Torvalds wrote: > Dan Kegel wrote: >> [ http://www.kegel.com/dkftpbench/Poller_bench.html ] >> [ With only one active fd and N idle ones, poll's execution time scales >> [ as 6N on Solaris, but as 300N on Linux. ] > > Basically, poll() is _fundamentally_ a O(n) interface. There is no way > to avoid it - you have an array, and there simply is _no_ known > algorithm to scan an array in faster than O(n) time. Sorry. > ... > Under Linux, I'm personally more worried about the performance of X etc, > and small poll()'s are actually common. So I would argue that the > Solaris scalability is going the wrong way. But as performance really > depends on the load, and maybe that 1 entry load is what you > consider "real life", you are of course free to disagree (and you'd be > equally right ;) I don't think I was being as clueless as you feared. Solaris' poll() somehow manages to skip inactive fd's very efficiently. But I'm happy to agree that small poll()'s are very important. I'd prefer to never use big poll()'s. The RT signal stuff scales much better. I'm trying to write a server that handles 1 clients. On 2.4.x, the RT signal queue stuff looks like the way to achieve that. Unfortunately, when the RT signal queue overflows, the consensus seems to be that you fall back to a big poll(). And even though the RT signal queue [almost] never overflows, it certainly can, and servers have to be able to handle it. The way I'm implementing RT signal support is by writing a userspace wrapper to make it look like an OO version of poll(), more or less, with an 'add(int fd)' method so the wrapper manages the arrays of pollfd's. When and if I get that working, I may move it into the kernel as an implementation of /dev/poll -- and then I won't need to worry about the RT signal queue overflowing anymore, and I won't care how scalable poll() is. - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On Mon, 23 Oct 2000, Tobias Ringstrom wrote: > On 23 Oct 2000, Linus Torvalds wrote: > > > Either > > > > (a) Solaris has solved the faster-than-light problem, and Sun engineers > > should get a Nobel price in physics or something. > > > > (b) Solaris "scales" by being optimized for 1 entries, and not > > speeding up sufficiently for a small number of entries. > > > > You make the call. > > You will probably get the 6.5 factor because you have some (big) contant > setup time. For example > > t = 1700 + n > > This gives you an increase of 6.5 from 100 to 1, although it is of > course is O(n). No magic there... :-) Indeed. NOTE! I'm not saying that this is necessarily bad. It may well be that the setup time means that the Solaris code actually _does_ perform really well for the 1 entry case. I hope nobody took my rant against "scalability" to mean that I consider the Solaris case to necessarily be bad engineering. My rant was more about people often thinking that "scalability == performance", which it isn't. It may be that Solaris simply wants to do 1 entries really fast and that they actually do really well, but it is clear that if so they have a huge performance hit for the small cases, and people should realize that. It's basically a matter of making the proper trade-offs. I think that the "few file descriptors" case is actually arguably a very important one. Sun (who has long since given up on the desktop) probably have a different tradeoff, and maybe their big constant setup-time is worth it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
On 23 Oct 2000, Linus Torvalds wrote: > Either > > (a) Solaris has solved the faster-than-light problem, and Sun engineers > should get a Nobel price in physics or something. > > (b) Solaris "scales" by being optimized for 1 entries, and not > speeding up sufficiently for a small number of entries. > > You make the call. You will probably get the 6.5 factor because you have some (big) contant setup time. For example t = 1700 + n This gives you an increase of 6.5 from 100 to 1, although it is of course is O(n). No magic there... :-) /Tobias - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Hi, If you have a similar machine (in terms machine configuration) for both your solaris and linux machines... could you tell us what the difference in total time for 100 and 1 was? i.e... dont compare solaris with 100 descripters vs solaris with 1 descriptors, but rather Linux 100 descripters Vs. Solaris 100 descriptors AND Linux 1 descriptors Vs. Solaris 1 descriptors. That would be useful informatio... I think. Thanks Lyle Re: Linux's implementation of poll() not scalable? [ Small treatize on "scalability" included. People obviously do not understand what "scalability" really means. ] In article <[EMAIL PROTECTED]>, Dan Kegel <[EMAIL PROTECTED]> wrote: >I ran a benchmark to see how long a call to poll() takes >as you increase the number of idle fd's it has to wade through. >I used socketpair() to generate the fd's. > >Under Solaris 7, when the number of idle sockets was increased from 100 to >1, the time to check for active sockets with poll() increased by a >factor of only 6.5. That's a sublinear increase in time, pretty spiffy. Yeah. It's pretty spiffy. Basically, poll() is _fundamentally_ a O(n) interface. There is no way to avoid it - you have an array, and there simply is _no_ known algorithm to scan an array in faster than O(n) time. Sorry. (Yeah, you could parallellize it. I know, I know. Put one CPU on each entry, and you can get it down to O(1). Somehow I doubt Solaris does that. In fact, I'll bet you a dollar that it doesn't). So what does this mean? Either (a) Solaris has solved the faster-than-light problem, and Sun engineers should get a Nobel price in physics or something. (b) Solaris "scales" by being optimized for 1 entries, and not speeding up sufficiently for a small number of entries. You make the call. Basically, for poll(), perfect scalability is that poll() scales by a factor of 100 when you go from 100 to 1 entries. Anybody who does NOT scale by a factor of 100 is not scaling right - and claiming that 6.5 is a "good" scale factor only shows that you've bought into marketing hype. In short, a 6.5 scale factor STINKS. The only thing it means is that Solaris is slow as hell on the 100 descriptor case. >Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets >was increased from 100 to 1, the time to check for active sockets with >poll() increased by a factor of 493 [or 300, respectively]. So, what you're showing is that Linux actually is _closer_ to the perfect scaling (Linux is off by a factor of 5, while Solaris is off by a factor of 15 from the perfect scaling line, and scales down really badly). Now, that factor of 5 (or 3, for 2.4.0) is still bad. I'd love to see Linux scale perfectly (which in this case means that 1 fd's should take exactly 100 times as long to poll() as 100 entries take). But I suspect that there are a few things going on, one of the main ones probably being that the kernel data working set for 100 entries fits in the cache or something like that. >Please, somebody point out my mistake. Linux can't be this bad! I suspect we could improve Linux in this area, but I hope that I pointed out the most fundamental mistake you did, which was thinking that "scalability" equals "speed". It doesn't. Scalability really means that the effort to handle a problem grows reasonably with the hardness of the problem. And _deviations_ from that are indications of something being wrong. Some people think that super-linear improvements in scalability are signs of "goodness". They aren't. For example, the classical reason for super-linear SMP improvement (with number of CPU's) that people get so excited about really means that something is wrong on the low end. Often the "wrongness" is lack of cache - some problems will scale better than perfectly simply because with multiple CPU's you have more cache. The "wrongess" is often also selecting the wrong algorithm: something that "scales well" by just being horribly slow for the small case, and being "less bad" for the big cases. In the end, the notion of "scalability" is meaningless. The only meaningful thing is how quickly something happens for the load you have. That's something called "performance", and unlike "scalability", it actually has real-life meaning. Under Linux, I'm personally more worried about the performance of X etc, and small poll()'s are actually common. So I would argue that the Solaris scalability is going the wrong way. But as performance really depends on the load, and maybe that 1 entry load is what you consider "real life", you are of course free to disagree (and you'd be equally right ;) Linus -
RE: Linux's implementation of poll() not scalable?
> Under Solaris 7, when the number of idle sockets was increased from > 100 to 1, the time to check for active sockets with poll() > increased by a factor of only 6.5. That's a sublinear increase in time, > pretty spiffy. Under Solaris 7, when the number of idle sockets was decreased from 10,000 to 100, the time to check for active sockets with poll() decreased by a factor of only 6.5. Shouldn't it be more like 100? Sounds like Solaris' poll implementation is horribly broken for the most common case. DS - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
[ Small treatize on "scalability" included. People obviously do not understand what "scalability" really means. ] In article <[EMAIL PROTECTED]>, Dan Kegel <[EMAIL PROTECTED]> wrote: >I ran a benchmark to see how long a call to poll() takes >as you increase the number of idle fd's it has to wade through. >I used socketpair() to generate the fd's. > >Under Solaris 7, when the number of idle sockets was increased from >100 to 1, the time to check for active sockets with poll() >increased by a factor of only 6.5. That's a sublinear increase in time, >pretty spiffy. Yeah. It's pretty spiffy. Basically, poll() is _fundamentally_ a O(n) interface. There is no way to avoid it - you have an array, and there simply is _no_ known algorithm to scan an array in faster than O(n) time. Sorry. (Yeah, you could parallellize it. I know, I know. Put one CPU on each entry, and you can get it down to O(1). Somehow I doubt Solaris does that. In fact, I'll bet you a dollar that it doesn't). So what does this mean? Either (a) Solaris has solved the faster-than-light problem, and Sun engineers should get a Nobel price in physics or something. (b) Solaris "scales" by being optimized for 1 entries, and not speeding up sufficiently for a small number of entries. You make the call. Basically, for poll(), perfect scalability is that poll() scales by a factor of 100 when you go from 100 to 1 entries. Anybody who does NOT scale by a factor of 100 is not scaling right - and claiming that 6.5 is a "good" scale factor only shows that you've bought into marketing hype. In short, a 6.5 scale factor STINKS. The only thing it means is that Solaris is slow as hell on the 100 descriptor case. >Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets >was increased from 100 to 1, the time to check for active sockets >with poll() increased by a factor of 493 [or 300, respectively]. So, what you're showing is that Linux actually is _closer_ to the perfect scaling (Linux is off by a factor of 5, while Solaris is off by a factor of 15 from the perfect scaling line, and scales down really badly). Now, that factor of 5 (or 3, for 2.4.0) is still bad. I'd love to see Linux scale perfectly (which in this case means that 1 fd's should take exactly 100 times as long to poll() as 100 entries take). But I suspect that there are a few things going on, one of the main ones probably being that the kernel data working set for 100 entries fits in the cache or something like that. >Please, somebody point out my mistake. Linux can't be this bad! I suspect we could improve Linux in this area, but I hope that I pointed out the most fundamental mistake you did, which was thinking that "scalability" equals "speed". It doesn't. Scalability really means that the effort to handle a problem grows reasonably with the hardness of the problem. And _deviations_ from that are indications of something being wrong. Some people think that super-linear improvements in scalability are signs of "goodness". They aren't. For example, the classical reason for super-linear SMP improvement (with number of CPU's) that people get so excited about really means that something is wrong on the low end. Often the "wrongness" is lack of cache - some problems will scale better than perfectly simply because with multiple CPU's you have more cache. The "wrongess" is often also selecting the wrong algorithm: something that "scales well" by just being horribly slow for the small case, and being "less bad" for the big cases. In the end, the notion of "scalability" is meaningless. The only meaningful thing is how quickly something happens for the load you have. That's something called "performance", and unlike "scalability", it actually has real-life meaning. Under Linux, I'm personally more worried about the performance of X etc, and small poll()'s are actually common. So I would argue that the Solaris scalability is going the wrong way. But as performance really depends on the load, and maybe that 1 entry load is what you consider "real life", you are of course free to disagree (and you'd be equally right ;) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Linux's implementation of poll() not scalable?
I ran a benchmark to see how long a call to poll() takes as you increase the number of idle fd's it has to wade through. I used socketpair() to generate the fd's. Under Solaris 7, when the number of idle sockets was increased from 100 to 1, the time to check for active sockets with poll() increased by a factor of only 6.5. That's a sublinear increase in time, pretty spiffy. Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets was increased from 100 to 1, the time to check for active sockets with poll() increased by a factor of 493 [or 300, respectively]. This is terribly, horribly bad scaling behavior, since the time required appears to increase three times faster than the number of sockets. Detailed measurements and the source code used to make them are available at http://www.kegel.com/dkftpbench/Poller_bench.html Please, somebody point out my mistake. Linux can't be this bad! - Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/