subject:"Linux's implementation of poll\(\) not scalable\?"

Re: Linux's implementation of poll() not scalable?

2000-11-01 Thread Dan Kegel


Mike Jagdis wrote:
> This patch firstly extends the wait queue mechanism
> to allow an arbitrary action to be performed. Then I rewrote
> the select/poll implementation to use event queueing to avoid
> rescanning descriptors that had not changed - and restructured
> the loops to be rather more efficient. This approach doesn't
> need any changes to driver poll routines, it doesn't need
> backwards mapping struct files. ...
>   Performance graphs and the lmbench derived test programs I
> used are at http://www.purplet.demon.co.uk/linux/select/ ...
>   Oh, and I updated this patch for 2.4.0-test9.

I can't wait to run my benchmark on it... hope I can get to it soon.
BTW, can you update that web page to also point to your patch?
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

RE: Linux's implementation of poll() not scalable?

2000-10-30 Thread Mike Jagdis


Here's something I did last year and then put on ice, partly
through lack of time and partly because I thought I'd pick
it up for 2.5.

  All this talk of event queues misses one thing: we already
have an event queue mechanism. They're called wait queues.
The only problem is that the only on-event action possible
is to wake the process (assuming it was asleep in the first
place). This patch firstly extends the wait queue mechanism
to allow an arbitrary action to be performed. Then I rewrote
the select/poll implementation to use event queueing to avoid
rescanning descriptors that had not changed - and restructured
the loops to be rather more efficient. This approach doesn't
need any changes to driver poll routines, it doesn't need
backwards mapping struct files. It should be fairly easy to
implement a /dev/poll mechanism using this, although I haven't
yet.

  Yes, the change to wait queues has a slight cost, but it isn't
great and the main part of it only happens if you actually sleep.

  Performance graphs and the lmbench derived test programs I
used are at http://www.purplet.demon.co.uk/linux/select/ (bounce
in and out of the index page 'cos the next and prev buttons
aren't wired up :-) )

  Oh, and I updated this patch for 2.4.0-test9.

  Comments and opinions are, as always, welcome :-).

Mike

 select.patch

Readiness vs. completion (was: Re: Linux's implementation of poll()not scalable?)

2000-10-29 Thread Dan Kegel

John Gardiner Myers <[EMAIL PROTECTED]> wrote:
> Your proposed interface suffers from most of the same problems as the 
> other Unix event interfaces I've seen. Key among the problems are 
> inherent race conditions when the interface is used by multithreaded 
> applications. 
> 
> The "stickiness" of the event binding can cause race conditions where 
> two threads simultaneously attempt to handle the same event. For 
> example, consider a socket becomes writeable, delivering a writable 
> event to one of the multiple threads calling get_events(). While the 
> callback for that event is running, it writes to the socket, causing the 
> socket to become non-writable and then writeable again. That in turn 
> can cause another writable event to be delivered to some other thread. 
> ...
> In the async I/O library I work with, this problem is addressed by 
> having delivery of the event atomically remove the binding. If the 
> event needs to be bound after the callback is finished, then it is the 
> responsibility for the callback to rebind the event. 

IMHO you're describing a situation where a 'completion notification event'
(as with aio) would be more appropriate than a 'readiness notification event'
(as with poll).

With completion notification, one naturally expects 'edge triggered',
'one shot' behavior from the notification system, with no event coalescing,
and there is no need to remove or reestablish bindings.

> There are three performance issues that need to be addressed by the 
> implementation of get_events(). One is that events preferably be given 
> to threads that are the same CPU as bound the event. That allows the 
> event's context to remain in the CPU's cache. 
> 
> Two is that of threads on a given CPU, events should wake threads in 
> LIFO order. This allows excess worker threads to be paged out. 
> 
> Three is that the event queue should limit the number of worker threads 
> contending with each other for CPU. If an event is triggered while 
> there are enough worker threads in runnable state, it is better to wait 
> for one of those threads to block before delivering the event. 

That describes NT's 'completion port / thread pooling' scheme, I think
(which incidentally is a 'completion notification' rather than a 'readiness
notification' - based scheme).

I suspect readiness notification using edge triggering is a
strange beast, not often seen in the wild, and hard to define precisely.

I'm going to risk generalizing, and categorizing the existing base
of application software into two groups.  Would it be going to far to say 
the following:

  Readiness notification, like that provided by traditional poll(),
  fits naturally with level-triggered events with event coalescing,
  and a large body of traditional Unix software exists that uses this paradigm.

  Completion notification, like that provided by aio and NT's networking,
  fits naturally with edge-triggered events with no event coalescing,
  and a large body of win32 software exists that uses this paradigm.

And, come to think of it, network programmers usually can be categorized
into the same two groups :-)  Each style of programming is an acquired taste.

IMHO if Linux is to be maximally popular with software developers
(desirable if we want to boost the number of apps available for Linux),
it would help to cater to both flavors of network programming.

So I'd like to see both a high-performance level-triggered readiness
notification API with event coalescing, and a high-performance edge-triggered
completion API with no event coalescing.  With luck, they'll be the
same API, but with slightly different flag values.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-27 Thread John Gardiner Myers

Linus Torvolds wrote:
> So sticky arrays of events are good, while queues are bad. Let's take that 
> as one of the fundamentals.

Please let's not.  There is nothing fundamentally different between an
event queue of size N and an interest set of size N.

Your proposed interface suffers from most of the same problems as the
other Unix event interfaces I've seen.  Key among the problems are
inherent race conditions when the interface is used by multithreaded
applications.

The "stickiness" of the event binding can cause race conditions where
two threads simultaneously attempt to handle the same event.  For
example, consider a socket becomes writeable, delivering a writable
event to one of the multiple threads calling get_events().  While the
callback for that event is running, it writes to the socket, causing the
socket to become non-writable and then writeable again.  That in turn
can cause another writable event to be delivered to some other thread.

In the async I/O library I work with, this problem is addressed by
having delivery of the event atomically remove the binding.  If the
event needs to be bound after the callback is finished, then it is the
responsibility for the callback to rebind the event.  Typically, a
server implements a command/response protocol, where the read callback 
reads a command, writes the response, and repeats.  It is quite likely
that after the read callback completes, the connection is interested in
a write event, not another read event.

Cancellation of event bindings is another area prone to race
conditions.  A thread canceling an event binding frequently needs to
know whether or not that event has been delivered to some other thread. 
If it has, the canceling thread has to synchronize with the other thread
before destroying any resources associated with the event.

Multiple event queues are needed by libraries as different libraries can
have different thread pools.  The threads in those different pools can
have different characteristics, such as stack size, priority, CPU
affinity, signal state, etc.

There are three performance issues that need to be addressed by the
implementation of get_events().  One is that events preferably be given
to threads that are the same CPU as bound the event.  That allows the
event's context to remain in the CPU's cache.

Two is that of threads on a given CPU, events should wake threads in
LIFO order.  This allows excess worker threads to be paged out.

Three is that the event queue should limit the number of worker threads
contending with each other for CPU.  If an event is triggered while
there are enough worker threads in runnable state, it is better to wait
for one of those threads to block before delivering the event.
 S/MIME Cryptographic Signature

RE: Linux's implementation of poll() not scalable?

2000-10-27 Thread Chris Swiedler


> It doesn't practically matter how efficient the X server is when
> you aren't busy, after all.

A simple polling scheme (i.e. not using poll() or select(), just looping
through all fd's trying nonblocking reads) is perfectly efficient when the
server is 100% busy, and perfectly inefficient when there is nothing to do.
I'm not saying that your statements are wrong--in your example, X is calling
select() which is not wasting as much time as a hard-polling loop--but it's
wrong to say that high-load efficiency is the primary concern. I would be
horrified if X took a signifigant portion of the CPU time when many clients
were connected, but none were actually doing anything.

chris

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Jonathan Lemon

In article [EMAIL PROTECTED]> you write:
>Linus Torvalds wrote:
>> I'd much rather have an event interface that is documented to be edge-
>> triggered and is really _lightweight_, than have another interface that
>> starts out with some piggy features.
>
>Agreed (except for that 'edge-triggered' part), but I don't think
>'level-triggered' implies piggy.   I haven't benchmarked whether
>kqueue() slows down the networking layer of FreeBSD yet; do you
>suspect maintaining the level-triggered structures actually is
>a bottleneck for them?

I really don't think it's a bottleneck.  At the moment, all events
are maintained on a linked list.  To dequeue an event, we simply:

1. take the event on the front of the list.
2. validate event.  (call filter function)
3. copy event into return array.
4. put event back on end of list.

If the `EV_ONESHOT' flag is set, we skip steps 2 & 4, and destroy
the event after it is returned to the user.
(we want to wait only once for this particular event)

If the `EV_CLEAR' flag is set, we skip step 4.
(pure edge-triggered delivery)

Step 4 is pretty simple, just re-insertion back onto the queue.

If you eliminate Step 2, then you have a `correctness' issue; where
the application must deal with stale events.  The validation function
is equally lightweight and doesn't (IMO) cause a performance problem.

>> ... the "re-bind()" approach works very simply, and means that the
>> overhead of testing whether the event is still active is not a generic
>> thing that _always_ has to be done, but something where the application
>> can basically give the kernel the information that "this time we're
>> leaving the event possibly half-done, please re-test now".
>
>Hmm.  I don't like the extra system call, though.  Any chance you'd be
>willing to make get_events() take a vector of bind requests, so we can
>avoid the system call overhead of re-binding?  (Or is that too close
>to kqueue for you?)

IMO, I'd think that the calls should be orthogonal.  If the "get_events()"
call returns an array, why shouldn't the "bind_request()" call as well?
Otherwise you're only amortizing the system calls in one direction.

>And are you sure apps will always know whether they need to rebind?
>Sometimes you're faced with a protocol stack which may or may not
>read the requests fully, and which you aren't allowed to change.
>It'd be nice to still have a high-performance interface that can deal with 
>that situation.

Agreed.
--
Jonathan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel


Linus Torvalds wrote:
> However, we also need to remember what got us to this discussion in the
> first place. One of the reasons why poll() is such a piggy interface is
> exactly because it tries to be "nice" to the programmer.

poll() is a piggy interface because it is O(n) in polled file 
descriptors, not because of its level-triggered nature.

> I'd much rather have an event interface that is documented to be edge-
> triggered and is really _lightweight_, than have another interface that
> starts out with some piggy features.

Agreed (except for that 'edge-triggered' part), but I don't think
'level-triggered' implies piggy.   I haven't benchmarked whether
kqueue() slows down the networking layer of FreeBSD yet; do you
suspect maintaining the level-triggered structures actually is
a bottleneck for them?
 
> I do understand that level to some degree is "nicer", but everybody pretty
> much agrees that apart from requireing more care, edge-triggered certainly
> does everything the level-triggered interfaces do.

Agreed.
 
> For example, if you want to only partially handle an event (one of the
> more valid arguments I've seen, although I do not agree with it actually
> being all that common or important thing to do), the edge-triggered
> interface _does_ allow for that. It's fairly easy, in fact: you just
> re-bind the thing.
> 
> (My suggested "bind()" interface would be to just always add a newly bound
> fd to the event queue, but I would not object to a "one-time test for
> active" kind of "bind()" interface either - which would basically allow
> for a poll() interface - and the existing Linux internal "poll()"
> implementation actually already allows for exactly this so it wouldn't
> even add any complexity).
> ... the "re-bind()" approach works very simply, and means that the
> overhead of testing whether the event is still active is not a generic
> thing that _always_ has to be done, but something where the application
> can basically give the kernel the information that "this time we're
> leaving the event possibly half-done, please re-test now".

Hmm.  I don't like the extra system call, though.  Any chance you'd be
willing to make get_events() take a vector of bind requests, so we can
avoid the system call overhead of re-binding?  (Or is that too close
to kqueue for you?)

And are you sure apps will always know whether they need to rebind?
Sometimes you're faced with a protocol stack which may or may not
read the requests fully, and which you aren't allowed to change.
It'd be nice to still have a high-performance interface that can deal with 
that situation.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel


Jim Gettys wrote:
> So I want an interface in which I can get as many events as possible
> at once, and one in which the events themselves can have appropriate
> aggregation behavior.  It isn't quite clear to me if the proposed interface
> would have this property.

I believe get_event, /dev/poll, and kqueue all share this property.  
e.g. none of them will present multiple POLLIN events per fd per call.  
Is that what you meant?

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Jim Gettys



Note that there is another aspect to the efficiency / performance of the 
select/poll style of interfaces not immediately obvious, but which occurs 
as a result of how some (streaming/batching) protocols work.

An X server does not call select all that often (probably one of the two items most 
frequently used that care; 
though I believe getting the Apache case right is more important).

X is such a streaming protocol: it is a feature that I don't have to do 
extra reads or system calls to deal with more data arriving from a client.  
An X server doesn't want one event generated for each incoming TCP segment: 
it merely needs to know there is data available on a file descriptor as 
a binary condition.  I really don't want to have to do one operation per 
segment; this is less efficient than the current situation.

Similarly, it is a feature that with one call I can find out that there
is work to do on multiple file descriptors.

In short, the X server does a select, and then loops across all the file
descriptors with work to do before doing another select: the system call
overhead gets amortized across multiple clients and buffers received from
the client.  As the server gets busier, it is more and more likely
that there is more than one client with work to do, and/or multiple
TCP segments have arrived to process (in the common single client
is busy case).  So we make the system call less and less often
as a fraction of work done.

This has the happy consequence that the select caused overhead DROPS as
a fraction of total work as the X server gets busier, and X is most efficient
at the point in time you care the most: when you have the most work to
do.  The system call is returning more information each time it is called,
and some of that information is aggregated as well (additional data arriving).
It doesn't practically matter how efficient the X server is when
you aren't busy, after all.

This aggregation property is therefore important, and there needs to be
some way to achieve this, IMHO.

Web servers often have similar behavior, though since most current
HTTP clients don't implement streaming behavior, the benefit is currently
much lower (would that HTTP clients start driving HTTP servers the
way the HTTP/1.1 protocol allows...  Sigh...).  Right now, scaling
to large numbers of descriptors is most urgent for big web servers.

So I want an interface in which I can get as many events as possible
at once, and one in which the events themselves can have appropriate
aggregation behavior.  It isn't quite clear to me if the proposed interface
would have this property.

As I said in early talks about X: "X is an exercise in avoiding system
calls"

- Jim Gettys
--
Jim Gettys
Technology and Corporate Development
Compaq Computer Corporation
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Linus Torvalds

On Thu, 26 Oct 2000, Dan Kegel wrote:
> 
> With level-triggered interfaces like poll(), your chances
> of writing a correctly functioning program are higher because you
> can throw events away (on purpose or accidentally) with no consequences; 
> the next time around the loop, poll() will happily tell you the current
> state of all the fd's.

Agreed.

However, we also need to remember what got us to this discussion in the
first place. One of the reasons why poll() is such a piggy interface is
exactly because it tries to be "nice" to the programmer.

I'd much rather have an event interface that is documented to be edge-
triggered and is really _lightweight_, than have another interface that
starts out with some piggy features.

I do understand that level to some degree is "nicer", but everybody pretty
much agrees that apart from requireing more care, edge-triggered certainly
does everything the level-triggered interfaces do. 

For example, if you want to only partially handle an event (one of the
more valid arguments I've seen, although I do not agree with it actually
being all that common or important thing to do), the edge-triggered
interface _does_ allow for that. It's fairly easy, in fact: you just
re-bind the thing.

(My suggested "bind()" interface would be to just always add a newly bound
fd to the event queue, but I would not object to a "one-time test for
active" kind of "bind()" interface either - which would basically allow
for a poll() interface - and the existing Linux internal "poll()"
implementation actually already allows for exactly this so it wouldn't
even add any complexity).

> With edge-triggered interfaces, the programmer must be much more careful
> to avoid ever dropping a single event; if he does, he's screwed.
> This gives rise to complicated code in apps to remember the current
> state of fd's in those cases where the programmer has to drop events.

No, the "re-bind()" approach works very simply, and means that the
overhead of testing whether the event is still active is not a generic
thing that _always_ has to be done, but something where the application
can basically give the kernel the information that "this time we're
leaving the event possibly half-done, please re-test now".

Advantage: performance again. The common case doesn't take the performance
hit.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Dan Kegel

"Eric W. Biederman" wrote:
> 
> Dan Kegel <[EMAIL PROTECTED]> writes:
> > It's harder to write correct programs that use edge-triggered events.
> 
> Huh? The race between when an event is reported, and when you take action
> on it effectively means all events are edge triggered.

Nope.  With any of these interfaces, whether level or edge, the app must
treat all the events as hints, and be prepared for them to be wrong.
That's why code that uses poll() tends to use nonblocking sockets.
No matter what you do, you can't get away from that.  Consider
accepting a connection.  Poll or whatever tells you a listening socket
is ready for you to call accept().  Before you do, the other end sends
an RST.  Consequence: app must be prepared for a readiness event to be wrong.
cf. http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0616.html

This is orthogonal to the question of whether edge or level triggering
is easier to write code for, I think.

> So making the interface clearly edge triggered seems to be a win for
> correctness.

With level-triggered interfaces like poll(), your chances
of writing a correctly functioning program are higher because you
can throw events away (on purpose or accidentally) with no consequences; 
the next time around the loop, poll() will happily tell you the current
state of all the fd's.

With edge-triggered interfaces, the programmer must be much more careful
to avoid ever dropping a single event; if he does, he's screwed.
This gives rise to complicated code in apps to remember the current
state of fd's in those cases where the programmer has to drop events.
And there are many cases like that; see 
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0529.html and
http://boudicca.tux.org/hypermail/linux-kernel/2000week44/0592.html
for examples.

Better to let apps ask the kernel for the current state of each socket if
they want, is what I say.  This reduces the amount of code and effort needed
to write many apps, *and makes migrating legacy apps to high-performance
interfaces easier*.
For new apps that are willing to maintain the state 
themselves, by all means, provide an edge-oriented interface, too.
It's probably better if your code is manly enough to handle it.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-26 Thread Eric W. Biederman


Dan Kegel <[EMAIL PROTECTED]> writes:

> It's harder to write correct programs that use edge-triggered events.

Huh? The race between when an event is reported, and when you take action 
on it effectively means all events are edge triggered. 

So making the interface clearly edge triggered seems to be a win for
correctness.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Dan Kegel

Helge Hafting wrote:
> > With poll(), it was *not a bug* for the user code to drop events; with
> > your proposed interface, it *is a bug* for the user code to drop events.
> > I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
> > incorrectly that your interface "has the same semantics as poll from
> > the event perspective".
> 
> Don't worry - existing interfaces will be around for existing
> applications.  I believe Linus is trying to engineer a new and better
> interface that will make a difference - (server) programs written to use
> the new stuff will show better performance use up less kernel time.
>
> The semantics will be different, so only new programs will benefit.
> However, the benefits may be so great that people will want to rewrite 
> programs to get them.

It's harder to write correct programs that use edge-triggered events.
Level-triggered events are inherantly easier to use.  If our high-performance
interface only supports edge-triggered events, people who write high
performance apps may well migrate to FreeBSD, where there is a lovely
high performance interface that offers either level or edge triggered
events, whichever you prefer.

> Seems Linus don't want to do this no matter what you comes up with.  You
> are free to try of course...

Or free to switch to FreeBSD, if that's the only OS which has a
high-performance poll replacement that I can get my existing codebase
to work with.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Simon Kirby

On Tue, Oct 24, 2000 at 04:12:38PM -0700, Dan Kegel wrote:

> With poll(), it was *not a bug* for the user code to drop events; with
> your proposed interface, it *is a bug* for the user code to drop events.
> I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
> incorrectly that your interface "has the same semantics as poll from
> the event perspective".

I missed this because I've never written anything that drops or forgets
events and didn't think about it.  Most programs will read() until EOF is
returned and write() until EAGAIN is returned with non-blocking sockets.
Is there any reason to ignore events other than to slow down response to
some events in favor to others?

I don't see why this is a problem as this interface _isn't_ replacing
select or poll, so it shouldn't matter for existing programs that aren't
converted to use the new interface.

In any case, I think I would prefer that the kernel be optimized for the
common case and leave any strange processing up to userspace so that the
majority of programs which don't need this special case can run as fast
as possible.  Besides, it wouldn't be difficult for a program to stack up
a list of events, even in the same structure as it would get from the
kernel, so that it can process them later.  At least then this data would
be in swappable memory.  Heck, even from an efficiency perspective, it
would be faster for userspace to store the data as it wouldn't keep
getting it returned from a syscall each time...

Am I missing something else?

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Kai Henningsen


[EMAIL PROTECTED] (Linus Torvalds)  wrote on 23.10.00 in 
<[EMAIL PROTECTED]>:

> actually inform about the events. The way to do this simply is to limit it
> in very clear ways, the most notable one being simply that there is only
> one event queue per process (or rather, per "struct files_struct" - so
> threads would automatically share the event queue). This keeps the

While the rest looks fine, I suspect this one is a show-stopper.

IMO, the ability to wait for separate event sets in separate threads is a  
MUST. In a single-threaded program, it would probably still be useful in a  
number of situations.

Not that it's all that difficult to expand your model to do multiple  
queues.

MfG Kai
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-25 Thread Gideon Glass

> 
> 
> 
> On Tue, 24 Oct 2000, Evan Jeffrey wrote:
> > 
> > > Multiple event queues are bad, because it completely breaks the notion of
> > > even-driven programming. How do you want to listen to them all? You can't.
> > > You can only listen to one event queue at a time - unless you create some
> > 
> > You can listen to one event queue per thread.
> 
> Oh, I agree.
> 
> And I think something like CLONE_EVENTS would be fine - and decide
> yourself what kind of threads you want (do you want indistinguishable
> "anonymous" threads like apache, or do you want a threads that handle
> separate event queues). Or you might have a mixture of the two - for web
> serving the "accept()" event list might be a separate thing with a few
> threads doing that, while "worker threads" handle the actual IO..

All this is fine; but why force people to break up their apps
into threads?

 [ from an earlier message ]

> Trust me. Multiple event queues are idiotic. Anybody who _designs_ for
> multiple event queues has some serious problems.

So, are you now saying that "serious problems" are no longer
to be had, provided that an app is chopped into threads?

If you have one queue per process, multiple queues could
be implemented in a straightforward manner.  Add a "queue"
fd type and a system call to create one.  Allow any queuable
fd (such as a socket) to be queued on the one implicit queue
and also to be associated with zero or one explicitly-created queues.
This would satisfy library or other code that wanted for whatever
reason to have visibility into all fds with pending events.
It would also give apps the ability to break up work amongst
disjoint portions without forcing the use of multiple threads.

This seems reasonable to implement.  Or am I missing something?

gid

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Edgar Toernig

Linus Torvalds wrote:
> 
> The point they disagree is when the event gets removed from the event
> queue. For edge triggered, this one is trivial: when a get_events() thing
> happens and moves it into user land. This is basically a one-liner, and it
> is local to get_events() and needs absolutely no help from anybody else.
> So obviously event removal is _very_ simple for edge-triggered events -
> the INTACK basically removes the event (and also re-arms the trigger
> logic: which is different from most interrupt controllers, so the analogy
> falls down here).

And IMHO here's a problem.  The events are no longer events.  They are
just hints saying: after the previous get_events() something has happened.
You don't know if you've already handled this event.  There's no synchron-
ization between what the app does and the triggering of 'hints'.

For example your waitpid-loop: you get the event, start the waitpid-loop.
While processing another process dies.  You handle it too (still in the
loop).  But a new 'hint' has already been registered.  So on the next
get_event you'll be notified again.  I just hope, every event-generator
has a WNOHANG flag...

It could even be possible, that you are unable to perform some actions
without triggering hints despite the fact that the conditions will
already be gone before the next get_event.  May generate lot of bogus
hints.

At least the current semantic of for example "POLL_IN on fd was signaled
so I may read without blocking" gets lost.

Maybe (don't know kernel wise) it makes sense to check in the kernel
if the events to be returned to userspace are still valid.  The user
space has to do it anyway...  But that way you get a more level-based
design ;)

Another thing: while toying with cooperative userspace multithreading
I found it much more versatile to have a req_type/req_data tuple in
the request structure (ie READ/, TIMEOUT/, WAKEUP/).

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mark Montague

Linus Torvalds <[EMAIL PROTECTED]> writes:

>   bind_event(sock, POLLIN, NULL, accept_fn);

[...]

> (In fact, you might as well move the event array completely inside
> "get_event()", because nobody would be supposed to look at the raw array
> any more. So the "get_event()" interface would be even simpler: just a
> timeout, nothing more. Talk about simple programming.

It might be good to have bind_event return any previous accept_fn
that's there, in case a library or something wants to chain a handler
rather than clobbering what's there. Or maybe have a way to query,
although this seems fraught with danger if another thread binds
between a query and a bind.

- M

-- 
Mark "Monty" Montague | [EMAIL PROTECTED]  | I don't do Windows(tm)
I'm dubious about any company whose assets can be destroyed by rm -rf
  http://www.gg.caltech.edu/~monty/monty.shtml>
 X-PGP-Fingerprint: E4 EA 6D B1 82 46 DB A1  B0 FF 60 B9 F9 5D 5C F7
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread H. Peter Anvin


Followup to:  <[EMAIL PROTECTED]>
By author:Linus Torvalds <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
> 
> Oh, I agree.
> 
> And I think something like CLONE_EVENTS would be fine - and decide
> yourself what kind of threads you want (do you want indistinguishable
> "anonymous" threads like apache, or do you want a threads that handle
> separate event queues). Or you might have a mixture of the two - for web
> serving the "accept()" event list might be a separate thing with a few
> threads doing that, while "worker threads" handle the actual IO..
> 
> But if you want to have some event queue ID, I just wonder how you'd set
> it up sanely without tons of complications..
> 

How about handle the event queue ID as an event mask internally?

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
> > But user code currently written for poll() has the luxury of dropping
> > events because poll() will happily report on the current readiness of
> > the socket every time.  /dev/poll is level-triggered because it's trying
> > to make conversion of poll()-based code easy.  With your scheme,
> > whatever user code is receiving the edges better darn well do something
> > about them, because it's only going to get them once.
> 
> Oh, I agree. I'm not saying that my approach magically fixes bugs in user
> space ;)

With poll(), it was *not a bug* for the user code to drop events; with
your proposed interface, it *is a bug* for the user code to drop events.
I'm just emphasizing this because Simon Kirby ([EMAIL PROTECTED]) posted
incorrectly that your interface "has the same semantics as poll from
the event perspective".

[ Linus explains why the level-triggered style is difficult ]

>  - the "proactively remove events when the thing that triggerred them goes
>away" approach means that each anti-event (like a read that empties the
>buffer) needs to undo it's events, but it also needs to be careful that
>it doesn't undo combined events, and it needs to be very careful about
>races (new packet coming in), so the proactive remove actually ends up
>being less than trivial - and in a performance-critical section.

That's the approach I was thinking of.  It should be about as difficult
as continuously maintaining an integer for each fd which matches what poll() 
with a zero timeout on that fd would return.  (The difference is that
as the bits of the integer change, you would move the fd's record in and out of
the ready list.)  How hard is that?  

> Implement it, and see. I bet you'll find that it gets really interesting
> when you have concurrent accesses to the same file descriptor etc.

Sure, but lots of the kernel is really interesting.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds

On Tue, 24 Oct 2000, Evan Jeffrey wrote:
> 
> > Multiple event queues are bad, because it completely breaks the notion of
> > even-driven programming. How do you want to listen to them all? You can't.
> > You can only listen to one event queue at a time - unless you create some
> 
> You can listen to one event queue per thread.

Oh, I agree.

And I think something like CLONE_EVENTS would be fine - and decide
yourself what kind of threads you want (do you want indistinguishable
"anonymous" threads like apache, or do you want a threads that handle
separate event queues). Or you might have a mixture of the two - for web
serving the "accept()" event list might be a separate thing with a few
threads doing that, while "worker threads" handle the actual IO..

But if you want to have some event queue ID, I just wonder how you'd set
it up sanely without tons of complications..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Evan Jeffrey



> Multiple event queues are bad, because it completely breaks the notion of
> even-driven programming. How do you want to listen to them all? You can't.
> You can only listen to one event queue at a time - unless you create some

You can listen to one event queue per thread.  Maybe in the case people
always bring up (Apache handing 1 concurrent connections), threads
are indistiguishable, but for many applications different threads or thread
groups will be assigned to different tasks.   An ORB ought to be allowed
to clone off an event handler thread that doesn't have to worry about
receiving X events, which might be really slow to handle.  (Or vice versa,
handling a mouse_move event shouldn't have to wait for a CORBA upcall that
recalculates a spreasheet).

Also, what happens if someone does a clone with CLONE_FILES but no CLONE_VM?
Now the event handlers and opaque data point to things that may not even be
valid in this VM.

I see a couple of ways to do this that don't give in much from the "One
event handler list per process" idea.  Add a CLONE_EVENTS flag to clone.  This
is rather nasty, since it violates orthogonality (since a struct event
references both VM and fd objects), and makes it difficult for pthreads users
to choose whether to share event handlers.  The second is to tag events
with a threadid, and allow add_event to specify a flag saying "only deliver
to this thread" and get_events to say "wait for this threads events, or any
threads events".  The third is to just declare "event handlers are shared
by any processes sharing both a VM and FILES".  This seems rather silly
(since it would still have to be independant in the task_struct, but with
no user space control over it), and doensn't help the poor SOB who has
to handle 

> kind of "hierarchy" of event queues, where you have one event queue that
> you wait on that contains the _other_ event queues, so that you can tell
> which one has anything pending..

Agreed.  The only reason to do this is if your event queue sucks as bad
as poll() and you have to do so for effeciency reasons... (Like people do
with 10 threads each handling 1/10 of the fds).

Evan
---
Fear is the mind killer.   -- Frank Herbert, "Dune"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Maas


> Shouldn't there also be a way to add non-filedescriptor based events
> into this, such as "child exited" or "signal caught" or shm things?

Waiting on pthreads condition variables, POSIX message queues, and
semaphores (as well as fd's) at the same time would *rock*...

Unifying all these "waitable objects" would be tremendously helpful to fully
exploit the "library transparency" advantage that Linus brought up. Some
libraries might want to wait on things that are not file descriptors...

Regards,
Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Miquel van Smoorenburg

In article <[EMAIL PROTECTED]>,
Linus Torvalds  <[EMAIL PROTECTED]> wrote:
>   struct event {
>   unsigned long id;   /* file descriptor ID the event is on */
>   unsigned long event;/* bitmask of active events */
>   };
>   int bind_event(int fd, struct event *event);

Shouldn't there also be a way to add non-filedescriptor based events
into this, such as "child exited" or "signal caught" or shm things?

Or should there simply be ways to make those events available
through filedescriptors anyway? Then you'd get /dev/events - open it, tell
it what kind of event you want to bind to the fd, pass it to bind_event,
read a struct event_info from it when there's something to read.

Just a random thought. Ignore if not appropriate.

Mike.
-- 
Q: How many ears do Trekkies have?
A: Three; the left ear, the right ear, and the final front ear.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mitchell Blank Jr

Linus Torvalds wrote:
> >   * it doesn't add extra syscalls
> 
> Sure it does. 
> 
> What do you think ioctl's are?

As I explained a few lines down from where you stopped quoting (and probably
stopped reading) the ioctl() use is just an artifact of Solaris's icky
implementation.  It could and should use read().  Not an extra syscall.

> Sure, you can have multiple "poll lists" by opening /dev/poll multiple
> times, and you seem to see this as a great feature. I'm telling you that
> multiple event queues are a horrible disaster, and it's a perfect example
> of what an engineer with the "more is better" mindset comes up with.

It has nothing to do with how exciting of a feature it is.  It's a matter
"hey, if you implement it with /dev/poll, you get multiple queues FOR
FREE".  Those sorts of things are an indication that you've got the
implementation right.  The fact that in your scheme multiple queues would
be extra work is just residue from the fact that you've got the idea wrong.

And how do you propose dealing with multi-threaded programs.  Obviously for
SMP you'd want multiple threads to be able to grab events, so I guess the
"one great queue" will be shared among all the threads.  But, suppose I
want to dedicate a thread to coping with a specific set of events.  H...
guess you'll need a CLONE_EVENTQUEUE flag to support that.  With /dev/poll
none of this foolishness is needed - you just open tow of them and let
the threads do what they want.

> Multiple event queues are bad, because it completely breaks the notion of
> even-driven programming. How do you want to listen to them all? You can't.

Explained in detail in my mail.

> You can only listen to one event queue at a time - unless you create some
> kind of "hierarchy" of event queues, where you have one event queue that
> you wait on that contains the _other_ event queues, so that you can tell
> which one has anything pending..

No, event queues of event queues are messy in implementation.  There's
nothing fundamentally wrong with them, but you need to make sure the
graph stays acyclic which is just an implementation hassle.  Not enough
gain for too much code (unless someone has a beautiful algorithm)

However, it's perfectly reasonable to use select() or poll() on two event
queues.  As I explained in my email, this is needed if you're already using
a library that does its own event queue based on poll or select.

> Basically, tastes differ. Everything you seem to see as an advantage of
> /dev/poll, I see as a horrible kludge.

Well, everything I see indicates that /dev/poll would be the easier of
the two interfaces to actually implement.  All the added flexibility is
just gravy.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds




On Tue, 24 Oct 2000, Abramo Bagnara wrote:

> Linus Torvalds wrote:
> > 
> > 
> > struct event {
> > int fd;
> > unsigned long mask;
> > void *opaque;
> > void (*event_fn)(ind fd, unsigned long mask, void *opaque);
> 
> My experience say that:
> 
>   unsigned long rmask;
>   void (*event_fn)(struct event *event);
> 
> is a far better solution (more type safe, more descriptive).

You're probably right. It certainly makes it easier to add fields later on
if we'd want to extend the ID part without having to change old users..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds

On Tue, 24 Oct 2000, Dan Kegel wrote:
> 
> But user code currently written for poll() has the luxury of dropping
> events because poll() will happily report on the current readiness of
> the socket every time.  /dev/poll is level-triggered because it's trying
> to make conversion of poll()-based code easy.  With your scheme, 
> whatever user code is receiving the edges better darn well do something
> about them, because it's only going to get them once.

Oh, I agree. I'm not saying that my approach magically fixes bugs in user
space ;)

> > The BSD kevent paper goes on about "level and edge triggered" and it
> > becomes a big thing for them, and they selected level-triggered events as
> > if it made any difference. And sure - it _does_ make a difference, but the
> > only difference is in how hard it is to implement, and level-triggered is
> > a noticeably harder.
> 
> I don't see why edge triggered is that much harder.  All it adds is
   level
> a layer which receives the edges and moves fds back and forth between
> a 'ready' list and a 'not ready' list.  Easy as pie.

Not true.

For example, if you're truly level-triggered, and you have a socket that
gets data, the event move onto the event queue. So far so fine: both edge
and level agree about this one.

The point they disagree is when the event gets removed from the event
queue. For edge triggered, this one is trivial: when a get_events() thing
happens and moves it into user land. This is basically a one-liner, and it
is local to get_events() and needs absolutely no help from anybody else.
So obviously event removal is _very_ simple for edge-triggered events -
the INTACK basically removes the event (and also re-arms the trigger
logic: which is different from most interrupt controllers, so the analogy 
falls down here).

For level, the thing is not as easy at ALL. Suddenly removal becomes a big
issue, and needs help from the actual driver. You can do it two ways:
calling down to the driver when you remove (to see if the event should be
dismissed or not once it has been read) or have the driver pro-actively
remove the event whenever a read() happens (or whatever that undoes the
event).

Both are actually fairly hard. Much harder than they sound. For different
reasons.

 - the callback approach at get_events() time sounds trivial, but actually
   has two problems: cache footprint for "get_events()" grows a _lot_
   (because the events are likely to be spread out a lot if there are a
   lot of them pending, so you don't get a nice tight inner loop at all),
   and you get "double events" - by the time the event first happens, it
   will still be active, so we cannot actually remove it at that time
   (there is still data to be read - and the event doesn't go away until
   we read it) so we'll get the event _again_, and on the next
   get_events() it will notice that the event was bogus, and remove it
   (and we can optimize it away from reporting it to user land at that
   point, so only the kernel needs to look at it twice and do two
   callbacks)

 - the "proactively remove events when the thing that triggerred them goes
   away" approach means that each anti-event (like a read that empties the
   buffer) needs to undo it's events, but it also needs to be careful that
   it doesn't undo combined events, and it needs to be very careful about
   races (new packet coming in), so the proactive remove actually ends up
   being less than trivial - and in a performance-critical section.

Now, compare that to a one-liner that just does a "list_del(&event->list)"
as it copies over the event to user mode. Woudln't you say that the
edge-triggered version is simpler?

> > The reason "edge-triggered" (ie only when an event changes) is preferable
> > is that it's MUCH simpler, and means that the "get_events()" system call
> > doesn't need to actually understand what the events mean at all. 
> 
> Not much understanding is required on the part of the edge-to-level filter.

Implement it, and see. I bet you'll find that it gets really interesting
when you have concurrent accesses to the same file descriptor etc.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel


Linus Torvalds wrote:
> > * Do you get an event whenever an fd is ready for something, or
> >   only when its readiness changes?  (Presumably whenever an fd is ready for 
>something?)
> 
> Only when its readiness changes - probably with the addition that it would
> simplify things that a new event always (unconditionally) gets added to
> the event queue.
>
> Note that there are no races in this implementation - there is no way for
> events to get lost, even trivially seen in SMP/threaded environments.
> That's important.

But user code currently written for poll() has the luxury of dropping
events because poll() will happily report on the current readiness of
the socket every time.  /dev/poll is level-triggered because it's trying
to make conversion of poll()-based code easy.  With your scheme, 
whatever user code is receiving the edges better darn well do something
about them, because it's only going to get them once.
 
> The BSD kevent paper goes on about "level and edge triggered" and it
> becomes a big thing for them, and they selected level-triggered events as
> if it made any difference. And sure - it _does_ make a difference, but the
> only difference is in how hard it is to implement, and level-triggered is
> a noticeably harder.

I don't see why edge triggered is that much harder.  All it adds is
a layer which receives the edges and moves fds back and forth between
a 'ready' list and a 'not ready' list.  Easy as pie.

> The reason "edge-triggered" (ie only when an event changes) is preferable
> is that it's MUCH simpler, and means that the "get_events()" system call
> doesn't need to actually understand what the events mean at all. 

Not much understanding is required on the part of the edge-to-level filter.

That said, the edge-to-level filter can be in userspace, and the kernel
part can be just edge-triggered -- as long as the kernel delivers the
downward edges as well as the upward edges.   We could do that by
adding NOTPOLLIN, NOTPOLLOUT, etc., i.e. events that are delivered
when a socket becomes UN-ready for reading.
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Lee Chin

There is only one thiong I don't understand about this... why can't we
re-implement the poll() implementation of Linux instead of introducing
another system call?

If I understood Linux correctly, what he is saying is that the bind_event
system call is needed to give the kernel a hint that the application is
interested in a certain event associated with a file descriptor.

If the kernel kept such an event queue per process any way (as soon as the
process opened the file/socket)... then the poll implementation would be
exactly like the get_events system call.  What is wrong with this?

Thanks
Lee

On Mon, 23 Oct 2000, Linus Torvalds wrote:
>
> > What is your favourite interface then ?
>
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to
implement, and very easy to use, with none of the scalability issues that
many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental
queuing theory problem: if you have events happening really quickly, the
events pile up, and queuing theory tells you that as you start having
queueing problems, your latency increases, which in turn tends to mean
that later events are even more likely to queue up, and you end up in a
nasty meltdown schenario where your queues get longer and longer.

This is why RT signals suck so badly as a generic interface - clearly we
cannot keep sending RT signals forever, because we'd run out of memory
just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more
expensive as you have more and more events - their expense is the number
of file descriptors, not the number of events per se. In fact, both poll()
and select() tend to perform _better_ when you have pending events, as
they are both amenable to optimizations when there is no need for waiting,
and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that
as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which
is that you do NOT have that silly array traversal when there is nothing
to do. Basically, the RT signals kind of approach is really good for the
cases where select() and poll() suck: no need to traverse mostly empty and
non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status
information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

struct event {
unsigned long id; /* file descriptor ID the event is on */
unsigned long event; /* bitmask of active events */
};

int get_events(struct event * event_array, int maxnr, struct timeval
*tmout);

where you say "I want an array of pending events, and I have an array you
can fill with up to 'maxnr' events - and if you have no events for me,
please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than
either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie
the "binding" of events. The most straightforward way to do that is to
just do a simple "bind_event()" system call:

int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file
descriptor "fd". The way to stop being interested in events is to just set
the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing
fancy, nothing complicated. Only really simple stuff. Remember the old
rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's
about 50 lines of code, plus some simple logic to various drivers etc to
actually inform about the events. The way to do this simply is to limit it
in very clear ways, the most notable one being simply that there is only
one event queue per process (or rather, per "struct files_struct" - so
threads would automatically share the event queue). This keeps the
implementation simple, but it's also what keeps the interfaces simple: no
queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

- a queue head in "struct files_struct", initially empty.

- doing a "bind_event()" basically adds a fasync entry to the file
structure, but rather than cause a signal, it just looks a

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Abramo Bagnara

Linus Torvalds wrote:
> 
> 
> struct event {
> int fd;
> unsigned long mask;
> void *opaque;
> void (*event_fn)(ind fd, unsigned long mask, void *opaque);

My experience say that:

unsigned long rmask;
void (*event_fn)(struct event *event);

is a far better solution (more type safe, more descriptive).

-- 
Abramo Bagnara   mailto:[EMAIL PROTECTED]

Opera Unica
Via Emilia Interna, 140  Phone: +39.0546.656023
48014 Castel Bolognese (RA) - Italy  Fax:   +39.0546.656023

ALSA project ishttp://www.alsa-project.org
sponsored by SuSE Linuxhttp://www.suse.com

It sounds good!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Simon Kirby

On Tue, Oct 24, 2000 at 10:03:04AM -0700, Linus Torvalds wrote:

> Basically, with get_events(), there is a maximum of one event per "bind".
> And the memory for that is statically allocated at bind_event() time. 
>... 
> But you'd be doing so in a controlled manner: the memory use wouldn't go
> up just because there is a sudden influx of 5 packets. So it scales
> with load by virtue of simply not _caring_ about the load - it only cares
> about the number of fd's you're waiting on.

Nice.  I like this.

It would be easy for existing userspace code to start using this
interface as it has the same semantics as select/poll from the event
perspective.  But it would make things even easier, as the bind would
follow the life of the descriptor and thus wouldn't need to be "requeued"
before every get_events call, so that part of userspace code could just
be ripped out^W^W disabled and kept only for portability.

In most of the daemons I have written, I've ended up using memcpy() to
keep a non-scribbled-over copy of the fdsets around so I don't have to
walk data structures and requeue fds on every loop for select()...nasty.

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds

On Tue, 24 Oct 2000, Simon Kirby wrote:
> 
> However, isn't there already something like this, albeit maybe without
> the ability to return multiple events at a time?  When discussing
> select/poll on IRC a while ago with sct, sct said:
> 
>   Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for
>SIGIO as usual.

[ siginfo and RT signals, but using "sigwaitinfo()" instead of actual
  signal handlers ]

The problem with RT signals (whether you use sigwaitinfo() or signal
handling overhead) is that they don't scale to lots of events. The memory
pressure for keeping track of millions of siginfo structures is more than
any system can handle - so all RT signals implementations will have to
eventually fall back on something else.

Which makes for really hard to debug bugs, and complex programming (you
essentially end up having to duplicate all of the event handling, and the
"fallback" case has the added difficulty that you won't even see it except
under heavy load). 

> Also, what is different in your above interface that prevents it from
> being able to queue up too many events?

Basically, with get_events(), there is a maximum of one event per "bind".
And the memory for that is statically allocated at bind_event() time. 

> I guess the structure is only
> sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000
> FDs on x86, but I don't see how the other method would be significantly
> different.  The kernel would have to store the queued events still,
> surely...

Oh, the bind information would be more like 32 bytes per bind, so assuming
you have 2 fd's you're waiting for, you'd certainly be using a lot of
memory. Still less than the quite complex SIGIO structures, though.

But you'd be doing so in a controlled manner: the memory use wouldn't go
up just because there is a sudden influx of 5 packets. So it scales
with load by virtue of simply not _caring_ about the load - it only cares
about the number of fd's you're waiting on.

(Obviously _other_ parts of the system will have to keep up with tons of
new packets, but that's why protocols like TCP have flow control, and
that is something that is done independently of the event notification,
so regardless of _what_ kind of event model you have you have to handle
the simple load of TCP ;).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Simon Kirby

On Mon, Oct 23, 2000 at 10:39:36PM -0700, Linus Torvalds wrote:

> Actually, forget the mmap, it's not needed.
> 
> Here's a suggested "good" interface that would certainly be easy to
> implement, and very easy to use, with none of the scalability issues that
> many interfaces have.
>...
> Basically, the perfect interface for events would be
> 
>   struct event {
>   unsigned long id;   /* file descriptor ID the event is on */
>   unsigned long event;/* bitmask of active events */
>   };
> 
>   int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

I like. :)

However, isn't there already something like this, albeit maybe without
the ability to return multiple events at a time?  When discussing
select/poll on IRC a while ago with sct, sct said:

  Simon: You just put your sockets into O_NONBLOCK|FASYNC mode for
   SIGIO as usual.
  Simon: Then fcntl(fd, F_SETSIG, rtsignum)
  Simon: And you'll get a signal queue which passes you the fd of
   each SIGIO in turn.
  sct: easy :)
  Simon: You don't even need the overhead of a signal handler:
   instead of select(), you just do "sigwaitinfo(&siginfo, timeout)"
   and it will do a select-style IO wait, returning the fd in the
   siginfo when it's available.

(Captured from IRC on Nov 12th, 1998.)

Or does this menthod still have the overhead of encapsulating the events
into signals within the kernel?

Also, what is different in your above interface that prevents it from
being able to queue up too many events?  I guess the structure is only
sizeof(int) * 2 bytes per fd, so it would only take, say, 80kB for 20,000
FDs on x86, but I don't see how the other method would be significantly
different.  The kernel would have to store the queued events still,
surely...

Simon-

[  Stormix Technologies Inc.  ][  NetNation Communications Inc. ]
[   [EMAIL PROTECTED]   ][   [EMAIL PROTECTED]]
[ Opinions expressed are not necessarily those of my employers. ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds

On Tue, 24 Oct 2000, Mitchell Blank Jr wrote:
> I think everyone should take a timeout and look at Solaris 8's /dev/poll
> interface.  This discussion is reinventing the wheel, the lever, and the
> inclined plane.
> 
>   http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123
> 
> I think this is a lot cleaner than your approach:

You must be kidding.

/dev/poll is the interface from _hell_. It's absolutely disgusting.

>   * it doesn't add extra syscalls

Sure it does. 

What do you think ioctl's are? Every single ioctl number is an extra
system call. It's just indirected inside the device driver instead of at
the normal system call entry point. The advantage? It's noticeably slower
than a real system call.

And the final reason why /dev/poll should just go away: not only is the
interface ugly, but it's hard to use too. It's really hard to mix
different users of /dev/poll - you'll have to create a whole new event
structure around it.

Sure, you can have multiple "poll lists" by opening /dev/poll multiple
times, and you seem to see this as a great feature. I'm telling you that
multiple event queues are a horrible disaster, and it's a perfect example
of what an engineer with the "more is better" mindset comes up with.

Multiple event queues are bad, because it completely breaks the notion of
even-driven programming. How do you want to listen to them all? You can't.
You can only listen to one event queue at a time - unless you create some
kind of "hierarchy" of event queues, where you have one event queue that
you wait on that contains the _other_ event queues, so that you can tell
which one has anything pending..

Trust me. Multiple event queues are idiotic. Anybody who _designs_ for
multiple event queues has some serious problems.

Basically, tastes differ. Everything you seem to see as an advantage of
/dev/poll, I see as a horrible kludge.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds

On Tue, 24 Oct 2000, Dan Kegel wrote:
> Linus Torvalds wrote:
> > Basically, the main loop would boil down to
> > for (;;) {
> > static struct event ev_list[MAXEV];
> > get_event(ev_list, MAXEV, &tmout);
> > .. timeout handling here ..
> > }
> > 
> > because get_even() would end up doing all the user-mode calls too (so
> > "get_event()" is no longer a system call: it's a system call + a for-loop
> > to call all the ID handler functions that were associated with the events
> > that triggered).
> 
> Occurs to me that this stuff will be used from other languages
> than C, which might prefer to do their own event dispatching,
> and would want the system to treat event_fn as another opaque quantity.

Yes and no.

The KERNEL would treat it as just another opaque quantity: after all, it
would never ever touch the thing other than get it as part of
"bind_event()" and return it as part of "get_events()". So as far as the
kernel is concerned, both "event_fn" and "opaque" are just random
information.

However, it is very important to have every common user agree about the
meaning of the ID's in user space: the whole point of encapsulating
"event_fn" as a function pointer is that you can have different parts of
the same program use the "bind_event()" interface, and they won't step on
each others toes.

So if you program in Fortran, for example, and you expect to link against
the C library, and the C library uses events for something, then you'd
better make sure that the fortran interfaces end up using the event_fn
thing in the same wasy as the C one..

> So I don't object to a function that dispatches events, e.g.
>  int get_event(&tmout);
> as long as there's also 
>  int get_events(ev_list, MAXEV, &tmout)
> that treats event_fn as an opaque pointer and *does not* dispatch the events.

That's a user-mode issue, and obviously, if you don't care about
compatibility with anything else you can do this. The kernel won't know,
or care. But I think it is unlikely that you'd ever use the raw events in
any other way - most environments under UNIX tend to have a C library
component somewhere, because the low-level stuff is written in C and uses
the common code that way even when the programmer isn't aware of it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Linus Torvalds

[ Moving on to practical matters ]

On Tue, 24 Oct 2000, Dan Kegel wrote:
> 
> Might be good to pick more unique names than 'struct event' and 'get_event'.
> People probably use those already.

I agree. I would never implement them under those names, but it's easier
to talk about "event" than about something very specific.

> Hiding ev_list is probably ok.  However,
> http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf
> suggests that knowing how many events are pending is a useful measure 
> of server load, and that if more than a certain number of events
> are pending, web servers should reject new connections.  Thus it might
> be handy to make the return value of get_event be the number of events
> gotten.

Note that my "get_event()" function actually did return exactly that: it
returned the number of events, even though the actual event array handling
was hidden inside the library function.

> > So now you'd start everything off (assuming the same kind of "listen to
> > everything and react to it" server as in my previous example) by just
> > setting
> > 
> > bind_event(sock, POLLIN, NULL, accept_fn);
> 
> A couple questions:
> 
> * how do you unbind?  (By calling bind_event with NULL as the accept_fn?)

If done that way, I'd prefer something where an event _mask_ of 0 means
that the event obviously no longer exists, as it can no longer trigger.
The accept_fn would be part of the identifier, and as such zeroing it out
would lose the identification.

But I'd personally probably prefer to be more explicit, and have a
separate function for it. 

Note that whether it's a separate function or just making "bind_event()"
really be "modify_or_bind_or_remove_event()" is pretty much not anything I
care about - it gets to be so low-level details that it would be something
that is probably best tried out and something that I don't have any
religious issues with.

I care about high-level interfaces, because _those_ are the ones that
screw me years down the line. For example, select() (and later bind()) was
always an interface that did _not_ fit into the UNIX way of doing things
AT ALL. It really stands out among all the IO interfaces as being very
different. And it's one of the nastier ones to maintain.

> * how do you change a mask?  (By calling bind_event with a new mask?)

Same as above. Either bind with a new mask (zero being remove), or
separate call.

> * Is it ok to do these things while in an event_fn?  (Yes?)

Oh, absolutely. The kernel doesn't even _know_ when something is in an
event function - the kernel only sees "change the event" and "get the list
of events".

> * Do you get an event whenever an fd is ready for something, or
>   only when its readiness changes?  (Presumably whenever an fd is ready for 
>something?)

Only when its readiness changes - probably with the addition that it would
simplify things that a new event always (unconditionally) gets added to
the event queue.

Note that there are no races in this implementation - there is no way for
events to get lost, even trivially seen in SMP/threaded environments.
That's important. 

The BSD kevent paper goes on about "level and edge triggered" and it
becomes a big thing for them, and they selected level-triggered events as 
if it made any difference. And sure - it _does_ make a difference, but the
only difference is in how hard it is to implement, and level-triggered is
a noticeably harder.

The bind_event()/get_events() interface is edge-triggered, and works
fundamentally the same way as SIGCHLD. If you don't keep up, you get
SIGCHLD only once, even if ten children died on you in rapid succession.
And it's edge-triggered: that doesn't mean that you lose events (like the
BSD paper implies), it only means that you have to reap your children with
a loop:

while ((err = waitpid(..)) >= 0) {
.. get all children ..
}

The reason "edge-triggered" (ie only when an event changes) is preferable
is that it's MUCH simpler, and means that the "get_events()" system call
doesn't need to actually understand what the events mean at all. It will
just take the events off the list. If anythig triggers after that, it will
be put back on the list again, but you can guarantee that you don't get
any lost events simply by the simple fact that

 - by the time get_events() returns to user mode (ie before the action
   functions have been called), any new event that happens even before we
   have had to react to the old events will cause the event to be queued
   up again.

What does this mean? It basically means that if you continue to get events
on a file descriptor during your event handler function, the event will
have moved back to the event list, and next time you _will_ get an event
again.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Mitchell Blank Jr

Linus Torvalds wrote:
> Here's a suggested "good" interface that would certainly be easy to
> implement, and very easy to use, with none of the scalability issues that
> many interfaces have.

I think everyone should take a timeout and look at Solaris 8's /dev/poll
interface.  This discussion is reinventing the wheel, the lever, and the
inclined plane.

http://docs.sun.com/ab2/coll.40.6/REFMAN7/@Ab2PageView/55123

I think this is a lot cleaner than your approach:
  * it doesn't add extra syscalls

  * it handles multiple event queues, and does so without ugliness.
all the per-queue state is held in the /dev/poll's "struct file"
instance

  * in your method you have a per-process queue - but under what clone()
conditions is it shared among siblings?  Here the user has a choice -
they can share the open("/dev/poll") file descriptor or not using
any of the existing means.  Granted, they also would probably want
to arrange to share the fd's being polled in order to make this
useful.

  * No new fields in task_struct

A few simple improvements can be made to the Sun model, though:

  * The fact that the fd of /dev/poll can't be used for poll(2) or select(2)
is ugly.  Sure, you probably don't want an open instance of /dev/poll
being added to another /dev/poll, but being able to call "select" on
them would be really useful:
  1. Imagine a server that has to process connections from both
 high-priority and low-priority clients - and that requests from
 the high-priority ones always take precedence.  With this
 interface you could easily have two open instances of /dev/poll
 and then call select(2) on them.  This ability just falls
 out naturally from the interface.
  2. Some libraries are internally driven by select(2) loops (I think
 Xlib is like this, IIRC)  If you have a lot of fd's you want to
 watch, this means you must take the hit of calling select(2) on
 all of them.  If you could just pass in a fd for /dev/poll,
 problem solved.

  * I think the fact that you add events via write(2) but retrieve them
via ioctl(2) is an ugly asymmetry.  From what I can see, the entire
motivation for using ioctl as opposed to read(2) is to allow the user
to specify a timeout.  If you could use poll(2)/select(2) on /dev/poll
this issue would be moot (see above)

  * It would be nice if the interface were flexible enough to report
items _other_ than "activity on fd" in the future.  Maybe SYSV IPC?
itimers?  directory update notification?  It seems that every time
UNIX adds a mechanism of waiting for events, we spoil it by not
making it flexible enough to wait on everything you might want.
Lets not repeat the mistake with a new interface.

  * The "struct pollfd" and "struct dvpoll" should also include a 64-bit
opaque quantity supplied by userland which is returned with each event
on that fd.  This would save the program from having to look up
which connection context goes with each fd - the kernel could just
give you the pointer to the structure.  Not having this capability isn't
a burden for programs dealing with a small number of fd's (since they
can just have a lookup table) but if you potentially have 1's of
connections it may be undesirable to allocate an array for them all.

The linux patch of /dev/poll implements mmap'ing of the in-kenrel poll
table... I don't think this is a good idea.  First, the user just wants to
be able to add events and dequeue them - both linear operations.  Second,
why should the kernel be forced to maintain a fixed-size linear list of
events we're looking for... this seems mindlessly inefficient.  Why not
just pull a "struct pollfd" out of slab each time a new event is listened
for?

My unresolved concerns:
  * Is this substantially better than the already existing rtsig+poll
solution?  Enough to make implementing it worth while?

  * How do we quickly do the "struct file" -> "struct pollfd" conversion
each time an event happens?  Especially if there are multiple /dev/poll
instances open in the current process.  Probably each "struct file"
would need a pointer to the instance of /dev/poll which would have
some b-tree variant (or maybe a hash table).  The limitation would
be that a single fd couldn't be polled for events by two different
/dev/poll instances, even for different events.  This is probably
good for sanity's sake anyway.

-Mitch
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Felix von Leitner


Thus spake Linus Torvalds ([EMAIL PROTECTED]):
> I disagree.

> Let's just face it, poll() is a bad interface scalability-wise.

Is that a reason to implement it badly?

Felix
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Andi Kleen


On Mon, Oct 23, 2000 at 09:06:11PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> > 
> > I don't see the problem. You have the poll table allocated in the kernel,
> > the drivers directly change it and the user mmaps it (I was not proposing
> > to let poll make a kiobuf out of the passed array) 
> 
> That's _not_ how poll() works at all.

But how /dev/poll would work. Sorry for not marking it clearly. 

> 
> We don't _have_ a poll table in the kernel, and no way to mmap it. The
> poll() tables gets created dynamically based on the stuff that the user
> has set up in the table. And the user can obviously change the fd's etc in
> the table directly, so in order for the caching to work you need to do
> various games with page table dirty or writable bits, or at least test
> dynamically whether the poll table is the same as it was before.

Actually assuming the user does not duplicate fds in the poll array it
would always work out the lazy way (and when he duplicates fds I would
say the behaviour is undefined -- no spec says in what order poll
must process fds in the supplied poll table)

You either find the correct fd in the offset you cached, and if not
the user has changed the table and you need to recompute all cached offsets. 

No need to play pte tricks. 

> 
> Sure, it's doable, and apparently Solaris does something like this. But
> what _is_ the overhead of the Solaris code for small number of fd's? I bet
> it really is quite noticeable. I also suspect it is very optimized toward
> an unchangning poll-table.

For small number of fds you use the fast_poll/fast_select that I implemented 
in the patch I sent to you.


> 
> > What is your favourite interface then ? 
> 
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

I think when you say the case of duplicated fd entries in an poll array
as undefined you don't need the controlled add/remove_entry.
[and from a quick glimpse at Single Unix it does not say anything about
the order of fd processing in poll, so the spec is definitely flexible
enough for that]


-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
> Basically, the main loop would boil down to
> for (;;) {
> static struct event ev_list[MAXEV];
> get_event(ev_list, MAXEV, &tmout);
> .. timeout handling here ..
> }
> 
> because get_even() would end up doing all the user-mode calls too (so
> "get_event()" is no longer a system call: it's a system call + a for-loop
> to call all the ID handler functions that were associated with the events
> that triggered).

Occurs to me that this stuff will be used from other languages
than C, which might prefer to do their own event dispatching,
and would want the system to treat event_fn as another opaque quantity.

So I don't object to a function that dispatches events, e.g.
 int get_event(&tmout);
as long as there's also 
 int get_events(ev_list, MAXEV, &tmout)
that treats event_fn as an opaque pointer and *does not* dispatch the events.

These probably both want to be C library functions (rather than
get_events being only a system call) on the chance that some
operating systems will implement Linux emulation at the
shared library level rather than the syscall level.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Dan Kegel

Linus Torvalds wrote:
> Basically, the main loop would boil down to
> 
> for (;;) {
> static struct event ev_list[MAXEV];
> get_event(ev_list, MAXEV, &tmout);
> .. timeout handling here ..
> }
> 
> because get_even() would end up doing all the user-mode calls too (so
> "get_event()" is no longer a system call: it's a system call + a for-loop
> to call all the ID handler functions that were associated with the events
> that triggered).
> 
> So the "struct event" would just be:
> 
> struct event {
> int fd;
> unsigned long mask;
> void *opaque;
> void (*event_fn)(ind fd, unsigned long mask, void *opaque);
> }
> 
> and there's no need for separate event queues, because the separate event
> queues have been completely subsumed by the fact that every single event
> has a separate event function.

OK, guess I buy the one-queue-to-bind-them-all argument.

Might be good to pick more unique names than 'struct event' and 'get_event'.
People probably use those already.

Hiding ev_list is probably ok.  However,
http://www.citi.umich.edu/techreports/reports/citi-tr-00-7.pdf
suggests that knowing how many events are pending is a useful measure 
of server load, and that if more than a certain number of events
are pending, web servers should reject new connections.  Thus it might
be handy to make the return value of get_event be the number of events
gotten.

> So now you'd start everything off (assuming the same kind of "listen to
> everything and react to it" server as in my previous example) by just
> setting
> 
> bind_event(sock, POLLIN, NULL, accept_fn);

A couple questions:

* how do you unbind?  (By calling bind_event with NULL as the accept_fn?)

* how do you change a mask?  (By calling bind_event with a new mask?)

* Is it ok to do these things while in an event_fn?  (Yes?)

* Do you get an event whenever an fd is ready for something, or
  only when its readiness changes?  (Presumably whenever an fd is ready for something?)

> (This is also the ideal event programming interface - signals get racy and
> hard to handle, while in the above example you can trivially just be
> single-threaded. Which doesn't mean that you CANNOT be multi-threaded if
> you want to: you multi-thread things by just having multiple threads that
> all call "get_event()" on their own).

I for one will be glad to not have to think of the races
caused by the delay between signal queueing and signal pickup.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Lincoln Dale

At 10:39 PM 23/10/2000 -0700, Linus Torvalds wrote:
>First, let's see what is so nice about "select()" and "poll()". They do
>have one _huge_ advantage, which is why you want to fall back on poll()
>once the RT signal interface stops working. What is that?

RT methods are bad if they consume too many resources.  SIGIO is a good 
example of this - the current overhead of passing events to user-space 
incurs both a spinlock and a memory copy of 512 bytes for each 
event.  while it removes the requirement to "walk lists", the signal 
semantics in the kernel and the overhead of memory copies to userspace 
negate its performance a fair bit.

that isn't to say that all "event-driven" methods are bad.  in the past 
year, i've done many experiments at making SIGIO more efficient.

some of these experiments include --
  [1] 'aggregate' events.  that is, if you've registered a POLL_IN, no need
  to registered another POLL_IN
  this was marginally successful, but ultimately still didn't scale.

  [2] create a new interface for event delivery.

for i settled on a 16-byte structure sufficient to pass all of the relevant 
information:
 typedef struct zerocopy_buf {
 int fd;
 short int   cmd;
 #define ZEROCOPY_VALID_BUFFER   0xe1e2
 short int   valid_buffer;
 void*buf; /* skbuff */
 #ifdef __KERNEL__
 volatile
 #endif
 struct zerocopy_buf *next;
 } zerocopy_buf_t;

so, we get down to 16 bytes per-event.  these are allocated

coupled with this was an interface whereby user-space could view 
kernel-space (via read-only mmap).
in my case, this allowed for user-space to be able to read the above chain 
of zerocopy_buf events with no kernel-to-user memory copies.

an ioctl on a character driver could ask the kernel to give it the head of 
the chain of the current zerocopy_buf structure.  a similar ioctl() call 
allows it to pass a chain of instructions to the kernel (adding/removing 
events from notification) and other housekeeping.

since user-space had read-only visibility into kernel memory address-space, 
one could then pick up skbuff's in userspace without the overhead of copies.

... and so-on.

the above is a bit of a simplification of what goes on.  using flip-buffers 
of queues, one can use this in multiple processes and be SMP-safe without 
the requirements for spinlocks or semaphores in the "fast path".  solving 
the "walk the list of fd's" and "incur the overhead of memory copies" tied 
in with network hardware capable of handling scatter/gather DMA and IP and 
TCP checksum calculations, i've more than doubled the performance of an 
existing application which depended on poll()-type behaviour.

while i agree that it isn't necessarily a 'generic' interface, and won't 
necessarilly appeal to everyone as the cure-all, the techniques used have 
removed two significant bottlenecks to high-network-performance i/o on 
tens-of-thousands of TCP sockets for an application we've been working on.

cheers,

lincoln.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Dan Kegel wrote:
> 
> kqueue lets you associate an arbitrary integer with each event
> specification; the integer is returned along with the event.
> This is very handy for, say, passing the 'this' pointer of the
> object that should handle the event.  Yes, you can simulate it
> with an array indexed by fd, but I like having it included.

I agree. I thin the ID part is the most important part of the interfaces,
because you definitely want to re-use the same functions over and over
again - and you wan tto have some way other than just the raw fd to
identify where the actual _buffers_ for that IO is, and what stage of the
state machine we're in for that fd etc etc.

Also, it's potentially important to have different "id"s for even the same
fd - in some cases you want to have the same event handle both the read
and the write part on an fd, but in other cases it might make more
conceptual sense to separate out the read handling from the write
handling, and instead of using "mask = POLLIN | POLLOUT", you'd just have
two separate events, one with POLLIN and one with POLLOUT.

This was what my "unsigned long id" was, but that is much too hard to use.
See my expanded suggestion of it just a moment ago.

(And yes, I'm sure you can do all this with kevents. I just abhor the
syntax of those things).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Dan Kegel wrote:
> 
> 
>http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&manpath=FreeBSD+5.0-current&format=html
> describes the FreeBSD kqueue interface for events:

I've actually read the BSD kevent stuff, and I think it's classic
over-design. It's not easy to see what it's all about, and the whole  tuple crap is just silly. Looks much too complicated.

I don't believe in the "library" argument at all, and I think multiple
event queues completely detract from the whole point of being simple to
use and implement.

Now, I agree that my bind_event()/get_event() has limitations, and the
biggest one is probably the event "id". It needs to be better, and it
needs to have more structure. The "id" really should be something that not
only contains the "fd", but also contains the actor function to be called,
along with some opaque data for that function.

In fact, if you take my example server, and move the "handle[id]()" call
_into_ get_events() (and make the "handle[id]()" function pointer a part
of the ID of the event), then the library argument goes away too: it
doesn't matter _who_ calls the get_event() function, because the end
result is always going to be the same regardless of whether it is called
from within a library or from a main loop: it's going to call the function
handle associated with the ID that triggered.

Basically, the main loop would boil down to

for (;;) {
static struct event ev_list[MAXEV];
get_event(ev_list, MAXEV, &tmout);
.. timeout handling here ..
}

because get_even() would end up doing all the user-mode calls too (so
"get_event()" is no longer a system call: it's a system call + a for-loop
to call all the ID handler functions that were associated with the events
that triggered).

So the "struct event" would just be:

struct event {
int fd;
unsigned long mask;
void *opaque;
void (*event_fn)(ind fd, unsigned long mask, void *opaque);
}

and there's no need for separate event queues, because the separate event
queues have been completely subsumed by the fact that every single event
has a separate event function.

So now you'd start everything off (assuming the same kind of "listen to
everything and react to it" server as in my previous example) by just
setting

bind_event(sock, POLLIN, NULL, accept_fn);

which basically creates the event inside the kernel, and will pass it to
the "__get_event()" system call through the event array, so the
get_event() library function basically looks like

int get_event(struct event *array, int maxevents, struct timeval *tv)
{
int nr = __get_event(array, maxevents, tv);
int i;
for (i = 0; i < nr; i++) {
array->event_fn(array->fd, array->mask, array->opaque);
array++;
}
return nr;
}

and tell me why you'd want to have multiple event queues any more?

(In fact, you might as well move the event array completely inside
"get_event()", because nobody would be supposed to look at the raw array
any more. So the "get_event()" interface would be even simpler: just a
timeout, nothing more. Talk about simple programming.

(This is also the ideal event programming interface - signals get racy and
hard to handle, while in the above example you can trivially just be
single-threaded. Which doesn't mean that you CANNOT be multi-threaded if
you want to: you multi-thread things by just having multiple threads that
all call "get_event()" on their own).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread H. Peter Anvin


Followup to:  <[EMAIL PROTECTED]>
By author:Dave Zarzycki <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
> 
> Maybe I'm missing something, but why do you seperate out fd from the event
> structure. Why not just "int bind_event(struct event *event)"
> 
> The only thing I might have done differently is allow for multiple event
> queues per process, and the ability to add event queues to event
> queues. But these features might not be all that useful in real life.
> 

This could be useful if it doesn't screw up the implementation too
much.

Pretty much, what Linus is saying is the following:

   select()/poll() have one sizable cost, and that is to set up
   and destroy the set of events we want to trigger on.  We would
   like to amortize this cost by making it persistent.

It would definitely be useful for the user to have more than one such
"event set" installed at any one time, so that you can call different
wait_for_event() [or whatever] as appropriate.  However, if that means
we're doing lots of constructing and deconstructing in kernel space,
then we probably didn't gain much.

The other things I think we'd really like in a new interface is an
interface where you can explicitly avoid the "storming hordes" problem
-- if N servers is waiting for the same event, it should be at least
possible to tell the kernel to only wake up one (arbitrarily chosen)
of them, rather than all.

Finally, it would be somewhat nice to have a unified interface for
synchronous and asynchronous notification.  This should be quite
easily doable by adding a call event_notify(event_set,signal) that
causes real-time signal "signal" to be raised (presumably with the
event_set as the argument), when the specified event_set triggers.

-hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel


Dan Kegel wrote:
> [kqueue is] Pretty similar to yours, with the following additions:
> 
> Your proposal seems to only have one stream of available events per
> process.  kqueue() returns a handle to an event queue, and kevent()
> takes that handle as a first parameter.
> 
> [kqueue] uses a single call to both bind (or unbind) an array of fd's
> and pick up new events.  Probably reduces overhead.
> 
> [kqueue] allows you to watch not just sockets, but also plain files
> or directories.  (Hey, haven't we heard people talk about letting apps
> do that under Linux?  What interface does that use?)

I forgot to mention:

kqueue lets you associate an arbitrary integer with each event
specification; the integer is returned along with the event.
This is very handy for, say, passing the 'this' pointer of the
object that should handle the event.  Yes, you can simulate it
with an array indexed by fd, but I like having it included.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dave Zarzycki

On Mon, 23 Oct 2000, Linus Torvalds wrote:

> where you say "I want an array of pending events, and I have an array you
> can fill with up to 'maxnr' events - and if you have no events for me,
> please sleep until you get one, or until 'tmout'".
> 
> The above looks like a _really_ simple interface to me. Much simpler than
> either select() or poll(). Agreed?

Totally. I've been wanting an API like this in user-space for a long time.

One question about "int bind_event(int fd, struct event *event)"

Maybe I'm missing something, but why do you seperate out fd from the event
structure. Why not just "int bind_event(struct event *event)"

The only thing I might have done differently is allow for multiple event
queues per process, and the ability to add event queues to event
queues. But these features might not be all that useful in real life.

Hmmm... It might be cute if you could do something like this:

int num_of_resulting_events;
int r, q = open("/dev/eventq");
struct event interested_events[1024];
struct event events_that_happened[1024];

/* fill up interested_event_array */

write(q, interested_events, sizeof(interested_events));
r = read(q, events_that_happened, sizeof(events_that_happened));
num_of_resulting_events = r / sizeof(struct event);

davez

-- 
Dave Zarzycki
http://thor.sbay.org/~dave/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

On Mon, 23 Oct 2000, Linus Torvalds wrote: 
> Here's a suggested "good" interface that would certainly be easy to 
> implement, and very easy to use, with none of the scalability issues that 
> many interfaces have.  ...
> It boils down to one very simple rule: dense arrays of sticky status 
> information is good. So let's design a good interface for a dense array. 
> 
> Basically, the perfect interface for events would be 
> 
> struct event { 
> unsigned long id; /* file descriptor ID the event is on */ 
> unsigned long event; /* bitmask of active events */ 
> }; 
> 
> int get_events(struct event * event_array, int maxnr, struct timeval 
>*tmout); 
>
> int bind_event(int fd, struct event *event); 

http://www.FreeBSD.org/cgi/man.cgi?query=kqueue&apropos=0&sektion=0&manpath=FreeBSD+5.0-current&format=html
describes the FreeBSD kqueue interface for events:

 struct kevent {
 uintptr_t ident;/* identifier for this event */
 short filter;   /* filter for event */
 u_short   flags;/* action flags for kqueue */
 u_int fflags;   /* filter flag value */
 intptr_t  data; /* filter data value */
 void  *udata;   /* opaque user data identifier */
 };

 int kqueue(void);

 int kevent(int kq, const struct kevent *changelist, int nchanges,
 struct kevent *eventlist, int nevents,
 const struct timespec *timeout);

Pretty similar to yours, with the following additions:

Your proposal seems to only have one stream of available events per
process.  kqueue() returns a handle to an event queue, and kevent()
takes that handle as a first parameter.

Their proposal uses a single call to both bind (or unbind) an array of fd's
and pick up new events.  Probably reduces overhead.

Their proposal allows you to watch not just sockets, but also plain files
or directories.  (Hey, haven't we heard people talk about letting apps
do that under Linux?  What interface does that use?)

> The really nice part of the above is that it's trivial to implement. It's 
> about 50 lines of code, plus some simple logic to various drivers etc to 
> actually inform about the events. The way to do this simply is to limit it 
> in very clear ways, the most notable one being simply that there is only 
> one event queue per process (or rather, per "struct files_struct" - so 
> threads would automatically share the event queue). This keeps the 
> implementation simple, but it's also what keeps the interfaces simple: no 
> queue ID's to pass around etc. 

I dislike having only one event queue per process.  Makes it hard to use
in a library.  ("But wait, *I* was going to use bind_event()!")

> Advantage: everything is O(1), except for "get_event()" which is O(n) 
> where 'n' is the number of active events that it fills in. 

This is a big win, and is exactly the payoff that Solaris gets with /dev/poll.

> Example "server": 

See  http://www.monkeys.com/kqueue/echo.c for a very similar example server
for kqueue() / kevent().

> You get the idea. Very simple, and looks like it should perform quite 
> admirably. With none of the complexities of signal handling, and none of 
> the downsides of select() or poll(). Both of the new system calls would be 
> on the order of 20-30 lines of code (along with the infrastructure to 
> extend the fasync stuff to also be able to handle events) 

Go, Linus, go!  Let's do it!  But don't go *too* simple.  I really would
like a few of the things kqueue has.

> Yes, I'm probably missing something. But this is the kind of thing I think 
> we should look at (and notice that you can _emulate_ this with poll(), so 
> you can use this kind of interface even on systems that wouldn't support 
> these kinds of events natively). 

- Dan

p.s. my collection of notes on kqueue is at 
http://www.kegel.com/c10k.html#nb.kqueue
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Alexander Viro

On Mon, 23 Oct 2000, Jordan Mendelson wrote:

> What you describe is exactly what the /dev/poll interface patch from the
> Linux scalability project does.
> 
> It creates a special device which you can open up and write
> add/remove/modify entries you wish to be notified of using the standard
> struct pollfd. Removing entries is done by setting the events in a
> struct written to the device to POLLREMOVE.

And that's an ugly crap.

* you add struct {} into the kernel API. _Always_ a bad
idea.
* you either create yet another example of "every open() gives a
new instance" kind of device or you got to introduce a broker process.
* no easy way to check the current set.
* no fscking way to use that from scripts/etc.

> You can optionally mmap() memory which the notifications are written to.
> Two ioctl() calls are provide for the initial allocation and also to
> force it to check all items in your poll() list.

* useless use of ioctl() award

> Solaris has this same interface minus the mmap()'ed memory.

Oh, yes. Solaris. Great example of good taste...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Linus Torvalds wrote:
> 
> > What is your favourite interface then ? 
> 
> I suspect a good interface that can easily be done efficiently would
> basically be something where the user _does_ do the equivalent of a
> read-only mmap() of poll entries - and explicit and controlled
> "add_entry()" and "remove_entry()" controls, so that the kernel can
> maintain the cache without playing tricks.

Actually, forget the mmap, it's not needed.

Here's a suggested "good" interface that would certainly be easy to
implement, and very easy to use, with none of the scalability issues that
many interfaces have.

First, let's see what is so nice about "select()" and "poll()". They do
have one _huge_ advantage, which is why you want to fall back on poll()
once the RT signal interface stops working. What is that?

Basically, RT signals or any kind of event queue has a major fundamental
queuing theory problem: if you have events happening really quickly, the
events pile up, and queuing theory tells you that as you start having
queueing problems, your latency increases, which in turn tends to mean
that later events are even more likely to queue up, and you end up in a
nasty meltdown schenario where your queues get longer and longer. 

This is why RT signals suck so badly as a generic interface - clearly we
cannot keep sending RT signals forever, because we'd run out of memory
just keeping the signal queue information around.

Neither poll() nor select() have this problem: they don't get more
expensive as you have more and more events - their expense is the number
of file descriptors, not the number of events per se. In fact, both poll()
and select() tend to perform _better_ when you have pending events, as
they are both amenable to optimizations when there is no need for waiting,
and scanning the arrays can use early-out semantics.

So sticky arrays of events are good, while queues are bad. Let's take that
as one of the fundamentals.

So why do people still like RT signals? They do have one advantage, which
is that you do NOT have that silly array traversal when there is nothing
to do. Basically, the RT signals kind of approach is really good for the
cases where select() and poll() suck: no need to traverse mostly empty and
non-changing arrays all the time.

It boils down to one very simple rule: dense arrays of sticky status
information is good. So let's design a good interface for a dense array.

Basically, the perfect interface for events would be

struct event {
unsigned long id;   /* file descriptor ID the event is on */
unsigned long event;/* bitmask of active events */
};

int get_events(struct event * event_array, int maxnr, struct timeval *tmout);

where you say "I want an array of pending events, and I have an array you
can fill with up to 'maxnr' events - and if you have no events for me,
please sleep until you get one, or until 'tmout'".

The above looks like a _really_ simple interface to me. Much simpler than
either select() or poll(). Agreed?

Now, we still need to inform the kernel of what kind of events we want, ie
the "binding" of events. The most straightforward way to do that is to
just do a simple "bind_event()" system call:

int bind_event(int fd, struct event *event);

which basically says: I'm interested in the events in "event" on the file
descriptor "fd". The way to stop being interested in events is to just set
the event bitmask to zero.

Now, the perfect interface would be the above. Nothing more. Nothing
fancy, nothing complicated. Only really simple stuff. Remember the old
rule: "keep it simple, stupid".

The really nice part of the above is that it's trivial to implement. It's
about 50 lines of code, plus some simple logic to various drivers etc to
actually inform about the events. The way to do this simply is to limit it
in very clear ways, the most notable one being simply that there is only
one event queue per process (or rather, per "struct files_struct" - so
threads would automatically share the event queue). This keeps the
implementation simple, but it's also what keeps the interfaces simple: no
queue ID's to pass around etc.

Implementation is straightforward: the event queue basically consists of

 - a queue head in "struct files_struct", initially empty.

 - doing a "bind_event()" basically adds a fasync entry to the file
   structure, but rather than cause a signal, it just looks at whether the
   fasync entry is already linked into the event queue, and if it isn't
   (and the event is one of the ones in the event bitmask), it adds itself
   to the event queue.

 - get_event() just traverses the event queue and fills in the array,
   removing them from the event queue. End of story. If the event queue is
   empty, it trivially sees that in a single line of code (+ timeout
   handling)

Advantage: everything is O(1), except for "get_event()" which is O(n)
where 'n' is the number of active eve

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Jordan Mendelson

Linus Torvalds wrote:
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> >
> > I don't see the problem. You have the poll table allocated in the kernel,
> > the drivers directly change it and the user mmaps it (I was not proposing
> > to let poll make a kiobuf out of the passed array)

> Th eproblem with poll() as-is is that the user doesn't really tell the
> kernel explictly when it is changing the table..

What you describe is exactly what the /dev/poll interface patch from the
Linux scalability project does.

It creates a special device which you can open up and write
add/remove/modify entries you wish to be notified of using the standard
struct pollfd. Removing entries is done by setting the events in a
struct written to the device to POLLREMOVE.

You can optionally mmap() memory which the notifications are written to.
Two ioctl() calls are provide for the initial allocation and also to
force it to check all items in your poll() list.

Solaris has this same interface minus the mmap()'ed memory.

Jordan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Tue, 24 Oct 2000, Andi Kleen wrote:
> 
> I don't see the problem. You have the poll table allocated in the kernel,
> the drivers directly change it and the user mmaps it (I was not proposing
> to let poll make a kiobuf out of the passed array) 

That's _not_ how poll() works at all.

We don't _have_ a poll table in the kernel, and no way to mmap it. The
poll() tables gets created dynamically based on the stuff that the user
has set up in the table. And the user can obviously change the fd's etc in
the table directly, so in order for the caching to work you need to do
various games with page table dirty or writable bits, or at least test
dynamically whether the poll table is the same as it was before.

Sure, it's doable, and apparently Solaris does something like this. But
what _is_ the overhead of the Solaris code for small number of fd's? I bet
it really is quite noticeable. I also suspect it is very optimized toward
an unchangning poll-table.

> What is your favourite interface then ? 

I suspect a good interface that can easily be done efficiently would
basically be something where the user _does_ do the equivalent of a
read-only mmap() of poll entries - and explicit and controlled
"add_entry()" and "remove_entry()" controls, so that the kernel can
maintain the cache without playing tricks.

Basically, something like a user interface to something that looks like
the linux poll_table_page structures, with the difference being that it
doesn't have to be created and torn down all the time because the user
would explicitly ask for "add this fd" and "remove this fd" from the
table.

Th eproblem with poll() as-is is that the user doesn't really tell the
kernel explictly when it is changing the table..

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

Nick Piggin ([EMAIL PROTECTED]) wrote:
> > I'm trying to write a server that handles 1 clients. On 2.4.x, 
> > the RT signal queue stuff looks like the way to achieve that. 
> 
> I would suggest you try multiple polling threads. Not only will you get 
> better SMP scalability, if you have say 16 threads, each one only has to 
> handle ~ 600 fds. 

Good point.  My code is already able to use multiple network threads,
and I have done what you suggest in the past.

But I'm interested in pushing the state of the art here,
and want to see if Linux can handle it with just a single
network thread.  (My server has enough non-network threads to
keep multiple CPUs busy, don't worry :-)

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel


David Schwartz wrote:
> > I'm trying to write a server that handles 1 clients.  On 2.4.x,
> > the RT signal queue stuff looks like the way to achieve that.
> > Unfortunately, when the RT signal queue overflows, the consensus seems
> > to be that you fall back to a big poll().   And even though the RT signal
> > queue [almost] never overflows, it certainly can, and servers have to be
> > able to handle it.
> 
> Don't let that bother you. In the case where you get a hit a significant
> fraction of the descriptors you are polling on, poll is very efficient. The
> inefficiency comes when you have to wade through 10,000 uninteresting file
> descriptors to find the one interesting one. If the poll set is rich in
> ready descriptors, there is little advantage to signal queues over poll
> itself.
> 
> In fact, if you assume the percentage of ready file descriptors (as opposed
> to the number of file descriptors) is constant, then poll is just as
> scalable (theoretically) as any other method. Under both schemes, with twice
> as many file descriptors you have to do twice as much work.

Yep, I've made similar arguments myself.  It's just that seeing
poll() take 14 milliseconds to return on a 650 MHz system is a little daunting.
I'll report again when I have results for RT signal stuff
and different percentages of idle sockets (probably 0, 1, and 10).

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Jordan Mendelson

Dan Kegel wrote:
> 
> Jordan Mendelson ([EMAIL PROTECTED]) wrote:
> > An implementation of /dev/poll for Linux already exists and has shown to
> > be more scalable than using RT signals under my tests. A patch for 2.2.x
> > and 2.4.x should be available at the Linux Scalability Project @
> > http://www.citi.umich.edu/projects/linux-scalability/ in the patches
> > section.
> 
> If you'll look at the page I linked to in my original post,
>   http://www.kegel.com/dkftpbench/Poller_bench.html
> you'll see that I also benchmarked /dev/poll.

The Linux /dev/poll implementation has a few "non-standard" features
such as the ability to mmap() the poll structure memory to eliminate a
memory copy.

int dpoll_fd;
unsigned char *dpoll;
struct pollfd *mmap_dpoll;

dpoll_fd = open("/dev/poll", O_RDWR, 0);
ioctl(dpoll_fd, DP_ALLOC, 1);
dpoll = mmap(0, DP_MMAP_SIZE(1), PROT_WRITE|PROT_READ,
MAP_SHARED,   dpoll_fd, 0);

dpoll = (struct pollfd *)mmap_dpoll;

Use this memory when reading and write() to add/remove and see if you
get any boost in performance. Also, I believe there is a hash table
associated with /dev/poll in the kernel patch which might slow down your
performance tests when it's first growing to resize itself.

Jordan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

RE: Linux's implementation of poll() not scalable?

2000-10-23 Thread David Schwartz



> I'm trying to write a server that handles 1 clients.  On 2.4.x,
> the RT signal queue stuff looks like the way to achieve that.
> Unfortunately, when the RT signal queue overflows, the consensus seems
> to be that you fall back to a big poll().   And even though the RT signal
> queue [almost] never overflows, it certainly can, and servers have to be
> able to handle it.

Don't let that bother you. In the case where you get a hit a significant
fraction of the descriptors you are polling on, poll is very efficient. The
inefficiency comes when you have to wade through 10,000 uninteresting file
descriptors to find the one interesting one. If the poll set is rich in
ready descriptors, there is little advantage to signal queues over poll
itself.

In fact, if you assume the percentage of ready file descriptors (as opposed
to the number of file descriptors) is constant, then poll is just as
scalable (theoretically) as any other method. Under both schemes, with twice
as many file descriptors you have to do twice as much work.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Andi Kleen

On Mon, Oct 23, 2000 at 06:42:39PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> > 
> > Also with the poll table mmap'ed via /dev/poll and the optimizations I
> > described poll could be made quite nice (I know that queued SIGIO exists,
> > but it has its drawbacks too and often you need to fallback to poll anyways) 
> 
> The problem is that your proposed optimizations would be horrible: the
> poll events themselves happen when some other process is running, so you'd
> have to do some serious VM hacking and consider the poll table to be some
> kind of direct-IO thing etc. Ugh. 

I don't see the problem. You have the poll table allocated in the kernel,
the drivers directly change it and the user mmaps it (I was not proposing
to let poll make a kiobuf out of the passed array) 

> 
> And the file->fd mapping is not a simple mapping, it'a s 1:m mapping from
> file -> potentially many [process,fd] pairs.

Yes, but in 95% of all cases it is a 1:1 mapping, so you could just
fall back to the old slow method when that isn't the case.
> 
> Doing the caching across multiple poll-calls is even worse, you'd have to
> cache where in user space people had the array etc. Not pretty.

The file -> fdnum reverse table does not depend
On every poll call you can walk the poll table and fix the pointer from file
to poll entry. That can be done cheaply during copyin for normal poll.
mmap'ed /dev/poll could probably use a lazy method (cache the pointers and
verify and walk if it the verify fails) 

> I'm sure it can be speeded up, I'm just not sure it's really worth it if
> you'd instead just have a better interface that wouldn't need this crap,
> and consider poll() as just a compatibility layer.

What is your favourite interface then ? 
ndi
/dev/poll is nice in that it can be relatively easy hacked into older programs.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Tue, 24 Oct 2000, Andi Kleen wrote:
> 
> Also with the poll table mmap'ed via /dev/poll and the optimizations I
> described poll could be made quite nice (I know that queued SIGIO exists,
> but it has its drawbacks too and often you need to fallback to poll anyways) 

The problem is that your proposed optimizations would be horrible: the
poll events themselves happen when some other process is running, so you'd
have to do some serious VM hacking and consider the poll table to be some
kind of direct-IO thing etc. Ugh. 

And the file->fd mapping is not a simple mapping, it'a s 1:m mapping from
file -> potentially many [process,fd] pairs.

Doing the caching across multiple poll-calls is even worse, you'd have to
cache where in user space people had the array etc. Not pretty.

I'm sure it can be speeded up, I'm just not sure it's really worth it if
you'd instead just have a better interface that wouldn't need this crap,
and consider poll() as just a compatibility layer.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Andi Kleen


On Mon, Oct 23, 2000 at 06:16:24PM -0700, Linus Torvalds wrote:
> 
> 
> On Tue, 24 Oct 2000, Andi Kleen wrote:
> > 
> > It would be possible to setup a file -> fdnum reverse table (possibly cached 
> > over poll calls, I think Solaris does that) and let the async events directly 
> > change the bits in the output buffer in O(1).
> 
> I disagree.
> 
> Let's just face it, poll() is a bad interface scalability-wise.

It is, but Linux poll is far from being a good implementation even in the
limitations of the interfaces, as the comparison with Solaris shows. It is
inherently O(n), but the constant part of that in Linux could be certainly 
improved. 

Also with the poll table mmap'ed via /dev/poll and the optimizations I
described poll could be made quite nice (I know that queued SIGIO exists,
but it has its drawbacks too and often you need to fallback to poll anyways) 



> 
> > Also the current 2.4 poll is very wasteful both in memory and cycles
> > for small numbers of fd.
> 
> Yes, we could go back to the "optimize the case of n < 8" thing. Which
> should be particularly easy with the new "poll_table_page" setup: the
> thing is more abstracted than it used to be in 2.2.

I did it for n < 40. It also put the wait queues on the stack, in that 
it is even more efficient than 2.0
I attached the patch for your reference, it is against test3 or so 
(but iirc select/poll has not changed after that). I'm not proposing it
for inclusion right now. 

It does only the easy parts of course, I think with some VFS extensions
it could be made a lot better.

BTW the linux select can be get a lower constant part of n too by simply
testing whole words of the FD_SET for being zero and skipping them quickly
(that makes it a lot cheaper when you have bigger holes in the select fd
space). The patch does that too. 


-Andi



--- linux/fs/select.c   Mon Jul 24 07:39:44 2000
+++ linux-work/fs/select.c  Sat Aug  5 17:23:18 2000
@@ -12,6 +12,9 @@
  *  24 January 2000
  * Changed sys_poll()/do_poll() to use PAGE_SIZE chunk-based allocation 
  * of fds to overcome nfds < 16390 descriptors limit (Tigran Aivazian).
+ *
+ *  July 2000
+ * Add fast poll/select (Andi Kleen)
  */
 
 #include 
@@ -24,6 +27,8 @@
 #define ROUND_UP(x,y) (((x)+(y)-1)/(y))
 #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM)
 
+#define __MAX_POLL_TABLE_ENTRIES ((PAGE_SIZE - sizeof (struct poll_table_page)) / 
+sizeof (struct poll_table_entry))
+
 struct poll_table_entry {
struct file * filp;
wait_queue_t wait;
@@ -32,42 +37,43 @@
 
 struct poll_table_page {
struct poll_table_page * next;
-   struct poll_table_entry * entry;
+   int nr, max; 
struct poll_table_entry entries[0];
 };
 
-#define POLL_TABLE_FULL(table) \
-   ((unsigned long)((table)->entry+1) > PAGE_SIZE + (unsigned long)(table))
-
 /*
- * Ok, Peter made a complicated, but straightforward multiple_wait() function.
- * I have rewritten this, taking some shortcuts: This code may not be easy to
- * follow, but it should be free of race-conditions, and it's practical. If you
- * understand what I'm doing here, then you understand how the linux
- * sleep/wakeup mechanism works.
- *
- * Two very simple procedures, poll_wait() and poll_freewait() make all the
- * work.  poll_wait() is an inline-function defined in ,
- * as all select/poll functions have to call it to add an entry to the
- * poll table.
+ * Tune fast poll/select. Limit is the kernel stack.
  */
+#define FAST_SELECT_LIMIT 40
+#define FAST_POLL_LIMIT   40
+
+#define FAST_SELECT_BYTES ((FAST_SELECT_LIMIT + 7) / 8) 
+#define FAST_SELECT_ULONG \
+   ((FAST_SELECT_BYTES + sizeof(unsigned long) - 1) / sizeof(unsigned long))
+
+static inline void do_freewait(struct poll_table_page *p)
+{
+   struct poll_table_entry * entry;
+   entry = p->entries + p->nr;
+   while (p->nr > 0) {
+   p->nr--;
+   entry--;
+   remove_wait_queue(entry->wait_address,&entry->wait);
+   fput(entry->filp);
+   }
+}
 
 void poll_freewait(poll_table* pt)
 {
+   struct poll_table_page *old;
struct poll_table_page * p = pt->table;
while (p) {
-   struct poll_table_entry * entry;
-   struct poll_table_page *old;
-
-   entry = p->entry;
-   do {
-   entry--;
-   remove_wait_queue(entry->wait_address,&entry->wait);
-   fput(entry->filp);
-   } while (entry > p->entries);
old = p;
+   do_freewait(p); 
p = p->next;
-   free_page((unsigned long) old);
+   if (old->max == __MAX_POLL_TABLE_ENTRIES) { 
+   free_page((unsigned long) old);
+   } 
}
 }
 
@@ -75,7 +81,7 @@
 {
struct poll_table_page *table = p->table;
 
-   if (!table || POLL_TABLE_FULL(table)) {
+   if (!table ||

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Nick Piggin


> I'm trying to write a server that handles 1 clients.  On 2.4.x,
> the RT signal queue stuff looks like the way to achieve that.

I would suggest you try multiple polling threads. Not only will you get
better SMP scalability, if you have say 16 threads, each one only has to
handle ~ 600 fds.

Nick.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Tue, 24 Oct 2000, Andi Kleen wrote:
> 
> It would be possible to setup a file -> fdnum reverse table (possibly cached 
> over poll calls, I think Solaris does that) and let the async events directly 
> change the bits in the output buffer in O(1).

I disagree.

Let's just face it, poll() is a bad interface scalability-wise.

If you want to do efficient events, you should have some other approach,
like an event queue. Yes, I know it's a dirty word because NT uses it, but
let's face it, poll() was a hack to make it easier to do something
select()-like but with the same machinery as select.

> Also the current 2.4 poll is very wasteful both in memory and cycles
> for small numbers of fd.

Yes, we could go back to the "optimize the case of n < 8" thing. Which
should be particularly easy with the new "poll_table_page" setup: the
thing is more abstracted than it used to be in 2.2.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

Jordan Mendelson ([EMAIL PROTECTED]) wrote:
> An implementation of /dev/poll for Linux already exists and has shown to 
> be more scalable than using RT signals under my tests. A patch for 2.2.x 
> and 2.4.x should be available at the Linux Scalability Project @ 
> http://www.citi.umich.edu/projects/linux-scalability/ in the patches 
> section. 

If you'll look at the page I linked to in my original post,
  http://www.kegel.com/dkftpbench/Poller_bench.html
you'll see that I also benchmarked /dev/poll.

When finding 1 active fd among 1, the Solaris implementation creamed the living 
snot out of the Linux one, even though the Solaris hardware was 5 or so times slower
(see the lmbench results on that page).  Here are the numbers (times in microseconds):

On a 167MHz sun4u Sparc Ultra-1 running SunOS 5.7 (Solaris 7) Generic_106541-11: 

 pipes1001000   1
select151   -   -
  poll470 6763742
 /dev/poll 61  70  92

On an idle 650 MHz dual Pentium III running Red Hat Linux 6.2 with kernel 2.2.14smp 
plus the /dev/poll patch:

 pipes1001000   1 
select 28   -   - 
  poll 23 890   11333 
 /dev/poll 19 1464264 

On the same machine as above, but with vanilla kernel 2.4.0-test10-pre4 smp: 

 pipes1001000   1 
select 52   -   - 
  poll 491184   14660 

> It works fairly well, but I was actually somewhat disappointed to find 
> that it wasn't the primary cause for the system CPU suckage for my 
> particular system. Granted, when you only have to poll a few times per 
> second, the overhead of standard poll() just isn't that bad. 

If you have to poll 100 or fewer sockets, Linux's poll is quite good.
If you have to poll 1000 or so sockets, Linux's /dev/poll works well
(I wish it were available for the current kernels).
But if you have to poll 1 or sockets on Linux, 
neither standard poll nor /dev/poll as currently implemented performs adequately.
I suspect that RT signals on Linux will do much better for N=1 than the 
current /dev/poll, and hope to benchmark that soon.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Andi Kleen

On Mon, Oct 23, 2000 at 11:14:35AM -0700, Linus Torvalds wrote:
> [ Small treatize on "scalability" included. People obviously do not
>   understand what "scalability" really means. ]
> 
> In article <[EMAIL PROTECTED]>,
> Dan Kegel  <[EMAIL PROTECTED]> wrote:
> >I ran a benchmark to see how long a call to poll() takes
> >as you increase the number of idle fd's it has to wade through.
> >I used socketpair() to generate the fd's.
> >
> >Under Solaris 7, when the number of idle sockets was increased from 
> >100 to 1, the time to check for active sockets with poll() 
> >increased by a factor of only 6.5.  That's a sublinear increase in time, 
> >pretty spiffy.
> 
> Yeah. It's pretty spiffy.
> 
> Basically, poll() is _fundamentally_ a O(n) interface. There is no way
> to avoid it - you have an array, and there simply is _no_ known
> algorithm to scan an array in faster than O(n) time. Sorry.

One problem in Linux is that it scans multiple times. At least 4 times
currently: copyin, setup of wait queues, ask for results, copyout. copyin
and copyout are relatively cheap (and could be made even cheaper with
/dev/poll), the problem are the two other passes which involve a lot
of function pointer calls which generally cause pipeline stalls in modern
CPUs.  

It would be possible to setup a file -> fdnum reverse table (possibly cached 
over poll calls, I think Solaris does that) and let the async events directly 
change the bits in the output buffer in O(1). This would save one pass.
It may also be possible to cache the wait queue setup over polls, this
would make poll much cheaper in terms of cache lines used. 

Also the current 2.4 poll is very wasteful both in memory and cycles
for small numbers of fd. I did some experiments with a poll fast path
for small ns by falling back to the 2.0 stack allocation method, and 
it decreased latency dramatically (>-30%). It also saves a lot of memory,
because all the daemons only polling for a few fds wouldn't use two 
additional pages to their stack page [patches are availble, I just didn't
want to submit them during code freeze] 

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Jordan Mendelson


Dan Kegel wrote:
> 
> Linus Torvalds wrote:
> > Dan Kegel wrote:
> >> [ http://www.kegel.com/dkftpbench/Poller_bench.html ]
> >> [ With only one active fd and N idle ones, poll's execution time scales
> >> [ as 6N on Solaris, but as 300N on Linux. ]
> >
> > Basically, poll() is _fundamentally_ a O(n) interface. There is no way
> > to avoid it - you have an array, and there simply is _no_ known
> > algorithm to scan an array in faster than O(n) time. Sorry.
> > ...
> > Under Linux, I'm personally more worried about the performance of X etc,
> > and small poll()'s are actually common. So I would argue that the
> > Solaris scalability is going the wrong way. But as performance really
> > depends on the load, and maybe that 1 entry load is what you
> > consider "real life", you are of course free to disagree (and you'd be
> > equally right ;)

> The way I'm implementing RT signal support is by writing a userspace
> wrapper to make it look like an OO version of poll(), more or less,
> with an 'add(int fd)' method so the wrapper manages the arrays of pollfd's.
> When and if I get that working, I may move it into the kernel as an
> implementation of /dev/poll -- and then I won't need to worry about
> the RT signal queue overflowing anymore, and I won't care how scalable
> poll() is.

An implementation of /dev/poll for Linux already exists and has shown to
be more scalable than using RT signals under my tests. A patch for 2.2.x
and 2.4.x should be available at the Linux Scalability Project @ 
http://www.citi.umich.edu/projects/linux-scalability/ in the patches
section.

It works fairly well, but I was actually somewhat disappointed to find
that it wasn't the primary cause for the system CPU suckage for my
particular system. Granted, when you only have to poll a few times per
second, the overhead of standard poll() just isn't that bad.


Jordan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Dan Kegel

Linus Torvalds wrote:
> Dan Kegel wrote:
>> [ http://www.kegel.com/dkftpbench/Poller_bench.html ]
>> [ With only one active fd and N idle ones, poll's execution time scales
>> [ as 6N on Solaris, but as 300N on Linux. ]
>
> Basically, poll() is _fundamentally_ a O(n) interface. There is no way
> to avoid it - you have an array, and there simply is _no_ known
> algorithm to scan an array in faster than O(n) time. Sorry.
> ...
> Under Linux, I'm personally more worried about the performance of X etc,
> and small poll()'s are actually common. So I would argue that the
> Solaris scalability is going the wrong way. But as performance really
> depends on the load, and maybe that 1 entry load is what you
> consider "real life", you are of course free to disagree (and you'd be
> equally right ;)

I don't think I was being as clueless as you feared.
Solaris' poll() somehow manages to skip inactive fd's very efficiently.
But I'm happy to agree that small poll()'s are very important.
I'd prefer to never use big poll()'s.  The RT signal stuff
scales much better.  

I'm trying to write a server that handles 1 clients.  On 2.4.x,
the RT signal queue stuff looks like the way to achieve that.
Unfortunately, when the RT signal queue overflows, the consensus seems 
to be that you fall back to a big poll().   And even though the RT signal 
queue [almost] never overflows, it certainly can, and servers have to be 
able to handle it.

The way I'm implementing RT signal support is by writing a userspace
wrapper to make it look like an OO version of poll(), more or less,
with an 'add(int fd)' method so the wrapper manages the arrays of pollfd's.
When and if I get that working, I may move it into the kernel as an
implementation of /dev/poll -- and then I won't need to worry about 
the RT signal queue overflowing anymore, and I won't care how scalable
poll() is.

- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

On Mon, 23 Oct 2000, Tobias Ringstrom wrote:

> On 23 Oct 2000, Linus Torvalds wrote:
> 
> > Either 
> > 
> >  (a) Solaris has solved the faster-than-light problem, and Sun engineers
> >  should get a Nobel price in physics or something.
> > 
> >  (b) Solaris "scales" by being optimized for 1 entries, and not
> >  speeding up sufficiently for a small number of entries.
> > 
> > You make the call. 
> 
> You will probably get the 6.5 factor because you have some (big) contant
> setup time. For example
> 
>   t = 1700 + n
> 
> This gives you an increase of 6.5 from 100 to 1, although it is of
> course is O(n). No magic there... :-)

Indeed.

NOTE! I'm not saying that this is necessarily bad. It may well be that the
setup time means that the Solaris code actually _does_ perform really well
for the 1 entry case.

I hope nobody took my rant against "scalability" to mean that I consider
the Solaris case to necessarily be bad engineering. My rant was more about
people often thinking that "scalability == performance", which it isn't.
It may be that Solaris simply wants to do 1 entries really fast and
that they actually do really well, but it is clear that if so they have a
huge performance hit for the small cases, and people should realize that.

It's basically a matter of making the proper trade-offs. I think that the
"few file descriptors" case is actually arguably a very important one. Sun
(who has long since given up on the desktop) probably have a different
tradeoff, and maybe their big constant setup-time is worth it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Tobias Ringstrom

On 23 Oct 2000, Linus Torvalds wrote:

> Either 
> 
>  (a) Solaris has solved the faster-than-light problem, and Sun engineers
>  should get a Nobel price in physics or something.
> 
>  (b) Solaris "scales" by being optimized for 1 entries, and not
>  speeding up sufficiently for a small number of entries.
> 
> You make the call. 

You will probably get the 6.5 factor because you have some (big) contant
setup time. For example

t = 1700 + n

This gives you an increase of 6.5 from 100 to 1, although it is of
course is O(n). No magic there... :-)

/Tobias

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Lyle Coder

Hi,
If you have a similar machine (in terms machine configuration) for both your
solaris and linux machines... could you tell us what the difference in total
time for 100 and 1 was?  i.e... dont compare solaris with 100
descripters vs solaris with 1 descriptors, but rather
Linux 100 descripters Vs. Solaris 100 descriptors  AND
Linux 1 descriptors Vs. Solaris 1 descriptors.

That would be useful informatio... I think.

Thanks
Lyle

Re: Linux's implementation of poll() not scalable?

[ Small treatize on "scalability" included. People obviously do not
  understand what "scalability" really means. ]

In article <[EMAIL PROTECTED]>,
Dan Kegel  <[EMAIL PROTECTED]> wrote:
>I ran a benchmark to see how long a call to poll() takes
>as you increase the number of idle fd's it has to wade through.
>I used socketpair() to generate the fd's.
>
>Under Solaris 7, when the number of idle sockets was increased from 100 to 
>1, the time to check for active sockets with poll() increased by a 
>factor of only 6.5.  That's a sublinear increase in time, pretty spiffy.

Yeah. It's pretty spiffy.

Basically, poll() is _fundamentally_ a O(n) interface. There is no way
to avoid it - you have an array, and there simply is _no_ known
algorithm to scan an array in faster than O(n) time. Sorry.

(Yeah, you could parallellize it.  I know, I know.  Put one CPU on each
entry, and you can get it down to O(1).  Somehow I doubt Solaris does
that.  In fact, I'll bet you a dollar that it doesn't).

So what does this mean?

Either

(a) Solaris has solved the faster-than-light problem, and Sun engineers
 should get a Nobel price in physics or something.

(b) Solaris "scales" by being optimized for 1 entries, and not
 speeding up sufficiently for a small number of entries.

You make the call.

Basically, for poll(), perfect scalability is that poll() scales by a
factor of 100 when you go from 100 to 1 entries. Anybody who does
NOT scale by a factor of 100 is not scaling right - and claiming that
6.5 is a "good" scale factor only shows that you've bought into
marketing hype.

In short, a 6.5 scale factor STINKS. The only thing it means is that
Solaris is slow as hell on the 100 descriptor case.

>Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets 
>was increased from  100 to 1, the time to check for active sockets with 
>poll() increased by a factor of 493 [or 300, respectively].

So, what you're showing is that Linux actually is _closer_ to the
perfect scaling (Linux is off by a factor of 5, while Solaris is off by
a factor of 15 from the perfect scaling line, and scales down really
badly).

Now, that factor of 5 (or 3, for 2.4.0) is still bad.  I'd love to see
Linux scale perfectly (which in this case means that 1 fd's should
take exactly 100 times as long to poll() as 100 entries take).  But I
suspect that there are a few things going on, one of the main ones
probably being that the kernel data working set for 100 entries fits in
the cache or something like that.

>Please, somebody point out my mistake.  Linux can't be this bad!

I suspect we could improve Linux in this area, but I hope that I pointed
out the most fundamental mistake you did, which was thinking that
"scalability" equals "speed".  It doesn't.

Scalability really means that the effort to handle a problem grows
reasonably with the hardness of the problem. And _deviations_ from that
are indications of something being wrong.

Some people think that super-linear improvements in scalability are
signs of "goodness".  They aren't.  For example, the classical reason
for super-linear SMP improvement (with number of CPU's) that people get
so excited about really means that something is wrong on the low end.
Often the "wrongness" is lack of cache - some problems will scale better
than perfectly simply because with multiple CPU's you have more cache.

The "wrongess" is often also selecting the wrong algorithm: something
that "scales well" by just being horribly slow for the small case, and
being "less bad" for the big cases.

In the end, the notion of "scalability" is meaningless. The only
meaningful thing is how quickly something happens for the load you have.
That's something called "performance", and unlike "scalability", it
actually has real-life meaning.

Under Linux, I'm personally more worried about the performance of X etc,
and small poll()'s are actually common. So I would argue that the
Solaris scalability is going the wrong way. But as performance really
depends on the load, and maybe that 1 entry load is what you
consider "real life", you are of course free to disagree (and you'd be
equally right ;)

Linus

-

RE: Linux's implementation of poll() not scalable?

2000-10-23 Thread David Schwartz



> Under Solaris 7, when the number of idle sockets was increased from
> 100 to 1, the time to check for active sockets with poll()
> increased by a factor of only 6.5.  That's a sublinear increase in time,
> pretty spiffy.

Under Solaris 7, when the number of idle sockets was decreased from 10,000
to 100, the time to check for active sockets with poll() decreased by a
factor of only 6.5. Shouldn't it be more like 100? Sounds like Solaris' poll
implementation is horribly broken for the most common case.

DS

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: Linux's implementation of poll() not scalable?

2000-10-23 Thread Linus Torvalds

[ Small treatize on "scalability" included. People obviously do not
  understand what "scalability" really means. ]

In article <[EMAIL PROTECTED]>,
Dan Kegel  <[EMAIL PROTECTED]> wrote:
>I ran a benchmark to see how long a call to poll() takes
>as you increase the number of idle fd's it has to wade through.
>I used socketpair() to generate the fd's.
>
>Under Solaris 7, when the number of idle sockets was increased from 
>100 to 1, the time to check for active sockets with poll() 
>increased by a factor of only 6.5.  That's a sublinear increase in time, 
>pretty spiffy.

Yeah. It's pretty spiffy.

Basically, poll() is _fundamentally_ a O(n) interface. There is no way
to avoid it - you have an array, and there simply is _no_ known
algorithm to scan an array in faster than O(n) time. Sorry.

(Yeah, you could parallellize it.  I know, I know.  Put one CPU on each
entry, and you can get it down to O(1).  Somehow I doubt Solaris does
that.  In fact, I'll bet you a dollar that it doesn't). 

So what does this mean?

Either 

 (a) Solaris has solved the faster-than-light problem, and Sun engineers
 should get a Nobel price in physics or something.

 (b) Solaris "scales" by being optimized for 1 entries, and not
 speeding up sufficiently for a small number of entries.

You make the call. 

Basically, for poll(), perfect scalability is that poll() scales by a
factor of 100 when you go from 100 to 1 entries. Anybody who does
NOT scale by a factor of 100 is not scaling right - and claiming that
6.5 is a "good" scale factor only shows that you've bought into
marketing hype.

In short, a 6.5 scale factor STINKS. The only thing it means is that
Solaris is slow as hell on the 100 descriptor case.

>Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets 
>was increased from  100 to 1, the time to check for active sockets 
>with poll() increased by a factor of 493 [or 300, respectively].

So, what you're showing is that Linux actually is _closer_ to the
perfect scaling (Linux is off by a factor of 5, while Solaris is off by
a factor of 15 from the perfect scaling line, and scales down really
badly). 

Now, that factor of 5 (or 3, for 2.4.0) is still bad.  I'd love to see
Linux scale perfectly (which in this case means that 1 fd's should
take exactly 100 times as long to poll() as 100 entries take).  But I
suspect that there are a few things going on, one of the main ones
probably being that the kernel data working set for 100 entries fits in
the cache or something like that.

>Please, somebody point out my mistake.  Linux can't be this bad!

I suspect we could improve Linux in this area, but I hope that I pointed
out the most fundamental mistake you did, which was thinking that
"scalability" equals "speed".  It doesn't. 

Scalability really means that the effort to handle a problem grows
reasonably with the hardness of the problem. And _deviations_ from that
are indications of something being wrong.

Some people think that super-linear improvements in scalability are
signs of "goodness".  They aren't.  For example, the classical reason
for super-linear SMP improvement (with number of CPU's) that people get
so excited about really means that something is wrong on the low end. 
Often the "wrongness" is lack of cache - some problems will scale better
than perfectly simply because with multiple CPU's you have more cache.

The "wrongess" is often also selecting the wrong algorithm: something
that "scales well" by just being horribly slow for the small case, and
being "less bad" for the big cases.

In the end, the notion of "scalability" is meaningless. The only
meaningful thing is how quickly something happens for the load you have.
That's something called "performance", and unlike "scalability", it
actually has real-life meaning.

Under Linux, I'm personally more worried about the performance of X etc,
and small poll()'s are actually common. So I would argue that the
Solaris scalability is going the wrong way. But as performance really
depends on the load, and maybe that 1 entry load is what you
consider "real life", you are of course free to disagree (and you'd be
equally right ;)

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Linux's implementation of poll() not scalable?

2000-10-21 Thread Dan Kegel


I ran a benchmark to see how long a call to poll() takes
as you increase the number of idle fd's it has to wade through.
I used socketpair() to generate the fd's.

Under Solaris 7, when the number of idle sockets was increased from 
100 to 1, the time to check for active sockets with poll() 
increased by a factor of only 6.5.  That's a sublinear increase in time, 
pretty spiffy.

Under Linux 2.2.14 [or 2.4.0-test1-pre4], when the number of idle sockets 
was increased from  100 to 1, the time to check for active sockets 
with poll() increased by a factor of 493 [or 300, respectively].
This is terribly, horribly bad scaling behavior, since the time
required appears to increase three times faster than the number of sockets.

Detailed measurements and the source code used to make them
are available at http://www.kegel.com/dkftpbench/Poller_bench.html

Please, somebody point out my mistake.  Linux can't be this bad!
- Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

71 matches

Mail list logo