On 31/03/2009, at 1:12 AM, Erick Tryzelaar wrote:

> On Mon, Mar 30, 2009 at 12:49 AM, john skaller
> <skal...@users.sourceforge.net> wrote:
>> Yes. It never works properly, and is impossible to maintain on  
>> multiple
>> platforms.
>>
>> Also, Async TCP/IP is stuffed as we found out,.. someone should write
>> a paper
>> on that and send it to some conference, it is a SERIOUS issue costing
>> billions
>> of dollars and compromising world network security.
>
> Which bugs are you referring to? Maybe the erlang folks have some
> ideas on working with it. Ulf Wiger, you still subscribed?


no no, this is a *design fault* in the TCP protocol C interface  
(sockets).

Roughly speaking, SO_LINGER does not work with asynchronous sockets.

Under Linux this means that when you close a socket, the buffers are  
flushed.
This means you CANNOT close the socket, (at least without setting a
user space timer ***), because when you do any transmissions you've made
asynchronously might be lost.

The only reliable way to close a socket is with a synchronous close.
to make that asynchronous, you have to launch a pthread for every
socket, imposing a massive overhead and limits related to pthreads
on your program, and defeating the purpose of asynchronous I/O.

This bug manifested immediately in tools/webserver, so I have no
idea how millions of people are doing high performance web stuff
on Posix (Windows probably doesn't have this issue, a good reason
to switch to Windows for networking .. arrrgggghhh :)

The guts of the problem is this: a webserver MUST use a finite
number of threads to manage an unbounded number of sockets
(any bounds are applied by connect failures or user counters).
Unbounded threads aren't tenable because the OS can't schedule
them fast enough (and threads are resource hungry).

Given the above assumptions, that we have to use async I/O,
Felix uses just two threads: one uses notifications like epoll
to perform synchronous I/O for the client thread (which uses
Felix f-threads for servicing, since these schedule O(1)).

All this works just fine. The problem is that we want to avoid
Denial of Service (DOS) attacks by a rogue client, but on
the other hand clients can make requests of unbounded
length. So we read until End of Message or a maximum
number of bytes are read (in the later case we can close
the socket, assuming a rogue client).

All fine. Now also, clients can read/write data very slowly.
In fact, they can read/write a bit then just hang, and this
blocks the socket -- another DNS attack possibility.
So to write/read reliably, we have to set a timer to trigger
even if the socket doesn't report itself ready (or use the  
notification service
timer and/or any OS level stuff). Then if an I/O op fails
(reads/writes 0 bytes) we can also close the socket. Otherwise
we have progress at some minimal rate.

So by the above algorithm we can read and write
everything from and to the client browser, and if the
browser tries to flood write to us, or if it tries to starve
on either read or write we can detect it with a timer
and a progress failure.

Note there is NO OTHER way to do this with sockets.

The problem is that there is no way to delay when closing
a socket in Async mode. For synchronous sockets, the TCP
gurus decided reliability wasn't possible without SO_LINGER.
This causes a close on a socket to hang for a while to give it
a chance to finish writing before closing the underlying socket:
without lingering EVERY write followed by a close would fail!
The reason is that the C interface is synchronous, but the underlying
transport is not. So you basically have to say "if the writes don't
go thru in X seconds we're not going to waste resources any longer,
kill the connection".

The guru's stuffed up. SO_LINGER must work on Asynchronous
sockets too. Although Async sockets cannot return an error code
on closing, and the close function SHOULD return immediately,
the OS should NOT be allowed to simply flush the buffers and
close the sockets (Linux DOES). It should wait SO_LINGER time
before doing that, otherwise there is NO possibility of a previous
write succeeding.

And that's what happens. Clients of the Felix webserver
lose the end of the page being downloaded and good fraction
of the time, almost completely reliably when the page is long
(and the connection is to another computer).

*** setting a user space time on close is NOT ACCEPTABLE because
it leads to a DNS attack based on opening too many sockets: a socket
that would ordinarily be closed in milliseconds may be held onto for
seconds, starving the system of free sockets.
The timer MUST be implemented in the OS (TCP library) so that
the socket can be closed when the data is transfered OR the
time limit is up, whichever comes first. It is NOT possible for
the client to test if the data is transmitted.

Hence THIS IS A BUG IN POSIX SOCKET INTERFACE.

HIGH PERFORMANCE (I.E. ASYNCHRONOUS) SOCKET I/O
CANNOT BE MADE RELIABLE.

I hope I'm wrong.. but I doubt it. At least Linux should be fixed,
at the moment it is screwed.

Note: a summary of the problem shows people just didn't think.
async I/O clearly implies async close. It's stupidity to have
buffered I/O and  unbuffered close. The buffering (delay) must
be inside the OS. Consequently there's no way to know if the
close succeed in writing all the data or not. This COULD be fixed
by a notification signal (i.e. in Linux adding a case to the epoll
service however it is non-trivial because the socket can't be
identified by the socket-id because it is closed and invalid).

Another solution would be to be able to TEST if the underlying
transport was ready. This can be done for lower level I/O
operation (meaning lower down the ISO stack), and it can be done to see
if you can READ but it can't test if write buffers are empty.

--
john skaller
skal...@users.sourceforge.net





------------------------------------------------------------------------------
_______________________________________________
Felix-language mailing list
Felix-language@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/felix-language

Reply via email to