In the interests of open development, I wanted to get the opinions
of the OpenSolaris developers on this mailing list.

In Solaris 10, Sun introduced the concept of "fused" TCP connections.
The idea is that most of the TCP algorithms are designed to deal with
unreliable network wires. However, there is no need for all of that
baggage when both ends of the connection are on the same machine,
since there are no unreliable wires between them. The is no reason
to limit the packet flow because of Nagle, or silly window syndrome
or anything else, just put the data directly into the receive buffer
and have done with it.

This was a great idea, however, there was a slight modification to
the standard streams flow control added to the fused connections. This
modification placed a restriction of the number of unread data blocks
on the queue. In the context of TCP and the kernel, a data block
amounts to the data written in a single write syscall, and the queue
is the receive buffer. What this means in practical terms is that the
producer process can only do 7 write calls without the consumer doing
a read. The 8th write will block until the read.

This is done to balance the process scheduling and prevent the producer
from starving the consumer for cycles to read the data. The number was
determined experimentally by tuning to get good results on an important
benchmark.

I am distrustful of the reasoning, and very distrustful of the results.
You can see how it might improve performance by reducing the latency.
If your benchmark has a producer and a consumer, you want the consumer
to start consuming as soon as possible, otherwise the startup cost gets
high. Also, by having a producer produce a bunch of data and then have
the consumer consume them, you have to allocate more data buffers than
might otherwise be necessary. But I am not convinced that it should be
up to TCP/IP to enforce that. It seems like it should be the job of
the scheduler, or the application itself. And tuning to a particular
benchmark strikes me as particularly troublesome.

Furthermore, it introduces a deadlock situation that did not exist
before. Applications that have some knowledge of the size of the
records that they deal with often use MSG_PEEK or FIONREAD to query
the available data and wait until a full record arrives before reading
the data.  If the data is written in more than 8 chunks by the
producer, then the producer will block waiting for the consumer, who
will never read, waiting for the rest of the data to arrive.

Now this same deadlock was always a possibility with the flow control,
but as long as the record size was considerably smaller than the receive
buffer size, the application never had to worry about it. With this type
of blocking, the receive buffer can effectively be 8 bytes, making the
deadlock a very real possibility.

So, I am open to discussion on this. Is this a reasonable approach to
context switching between a producer and consumer, or should the
scheduler do this better? Perhaps instead of blocking, the process
should just lose the rest of its time slice? (I don't know if that
is feasible) Any thoughts on the subject?

blu

"The genius of you Americans is that you never make clear-cut stupid
 moves, only complicated stupid moves which make us wonder at the
 possibility that there may be something to them which we are missing."
 - Gamal Abdel Nasser
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to