Brian Utterback writes:
 > In the interests of open development, I wanted to get the opinions
 > of the OpenSolaris developers on this mailing list.
 > 
 > In Solaris 10, Sun introduced the concept of "fused" TCP connections.
 > The idea is that most of the TCP algorithms are designed to deal with
 > unreliable network wires. However, there is no need for all of that
 > baggage when both ends of the connection are on the same machine,
 > since there are no unreliable wires between them. The is no reason
 > to limit the packet flow because of Nagle, or silly window syndrome
 > or anything else, just put the data directly into the receive buffer
 > and have done with it.
 > 
 > This was a great idea, however, there was a slight modification to
 > the standard streams flow control added to the fused connections. This
 > modification placed a restriction of the number of unread data blocks
 > on the queue. In the context of TCP and the kernel, a data block
 > amounts to the data written in a single write syscall, and the queue
 > is the receive buffer. What this means in practical terms is that the
 > producer process can only do 7 write calls without the consumer doing
 > a read. The 8th write will block until the read.
 > 
 > This is done to balance the process scheduling and prevent the producer
 > from starving the consumer for cycles to read the data. The number was
 > determined experimentally by tuning to get good results on an important
 > benchmark.
 > 
 > I am distrustful of the reasoning, and very distrustful of the results.
 > You can see how it might improve performance by reducing the latency.
 > If your benchmark has a producer and a consumer, you want the consumer
 > to start consuming as soon as possible, otherwise the startup cost gets
 > high. Also, by having a producer produce a bunch of data and then have
 > the consumer consume them, you have to allocate more data buffers than
 > might otherwise be necessary. But I am not convinced that it should be
 > up to TCP/IP to enforce that. It seems like it should be the job of
 > the scheduler, or the application itself. And tuning to a particular
 > benchmark strikes me as particularly troublesome.
 > 
 > Furthermore, it introduces a deadlock situation that did not exist
 > before. Applications that have some knowledge of the size of the
 > records that they deal with often use MSG_PEEK or FIONREAD to query
 > the available data and wait until a full record arrives before reading
 > the data.  If the data is written in more than 8 chunks by the
 > producer, then the producer will block waiting for the consumer, who
 > will never read, waiting for the rest of the data to arrive.
 > 
 > Now this same deadlock was always a possibility with the flow control,
 > but as long as the record size was considerably smaller than the receive
 > buffer size, the application never had to worry about it. With this type
 > of blocking, the receive buffer can effectively be 8 bytes, making the
 > deadlock a very real possibility.
 > 
 > So, I am open to discussion on this. Is this a reasonable approach to
 > context switching between a producer and consumer, or should the
 > scheduler do this better? Perhaps instead of blocking, the process
 > should just lose the rest of its time slice? (I don't know if that
 > is feasible) Any thoughts on the subject?
 > 
 > blu
 > 
 > "The genius of you Americans is that you never make clear-cut stupid
 >   moves, only complicated stupid moves which make us wonder at the
 >   possibility that there may be something to them which we are missing."
 >   - Gamal Abdel Nasser
 > ----------------------------------------------------------------------
 > Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
 > Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom
 > _______________________________________________
 > networking-discuss mailing list
 > [email protected]


What you say appears quite reasonable.

I   don't  understand  why we  should    block before having
buffered  the sum  of  a  socket  receive buffer and  socket
transmit buffer. On a single CPU system I can imagine having
a provision  to have the  transmitter  yield() to a runnable
receiver  based  on some threshold such   as  N chunks  or M
bytes.

____________________________________________________________________________________
        Performance, Availability & Architecture Engineering  

Roch Bourbonnais                        Sun Microsystems, Icnc-Grenoble 
Senior Performance Analyst              180, Avenue De L'Europe, 38330, 
                                        Montbonnot Saint Martin, France
http://icncweb.france/~rbourbon         http://blogs.sun.com/roch
[EMAIL PROTECTED]               (+33).4.76.18.83.20


_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to