In the interests of open development, I wanted to get the opinions of the OpenSolaris developers on this mailing list.
In Solaris 10, Sun introduced the concept of "fused" TCP connections. The idea is that most of the TCP algorithms are designed to deal with unreliable network wires. However, there is no need for all of that baggage when both ends of the connection are on the same machine, since there are no unreliable wires between them. The is no reason to limit the packet flow because of Nagle, or silly window syndrome or anything else, just put the data directly into the receive buffer and have done with it. This was a great idea, however, there was a slight modification to the standard streams flow control added to the fused connections. This modification placed a restriction of the number of unread data blocks on the queue. In the context of TCP and the kernel, a data block amounts to the data written in a single write syscall, and the queue is the receive buffer. What this means in practical terms is that the producer process can only do 7 write calls without the consumer doing a read. The 8th write will block until the read. This is done to balance the process scheduling and prevent the producer from starving the consumer for cycles to read the data. The number was determined experimentally by tuning to get good results on an important benchmark. I am distrustful of the reasoning, and very distrustful of the results. You can see how it might improve performance by reducing the latency. If your benchmark has a producer and a consumer, you want the consumer to start consuming as soon as possible, otherwise the startup cost gets high. Also, by having a producer produce a bunch of data and then have the consumer consume them, you have to allocate more data buffers than might otherwise be necessary. But I am not convinced that it should be up to TCP/IP to enforce that. It seems like it should be the job of the scheduler, or the application itself. And tuning to a particular benchmark strikes me as particularly troublesome. Furthermore, it introduces a deadlock situation that did not exist before. Applications that have some knowledge of the size of the records that they deal with often use MSG_PEEK or FIONREAD to query the available data and wait until a full record arrives before reading the data. If the data is written in more than 8 chunks by the producer, then the producer will block waiting for the consumer, who will never read, waiting for the rest of the data to arrive. Now this same deadlock was always a possibility with the flow control, but as long as the record size was considerably smaller than the receive buffer size, the application never had to worry about it. With this type of blocking, the receive buffer can effectively be 8 bytes, making the deadlock a very real possibility. So, I am open to discussion on this. Is this a reasonable approach to context switching between a producer and consumer, or should the scheduler do this better? Perhaps instead of blocking, the process should just lose the rest of its time slice? (I don't know if that is feasible) Any thoughts on the subject? blu "The genius of you Americans is that you never make clear-cut stupid moves, only complicated stupid moves which make us wonder at the possibility that there may be something to them which we are missing." - Gamal Abdel Nasser ---------------------------------------------------------------------- Brian Utterback - Solaris RPE, Sun Microsystems, Inc. Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom _______________________________________________ networking-discuss mailing list [email protected]
