Brian, I agree. IMHO default tuning for specific benchmarks doesn't often benefit the average user. Perhaps the default should be the size of the receive buffer modulo blocksize and an alternate default (pre-set) available to revert to the current behaviour, (even then I'd prefer to see the existing limit tunable with a defined safe range, while that preset default was selected in /etc/system).
George *>Date: Wed, 03 Jan 2007 09:09:45 -0500 *>From: Brian Utterback <[EMAIL PROTECTED]> *>Subject: [networking-discuss] When should fused TCP connections block. *>To: [email protected] *>MIME-version: 1.0 *>Content-transfer-encoding: 7BIT *>X-BeenThere: [email protected] *>Delivered-to: [email protected] *>X-PMX-Version: 5.2.0.264296 *>X-Original-To: [email protected] *>X-Mailman-Version: 2.1.4 *>List-Post: <mailto:[email protected]> *>List-Subscribe: <http://mail.opensolaris.org/mailman/listinfo/networking-discuss>, <mailto:[EMAIL PROTECTED]> *>List-Unsubscribe: <http://mail.opensolaris.org/mailman/listinfo/networking-discuss>, <mailto:[EMAIL PROTECTED]> *>List-Archive: <http://mail.opensolaris.org/pipermail/networking-discuss> *>List-Help: <mailto:[EMAIL PROTECTED]> *>List-Id: Networking General Discussion <networking-discuss.opensolaris.org> *>User-Agent: Thunderbird 2.0b1 (X11/20070101) *> *>In the interests of open development, I wanted to get the opinions *>of the OpenSolaris developers on this mailing list. *> *>In Solaris 10, Sun introduced the concept of "fused" TCP connections. *>The idea is that most of the TCP algorithms are designed to deal with *>unreliable network wires. However, there is no need for all of that *>baggage when both ends of the connection are on the same machine, *>since there are no unreliable wires between them. The is no reason *>to limit the packet flow because of Nagle, or silly window syndrome *>or anything else, just put the data directly into the receive buffer *>and have done with it. *> *>This was a great idea, however, there was a slight modification to *>the standard streams flow control added to the fused connections. This *>modification placed a restriction of the number of unread data blocks *>on the queue. In the context of TCP and the kernel, a data block *>amounts to the data written in a single write syscall, and the queue *>is the receive buffer. What this means in practical terms is that the *>producer process can only do 7 write calls without the consumer doing *>a read. The 8th write will block until the read. *> *>This is done to balance the process scheduling and prevent the producer *>from starving the consumer for cycles to read the data. The number was *>determined experimentally by tuning to get good results on an important *>benchmark. *> *>I am distrustful of the reasoning, and very distrustful of the results. *>You can see how it might improve performance by reducing the latency. *>If your benchmark has a producer and a consumer, you want the consumer *>to start consuming as soon as possible, otherwise the startup cost gets *>high. Also, by having a producer produce a bunch of data and then have *>the consumer consume them, you have to allocate more data buffers than *>might otherwise be necessary. But I am not convinced that it should be *>up to TCP/IP to enforce that. It seems like it should be the job of *>the scheduler, or the application itself. And tuning to a particular *>benchmark strikes me as particularly troublesome. *> *>Furthermore, it introduces a deadlock situation that did not exist *>before. Applications that have some knowledge of the size of the *>records that they deal with often use MSG_PEEK or FIONREAD to query *>the available data and wait until a full record arrives before reading *>the data. If the data is written in more than 8 chunks by the *>producer, then the producer will block waiting for the consumer, who *>will never read, waiting for the rest of the data to arrive. *> *>Now this same deadlock was always a possibility with the flow control, *>but as long as the record size was considerably smaller than the receive *>buffer size, the application never had to worry about it. With this type *>of blocking, the receive buffer can effectively be 8 bytes, making the *>deadlock a very real possibility. *> *>So, I am open to discussion on this. Is this a reasonable approach to *>context switching between a producer and consumer, or should the *>scheduler do this better? Perhaps instead of blocking, the process *>should just lose the rest of its time slice? (I don't know if that *>is feasible) Any thoughts on the subject? *> *>blu *> *>"The genius of you Americans is that you never make clear-cut stupid *> moves, only complicated stupid moves which make us wonder at the *> possibility that there may be something to them which we are missing." *> - Gamal Abdel Nasser *>---------------------------------------------------------------------- *>Brian Utterback - Solaris RPE, Sun Microsystems, Inc. *>Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom *>_______________________________________________ *>networking-discuss mailing list *>[email protected] George Shepherd http://clem.uk/~georges/ ============================================================================== Solaris Revenue Product Engineering: | SUN Microsystems Core team -Internet | Guillemont Park Email: [EMAIL PROTECTED] | Camberley GU17 9QG Disclaimer: Less is more, more or less | United Kingdom ============================================================================== _______________________________________________ networking-discuss mailing list [email protected]
