Roch - PAE wrote:
Brian Utterback writes:
> In the interests of open development, I wanted to get the opinions
> of the OpenSolaris developers on this mailing list.
>
> In Solaris 10, Sun introduced the concept of "fused" TCP connections.
> The idea is that most of the TCP algorithms are designed to deal with
> unreliable network wires. However, there is no need for all of that
> baggage when both ends of the connection are on the same machine,
> since there are no unreliable wires between them. The is no reason
> to limit the packet flow because of Nagle, or silly window syndrome
> or anything else, just put the data directly into the receive buffer
> and have done with it.
>
> This was a great idea, however, there was a slight modification to
> the standard streams flow control added to the fused connections. This
> modification placed a restriction of the number of unread data blocks
> on the queue. In the context of TCP and the kernel, a data block
> amounts to the data written in a single write syscall, and the queue
> is the receive buffer. What this means in practical terms is that the
> producer process can only do 7 write calls without the consumer doing
> a read. The 8th write will block until the read.
>
> This is done to balance the process scheduling and prevent the producer
> from starving the consumer for cycles to read the data. The number was
> determined experimentally by tuning to get good results on an important
> benchmark.
>
> I am distrustful of the reasoning, and very distrustful of the results.
> You can see how it might improve performance by reducing the latency.
> If your benchmark has a producer and a consumer, you want the consumer
> to start consuming as soon as possible, otherwise the startup cost gets
> high. Also, by having a producer produce a bunch of data and then have
> the consumer consume them, you have to allocate more data buffers than
> might otherwise be necessary. But I am not convinced that it should be
> up to TCP/IP to enforce that. It seems like it should be the job of
> the scheduler, or the application itself. And tuning to a particular
> benchmark strikes me as particularly troublesome.
>
> Furthermore, it introduces a deadlock situation that did not exist
> before. Applications that have some knowledge of the size of the
> records that they deal with often use MSG_PEEK or FIONREAD to query
> the available data and wait until a full record arrives before reading
> the data. If the data is written in more than 8 chunks by the
> producer, then the producer will block waiting for the consumer, who
> will never read, waiting for the rest of the data to arrive.
>
> Now this same deadlock was always a possibility with the flow control,
> but as long as the record size was considerably smaller than the receive
> buffer size, the application never had to worry about it. With this type
> of blocking, the receive buffer can effectively be 8 bytes, making the
> deadlock a very real possibility.
>
> So, I am open to discussion on this. Is this a reasonable approach to
> context switching between a producer and consumer, or should the
> scheduler do this better? Perhaps instead of blocking, the process
> should just lose the rest of its time slice? (I don't know if that
> is feasible) Any thoughts on the subject?
...
What you say appears quite reasonable.
I don't understand why we should block before having
buffered the sum of a socket receive buffer and socket
transmit buffer. On a single CPU system I can imagine having
a provision to have the transmitter yield() to a runnable
receiver based on some threshold such as N chunks or M
bytes.
I'd rather see the yield() be every N chunks (or even every
X ticks or some such) rather than once a threshold is reached.
Otherwise the degenerate case of writing one byte at a
time becomes very expensive once you pass the threshold.
e.g. if the reader is waiting for 1k, the threshold is 8 (like
it is now), that's 1016 yield()s after the threshold is crossed.
Darren
_______________________________________________
networking-discuss mailing list
[email protected]