Kacheong Poon wrote:
Brian Utterback wrote:
That doesn't make sense to me. Nagle is necessary because of the
high level of overhead of sending packets over the network. And it
never prevents data from flowing, it only slows the flow down.


A couple points.  Nagle algorithm is *not* necessary.  It
is an optimization.  The overhead in loopback TCP connection
is the system call and context switch.


If the application writes the data with small buffers, the amount
of resources used per unit will be the same with or without the
block. The same number of data blocks will be used, and the same
number of context switches.


I think you misunderstood the following *IF* part in my previous
email.


---
*IF* the sender is clever and does its own buffering so
that when it is unblocked, it can send a huge chunk of
data, it will be a win.
---


Right. I agree that it is better to buffer the data and then
write it in a single big write and be done with it. I have been
ranting about that for years, telling everybody that would listen
about the evils of TCP_NODELAY. But the cost of bytes across the
network is way more than the cost of cycles. If we haven't been
able to get customers to conserve the one, I don't see how we will
ever get them to adopt the same habits for the other. Except using
the bigger stick. "Forget performance, your app will stop working
entirely! Bawahhha!"


It will be one single big write after the unblock *IF* the app
is clever and does its own buffering.  And this is indeed what
some apps do to handle write blocking.  It is a win.  The
optimization is not about storage space.  Hope this clarifies
the dynamics.


In fact, the block might actually
increase the number of syscall used by the reader.


Could you explain your reasoning above?  It does not seem to
be true universally.  In fact, the app mentioned above wins
because the number of syscall done by the reader is also
reduced, exactly the opposite what the above point.

If the reader does not use poll at all and instead reads in a delay
loop, each read will contain less data because of the block thus
forcing it to do more read calls for the same amount of data.



The only
thing saved is the high water mark of simultaneous data blocks
used might be lower.


I believe this is never the intention of the current code.
And this is not what makes some apps perform better.

Right. My point is that it is a minor improvement.



Nagle tries to reduce the number of packets
used by allowing the application to write more data before sending
the packet. That doesn't work in this case.


Why?  Actual data shows that some apps indeed perform better.
Could you explain your reasoning why the performance data does
not correctly represent the dynamics of the code flow and
the reason why the improvement is not real?

Simple. As we have already noted, we don't consolidate data into
a buffer. The data when written is copied into a data block in
the size it was given to the syscall. The cost of processing the
data is then fixed by the code paths which are the same no matter
how much data is in the blocks, give or take. So, yes, having a
single data block is a big win, but not as much of a win as having
a single network packet. And we can't get the customers to optimize
for *that*. Blocking does not reduce the number of data blocks
processed. It might reduce the number of context switches done by
the reader but it might also increase them.

Some types of benchmarks will see performance improvement because
of a reduction in the data latency. By blocking, we are allow the
reader to consume the data, reducing the average time the data is
waiting to be read.

To take an extreme case, suppose you had two real time processes,
one of which continually wrote timestamps to a TCP connection, and
the other reads them until the connection is closed. If the first
wrote for 10 seconds before yielding, and then the second read them
one at a time at the same rate and compared them to the current time,
we would have each timestamp being read after a 10 second wait. But
if both if both processes did a yield between each read or write, the
time delay would be extremely small and the whole thing test would
be done in 10 seconds instead of 20 seconds. In terms of the actual
work done, it would be the same in both cases. In terms of efficiency,
the first might actually be the more efficient because of the effects
of instruction page caching.

So a yield after every read or write is probably not a good idea. A
block after X writes is also not a good idea.  What is the best? It
still seems to me to be a function of the scheduler and not the I/O
system.



It seems backwards to me because:

1. The purpose of fused connections was to reduce the TCP baggage,
not introduce more.


Could you clarify what you implied by introducing more TCP
baggage?  I think the current code removes all TCP protocol
data processing in the loopback case, not introduces more.

Except we are replacing it. Granted, we removed a ton of flow
control baggage that slowed the data flow down needlessly, but
we introduced an ounce of flow control that stops it altogether.



2. As far as I know, no other IPC method imposes this kind of flow
control. Why single out TCP?


Maybe there is another misunderstanding.  TCP is not being
singled out.  There are some apps which need to talk to both
local and remote peers.  To simplify the code, the app writer
chooses to use TCP socket to communicate with all of them.
If an app only needs to do IPC, I believe TCP is probably
not the right choice in many circumstances.  And because the
app needs to communicate with remote peers, it also has
mechanism to handle write blocking.  The current code optimization
is based on this fact.

Of course. It was not unreasonable to have the TCP flow control on
loopback when TCP connection are used. Not unreasonable, but
unnecessary. It is also reasonable to remove the unnecessary TCP
flow control for loopback connections. It is not reasonable to
introduce new flow control that requires the app know that a
connection is loopback so that it can work at all.

If it were a case of 90% of apps getting a 5% performance boost
while 10% saw a 10% reduction, I wouldn't quibble, although
some might complain if any app saw a performance reduction. The
problem is that while 90% might see a boost, some stop working
entirely.


Again, the question to answer is how to make it work better.
But if you really think that the above observation on the
dynamics of data flow is incorrect, please state the reasoning.

I gave my reasoning above. I think that the answer was already
given earlier in the thread. Instead of blocking, yield. Either
that, or go back to the "if the buffer is full, block" semantics.
Hmm. Maybe the real answer is do the former on single processor
systems and the latter on multi-processor systems?





--
blu

"The genius of you Americans is that you never make clear-cut stupid
 moves, only complicated stupid moves which make us wonder at the
 possibility that there may be something to them which we are missing."
 - Gamal Abdel Nasser
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom
_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to