Brian Utterback wrote:

> Right. I agree that it is better to buffer the data and then
> write it in a single big write and be done with it. I have been
> ranting about that for years, telling everybody that would listen
> about the evils of TCP_NODELAY. 


I hope we are not really telling our customers that using
TCP_NODELAY is "evil."  If an app writer knows exactly what
it means and understands the consequences but still chooses
to set it because of the nature of the app, I believe it
is a legitimate use.


> But the cost of bytes across the
> network is way more than the cost of cycles. 


They are different costs.  I think comparing them this way
is like comparing apples and oranges.


> If we haven't been
> able to get customers to conserve the one, I don't see how we will
> ever get them to adopt the same habits for the other. Except using
> the bigger stick. "Forget performance, your app will stop working
> entirely! Bawahhha!"


The motivation of the change is that there are real apps
which benefit from it.  I was told that the known app which
does not work because it does not handle EAGAIN.  The app
assumes that all write must succeed so it does not handle the
EAGAIN error.  I think regardless of the fusion change,
this app may one day fail anyway.  While we should improve
the current code, I don't think there is anything we can
do if an app does not handle the error case.


> If the reader does not use poll at all and instead reads in a delay
> loop, each read will contain less data because of the block thus
> forcing it to do more read calls for the same amount of data.


I suspect that the above is not always true.  While one can
definitely write such a delay loop with the exact timing
to make things behave badly, it may not happen often.  For
example, netperf is such a simple program.  I remember Adi
had done extensive testing using it and the current code
did not make the performance worse.  If Adi is reading this,
he can confirm this with data.  The performance comparison
is against a normal fused TCP which only blocks the sending
side when the receive buffer is full.  I think Adi has done
such tests in single (single thread) and multiple processors
system.


>> Why?  Actual data shows that some apps indeed perform better.
>> Could you explain your reasoning why the performance data does
>> not correctly represent the dynamics of the code flow and
>> the reason why the improvement is not real?
> 
> Simple. As we have already noted, we don't consolidate data into
> a buffer. 


I guess there is a misunderstanding.  The app I mentioned does
consolidate the data in the write after the unblock.  This is
where part of the gain is from.  The other part is that the reader
can then do a single big read.  That's what I meant by "actual
data" above.

By "simple" above, do you have data showing that the original
analysis of those apps is not correct?  If you do, could you
share the data?


> The data when written is copied into a data block in
> the size it was given to the syscall. The cost of processing the
> data is then fixed by the code paths which are the same no matter
> how much data is in the blocks, give or take. So, yes, having a
> single data block is a big win, but not as much of a win as having
> a single network packet. 


As mentioned earlier, comparing network cost and system cost
is like comparing apples and oranges.  What we need to compare
is a single big write and read versus many writes and reads.


> And we can't get the customers to optimize
> for *that*. Blocking does not reduce the number of data blocks
> processed. It might reduce the number of context switches done by
> the reader but it might also increase them.


In reality, it is the fact that our customers are actually
writing very optimized code which suggested Adi to come up
with the current algorithm to further optimize the fusion
performance.  The original fusion code only blocks when the
receive buffer is full.


> Some types of benchmarks will see performance improvement because
> of a reduction in the data latency. By blocking, we are allow the
> reader to consume the data, reducing the average time the data is
> waiting to be read.


I assume you've some data to support the above observation.
Could you share the data?  Was it done using a single threaded
processor?


> To take an extreme case, suppose you had two real time processes,
> one of which continually wrote timestamps to a TCP connection, and
> the other reads them until the connection is closed. If the first
> wrote for 10 seconds before yielding, and then the second read them
> one at a time at the same rate and compared them to the current time,
> we would have each timestamp being read after a 10 second wait. But
> if both if both processes did a yield between each read or write, the
> time delay would be extremely small and the whole thing test would
> be done in 10 seconds instead of 20 seconds. In terms of the actual
> work done, it would be the same in both cases. In terms of efficiency,
> the first might actually be the more efficient because of the effects
> of instruction page caching.
> 
> So a yield after every read or write is probably not a good idea. A
> block after X writes is also not a good idea.  What is the best? It
> still seems to me to be a function of the scheduler and not the I/O
> system.


As I mentioned a couple times, it is the app which makes the
big difference in performance gain.  The gain is there on single
(single threaded) and multiple processors system.  If you have
data showing that this is actually not the reason for the performance
gain, please share the data.  It will help clear up things.


> Of course. It was not unreasonable to have the TCP flow control on
> loopback when TCP connection are used. Not unreasonable, but
> unnecessary. It is also reasonable to remove the unnecessary TCP
> flow control for loopback connections. 


As mentioned above, the current code wins when compared to
fused TCP which only blocks when receive buffer is full.


> It is not reasonable to
> introduce new flow control that requires the app know that a
> connection is loopback so that it can work at all.


And could you explain the reason that a well written app will
need to know that it is using a loopback connection in order
to work at all?


> If it were a case of 90% of apps getting a 5% performance boost
> while 10% saw a 10% reduction, I wouldn't quibble, although
> some might complain if any app saw a performance reduction. The
> problem is that while 90% might see a boost, some stop working
> entirely.


AFAIK, only those apps that expect certain behavior which
cannot always be expected will fail to work.  To those apps,
I think it is better to make them more robust.


> I gave my reasoning above. I think that the answer was already
> given earlier in the thread. Instead of blocking, yield. Either
> that, or go back to the "if the buffer is full, block" semantics.
> Hmm. Maybe the real answer is do the former on single processor
> systems and the latter on multi-processor systems?


Have you implemented this and have data showing that it actually
performs much better?  As I said, we should improve the code.
It is better to see some code and data than just "thinking."



-- 

                                                K. Poon.
                                                [EMAIL PROTECTED]

_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to