Kacheong Poon wrote:
Brian Utterback wrote:
Right. I agree that it is better to buffer the data and then
write it in a single big write and be done with it. I have been
ranting about that for years, telling everybody that would listen
about the evils of TCP_NODELAY.
I hope we are not really telling our customers that using
TCP_NODELAY is "evil." If an app writer knows exactly what
it means and understands the consequences but still chooses
to set it because of the nature of the app, I believe it
is a legitimate use.
Oh, I agree completely. However, I have never actually dealt with
an application in the course of a customer issue which used TCP_NODELAY
"legitimately", while having had many such that were not so. The vast
majority of applications that use TCP_NODELAY do so because they had
poor performance and somebody told the programmer that setting
TCP_NODELAY would improve it; and lo and behold, it did!
But the cost of bytes across the
network is way more than the cost of cycles.
They are different costs. I think comparing them this way
is like comparing apples and oranges.
Agreed. But like a farmer deciding what to grow, sometimes you
have to compare apples and oranges. For instance, deciding when
to use compression on a TCP data stream is exactly that kind of
trade off, cycles for packets. I happen to feel that sending single
bytes across a network with a 6600% inefficiency plus the attendant
inefficiency in the cycles of the processing each packet plus the
inefficiency of the processing of the kernel blocks is worth more
cost to avoid than processing the same data in the same manner with
only the cost of processing the kernel blocks in the picture.
If we haven't been
able to get customers to conserve the one, I don't see how we will
ever get them to adopt the same habits for the other. Except using
the bigger stick. "Forget performance, your app will stop working
entirely! Bawahhha!"
The motivation of the change is that there are real apps
which benefit from it. I was told that the known app which
does not work because it does not handle EAGAIN. The app
assumes that all write must succeed so it does not handle the
EAGAIN error. I think regardless of the fusion change,
this app may one day fail anyway. While we should improve
the current code, I don't think there is anything we can
do if an app does not handle the error case.
I don't think that the EAGAIN was the issue. I no longer have the
information about the original call (it wasn't mine, I was consulting)
and I have not been able to find it. However, from the emails I was
able to find, none of them mentioned EAGAIN. Further, I have recently
had another call where an app was getting EAGAIN and I told the customer
exactly what you said, i.e. the app should handle EAGAIN.
Besides, EAGAIN is not the issue anyway. The issue is an application
(possibly real, possibly theoretical) that deadlocks when run with
both ends on the same system on Solaris 10, that does not deadlock
when run on different systems, or on the same system on Solaris 9 or
on any other platform.
If the reader does not use poll at all and instead reads in a delay
loop, each read will contain less data because of the block thus
forcing it to do more read calls for the same amount of data.
I suspect that the above is not always true. While one can
definitely write such a delay loop with the exact timing
to make things behave badly, it may not happen often. For
example, netperf is such a simple program. I remember Adi
had done extensive testing using it and the current code
did not make the performance worse. If Adi is reading this,
he can confirm this with data. The performance comparison
is against a normal fused TCP which only blocks the sending
side when the receive buffer is full. I think Adi has done
such tests in single (single thread) and multiple processors
system.
It doesn't take any crafting to get just the right timing. All
you need is a writer that writes more than 9 times in the interval
that the reader reads and a reader that is able to read all that
the writer would have written. For instance, if the writer writes
one byte a second and the reader reads each 20 seconds, instead of
getting 20 bytes it will only get 8 each 20 seconds.
Why? Actual data shows that some apps indeed perform better.
Could you explain your reasoning why the performance data does
not correctly represent the dynamics of the code flow and
the reason why the improvement is not real?
Simple. As we have already noted, we don't consolidate data into
a buffer.
I guess there is a misunderstanding. The app I mentioned does
consolidate the data in the write after the unblock. This is
where part of the gain is from. The other part is that the reader
can then do a single big read. That's what I meant by "actual
data" above.
That is consolidation in the app in userland, not in kernel land.
I am not saying that it isn't better to do all of that kind of
optimization in the application. As you noted, you can get huge
performance gains that way. What I am saying is what about the
less well written apps and the legacy apps? I wouldn't even mind
a performance degradation, it's total failure that I find a problem.
By "simple" above, do you have data showing that the original
analysis of those apps is not correct? If you do, could you
share the data?
I don't have it, but I will try to get some. Do you have an application
that shows an improvement with the blocking behavior?
The data when written is copied into a data block in
the size it was given to the syscall. The cost of processing the
data is then fixed by the code paths which are the same no matter
how much data is in the blocks, give or take. So, yes, having a
single data block is a big win, but not as much of a win as having
a single network packet.
As mentioned earlier, comparing network cost and system cost
is like comparing apples and oranges. What we need to compare
is a single big write and read versus many writes and reads.
No, that would also be comparing apples to oranges. The question
is not whether or not big reads and writes are more efficient, I'll
concede that. The question is whether blocking is better than something
else. The argument I made in this paragraph was merely pointing out
why I think that taking extra measures to trade performance for
efficiency makes sense for networks and not for cycles. It doesn't
affect the question at hand.
And we can't get the customers to optimize
for *that*. Blocking does not reduce the number of data blocks
processed. It might reduce the number of context switches done by
the reader but it might also increase them.
In reality, it is the fact that our customers are actually
writing very optimized code which suggested Adi to come up
with the current algorithm to further optimize the fusion
performance. The original fusion code only blocks when the
receive buffer is full.
Maybe some customers write highly optimized code, but some do not.
You cannot rely on that being the case. Quick and dirty is also
popular. As I said, I have no problem with quick and dirty also
performing poorly, but it should still work.
Some types of benchmarks will see performance improvement because
of a reduction in the data latency. By blocking, we are allow the
reader to consume the data, reducing the average time the data is
waiting to be read.
I assume you've some data to support the above observation.
Could you share the data? Was it done using a single threaded
processor?
A Priori reasoning. I am willing to do the testing and report back.
To take an extreme case, suppose you had two real time processes,
one of which continually wrote timestamps to a TCP connection, and
the other reads them until the connection is closed. If the first
wrote for 10 seconds before yielding, and then the second read them
one at a time at the same rate and compared them to the current time,
we would have each timestamp being read after a 10 second wait. But
if both if both processes did a yield between each read or write, the
time delay would be extremely small and the whole thing test would
be done in 10 seconds instead of 20 seconds. In terms of the actual
work done, it would be the same in both cases. In terms of efficiency,
the first might actually be the more efficient because of the effects
of instruction page caching.
So a yield after every read or write is probably not a good idea. A
block after X writes is also not a good idea. What is the best? It
still seems to me to be a function of the scheduler and not the I/O
system.
As I mentioned a couple times, it is the app which makes the
big difference in performance gain. The gain is there on single
(single threaded) and multiple processors system. If you have
data showing that this is actually not the reason for the performance
gain, please share the data. It will help clear up things.
I will try to clear this up as you suggest. But apps can fail now,
suggesting that something needs to be fixed.
Of course. It was not unreasonable to have the TCP flow control on
loopback when TCP connection are used. Not unreasonable, but
unnecessary. It is also reasonable to remove the unnecessary TCP
flow control for loopback connections.
As mentioned above, the current code wins when compared to
fused TCP which only blocks when receive buffer is full.
It is not reasonable to
introduce new flow control that requires the app know that a
connection is loopback so that it can work at all.
And could you explain the reason that a well written app will
need to know that it is using a loopback connection in order
to work at all?
Define "well written"? And should the criteria be "well written"
or "correct"? The point I was making is that a reasonable strategy
for over the wire will fail with the current setup. Not perform
poorly, just fail. Thus the app will need to behave differently
for the loopback case and will thus need to know when it is in use.
If it were a case of 90% of apps getting a 5% performance boost
while 10% saw a 10% reduction, I wouldn't quibble, although
some might complain if any app saw a performance reduction. The
problem is that while 90% might see a boost, some stop working
entirely.
AFAIK, only those apps that expect certain behavior which
cannot always be expected will fail to work. To those apps,
I think it is better to make them more robust.
Do we really want to keep making the allowed expected behavior
narrower and narrower, and just for Solaris? Can we?
I gave my reasoning above. I think that the answer was already
given earlier in the thread. Instead of blocking, yield. Either
that, or go back to the "if the buffer is full, block" semantics.
Hmm. Maybe the real answer is do the former on single processor
systems and the latter on multi-processor systems?
Have you implemented this and have data showing that it actually
performs much better? As I said, we should improve the code.
It is better to see some code and data than just "thinking."
Nope, but I will. See you later.
--
blu
"The genius of you Americans is that you never make clear-cut stupid
moves, only complicated stupid moves which make us wonder at the
possibility that there may be something to them which we are missing."
- Gamal Abdel Nasser
----------------------------------------------------------------------
Brian Utterback - Solaris RPE, Sun Microsystems, Inc.
Ph:877-259-7345, Em:brian.utterback-at-ess-you-enn-dot-kom
_______________________________________________
networking-discuss mailing list
[email protected]