Re: [UPDATE] zerocopy patch against 2.4.2-pre2

2001-02-13 Thread David Rees

On Wed, Feb 14, 2001 at 12:27:10AM +1100, Andrew Morton wrote:
> 
> It's getting very lonely testing this stuff. It would be useful if
> someone else could help out - at least running the bw_tcp tests. It's
> pretty simple:
> 
>   bw_tcp -s ; bw_tcp 0

OK, here's my bw_tcp results on a K6-2 450. I ran bw_tcp 10 times, then
averaged the results.

bw_tcp
2.4.2-pre3   57.0
2.4.2-pre3zc 52.6

-Dave
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [UPDATE] zerocopy patch against 2.4.2-pre2

2001-02-13 Thread Andrew Morton

"David S. Miller" wrote:
> 
> Andrew Morton writes:
>  > Changing the memory copy function did make some difference
>  > in my setup.  But the performance drop on send(8k) is only approx 10%,
>  > partly because I changed the way I'm testing it - `cyclesoak' is
>  > now penalised more heavily by cache misses, and amount of cache
>  > missing which networking causes cyclesoak is basically the same,
>  > whether or not the ZC patch is applied.
> 
> Ok ok ok, but are we at the point where there are no sizable "over the
> wire" performance anomalies anymore?  That is what is important, what
> are the localhost bandwidth measurements looking like for you now
> with/without the patch applied?

Using 2.4.2-pre3 + zerocopy-2.4.2p3-1.diff

All numbers in megs/sec

zcc/zcs is doing read(8k)/send(8k) to localhost.

On the dual 500MHz PII:

   zcc/zcs bw_tcp

  Unpatched:   70   66 
  Patched: 67   66

Single 500MHz PII:

  Unpatched:   58   54
  Patched: 49   52 

Single 650MHz PIII Coppermine:

  Unpatched:   140  180-250 
  Patched: 107  159  


With or without ZC, there is Wierd Stuff happening with local
networking. Throughput is all over the place.

- With zcs reporting throughput once per second, the numbers were jumping
  around by +/-10%.  Had to bump the averaging period to 5 seconds to
  make much sense of it.   With a real network, they're rock solid.

- The difference between the PII and PIII is far beyond anything I see
  with any other workload.

- The difference between zcc/zcs and bw_tcp on the PIII is interesting.
  It's still apparent when zcc/zcs uses a 64k transfer buffer, like bw_tcp.
  zcc/zcs is doing file system reads, whereas bw_tcp isn't.  But the
  discrepancy isn't there on the PII.

- On the unpatched kernel, I saw one bw_tcp run after a reboot report
  410 Mbytes/sec.  Thereafter it's around 210.  err..  make that 180. No,
  make that 254. WTF?

Amongst all the noise it seems there's a problem on the PIII but
not the PII.

It's getting very lonely testing this stuff. It would be useful if
someone else could help out - at least running the bw_tcp tests. It's
pretty simple:

bw_tcp -s ; bw_tcp 0


> I want to reach a known state where we can conclude "over the wire is
> about as good or better than before, but there is a cpu/cache usage
> penalty from the zerocopy stuff".
> 
> This is important.  It lets us get to the next stage which is to
> use your tools, numbers, and some profiling to see if we can get
> some of that cpu overhead back.

Seems, with the 100baseT NIC the performance drop on the Coppermine
is only half that of the Mendocino.  I _think_ the Mendocino is
only 4-way associative, but reports vary on this.  Coppermine is 8-way.

-
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [UPDATE] zerocopy patch against 2.4.2-pre2

2001-02-12 Thread David S. Miller


Andrew Morton writes:
 > Changing the memory copy function did make some difference
 > in my setup.  But the performance drop on send(8k) is only approx 10%,
 > partly because I changed the way I'm testing it - `cyclesoak' is
 > now penalised more heavily by cache misses, and amount of cache
 > missing which networking causes cyclesoak is basically the same,
 > whether or not the ZC patch is applied.

Ok ok ok, but are we at the point where there are no sizable "over the
wire" performance anomalies anymore?  That is what is important, what
are the localhost bandwidth measurements looking like for you now
with/without the patch applied?

I want to reach a known state where we can conclude "over the wire is
about as good or better than before, but there is a cpu/cache usage
penalty from the zerocopy stuff".

This is important.  It lets us get to the next stage which is to
use your tools, numbers, and some profiling to see if we can get
some of that cpu overhead back.

Later,
David S. Miller
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: [UPDATE] zerocopy patch against 2.4.2-pre2

2001-02-11 Thread Andrew Morton

"David S. Miller" wrote:
> 
> As usual:
> 
> ftp://ftp.kernel.org/pub/linux/kernel/people/davem/zerocopy-2.4.2p2-1.diff.gz
> 
> It's updated to be against the latest (2.4.2-pre2) and I've removed
> the non-zerocopy related fixes from the patch (because I've sent them
> under seperate cover to Linus).
> 

Changing the memory copy function did make some difference
in my setup.  But the performance drop on send(8k) is only approx 10%,
partly because I changed the way I'm testing it - `cyclesoak' is
now penalised more heavily by cache misses, and amount of cache
missing which networking causes cyclesoak is basically the same,
whether or not the ZC patch is applied.

I tried a number of things to try to optimise this situation
on an SG-capable NIC with the ZC patch:

while (more_to_send) {
read(fd, buf, 8192);
send(sock, buf, 8192);
}

Things I tried:

- Use the csum_copy() functions

- Use copy_from_user()

- Use copy_from_user if src and dest are 8-byte aligned,
  else use csum_copy.

- Force data alignment.

  Explain:  If an application sends a few bytes to a connection
  (say, some headers) and then starts pumping bulk data down the
  same connection, we end up in the situation where the source of
  a copy_from_user is poorly aligned, and it *stays* that way for
  the whole operation.  This is because new, incoming data is always
  tacked onto the end of the socket write buffer.

  Copying from a poorly aligned source address takes 1.5 to 2 times
  as long, depending upon the combination of source-cached and
  dest-cached.

  So I special-cased this in tcp_sendmsg: if we see a large write
  from userspace and we're poorly aligned then just send out a single
  undersized frame so we can drop back into alignment.

  This didn't make a lot of difference, which perhaps indicates
  that the dominating factor is misses, not alignment.  If it
  _is_ misses, they're probably due to aliasing - Ingo said his
  toy has 2 megs of full-speed L2.

- skbuff_cache.

  Explain: When we build an skbuff for ZC transmit it is always
  the same size - it only holds the headers.  The data is put
  into the fragment buffer.  So I created a slab cache for
  skbuffs whose data length is <= 256 bytes, and used that.

  This didn't make much difference.


send(8k), no SG 19.2%
send(8k), SG, csum_copy 20.3%
send(8k), SG, copy_from_user20.9%
send(8k), SG, choose copy   20.6% (huh?)
send(8k), SG, page-aligned, choose copy 20.3%
send(8k), SG, page-aligned, csum_copy   20.2%
send(8k), SG, csum_copy, skbuff_cache   20.5% (huh?)
send(8k), SG, csum_copy, skbuff_cache, page-aligned 20.2%
send(8k), SG, copy_from_user, skbuff_cache, page-aligned20.2%


That's all pretty uninteresting, except for the observation
that not using Pentium string ops on un-8byte-aligned is the
biggest win.  And the two huhs, the first of which is
bizarre.  I've checked that code over and over:

   if (((long)_from | (long)_to) & 7)
csum_and_copy()
   else
copy_from_user()

and it's slower than an unconditional csum_and_copy().  Wierd.
 

The profiles are more interesting:

send(8k), no SG 18.2%
=

c0224734 tcp_transmit_skb 47   0.0347
c01127dc schedule 54   0.0340
c021599c ip_output54   0.1688
c010a768 handle_IRQ_event 55   0.4583
c02041ec skb_release_data 60   0.5357
c0211068 ip_route_input   69   0.1938
c022abac tcp_v4_rcv   75   0.0470
c0215adc ip_queue_xmit76   0.0571
c0204410 skb_clone85   0.1986
c0219a54 tcp_sendmsg_copy 99   0.0270
c02209fc tcp_clean_rtx_queue 101   0.1153
c02042c4 __kfree_skb 113   0.3404
c024a3cc csum_partial_copy_generic   436   1.7581
c0125580 file_read_actor 548   6.5238
 total  2874   0.0021

send(8k), SG, csum copy 20.3%
=

c0211068 ip_route_input   47   0.1320
c011be60 del_timer49   0.6806
c021599c ip_output49   0.1531
c010a768 handle_IRQ_event 56   0.4667
c022abac tcp_v4_rcv   66   0.0414
c02041ec skb_release_data 69   0.6161
c0215adc ip_queue_xmit69   0.0518
c0204410 skb_clone

[UPDATE] zerocopy patch against 2.4.2-pre2

2001-02-09 Thread David S. Miller


As usual:

ftp://ftp.kernel.org/pub/linux/kernel/people/davem/zerocopy-2.4.2p2-1.diff.gz

It's updated to be against the latest (2.4.2-pre2) and I've removed
the non-zerocopy related fixes from the patch (because I've sent them
under seperate cover to Linus).

Enjoy.  As usual, I am very seriously interested in any bugs or
performance problems introduced by this patch.  Thanks.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/