On Tue, 19 Apr 2005, Bosko Milekic wrote:

 My experience with 6.0-CURRENT has been that I am able to push at
 least about 400kpps INTO THE KERNEL from a gigE em card on its own
 64-bit PCI-X 133MHz bus (i.e., the bus is uncontested) and that's

A 64-bit bus doesn't seem to be essential for reasonable performance.

I get about 210 kpps (receive) for a bge card on an old Athlon system
with a 32-bit PCI 33MHz bus.  Overclocking this bus speeds up at least
sending almost proportionally to the overclocking :-).  This is with
my version of an old version of -current, with no mpsafenet, no driver
tuning, and no mistuning (no INVARIANTS, etc., no POLLING, no HZ > 100).
Sending goes slightly slower (about 200 kppps).

I get about 220 kpps (send) for a much-maligned (last year) sk non-card
on a much-maligned Athlon nForce2 newer Athlon system with a 32-bit
PCI 33MHz bus.  This is with a similar setup but with sending in the
driver changed to not use the braindamaged sk interrupt moderation.
The changes don't improve the throughput significantly since it is
limited by the sk or bus to 4 us per packet, but they reduce interrupt
overhead.

 basically out of the box GENERIC on a dual-CPU box with HTT disabled
 and no debugging options, with small 50-60 byte UDP packets.

I used an old version of ttcp for testing. A small packet for me is 5 bytes UDP data since that is the minimum that ttcp will send, but I repeated the tests with a packet size of 50 for comparison. For the sk, the throughput with a packet size of 5 is only slightly larger (240 kpps).

There are some kernel deficiencies which at best break testing using
simple programs like ttcp and at worst reduce throughput:
- when the tx queue fills up, the application should stop sending, at
  least in the udp case, but there is no way for userland to tell
  when the queue becomes non-full so that it is useful to try to add
  to it -- select() doesn't work for this.  Applications either have
  to waste cycles by retrying immediately or waste send slots by
  retrying after a short sleep.

  The old version of ttcp that I use uses the latter method, with a
  sleep interval of 1000 usec.  This works poorly, especially with HZ
  = 100 (which gives an actual sleep interval of 10000 to 20000 usec),
  or with devices that have a smaller tx queue than sk (511).  The tx
  queue always fills up when blasted with packets; it becomes non-full
  a few usec later after a tx interrupt, and it becomes empty a few
  usec or msec later, and then the transmitter is idle while ttcp
  sleeps.  With sk and HZ = 100, throughput is reduced to approximately
  511 * (1000000 / 15000) = 34066 pps.  HZ = 1000 is just large enough
  for the sleep to always be shorter than the tx draining time (2/HZ
  seconds = 2 msec < 4 * 511 usec = 2.044 msec), so transmission can
  stream.

  Newer versions of ttcp like the on in ports are aware of this problem
  but can't fix it since it is in the kernel.  tools/netrate is less
  explicitly aware of this problem and can't fix it...  However, if
  you don't care about using the sender for anything else and don't
  want to measure efficiency of sending, then retrying immediately can
  be used to generate almost the maximum pps.  Parts of netrate do this.

- the tx queue length is too small for all drivers, so the tx queue fills
  up too often.  It defaults to IFQ_MAXLEN = 50.  This may be right for
  1 Mbps ethernet or even for 10 Mbps ethernet, but it is too small for
  100 Mbps ethernet and far too small for 1000 Mbps ethernet.  Drivers
  with a larger hardware tx queue length all bump it up to their tx
  queue length (often, bogusly, less 1), but it needs to be larger for
  transmission to stream.  I use (SK_TX_RING_CNT + imax(2*tick, 10000) / 4)
  for sk.

 My tests were done without polling so with very high interrupt load
 and that also sucks when you have a high-traffic scenario.

Interrupt load isn't necessarily very high, relevant or reduced by polling. For transmission, with non-broken hardware and software, there should be not many more than (pps / <size of hardware tx queue>) tx interrupts per second, and <size of hardware tx queue> should be small so that there aren't many txintrs/sec. For sk, this gives 240000 / 511 = 489. After reprogramming sk's interrupt handling, I get 539. The standard driver used to get 7000+ with the old interrupt moderation timeout of 200 usec (actually 137 usec for Yukon, 200 for Genesis), and now 14000+ with an an interrupt moderation timeout of 200 (68.5) usec. The interrupt load for 539 txintrs/sec and 240 kpps is 10% on an AthlonXP2600 (Barton) overclocked. Very little of this is related to interrupts, so the term "interrupt load" is misleading. About 480 packets are handled for every tx interrupt (512 less 32 for watermark stuff). Much more than 90% of the handling is useful work and would have to be done somewhere; it just happens to be done in the interrupt handler, and that is the best place to do it. With polling, it would take longer to do it and the load is poorly reported so it is hard to see. The system load for 539 txintrs/sec and 240 kpps is much larger. It is about 45% (up from 25% in RELENG_4 :-().

[Context almost lost to top posting.]

On 4/19/2005 1:32 PM, Eivind Hestnes wrote:

I have an Intel Pro 1000 MT (PWLA8490MT) NIC (em(4) driver 1.7.35)
installed
in a Pentium III 500 Mhz with 512 MB RAM (100 Mhz) running FreeBSD
5.4-RC3.
The machine is routing traffic between multiple VLANs. Recently I did a
benchmark with/without device polling enabled. Without device
polling I was
able to transfer roughly 180 Mbit/s. The router however was
suffering when
doing this benchmark. Interrupt load was peaking 100% - overall the
system
itself was quite unusable (_very_ high system load).

I think it is CPU-bound. My Athlon2600 (overclocked) is many times faster than your P3/500 (5-10 times?), but it doesn't have much CPU left over (sending 240000 5-byte udp packets per second from sk takes 60% of the CPU, and sending 53000 1500-byte udp packets per second takes 30% of the CPU; sending tcp packets takes less CPU but goes slower). Apparently 2 or 3 P3/500's worth of CPU is needed just to keep up with the transmitter (with 100% of the CPU used but no transmission slots missed). RELENG_4 has lower overheads so it might need only 1 or 2 P3/500's worth of CPU to keep up.

With device
polling
enabled the interrupt kept stable around 40-50% and max transfer
rate was
nearly 70 Mbit/s. Not very scientific tests, but it gave me a pin
point.

I don't believe in device polling. It's not surprising that it reduces throughput for a device that has large enough hardware queues. It just lets a machine that is too slow to handle 1Gbps ethernet (at least under FreeBSD) sort of work by not using the hardware to its full potentially. 70 Mbit/s is still bad -- it's easy to get more than that with a 100Mbps NIC.

[EMAIL PROTECTED]:~$ sysctl -a | grep kern.polling
...
kern.polling.idle_poll: 0

Setting this should increase throughput when the system is idle by taking 100% of the CPU then. With just polling every 1 msec (from HZ = 1000), there are the same problems as with ttcp retrying every 10-20 msec, but scaled down by a factor of 10-20. For my ttcp example, the transmitter runs dry every 2.044 msec so the polling interval must be shorter than 2.044 msec, but this is with a full hardare tx queue (511 entries) on a not very fast NIC. If the hardware is just twice as fast or the tx queue is just half as large of half as full, then the hardware tx queue it will run dry when polled every 1 msec and hardware capability will be wasted. This problem can be reduced by increasing HZ some more, but I don't believe in increasing it beyond 100, since only software that does too much polling would noticed it being larger.

Bruce
_______________________________________________
freebsd-performance@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to