Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Fri, Jul 11, 2008 at 11:44 PM, Brian McGinty [EMAIL PROTECTED] wrote: Hi Brian I very much doubt that this is ceteris paribus. This is 384 random IPs - 384 random IP addresses with a flow lookup for each packet. Also, I've read through igb on Linux - it has a lot of optimizations that the FreeBSD driver lacks and I have yet to implement. Hey Kip, when will you push the optimization into FreeBSD? Hi Brian, I'm hoping to get to it some time in August. I'm a bit behind in my contracts at the moment. FYI: I'm actually able to forward 2.3Mpps between 2 10Gig interfaces on an 8-core system. I'm hoping to push it up to 3Mpps. Thanks, Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
G'day Kip, I'm hoping to get to it some time in August. I'm a bit behind in my contracts at the moment. A few weeks ago, I did a quick comparison of the driver between FreeBSD and Linux, and found quite a few differences that's worth pulling over. The guy from Intel working on FreeBSD, Jack?, is he the one that does this sort of sync-up of the drivers between the two distribution, or you? There's been a lot of changes recently, including full support for multiple Rx/Tx queues that significantly ups the ante on performance. FreeBSD doesn't support multiple Rx/Tx, or does something half arsed. FYI: I'm actually able to forward 2.3Mpps between 2 10Gig interfaces on an 8-core system. I'm hoping to push it up to 3Mpps. Is this no-loss number, and how did you test it? I don't have throughput numbers for the Oplin. I'm waiting to get some time on the Ixia at work to generate performance numbers for 1G and 10G for all packet sizes, on FreeBSD and Linux, on a 16 core system, and blast it to the list. I expect Linux to do 2-3 times better :-) Later, Brian ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Sat, Jul 19, 2008 at 7:17 PM, Brian McGinty [EMAIL PROTECTED] wrote: G'day Kip, I'm hoping to get to it some time in August. I'm a bit behind in my contracts at the moment. A few weeks ago, I did a quick comparison of the driver between FreeBSD and Linux, and found quite a few differences that's worth pulling over. The guy from Intel working on FreeBSD, Jack?, is he the one that does this sort of sync-up of the drivers between the two distribution, or you? There's been a lot of changes recently, including full support for multiple Rx/Tx queues that significantly ups the ante on performance. FreeBSD doesn't support multiple Rx/Tx, or does something half arsed. This is on a variant of RELENG_6 FreeBSD with a recent version of ULE and running the Checkpoint firewall. It also uses the full number of queues available to igb (4) and #queues == #cores (8 in this case) for ixgbe. The drivers in CVS have some bugs that I have fixed in this FreeBSD variant. FreeBSD's CVS version of the Intel drivers definitely lags Linux in terms of some optimizations. Even my version doesn't have some of the linux optimizations. FYI: I'm actually able to forward 2.3Mpps between 2 10Gig interfaces on an 8-core system. I'm hoping to push it up to 3Mpps. This is testing with an IXIA I don't currently have zero loss numbers. This is not fully loaded. However, ixgbe spews out pause frames when rx gets backed up so losses never get much above 0.1%. Is this no-loss number, and how did you test it? I don't have throughput numbers for the Oplin. I'm waiting to get some time on the Ixia at work to generate performance numbers for 1G and 10G for all packet sizes, on FreeBSD and Linux, on a 16 core system, and blast it to the list. I expect Linux to do 2-3 times better :-) Sure, if you don't care about packet reordering. On their own box Checkpoint claims that Linux is currently able to do 20% better than we are seeing. Even they don't claim 200% - 300%. I know people who are switching off of Linux for memcache because they simply can't make it perform. So you're mileage really varies depending on the workload. I'm not sure where you get your numbers from. I would really like to get a hold of this magical Linux distribution to do a side by side comparison on the same workload. A 200% - 300% performance delta would definitely justify switching. Thanks, Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Robert Watson wrote: On Mon, 7 Jul 2008, Bruce Evans wrote: (1) sendto() to a specific address and port on a socket that has been bound to INADDR_ANY and a specific port. (2) sendto() on a specific address and port on a socket that has been bound to a specific IP address (not INADDR_ANY) and a specific port. (3) send() on a socket that has been connect()'d to a specific IP address and a specific port, and bound to INADDR_ANY and a specific port. (4) send() on a socket that has been connect()'d to a specific IP address and a specific port, and bound to a specific IP address (not INADDR_ANY) and a specific port. The last of these should really be quite a bit faster than the first of these, but I'd be interested in seeing specific measurements for each if that's possible! Not sure if I understand networking well enough to set these up quickly. Does netrate use one of (3) or (4) now? (3) and (4) are effectively the same thing, I think, since connect(2) should force the selection of a source IP address, but I think it's not a bad idea to confirm that. :-) The structure of the desired micro-benchmark here is basically: ... I hacked netblast.c to do this: % --- /usr/src/tools/tools/netrate/netblast/netblast.c Fri Dec 16 17:02:44 2005 % +++ netblast.cMon Jul 14 21:26:52 2008 % @@ -44,9 +44,11 @@ % { % % - fprintf(stderr, netblast [ip] [port] [payloadsize] [duration]\n); % - exit(-1); % + fprintf(stderr, netblast ip port payloadsize duration bind connect\n); % + exit(1); % } % % +static int gconnected; % static int global_stop_flag; % +static struct sockaddr_in *gsin; % % static void % @@ -116,6 +118,13 @@ % counter++; % } % - if (send(s, packet, packet_len, 0) 0) % + if (gconnected send(s, packet, packet_len, 0) 0) { % send_errors++; % + usleep(1000); % + } % + if (!gconnected sendto(s, packet, packet_len, 0, % + (struct sockaddr *)gsin, sizeof(*gsin)) 0) { % + send_errors++; % + usleep(1000); % + } % send_calls++; % } % @@ -146,9 +155,10 @@ % struct sockaddr_in sin; % char *dummy, *packet; % - int s; % + int bind_desired, connect_desired, s; % % - if (argc != 5) % + if (argc != 7) % usage(); % % + gsin = sin; % bzero(sin, sizeof(sin)); % sin.sin_len = sizeof(sin); % @@ -176,4 +186,7 @@ % usage(); % % + bind_desired = (strcmp(argv[5], b) == 0); % + connect_desired = (strcmp(argv[6], c) == 0); % + % packet = malloc(payloadsize); % if (packet == NULL) { % @@ -189,7 +202,19 @@ % } % % - if (connect(s, (struct sockaddr *)sin, sizeof(sin)) 0) { % - perror(connect); % - return (-1); % + if (bind_desired) { % + struct sockaddr_in osin; % + % + osin = sin; % + if (inet_aton(0, sin.sin_addr) == 0) % + perror(inet_aton(0)); % + if (bind(s, (struct sockaddr *)sin, sizeof(sin)) 0) % + err(-1, bind); % + sin = osin; % + } % + % + if (connect_desired) { % + if (connect(s, (struct sockaddr *)sin, sizeof(sin)) 0) % + err(-1, connect); % + gconnected = 1; % } % This also fixes some bugs in usage() (bogus [] around non-optional args and bogus exit code) and adds a sleep after send failure. Without the sleep, netblast distorts the measurements by taking 100% CPU. This depends on kernel queues having enough buffering to not run dry during the sleep time (rounded up to a tick boundary). I use ifq_maxlen = DRIVER_TX_RING_CNT + imax(2 * tick / 4, 1) = 10512 for DRIVER = bge and HZ = 100. This is actually wrong now. The magic 2 is to round up to a tick boundary and the magic 4 is for bge taking a minimum of 4 usec per packet on old hadware, but bge actually takes about 1.5 usec on the test hardware and I'd like it to take 0.66 usec. The queues rarely run dry in practice, but running dry just a few times for a few msec each would explain some anomalies. Old SGI ttcp uses a select timeout of 18 msec here. nttcp and netsend use more sophisticated methods that don't work unless HZ is too small. It's just impossible for a program to schedule its sleeps with a fine enough resolution to ensure waking up before the queue runs dry, unless HZ is too small or the queue is too large. select() for writing doesn't work for the queue part of socket i/o. Results: ~5.2 sendto (1): 630 kpps 98% CPU 11 cm/p (cache misses/packet (min)) -cur sendto: 590 kpps 100% CPU 10 cm/p (July 8 -current) (2): no significant difference - see below ~5.2 send (3): 620 kpps 75% CPU 9.5 cm/p -cur send:520 kpps 60% CPU 8
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Hi Brian I very much doubt that this is ceteris paribus. This is 384 random IPs - 384 random IP addresses with a flow lookup for each packet. Also, I've read through igb on Linux - it has a lot of optimizations that the FreeBSD driver lacks and I have yet to implement. Hey Kip, when will you push the optimization into FreeBSD? Cheers, Brian ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Hi Paul, Paul wrote: I tested Linux in bridge configuration with the same machine and it CPUed out at about 600kpps through the bridge.. 600kpps incoming or 600kpps incoming+ outgoing ? That's a bit low :/ Soft interrupt using all the cpu. Same opteron , 82571EB Pci express NIC. Tried SMP/ non-smp , load balanced irqs, etc.. Does hwpmc work out of the box (FreeBSD) with those CPUs? Good news is using iptables only adds a few percentage onto the CPU usage. But still, what's with that.. So far FreeBSD got the highest pps rating for forwarding. I haven't tried bridge mode. Ipfw probably takes a big hit in that too though. Looking for an 82575 to test.. P.S. It was a nice chat, but what we can expect from the future? Any plans, patches etc? Someone suggested to install 8-current and test with it as this is the fast way to have something included in FreeBSD. I can do this - I can install 8-current, patch it and put it under load and report results, but need patches :) I guess Paul is in the same situation .. -- Best Wishes, Stefan Lambrev ICQ# 24134177 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Good news is using iptables only adds a few percentage onto the CPU usage. But still, what's with that.. So far FreeBSD got the highest pps rating for forwarding. I haven't tried bridge mode. Ipfw probably takes a big hit in that too though. Looking for an 82575 to test.. P.S. It was a nice chat, but what we can expect from the future? Any plans, patches etc? Someone suggested to install 8-current and test with it as this is the fast way to have something included in FreeBSD. I can do this - I can install 8-current, patch it and put it under load and report results, but need patches :) I guess Paul is in the same situation .. I'm in the same situation as well. Would anyone be interested in very specific work aimed at improving IP forwarding? I would happily put out a bounty for this, and I'm quite sure I'm not alone. PS Paul: idd you get around to testing C2D ? Kind regards, Met vriendelijke groet / With kind regards, Bart Van Kerckhove http://friet.net/pgp.txt There are 10 kinds of ppl; those who read binary and those who don't -BEGIN PGP SIGNATURE- iQA/AwUBSHdqOQoIFchBM0BKEQJSPQCfQKKgD8+xrX088+o0IKmPDdDD0XoAnAv+ SqgNdjkKsEstDYqnFDNUQuK3 =ft58 -END PGP SIGNATURE- ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I tested Linux in bridge configuration with the same machine and it CPUed out at about 600kpps through the bridge.. That's a bit low :/ Soft interrupt using all the cpu. Same opteron , 82571EB Pci express NIC. Tried SMP/ non-smp , load balanced irqs, etc.. Good news is using iptables only adds a few percentage onto the CPU usage. But still, what's with that.. So far FreeBSD got the highest pps rating for forwarding. I haven't tried bridge mode. Ipfw probably takes a big hit in that too though. Looking for an 82575 to test.. Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, Jul 7, 2008 at 6:07 PM, Mike Tancsa [EMAIL PROTECTED] wrote: At 02:44 PM 7/7/2008, Paul wrote: Also my 82571 NIC supports multiple received queues and multiple transmit queues so why hasn't anyone written the driver to support this? It's not a 10gb card and it still supports it and it's widely available and not too expensive either. The new 82575/6 chips support even more queues and the two port version will be out this month and the 4 port in october (PCI-E cards). Motherboards are already shipping with the 82576.. (82571 supports 2x/2x 575/6 support 4x/4x) Actually, do any of your NICs attach via the igb driver ? I have a pre-production card. With some bug fixes and some tuning of interrupt handling (custom stack - I've been asked to push the changes back in to CVS, I just don't have time right now) an otherwise unoptimized igb can forward 1.04Mpps from one port to another (1.04 Mpps in on igb0 and 1.04 Mpps out on igb1) using 3.5 cores on an 8 core system. -Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, Jul 7, 2008 at 6:22 PM, Paul [EMAIL PROTECTED] wrote: I read through the IGB driver, and it says 82575/6 only... which is the new chip Intel is releasing on the cards this month 2 port and october 4 port, but the chips are on some of the motherboards right now. Why can't it also use the 82571 ? doesn't make any sense.. I haven't tried it but just browsing the driver source doesn't look like it will work. The igb driver has been written to remove a lot of the cruft that has accumulated to work around deficiencies in earlier 8257x hardware. Although it supports legacy descriptor handling it has a new mode of descriptor handling that is ostensibly better. I don't have access to the data sheets for pre-zoar hardware so I'm not sure what it would take to support multiple queues on that hardware. -Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Artem Belevich wrote: As was already mentioned, we can't avoid all cache misses as there's data that's recently been updated in memory via DMA and therefor kicked out of cache. However, we may hide some of the latency penalty by prefetching 'interesting' data early. I.e. we know that we want to access some ethernet headers, so we may start pulling relevant data into cache early. Ideally, by the time we need to access the field, it will already be in the cache. When we're counting nanoseconds per packet this may bring some performance gain. There were some patches floating around for if_em to do a prefetch of the first bit of packet data on packets before handing them up the stack. My understanding is that they moved the hot spot earlier, but didn't make a huge difference because it doesn't really take that long to get to the point where you're processing the IP header in our current stack (a downside to optimization...). However, that's a pretty anecdotal story, and a proper study of the effects of prefetching would be most welcome. One thing that I'd really like to see someone look at is whether, by doing a bit of appropriately timed prefetching, we can move cache misses out from under hot locks that don't really relate to the data being prefetched. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Hi, Kip Macy wrote: On Mon, Jul 7, 2008 at 6:07 PM, Mike Tancsa [EMAIL PROTECTED] wrote: At 02:44 PM 7/7/2008, Paul wrote: Also my 82571 NIC supports multiple received queues and multiple transmit queues so why hasn't anyone written the driver to support this? It's not a 10gb card and it still supports it and it's widely available and not too expensive either. The new 82575/6 chips support even more queues and the two port version will be out this month and the 4 port in october (PCI-E cards). Motherboards are already shipping with the 82576.. (82571 supports 2x/2x 575/6 support 4x/4x) Actually, do any of your NICs attach via the igb driver ? I have a pre-production card. With some bug fixes and some tuning of interrupt handling (custom stack - I've been asked to push the changes back in to CVS, I just don't have time right now) an otherwise unoptimized igb can forward 1.04Mpps from one port to another (1.04 Mpps in on igb0 and 1.04 Mpps out on igb1) using 3.5 cores on an 8 core system. Is this on 1gbps or on 10gbps NIC? -Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] -- Best Wishes, Stefan Lambrev ICQ# 24134177 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I have a pre-production card. With some bug fixes and some tuning of interrupt handling (custom stack - I've been asked to push the changes back in to CVS, I just don't have time right now) an otherwise unoptimized igb can forward 1.04Mpps from one port to another (1.04 Mpps in on igb0 and 1.04 Mpps out on igb1) using 3.5 cores on an 8 core system. Is this on 1gbps or on 10gbps NIC? Hi Stefan, The hardware that igb supports is just the latest revision of the hardware supported by em, i.e. it is 1gbps. Cheers, Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Will someone confirm if it will support the 82571EB ? I don't see a reason why not as it's very similar hardware and it's available now in large quantities so making 82571 part of igb I think would be a good idea. Kip Macy wrote: I have a pre-production card. With some bug fixes and some tuning of interrupt handling (custom stack - I've been asked to push the changes back in to CVS, I just don't have time right now) an otherwise unoptimized igb can forward 1.04Mpps from one port to another (1.04 Mpps in on igb0 and 1.04 Mpps out on igb1) using 3.5 cores on an 8 core system. Is this on 1gbps or on 10gbps NIC? Hi Stefan, The hardware that igb supports is just the latest revision of the hardware supported by em, i.e. it is 1gbps. Cheers, Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Erik Trulsson wrote: On Mon, Jul 07, 2008 at 10:30:53PM +1000, Bruce Evans wrote: On Mon, 7 Jul 2008, Andre Oppermann wrote: The theoretical maximum at 64byte frames is 1,488,100. I've looked up my notes the 1.244Mpps number can be ajusted to 1.488Mpps. Where is the extra? I still get 1.644736 Mpps (10^9/(8*64+96)). 1.488095 is for 64 bits extra (10^9/(8*64+96+64)). A standard ethernet frame (on the wire) consists of: 7 octets preamble 1 octet Start Frame Delimiter 6 octets destination address 6 octets source address 2 octets length/type 46-1500 octets data (+padding if needed) 4 octets Frame Check Sequence Followed by (at least) 96 bits interFrameGap, before the next frame starts. For minimal packet size this gives a maximum packet rate at 1Gbit/s of 1e9/((7+1+6+6+2+46+4)*8+96)/ = 1488095 packets/second You probably missed the preamble and start frame delimiter in your calculation. Thanks. Yes, that was it. Bruce ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Hi, Kip Macy wrote: On Mon, Jul 7, 2008 at 6:22 PM, Paul [EMAIL PROTECTED] wrote: I read through the IGB driver, and it says 82575/6 only... which is the new chip Intel is releasing on the cards this month 2 port and october 4 port, but the chips are on some of the motherboards right now. Why can't it also use the 82571 ? doesn't make any sense.. I haven't tried it but just browsing the driver source doesn't look like it will work. The igb driver has been written to remove a lot of the cruft that has accumulated to work around deficiencies in earlier 8257x hardware. Although it supports legacy descriptor handling it has a new mode of descriptor handling that is ostensibly better. I don't have access to the data sheets for pre-zoar hardware so I'm not sure what it would take to support multiple queues on that hardware. May be we should ask Jack Vogel? He will have some news probably. -Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] -- Best Wishes, Stefan Lambrev ICQ# 24134177 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On 7/8/08, Robert Watson [EMAIL PROTECTED] wrote: There were some patches floating around for if_em to do a prefetch of the first bit of packet data on packets before handing them up the stack. My I found Andre Oppermann's optimization patch mentioned in july 2005 status report: http://lists.freebsd.org/pipermail/freebsd-announce/2005-July/001012.html http://www.nrg4u.com/freebsd/tcp_reass+prefetch-20041216.patch Is that the patch you had in mind? In the report Andre says: Use [of prefetch] in both of these places show a very significant performance gain but not yet fully quantified. very significant bit looks promising. Unfortunately, it does not look like prefetch changes in the patch made it into official kernel. I wonder why. It should be easy enough to apply prefetch-related changes and see if/how it affects forwarding performance. --Artem ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
But this is probably no routing table, and single source and dst ips or very limited number of ips and ports. the entire problem with Linux is the route cache, try and generate random source ips and random source/dst ports and it won't even do 100kpps without problems. I would like to log into the machine and see 1.4Mpps going through 3 nics :) Brian McGinty wrote: I have a pre-production card. With some bug fixes and some tuning of interrupt handling (custom stack - I've been asked to push the changes back in to CVS, I just don't have time right now) an otherwise unoptimized igb can forward 1.04Mpps from one port to another (1.04 Mpps in on igb0 and 1.04 Mpps out on igb1) using 3.5 cores on an 8 core system. I have a 8 core system running stock Linux that easily does line rate (ie, 1.488 Mpps) on 3 (82575) interfaces. Ie, 3 * 1.48 Mpps! Cheers, Brian. -Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Tue, Jul 8, 2008 at 1:46 PM, Brian McGinty [EMAIL PROTECTED] wrote: I have a pre-production card. With some bug fixes and some tuning of interrupt handling (custom stack - I've been asked to push the changes back in to CVS, I just don't have time right now) an otherwise unoptimized igb can forward 1.04Mpps from one port to another (1.04 Mpps in on igb0 and 1.04 Mpps out on igb1) using 3.5 cores on an 8 core system. I have a 8 core system running stock Linux that easily does line rate (ie, 1.488 Mpps) on 3 (82575) interfaces. Ie, 3 * 1.48 Mpps! Hi Brian I very much doubt that this is ceteris paribus. This is 384 random IPs - 384 random IP addresses with a flow lookup for each packet. Also, I've read through igb on Linux - it has a lot of optimizations that the FreeBSD driver lacks and I have yet to implement. Thanks, Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I have a pre-production card. With some bug fixes and some tuning of interrupt handling (custom stack - I've been asked to push the changes back in to CVS, I just don't have time right now) an otherwise unoptimized igb can forward 1.04Mpps from one port to another (1.04 Mpps in on igb0 and 1.04 Mpps out on igb1) using 3.5 cores on an 8 core system. I have a 8 core system running stock Linux that easily does line rate (ie, 1.488 Mpps) on 3 (82575) interfaces. Ie, 3 * 1.48 Mpps! Cheers, Brian. -Kip ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Robert Watson wrote: Experience suggests that forwarding workloads see significant lock contention in the routing and transmit queue code. The former needs some kernel hacking to address in order to improve parallelism for routing lookups. The latter is harder to address given the hardware you're using: modern 10gbps cards frequently offer multiple transmit queues that can be used independently (which our cxgb driver supports), but 1gbps cards generally don't. Actually the routing code is not contended. The workload in router is mostly serialized without much opportunity for contention. With many interfaces and any-to-any traffic patterns it may get some contention. The locking overhead per packet is always there and has some impact though. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Andre Oppermann wrote: Robert Watson wrote: Experience suggests that forwarding workloads see significant lock contention in the routing and transmit queue code. The former needs some kernel hacking to address in order to improve parallelism for routing lookups. The latter is harder to address given the hardware you're using: modern 10gbps cards frequently offer multiple transmit queues that can be used independently (which our cxgb driver supports), but 1gbps cards generally don't. Actually the routing code is not contended. The workload in router is mostly serialized without much opportunity for contention. With many interfaces and any-to-any traffic patterns it may get some contention. The locking overhead per packet is always there and has some impact though. Yes, I don't see any real sources of contention until we reach the output code, which will run in the input if_em taskqueue threads, as the input path generates little or no contention of the packets are not destined for local delivery. I was a little concerned about mention of degrading performance as firewall complexity grows -- I suspect there's a nice project for someone to do looking at why this is the case. I was under the impression that, in 7.x and later, we use rwlocks to protect firewall state, and that unless stateful firewall rules are used, these are locked read-only rather than writable... Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Ingo Flaschberger wrote: Dear Paul, I tried all of this :/ still, 256/512 descriptors seem to work the best. Happy to let you log into the machine and fiddle around if you want :) yes, but I'm shure I will also not be able to achieve much more pps. As it seems that you hit hardware-software-level-barriers, my only idea is to test dragonfly bsd, which seems to have less software overhead. I tested DragonFly some time ago with an Agilent N2X tester and it was by far the slowest of the pack. I don't think you will be able to route 64byte packets at 1gbit wirespeed (2Mpps) with a current x86 platform. You have to take inter-frame gap and other overheads too. That gives about 1.244Mpps max on a 1GigE interface. In general the chipsets and buses are able to transfer quite a bit of data. On a dual-opteron 848 I was able to sink 2.5Mpps into the machine with ifconfig em[01] monitor without hitting the cpu ceiling. This means that the bus and interrupt handling is not where most of the time is spent. When I did my profiling the saturation point was the cache miss penalty for accessing the packet headers. At saturation point about 50% of the time was spent waiting for the memory to make its way into the CPU. I hoped to reach 1Mpps with the hardware I mentioned some mails before, but 2Mpps is far far away. Currently I get 160kpps via pci-32mbit-33mhz-1,2ghz mobile pentium. This is more or less expected. PCI32 is not able to sustain high packet rates. The bus setup times kill the speed. For larger packets the ratio gets much better and some reasonable throughput can be achieved. Perhaps you have some better luck at some different hardware systems (ppc, mips, ..?) or use freebsd only for routing-table-updates and special network-cards (netfpga) for real routing. NetFPGA doesn't have enough TCAM space to be useful for real routing (as in Internet sized routing table). The trick many embedded networking CPUs use is cache prefetching that is integrated with the network controller. The first 64-128bytes of every packet are transferred automatically into the L2 cache by the hardware. This allows relatively slow CPUs (700 MHz Broadcom BCM1250 in Cisco NPE-G1 or 1.67-GHz Freescale 7448 in NPE-G2) to get more than 1Mpps. Until something like this is possible on Intel or AMD x86 CPUs we have a ceiling limited by RAM speed. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Robert Watson wrote: On Mon, 7 Jul 2008, Andre Oppermann wrote: Robert Watson wrote: Experience suggests that forwarding workloads see significant lock contention in the routing and transmit queue code. The former needs some kernel hacking to address in order to improve parallelism for routing lookups. The latter is harder to address given the hardware you're using: modern 10gbps cards frequently offer multiple transmit queues that can be used independently (which our cxgb driver supports), but 1gbps cards generally don't. Actually the routing code is not contended. The workload in router is mostly serialized without much opportunity for contention. With many interfaces and any-to-any traffic patterns it may get some contention. The locking overhead per packet is always there and has some impact though. Yes, I don't see any real sources of contention until we reach the output code, which will run in the input if_em taskqueue threads, as the input path generates little or no contention of the packets are not destined for local delivery. I was a little concerned about mention of The interface output was the second largest block after the cache misses IIRC. The output part seems to have received only moderate attention and detailed performance analysis compared to the interface input path. Most network drivers do a write to the hardware for every packet sent in addition to other overhead that may be necessary for their transmit DMA rings. That adds significant overhead compared to the RX path where those costs are amortized over a larger number packets. degrading performance as firewall complexity grows -- I suspect there's a nice project for someone to do looking at why this is the case. I was under the impression that, in 7.x and later, we use rwlocks to protect firewall state, and that unless stateful firewall rules are used, these are locked read-only rather than writable... The overhead of just looking at the packet (twice) in ipfw or other firewall packets is a huge overhead. The main loop of ipfw is a very large block of code. Unless one implements compilation of firewall to native machine code there is not much that can be done. With LLVM we will see some very interesting opportunity in that area. Other than that the ipfw instruction over per rule seems to be quite close to the optimum. I'm not saying one shouldn't take a close look with a profiler to verify this is actually the case. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Paul, to get a systematic analysis of the performance please do the following tests and put them into a table for easy comparison: 1. inbound pps w/o loss with interface in monitor mode (ifconfig em0 monitor) 2. inbound pps w/ fastforward into a single blackhole route 3. inbound pps /w fastforward into a single blackhole route w/ ipfw and just one allow all rule 4. inbound pps /w fastforward into a single blackhole route w/ ipfw and just one deny all rule 5. inbound pps /w fastforward into the disc(4) discard network interface 6. inbound pps /w fastforward into the disc(4) discard network interface w/ ipfw and just one allow all rule All surrounding parameters like RX/TX interface queue length, scheduler and so may me varied but should be noted. -- Andre Paul wrote: UP 32 bit test vs 64 bit: negligible difference in forwarding performance without polling slightly better polling performance but still errors at lower packet rates same massive hit with ipfw loaded Installing dragonfly in a bit.. If anyone has a really fast PPC type system or SUN or something i'd love to try it :) Something with a really big L1 cache :P Paul wrote: ULE + PREEMPTION for non SMP no major differences with SMP with ULE/4BSD and preemption ON/OFF 32 bit UP test coming up with new cpu and I'm installing dragonfly sometime this weekend :] UP: 1mpps in one direction with no firewall/no routing table is not too bad, but 1mpps both directions is the goal here 700kpps with full bgp table in one direction is not too bad Ipfw needs a lot of work, barely gets 500kpps with no routing table with a few ipfw rules loaded.. that's horrible Linux barely takes a hit when you start loading iptables rules , but then again linux has a HUGE problem with routing random packet sources/ports .. grr My problem Is I need some box to do fast routing and some to do firewall.. :/ I'll have 32 bit 7-stable UP test with ipfw/routing table and then move on to dragonfly. I'll post the dragonfly results here as well as sign up for their mailing list. Bart Van Kerckhove wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul / Ingo, I tried all of this :/ still, 256/512 descriptors seem to work the best. Happy to let you log into the machine and fiddle around if you want :) I've been watching this thread closely, since I'm in a very similair situation. A few questions/remarks: Does ULE provide better performance than 4BSD for forwarding? Did you try freebsd4 as well? This thread had a report about that quite opposite to my own experiences, -4 seemed to be a lot faster at forwarding than anything else I 've tried so far. Obviously the thing I'm interested in is IMIX - and 64byte packets. Does anyone have any benchmarks for DragonFly? I asked around on IRC, but that nor google turned up any useful results. snip I don't think you will be able to route 64byte packets at 1gbit wirespeed (2Mpps) with a current x86 platform. Are there actual hardware related reasons this should not be possible, or is this purely lack of dedicated work towards this goal? snip Theres a sun used at quagga dev as bgp-route-server. http://quagga.net/route-server.php (but they don't answered my question regarding fw-performance). the Quagga guys are running a sun T1000 (niagara 1) route server - I happen to have the machine in my racks, please let me know if you want to run some tests on it, I'm sure they won't mind ;-) It should also make a great testbed for SMP performance testing imho (and they're pretty cheap these days) Also, feel free to use me as a relay for your questions, they're not always very reachable. snap Perhaps you have some better luck at some different hardware systems (ppc, mips, ..?) or use freebsd only for routing-table-updates and special network-cards (netfpga) for real routing. The netfpga site seems more or less dead - is this project still alive? It does look like a very interesting idea, even though it's currently quite linux-centric (and according to docs doesn't have VLAN nor ip6 support, the former being quite a dealbreaker) Paul: I'm looking forward to the C2D 32bit benchmarks (maybe throw in a freebsd4 and/or dragonfly bench if you can..) - appreciate the lots of information you are providing us :) Met vriendelijke groet / With kind regards, Bart Van Kerckhove http://friet.net/pgp.txt -BEGIN PGP SIGNATURE- iQA/AwUBSG/tMgoIFchBM0BKEQKUSQCcCJqsw2wtUX7HQi050HEDYX3WPuMAnjmi eca31f7WQ/oXq9tJ8TEDN3CA =YGYq -END PGP SIGNATURE- ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Paul wrote: SMP DISABLED on my Opteron 2212 (ULE, Preemption on) Yields ~750kpps in em0 and out em1 (one direction) I am miffed why this yields more pps than a) with all 4 cpus running and b) 4 cpus with lagg load balanced over 3 incoming connections so 3 taskq threads SMP adds quite some overhead in the generic case is currently not well suited for high performance packet forwarding. On SMP interrupts are delivered to one CPU but not necessarily the one that will later on handle the taskqueue to process the packets. That adds overhead. Ideally the interrupt for each network interface is bound to exactly one pre-determined CPU and the taskqueue is bound to the same CPU. That way the overhead for interrupt and taskqueue scheduling can be kept at a minimum. Most of the infrastructure to do this binding already exists in the kernel but is not yet exposed to the outside for us to make use of it. I'm also not sure if the ULE scheduler skips the more global locks when interrupt and the thread are on the same CPU. Distributing the interrupts and taskqueues among the available CPUs gives concurrent forwarding with bi- or multi-directional traffic. All incoming traffic from any particular interface is still serialized though. -- Andre I would be willing to set up test equipment (several servers plugged into a switch) with ipkvm and power port access if someone or a group of people want to figure out ways to improve the routing process, ipfw, and lagg. Maximum PPS with one ipfw rule on UP: tops out about 570Kpps.. almost 200kpps lower ? (frown) I'm going to drop in a 3ghz opteron instead of the 2ghz 2212 that's in here and see how that scales, using UP same kernel etc I have now. Julian Elischer wrote: Paul wrote: ULE without PREEMPTION is now yeilding better results. input (em0) output packets errs bytespackets errs bytes colls 571595 40639 34564108 1 0226 0 577892 48865 34941908 1 0178 0 545240 84744 32966404 1 0178 0 587661 44691 35534512 1 0178 0 587839 38073 35544904 1 0178 0 587787 43556 35540360 1 0178 0 540786 39492 32712746 1 0178 0 572071 55797 34595650 1 0178 0 *OUCH, IPFW HURTS.. loading ipfw, and adding one ipfw rule allow ip from any to any drops 100Kpps off :/ what's up with THAT? unloaded ipfw module and back 100kpps more again, that's not right with ONE rule.. :/ ipfw need sto gain a lock on hte firewall before running, and is quite complex.. I can believe it.. in FreeBSD 4.8 I was able to use ipfw and filter 1Gb between two interfaces (bridged) but I think it has slowed down since then due to the SMP locking. em0 taskq is still jumping cpus.. is there any way to lock it to one cpu or is this just a function of ULE running a tar czpvf all.tgz * and seeing if pps changes.. negligible.. guess scheduler is doing it's job at least.. Hmm. even when it's getting 50-60k errors per second on the interface I can still SCP a file through that interface although it's not fast.. 3-4MB/s.. You know, I wouldn't care if it added 5ms latency to the packets when it was doing 1mpps as long as it didn't drop any.. Why can't it do that? Queue them up and do them in bi chunks so none are droppedhmm? 32 bit system is compiling now.. won't do 400kpps with GENERIC kernel, as with 64 bit did 450k with GENERIC, although that could be the difference between opteron 270 and opteron 2212.. Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Andre Oppermann wrote: Distributing the interrupts and taskqueues among the available CPUs gives concurrent forwarding with bi- or multi-directional traffic. All incoming traffic from any particular interface is still serialized though. ... although not on multiple input queue-enabled hardware and drivers. While I've really only focused on local traffic performance with my 10gbps Chelsio setup, it should be possible to do packet forwarding from multiple input queues using that hardware and driver today. I'll update the netisr2 patches, which allow work to be pushed to multiple CPUs from a single input queue. However, these necessarily take a cache miss or two on packet header data in order to break out the packets from the input queue into flows that can be processed independently without ordering constraints, so if those cache misses on header data are a big part of the performance of a configuration, load balancing in this manner may not help. What would be neat is if the cards without multiple input queues could still tag receive descriptors with a flow identifier generated from the IP/TCP/etc layers that could be used for work placement. Robert N M Watson Computer Laboratory University of Cambridge -- Andre I would be willing to set up test equipment (several servers plugged into a switch) with ipkvm and power port access if someone or a group of people want to figure out ways to improve the routing process, ipfw, and lagg. Maximum PPS with one ipfw rule on UP: tops out about 570Kpps.. almost 200kpps lower ? (frown) I'm going to drop in a 3ghz opteron instead of the 2ghz 2212 that's in here and see how that scales, using UP same kernel etc I have now. Julian Elischer wrote: Paul wrote: ULE without PREEMPTION is now yeilding better results. input (em0) output packets errs bytespackets errs bytes colls 571595 40639 34564108 1 0226 0 577892 48865 34941908 1 0178 0 545240 84744 32966404 1 0178 0 587661 44691 35534512 1 0178 0 587839 38073 35544904 1 0178 0 587787 43556 35540360 1 0178 0 540786 39492 32712746 1 0178 0 572071 55797 34595650 1 0178 0 *OUCH, IPFW HURTS.. loading ipfw, and adding one ipfw rule allow ip from any to any drops 100Kpps off :/ what's up with THAT? unloaded ipfw module and back 100kpps more again, that's not right with ONE rule.. :/ ipfw need sto gain a lock on hte firewall before running, and is quite complex.. I can believe it.. in FreeBSD 4.8 I was able to use ipfw and filter 1Gb between two interfaces (bridged) but I think it has slowed down since then due to the SMP locking. em0 taskq is still jumping cpus.. is there any way to lock it to one cpu or is this just a function of ULE running a tar czpvf all.tgz * and seeing if pps changes.. negligible.. guess scheduler is doing it's job at least.. Hmm. even when it's getting 50-60k errors per second on the interface I can still SCP a file through that interface although it's not fast.. 3-4MB/s.. You know, I wouldn't care if it added 5ms latency to the packets when it was doing 1mpps as long as it didn't drop any.. Why can't it do that? Queue them up and do them in bi chunks so none are droppedhmm? 32 bit system is compiling now.. won't do 400kpps with GENERIC kernel, as with 64 bit did 450k with GENERIC, although that could be the difference between opteron 270 and opteron 2212.. Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Robert Watson wrote: On Mon, 7 Jul 2008, Andre Oppermann wrote: Distributing the interrupts and taskqueues among the available CPUs gives concurrent forwarding with bi- or multi-directional traffic. All incoming traffic from any particular interface is still serialized though. ... although not on multiple input queue-enabled hardware and drivers. While I've really only focused on local traffic performance with my 10gbps Chelsio setup, it should be possible to do packet forwarding from multiple input queues using that hardware and driver today. I'll update the netisr2 patches, which allow work to be pushed to multiple CPUs from a single input queue. However, these necessarily take a cache miss or two on packet header data in order to break out the packets from the input queue into flows that can be processed independently without ordering constraints, so if those cache misses on header data are a big part of the performance of a configuration, load balancing in this manner may not help. What would be neat is if the cards without multiple input queues could still tag receive descriptors with a flow identifier generated from the IP/TCP/etc layers that could be used for work placement. The cache miss is really the elephant in the room. If the network card supports multiple RX rings with separate interrupts and a stable hash based (that includes IP+Port src+dst) distribution they can be bound to different CPUs. It is very important to maintain the packet order for flows that go through the router. Otherwise TCP and VoIP will suffer. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Bruce Evans wrote: On Mon, 7 Jul 2008, Andre Oppermann wrote: Ingo Flaschberger wrote: I don't think you will be able to route 64byte packets at 1gbit wirespeed (2Mpps) with a current x86 platform. You have to take inter-frame gap and other overheads too. That gives about 1.244Mpps max on a 1GigE interface. What are the other overheads? I calculate 1.644Mpps counting the inter-frame gap, with 64-byte packets and 64-header_size payloads. If the 64 bytes is for the payload, then the max is much lower. The theoretical maximum at 64byte frames is 1,488,100. I've looked up my notes the 1.244Mpps number can be ajusted to 1.488Mpps. I hoped to reach 1Mpps with the hardware I mentioned some mails before, but 2Mpps is far far away. Currently I get 160kpps via pci-32mbit-33mhz-1,2ghz mobile pentium. This is more or less expected. PCI32 is not able to sustain high packet rates. The bus setup times kill the speed. For larger packets the ratio gets much better and some reasonable throughput can be achieved. I get about 640 kpps without forwarding (sendto: slightly faster; recvfrom: slightly slower) on a 2.2GHz A64. Underclocking the memory from 200MHz to 100MHz only reduces the speed by about 10%, while not overclocking the CPU by 10% reduces the speed by the same 10%, so the system is apparently still mainly CPU-bound. On [EMAIL PROTECTED] He's using a 1.2GHz Mobile Pentium on top of that. NetFPGA doesn't have enough TCAM space to be useful for real routing (as in Internet sized routing table). The trick many embedded networking CPUs use is cache prefetching that is integrated with the network controller. The first 64-128bytes of every packet are transferred automatically into the L2 cache by the hardware. This allows relatively slow CPUs (700 MHz Broadcom BCM1250 in Cisco NPE-G1 or 1.67-GHz Freescale 7448 in NPE-G2) to get more than 1Mpps. Until something like this is possible on Intel or AMD x86 CPUs we have a ceiling limited by RAM speed. Does using fa$ter memory (speed and/or latency) help here? 64 bytes is so small that latency may be more of a problem, especially without a prefetch. Latency. For IPv4 packet forwarding only one cache line per packet is fetched. More memory speed only helps with the DMA from/to the network card. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Andre Oppermann wrote: Robert Watson wrote: Experience suggests that forwarding workloads see significant lock contention in the routing and transmit queue code. The former needs some kernel hacking to address in order to improve parallelism for routing lookups. The latter is harder to address given the hardware you're using: modern 10gbps cards frequently offer multiple transmit queues that can be used independently (which our cxgb driver supports), but 1gbps cards generally don't. Actually the routing code is not contended. The workload in router is mostly serialized without much opportunity for contention. With many interfaces and any-to-any traffic patterns it may get some contention. The locking overhead per packet is always there and has some impact though. Actually contention from route locking is a major bottleneck even on packet generation from multiple CPUs on a single host. It is becoming increasingly necessary that someone look into fixing this. Kris ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Andre Oppermann wrote: Bruce Evans wrote: What are the other overheads? I calculate 1.644Mpps counting the inter-frame gap, with 64-byte packets and 64-header_size payloads. If the 64 bytes is for the payload, then the max is much lower. The theoretical maximum at 64byte frames is 1,488,100. I've looked up my notes the 1.244Mpps number can be ajusted to 1.488Mpps. Where is the extra? I still get 1.644736 Mpps (10^9/(8*64+96)). 1.488095 is for 64 bits extra (10^9/(8*64+96+64)). I hoped to reach 1Mpps with the hardware I mentioned some mails before, but 2Mpps is far far away. Currently I get 160kpps via pci-32mbit-33mhz-1,2ghz mobile pentium. This is more or less expected. PCI32 is not able to sustain high packet rates. The bus setup times kill the speed. For larger packets the ratio gets much better and some reasonable throughput can be achieved. I get about 640 kpps without forwarding (sendto: slightly faster; recvfrom: slightly slower) on a 2.2GHz A64. Underclocking the memory from 200MHz to 100MHz only reduces the speed by about 10%, while not overclocking the CPU by 10% reduces the speed by the same 10%, so the system is apparently still mainly CPU-bound. On [EMAIL PROTECTED] He's using a 1.2GHz Mobile Pentium on top of that. Yes. My example shows that FreeBSD is more CPU-bound than I/O bound up to CPUs considerably faster than a 1.2GHz Pentium (though PentiumM is fast relative to its clock speed). The memory interface may matter more than the CPU clock. NetFPGA doesn't have enough TCAM space to be useful for real routing (as in Internet sized routing table). The trick many embedded networking CPUs use is cache prefetching that is integrated with the network controller. The first 64-128bytes of every packet are transferred automatically into the L2 cache by the hardware. This allows relatively slow CPUs (700 MHz Broadcom BCM1250 in Cisco NPE-G1 or 1.67-GHz Freescale 7448 in NPE-G2) to get more than 1Mpps. Until something like this is possible on Intel or AMD x86 CPUs we have a ceiling limited by RAM speed. Does using fa$ter memory (speed and/or latency) help here? 64 bytes is so small that latency may be more of a problem, especially without a prefetch. Latency. For IPv4 packet forwarding only one cache line per packet is fetched. More memory speed only helps with the DMA from/to the network card. I use low-end memory, but on the machine that does 640 kpps it somehow has latency almost 4 times as low as on new FreeBSD cluster machines (~42 nsec instead of ~150). perfmon (fixed for AXP and A64) and hwpmc report an average of 11 k8-dc-misses per sendto() while sending via bge at 640 kpps. 11 * 42 accounts for 442 nsec out of the 1562 per packet at this rate. 11 * 150 = 1650 would probably make this rate unachievable despite the system having 20 times as much CPU and bus. Bruce ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Bruce Evans wrote: I use low-end memory, but on the machine that does 640 kpps it somehow has latency almost 4 times as low as on new FreeBSD cluster machines (~42 nsec instead of ~150). perfmon (fixed for AXP and A64) and hwpmc report an average of 11 k8-dc-misses per sendto() while sending via bge at 640 kpps. 11 * 42 accounts for 442 nsec out of the 1562 per packet at this rate. 11 * 150 = 1650 would probably make this rate unachievable despite the system having 20 times as much CPU and bus. Since you're doing fine-grained performance measurements of a code path that interests me a lot, could you compare the cost per-send on UDP for the following four cases: (1) sendto() to a specific address and port on a socket that has been bound to INADDR_ANY and a specific port. (2) sendto() on a specific address and port on a socket that has been bound to a specific IP address (not INADDR_ANY) and a specific port. (3) send() on a socket that has been connect()'d to a specific IP address and a specific port, and bound to INADDR_ANY and a specific port. (4) send() on a socket that has been connect()'d to a specific IP address and a specific port, and bound to a specific IP address (not INADDR_ANY) and a specific port. The last of these should really be quite a bit faster than the first of these, but I'd be interested in seeing specific measurements for each if that's possible! Thanks, Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Robert Watson wrote: The last of these should really be quite a bit faster than the first of these, but I'd be interested in seeing specific measurements for each if that's possible! And, if you're feeling particualrly subject to suggestion, you might consider comparing 7.0 recent 8.x along the same dimensions :-). Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Robert Watson wrote: Since you're doing fine-grained performance measurements of a code path that interests me a lot, could you compare the cost per-send on UDP for the following four cases: (1) sendto() to a specific address and port on a socket that has been bound to INADDR_ANY and a specific port. (2) sendto() on a specific address and port on a socket that has been bound to a specific IP address (not INADDR_ANY) and a specific port. (3) send() on a socket that has been connect()'d to a specific IP address and a specific port, and bound to INADDR_ANY and a specific port. (4) send() on a socket that has been connect()'d to a specific IP address and a specific port, and bound to a specific IP address (not INADDR_ANY) and a specific port. The last of these should really be quite a bit faster than the first of these, but I'd be interested in seeing specific measurements for each if that's possible! Not sure if I understand networking well enough to set these up quickly. Does netrate use one of (3) or (4) now? I can tell you vaguely about old results for netrate (send()) vs ttcp (sendto()). send() is lighter weight of course, and this made a difference of 10-20%, but after further tuning the difference became smaller, which suggests that everything ends up waiting for something in common. Now I can measure cache misses better and hope that a simple count of cache misses will be a more reproducible indicator of significant bottlenecks than pps. I got nowhere trying to reduce instruction counts, possibly because it would take avoiding 100's of instructions to get the same benefit as avoiding a single cache miss. Bruce ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, Jul 07, 2008 at 10:30:53PM +1000, Bruce Evans wrote: On Mon, 7 Jul 2008, Andre Oppermann wrote: Bruce Evans wrote: What are the other overheads? I calculate 1.644Mpps counting the inter-frame gap, with 64-byte packets and 64-header_size payloads. If the 64 bytes is for the payload, then the max is much lower. The theoretical maximum at 64byte frames is 1,488,100. I've looked up my notes the 1.244Mpps number can be ajusted to 1.488Mpps. Where is the extra? I still get 1.644736 Mpps (10^9/(8*64+96)). 1.488095 is for 64 bits extra (10^9/(8*64+96+64)). A standard ethernet frame (on the wire) consists of: 7 octets preamble 1 octet Start Frame Delimiter 6 octets destination address 6 octets source address 2 octets length/type 46-1500 octets data (+padding if needed) 4 octets Frame Check Sequence Followed by (at least) 96 bits interFrameGap, before the next frame starts. For minimal packet size this gives a maximum packet rate at 1Gbit/s of 1e9/((7+1+6+6+2+46+4)*8+96)/ = 1488095 packets/second You probably missed the preamble and start frame delimiter in your calculation. -- Insert your favourite quote here. Erik Trulsson [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Bruce Evans wrote: On Mon, 7 Jul 2008, Andre Oppermann wrote: Bruce Evans wrote: What are the other overheads? I calculate 1.644Mpps counting the inter-frame gap, with 64-byte packets and 64-header_size payloads. If the 64 bytes is for the payload, then the max is much lower. The theoretical maximum at 64byte frames is 1,488,100. I've looked up my notes the 1.244Mpps number can be ajusted to 1.488Mpps. Where is the extra? I still get 1.644736 Mpps (10^9/(8*64+96)). 1.488095 is for 64 bits extra (10^9/(8*64+96+64)). The preamble has 64 bits and is in addition to the inter-frame gap. I hoped to reach 1Mpps with the hardware I mentioned some mails before, but 2Mpps is far far away. Currently I get 160kpps via pci-32mbit-33mhz-1,2ghz mobile pentium. This is more or less expected. PCI32 is not able to sustain high packet rates. The bus setup times kill the speed. For larger packets the ratio gets much better and some reasonable throughput can be achieved. I get about 640 kpps without forwarding (sendto: slightly faster; recvfrom: slightly slower) on a 2.2GHz A64. Underclocking the memory from 200MHz to 100MHz only reduces the speed by about 10%, while not overclocking the CPU by 10% reduces the speed by the same 10%, so the system is apparently still mainly CPU-bound. On [EMAIL PROTECTED] He's using a 1.2GHz Mobile Pentium on top of that. Yes. My example shows that FreeBSD is more CPU-bound than I/O bound up to CPUs considerably faster than a 1.2GHz Pentium (though PentiumM is fast relative to its clock speed). The memory interface may matter more than the CPU clock. NetFPGA doesn't have enough TCAM space to be useful for real routing (as in Internet sized routing table). The trick many embedded networking CPUs use is cache prefetching that is integrated with the network controller. The first 64-128bytes of every packet are transferred automatically into the L2 cache by the hardware. This allows relatively slow CPUs (700 MHz Broadcom BCM1250 in Cisco NPE-G1 or 1.67-GHz Freescale 7448 in NPE-G2) to get more than 1Mpps. Until something like this is possible on Intel or AMD x86 CPUs we have a ceiling limited by RAM speed. Does using fa$ter memory (speed and/or latency) help here? 64 bytes is so small that latency may be more of a problem, especially without a prefetch. Latency. For IPv4 packet forwarding only one cache line per packet is fetched. More memory speed only helps with the DMA from/to the network card. I use low-end memory, but on the machine that does 640 kpps it somehow has latency almost 4 times as low as on new FreeBSD cluster machines (~42 nsec instead of ~150). perfmon (fixed for AXP and A64) and hwpmc report an average of 11 k8-dc-misses per sendto() while sending via bge at 640 kpps. 11 * 42 accounts for 442 nsec out of the 1562 per packet at this rate. 11 * 150 = 1650 would probably make this rate unachievable despite the system having 20 times as much CPU and bus. We were talking routing here. That is a packet received via network interface and sent out on another. Crosses the PCI bus twice. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Bruce Evans wrote: (1) sendto() to a specific address and port on a socket that has been bound to INADDR_ANY and a specific port. (2) sendto() on a specific address and port on a socket that has been bound to a specific IP address (not INADDR_ANY) and a specific port. (3) send() on a socket that has been connect()'d to a specific IP address and a specific port, and bound to INADDR_ANY and a specific port. (4) send() on a socket that has been connect()'d to a specific IP address and a specific port, and bound to a specific IP address (not INADDR_ANY) and a specific port. The last of these should really be quite a bit faster than the first of these, but I'd be interested in seeing specific measurements for each if that's possible! Not sure if I understand networking well enough to set these up quickly. Does netrate use one of (3) or (4) now? (3) and (4) are effectively the same thing, I think, since connect(2) should force the selection of a source IP address, but I think it's not a bad idea to confirm that. :-) The structure of the desired micro-benchmark here is basically: int main(int argc, char *argv) { struct sockaddr_in sin; /* Parse command line arguments such as addresss and ports. */ if (bind_desired) { /* Set up sockaddr_in. */ if (bind(s, (struct sockaddr *)sin, sizeof(sin)) 0) err(-1, bind); } /* Set up destination sockaddr_in. */ if (connect_desired) { if (connect(s, (struct sockaddr *)sin, sizeof(sin)) 0) err(-1, connect); } while (appropriate_condition) { if (connect_desired) { if (send(s, ...) 0) errors++; } else { if (sendto(s, (struct sockaddr *)sin, sizeof(sin)) 0) errors++; } } } I can tell you vaguely about old results for netrate (send()) vs ttcp (sendto()). send() is lighter weight of course, and this made a difference of 10-20%, but after further tuning the difference became smaller, which suggests that everything ends up waiting for something in common. Now I can measure cache misses better and hope that a simple count of cache misses will be a more reproducible indicator of significant bottlenecks than pps. I got nowhere trying to reduce instruction counts, possibly because it would take avoiding 100's of instructions to get the same benefit as avoiding a single cache miss. If you look at the design of the higher performance UDP applications, they will generally bind a specific IP (perhaps every IP on the host with its own socket), and if they do sustained communication to a specific endpoint they will use connect(2) rather than providing an address for each send(2) system call to the kernel. udp_output(2) makes the trade-offs there fairly clear: with the most recent rev, the optimal case is one connect(2) has been called, allowing a single inpcb read lock and no global data structure access, vs. an application calling sendto(2) for each system call and the local binding remaining INADDR_ANY. Middle ground applications, such as named(8) will force a local binding using bind(2), but then still have to pass an address to each sendto(2). In the future, this case will be further optimized in our code by using a global read lock rather than a global write lock: we have to check for collisions, but we don't actually have to reserve the new 4-tuple for the UDP socket as it's an ephemeral association rather than a connect(2). Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Andre Oppermann wrote: Paul, to get a systematic analysis of the performance please do the following tests and put them into a table for easy comparison: 1. inbound pps w/o loss with interface in monitor mode (ifconfig em0 monitor) ... I won't be running many of these tests, but found this one interesting -- I didn't know about monitor mode. It gives the following behaviour: -monitor ttcp receiving on bge0 at 397 kpps: 35% idle (8.0-CURRENT) 13.6 cm/p monitor ttcp receiving on bge0 at 397 kpps: 83% idle (8.0-CURRENT) 5.8 cm/p -monitor ttcp receiving on em0 at 580 kpps: 5% idle (~5.2)12.5 cm/p monitor ttcp receiving on em0 at 580 kpps: 65% idle (~5.2) 4.8 cm/p cm/p = k8-dc-misses (bge0 system) cm/p = k7-dc-misses (em0 system) So it seems that the major overheads are not near the driver (as I already knew), and upper layers are responsible for most of the cache misses. The packet header is accessed even in monitor mode, so I think most of the cache misses in upper layers are not related to the packet header. Maybe they are due mainly to perfect non-locality for mbufs. Other cm/p numbers: ttcp sending on bge0 at 640 kpps: (~5.2)11 cm/p ttcp sending on bge0 at 580 kpps: (8.0-CURRENT) 9 cm/p (-current is 10% slower despite having lower cm/p. This seems to be due to extra instructions executed) ping -fq -c100 localhost at 171 kpps: (8.0-CURRENT) 12-33 cm/p (This is certainly CPU-bound. lo0 is much slower than bge0. Latency (rtt) is 2 us. It is 3 us in ~5.2 and was 4 in -current until very recently.) ping -fq -c100 etherhost at 40 kpps: (8.0-CURRENT)55 cm/p (The rate is quite low because flood ping doesn't actually flood. It tries to limit the rate to max(100, 1/latency), but it tends to go at a rate of ql(t)/latency where ql(t) is the average hardware queue length at the current time t. ql(t) starts at 1 and builds up after a minute or 2 to a maximum of about 10 on my hardware. Latency is always ~100 us, so the average ql(t) must have been ~4.) ping -fq -c100 etherhost at 20 kpps: (8.0-CURRENT)45 cm/p (Another run to record the average latency (it was 121) showed high variance.) netblast sending on bge0 at 582 kpps: (8.0-CURRENT) 9.8 cm/p (Packet blasting benchmarks actually flood, unlike flood ping. This is hard to implement, since select() for output-ready doesn't work. netblast has to busy wait, while ttcp guesses how long to sleep but cannot sleep for a short enough interval unless queues are too large or hz is too small. My systems are configured with HZ = 100 and snd.ifq too large so that sleeping for 1/Hz works for ttcp. netblast still busy-waits. This gives an interesting difference for netblast. It tries to send 800 k packets in 1 second by only successfully sends 582 k. 9.8 cm/p is for #misses / 582k. The 300k unsuccessful sends apparently don't cause many cache misses. But variance is high...) ttcp sending on bge0 at 577 kpps: (8.0-CURRENT) 15.5 cm/p (Another run shows high variance.) ttcp rates have low variance for a given kernel but high variance for different kernels (an extra unrelated byte in the text section can cause a 30% change). High variance would also be explained by non-locality of mbufs. Cycling through lots of mbufs would maximize cache misses but random reuse of mbufs would give variance. Or the cycling and variance might be more in general allocation. There is sillyness in getsockaddr(): sendit() calls getsockaddr() and getsockaddr() always uses malloc(), but allocation on the stack works for at the call from sendit(). This malloc() seemed to be responsible for a cache miss or two, but when I changed it to use the stack the results were inconclusive. Bruce ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Bruce Evans wrote: On Mon, 7 Jul 2008, Andre Oppermann wrote: Paul, to get a systematic analysis of the performance please do the following tests and put them into a table for easy comparison: 1. inbound pps w/o loss with interface in monitor mode (ifconfig em0 monitor) ... I won't be running many of these tests, but found this one interesting -- I didn't know about monitor mode. It gives the following behaviour: -monitor ttcp receiving on bge0 at 397 kpps: 35% idle (8.0-CURRENT) 13.6 cm/p monitor ttcp receiving on bge0 at 397 kpps: 83% idle (8.0-CURRENT) 5.8 cm/p -monitor ttcp receiving on em0 at 580 kpps: 5% idle (~5.2)12.5 cm/p monitor ttcp receiving on em0 at 580 kpps: 65% idle (~5.2) 4.8 cm/p cm/p = k8-dc-misses (bge0 system) cm/p = k7-dc-misses (em0 system) So it seems that the major overheads are not near the driver (as I already knew), and upper layers are responsible for most of the cache misses. The packet header is accessed even in monitor mode, so I think most of the cache misses in upper layers are not related to the packet header. Maybe they are due mainly to perfect non-locality for mbufs. Monitor mode doesn't access the payload packet header. It only looks at the mbuf (which has a structure called mbuf packet header). The mbuf header it hot in the cache because the driver just touched it and filled in the information. The packet content (the payload) is cold and just arrived via DMA in DRAM. -- Andre ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, 7 Jul 2008, Andre Oppermann wrote: Bruce Evans wrote: So it seems that the major overheads are not near the driver (as I already knew), and upper layers are responsible for most of the cache misses. The packet header is accessed even in monitor mode, so I think most of the cache misses in upper layers are not related to the packet header. Maybe they are due mainly to perfect non-locality for mbufs. Monitor mode doesn't access the payload packet header. It only looks at the mbuf (which has a structure called mbuf packet header). The mbuf header it hot in the cache because the driver just touched it and filled in the information. The packet content (the payload) is cold and just arrived via DMA in DRAM. Why does it use ntohs() then? :-). From if_ethersubr.c: % static void % ether_input(struct ifnet *ifp, struct mbuf *m) % { % struct ether_header *eh; % u_short etype; % % if ((ifp-if_flags IFF_UP) == 0) { % m_freem(m); % return; % } % #ifdef DIAGNOSTIC % if ((ifp-if_drv_flags IFF_DRV_RUNNING) == 0) { % if_printf(ifp, discard frame at !IFF_DRV_RUNNING\n); % m_freem(m); % return; % } % #endif % /* %* Do consistency checks to verify assumptions %* made by code past this point. %*/ % if ((m-m_flags M_PKTHDR) == 0) { % if_printf(ifp, discard frame w/o packet header\n); % ifp-if_ierrors++; % m_freem(m); % return; % } % if (m-m_len ETHER_HDR_LEN) { % /* XXX maybe should pullup? */ % if_printf(ifp, discard frame w/o leading ethernet % header (len %u pkt len %u)\n, % m-m_len, m-m_pkthdr.len); % ifp-if_ierrors++; % m_freem(m); % return; % } % eh = mtod(m, struct ether_header *); Point outside of mbuf header. % etype = ntohs(eh-ether_type); First access outside of mbuf header. But this seems to be bogus and might be fixed by compiler optimization, since etype is not used until after the monitor mode returns. This may have been broken by debugging cruft -- in 5.2, etype is used immediately after here in a printf about discarding oversize frames. The compiler might also pessimize things by reordering code. % if (m-m_pkthdr.rcvif == NULL) { % if_printf(ifp, discard frame w/o interface pointer\n); % ifp-if_ierrors++; % m_freem(m); % return; % } % #ifdef DIAGNOSTIC % if (m-m_pkthdr.rcvif != ifp) { % if_printf(ifp, Warning, frame marked as received on %s\n, % m-m_pkthdr.rcvif-if_xname); % } % #endif % % if (ETHER_IS_MULTICAST(eh-ether_dhost)) { % if (ETHER_IS_BROADCAST(eh-ether_dhost)) % m-m_flags |= M_BCAST; % else % m-m_flags |= M_MCAST; % ifp-if_imcasts++; % } Another dereference of eh (2 unless optimizable and optimized). Here the result is actually used early, but I think you don't care enough about maintaing if_imcasts to do this. % % #ifdef MAC % /* %* Tag the mbuf with an appropriate MAC label before any other %* consumers can get to it. %*/ % mac_ifnet_create_mbuf(ifp, m); % #endif % % /* %* Give bpf a chance at the packet. %*/ % ETHER_BPF_MTAP(ifp, m); I think this can access the whole packet, but usually doesn't. % % /* %* If the CRC is still on the packet, trim it off. We do this once %* and once only in case we are re-entered. Nothing else on the %* Ethernet receive path expects to see the FCS. %*/ % if (m-m_flags M_HASFCS) { % m_adj(m, -ETHER_CRC_LEN); % m-m_flags = ~M_HASFCS; % } % % ifp-if_ibytes += m-m_pkthdr.len; % % /* Allow monitor mode to claim this frame, after stats are updated. */ % if (ifp-if_flags IFF_MONITOR) { % m_freem(m); % return; % } Finally return in monitor mode. I don't see any stats update before here except for the stray if_imcasts one. BTW, stats behave strangely in monitor mode: - netstat -I interface 1 works except: - the byte counts are 0 every second second (the next second counts the previous 2), while the packet counts are update every second - one system started showing bge0 stats for all interfaces. Perhaps unrelated. - systat -ip shows all counts 0. I think this is due to stats maintained by the driver working but other stats not. The mixture seems strange at user level. Bruce ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
one that will later on handle the taskqueue to process the packets. That adds overhead. Ideally the interrupt for each network interface is bound to exactly one pre-determined CPU and the taskqueue is bound to the same CPU. That way the overhead for interrupt and taskqueue scheduling can be kept at a minimum. Most of the infrastructure to do this binding already exists in the kernel but is not yet exposed to the outside for us to make use of it. I'm also not sure if the ULE scheduler skips the more global locks when interrupt and the thread are on the same CPU. Distributing the interrupts and taskqueues among the available CPUs gives concurrent forwarding with bi- or multi-directional traffic. All incoming traffic from any particular interface is still serialized though. I used etherchannel to distribute incoming packets over 3 separate cpus evenly but the output was on one interface.. What I got was less performance than with one cpu and all three cpus were close to 100% utilizied. em0,em1,em2 were all receiving packets and sending them out em3. The machine had 4 cpus in it. em3 taskq was low cpu usage and em0,1,2 were using cpu0,1,2(for example) almost fully used. With all that cpu power being used and I got less performance than with 1 cpu :/ Obviously in SMP there is a big issue somewhere. Also my 82571 NIC supports multiple received queues and multiple transmit queues so why hasn't anyone written the driver to support this? It's not a 10gb card and it still supports it and it's widely available and not too expensive either. The new 82575/6 chips support even more queues and the two port version will be out this month and the 4 port in october (PCI-E cards). Motherboards are already shipping with the 82576.. (82571 supports 2x/2x 575/6 support 4x/4x) Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I use low-end memory, but on the machine that does 640 kpps it somehow has latency almost 4 times as low as on new FreeBSD cluster machines (~42 nsec instead of ~150). perfmon (fixed for AXP and A64) and hwpmc report an average of 11 k8-dc-misses per sendto() while sending via bge at 640 kpps. 11 * 42 accounts for 442 nsec out of the 1562 per packet at this rate. 11 * 150 = 1650 would probably make this rate unachievable despite the system having 20 times as much CPU and bus. Any of the buffered dimms or ddr3 or high cas ddr2 are going to have a lot more latency than older ones because the frequency is so high or the buffering. The best is to use ddr2 with the lowest timings that it supports at the highest frequency but not the highest frequency it supports at higher timings.. for instance i have some 1100mhz ddr2 ram but it's 5-5-5-15 but it will do 5-4-4-12 at 1000 or 900 Mhz so I think the latency may have more impact on the speed than the actual MHz of the ram itself. This works for several benchmarks which I have tested before running the ram at 1:1 with the FSB (400 FSB(1600fsb actual) with ram at 800 and the latency is a lot lower than ram at 1:1.20 FSB even though the bandwidth is higher) With higher latency in the 'server' machines we probably need to do things in bigger chunks.. Anyone using a FBSD router isn't going to care about a 1ms delay in the packet but they will care if packets are dropped or reordered. Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Tue, 8 Jul 2008, Bruce Evans wrote: On Mon, 7 Jul 2008, Andre Oppermann wrote: Bruce Evans wrote: So it seems that the major overheads are not near the driver (as I already knew), and upper layers are responsible for most of the cache misses. The packet header is accessed even in monitor mode, so I think most of the cache misses in upper layers are not related to the packet header. Maybe they are due mainly to perfect non-locality for mbufs. Monitor mode doesn't access the payload packet header. It only looks at the mbuf (which has a structure called mbuf packet header). The mbuf header it hot in the cache because the driver just touched it and filled in the information. The packet content (the payload) is cold and just arrived via DMA in DRAM. Why does it use ntohs() then? :-). From if_ethersubr.c: ... % eh = mtod(m, struct ether_header *); Point outside of mbuf header. % etype = ntohs(eh-ether_type); First access outside of mbuf header. ... % % /* Allow monitor mode to claim this frame, after stats are updated. */ % if (ifp-if_flags IFF_MONITOR) { % m_freem(m); % return; % } Finally return in monitor mode. I don't see any stats update before here except for the stray if_imcasts one. There are some error stats with printfs, but I've never seen these do anything except with a buggy sk driver. Testing verifies that accessing eh above gives a cache miss. Under ~5.2 receiving on bge0 at 397 kpps: -monitor: 17% idle 19 cm/p (18% less idle than under -current) monitor: 66% idle 8 cm/p (17% less idle than under -current) +monitor: 71% idle 7 cm/p (idle time under -current not measured) +monitor is monitor mode with the exit moved to the top of ether_input(). If the cache miss takes the time measured by lmbench2 (42 ns), then 397 k of these per second gives 17 ms or 1.7% CPU, which is vaguely consistent with the improvement of 5% by not taking this cache miss. Avoiding most of the 19 cache misses should give much more than a 5% improvement. Maybe -current gets its 17% improvement by avoiding some. More userland stats weirdness in userland: - in monitor mode, em0 gives byte counts delayed while bge0 gives byte counts always 0. - netstat -I interface 1 seems to be broken in ~5.2 in all modes -- it gives output for interfaces with drivers but no hardware. All this is for UP. An SMP kernel on the same UP system loses 5% for at least tx. Bruce ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Hi, As was already mentioned, we can't avoid all cache misses as there's data that's recently been updated in memory via DMA and therefor kicked out of cache. However, we may hide some of the latency penalty by prefetching 'interesting' data early. I.e. we know that we want to access some ethernet headers, so we may start pulling relevant data into cache early. Ideally, by the time we need to access the field, it will already be in the cache. When we're counting nanoseconds per packet this may bring some performance gain. Just my $0.02. --Artem ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
At 02:44 PM 7/7/2008, Paul wrote: Also my 82571 NIC supports multiple received queues and multiple transmit queues so why hasn't anyone written the driver to support this? It's not a 10gb card and it still supports it and it's widely Intel actually maintains the driver. Not sure if there are plans or not, but perhaps they can comment ? ---Mike available and not too expensive either. The new 82575/6 chips support even more queues and the two port version will be out this month and the 4 port in october (PCI-E cards). Motherboards are already shipping with the 82576.. (82571 supports 2x/2x 575/6 support 4x/4x) Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Artem Belevich wrote: Hi, As was already mentioned, we can't avoid all cache misses as there's data that's recently been updated in memory via DMA and therefor kicked out of cache. However, we may hide some of the latency penalty by prefetching 'interesting' data early. I.e. we know that we want to access some ethernet headers, so we may start pulling relevant data into cache early. Ideally, by the time we need to access the field, it will already be in the cache. When we're counting nanoseconds per packet this may bring some performance gain. Prefetching when you are waiting for the data isn't a help. what you need is a speculative prefetch where you an tell teh processor We will probably need the following address so start getting it while we go do other stuff. As far as I know we have no capacity to do that.. Just my $0.02. --Artem ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On 2008-Jul-07 13:25:13 -0700, Julian Elischer [EMAIL PROTECTED] wrote: what you need is a speculative prefetch where you an tell teh processor We will probably need the following address so start getting it while we go do other stuff. This looks like the PREFETCH instructions that exist in at least amd64 and SPARC. Unfortunately, their optimal use is very implementation- dependent and the AMD documentation suggests that incorrect use can degrade performance. -- Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. pgptHl9lUe7aI.pgp Description: PGP signature
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Peter Jeremy wrote: On 2008-Jul-07 13:25:13 -0700, Julian Elischer [EMAIL PROTECTED] wrote: what you need is a speculative prefetch where you an tell teh processor We will probably need the following address so start getting it while we go do other stuff. This looks like the PREFETCH instructions that exist in at least amd64 and SPARC. Unfortunately, their optimal use is very implementation- dependent and the AMD documentation suggests that incorrect use can degrade performance. It might be worth looking to see if the network processing threads might be able to prefetch the IP header at least :-) ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Prefetching when you are waiting for the data isn't a help. Agreed. Got to start prefetch around put your memory latency herens before you actually need the data and move on doing other things that do not depend on the data you've just started prefetching. what you need is a speculative prefetch where you an tell teh processor We will probably need the following address so start getting it while we go do other stuff. It does not have to be 'speculative' either. In this particular case we have very good idea that we *will* need some data from ethernet header and, probably, IP and TCP headers as well. We might as well tel the hardware to start pulling data in without stalling the CPU. Intel has instructions specifically for this purpose. I assume AMD has them too. --Artem ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
At 02:44 PM 7/7/2008, Paul wrote: Also my 82571 NIC supports multiple received queues and multiple transmit queues so why hasn't anyone written the driver to support this? It's not a 10gb card and it still supports it and it's widely available and not too expensive either. The new 82575/6 chips support even more queues and the two port version will be out this month and the 4 port in october (PCI-E cards). Motherboards are already shipping with the 82576.. (82571 supports 2x/2x 575/6 support 4x/4x) Actually, do any of your NICs attach via the igb driver ? ---Mike ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I read through the IGB driver, and it says 82575/6 only... which is the new chip Intel is releasing on the cards this month 2 port and october 4 port, but the chips are on some of the motherboards right now. Why can't it also use the 82571 ? doesn't make any sense.. I haven't tried it but just browsing the driver source doesn't look like it will work. Mike Tancsa wrote: At 02:44 PM 7/7/2008, Paul wrote: Also my 82571 NIC supports multiple received queues and multiple transmit queues so why hasn't anyone written the driver to support this? It's not a 10gb card and it still supports it and it's widely available and not too expensive either. The new 82575/6 chips support even more queues and the two port version will be out this month and the 4 port in october (PCI-E cards). Motherboards are already shipping with the 82576.. (82571 supports 2x/2x 575/6 support 4x/4x) Actually, do any of your NICs attach via the igb driver ? ---Mike ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I'm no expert, but I imagine the problem is because the net processing of FreeBSD is not pipelined enough. We are now able to affordably throw many gigabytes of RAM into a machine, as well 2 to 8 CPUs. So why not allow for big buffers and multiple processing steps? I be happy to give up a bit of latency in order to increase the parallel processing ability of packets travelling through the system. I could be wrong but I imagine it would be better to treat the processing of pockets as a series of stages with queues (that can grow quite large if necessary). ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
ULE + PREEMPTION for non SMP no major differences with SMP with ULE/4BSD and preemption ON/OFF 32 bit UP test coming up with new cpu and I'm installing dragonfly sometime this weekend :] UP: 1mpps in one direction with no firewall/no routing table is not too bad, but 1mpps both directions is the goal here 700kpps with full bgp table in one direction is not too bad Ipfw needs a lot of work, barely gets 500kpps with no routing table with a few ipfw rules loaded.. that's horrible Linux barely takes a hit when you start loading iptables rules , but then again linux has a HUGE problem with routing random packet sources/ports .. grr My problem Is I need some box to do fast routing and some to do firewall.. :/ I'll have 32 bit 7-stable UP test with ipfw/routing table and then move on to dragonfly. I'll post the dragonfly results here as well as sign up for their mailing list. Bart Van Kerckhove wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul / Ingo, I tried all of this :/ still, 256/512 descriptors seem to work the best. Happy to let you log into the machine and fiddle around if you want :) I've been watching this thread closely, since I'm in a very similair situation. A few questions/remarks: Does ULE provide better performance than 4BSD for forwarding? Did you try freebsd4 as well? This thread had a report about that quite opposite to my own experiences, -4 seemed to be a lot faster at forwarding than anything else I 've tried so far. Obviously the thing I'm interested in is IMIX - and 64byte packets. Does anyone have any benchmarks for DragonFly? I asked around on IRC, but that nor google turned up any useful results. snip I don't think you will be able to route 64byte packets at 1gbit wirespeed (2Mpps) with a current x86 platform. Are there actual hardware related reasons this should not be possible, or is this purely lack of dedicated work towards this goal? snip Theres a sun used at quagga dev as bgp-route-server. http://quagga.net/route-server.php (but they don't answered my question regarding fw-performance). the Quagga guys are running a sun T1000 (niagara 1) route server - I happen to have the machine in my racks, please let me know if you want to run some tests on it, I'm sure they won't mind ;-) It should also make a great testbed for SMP performance testing imho (and they're pretty cheap these days) Also, feel free to use me as a relay for your questions, they're not always very reachable. snap Perhaps you have some better luck at some different hardware systems (ppc, mips, ..?) or use freebsd only for routing-table-updates and special network-cards (netfpga) for real routing. The netfpga site seems more or less dead - is this project still alive? It does look like a very interesting idea, even though it's currently quite linux-centric (and according to docs doesn't have VLAN nor ip6 support, the former being quite a dealbreaker) Paul: I'm looking forward to the C2D 32bit benchmarks (maybe throw in a freebsd4 and/or dragonfly bench if you can..) - appreciate the lots of information you are providing us :) Met vriendelijke groet / With kind regards, Bart Van Kerckhove http://friet.net/pgp.txt -BEGIN PGP SIGNATURE- iQA/AwUBSG/tMgoIFchBM0BKEQKUSQCcCJqsw2wtUX7HQi050HEDYX3WPuMAnjmi eca31f7WQ/oXq9tJ8TEDN3CA =YGYq -END PGP SIGNATURE- ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
UP 32 bit test vs 64 bit: negligible difference in forwarding performance without polling slightly better polling performance but still errors at lower packet rates same massive hit with ipfw loaded Installing dragonfly in a bit.. If anyone has a really fast PPC type system or SUN or something i'd love to try it :) Something with a really big L1 cache :P Paul wrote: ULE + PREEMPTION for non SMP no major differences with SMP with ULE/4BSD and preemption ON/OFF 32 bit UP test coming up with new cpu and I'm installing dragonfly sometime this weekend :] UP: 1mpps in one direction with no firewall/no routing table is not too bad, but 1mpps both directions is the goal here 700kpps with full bgp table in one direction is not too bad Ipfw needs a lot of work, barely gets 500kpps with no routing table with a few ipfw rules loaded.. that's horrible Linux barely takes a hit when you start loading iptables rules , but then again linux has a HUGE problem with routing random packet sources/ports .. grr My problem Is I need some box to do fast routing and some to do firewall.. :/ I'll have 32 bit 7-stable UP test with ipfw/routing table and then move on to dragonfly. I'll post the dragonfly results here as well as sign up for their mailing list. Bart Van Kerckhove wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul / Ingo, I tried all of this :/ still, 256/512 descriptors seem to work the best. Happy to let you log into the machine and fiddle around if you want :) I've been watching this thread closely, since I'm in a very similair situation. A few questions/remarks: Does ULE provide better performance than 4BSD for forwarding? Did you try freebsd4 as well? This thread had a report about that quite opposite to my own experiences, -4 seemed to be a lot faster at forwarding than anything else I 've tried so far. Obviously the thing I'm interested in is IMIX - and 64byte packets. Does anyone have any benchmarks for DragonFly? I asked around on IRC, but that nor google turned up any useful results. snip I don't think you will be able to route 64byte packets at 1gbit wirespeed (2Mpps) with a current x86 platform. Are there actual hardware related reasons this should not be possible, or is this purely lack of dedicated work towards this goal? snip Theres a sun used at quagga dev as bgp-route-server. http://quagga.net/route-server.php (but they don't answered my question regarding fw-performance). the Quagga guys are running a sun T1000 (niagara 1) route server - I happen to have the machine in my racks, please let me know if you want to run some tests on it, I'm sure they won't mind ;-) It should also make a great testbed for SMP performance testing imho (and they're pretty cheap these days) Also, feel free to use me as a relay for your questions, they're not always very reachable. snap Perhaps you have some better luck at some different hardware systems (ppc, mips, ..?) or use freebsd only for routing-table-updates and special network-cards (netfpga) for real routing. The netfpga site seems more or less dead - is this project still alive? It does look like a very interesting idea, even though it's currently quite linux-centric (and according to docs doesn't have VLAN nor ip6 support, the former being quite a dealbreaker) Paul: I'm looking forward to the C2D 32bit benchmarks (maybe throw in a freebsd4 and/or dragonfly bench if you can..) - appreciate the lots of information you are providing us :) Met vriendelijke groet / With kind regards, Bart Van Kerckhove http://friet.net/pgp.txt -BEGIN PGP SIGNATURE- iQA/AwUBSG/tMgoIFchBM0BKEQKUSQCcCJqsw2wtUX7HQi050HEDYX3WPuMAnjmi eca31f7WQ/oXq9tJ8TEDN3CA =YGYq -END PGP SIGNATURE- ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul / Ingo, I tried all of this :/ still, 256/512 descriptors seem to work the best. Happy to let you log into the machine and fiddle around if you want :) I've been watching this thread closely, since I'm in a very similair situation. A few questions/remarks: Does ULE provide better performance than 4BSD for forwarding? Did you try freebsd4 as well? This thread had a report about that quite opposite to my own experiences, -4 seemed to be a lot faster at forwarding than anything else I 've tried so far. Obviously the thing I'm interested in is IMIX - and 64byte packets. Does anyone have any benchmarks for DragonFly? I asked around on IRC, but that nor google turned up any useful results. snip I don't think you will be able to route 64byte packets at 1gbit wirespeed (2Mpps) with a current x86 platform. Are there actual hardware related reasons this should not be possible, or is this purely lack of dedicated work towards this goal? snip Theres a sun used at quagga dev as bgp-route-server. http://quagga.net/route-server.php (but they don't answered my question regarding fw-performance). the Quagga guys are running a sun T1000 (niagara 1) route server - I happen to have the machine in my racks, please let me know if you want to run some tests on it, I'm sure they won't mind ;-) It should also make a great testbed for SMP performance testing imho (and they're pretty cheap these days) Also, feel free to use me as a relay for your questions, they're not always very reachable. snap Perhaps you have some better luck at some different hardware systems (ppc, mips, ..?) or use freebsd only for routing-table-updates and special network-cards (netfpga) for real routing. The netfpga site seems more or less dead - is this project still alive? It does look like a very interesting idea, even though it's currently quite linux-centric (and according to docs doesn't have VLAN nor ip6 support, the former being quite a dealbreaker) Paul: I'm looking forward to the C2D 32bit benchmarks (maybe throw in a freebsd4 and/or dragonfly bench if you can..) - appreciate the lots of information you are providing us :) Met vriendelijke groet / With kind regards, Bart Van Kerckhove http://friet.net/pgp.txt -BEGIN PGP SIGNATURE- iQA/AwUBSG/tMgoIFchBM0BKEQKUSQCcCJqsw2wtUX7HQi050HEDYX3WPuMAnjmi eca31f7WQ/oXq9tJ8TEDN3CA =YGYq -END PGP SIGNATURE- ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Bart Van Kerckhove wrote: The netfpga site seems more or less dead - is this project still alive? It does look like a very interesting idea, even though it's currently quite linux-centric (and according to docs doesn't have VLAN nor ip6 support, the former being quite a dealbreaker) Just last Thursday they made another release so it certainly doesn't look dead. I've been following the project for awhile now to see where it's going to go. The lack of FreeBSD support isn't great but I doubt it's going to happen until someone steps up and makes it so. The same is likely true for VLAN support. So far it's primarily been a proof of concept from what I can tell and could be molded into any number of different applications with the appropriate support. Considering all high performance routing platforms separate the management and routing/switching into two (or more) different hardware sections it wouldn't surprise me at all to see this as the only real option to get some serious routing and firewalling performance out of i386/amd64 type servers. Throwing faster and faster cpus at it is only going to get you so far (re: opteron 2212 vs ). Even so, 1.1Mpps is a considerable rate. Regards, Chris ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Bart Van Kerckhove wrote: Perhaps you have some better luck at some different hardware systems (ppc, mips, ..?) or use freebsd only for routing-table-updates and special network-cards (netfpga) for real routing. The netfpga site seems more or less dead - is this project still alive? It does look like a very interesting idea, even though it's currently quite linux-centric (and according to docs doesn't have VLAN nor ip6 support, the former being quite a dealbreaker) netfpga is very much alive. I'm on the mailing lists.. but it is summer break and it's an academically driven project. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Fri, 4 Jul 2008, Paul wrote: Numbers are maximum with near 100% cpu usage and some errors occuring, just for testing. FreeBSD 7.0-STABLE FreeBSD 7.0-STABLE #6: Thu Jul 3 19:32:38 CDT 2008 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/ROUTER amd64 CPU: Dual-Core AMD Opteron(tm) Processor (3015.47-MHz K8-class CPU) NON-SMP KERNEL em driver, intel 82571EB NICs fastforwarding on, isr.direct on, ULE, Preemption (NOTE: Interesting thing, without preemption gets errors similar to polling) PREEMPTION is certainly needed with UP. Without it, interrupts don't actually work (to work, they need to preempt the running thread, but they often (usually?) don't do that). Then with UP, there is a good chance that the interrupt thread doesn't get scheduled to run for a long time, but with SMP (especially with lots of CPUs) there is a good chance that another CPU gets scheduled to run the interrupt thread. em (unless misconfigured) doesn't have an interrupt thread; it uses a taskq which might take even longer to be scheduled than an interrupt thread. I use PREEMPTION with UP and !PREEMPTION with SMP. With polling, missed polls cause the same packet loss as not preempting. I tried polling, and I tried the polling patch that was posted to the list and both work but generate too many errors (missed packets). Without polling the packet errors ONLY occur when the cpu is near 100% usage Polling should also only cause packet loss when the CPU is near 100% usage, but now transients of near 100% usually cause packet loss, while with interrupts it takes a transient of 100% on the competing interrupt- driven resources to cause packet loss. Pleas trim quotes. Bruce ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Paul, Opteron UP mode, no polling input (em0) output packets errs bytespackets errs bytes colls 1071020 0 66403248 2 0404 0 that looks good. (but seems to be near the limit). Polling turned on provided better performance on 32 bit, but it gets strange errors on 64 bit.. Even at low pps I get small amounts of errors, and high pps same thing.. you would think that if it got errors at low pps it would get more errors at high pps but that isn't the case.. Polling on: packets errs bytespackets errs bytes colls 979736 963 60743636 1 0226 0 991838 496 61493960 1 0178 0 996125 460 61759754 1 0178 0 979381 326 60721626 1 0178 0 1022249 379 63379442 1 0178 0 991468 557 61471020 1 0178 0 lowering pps a little... input (em0) output packets errs bytespackets errs bytes colls 818688 151 50758660 1 0226 0 837920 179 51951044 1 0178 0 826217 168 51225458 1 0178 0 801017 100 49663058 1 0178 0 761857 287 47235138 1 0178 0 what could cause this? *) kern.polling.idle_poll enabled? *) kern.polling.user_frac ? *) kern.polling.reg_frac ? *) kern.polling.burst_max ? *) kern.polling.each_burst ? Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Paul, what could cause this? *) kern.polling.idle_poll enabled? *) kern.polling.user_frac ? *) kern.polling.reg_frac ? *) kern.polling.burst_max ? *) kern.polling.each_burst ? I tried tons of different values for these and nothing made any significant difference. Idle polling makes a difference, allows more pps, but still errors. Without idle polling it seems PPS is limited to HZ * descriptors, or 1000 * 256 or 512 but 1000 * 1024 is the same as 512.. 4000 * 256 or 2000 * 512 works but starts erroring 600kpps (SMP right now but it happens in UP too) I have patched src/sys/kern/kern_poll.c to support higher burst_max values: #define MAX_POLL_BURST_MAX 1 When setting kern.polling.burst_max to higher values, the server reach a point, where cpu-usage goes up without load, so try to keep below this values. I also have set the network card to 4096 rx-ram, to have more room for late polls. Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I tried all of this :/ still, 256/512 descriptors seem to work the best. Happy to let you log into the machine and fiddle around if you want :) Paul Ingo Flaschberger wrote: Dear Paul, what could cause this? *) kern.polling.idle_poll enabled? *) kern.polling.user_frac ? *) kern.polling.reg_frac ? *) kern.polling.burst_max ? *) kern.polling.each_burst ? I tried tons of different values for these and nothing made any significant difference. Idle polling makes a difference, allows more pps, but still errors. Without idle polling it seems PPS is limited to HZ * descriptors, or 1000 * 256 or 512 but 1000 * 1024 is the same as 512.. 4000 * 256 or 2000 * 512 works but starts erroring 600kpps (SMP right now but it happens in UP too) I have patched src/sys/kern/kern_poll.c to support higher burst_max values: #define MAX_POLL_BURST_MAX 1 When setting kern.polling.burst_max to higher values, the server reach a point, where cpu-usage goes up without load, so try to keep below this values. I also have set the network card to 4096 rx-ram, to have more room for late polls. Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Preliminary 32 bit results... When I started out it looked like 32 bit was worse than 64 bit, but it's just the timers are different. For instance, 4000 hz in 64 bit gives better results than 4000hz in 32 bit. Low HZ gives better result with polling on in 32 bit Bottom line, so far I'm not able to get any better performance out of 32 bit at all. In fact I think it might be even a tad slower. I didn't see as high of bursts like I did on 64 bit so far but I'm still testing. Tomorrow comes opteron so it's 1ghz faster than this one, and I can see if it scales directly with cpu speed or what happens. I did another SMP test with an interesting results. I took one of the cpus out of the machine, so it was just left with a single 2212 (dual core) and it performed better. Less contention I suppose? some results: kern.hz=4000 hw.em.rxd=512 hw.em.txd=512 polling on, idle polling on (only way I can get a reliable netstat output) input (em0) output packets errs bytespackets errs bytes colls 681961 117612 42281586 1 0226 0 655095 83418 40615892 2 0220 0 683881 93559 42400626 1 0178 0 683637 90452 42385498 1 0178 0 683345 87471 42367394 1 0178 0 682737 81483 42329696 2 0220 0 683154 95413 4232 1 0178 0 684556 111013 42442476 1 0178 0 684365 110960 42430634 1 0178 0 679089 116440 42103518 3 0534 0 684328 122713 42428340 1 0178 0 684852 121387 42460828 1 0178 0 685358 113256 42492200 1 0178 0 685060 123110 42473724 1 0178 0 684463 118335 42436710 1 0178 0 677182 127788 41985300 2 0356 0 685920 126144 42527044 1 0178 0 684946 107034 42466656 1 0178 0 (reboot) kern.hz=1000 input (em0) output packets errs bytespackets errs bytes colls 679611 97394 42136046 5 0762 0 663939 104714 41164254 5 0 1322 0 685538 91102 42503412 4 0536 0 676704 94629 41955668 2 0404 0 685323 115060 42490030 1 0178 0 675954 105506 41909164 2 0356 0 655321 92118 40629906 1 0178 0 686826 85674 42583228 2 0356 0 686378 89983 42555440 1 0178 0 685539 80180 42503422 1 0178 0 686704 88626 42575652 1 0178 0 686567 88596 42567158 1 0178 0 687031 82640 42595936 3 0398 0 sysctl -w kern.polling.each_burst=50 kern.polling.each_burst: 256 - 50 [EMAIL PROTECTED] ~]# netstat -w1 -I em0 input (em0) output packets errs bytespackets errs bytes colls 693036 39992 42968315 3 0400 0 695538 58189 43123360 1 0178 0 692670 62765 42945544 1 0178 0 693219 60755 42979580 2 0220 0 692637 64761 42943498 1 sysctl -w kern.polling.each_burst=33 kern.polling.each_burst: 50 - 33 [EMAIL PROTECTED] ~]# netstat -w1 -I em0 input (em0) output packets errs bytespackets errs bytes colls 690530 63359 42812868 1 0226 0 689748 57670 42764380 1 0178 0 690489 57874 42810322 1 0178 0 689655 60606 42758614 1 0178 0 ^C [EMAIL PROTECTED] ~]# sysctl -w kern.polling.each_burst=3 kern.polling.each_burst: 33 - 3 [EMAIL PROTECTED] ~]# netstat -w1 -I em0 input (em0) output packets errs bytespackets errs bytes colls 612234 110896 37958512 1 0226 0 614391 112506 38092246 1 0178 0 ^C [EMAIL PROTECTED] ~]# sysctl -w kern.polling.each_burst=800 kern.polling.each_burst: 3 - 800 [EMAIL PROTECTED] ~]# netstat -w1 -I em0 input (em0) output packets errs bytespackets errs bytes colls 668057 76496 41419538 1 0226 0 667689 88674 41396720 2 0220 0 670526 106654 41572616 1 0178 0 667326 97832 41374216 1 0178 0 ^C [EMAIL PROTECTED] ~]# sysctl -w
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Wed, 2 Jul 2008, Paul wrote: ... ---Reboot with 4096/4096(my guess is that it will be a lot worse, more errors..) Without polling, 4096 is horrible, about 200kpps less ... :/ Turning on polling.. polling on, 4096 is bad, input (em0) output packets errs bytespackets errs bytes colls 622379 307753 38587506 1 0178 0 635689 277303 39412718 1 0178 0 ... --Rebooting with 256/256 descriptors.. .. No polling: 843762 25337 52313248 1 0178 0 763555 0 47340414 1 0178 0 830189 0 51471722 1 0178 0 838724 0 52000892 1 0178 0 813594 939 50442832 1 0178 0 807303 763 50052790 1 0178 0 791024 0 49043492 1 0178 0 768316 1106 47635596 1 0178 0 Machine is maxed and is unresponsive.. That's the most interesting one. Even 1% packet loss would probably destroy performance, so the benchmarks that give 10-50% packet loss are uninteresting. All indications are that you are running out of CPU and memory (DMA and/or cache fills) throughput. The above apparently hits both limits at the same time, while with more descriptors memory throughput runs out first. 1 CPU is apparently barely enough for 800 kpps (is this all with UP now?), and I think more CPUs could only be slower, as you saw with SMP, especially using multiple em taskqs, since memory traffic would be higher. I wouldn't expect this to be fixed soon (except by throwing better/different hardware at it). The CPU/DMA balance can probably be investigated by slowing down the CPU/ memory system. You may remember my previous mail about getting higher pps on bge. Again, all indications are that I'm running out of CPU, memory, and bus throughput too since the bus is only PCI 33MHz. These interact in a complicated way which I haven't been able to untangle. -current is fairly consistently slower than my ~5.2 by about 10%, apparently due to code bloat (extra CPU and related extra cache misses). OTOH, like you I've seen huge variations for changes that should be null (e.g., disturbing the alignment of the text section without changing anything else). My ~5.2 is very consistent since I rarely change it, while -current changes a lot and shows more variation, but with no sign of getting near the ~5.2 plateau or even its old peaks. Polling ON: input (em0) output packets errs bytespackets errs bytes colls 784138 179079 48616564 1 0226 0 788815 129608 48906530 2 0356 0 75 142997 46844426 2 0468 0 803670 144459 49827544 1 0178 0 777649 147120 48214242 1 0178 0 779539 146820 48331422 1 0178 0 786201 148215 48744478 2 0356 0 776013 101660 48112810 1 0178 0 774239 145041 48002834 2 0356 0 771774 102969 47850004 1 0178 0 Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really mistified by this.. Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy to explain (perhaps incorrectly). Polling can then read at most 256 descriptors every 1/2000 second, giving a max throughput of 512 kpps. Packets descriptors in general but might be equal here (for small packets). You seem to actually get 784 kpps, which is too high even in descriptors unless, but matches exactly if the errors are counted twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% still happens to be left over after giving up at 512 kpps. Most of the errors are probably handled by the hardware at low cost in CPU by dropping packets. There are other types of errors but none except dropped packets is likely. Every time it maxes out and gets errors, top reports: CPU: 0.0% user, 0.0% nice, 10.1% system, 45.3% interrupt, 44.6% idle pretty much the same line every time 256/256 blows away 4096 , probably fits the descriptors into the cache lines on the cpu and 4096 has too many cache misses and causes worse performance. Quite likely. Maybe your systems have memory systems that are weak relative to other resources, so that they this limit sooner than expected. I should look at my fixes for bge, one than changes rxd from 256 to 512, and one that increases the ifq tx length from txd = 512 to about 2. Both of these might thrash caches. The former makes little difference except for polling at 4000 Hz, but I don't believe in or use polling. The latter works around select() for write descriptors not working on sockets, so that high frequency
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Bruce Evans wrote: On Wed, 2 Jul 2008, Paul wrote: ... ---Reboot with 4096/4096(my guess is that it will be a lot worse, more errors..) Without polling, 4096 is horrible, about 200kpps less ... :/ Turning on polling.. polling on, 4096 is bad, input (em0) output packets errs bytespackets errs bytes colls 622379 307753 38587506 1 0178 0 635689 277303 39412718 1 0178 0 ... --Rebooting with 256/256 descriptors.. .. No polling: 843762 25337 52313248 1 0178 0 763555 0 47340414 1 0178 0 830189 0 51471722 1 0178 0 838724 0 52000892 1 0178 0 813594 939 50442832 1 0178 0 807303 763 50052790 1 0178 0 791024 0 49043492 1 0178 0 768316 1106 47635596 1 0178 0 Machine is maxed and is unresponsive.. That's the most interesting one. Even 1% packet loss would probably destroy performance, so the benchmarks that give 10-50% packet loss are uninteresting. But you realize that it's outputting all of these packets on em3 and I'm watching them coming out and they are consistent with the packets received on em0 that netstat shows are 'good' packets. All indications are that you are running out of CPU and memory (DMA and/or cache fills) throughput. The above apparently hits both limits at the same time, while with more descriptors memory throughput runs out first. 1 CPU is apparently barely enough for 800 kpps (is this all with UP now?), and I think more CPUs could only be slower, as you saw with SMP, especially using multiple em taskqs, since memory traffic would be higher. I wouldn't expect this to be fixed soon (except by throwing better/different hardware at it). The CPU/DMA balance can probably be investigated by slowing down the CPU/ memory system. I'm using a server opteron which supposedly has the best memory performance out of any CPU right now. Plus opterons have the biggest l1 cache, but small l2 cache. Do you think larger l2 cache on the Xeon (6mb for 2 core) would be better? I have a opteron coming which is 1ghz faster so we will see what happens : My NIC is PCI-E 4x so there's no bottleneck there. You may remember my previous mail about getting higher pps on bge. Again, all indications are that I'm running out of CPU, memory, and bus throughput too since the bus is only PCI 33MHz. These interact in a complicated way which I haven't been able to untangle. -current is fairly consistently slower than my ~5.2 by about 10%, apparently due to code bloat (extra CPU and related extra cache misses). OTOH, like you I've seen huge variations for changes that should be null (e.g., disturbing the alignment of the text section without changing anything else). My ~5.2 is very consistent since I rarely change it, while -current changes a lot and shows more variation, but with no sign of getting near the ~5.2 plateau or even its old peaks. Polling ON: input (em0) output packets errs bytespackets errs bytes colls 784138 179079 48616564 1 0226 0 788815 129608 48906530 2 0356 0 75 142997 46844426 2 0468 0 803670 144459 49827544 1 0178 0 777649 147120 48214242 1 0178 0 779539 146820 48331422 1 0178 0 786201 148215 48744478 2 0356 0 776013 101660 48112810 1 0178 0 774239 145041 48002834 2 0356 0 771774 102969 47850004 1 0178 0 Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really mistified by this.. Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy to explain (perhaps incorrectly). Polling can then read at most 256 descriptors every 1/2000 second, giving a max throughput of 512 kpps. Packets descriptors in general but might be equal here (for small packets). You seem to actually get 784 kpps, which is too high even in descriptors unless, but matches exactly if the errors are counted twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% still happens to be left over after giving up at 512 kpps. Most of the errors are probably handled by the hardware at low cost in CPU by dropping packets. There are other types of errors but none except dropped packets is likely. Read above, it's actually transmitting 770kpps out of em3 so it can't just be 512kpps. I suppose multiple packets can fit in 1 descriptor? I am using VERY small tcp packets.. Every time it maxes out and gets errors, top reports: CPU: 0.0% user, 0.0%
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Thu, 3 Jul 2008, Paul wrote: Bruce Evans wrote: No polling: 843762 25337 52313248 1 0178 0 763555 0 47340414 1 0178 0 830189 0 51471722 1 0178 0 838724 0 52000892 1 0178 0 813594 939 50442832 1 0178 0 807303 763 50052790 1 0178 0 791024 0 49043492 1 0178 0 768316 1106 47635596 1 0178 0 Machine is maxed and is unresponsive.. That's the most interesting one. Even 1% packet loss would probably destroy performance, so the benchmarks that give 10-50% packet loss are uninteresting. But you realize that it's outputting all of these packets on em3 and I'm watching them coming out and they are consistent with the packets received on em0 that netstat shows are 'good' packets. Well, output is easier. I don't remember seeing the load on a taskq for em3. If there is a memory bottleneck, it might to might not be more related to running only 1 taskq per interrupt, depending on how independent the memory system is for different CPU. I think Opterons have more indenpendence here than most x86's. I'm using a server opteron which supposedly has the best memory performance out of any CPU right now. Plus opterons have the biggest l1 cache, but small l2 cache. Do you think larger l2 cache on the Xeon (6mb for 2 core) would be better? I have a opteron coming which is 1ghz faster so we will see what happens I suspect lower latency memory would help more. Big memory systems have inherently higher latency. My little old A64 workstation and laptop have main memory latencies 3 times smaller than freebsd.org's new Core2 servers according to lmbench2 (42 nsec for the overclocked DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec). If there are a lot of cache misses, then the extra 100 nsec can be important. Profiling of sendto() using hwpmc or perfmon shows a significant number of cache misses per packet (2 or 10?). Polling ON: input (em0) output packets errs bytespackets errs bytes colls 784138 179079 48616564 1 0226 0 788815 129608 48906530 2 0356 0 Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really mistified by this.. Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy to explain (perhaps incorrectly). Polling can then read at most 256 descriptors every 1/2000 second, giving a max throughput of 512 kpps. Packets descriptors in general but might be equal here (for small packets). You seem to actually get 784 kpps, which is too high even in descriptors unless, but matches exactly if the errors are counted twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% still happens to be left over after giving up at 512 kpps. Most of the errors are probably handled by the hardware at low cost in CPU by dropping packets. There are other types of errors but none except dropped packets is likely. Read above, it's actually transmitting 770kpps out of em3 so it can't just be 512kpps. Transmitting is easier, but with polling its even harder to send faster than hz * queue_length than to receive. This is without polling in idle. I was thinking of trying 4 or 5.. but how would that work with this new hardware? Poorly, except possibly with polling in FreeBSD-4. FreeBSD-4 generally has lower overheads and latency, but is missing important improvements (mainly tcp optimizations in upper layers, better DMA and/or mbuf handling, and support for newer NICs). FreeBSD-5 is also missing the overhead+latency advantage. Here are some benchmarks. (ttcp mainly tests sendto(). 4.10 em needed a 2-line change to support a not-so-new PCI em NIC. Summary: - my bge NIC can handle about 600 kpps on my faster machine, but only achieves 300 in 4.10 unpatched. - my em NIC can handle about 400 kpps on my slower machine, except in later versions it can receive at about 600 kpps. - only 6.x and later can achieve near wire throughput for 1500-MTU packets (81 kpps vs 76 kpps). This depends on better DMA or mbuf handling... I now remember the details -- it is mainly better mbuf handling: old versions split the 1500-MTU packets into 2 mbufs and this causes 2 descriptors per packet, which causes extra software overheads and even larger overheads for the hardware. %%% Results of benchmarks run on 23 Feb 2007: my~5.2 bge -- ~4.10 em tx rx kpps load%ipskppsload%ips ttcp -l5-u -t 639 981660 398* 77 8k ttcp -l5 -t 6.01003960 6.0 65900 ttcp -l1472 -u -t 76 27 395 76 40 8k ttcp -l1472-t 51 40 11k 51
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Stefan, So my maximum without polling is close to 800kpps but if I push that it starts locking me from doing things, or how many kpps do you want to achieve? Do not know for Paul but, I want to be able to route (and/or bridge to handle) 600-700mbps syn flood, which is something like 1500kpps in every direction. Is it unrealistic? yes, I think so. look at this project: http://yuba.stanford.edu/NetFPGA/ This card(s) could do that. Maximum count of routes seems to be limited, but with lpf it should work. A freebsd-kernel interface is missing. If the code is optimized to fully utilize MP I do not see a reason why quad core processor should not be able to do this. After all single core seems to handle 500kpps, if we utilize four, eight or even more cores we should be able to route 1500kpps + ? Theres a sun used at quagga dev as bgp-route-server. http://quagga.net/route-server.php (but they don't answered my question regarding fw-performance). I hope TOE once MFCed to 7-STABLE will help too? I don't think toe will help. Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Paul, Tomorrow comes opteron so it's 1ghz faster than this one, and I can see if it scales directly with cpu speed or what happens. can you send me a lspci -v? I did another SMP test with an interesting results. I took one of the cpus out of the machine, so it was just left with a single 2212 (dual core) and it performed better. Less contention I suppose? in smp locking is a performance killer. My next router appliance will be: http://www.axiomtek.com.tw/products/ViewProduct.asp?view=429 Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Stefan, So my maximum without polling is close to 800kpps but if I push that it starts locking me from doing things, or how many kpps do you want to achieve? Do not know for Paul but, I want to be able to route (and/or bridge to handle) 600-700mbps syn flood, which is something like 1500kpps in every direction. Is it unrealistic? I would also give Dragonfly bsd a try, as Mike had the best results with it. Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Bruce Evans wrote: On Thu, 3 Jul 2008, Paul wrote: Bruce Evans wrote: No polling: 843762 25337 52313248 1 0178 0 763555 0 47340414 1 0178 0 830189 0 51471722 1 0178 0 838724 0 52000892 1 0178 0 813594 939 50442832 1 0178 0 807303 763 50052790 1 0178 0 791024 0 49043492 1 0178 0 768316 1106 47635596 1 0178 0 Machine is maxed and is unresponsive.. That's the most interesting one. Even 1% packet loss would probably destroy performance, so the benchmarks that give 10-50% packet loss are uninteresting. But you realize that it's outputting all of these packets on em3 and I'm watching them coming out and they are consistent with the packets received on em0 that netstat shows are 'good' packets. Well, output is easier. I don't remember seeing the load on a taskq for em3. If there is a memory bottleneck, it might to might not be more related to running only 1 taskq per interrupt, depending on how independent the memory system is for different CPU. I think Opterons have more indenpendence here than most x86's. Opterons have on cpu memory controller.. That should help a little. :P But I must be getting more than 1 packet per descriptor because I can do HZ=100 and still get it without polling.. idle polling helps in all cases of polling that I have tested it with, seems moreso on 32 bit I'm using a server opteron which supposedly has the best memory performance out of any CPU right now. Plus opterons have the biggest l1 cache, but small l2 cache. Do you think larger l2 cache on the Xeon (6mb for 2 core) would be better? I have a opteron coming which is 1ghz faster so we will see what happens I suspect lower latency memory would help more. Big memory systems have inherently higher latency. My little old A64 workstation and laptop have main memory latencies 3 times smaller than freebsd.org's new Core2 servers according to lmbench2 (42 nsec for the overclocked DDR PC3200 one and 55 for the DDR2 PC5400 (?) one, vs 145-155 nsec). If there are a lot of cache misses, then the extra 100 nsec can be important. Profiling of sendto() using hwpmc or perfmon shows a significant number of cache misses per packet (2 or 10?). The opterons are 667mhz DDR2 [registered], I have a Xeon that is ddr3 but i think the latency is higher than ddr2. I'll look up those programs you mentioned and see If I can run some tests. Polling ON: input (em0) output packets errs bytespackets errs bytes colls 784138 179079 48616564 1 0226 0 788815 129608 48906530 2 0356 0 Machine is responsive and has 40% idle cpu.. Why ALWAYS 40% ? I'm really mistified by this.. Is this with hz=2000 and 256/256 and no polling in idle? 40% is easy to explain (perhaps incorrectly). Polling can then read at most 256 descriptors every 1/2000 second, giving a max throughput of 512 kpps. Packets descriptors in general but might be equal here (for small packets). You seem to actually get 784 kpps, which is too high even in descriptors unless, but matches exactly if the errors are counted twice (784 - 179 - 505 ~= 512). CPU is getting short too, but 40% still happens to be left over after giving up at 512 kpps. Most of the errors are probably handled by the hardware at low cost in CPU by dropping packets. There are other types of errors but none except dropped packets is likely. Read above, it's actually transmitting 770kpps out of em3 so it can't just be 512kpps. Transmitting is easier, but with polling its even harder to send faster than hz * queue_length than to receive. This is without polling in idle. What i'm saying though, it that it's not giving up at 512kpps because 784kpps is coming in em0 and going out em3 so obviously it's reading more than 256 every 1/2000th of a second (packets). What would be the best settings (theoretical) for 1mpps processing? I actually don't have a problem 'receiving' more than 800kpps with much lower CPU usage if it's going to blackhole . so obviously it can receive a lot more, maybe even line rate pps but i can't generate that much. I was thinking of trying 4 or 5.. but how would that work with this new hardware? Poorly, except possibly with polling in FreeBSD-4. FreeBSD-4 generally has lower overheads and latency, but is missing important improvements (mainly tcp optimizations in upper layers, better DMA and/or mbuf handling, and support for newer NICs). FreeBSD-5 is also missing the overhead+latency advantage. Here are some benchmarks. (ttcp mainly tests sendto(). 4.10 em needed a 2-line change to support a not-so-new PCI em NIC. Summary: - my bge NIC can handle about 600 kpps on my faster machine, but only achieves
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Err.. pciconf -lv ? [EMAIL PROTECTED]:0:0:0: class=0x05 card=0x151115d9 chip=0x036910de rev=0xa2 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 Memory Controller' class = memory subclass = RAM [EMAIL PROTECTED]:0:1:0: class=0x060100 card=0x151115d9 chip=0x036410de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 LPC Bridge' class = bridge subclass = PCI-ISA [EMAIL PROTECTED]:0:1:1: class=0x0c0500 card=0x151115d9 chip=0x036810de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SMBus' class = serial bus subclass = SMBus [EMAIL PROTECTED]:0:2:0: class=0x0c0310 card=0x151115d9 chip=0x036c10de rev=0xa1 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 USB Controller' class = serial bus subclass = USB [EMAIL PROTECTED]:0:2:1: class=0x0c0320 card=0x151115d9 chip=0x036d10de rev=0xa2 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 USB Controller' class = serial bus subclass = USB [EMAIL PROTECTED]:0:4:0: class=0x01018a card=0x151115d9 chip=0x036e10de rev=0xa1 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 IDE' class = mass storage subclass = ATA [EMAIL PROTECTED]:0:5:0: class=0x010185 card=0x151115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA Controller' class = mass storage subclass = ATA [EMAIL PROTECTED]:0:5:1: class=0x010185 card=0x151115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA Controller' class = mass storage subclass = ATA [EMAIL PROTECTED]:0:5:2: class=0x010185 card=0x151115d9 chip=0x037f10de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 SATA Controller' class = mass storage subclass = ATA [EMAIL PROTECTED]:0:6:0: class=0x060401 card=0x151115d9 chip=0x037010de rev=0xa2 hdr=0x01 vendor = 'Nvidia Corp' device = 'MCP55 PCI bridge' class = bridge subclass = PCI-PCI [EMAIL PROTECTED]:0:8:0:class=0x02 card=0x151115d9 chip=0x037210de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 Ethernet' class = network subclass = ethernet [EMAIL PROTECTED]:0:9:0:class=0x02 card=0x151115d9 chip=0x037210de rev=0xa3 hdr=0x00 vendor = 'Nvidia Corp' device = 'MCP55 Ethernet' class = network subclass = ethernet [EMAIL PROTECTED]:0:10:0: class=0x060400 card=0x10de chip=0x037610de rev=0xa3 hdr=0x01 vendor = 'Nvidia Corp' device = 'MCP55 PCIe bridge' class = bridge subclass = PCI-PCI [EMAIL PROTECTED]:0:13:0: class=0x060400 card=0x10de chip=0x037810de rev=0xa3 hdr=0x01 vendor = 'Nvidia Corp' device = 'MCP55 PCIe bridge' class = bridge subclass = PCI-PCI [EMAIL PROTECTED]:0:14:0: class=0x060400 card=0x10de chip=0x037510de rev=0xa3 hdr=0x01 vendor = 'Nvidia Corp' device = 'MCP55 PCIe bridge' class = bridge subclass = PCI-PCI [EMAIL PROTECTED]:0:15:0: class=0x060400 card=0x10de chip=0x037710de rev=0xa3 hdr=0x01 vendor = 'Nvidia Corp' device = 'MCP55 PCIe bridge' class = bridge subclass = PCI-PCI [EMAIL PROTECTED]:0:24:0: class=0x06 card=0x chip=0x11001022 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices (AMD)' device = '(K8) Athlon 64/Opteron HyperTransport Technology Configuration' class = bridge subclass = HOST-PCI [EMAIL PROTECTED]:0:24:1: class=0x06 card=0x chip=0x11011022 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices (AMD)' device = '(K8) Athlon 64/Opteron Address Map' class = bridge subclass = HOST-PCI [EMAIL PROTECTED]:0:24:2: class=0x06 card=0x chip=0x11021022 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices (AMD)' device = '(K8) Athlon 64/Opteron DRAM Controller' class = bridge subclass = HOST-PCI [EMAIL PROTECTED]:0:24:3: class=0x06 card=0x chip=0x11031022 rev=0x00 hdr=0x00 vendor = 'Advanced Micro Devices (AMD)' device = '(K8) Athlon 64/Opteron Miscellaneous Control' class = bridge subclass = HOST-PCI [EMAIL PROTECTED]:1:6:0: class=0x03 card=0x151115d9 chip=0x515e1002 rev=0x02 hdr=0x00 vendor = 'ATI Technologies Inc' device = 'Radeon ES1000 Radeon ES1000' class = display subclass = VGA [EMAIL PROTECTED]:2:0:0: class=0x060400 card=0x chip=0x01251033 rev=0x07 hdr=0x01 vendor = 'NEC Electronics Hong Kong' class = bridge subclass = PCI-PCI [EMAIL PROTECTED]:2:0:1: class=0x060400 card=0x chip=0x01251033 rev=0x07 hdr=0x01 vendor = 'NEC Electronics Hong Kong' class
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Ingo Flaschberger wrote: My next router appliance will be: http://www.axiomtek.com.tw/products/ViewProduct.asp?view=429 This is exactly the device that I have been testing with (just rebranded). Steve ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
SMP DISABLED on my Opteron 2212 (ULE, Preemption on) Yields ~750kpps in em0 and out em1 (one direction) I am miffed why this yields more pps than a) with all 4 cpus running and b) 4 cpus with lagg load balanced over 3 incoming connections so 3 taskq threads I would be willing to set up test equipment (several servers plugged into a switch) with ipkvm and power port access if someone or a group of people want to figure out ways to improve the routing process, ipfw, and lagg. Maximum PPS with one ipfw rule on UP: tops out about 570Kpps.. almost 200kpps lower ? (frown) I'm going to drop in a 3ghz opteron instead of the 2ghz 2212 that's in here and see how that scales, using UP same kernel etc I have now. Julian Elischer wrote: Paul wrote: ULE without PREEMPTION is now yeilding better results. input (em0) output packets errs bytespackets errs bytes colls 571595 40639 34564108 1 0226 0 577892 48865 34941908 1 0178 0 545240 84744 32966404 1 0178 0 587661 44691 35534512 1 0178 0 587839 38073 35544904 1 0178 0 587787 43556 35540360 1 0178 0 540786 39492 32712746 1 0178 0 572071 55797 34595650 1 0178 0 *OUCH, IPFW HURTS.. loading ipfw, and adding one ipfw rule allow ip from any to any drops 100Kpps off :/ what's up with THAT? unloaded ipfw module and back 100kpps more again, that's not right with ONE rule.. :/ ipfw need sto gain a lock on hte firewall before running, and is quite complex.. I can believe it.. in FreeBSD 4.8 I was able to use ipfw and filter 1Gb between two interfaces (bridged) but I think it has slowed down since then due to the SMP locking. em0 taskq is still jumping cpus.. is there any way to lock it to one cpu or is this just a function of ULE running a tar czpvf all.tgz * and seeing if pps changes.. negligible.. guess scheduler is doing it's job at least.. Hmm. even when it's getting 50-60k errors per second on the interface I can still SCP a file through that interface although it's not fast.. 3-4MB/s.. You know, I wouldn't care if it added 5ms latency to the packets when it was doing 1mpps as long as it didn't drop any.. Why can't it do that? Queue them up and do them in bi chunks so none are droppedhmm? 32 bit system is compiling now.. won't do 400kpps with GENERIC kernel, as with 64 bit did 450k with GENERIC, although that could be the difference between opteron 270 and opteron 2212.. Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Paul, SMP DISABLED on my Opteron 2212 (ULE, Preemption on) Yields ~750kpps in em0 and out em1 (one direction) I am miffed why this yields more pps than a) with all 4 cpus running and b) 4 cpus with lagg load balanced over 3 incoming connections so 3 taskq threads because less locking, less synchronisation, I would be willing to set up test equipment (several servers plugged into a switch) with ipkvm and power port access if someone or a group of people want to figure out ways to improve the routing process, ipfw, and lagg. Maximum PPS with one ipfw rule on UP: tops out about 570Kpps.. almost 200kpps lower ? (frown) can you post the rule here? I'm going to drop in a 3ghz opteron instead of the 2ghz 2212 that's in here and see how that scales, using UP same kernel etc I have now. really, please try 32bit and 1 cpu. Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Ipfw rule was simply allow ip from any to any :) This is 64bit i'm testing now.. I have a 32 bit install I tested on another machine but it only has bge NIC and wasn't performing as well so I'll reinstall 32 bit on this 2212 and test then drop in the (3ghz) and test. I still don't like the huge hit ipfw and lagg take :/ ** I tried polling in UP mode and I got some VERY interesting results.. CPU is 44% idle (idle polling isn't on) but I'm getting errors! It's doing 530kpps with ipfw loaded, which without polling uses 100% cpu but now it says my cpu is 44% idle? that makes no sense.. If it was idle why am I getting errors? I only get errors when em taskq was eating 100% cpu.. Idle polling on/off makes no difference. user_frac is set to 5 .. last pid: 1598; load averages: 0.01, 0.16, 0.43 up 0+00:34:41 04:04:43 66 processes: 2 running, 46 sleeping, 18 waiting CPU: 0.0% user, 0.0% nice, 7.3% system, 46.5% interrupt, 46.2% idle Mem: 8064K Active, 6808K Inact, 43M Wired, 92K Cache, 9264K Buf, 1923M Free Swap: 8192M Total, 8192M Free PID USERNAME PRI NICE SIZERES STATETIME WCPU COMMAND 10 root 171 ki31 0K16K RUN 10:10 88.87% idle 1598 root 450 8084K 2052K RUN 0:00 1.12% top 11 root -32- 0K16K WAIT 0:02 0.24% swi4: clock sio 13 root -44- 0K16K WAIT14:13 0.15% swi1: net 1329 root 440 33732K 4572K select 0:00 0.05% sshd input (em0) output packets errs bytespackets errs bytes colls 541186 68741 33107504 1 0 0 0 540036 70611 33044632 1 0178 0 540470 66493 33043148 1 0178 0 541903 67981 33125414 1 0178 0 541238 84979 33105898 1 0178 0 541338 74067 33115984 2 0356 0 539116 49286 32991516 2 0220 0 kldunload ipfw... input (em0) output packets errs bytespackets errs bytes colls 600589 0 36751064 1 0226 0 606294 0 37102868 2 0220 0 616802 0 37733866 1 0178 0 623017 0 38117436 1 0178 0 624800 0 38225470 1 0178 0 626791 0 38347426 1 0178 0 last pid: 1605; load averages: 0.00, 0.13, 0.40 up 0+00:35:30 04:05:32 66 processes: 2 running, 46 sleeping, 18 waiting CPU: 0.0% user, 0.0% nice, 7.1% system, 36.0% interrupt, 56.9% idle Mem: 8064K Active, 6812K Inact, 43M Wired, 92K Cache, 9264K Buf, 1923M Free Swap: 8192M Total, 8192M Free PID USERNAME PRI NICE SIZERES STATETIME WCPU COMMAND 10 root 171 ki31 0K16K RUN 10:16 95.36% idle 13 root -44- 0K16K WAIT14:53 0.24% swi1: net 36 root -68- 0K16K -1:03 0.10% em3 taskq 1605 root 440 8084K 2052K RUN 0:00 0.10% top 11 root -32- 0K16K WAIT 0:02 0.05% swi4: clock sio add some more PPS.. input (em0) output packets errs bytespackets errs bytes colls 749015 169684 46438936 1 0 42 0 749176 184574 46448916 1 0178 0 759576 188462 47093716 1 0178 0 762904 182854 47300052 1 0178 0 798039 147509 49478422 1 0178 0 759528 194297 47090740 1 0178 0 746849 195935 46304642 1 0178 0 747566 186703 46349096 1 0178 0 750011 181630 46500702 2 last pid: 1607; load averages: 0.19, 0.17, 0.40 up 0+00:36:18 04:06:20 66 processes: 2 running, 46 sleeping, 18 waiting CPU: 0.0% user, 0.0% nice, 12.5% system, 45.4% interrupt, 42.1% idle Mem: 8068K Active, 6808K Inact, 43M Wired, 92K Cache, 9264K Buf, 1923M Free Swap: 8192M Total, 8192M Free PID USERNAME PRI NICE SIZERES STATETIME WCPU COMMAND 10 root 171 ki31 0K16K RUN 10:21 85.64% idle 36 root -68- 0K16K -1:07 3.61% em3 taskq 1607 root 440 8084K 2052K RUN 0:00 0.93% top 13 root -44- 0K16K WAIT15:32 0.20% swi1: net 11 root -32- 0K16K WAIT 0:02 0.05% swi4: clock sio So my maximum without polling is close to 800kpps but if I push that it starts locking me from doing things, or my maximum is 750kpps with polling and the console is very responsive? How on EARTH can my
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Paul, I still don't like the huge hit ipfw and lagg take :/ I think, you can't use fastforward with with lagg. ** I tried polling in UP mode and I got some VERY interesting results.. CPU is 44% idle (idle polling isn't on) but I'm getting errors! It's doing 530kpps with ipfw loaded, which without polling uses 100% cpu but now it says my cpu is 44% idle? that makes no sense.. If it was idle why am I getting errors? I only get errors when em taskq was eating 100% cpu.. Idle polling on/off makes no difference. user_frac is set to 5 .. what are your values: kern.polling.reg_frac= kern.polling.user_frac= kern.polling.burst_max= I use: kern.polling.reg_frac=20 kern.polling.user_frac=20 kern.polling.burst_max=512 if you need more than 1000, you need to change the code: src/sys/kern/kern_poll.c #define MAX_POLL_BURST_MAX 1000 So my maximum without polling is close to 800kpps but if I push that it starts locking me from doing things, or how many kpps do you want to achieve? HZ=2000 for this test (512/512 descriptors) you mean: hw.em.rxd=512 hw.em.txd=512 ? can you try with polling: hw.em.rxd=4096 hw.em.txd=4096 Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Hi Ingo Flaschberger wrote: Dear Paul, I still don't like the huge hit ipfw and lagg take :/ You have to try PF, then you will respect IPFW again ;) -cut- So my maximum without polling is close to 800kpps but if I push that it starts locking me from doing things, or how many kpps do you want to achieve? Do not know for Paul but, I want to be able to route (and/or bridge to handle) 600-700mbps syn flood, which is something like 1500kpps in every direction. Is it unrealistic? If the code is optimized to fully utilize MP I do not see a reason why quad core processor should not be able to do this. After all single core seems to handle 500kpps, if we utilize four, eight or even more cores we should be able to route 1500kpps + ? I hope TOE once MFCed to 7-STABLE will help too? -- Best Wishes, Stefan Lambrev ICQ# 24134177 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
2008/7/2 Stefan Lambrev [EMAIL PROTECTED]: Do not know for Paul but, I want to be able to route (and/or bridge to handle) 600-700mbps syn flood, which is something like 1500kpps in every direction. Is it unrealistic? If the code is optimized to fully utilize MP I do not see a reason why quad core processor should not be able to do this. After all single core seems to handle 500kpps, if we utilize four, eight or even more cores we should be able to route 1500kpps + ? I hope TOE once MFCed to 7-STABLE will help too? But its not just about CPU use, its about your NIC, your IO bus path, your memory interface, your caches .. things get screwy. Especially if you're holding a full internet routing table. If you're interested in participating in a group funding project to make this happen then let me know. The more the merrier (read: the more that can be achieved :) Adrian ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
On Mon, Jun 30, 2008 at 6:39 PM, Ingo Flaschberger [EMAIL PROTECTED] wrote: I'm curious now... how do you change individual device polling via sysctl? not via sysctl, via ifconfig: # enable interface polling /sbin/ifconfig em0 polling /sbin/ifconfig em1 polling /sbin/ifconfig em2 polling /sbin/ifconfig em3 polling (and via /etc/rc.local also across reboots) No, you put it into the ifconfig_X lines in /etc/rc.conf as the last option. Or -polling to disable it. ifconfig_em0='inet 1.2.3.4/24 polling ifconfig_em2='inet 1.2.3.5/24 -polling -- Freddie Cash [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Hi, Ingo Flaschberger wrote: Dear Rudy, I used polling in FreeBSD 5.x and it helped a bunch. I set up a new router with 7.0 and MSI was recommended to me. (I noticed no difference when moving from polling - MSI, however, on 5.4 polling seemed to help a lot. What are people using in 7.0? polling or MSI? if you have a inet-router with gige-uplinks, it is possible that there will be (d)dos attacks. only polling helps you then to keep the router manageable (but dropping packets). Let me disagree :) I'm experimenting with bridge and Intel 82571EB Gigabit Ethernet Controller. On quad core system I have no problems with the stability of the bridge without polling. taskq em0 takes 100% CPU, but I have another three (cpus/cores) that are free and the router is very very stable, no lag on other interfaces and the average load is not very high too. Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] -- Best Wishes, Stefan Lambrev ICQ# 24134177 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Paul, I have been unable to even come close to livelocking the machine with the em driver interrupt moderation. So that to me throws polling out the window. I tried 8000hz with polling modified to allow 1 burst and it makes no difference higher hz-values gives you better latenca but less overall speed. 2000hz should be enough. Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Paul, Dual Opteron 2212, Recompiled kernel with 7-STABLE and removed a lot of junk in the config, added options NO_ADAPTIVE_MUTEXES not sure if that makes any difference or not, will test without. Used ULE scheduler, used preemption, CPUTYPE=opteron in /etc/make.conf 7.0-STABLE FreeBSD 7.0-STABLE #4: Tue Jul 1 01:22:18 CDT 2008 amd64 Max input rate .. 587kpps? Take into consideration that these packets are being forwarded out em1 interface which causes a great impact on cpu usage. If I set up a firewall rule to block the packets it can do over 1mpps on em0 input. would be great if you can also test with 32bit. what value do you have at net.inet.ip.intr_queue_maxlen? kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Thanks.. I was hoping I wasn't seeing things : I do not like inconsistencies.. :/ Stefan Lambrev wrote: Greetings Paul, --OK I'm stumped now.. Rebuilt with preemption and ULE and preemption again and it's not doing what it did before.. I saw this in my configuration too :) Just leave your test running for longer time and you will see this strange inconsistency in action. In my configuration I almost always have better throughput after reboot, which drops latter (5-10min under flood) with 50-60kpps and after another 10-15min the number of correctly passed packet increase again. Looks like auto tuning of which I'm not aware :) How could that be? Now about 500kpps.. That kind of inconsistency almost invalidates all my testing.. why would it be so much different after trying a bunch of kernel options and rebooting a bunch of times and then going back to the original config doesn't get you what it did in the beginning.. I'll have to dig into this further.. never seen anything like it :) Hopefully the ip_input fix will help free up a few cpu cycles. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I am going to.. I have an opteron 270 dual set up on 32 bit and the 2212 is set up on 64 bit :) Today should bring some 32 bit results as well as etherchannel results. Ingo Flaschberger wrote: Dear Paul, Dual Opteron 2212, Recompiled kernel with 7-STABLE and removed a lot of junk in the config, added options NO_ADAPTIVE_MUTEXES not sure if that makes any difference or not, will test without. Used ULE scheduler, used preemption, CPUTYPE=opteron in /etc/make.conf 7.0-STABLE FreeBSD 7.0-STABLE #4: Tue Jul 1 01:22:18 CDT 2008 amd64 Max input rate .. 587kpps? Take into consideration that these packets are being forwarded out em1 interface which causes a great impact on cpu usage. If I set up a firewall rule to block the packets it can do over 1mpps on em0 input. would be great if you can also test with 32bit. what value do you have at net.inet.ip.intr_queue_maxlen? kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I can't reproduce the 580kpps maximum that I saw when I first compiled for some reason, I don't understand, the max I get even with ULE and preemption is now about 530 and it dips to 480 a lot.. The first time I tried it it was at 580 and dipped to 520...what the?.. (kernel config attached at end) * noticed that SOMETIMES the em0 taskq jumps around cpus and doesn't use 100% of any one cpu * noticed that the netstat packets per second rate varies explicitly with the CPU usage of em0 taskq (top output with ULE/PREEMPTION compiled in): PID USERNAME PRI NICE SIZERES STATE C TIME WCPU COMMAND 10 root 171 ki31 0K16K RUN3 64:12 94.09% idle: cpu3 36 root -68- 0K16K CPU1 1 5:43 89.75% em0 taskq 13 root 171 ki31 0K16K CPU0 0 63:21 87.30% idle: cpu0 12 root 171 ki31 0K16K RUN1 62:44 66.75% idle: cpu1 11 root 171 ki31 0K16K CPU2 2 62:17 56.49% idle: cpu2 39 root -68- 0K16K - 0 0:54 10.64% em3 taskq this is about 480-500kpps rate. now I wait a minute and PID USERNAME PRI NICE SIZERES STATE C TIME WCPU COMMAND 10 root 171 ki31 0K16K CPU3 3 64:56 100.00% idle: cpu3 36 root -68- 0K16K CPU2 2 6:21 94.14% em0 taskq 13 root 171 ki31 0K16K RUN0 63:55 80.18% idle: cpu0 11 root 171 ki31 0K16K RUN2 62:48 67.38% idle: cpu2 12 root 171 ki31 0K16K CPU1 1 63:04 58.40% idle: cpu1 39 root -68- 0K16K - 1 1:00 10.21% em3 taskq 530kpps rate... drops to 85%.. 480kpps rate goes back up to 95% 530kpps it keeps flopping like this... none of the CPUs are 100% use and none of the cpus add up , like the cpu time of em0 taskq is 94% so one of the cpus should be 6% idle but it's not. This is with ULE/PREEMPTION.. I see different behavior without preemption and with 4bsd.. and I also see different behavior depending on the time of day lol :) Figure that one out I'll post back without preemption and with 4bsd in a min then i'll move on to the 32 bit platform tests ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
ULE without PREEMPTION is now yeilding better results. input (em0) output packets errs bytespackets errs bytes colls 571595 40639 34564108 1 0226 0 577892 48865 34941908 1 0178 0 545240 84744 32966404 1 0178 0 587661 44691 35534512 1 0178 0 587839 38073 35544904 1 0178 0 587787 43556 35540360 1 0178 0 540786 39492 32712746 1 0178 0 572071 55797 34595650 1 0178 0 *OUCH, IPFW HURTS.. loading ipfw, and adding one ipfw rule allow ip from any to any drops 100Kpps off :/ what's up with THAT? unloaded ipfw module and back 100kpps more again, that's not right with ONE rule.. :/ em0 taskq is still jumping cpus.. is there any way to lock it to one cpu or is this just a function of ULE running a tar czpvf all.tgz * and seeing if pps changes.. negligible.. guess scheduler is doing it's job at least.. Hmm. even when it's getting 50-60k errors per second on the interface I can still SCP a file through that interface although it's not fast.. 3-4MB/s.. You know, I wouldn't care if it added 5ms latency to the packets when it was doing 1mpps as long as it didn't drop any.. Why can't it do that? Queue them up and do them in bi chunks so none are droppedhmm? 32 bit system is compiling now.. won't do 400kpps with GENERIC kernel, as with 64 bit did 450k with GENERIC, although that could be the difference between opteron 270 and opteron 2212.. Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Ok, now THIS is absoultely a whole bunch of ridiculousness.. I set up etherchannel, and I'm evenly distributing packets over em0 em1 and em2 to lagg0 and i get WORSE performance than with a single interface.. Can anyone explain this one? This is horrible. I got em0-em2 taskq's using 80% cpu EACH and they are only doing 100kpps EACH looks: packets errs bytespackets errs bytes colls 105050 110666303000 0 0 0 0 104952 139696297120 0 0 0 0 104331 121216259860 0 0 0 0 input (em1) output packets errs bytespackets errs bytes colls 103734 706586223998 0 0 0 0 103483 757036209046 0 0 0 0 103848 761956230886 0 0 0 0 input (em2) output packets errs bytespackets errs bytes colls 103299 629576197940 1 0226 0 106388 730716383280 1 0178 0 104503 705736270180 4 0712 0 last pid: 1378; load averages: 2.31, 1.28, 0.57 up 0+00:06:27 17:42:32 68 processes: 8 running, 42 sleeping, 18 waiting CPU: 0.0% user, 0.0% nice, 58.9% system, 0.0% interrupt, 41.1% idle Mem: 7980K Active, 5932K Inact, 47M Wired, 16K Cache, 8512K Buf, 1920M Free Swap: 8192M Total, 8192M Free PID USERNAME PRI NICE SIZERES STATE C TIME WCPU COMMAND 11 root 171 ki31 0K16K RUN2 5:18 80.47% idle: cpu2 38 root -68- 0K16K CPU3 3 2:30 80.18% em2 taskq 37 root -68- 0K16K CPU1 1 2:28 76.90% em1 taskq 36 root -68- 0K16K CPU2 2 2:28 72.56% em0 taskq 13 root 171 ki31 0K16K RUN0 3:32 29.20% idle: cpu0 12 root 171 ki31 0K16K RUN1 3:29 27.88% idle: cpu1 10 root 171 ki31 0K16K RUN3 3:21 25.63% idle: cpu3 39 root -68- 0K16K - 3 0:32 17.68% em3 taskq See that's total wrongness.. something is very wrong here. Does anyone have any ideas? I really need to get this working. I figured if I evenly distributed the packets over 3 interfaces it simulates having 3 rx queues because it has a separate process for each interface and the result is WAY more CPU usage and a little over half the pps throughput with a single port .. If anyone is interested in tackling some these issues please e-mail me. It would be greatly appreciated. Paul Julian Elischer wrote: Paul wrote: ULE without PREEMPTION is now yeilding better results. input (em0) output packets errs bytespackets errs bytes colls 571595 40639 34564108 1 0226 0 577892 48865 34941908 1 0178 0 545240 84744 32966404 1 0178 0 587661 44691 35534512 1 0178 0 587839 38073 35544904 1 0178 0 587787 43556 35540360 1 0178 0 540786 39492 32712746 1 0178 0 572071 55797 34595650 1 0178 0 *OUCH, IPFW HURTS.. loading ipfw, and adding one ipfw rule allow ip from any to any drops 100Kpps off :/ what's up with THAT? unloaded ipfw module and back 100kpps more again, that's not right with ONE rule.. :/ ipfw need sto gain a lock on hte firewall before running, and is quite complex.. I can believe it.. in FreeBSD 4.8 I was able to use ipfw and filter 1Gb between two interfaces (bridged) but I think it has slowed down since then due to the SMP locking. em0 taskq is still jumping cpus.. is there any way to lock it to one cpu or is this just a function of ULE running a tar czpvf all.tgz * and seeing if pps changes.. negligible.. guess scheduler is doing it's job at least.. Hmm. even when it's getting 50-60k errors per second on the interface I can still SCP a file through that interface although it's not fast.. 3-4MB/s.. You know, I wouldn't care if it added 5ms latency to the packets when it was doing 1mpps as long as it didn't drop any.. Why can't it do that? Queue them up and do them in bi chunks so none are droppedhmm? 32 bit system is compiling now.. won't do 400kpps with GENERIC kernel, as with 64 bit did 450k with GENERIC, although that could be the difference between opteron 270 and opteron 2212.. Paul ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe,
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Apparently lagg hasn't been giant fixed :/ Can we do something about this quickly? with adaptive giant i get more performance on lagg but the cpu usage is smashed 100% I get about 50k more pps per interface (so 150kpps total which STILL is less than a single gigabit port) Check it out 68 processes: 9 running, 41 sleeping, 18 waiting CPU: 0.0% user, 0.0% nice, 89.5% system, 0.0% interrupt, 10.5% idle Mem: 8016K Active, 6192K Inact, 47M Wired, 108K Cache, 9056K Buf, 1919M Free Swap: 8192M Total, 8192M Free PID USERNAME PRI NICE SIZERES STATE C TIME WCPU COMMAND 38 root -68- 0K16K CPU1 1 3:29 100.00% em2 taskq 37 root -68- 0K16K CPU0 0 3:31 98.78% em1 taskq 36 root -68- 0K16K CPU3 3 2:53 82.42% em0 taskq 11 root 171 ki31 0K16K RUN2 22:48 79.00% idle: cpu2 10 root 171 ki31 0K16K RUN3 20:51 22.90% idle: cpu3 39 root -68- 0K16K RUN2 0:32 16.60% em3 taskq 12 root 171 ki31 0K16K RUN1 20:16 2.05% idle: cpu1 13 root 171 ki31 0K16K RUN0 20:25 1.90% idle: cpu0 input (em0) output packets errs bytespackets errs bytes colls 122588 07355280 0 0 0 0 123057 07383420 0 0 0 0 input (em1) output packets errs bytespackets errs bytes colls 174917 11899 10495032 2 0178 0 173967 11697 10438038 2 0356 0 174630 10603 10477806 2 0268 0 input (em2) output packets errs bytespackets errs bytes colls 175843 3928 10550580 0 0 0 0 175952 5750 10557120 0 0 0 0 Still less performance than single gig-e.. that giant lock really sucks , and why on earth would LAGG require that.. It seems so simple to fix :/ Anyone up for it:) I wish I was a programmer sometimes, but network engineering will have to do. :D Julian Elischer wrote: Paul wrote: Is PF better than ipfw? iptables almost has no impact on routing performance unless I add a swath of rules to it and then it bombs I need maybe 10 rules max and I don't want 20% performance drop for that.. :P well lots of people have wanted to fix it, and I've investigated quite a lot but it takes someone with 2 weeks of free time and all the right clue. It's not inherrent in ipfw but it needs some TLC from someone who cares :-). Ouch! :) Is this going to be fixed any time soon? We have some money that can be used for development costs to fix things like this because we use linux and freebsd machines as firewalls for a lot of customers and with the increasing bandwidth and pps the customers are demanding more and I can't give them better performance with a brand new dual xeon or opteron machine vs the old p4 machines I have them running on now :/ The only difference in the new machine vs old machine is that the new one can take in more pps and drop it but it can't route a whole lot more. Routing/firewalling must still not be lock free, ugh.. :P Thanks Julian Elischer wrote: Paul wrote: ULE without PREEMPTION is now yeilding better results. input (em0) output packets errs bytespackets errs bytes colls 571595 40639 34564108 1 0226 0 577892 48865 34941908 1 0178 0 545240 84744 32966404 1 0178 0 587661 44691 35534512 1 0178 0 587839 38073 35544904 1 0178 0 587787 43556 35540360 1 0178 0 540786 39492 32712746 1 0178 0 572071 55797 34595650 1 0178 0 *OUCH, IPFW HURTS.. loading ipfw, and adding one ipfw rule allow ip from any to any drops 100Kpps off :/ what's up with THAT? unloaded ipfw module and back 100kpps more again, that's not right with ONE rule.. :/ ipfw need sto gain a lock on hte firewall before running, and is quite complex.. I can believe it.. in FreeBSD 4.8 I was able to use ipfw and filter 1Gb between two interfaces (bridged) but I think it has slowed down since then due to the SMP locking. em0 taskq is still jumping cpus.. is there any way to lock it to one cpu or is this just a function of ULE running a tar czpvf all.tgz * and seeing if pps changes.. negligible.. guess scheduler is doing it's job at least.. Hmm. even when it's getting 50-60k errors per second on the interface I can still SCP a file through that interface although it's not fast.. 3-4MB/s.. You know, I wouldn't care if it added 5ms latency to the packets when it was doing 1mpps as long as it didn't drop any.. Why can't it do that?
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Paul wrote: The higher I set the buffer the worse it is.. 256 and 512 I get about 50-60k more pps than i do with 2048 or 4096.. You would think it would be the other way around but obviously there is some contention going on. :/ Looks like in bridge mode hw.em.rxd=512 and hw.em.txd=512 yields best results also. reducing or increasing those leads to worse performance. btw is there any news with hwpmc for new CPUs ? last time I checked was real pain to get it working with core2 CPUs :( I'm sticking with 512 for now, as it seems to make it worse with anything higher. Keep in mind, i'm using random source ips, random source and destination ports.. Although that should have zero impact on the amount of PPS it can route but for some reason it seems to.. ? Any ideas on that one? A single stream one source ip/port to one destination ip/port seems to use less cpu, although I haven't generated the same pps with that yet.. I am going to test it soon Ingo Flaschberger wrote: Dear Paul, I tried this.. I put 6-STABLE (6.3), using default driver was slower than FBSD7 have you set the rx/tx buffers? /boot/loader.conf hw.em.rxd=4096 hw.em.txd=4096 bye, Ingo ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED] -- Best Wishes, Stefan Lambrev ICQ# 24134177 ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
At 04:04 AM 6/29/2008, Paul wrote: This is just a question but who can get more than 400k pps forwarding performance ? OK, I setup 2 boxes on either end of a RELENG_7 box from about May 7th just now, to see with 2 boxes blasting across it how it would work. *However*, this is with no firewall loaded and, I must enable ip fast forwarding. Without that enabled, the box just falls over. even at 20Kpps, I start seeing all sorts of messages spewing to route -n monitor got message of size 96 on Mon Jun 30 15:39:10 2008 RTM_MISS: Lookup failed on this address: len 96, pid: 0, seq 0, errno 0, flags:DONE locks: inits: sockaddrs: DST default I am starting to wonder if those messages are the results of corrupted packets the machine just cant keep up with ? CPU is CPU: Intel(R) Xeon(R) CPU3070 @ 2.66GHz (2660.01-MHz 686-class CPU) input(Total) output packets errs bytespackets errs bytes colls 611945 0 77892098 611955 0 77013002 0 616727 0 78215508 616742 0 77303454 0 617066 0 78162130 617082 0 77238434 0 618238 0 78302314 618225 0 77377582 0 617035 0 78141000 617038 0 77215672 0 617625 0 78225600 617588 0 77301734 0 616190 0 78017320 616165 0 77091774 0 615583 0 78064130 615628 0 77152800 0 617662 0 78254388 617658 0 77332340 0 618000 0 78269912 617950 0 77344554 0 617248 0 78183136 617315 0 77259588 0 617325 0 78204566 617289 0 77282094 0 618391 0 78337734 618357 0 77413756 0 616025 0 78116070 616082 0 77203116 0 To generate the packets, I am just using /usr/src/tools/tools/netblast on 2 endpoints starting at about the same time # ./netblast 10.10.1.2 500 100 40 start: 1214854131.083679919 finish:1214854171.084668592 send calls:20139141 send errors: 0 approx send rate: 503478 approx error rate: 0 # ./netblast 10.10.1.3 500 10 40 start: 1214854273.882202815 finish:1214854313.882319031 send calls:23354971 send errors: 18757223 approx send rate: 114943 approx error rate: 0 The box in the middle doing the forwarding 1[spare-r7]# ifconfig -u em0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=19bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4 ether 00:1b:21:08:32:a8 inet 10.20.1.1 netmask 0xff00 broadcast 10.20.1.255 media: Ethernet autoselect (1000baseTX full-duplex) status: active em1: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=9bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM ether 00:1b:21:08:32:a9 inet 192.168.43.193 netmask 0xff00 broadcast 192.168.43.255 media: Ethernet autoselect (100baseTX full-duplex) status: active em3: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=19bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4 ether 00:30:48:90:4c:ff inet 10.10.1.1 netmask 0xff00 broadcast 10.10.1.255 media: Ethernet autoselect (1000baseTX full-duplex) status: active lo0: flags=8049UP,LOOPBACK,RUNNING,MULTICAST metric 0 mtu 16384 inet 127.0.0.1 netmask 0xff00 I am going to try a few more tests with and without, firewall rules etc as well as an updated kernel to RELENG_7 as of today and see how that goes. ---Mike ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
With hours and days of tweaking i can't even get 500k pps :/ no firewall no anything else.. What is your kernel config? Sysctl configs? My machine i'm testing on is dual opteron 2212 , with intel 2 port 82571 nic.. Using 7-STABLE and I tried 6-stable and -current I get the RTM_MISS with 7 and current but only with certain types of packets at a certain rate.. :/ I can not get more than 500kpps.. i tried everything I could think of... lowering the rx descriptors on EM to 512 instead of 2048 gave me some more.. I was stuck at 400kpps until i changed those and i lowered the rx processing limit. My tests are going incoming em0 and outgoing em1 in one direction only and it has major errors when em0 taskq gets close to 80% cpu.. I am pretty disappointed that it maxes out a little over 400kpps and even then it gets some errors here and there , mainly missed packets due to no buffer and rx overruns (dev.em.0.stats=1) Mike Tancsa wrote: At 04:04 AM 6/29/2008, Paul wrote: This is just a question but who can get more than 400k pps forwarding performance ? OK, I setup 2 boxes on either end of a RELENG_7 box from about May 7th just now, to see with 2 boxes blasting across it how it would work. *However*, this is with no firewall loaded and, I must enable ip fast forwarding. Without that enabled, the box just falls over. even at 20Kpps, I start seeing all sorts of messages spewing to route -n monitor got message of size 96 on Mon Jun 30 15:39:10 2008 RTM_MISS: Lookup failed on this address: len 96, pid: 0, seq 0, errno 0, flags:DONE locks: inits: sockaddrs: DST default I am starting to wonder if those messages are the results of corrupted packets the machine just cant keep up with ? CPU is CPU: Intel(R) Xeon(R) CPU3070 @ 2.66GHz (2660.01-MHz 686-class CPU) input(Total) output packets errs bytespackets errs bytes colls 611945 0 77892098 611955 0 77013002 0 616727 0 78215508 616742 0 77303454 0 617066 0 78162130 617082 0 77238434 0 618238 0 78302314 618225 0 77377582 0 617035 0 78141000 617038 0 77215672 0 617625 0 78225600 617588 0 77301734 0 616190 0 78017320 616165 0 77091774 0 615583 0 78064130 615628 0 77152800 0 617662 0 78254388 617658 0 77332340 0 618000 0 78269912 617950 0 77344554 0 617248 0 78183136 617315 0 77259588 0 617325 0 78204566 617289 0 77282094 0 618391 0 78337734 618357 0 77413756 0 616025 0 78116070 616082 0 77203116 0 To generate the packets, I am just using /usr/src/tools/tools/netblast on 2 endpoints starting at about the same time # ./netblast 10.10.1.2 500 100 40 start: 1214854131.083679919 finish:1214854171.084668592 send calls:20139141 send errors: 0 approx send rate: 503478 approx error rate: 0 # ./netblast 10.10.1.3 500 10 40 start: 1214854273.882202815 finish:1214854313.882319031 send calls:23354971 send errors: 18757223 approx send rate: 114943 approx error rate: 0 The box in the middle doing the forwarding 1[spare-r7]# ifconfig -u em0: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=19bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4 ether 00:1b:21:08:32:a8 inet 10.20.1.1 netmask 0xff00 broadcast 10.20.1.255 media: Ethernet autoselect (1000baseTX full-duplex) status: active em1: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=9bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM ether 00:1b:21:08:32:a9 inet 192.168.43.193 netmask 0xff00 broadcast 192.168.43.255 media: Ethernet autoselect (100baseTX full-duplex) status: active em3: flags=8843UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST metric 0 mtu 1500 options=19bRXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,TSO4 ether 00:30:48:90:4c:ff inet 10.10.1.1 netmask 0xff00 broadcast 10.10.1.255 media: Ethernet autoselect (1000baseTX full-duplex) status: active lo0: flags=8049UP,LOOPBACK,RUNNING,MULTICAST metric 0 mtu 16384 inet 127.0.0.1 netmask 0xff00 I am going to try a few more tests with and without, firewall rules etc as well as an updated kernel to RELENG_7 as of today and see how that goes. ---Mike ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Mike Tancsa wrote: At 04:04 AM 6/29/2008, Paul wrote: This is just a question but who can get more than 400k pps forwarding performance ? OK, I setup 2 boxes on either end of a RELENG_7 box from about May 7th just now, to see with 2 boxes blasting across it how it would work. *However*, this is with no firewall loaded and, I must enable ip fast forwarding. Without that enabled, the box just falls over. even at 20Kpps, I start seeing all sorts of messages spewing to route -n monitor got message of size 96 on Mon Jun 30 15:39:10 2008 RTM_MISS: Lookup failed on this address: len 96, pid: 0, seq 0, errno 0, flags:DONE locks: inits: sockaddrs: DST default Mike, Is the monitor running on the 7.0 box in the middle you are testing? I set up the same configuration, and even with almost no load ( 1Kpps) can replicate these error messages by making the remote IP address (in your case 'default', disappear (ie: unplug the cable, DDoS etc). ...to further, I can even replicate the problem at a single packet per second by trying to ping an IP address that I know for fact that the router can not get to. Do you see these error messages if you set up a loopback address with an IP on the router, and effectively chop your test environment in half? In your case, can the router in the middle actually get to a default gateway for external addresses (when I perform the test, your 'default' is substituted with the IP I am trying to reach, so I am only assuming that 'default' is implying default gateway). Steve ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
I am getting this message with normal routing. say... em0 10.1.1.1/24 em1 10.2.2.1/24 using a box 10.1.1.2 on em0 and having another box on 10.2.2.2 on em1 I send packet from 10.1.1.2 which goes through em0 and has a route to 10.2.2.2 out em1 of course and I get MASSIVE RTM_MISS messages but ONLY with this certain packets.. I don't get it? I posted the tcpdump of the types of packets that generate them and the ones that don't. RTM_MISS is normal if the box can't get to a route, it's the 'destination unreachable' message. I would prefer a kernel option to disable this message to save CPU cycles though as it is completely unnecessary to generate. I even set the default gateway to loopback interface and I STILL get the message.. Something is wrong in the code somewhere. Does anyone have any idea how to disable this message? It's causing major cpu usage on my zebra daemon which is watching the route messages and most likely severely limiting pps throughput :/ It generates the messages with only ip on em1 and em0 with nothing else in the routing table and a default gateway set. So it has nothing to do with zebra. It happens in 7-STABLE and (8) -CURRENT, I tested both. There are no RTM_MISS message in 7-RELEASE so someone changed something to -STABLE :/ Paul Steve Bertrand wrote: Mike Tancsa wrote: At 04:04 AM 6/29/2008, Paul wrote: This is just a question but who can get more than 400k pps forwarding performance ? OK, I setup 2 boxes on either end of a RELENG_7 box from about May 7th just now, to see with 2 boxes blasting across it how it would work. *However*, this is with no firewall loaded and, I must enable ip fast forwarding. Without that enabled, the box just falls over. even at 20Kpps, I start seeing all sorts of messages spewing to route -n monitor got message of size 96 on Mon Jun 30 15:39:10 2008 RTM_MISS: Lookup failed on this address: len 96, pid: 0, seq 0, errno 0, flags:DONE locks: inits: sockaddrs: DST default Mike, Is the monitor running on the 7.0 box in the middle you are testing? I set up the same configuration, and even with almost no load ( 1Kpps) can replicate these error messages by making the remote IP address (in your case 'default', disappear (ie: unplug the cable, DDoS etc). ...to further, I can even replicate the problem at a single packet per second by trying to ping an IP address that I know for fact that the router can not get to. Do you see these error messages if you set up a loopback address with an IP on the router, and effectively chop your test environment in half? In your case, can the router in the middle actually get to a default gateway for external addresses (when I perform the test, your 'default' is substituted with the IP I am trying to reach, so I am only assuming that 'default' is implying default gateway). Steve ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Paul, I am getting this message with normal routing. say... em0 10.1.1.1/24 em1 10.2.2.1/24 using a box 10.1.1.2 on em0 and having another box on 10.2.2.2 on em1 I send packet from 10.1.1.2 which goes through em0 and has a route to 10.2.2.2 out em1 of course and I get MASSIVE RTM_MISS messages but ONLY with this certain packets.. I don't get it? I posted the tcpdump of the types of There is a open bug report: http://www.freebsd.org/cgi/query-pr.cgi?pr=124540 perhaps it has something todo with the multiple fip-stuff? kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
0n Mon, Jun 30, 2008 at 03:44:48PM -0400, Mike Tancsa wrote: OK, I setup 2 boxes on either end of a RELENG_7 box from about May 7th just now, to see with 2 boxes blasting across it how it would work. *However*, this is with no firewall loaded and, I must enable ip fast forwarding. Without that enabled, the box just falls over. What is ip fast forwarding ? -aW IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Alex, OK, I setup 2 boxes on either end of a RELENG_7 box from about May 7th just now, to see with 2 boxes blasting across it how it would work. *However*, this is with no firewall loaded and, I must enable ip fast forwarding. Without that enabled, the box just falls over. What is ip fast forwarding ? instead of copying the while ip packet into system memory, only the ip header is copyied and then in a fast path determined if it could be fast forwarded. if possible, a ned header is created at the other network-cards-buffer and the ip-data is copied from network-card-buffer to network-card-buffer directly. Kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
0n Tue, Jul 01, 2008 at 03:00:31AM +0200, Ingo Flaschberger wrote: Dear Alex, OK, I setup 2 boxes on either end of a RELENG_7 box from about May 7th just now, to see with 2 boxes blasting across it how it would work. *However*, this is with no firewall loaded and, I must enable ip fast forwarding. Without that enabled, the box just falls over. What is ip fast forwarding ? instead of copying the while ip packet into system memory, only the ip header is copyied and then in a fast path determined if it could be fast forwarded. if possible, a ned header is created at the other network-cards-buffer and the ip-data is copied from network-card-buffer to network-card-buffer directly. So how does one enable ip fast forwarding on FreeBSD ? -aW IMPORTANT: This email remains the property of the Australian Defence Organisation and is subject to the jurisdiction of section 70 of the CRIMES ACT 1914. If you have received this email in error, you are requested to contact the sender and delete the email. ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Dear Alex, if possible, a ned header is created at the other network-cards-buffer and the ip-data is copied from network-card-buffer to network-card-buffer directly. So how does one enable ip fast forwarding on FreeBSD ? sysctl -w net.inet.ip.fastforwarding=1 usually interface polling is also chosen to prevent lock-ups. man polling kind regards, Ingo Flaschberger ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Ingo Flaschberger wrote: usually interface polling is also chosen to prevent lock-ups. man polling I used polling in FreeBSD 5.x and it helped a bunch. I set up a new router with 7.0 and MSI was recommended to me. (I noticed no difference when moving from polling - MSI, however, on 5.4 polling seemed to help a lot. What are people using in 7.0? polling or MSI? Rudy ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Wilkinson, Alex wrote: So how does one enable ip fast forwarding on FreeBSD ? Not to take anything away from Ingo's response, but to inform how to add the functionality to span across reboots, add the following line to /etc/sysctl.conf net.inet.ip.fastforwarding=1 Steve ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Freebsd IP Forwarding performance (question, and some info) [7-stable, current, em, smp]
Support (Rudy) wrote: Ingo Flaschberger wrote: usually interface polling is also chosen to prevent lock-ups. man polling I used polling in FreeBSD 5.x and it helped a bunch. I set up a new router with 7.0 and MSI was recommended to me. (I noticed no difference when moving from polling - MSI, however, on 5.4 polling seemed to help a lot. I'm curious now... how do you change individual device polling via sysctl? Steve ___ freebsd-net@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to [EMAIL PROTECTED]