On 02/03/16 14:37, Meyer, Wolfgang wrote:
Hello,

we are evaluating network performance on a DELL-Server (PowerEdge R930 with 4 
Sockets, hw.model: Intel(R) Xeon(R) CPU E7-8891 v3 @ 2.80GHz) with 10 
GbE-Cards. We use programs that on server side accepts connections on a 
IP-address+port from the client side and after establishing the connection data 
is sent in turns between server and client in a predefined pattern (server side 
sends more data than client side) with sleeps in between the send phases. The 
test set-up is chosen in such way that every client process initiates 500 
connections handled in threads and on the server side each process representing 
an IP/Port pair also handles 500 connections in threads.

The number of connections is then increased and the overall network througput is 
observed using nload. On FreeBSD (on server side) roughly at 50,000 connections 
errors begin to occur and the overall throughput won't increase further with more 
connections. With Linux on the server side it is possible to establish more than 
120,000 connections and at 50,000 connections the overall throughput ist double that 
of FreeBSD with the same sending pattern. Furthermore system load on FreeBSD is much 
higher with 50 % system usage on each core and 80 % interrupt usage on the 8 cores 
handling the interrupt queues for the NIC. In comparison Linux has <10 % system 
usage, <10 % user usage and about 15 % interrupt usage on the 16 cores handling 
the network interrupts for 50,000 connections.

Varying the numbers for the NIC interrupt queues won't change the performance 
(rather worsens the situation). Disabling Hyperthreading (utilising 40 cores) 
degrades the performance. Increasing MAXCPU to utilise all 80 cores won't 
improve compared to 64 cores, atkbd and uart had to be disabled to avoid kernel 
panics with increased MAXCPU (thanks to Andre Oppermann for investigating 
this). Initiallly the tests were made on 10.2 Release, later I switched to 10 
Stable (later with ixgbe driver version 3.1.0) but that didn't change the 
numbers.

Some sysctl configurables were modified along the network performance 
guidelines found on the net (e.g. 
https://calomel.org/freebsd_network_tuning.html, 
https://www.freebsd.org/doc/handbook/configtuning-kernel-limits.html, 
https://pleiades.ucsc.edu/hyades/FreeBSD_Network_Tuning) but most of them 
didn't have any measuarable impact. Final sysctl.conf and loader.conf settings 
see below. Actually the only tunables that provided any improvement were 
identified to be hw.ix.txd, and hw.ix.rxd that were reduced (!) to the minimum 
value of 64 and hw.ix.tx_process_limit and hw.ix.rx_process_limit that were set 
to -1.

Any ideas what tunables might be changed to get a higher number of TCP 
connections (it's not a question of the overall throughput as changing the 
sending pattern allows me to fully utilise the 10Gb bandwidth)? How can I 
determine where the kernel is spending its time that causes the high CPU load? 
Any pointers are highly appreciated, I can't believe that there is such a 
blatant difference in network performance compared to Linux.

Regards,
Wolfgang

[SNIP]

Hi Wolfgang,

hwpmc is your friend here if you need to investigate where are your processors 
wasting their time.

Either you will find them contending for network stack (probably the pcb hash 
table), either they are fighting each other in the scheduler's lock(s) trying 
to steal jobs from working ones.

Also check QPI links activity that may reveal interesting facts about PCI 
root-complexes geography vs processes locations and migration.

You have two options here: Either you persist in using a 4x10 core machine and 
you will have a long time rearranging stickyness of processes and interrupt to 
specific cores/packages (Driver, then isr rings, then userland) and police the 
whole thing (read peacekeeping the riot), either you go to the much simpler 
solution that is 1 (yes, one) socket machine, fastest available proc with low 
core (E5-1630v2/3 or 1650) that can handle 10G links hands down out-of-the-box.

Also note that there are specific and interesting optimization in the L2 
generation on -head that you may want to try if the problem is stack-centered.

You may also have a threading problem (userland ones). In the domain of 
counting instructions per packets (you can practice that with netmap as a 
wonderfull mean of really 'sensing' what 40Gbps is), threading is bad (and 
Hyperthreading is evil).

Thanks.

_______________________________________________
freebsd-performance@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "freebsd-performance-unsubscr...@freebsd.org"

Reply via email to