Hello list!

I'm quite stuck with bad forwarding performance on many FreeBSD boxes doing firewalling.

Typical configuration is E5645 / E5675 @ Intel 82599 NIC.
HT is turned off.
(Configs and tunables below).

I'm mostly concerned with unidirectional traffic flowing to single interface (e.g. using singe route entry).

In most cases system can forward no more than 700 (or 1400) kpps which is quite a bad number (Linux does, say, 5MPPs on nearly the same hardware).


Test scenario:

Ixia XM2 (traffic generator) <> ix0 (FreeBSD).

Ixia sends 64byte IP packets from vlan10 (10.100.0.64 - 10.100.0.156) to
destinations in vlan11 (10.100.1.128 - 10.100.1.192).

Static arps are configured for all destination addresses.

Traffic level is slightly above or slightly below system performance.


================= Test 1  =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, no firewall

Traffic: 1-1 flow (1 src, 1 dst)
(This is actually a bit different from described above)

Result:
             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       878k   48k     0        59M       878k     0        56M     0
       874k   48k     0        59M       874k     0        56M     0
       875k   48k     0        59M       875k     0        56M     0

16:41 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf " %7s %2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
     STATE  C   TIME        CPU         COMMAND
      CPU6  6  17:28    100.00%      kernel{ix0 que}
      CPU9  9  20:42     60.06%    intr{irq265: ix0:que

16:41 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0                 500796        167
irq257: ix0:que 1                6693573       2245
irq258: ix0:que 2                2572380        862
irq259: ix0:que 3                3166273       1062
irq260: ix0:que 4                9691706       3251
irq261: ix0:que 5               10766434       3611
irq262: ix0:que 6                8933774       2996
irq263: ix0:que 7                5246879       1760
irq264: ix0:que 8                3548930       1190
irq265: ix0:que 9               11817986       3964
irq266: ix0:que 10                227561         76
irq267: ix0:link                       1          0

Note that system is using 2 cores to forward, so 12 cores should be able to forward 4+ mpps which is more or less consistent with Linux results. Note that interrupts on all queues are (as far as I understand from the fact that AIM is turned off and interrupt rates are the same from previous test). Additionally, despite hw.intr_storm_threshold = 200k, i'm constantly getting
interrupt storm detected on "irq265:"; throttling interrupt source
message.


================= Test 2  =======================
Kernel: FreeBSD-8-S r237994, stock drivers, stock routing, no FLOWTABLE, no firewall

Traffic: Unidirectional many-2-many

16:20 [0] test15# netstat -I ix0 -hw 1
             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       507k  651k     0        74M       508k     0        32M     0
       506k  652k     0        74M       507k     0        28M     0
       509k  652k     0        74M       508k     0        37M     0


16:28 [0] test15# top -nCHSIzs1 | awk '$5 ~ /(K|SIZE)/ { printf " %7s %2s %6s %10s %15s %s\n", $7, $8, $9, $10, $11, $12}'
     STATE  C   TIME        CPU         COMMAND
     CPU10  6   0:40    100.00%      kernel{ix0 que}
      CPU2  2  11:47     84.86%    intr{irq258: ix0:que
      CPU3  3  11:50     81.88%    intr{irq259: ix0:que
      CPU8  8  11:38     77.69%    intr{irq264: ix0:que
      CPU7  7  11:24     77.10%    intr{irq263: ix0:que
      WAIT  1  10:10     74.76%    intr{irq257: ix0:que
      CPU4  4   8:57     63.48%    intr{irq260: ix0:que
      CPU6  6   8:35     61.96%    intr{irq262: ix0:que
      CPU9  9  14:01     60.79%    intr{irq265: ix0:que
       RUN  0   9:07     59.67%    intr{irq256: ix0:que
      WAIT  5   6:13     43.26%    intr{irq261: ix0:que
     CPU11 11   5:19     35.89%      kernel{ix0 que}
         -  4   3:41     25.49%      kernel{ix0 que}
         -  1   3:22     21.78%      kernel{ix0 que}
         -  1   2:55     17.68%      kernel{ix0 que}
         -  4   2:24     16.55%      kernel{ix0 que}
         -  1   9:54     14.99%      kernel{ix0 que}
      CPU0 11   2:13     14.26%      kernel{ix0 que}


16:07 [0] test15# vmstat -i | grep ix0
irq256: ix0:que 0                  13654         15
irq257: ix0:que 1                  87043         96
irq258: ix0:que 2                  39604         44
irq259: ix0:que 3                  48308         53
irq260: ix0:que 4                 138002        153
irq261: ix0:que 5                 169596        188
irq262: ix0:que 6                 107679        119
irq263: ix0:que 7                  72769         81
irq264: ix0:que 8                  30878         34
irq265: ix0:que 9                1002032       1115
irq266: ix0:que 10                 10967         12
irq267: ix0:link                       1          0


Note that all cores are loaded more or less evenly, but the result is _worse_. The first reason for this is mtx_lock which is acquired twice on every lookup (once in in in_matroute() where it can possibly be removed and once again in rtalloc1_fib()). Latter one is addressed by andre@ in r234650).

Additionally, despite itreads are bound to singe CPU each, kernel que are not in stock setup. However, configuration with 5 queues and 5 kernel threads bound to different CPU provides the same bad results.

================= Test 3  =======================
Kernel: FreeBSD-8-S June 4 SVN, +merged ifaddrlock, stock drivers, stock routing, no FLOWTABLE, no firewall


    packets  errs idrops      bytes    packets  errs      bytes colls
       580k   18k     0        38M       579k     0        37M     0
       581k   26k     0        39M       580k     0        37M     0
       580k   24k     0        39M       580k     0        37M     0
................
Enabling ipfw _increases_ performance a bit:

       604k     0     0        39M       604k     0        39M     0
       604k     0     0        39M       604k     0        39M     0
       582k   19k     0        38M       568k     0        37M     0
       527k   81k     0        39M       530k     0        34M     0
       605k    28     0        39M       605k     0        39M     0


================= Test 3.1  =======================

Same as test 3, the only difference is the following:
route add -net 10.100.1.160/27 -iface vlan11.

             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       543k  879k     0        91M       544k     0        35M     0
       547k  870k     0        91M       545k     0        35M     0
       541k  870k     0        91M       539k     0        30M     0
       952k  565k     0        97M       962k     0        48M     0
       1.2M  228k     0        91M       1.2M     0        92M     0
       1.2M  226k     0        90M       1.1M     0        76M     0
       1.1M  228k     0        91M       1.2M     0        76M     0
       1.2M  233k     0        90M       1.2M     0        76M     0

================= Test 3.2  =======================

Same as test 3, splitting destination into 4 smaller rtes:
route add -net 10.100.1.128/28 -iface vlan11
route add -net 10.100.1.144/28 -iface vlan11
route add -net 10.100.1.160/28 -iface vlan11
route add -net 10.100.1.176/28 -iface vlan11

             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       1.4M     0     0       106M       1.6M     0       106M     0
       1.8M     0     0       106M       1.6M     0        71M     0
       1.6M     0     0       106M       1.6M     0        71M     0
       1.6M     0     0        87M       1.6M     0        71M     0
       1.6M     0     0       126M       1.6M     0       212M     0

================= Test 3.3  =======================

Same as test 3, splitting destination into 16 smaller rtes:
             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       1.6M     0     0       118M       1.8M     0       118M     0
       2.0M     0     0       118M       1.8M     0       119M     0
       1.8M     0     0       119M       1.8M     0        79M     0
       1.8M     0     0       117M       1.8M     0       157M     0


================= Test 4  =======================
Kernel: FreeBSD-8-S June 4 SVN, stock drivers, routing patch 1, no FLOWTABLE, no firewall

             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       1.8M     0     0       114M       1.9M     0       114M     0
       1.7M     0     0       114M       1.7M     0       114M     0
       1.8M     0     0       114M       1.8M     0       114M     0
       1.7M     0     0       114M       1.7M     0       114M     0
       1.8M     0     0       114M       1.8M     0        74M     0
       1.5M     0     0       114M       1.8M     0        74M     0
         2M     0     0       114M       1.8M     0       194M     0


Patch 1 totally eliminates mtx_lock for fastforwarding path to get an idea how much performance we can achieve. The result is nearly the same as in 3.3

================= Test 4.1  =======================

Same as the test 4, same traffic level, enabling firewall with single allow rule (evaluating RLOCK performance)

22:35 [0] test15# netstat -I ix0 -hw 1
             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       1.8M  149k     0       114M       1.6M     0       142M     0
       1.4M  148k     0        85M       1.6M     0       104M     0
       1.8M  149k     0       143M       1.6M     0       104M     0
       1.6M  151k     0       114M       1.6M     0       104M     0
       1.6M  151k     0       114M       1.6M     0       104M     0
       1.4M  152k     0       114M       1.6M     0       104M     0

E.g something like 10% performance loss.


================= Test 4.2  =======================

Same as test4, playing with number of queues.

5queues, same traffic level
       1.5M  225k     0       114M       1.5M     0        99M     0

================= Test 4.3  =======================

Same as test 4, HT on, number of queues = 16

             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       2.4M     0     0       157M       2.4M     0       156M     0
       2.4M     0     0       156M       2.4M     0       157M     0

However, enabling firewall immediately drops rate to 1.9mpps which is nearly the same as 4.1 (and complicated fw ruleset possibly kill HT core much faster)

================= Test 4.3  =======================

Same as test4, kerwnel ix0 que Tx threads bound to specific CPUs (corresponding to RX ):
18:02 [0] test15# procstat -ak | grep ix0 | sort -nk 2
    12 100045 intr             irq256: ix0:que  <running>
     0 100046 kernel           ix0 que          <running>
    12 100047 intr             irq257: ix0:que  <running>
0 100048 kernel ix0 que mi_switch sleepq_wait msleep_spin taskqueue_thread_loop fork_exit fork_trampoline
    12 100049 intr             irq258: ix0:que  <running>
..

test15# for i in `jot 12 0`; do cpuset -l $i -t $((100046+2*$i)); done

Result:
             input          (ix0)           output
    packets  errs idrops      bytes    packets  errs      bytes colls
       2.1M     0     0       139M         2M     0       193M     0
       2.1M     0     0       139M       2.3M     0       139M     0
       2.1M     0     0       139M       2.1M     0        85M     0
       2.1M     0     0       139M       2.1M     0       193M     0

Quite considerable increase, however this works better for uniform traffic distribution only.


================= Test 5  =======================
Same as test 4, make radix use rmlock (r234648, r234649).

Result: 1.7 MPPS.


================= Test 6  =======================
Same as test 4 + FLOWTABLE

Result: 1.7 MPPS.


================= Test 7  =======================
Same as test 4, build with GCC 4.7

Result: No performance gain


Further investigations:

================= Test 8  =======================
Test 4 setup with kernel build with LOCK_PROFILING.

17:46 [0] test15# sysctl debug.lock.prof.enable=1 ; sleep 2 ; sysctl debug.lock.prof.enable=0

       920k     0     0        59M       920k     0        59M     0
       875k     0     0        59M       920k     0        59M     0
       628k     0     0        39M       566k     0        45M     0
        79k  2.7M     0       186M        57k     0       6.5M     0
        71k  878k     0        61M        73k     0       4.0M     0
       891k  254k     0        72M       917k     0        54M     0
       920k     0     0        59M       920k     0        59M     0


When enabled, forwarding performance goes down to 60kpps.
Enabled for 2 seconds (so actually 130k packets forwarded), results attached as separate file. Several hundred lock contentions in ixgbe, that's all.

================= Test 9  =======================
Same as test 4 setup with hwpmc.
Results attached.

================= Test 9  =======================
Kernel: Freebsd-9-S.
No major difference


Some (my) preliminary conclusions:
1) rte mtx_lock should (and can) be eliminated from stock kernel. (And it can be done more or less easily for in_matroute). 2) rmlock vs rwlock performance difference is insignificant (maybe because of 3) ) 3) there are locks contention between ixgbe taskq threads and ithreads. I'm not sure if taskq threads are necessary in the case of packet forwarding and not traffic generation.


Maybe I'm missing something else? (l2 cache misses or other things).

What else I can do to debug this further?



Relevant files:
http://static.ipfw.ru/files/fbsd10g/0001-no-rt-mutex.patch
http://static.ipfw.ru/files/fbsd10g/kernel.gprof.txt
http://static.ipfw.ru/files/fbsd10g/prof_stats.txt

============= CONFIGS ====================

sysctl.conf:
kern.ipc.maxsockbuf=33554432
net.inet.udp.maxdgram=65535
net.inet.udp.recvspace=16777216
net.inet.tcp.sendbuf_auto=0
net.inet.tcp.recvbuf_auto=0
net.inet.tcp.sendspace=16777216
net.inet.tcp.recvspace=16777216
net.inet.ip.maxfragsperpacket=64


kern.random.sys.harvest.ethernet=0
kern.random.sys.harvest.point_to_point=0
kern.random.sys.harvest.interrupt=0


net.inet.ip.forwarding=1
net.inet.ip.fastforwarding=1
net.inet.ip.redirect=0

hw.intr_storm_threshold=20000

loader.conf:
kern.ipc.nmbclusters="512000"
ixgbe_load="YES"
hw.ixgbe.rx_process_limit="300"
hw.ixgbe.nojumbobuf="1"
hw.ixgbe.max_loop="100"
hw.ixgbe.max_interrupt_rate="20000"
hw.ixgbe.num_queues="11"


hw.ixgbe.txd=4096
hw.ixgbe.rxd=4096

kern.hwpmc.nbuffers=2048

debug.debugger_on_panic=1
net.inet.ip.fw.default_to_accept=1


kernel:
cpu HAMMER

ident           CORE_RELENG_7
options COMPAT_IA32

makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols

options         SCHED_ULE               # ULE scheduler
options         PREEMPTION              # Enable kernel thread preemption
options         INET                    # InterNETworking
options         INET6                   # IPv6 communications protocols
options SCTP # Stream Control Transmission Protocol
options         FFS                     # Berkeley Fast Filesystem
options         SOFTUPDATES             # Enable FFS soft updates support
options         UFS_ACL                 # Support for access control lists
options UFS_DIRHASH # Improve performance on big directories options UFS_GJOURNAL # Enable gjournal-based UFS journaling
options         MD_ROOT                 # MD is a potential root device
options PROCFS # Process filesystem (requires PSEUDOFS)
options         PSEUDOFS                # Pseudo-filesystem framework
options         GEOM_PART_GPT           # GUID Partition Tables.
options         GEOM_LABEL              # Provides labelization
options         COMPAT_43TTY            # BSD 4.3 TTY compat [KEEP THIS!]
options         COMPAT_FREEBSD4         # Compatible with FreeBSD4
options         COMPAT_FREEBSD5         # Compatible with FreeBSD5
options         COMPAT_FREEBSD6         # Compatible with FreeBSD6
options         COMPAT_FREEBSD7         # Compatible with FreeBSD7
options COMPAT_FREEBSD32
options         SCSI_DELAY=4000         # Delay (in ms) before probing SCSI
options         KTRACE                  # ktrace(1) support
options         STACK                   # stack(9) support
options         SYSVSHM                 # SYSV-style shared memory
options         SYSVMSG                 # SYSV-style message queues
options         SYSVSEM                 # SYSV-style semaphores
options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions
options         KBD_INSTALL_CDEV        # install a CDEV entry in /dev
options         AUDIT                   # Security event auditing
options         HWPMC_HOOKS
options         GEOM_MIRROR
options         MROUTING
options         PRINTF_BUFR_SIZE=100

# To make an SMP kernel, the next two lines are needed
options         SMP                     # Symmetric MultiProcessor Kernel

# CPU frequency control
device          cpufreq

# Bus support.
device          acpi
device          pci

device          ada
device          ahci

# SCSI Controllers
device          ahd             # AHA39320/29320 and onboard AIC79xx devices
options         AHD_REG_PRETTY_PRINT    # Print register bitfields in debug
                                         # output.  Adds ~215k to driver.
device          mpt             # LSI-Logic MPT-Fusion
# SCSI peripherals
device          scbus           # SCSI bus (required for SCSI)
device          da              # Direct Access (disks)
device          pass            # Passthrough device (direct SCSI access)
device          ses             # SCSI Environmental Services (and SAF-TE)

# RAID controllers
device          mfi             # LSI MegaRAID SAS

# atkbdc0 controls both the keyboard and the PS/2 mouse
device          atkbdc          # AT keyboard controller
device          atkbd           # AT keyboard
device          psm             # PS/2 mouse

device          kbdmux          # keyboard multiplexer

device          vga             # VGA video card driver

device          splash          # Splash screen and screen saver support

# syscons is the default console driver, resembling an SCO console
device          sc

device          agp             # support several AGP chipsets

## Power management support (see NOTES for more options)
#device         apm
## Add suspend/resume support for the i8254.
#device         pmtimer

# Serial (COM) ports
#device         sio             # 8250, 16[45]50 based serial ports
device          uart            # Generic UART driver

# If you've got a "dumb" serial or parallel PCI card that is
# supported by the puc(4) glue driver, uncomment the following
# line to enable it (connects to sio, uart and/or ppc drivers):
#device         puc

# PCI Ethernet NICs.
device em # Intel PRO/1000 adapter Gigabit Ethernet Card
device          bce
#device         ixgb            # Intel PRO/10GbE Ethernet Card
#device         ixgbe

# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device          miibus          # MII bus support

# Pseudo devices.
device          loop            # Network loopback
device          random          # Entropy device
device          ether           # Ethernet support
device          pty             # Pseudo-ttys (telnet etc)
device          md              # Memory "disks"
device          firmware        # firmware assist module
device          lagg

# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device          bpf             # Berkeley packet filter

# USB support
device          uhci            # UHCI PCI->USB interface
device          ohci            # OHCI PCI->USB interface
device          ehci            # EHCI PCI->USB interface (USB 2.0)
device          usb             # USB Bus (required)
#device         udbp            # USB Double Bulk Pipe devices
device          uhid            # "Human Interface Devices"
device          ukbd            # Keyboard
device          umass           # Disks/Mass storage - Requires scbus and da
device          ums             # Mouse
# USB Serial devices
device          ucom            # Generic com ttys


options         INCLUDE_CONFIG_FILE

options         KDB
options         KDB_UNATTENDED
options         DDB
options         ALT_BREAK_TO_DEBUGGER

options         IPFIREWALL              #firewall
options         IPFIREWALL_FORWARD      #packet destination changes
options         IPFIREWALL_VERBOSE      #print information about
                                         # dropped packets
options         IPFIREWALL_VERBOSE_LIMIT=10000    #limit verbosity

# MRT support
options         ROUTETABLES=16

device          vlan                    #VLAN support

# Size of the kernel message buffer.  Should be N * pagesize.
options         MSGBUF_SIZE=4096000


options         SW_WATCHDOG
options         PANIC_REBOOT_WAIT_TIME=4

#
# Hardware watchdog timers:
#
# ichwd: Intel ICH watchdog timer
#
#device          ichwd

device          smbus
device          ichsmb
device          ipmi




--
WBR, Alexander

_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Reply via email to