PF Performance Tweak Folklore

Jason Healy Wed, 04 Nov 2009 07:36:15 -0800

Good day to everyone,

I'm a happy PF user, and have been for over a decade now.  I'm writing
to ask some questions about performance now that I've got a system
that needs to handle some real traffic.  I've been digging up various
tweaks and settings from the archives (and elsewhere) over the years,
and I'd like to know which of it is still useful and accurate, and
which is "folklore".  Sorry for the length of the post, but I hope that
at the very least this thread will collect some information where the
searchbots can find it...


I've got a pair of 3GHz Celeron machines in a failover config.  Each
machine has 1GB RAM and 4 gigabit intel (em) interfaces.  One LAN,
one WAN, one pfsync, and one unused.  They're running 4.3 generic
uniprocessor.  I intentionally went with a high clock single-core
box because PF isn't multi-core capable.

The systems work great, but are chewing up about 60% of their time on
interrupts (~9000 according to vmstat, with ~7500 going to the LAN/WAN
cards).  This is fine; everything is working and I know that high
interrupt load was inevitable at the time.  However, I need to ramp up
the traffic on this system soon (we're at 30Mbps / 3.5kpps right now),
so I want to make sure I can keep the load under control.

I know that the first thing I should do is upgrade to 4.6, which I
plan to do.  However, I'm looking for other "best practices", which
I've divided into two major sections below:


Interrupt Mitigation:
=====================

Since the system is under moderately heavy interrupt load, I'd like to
try and improve that if possible since it seems that's going to be the
first limit I hit on this system.  In the "Tuning OpenBSD" paper:

  http://www.openbsd.org/papers/tuning-openbsd.ps

they mention "sharing interrupts" on a high load system.  If I
understand correctly, the theory is that if all my NICs are on the
same interrupt, the kernel can stay in the interrupt handler (no
context switch) and service all the NICs at once, rather than handling
each separately.  Am I understanding this right?  Should I try to lump
all (or some) of my NICs onto the same IRQ?  Or are there better
approaches (see below).

Several sources have suggested using APIC, which should be available
in non-ancient hardware.  I'm not sure if APIC replaces or complements
the suggestion above about interrupt sharing.  I checked my box, and
my dmesg didn't mention APIC, so I don't think I'm taking advantage
of it right now.  The -misc archives have oblique references to APIC
only being enabled on multiprocessor (MP) kernels rather than
uniprocessor (UP) ones.  Is this still true?  I also saw hints that
4.6 now has APIC on in UP by default.  Can anyone confirm or deny?

Since PF isn't multi-core capable, I believed that UP was the way to
go for firewalls (and my machine isn't multicore anyway).  However,
I'm happy to run MP if there are side benefits like APIC that would
increase performance.

Next up, FreeBSD has been touting support for message-signaled
interrupts (MSI/MSI-X), claiming that this increases performance:

  http://onlamp.com/pub/a/bsd/2008/02/26/whats-new-in-freebsd-70.html?page=4

I'm not quite clear on whether this helps with a packet-forwarding
workload or not.  Is there support for this in OpenBSD, or would it
not really help anyway?


Sysctl Tweaks:
==============

I've been getting errors like:

  WARNING: mclpool limit reached; increase kern.maxclusters

So I did what it asked (I doubled the value to 12288), but am still
getting the error.  I've heard of people increasing this much more
(20x the default!), but also taunts of insanity for doing so:

  http://monkey.org/openbsd/archive/misc/0407/msg01521.html

So, what is a sane value for this?  Are there other causes that need
to be investigated when you get an "mclpool" warning, or should you
just keep cranking up the value?  Also, is there harm in going to
high (besides wasting memory)?

Next, I've seen interface drops (ifq.drops != 0), so I've cranked up
ifq.maxlen to 256 * #nics (1024) per recommendations on -misc.  I
was still getting occasional drops, so I doubled to 2048, and am
holding steady there.  I've seen recommendations not to go beyond
2500; what should I be worried about in this case?  High latency?
Memory issues?  Do I really need to be worried about a few drops?

Finally, as was mentioned on the list a few days ago, increasing
recvspace/sendspace doesn't help with a firewall (except for
locally-sourced connections) because it's just forwarding packets.
Just so I'm totally clear, is this true even in the case of packet
reassembly (scrub) and randomization, or do those features cause the
firewall to terminate and re-initiate connections that would benefit
from the buffers?

For that matter, are there any protocol options that help performance
of a packet forwarding box (again, ignoring locally-sourced
connections)?  I'm thinking about buffers, default MSS, ECN, window
scaling, SACK, etc.  I know it doesn't hurt to turn them on, but am I
doing any good for the connections I'm forwarding?

Thanks for any input and advice you can provide; I'm looking forward
to using PF for another 10 years... =)

Jason

--
Jason Healy    |    jhe...@logn.net    |   http://www.logn.net/

PF Performance Tweak Folklore

Reply via email to