Re: pf state-table-induced instability

Gabor LENCSE Wed, 30 Aug 2023 00:55:41 -0700

Dear David,

Thank you very much for your detailed answer! Now I have got theexplanation for seemingly rather strange things. :-)

However, I have some further questions. Let me explain what I do now sothat you can more clearly see the background.

I have recently enabled siitperf to use multiple IP addresses. (Siitperfis an IPv4, IPv6, SIIT, and stateful NAT64/NAT44 bechmarking toolimplementing the measurements of RFC 2544, RFC 8219, and this draft:https://datatracker.ietf.org/doc/html/draft-ietf-bmwg-benchmarking-stateful.)

Currently I want to test (and demonstrate) the difference thisimprovement has made. I have already covered the stateless case bymeasuring the IPv4 and IPv6 packet forwarding performance of OpenBSD using1) the very same test frames following the test frame format defined inthe appendix of RFC 25442) using only pseudorandom port numbers required by RFC 4814 (resultedin no performance improvement compared to case 1)3) using pseudorandom IP addresses from specified ranges (resulted insignificant performance improvement compared to case 1)4) using both pseudorandom IP addresses and port numbers (same resultsas in case 3)

Many thanks to OpenBSD developers for enabling multi-core IP packetforwarding!

https://www.openbsd.org/plus72.html says: "Activated parallel IPforwarding, starting 4 softnet tasks but limiting the usage to thenumber of CPUs."

It is not a fundamental issue, but it seems to me that during my testsnot only four but five CPU cores were used by IP packet forwarding:

load averages: 1.34, 0.35, 0.12 dut.cntrg 20:10:15

36 processes: 35 idle, 1 on processor up 1 days 02:16:56

CPU00 states: 0.0% user, 0.0% nice, 0.0% sys, 0.2% spin, 6.1% intr,93.7% idleCPU01 states: 0.0% user, 0.0% nice, 55.8% sys, 7.2% spin, 5.2% intr,31.9% idleCPU02 states: 0.0% user, 0.0% nice, 53.6% sys, 8.0% spin, 6.2% intr,32.1% idleCPU03 states: 0.0% user, 0.0% nice, 48.3% sys, 7.2% spin, 6.2% intr,38.3% idleCPU04 states: 0.0% user, 0.0% nice, 44.2% sys, 9.7% spin, 6.3% intr,39.8% idleCPU05 states: 0.0% user, 0.0% nice, 33.5% sys, 5.8% spin, 6.4% intr,54.3% idleCPU06 states: 0.0% user, 0.0% nice, 3.2% sys, 0.2% spin, 7.2% intr,89.4% idleCPU07 states: 0.0% user, 0.0% nice, 0.0% sys, 0.8% spin, 6.0% intr,93.2% idleCPU08 states: 0.0% user, 0.0% nice, 0.0% sys, 0.2% spin, 5.4% intr,94.4% idleCPU09 states: 0.0% user, 0.0% nice, 0.0% sys, 0.2% spin, 7.2% intr,92.6% idleCPU10 states: 0.0% user, 0.0% nice, 0.0% sys, 0.2% spin, 8.9% intr,90.9% idleCPU11 states: 0.0% user, 0.0% nice, 0.0% sys, 0.2% spin, 7.6% intr,92.2% idleCPU12 states: 0.0% user, 0.0% nice, 0.0% sys, 0.0% spin, 8.6% intr,91.4% idleCPU13 states: 0.0% user, 0.0% nice, 0.0% sys, 0.4% spin, 6.1% intr,93.5% idleCPU14 states: 0.0% user, 0.0% nice, 0.0% sys, 0.2% spin, 6.4% intr,93.4% idleCPU15 states: 0.0% user, 0.0% nice, 0.0% sys, 0.4% spin, 4.8% intr,94.8% idle

Memory: Real: 34M/2041M act/tot Free: 122G Cache: 825M Swap: 0K/256M

The above output of the "top" command show significant system load atCPU cores form CPU1 to CPU5.


*Has the number of softnet tasks been increased from 4 to 5?*

What it more crucial for me, are the stateful NAT64 the measurementswith PF.


My stateful NAT64 measurement are as follows.

1. Maximum connection establishment rate test uses a binary search tofind the highest rate, at which all connections can be establishedthrough the stateful NAT64 gateway when all test frames create a newconnection.

2. Throughput test also uses a binary search to find the highest rate(called throughput) at which all test frames are forwarded by thestateful NAT64 gateway using bidirectional traffic. (All test framesbelong to an already existing connection. This test requires to load theconnections into the connection tracking table of the stateful NAT64gateway in a previous step using a safely lower rate than determined bythe maximum connection establishment rate test.)

And both tests need to repeat multiple times to acquire statisticallyreliable results.

As for the explanation of the seemingly deteriorating performance of PF,now I understand from your explanation that the "pfctl -F states"command does not delete the content of the connection tracking table.


*Is there any way to completely delete its entire content?*

(E.g., under Linux, I can delete the connection tracking table ofiptables or Jool by deleting the appropriate kernel module.)

Of course, I can delete it by rebooting the server. However, currently Iuse a Dell PowerEdge R730 server, and its complete reboot (includingstopping OpenBSD, initialization of the hardware, booting OpenBSD andsome spare time) takes 5 minutes. This is a way too long overhead, if Ineed to do it between every single elementary steps (that is, the stepsof the binary search) which are in the order of magnitude of 1 minute. :-(

(Currently I use the compromise that I reboot the OpenBSD server afterfinishing each binary search.)


Thank you very much for all your further advice in advance!

Best regards,

Gábor

On 8/29/2023 12:01 AM, David Gwynne wrote:

On Mon, Aug 28, 2023 at 01:46:32PM +0200, Gabor LENCSE wrote:

Hi Lyndon,

Sorry for my late reply. Please see my answers inline.

On 8/24/2023 11:13 PM, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:

Gabor LENCSE writes:

If you are interested, you can find the results in Tables 18 - 20 of
this (open access) paper:https://doi.org/10.1016/j.comcom.2023.08.009

Thanks for the pointer -- that's a very interesting paper.

After giving it a quick read through, one thing immediately jumps
out.  The paper mentions (section A.4) a boost in performance after
increasing the state table size limit.  Not having looked at the
relevant code, so I'm guessing here, but this is a classic indicator
of a hashing algorithm falling apart when the table gets close to
full.  Could it be that simple?  I need to go digging into the pf
code for a closer look.

Beware, I wrote it about iptables and not PF!

As for iptables, it is really so simple. I have done a deeper analysis of
iptables performance as the function of its hash table size. It is
documented in another (open access) paper:
http://doi.org/10.36244/ICJ.2023.1.6

However, I am not familiar with the internals of the other two tested
stateful NAT64 implementations, Jool and OpenBSD PF. I have no idea, what
kind of data structures they use for storing the connections.

openbsd uses a red-black tree to look up states. packets are parsed into
a key that looks up states by address family, ips, ipproto, ports, etc,
to find the relevant state. if a state isnt found, it falls through to
ruleset evaluation, which is notionally a linked list, but has been
optimised.

You also describe how the performance degrades over time.  This
exactly matches the behaviour we see.  Could the fix be as simple
as cranking 'set limit states' up to, say, two milltion?  There is
one way to find out ... :-)

As you could see, the highest number of connections was 40M, and the limit
of the states was set to 1000M. It worked well for me then with the PF of
OpenBSD 7.1.

It would be interesting to find the root cause of the phenomenon, why the
performance of PF seems to deteriorate with time. E.g., somehow the internal
data structures of PF become "polluted" if many connections are established
and then deleted?

my first guess is that you're starting to fight agains the pf state
purge processing. pf tries to scan the entire state table every 10
seconds (by default) looking for expired states it can remove. this scan
process runs every second, but it tries to cover the whole state table
by 10 seconds. the more states you have the more time this takes, and
this increases linearly with the number of states you have.

until relatively recently (post 7.2), the scan and gc processing
effectively stopped the world. at work we run with about 2 million
states during business hours, and i was seeing the gc processing take up
approx 70ms a second, during which packet processing didnt really
happen.

now the scan can happen without blocking pf packet processing. it still
takes cpu time, so there is a point that processing packets and scanning
for states will fight each other for time, but at least they're not
fighting each other for locks now.

However, I have deleted the content of the state table after each elementary
measurement step using the "pfctl -F states" command. (I am sorry, this
command is missing from the paper, but it is there in my saved "del-pf"
file!)

Perhaps PF developers could advise us, if the deletion of the states
generate a fresh state table or not.

it marks the states as expired, and then the purge scan is able to take
them and actually free them.

Could anyone help us in this question?

Best regards,

G??bor




I use binary search to find the highest lossless rate (throughput).
Especially w

--lyndon

Re: pf state-table-induced instability

Reply via email to