Dear David,

Thank you very much for your detailed answer! Now I have got the explanation for seemingly rather strange things. :-)

However, I have some further questions. Let me explain what I do now so that you can more clearly see the background.

I have recently enabled siitperf to use multiple IP addresses. (Siitperf is an IPv4, IPv6,  SIIT, and stateful NAT64/NAT44 bechmarking tool implementing the measurements of RFC 2544, RFC 8219, and this draft: https://datatracker.ietf.org/doc/html/draft-ietf-bmwg-benchmarking-stateful .)

Currently I want to test (and demonstrate) the difference this improvement has made. I have already covered the stateless case by measuring the IPv4 and IPv6 packet forwarding performance of OpenBSD using 1) the very same test frames following the test frame format defined in the appendix of RFC 2544 2) using only pseudorandom port numbers required by RFC 4814 (resulted in no performance improvement compared to case 1) 3) using pseudorandom IP addresses from specified ranges (resulted in significant performance improvement compared to case 1) 4) using both pseudorandom IP addresses and port numbers (same results as in case 3)

Many thanks to OpenBSD developers for enabling multi-core IP packet forwarding!

https://www.openbsd.org/plus72.html says: "Activated parallel IP forwarding, starting 4 softnet tasks but limiting the usage to the number of CPUs."

It is not a fundamental issue, but it seems to me that during my tests not only four but five CPU cores were used by IP packet forwarding:

load averages:  1.34,  0.35, 0.12                               dut.cntrg 20:10:15
36 processes: 35 idle, 1 on processor up 1 days 02:16:56
CPU00 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 6.1% intr, 93.7% idle CPU01 states:  0.0% user,  0.0% nice, 55.8% sys,  7.2% spin, 5.2% intr, 31.9% idle CPU02 states:  0.0% user,  0.0% nice, 53.6% sys,  8.0% spin, 6.2% intr, 32.1% idle CPU03 states:  0.0% user,  0.0% nice, 48.3% sys,  7.2% spin, 6.2% intr, 38.3% idle CPU04 states:  0.0% user,  0.0% nice, 44.2% sys,  9.7% spin, 6.3% intr, 39.8% idle CPU05 states:  0.0% user,  0.0% nice, 33.5% sys,  5.8% spin, 6.4% intr, 54.3% idle CPU06 states:  0.0% user,  0.0% nice,  3.2% sys,  0.2% spin, 7.2% intr, 89.4% idle CPU07 states:  0.0% user,  0.0% nice,  0.0% sys,  0.8% spin, 6.0% intr, 93.2% idle CPU08 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 5.4% intr, 94.4% idle CPU09 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.2% intr, 92.6% idle CPU10 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 8.9% intr, 90.9% idle CPU11 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.6% intr, 92.2% idle CPU12 states:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin, 8.6% intr, 91.4% idle CPU13 states:  0.0% user,  0.0% nice,  0.0% sys,  0.4% spin, 6.1% intr, 93.5% idle CPU14 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 6.4% intr, 93.4% idle CPU15 states:  0.0% user,  0.0% nice,  0.0% sys,  0.4% spin, 4.8% intr, 94.8% idle
Memory: Real: 34M/2041M act/tot Free: 122G Cache: 825M Swap: 0K/256M

The above output of the "top" command show significant system load at CPU cores form CPU1 to CPU5.

*Has the number of softnet tasks been increased from 4 to 5?*

What it more crucial for me, are the stateful NAT64 the measurements with PF.

My stateful NAT64 measurement are as follows.

1. Maximum connection establishment rate test uses a binary search to find the highest rate, at which all connections can be established through the stateful NAT64 gateway when all test frames create a new connection.

2. Throughput test also uses a binary search to find the highest rate (called throughput) at which all test frames are forwarded by the stateful NAT64 gateway using bidirectional traffic. (All test frames belong to an already existing connection. This test requires to load the connections into the connection tracking table of the stateful NAT64 gateway in a previous step using a safely lower rate than determined by the maximum connection establishment rate test.)

And both tests need to repeat multiple times to acquire statistically reliable results.

As for the explanation of the seemingly deteriorating performance of PF, now I understand from your explanation that the "pfctl -F states" command does not delete the content of the connection tracking table.

*Is there any way to completely delete its entire content?*

(E.g., under Linux, I can delete the connection tracking table of iptables or Jool by deleting the appropriate kernel module.)

Of course, I can delete it by rebooting the server. However, currently I use a Dell PowerEdge R730 server, and its complete reboot (including stopping OpenBSD, initialization of the hardware, booting OpenBSD and some spare time) takes 5 minutes. This is a way too long overhead, if I need to do it between every single elementary steps (that is, the steps of the binary search) which are in the order of magnitude of 1 minute. :-(

(Currently I use the compromise that I reboot the OpenBSD server after finishing each binary search.)

Thank you very much for all your further advice in advance!

Best regards,

Gábor

On 8/29/2023 12:01 AM, David Gwynne wrote:
On Mon, Aug 28, 2023 at 01:46:32PM +0200, Gabor LENCSE wrote:
Hi Lyndon,

Sorry for my late reply. Please see my answers inline.

On 8/24/2023 11:13 PM, Lyndon Nerenberg (VE7TFX/VE6BBM) wrote:
Gabor LENCSE writes:

If you are interested, you can find the results in Tables 18 - 20 of
this (open access) paper:https://doi.org/10.1016/j.comcom.2023.08.009
Thanks for the pointer -- that's a very interesting paper.

After giving it a quick read through, one thing immediately jumps
out.  The paper mentions (section A.4) a boost in performance after
increasing the state table size limit.  Not having looked at the
relevant code, so I'm guessing here, but this is a classic indicator
of a hashing algorithm falling apart when the table gets close to
full.  Could it be that simple?  I need to go digging into the pf
code for a closer look.
Beware, I wrote it about iptables and not PF!

As for iptables, it is really so simple. I have done a deeper analysis of
iptables performance as the function of its hash table size. It is
documented in another (open access) paper:
http://doi.org/10.36244/ICJ.2023.1.6

However, I am not familiar with the internals of the other two tested
stateful NAT64 implementations, Jool and OpenBSD PF. I have no idea, what
kind of data structures they use for storing the connections.
openbsd uses a red-black tree to look up states. packets are parsed into
a key that looks up states by address family, ips, ipproto, ports, etc,
to find the relevant state. if a state isnt found, it falls through to
ruleset evaluation, which is notionally a linked list, but has been
optimised.

You also describe how the performance degrades over time.  This
exactly matches the behaviour we see.  Could the fix be as simple
as cranking 'set limit states' up to, say, two milltion?  There is
one way to find out ... :-)
As you could see, the highest number of connections was 40M, and the limit
of the states was set to 1000M. It worked well for me then with the PF of
OpenBSD 7.1.

It would be interesting to find the root cause of the phenomenon, why the
performance of PF seems to deteriorate with time. E.g., somehow the internal
data structures of PF become "polluted" if many connections are established
and then deleted?
my first guess is that you're starting to fight agains the pf state
purge processing. pf tries to scan the entire state table every 10
seconds (by default) looking for expired states it can remove. this scan
process runs every second, but it tries to cover the whole state table
by 10 seconds. the more states you have the more time this takes, and
this increases linearly with the number of states you have.

until relatively recently (post 7.2), the scan and gc processing
effectively stopped the world. at work we run with about 2 million
states during business hours, and i was seeing the gc processing take up
approx 70ms a second, during which packet processing didnt really
happen.

now the scan can happen without blocking pf packet processing. it still
takes cpu time, so there is a point that processing packets and scanning
for states will fight each other for time, but at least they're not
fighting each other for locks now.

However, I have deleted the content of the state table after each elementary
measurement step using the "pfctl -F states" command. (I am sorry, this
command is missing from the paper, but it is there in my saved "del-pf"
file!)

Perhaps PF developers could advise us, if the deletion of the states
generate a fresh state table or not.
it marks the states as expired, and then the purge scan is able to take
them and actually free them.

Could anyone help us in this question?

Best regards,

G??bor




I use binary search to find the highest lossless rate (throughput).
Especially w


--lyndon

Reply via email to