https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=295043

            Bug ID: 295043
           Summary: pf: per-packet microuptime() calls from 0493260115db
                    cause major throughput regression with expensive
                    timecounters
           Product: Base System
           Version: 15.0-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: [email protected]
          Reporter: [email protected]

Commit 0493260115db ("pf: store state creation/expiration timestamps
with millisecond precision") replaced time_uptime reads with
microuptime() calls via pf_get_uptime() in pf's per-packet state expire
update path. On systems using HPET (common in KVM guests without ITSC),
this is extremely expensive.

https://github.com/freebsd/freebsd-src/commit/0493260115db10164246762d6d4923880575e529

Measurements (FreeBSD/KVM guest, iperf3, pf enabled with stateful rules):

  vtnet,  HPET:     ~2.3 Gbit/s
  vtnet,  TSC-low:  ~9   Gbit/s
  ixl passthrough + VLANs, HPET:    <1   Gbit/s
  ixl passthrough + VLANs, TSC-low: ~9   Gbit/s
  pf disabled, any timecounter:     ~9   Gbit/s

  kern.timecounter.choice: TSC-low(-100) i8254(0) ACPI-fast(900) HPET(950)
dummy(-1000000)

TSC-low is distrusted because the hypervisor doesn't expose ITSC to the
guest, which is common in many KVM configurations.

How to reproduce:

In a VM with pf enabled and stateful rules:

  sysctl kern.timecounter.hardware=HPET
  iperf3 -c <target>    # observe reduced throughput
  sysctl kern.timecounter.hardware=TSC-low
  iperf3 -c <target>    # throughput restored

Possible approaches:

The millisecond precision is primarily needed by pflow(4) at export
time, not per packet. Defaulting pf_get_uptime() to
(uint64_t)time_uptime * 1000 would preserve the data model changes
while eliminating the per-packet timecounter read. The precise
microuptime() path could be gated behind a sysctl for pflow users who
need it.

Alternatively, the patch could use getmicrouptime() instead of
microuptime(). The former returns a cached value without querying the
hardware timecounter, which is what the man page recommends when
execution time matters more than precision.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to