On Mon, Apr 13, 2026 at 11:44:38AM +0200, [email protected] wrote: > From: Jesper Dangaard Brouer <[email protected]> > > Add a selftest that exercises veth's BQL (Byte Queue Limits) code path > under sustained UDP load. The test creates a veth pair with GRO enabled > (activating the NAPI path and BQL), attaches a qdisc, optionally loads > iptables rules in the consumer namespace to slow NAPI processing, and > floods UDP packets for a configurable duration. > > The test serves two purposes: benchmarking BQL's latency impact under > configurable load (iptables rules, qdisc type and parameters), and > detecting kernel BUG/Oops from DQL accounting mismatches. It monitors > dmesg throughout the run and reports PASS/FAIL via kselftest (lib.sh). > > Diagnostic output is printed every 5 seconds: > - BQL sysfs inflight/limit and watchdog tx_timeout counter > - qdisc stats: packets, drops, requeues, backlog, qlen, overlimits > - consumer PPS and NAPI-64 cycle time (shows fq_codel target impact) > - sink PPS (per-period delta), latency min/avg/max (stddev at exit) > - ping RTT to measure latency under load > > Generating enough traffic to fill the 256-entry ptr_ring requires care: > the UDP sendto() path charges each SKB to sk_wmem_alloc, and the SKB > stays charged (via sock_wfree destructor) until the consumer NAPI thread > finishes processing it -- including any iptables rules in the receive > path. With the default sk_sndbuf (~208KB from wmem_default), only ~93 > packets can be in-flight before sendto(MSG_DONTWAIT) returns EAGAIN. > Since 93 < 256 ring entries, the ring never fills and no backpressure > occurs. The test raises wmem_max via sysctl and sets SO_SNDBUF=1MB on > the flood socket to remove this bottleneck. An earlier multi-namespace > routing approach avoided this limit because ip_forward creates new SKBs > detached from the sender's socket. > > The --bql-disable option (sets limit_min=1GB) enables A/B comparison. > Typical results with --nrules 6000 --qdisc-opts 'target 2ms interval 20ms': > > fq_codel + BQL disabled: ping RTT ~10.8ms, 15% loss, 400KB in ptr_ring > fq_codel + BQL enabled: ping RTT ~0.6ms, 0% loss, 4KB in ptr_ring > > Both cases show identical consumer speed (~20Kpps) and fq_codel drops > (~255K), proving the improvement comes purely from where packets buffer. > > BQL moves buffering from the ptr_ring into the qdisc, where AQM > (fq_codel/CAKE) can act on it -- eliminating the "dark buffer" that > hides congestion from the scheduler. > > The --qdisc-replace mode cycles through sfq/pfifo/fq_codel/noqueue > under active traffic to verify that stale BQL state (STACK_XOFF) is > properly handled during live qdisc transitions. > > A companion wrapper (veth_bql_test_virtme.sh) launches the test inside > a virtme-ng VM, with .config validation to prevent silent stalls. > > Usage: > sudo ./veth_bql_test.sh [--duration 300] [--nrules 100] > [--qdisc sfq] [--qdisc-opts '...'] > [--bql-disable] [--normal-napi] > [--qdisc-replace] > > Signed-off-by: Jesper Dangaard Brouer <[email protected]> > Tested-by: Jonas Köppeler <[email protected]>
Tested-by: Breno Leitao <[email protected]> > diff --git a/tools/testing/selftests/net/config > b/tools/testing/selftests/net/config > index 2a390cae41bf..7b1f41421145 100644 > --- a/tools/testing/selftests/net/config > +++ b/tools/testing/selftests/net/config > @@ -97,6 +97,7 @@ CONFIG_NET_PKTGEN=m > CONFIG_NET_SCH_ETF=m > CONFIG_NET_SCH_FQ=m > CONFIG_NET_SCH_FQ_CODEL=m > +CONFIG_NET_SCH_SFQ=m nit: This breaks the alphabetical ordering of the config file.

