On Tue, Apr 28, 2026 at 03:41:20PM +0200, Simon Schippers wrote: > On 4/28/26 15:22, Michael S. Tsirkin wrote: > > On Tue, Apr 28, 2026 at 03:10:44PM +0200, Simon Schippers wrote: > >> On 4/28/26 14:50, Michael S. Tsirkin wrote: > >>> On Tue, Apr 28, 2026 at 02:38:59PM +0200, Simon Schippers wrote: > >>>> This commit prevents tail-drop when a qdisc is present and the ptr_ring > >>>> becomes full. Once an entry is successfully produced and the ptr_ring > >>>> reaches capacity, the netdev queue is stopped instead of dropping > >>>> subsequent packets. > >>>> > >>>> If producing an entry fails anyways due to a race, tun_net_xmit returns > >>>> NETDEV_TX_BUSY, again avoiding a drop. Such races are expected because > >>>> LLTX is enabled and the transmit path operates without the usual locking. > >>>> > >>>> If no qdisc is present, the previous tail-drop behavior is preserved. > >>>> > >>>> The existing __tun_wake_queue() function of the consumer races with the > >>>> producer for waking/stopping the netdev queue: the consumer may drain > >>>> the ring just as the producer stops the queue, leading to a permanent > >>>> stall. To avoid this, the producer re-checks the ring after stopping > >>>> and wakes the queue itself if space was just made. An > >>>> smp_mb__after_atomic() is required so the re-peek of the ring sees any > >>>> drain that the consumer performed. > >>>> smp_mb__after_atomic() pairs with the test_and_clear_bit() inside of > >>>> netif_wake_subqueue(): > >>>> > >>>> Consumer CPU Producer CPU > >>>> ======================== ========================= > >>>> __ptr_ring_consume() > >>>> netif_wake_subqueue() netif_tx_stop_queue() > >>>> /\ smp_mb__after_atomic() > >>>> || __ptr_ring_produce_peek() > >>>> contains RMW operation > >>>> test_and_clear_bit() > >>>> /\ > >>>> || > >>>> "Fully ordered RMW: > >>>> smp_mb() before + after" > >>>> - atomic_t.txt > >>>> > >>>> Benchmarks: > >>>> The benchmarks show a slight regression in raw transmission performance, > >>>> though no packets are lost anymore. > >>> > >>> Could you include the packets received as well? > >>> To demonstrate the gains/lack of loss. > >>> > >> > >> Do you mean the number of packets received by the VM? > >> They should just be the same as the number sent (shown below), right? > > > > Minus the loss? Which this is about, right? > > Yes. I simply calculated "Lost/s": > > elapsed_time = 100e6 / sent_pps > Lost/s = total_errors / elapsed_time > > > To get back total_errors for example for TAP > 1 thread sending: > > elapsed_time = 100e6 / 1.136Mpps = 88s > > 3758 Mpps = total_errors / 88s > <=> total_errors = 331 million packets > > So, out of 431 million packets sent, 100 million were successfully > delivered and 331 million were lost.
That is my issue. I kind of have trouble mapping that to the table below. For example: | TAP | Transmitted | 1.136 Mpps | 1.130 Mpps | -0.6% | | +-------------+--------------+----------------+----------+ | | Lost/s | 3.758 Mpps | 0 pps | | how can # of lost packets exceed the # of transmitted packets? Thanks! > > > >> I assume they would be visible as RX-DRP for TAP. > >> For TAP + vhost-net I would have to rewrite the XDP drop > >> program to count the number of dropped packets... > >> And I would have to automate it... > >> > >>>> > >>>> The previously introduced threshold to only wake after the queue stopped > >>>> and half of the ring was consumed showed to be a descent choice: > >>>> Waking the queue whenever a consume made space in the ring strongly > >>>> degrades performance for tap, while waking only when the ring is empty > >>>> is too late and also hurts throughput for tap & tap+vhost-net. > >>>> Other ratios (3/4, 7/8) showed similar results (not shown here), so > >>>> 1/2 was chosen for the sake of simplicity for both tun/tap and > >>>> tun/tap+vhost-net. > >>>> > >>>> Test setup: > >>>> AMD Ryzen 5 5600X at 4.3 GHz, 3200 MHz RAM, isolated QEMU threads; > >>>> Average over 50 runs @ 100,000,000 packets. SRSO and spectre v2 > >>>> mitigations disabled. > >>>> > >>>> Note for tap+vhost-net: > >>>> XDP drop program active in VM -> ~2.5x faster, slower for tap due to > >>>> more syscalls (high utilization of entry_SYSRETQ_unsafe_stack in perf) > >>>> > >>>> +--------------------------+--------------+----------------+----------+ > >>>> | 1 thread | Stock | Patched with | diff | > >>>> | sending | | fq_codel qdisc | | > >>>> +------------+-------------+--------------+----------------+----------+ > >>>> | TAP | Transmitted | 1.136 Mpps | 1.130 Mpps | -0.6% | > >>>> | +-------------+--------------+----------------+----------+ > >>>> | | Lost/s | 3.758 Mpps | 0 pps | | > >>>> +------------+-------------+--------------+----------------+----------+ > >>>> | TAP | Transmitted | 3.858 Mpps | 3.816 Mpps | -1.1% | > >>>> | +-------------+--------------+----------------+----------+ > >>>> | +vhost-net | Lost/s | 789.8 Kpps | 0 pps | | > >>>> +------------+-------------+--------------+----------------+----------+ > >>>> > >>>> +--------------------------+--------------+----------------+----------+ > >>>> | 2 threads | Stock | Patched with | diff | > >>>> | sending | | fq_codel qdisc | | > >>>> +------------+-------------+--------------+----------------+----------+ > >>>> | TAP | Transmitted | 1.117 Mpps | 1.087 Mpps | -2.7% | > >>>> | +-------------+--------------+----------------+----------+ > >>>> | | Lost/s | 8.476 Mpps | 0 pps | | > >>>> +------------+-------------+--------------+----------------+----------+ > >>>> | TAP | Transmitted | 3.679 Mpps | 3.464 Mpps | -5.8% | > >>>> | +-------------+--------------+----------------+----------+ > >>>> | +vhost-net | Lost/s | 5.306 Mpps | 0 pps | | > >>>> +------------+-------------+--------------+----------------+----------+ > >>>> > >>>> Co-developed-by: Tim Gebauer <[email protected]> > >>>> Signed-off-by: Tim Gebauer <[email protected]> > >>>> Signed-off-by: Simon Schippers <[email protected]> > >>>> --- > >>>> drivers/net/tun.c | 30 ++++++++++++++++++++++++++++-- > >>>> 1 file changed, 28 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c > >>>> index efe809597622..c2a1618cc9db 100644 > >>>> --- a/drivers/net/tun.c > >>>> +++ b/drivers/net/tun.c > >>>> @@ -1011,6 +1011,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff > >>>> *skb, struct net_device *dev) > >>>> struct netdev_queue *queue; > >>>> struct tun_file *tfile; > >>>> int len = skb->len; > >>>> + bool qdisc_present; > >>>> + int ret; > >>>> > >>>> rcu_read_lock(); > >>>> tfile = rcu_dereference(tun->tfiles[txq]); > >>>> @@ -1065,13 +1067,37 @@ static netdev_tx_t tun_net_xmit(struct sk_buff > >>>> *skb, struct net_device *dev) > >>>> > >>>> nf_reset_ct(skb); > >>>> > >>>> - if (ptr_ring_produce(&tfile->tx_ring, skb)) { > >>>> + queue = netdev_get_tx_queue(dev, txq); > >>>> + qdisc_present = !qdisc_txq_has_no_queue(queue); > >>>> + > >>>> + spin_lock(&tfile->tx_ring.producer_lock); > >>>> + ret = __ptr_ring_produce(&tfile->tx_ring, skb); > >>>> + if (__ptr_ring_produce_peek(&tfile->tx_ring) && qdisc_present) { > >>>> + netif_tx_stop_queue(queue); > >>>> + /* Re-peek and wake if the consumer drained the ring > >>>> + * concurrently in a race. smp_mb__after_atomic() pairs > >>>> + * with the test_and_clear_bit() of > >>>> netif_wake_subqueue() > >>>> + * in __tun_wake_queue(). > >>>> + */ > >>>> + smp_mb__after_atomic(); > >>>> + if (!__ptr_ring_produce_peek(&tfile->tx_ring)) > >>>> + netif_tx_wake_queue(queue); > >>>> + } > >>>> + spin_unlock(&tfile->tx_ring.producer_lock); > >>>> + > >>>> + if (ret) { > >>>> + /* If a qdisc is attached to our virtual device, > >>>> + * returning NETDEV_TX_BUSY is allowed. > >>>> + */ > >>>> + if (qdisc_present) { > >>>> + rcu_read_unlock(); > >>>> + return NETDEV_TX_BUSY; > >>>> + } > >>>> drop_reason = SKB_DROP_REASON_FULL_RING; > >>>> goto drop; > >>>> } > >>>> > >>>> /* dev->lltx requires to do our own update of trans_start */ > >>>> - queue = netdev_get_tx_queue(dev, txq); > >>>> txq_trans_cond_update(queue); > >>>> > >>>> /* Notify and wake up reader process */ > >>>> -- > >>>> 2.43.0 > >>> > >

