On 2021/4/6 18:13, Juergen Gross wrote: > On 06.04.21 09:06, Michal Kubecek wrote: >> On Tue, Apr 06, 2021 at 08:55:41AM +0800, Yunsheng Lin wrote: >>> >>> Hi, Jiri >>> Do you have a reproducer that can be shared here? >>> With reproducer, I can debug and test it myself too. >> >> I'm afraid we are not aware of a simple reproducer. As mentioned in the >> original discussion, the race window is extremely small and the other >> thread has to do quite a lot in the meantime which is probably why, as >> far as I know, this was never observed on real hardware, only in >> virtualization environments. NFS may also be important as, IIUC, it can >> often issue an RPC request from a different CPU right after a data >> transfer. Perhaps you could cheat a bit and insert a random delay >> between the empty queue check and releasing q->seqlock to make it more >> likely to happen. >> >> Other than that, it's rather just "run this complex software in a xen VM >> and wait". > > Being the one who has managed to reproduce the issue I can share my > setup, maybe you can setup something similar (we have seen the issue > with this kind of setup on two different machines). > > I'm using a physical machine with 72 cpus and 48 GB of memory. It is > running Xen as virtualization platform. > > Xen dom0 is limited to 40 vcpus and 32 GB of memory, the dom0 vcpus are > limited to run on the first 40 physical cpus (no idea whether that > matters, though). > > In a guest with 16 vcpu and 8GB of memory I'm running 8 parallel > sysbench instances in a loop, those instances are prepared via > > sysbench --file-test-mode=rndrd --test=fileio prepare > > and then started in a do while loop via: > > sysbench --test=fileio --file-test-mode=rndrw --rand-seed=0 --max-time=300 > --max-requests=0 run > > Each instance is using a dedicated NFS mount to run on. The NFS > server for the 8 mounts is running in dom0 of the same server, the > data of the NFS shares is located in a RAM disk (size is a little bit > above 16GB). The shares are mounted in the guest with: > > mount -t nfs -o > rw,proto=tcp,nolock,nfsvers=3,rsize=65536,wsize=65536,nosharetransport > dom0:/ramdisk/share[1-8] /mnt[1-8] > > The guests vcpus are limited to run on physical cpus 40-55, on the same > physical cpus I have 16 small guests running eating up cpu time, each of > those guests is pinned to one of the physical cpus 40-55. > > That's basically it. All you need to do is to watch out for sysbench > reporting maximum latencies above one second or so (in my setup there > are latencies of several minutes at least once each hour of testing). > > In case you'd like to have some more details about the setup don't > hesitate to contact me directly. I can provide you with some scripts > and config runes if you want.
The setup is rather complex, I just tried Michal' suggestion using the below patch: diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 9fb0ad4..b691eda 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -207,6 +207,11 @@ static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); if (qdisc->flags & TCQ_F_NOLOCK) { + udelay(10000); + udelay(10000); + udelay(10000); + udelay(10000); + udelay(10000); spin_unlock(&qdisc->seqlock); if (unlikely(test_bit(__QDISC_STATE_MISSED, diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 6d7f954..a83c520 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc, return qdisc_drop_cpu(skb, qdisc, to_free); else return qdisc_drop(skb, qdisc, to_free); + } else { + skb->enqueue_jiffies = jiffies; } qdisc_update_stats_at_enqueue(qdisc, pkt_len); @@ -653,6 +655,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) skb = __skb_array_consume(q); } if (likely(skb)) { + unsigned int delay_ms; + + delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies); linyunsheng@plinth:~/ci/kernel$ vi qdisc_reproducer.patch --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -920,7 +920,7 @@ struct sk_buff { *data; unsigned int truesize; refcount_t users; - + unsigned long enqueue_jiffies; #ifdef CONFIG_SKB_EXTENSIONS /* only useable after checking ->active_extensions != 0 */ struct skb_ext *extensions; diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 639e465..ba39b86 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -176,8 +176,14 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc) static inline void qdisc_run_end(struct Qdisc *qdisc) { write_seqcount_end(&qdisc->running); - if (qdisc->flags & TCQ_F_NOLOCK) + if (qdisc->flags & TCQ_F_NOLOCK) { + udelay(10000); + udelay(10000); + udelay(10000); + udelay(10000); + udelay(10000); spin_unlock(&qdisc->seqlock); + } } static inline bool qdisc_may_bulk(const struct Qdisc *qdisc) diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 49eae93..fcfae43 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -630,6 +630,8 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc, return qdisc_drop_cpu(skb, qdisc, to_free); else return qdisc_drop(skb, qdisc, to_free); + } else { + skb->enqueue_jiffies = jiffies; } qdisc_update_stats_at_enqueue(qdisc, pkt_len); @@ -651,6 +653,13 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc) skb = __skb_array_consume(q); } if (likely(skb)) { + unsigned int delay_ms; + + delay_ms = jiffies_to_msecs(jiffies - skb->enqueue_jiffies); + + if (delay_ms > 100) + netdev_err(qdisc_dev(qdisc), "delay: %u ms\n", delay_ms); + qdisc_update_stats_at_dequeue(qdisc, skb); } else { WRITE_ONCE(qdisc->empty, true); Using the below shell: while((1)) do taskset -c 0 mz dummy1 -d 10000 -c 100& taskset -c 1 mz dummy1 -d 10000 -c 100& taskset -c 2 mz dummy1 -d 10000 -c 100& sleep 3 done And got the below log: [ 80.881716] hns3 0000:bd:00.0 eth2: delay: 176 ms [ 82.036564] hns3 0000:bd:00.0 eth2: delay: 296 ms [ 87.065820] hns3 0000:bd:00.0 eth2: delay: 320 ms [ 94.134174] dummy1: delay: 1588 ms [ 94.137570] dummy1: delay: 1580 ms [ 94.140963] dummy1: delay: 1572 ms [ 94.144354] dummy1: delay: 1568 ms [ 94.147745] dummy1: delay: 1560 ms [ 99.065800] hns3 0000:bd:00.0 eth2: delay: 264 ms [ 100.106174] dummy1: delay: 1448 ms [ 102.169799] hns3 0000:bd:00.0 eth2: delay: 436 ms [ 103.166221] dummy1: delay: 1604 ms [ 103.169617] dummy1: delay: 1604 ms [ 104.985806] dummy1: delay: 316 ms [ 105.113797] hns3 0000:bd:00.0 eth2: delay: 308 ms [ 107.289805] hns3 0000:bd:00.0 eth2: delay: 556 ms [ 108.912922] hns3 0000:bd:00.0 eth2: delay: 188 ms [ 137.241801] dummy1: delay: 30624 ms [ 137.245283] dummy1: delay: 30620 ms [ 137.248760] dummy1: delay: 30616 ms [ 137.252237] dummy1: delay: 30616 ms It seems the problem can be easily reproduced using Michal' suggestion. Will test and debug it using the above reproducer first. > > > Juergen