Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
Hi all, I'm still crashing my head on this item... On Wed, 2018-04-18 at 09:44 -0700, John Fastabend wrote: > There is a set of conditions > that if met we can run without the lock. Possibly ONETXQUEUE and > aligned cpu_map is sufficient. We could detect this case and drop > the locking. For existing systems and high Gbps NICs I think (feel > free to correct me) assuming a core per cpu is OK. At some point > though we probably need to revisit this assumption. I think we can improve measurably moving the __QDISC_STATE_RUNNING bit fiddling around the __qdisc_run() call in the 'lockless' path, instead of keeping it inside __qdisc_restart(). Currently, in the single sender, pkt rate below link-limit scenario we hit the atomic bit overhead twice per xmitted packet: one for each dequeue, plus another one for the next, failing, dequeue attempt. With the wider scope we will hit it always only once. After that change __QDISC_STATE_RUNNING usage will look a bit like qdisc_lock(), for the dequeue part at least. So I'm wondering if we could replace __QDISC_STATE_RUNNING with spin_trylock(qdisc_lock()) _and_ keep such lock held for the whole qdisc_run() !?! The comment above qdisc_restart() states clearly we can't, but I don't see why !?! Acquiring qdisc_lock() and xmit lock always in the given sequence looks safe to me. Can someone please explain? Is there some possible deathlock condition I'm missing ?!? It looks like the comment itself cames directly from the pre-bitkeeper era (modulo locks name change). Performance wise, acquiring the qdisc_lock only once per xmitted packet should improve considerably 'locked' qdisc performance, both in the contented and in the uncontended scenario (and some quick experiments seems to confirm that). Thanks, Paolo
Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
On Wed, 2018-04-18 at 09:44 -0700, John Fastabend wrote: > Thanks for bringing this up. I'll think about it for a bit maybe > there is something we can do here. There is a set of conditions > that if met we can run without the lock. Possibly ONETXQUEUE and > aligned cpu_map is sufficient. I think you mean "root qdisc is mq and aligned cpu_map": AFAICS we can have ONETXQUEUE when root qdisc is e.g. pfifo_fast which would not help here. > We could detect this case and drop > the locking. For existing systems and high Gbps NICs I think (feel > free to correct me) assuming a core per cpu is OK. I'm sorry, I'm lost. Do you mean "a tx queue per core" instead ?!? I'm unsure we can assume the above. In my experiments, at least in some scenarios it's preferrable configuring a limited number of rx/tx queues, confine BH processing to the related cores and let user space processes run on the others, with a many to 1 relationship between the cores "assigned" to user-space and the cores "assigned" to BH processing. Can't we somewhat try to leverage TCQ_F_CAN_BYPASS even with NOLOCK qdisc? I *think* we can avoid the qdisc_run() call after sch_direct_xmit() in the bypass scenario, and that will avoid the blamed atomic ops above. Cheers, Paolo
Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
On 04/18/2018 12:28 AM, Paolo Abeni wrote: > Hi, > > let me revive this old thread... > > On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote: >> On 03/26/2018 10:30 AM, Cong Wang wrote: >>> On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend >>> wrote: After the qdisc lock was dropped in pfifo_fast we allow multiple enqueue threads and dequeue threads to run in parallel. On the enqueue side the skb bit ooo_okay is used to ensure all related skbs are enqueued in-order. On the dequeue side though there is no similar logic. What we observe is with fewer queues than CPUs it is possible to re-order packets when two instances of __qdisc_run() are running in parallel. Each thread will dequeue a skb and then whichever thread calls the ndo op first will be sent on the wire. This doesn't typically happen because qdisc_run() is usually triggered by the same core that did the enqueue. However, drivers will trigger __netif_schedule() when queues are transitioning from stopped to awake using the netif_tx_wake_* APIs. When this happens netif_schedule() calls qdisc_run() on the same CPU that did the netif_tx_wake_* which is usually done in the interrupt completion context. This CPU is selected with the irq affinity which is unrelated to the enqueue operations. >>> >>> Interesting. Why this is unique to pfifo_fast? For me it could >>> happen to other qdisc's too, when we release the qdisc root >>> lock in sch_direct_xmit(), another CPU could dequeue from >>> the same qdisc and transmit the skb in parallel too? >>> >> >> Agreed, my guess is it never happens because the timing is >> tighter in the lock case. Or if it is happening its infrequent >> enough that no one noticed the OOO packets. > > I think the above could not happend due to the qdisc seqlock - which is > not acquired by NOLOCK qdiscs. > Yep, seems to be the case. >> For net-next we probably could clean this up. I was just >> going for something simple in net that didn't penalize all >> qdiscs as Eric noted. This patch doesn't make it any worse >> at least. And we have been living with the above race for >> years. > > I've benchmarked this patch is some different scenario, and in my > testing it introduces a measurable regression in uncontended/lightly > contended scenarios. The measured peak negative delta is with a pktgen > thread using "xmit_mode queue_xmit": > > before: 27674032 pps > after: 23809052 pps Yeah more atomic ops :/ > > I spend some time searching a way to improve this, without success. > > John, did you had any chance to look at this again? > If we have a multiple cores pulling from the same skb list and feeding the same txq this happens. One problem is even if the normal dev_queue_xmit path is aligned drivers call netif_schedule from interrupt context and that happens on an arbitrary a cpu. When the arbitrary cpu runs the netif_schedule logic it will dequeue from the skb list using the cpu it was scheduled on. The lockless case is not _really_ lockless after this patch we have managed to pull apart the enqueue and dequeue serialization though. Thanks for bringing this up. I'll think about it for a bit maybe there is something we can do here. There is a set of conditions that if met we can run without the lock. Possibly ONETXQUEUE and aligned cpu_map is sufficient. We could detect this case and drop the locking. For existing systems and high Gbps NICs I think (feel free to correct me) assuming a core per cpu is OK. At some point though we probably need to revisit this assumption. .John > Thanks, > > Paolo >
Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
Hi, let me revive this old thread... On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote: > On 03/26/2018 10:30 AM, Cong Wang wrote: > > On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend > > wrote: > > > After the qdisc lock was dropped in pfifo_fast we allow multiple > > > enqueue threads and dequeue threads to run in parallel. On the > > > enqueue side the skb bit ooo_okay is used to ensure all related > > > skbs are enqueued in-order. On the dequeue side though there is > > > no similar logic. What we observe is with fewer queues than CPUs > > > it is possible to re-order packets when two instances of > > > __qdisc_run() are running in parallel. Each thread will dequeue > > > a skb and then whichever thread calls the ndo op first will > > > be sent on the wire. This doesn't typically happen because > > > qdisc_run() is usually triggered by the same core that did the > > > enqueue. However, drivers will trigger __netif_schedule() > > > when queues are transitioning from stopped to awake using the > > > netif_tx_wake_* APIs. When this happens netif_schedule() calls > > > qdisc_run() on the same CPU that did the netif_tx_wake_* which > > > is usually done in the interrupt completion context. This CPU > > > is selected with the irq affinity which is unrelated to the > > > enqueue operations. > > > > Interesting. Why this is unique to pfifo_fast? For me it could > > happen to other qdisc's too, when we release the qdisc root > > lock in sch_direct_xmit(), another CPU could dequeue from > > the same qdisc and transmit the skb in parallel too? > > > > Agreed, my guess is it never happens because the timing is > tighter in the lock case. Or if it is happening its infrequent > enough that no one noticed the OOO packets. I think the above could not happend due to the qdisc seqlock - which is not acquired by NOLOCK qdiscs. > For net-next we probably could clean this up. I was just > going for something simple in net that didn't penalize all > qdiscs as Eric noted. This patch doesn't make it any worse > at least. And we have been living with the above race for > years. I've benchmarked this patch is some different scenario, and in my testing it introduces a measurable regression in uncontended/lightly contended scenarios. The measured peak negative delta is with a pktgen thread using "xmit_mode queue_xmit": before: 27674032 pps after: 23809052 pps I spend some time searching a way to improve this, without success. John, did you had any chance to look at this again? Thanks, Paolo
Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
On 03/26/2018 10:30 AM, Cong Wang wrote: > On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend > wrote: >> After the qdisc lock was dropped in pfifo_fast we allow multiple >> enqueue threads and dequeue threads to run in parallel. On the >> enqueue side the skb bit ooo_okay is used to ensure all related >> skbs are enqueued in-order. On the dequeue side though there is >> no similar logic. What we observe is with fewer queues than CPUs >> it is possible to re-order packets when two instances of >> __qdisc_run() are running in parallel. Each thread will dequeue >> a skb and then whichever thread calls the ndo op first will >> be sent on the wire. This doesn't typically happen because >> qdisc_run() is usually triggered by the same core that did the >> enqueue. However, drivers will trigger __netif_schedule() >> when queues are transitioning from stopped to awake using the >> netif_tx_wake_* APIs. When this happens netif_schedule() calls >> qdisc_run() on the same CPU that did the netif_tx_wake_* which >> is usually done in the interrupt completion context. This CPU >> is selected with the irq affinity which is unrelated to the >> enqueue operations. > > Interesting. Why this is unique to pfifo_fast? For me it could > happen to other qdisc's too, when we release the qdisc root > lock in sch_direct_xmit(), another CPU could dequeue from > the same qdisc and transmit the skb in parallel too? > Agreed, my guess is it never happens because the timing is tighter in the lock case. Or if it is happening its infrequent enough that no one noticed the OOO packets. For net-next we probably could clean this up. I was just going for something simple in net that didn't penalize all qdiscs as Eric noted. This patch doesn't make it any worse at least. And we have been living with the above race for years. > ... > >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c >> index 7e3fbe9..39c144b 100644 >> --- a/net/sched/sch_generic.c >> +++ b/net/sched/sch_generic.c >> @@ -373,24 +373,33 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc >> *q, >> */ >> static inline bool qdisc_restart(struct Qdisc *q, int *packets) >> { >> + bool more, validate, nolock = q->flags & TCQ_F_NOLOCK; >> spinlock_t *root_lock = NULL; >> struct netdev_queue *txq; >> struct net_device *dev; >> struct sk_buff *skb; >> - bool validate; >> >> /* Dequeue packet */ >> + if (nolock && test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) >> + return false; >> + > > Nit: you probably want to move the comment below this if check, > or simply remove it since it is useless... > hmm I was planning to do a comment rewrite patch to bring the comments in sch_generic.c up to date in net-next I'll delete it there. I think we can live with the extra line in net. Also Eric pointed out that qdisc_restart is not really a good name anymore for this routine. .John
Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend wrote: > After the qdisc lock was dropped in pfifo_fast we allow multiple > enqueue threads and dequeue threads to run in parallel. On the > enqueue side the skb bit ooo_okay is used to ensure all related > skbs are enqueued in-order. On the dequeue side though there is > no similar logic. What we observe is with fewer queues than CPUs > it is possible to re-order packets when two instances of > __qdisc_run() are running in parallel. Each thread will dequeue > a skb and then whichever thread calls the ndo op first will > be sent on the wire. This doesn't typically happen because > qdisc_run() is usually triggered by the same core that did the > enqueue. However, drivers will trigger __netif_schedule() > when queues are transitioning from stopped to awake using the > netif_tx_wake_* APIs. When this happens netif_schedule() calls > qdisc_run() on the same CPU that did the netif_tx_wake_* which > is usually done in the interrupt completion context. This CPU > is selected with the irq affinity which is unrelated to the > enqueue operations. Interesting. Why this is unique to pfifo_fast? For me it could happen to other qdisc's too, when we release the qdisc root lock in sch_direct_xmit(), another CPU could dequeue from the same qdisc and transmit the skb in parallel too? ... > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c > index 7e3fbe9..39c144b 100644 > --- a/net/sched/sch_generic.c > +++ b/net/sched/sch_generic.c > @@ -373,24 +373,33 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc > *q, > */ > static inline bool qdisc_restart(struct Qdisc *q, int *packets) > { > + bool more, validate, nolock = q->flags & TCQ_F_NOLOCK; > spinlock_t *root_lock = NULL; > struct netdev_queue *txq; > struct net_device *dev; > struct sk_buff *skb; > - bool validate; > > /* Dequeue packet */ > + if (nolock && test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) > + return false; > + Nit: you probably want to move the comment below this if check, or simply remove it since it is useless...
Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
On 03/26/2018 09:36 AM, David Miller wrote: > From: John Fastabend > Date: Sat, 24 Mar 2018 22:25:06 -0700 > >> After the qdisc lock was dropped in pfifo_fast we allow multiple >> enqueue threads and dequeue threads to run in parallel. On the >> enqueue side the skb bit ooo_okay is used to ensure all related >> skbs are enqueued in-order. On the dequeue side though there is >> no similar logic. What we observe is with fewer queues than CPUs >> it is possible to re-order packets when two instances of >> __qdisc_run() are running in parallel. Each thread will dequeue >> a skb and then whichever thread calls the ndo op first will >> be sent on the wire. This doesn't typically happen because >> qdisc_run() is usually triggered by the same core that did the >> enqueue. However, drivers will trigger __netif_schedule() >> when queues are transitioning from stopped to awake using the >> netif_tx_wake_* APIs. When this happens netif_schedule() calls >> qdisc_run() on the same CPU that did the netif_tx_wake_* which >> is usually done in the interrupt completion context. This CPU >> is selected with the irq affinity which is unrelated to the >> enqueue operations. >> >> To resolve this we add a RUNNING bit to the qdisc to ensure >> only a single dequeue per qdisc is running. Enqueue and dequeue >> operations can still run in parallel and also on multi queue >> NICs we can still have a dequeue in-flight per qdisc, which >> is typically per CPU. >> >> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") >> Reported-by: Jakob Unterwurzacher >> >> Signed-off-by: John Fastabend > > Applied, thanks John. > Great, also off-list email from Jakob (I forgot to add him to the CC list here, oops) told me to add, Tested-by: Jakob Unterwurzacher Also in net-next I'll look to see if we can avoid doing the extra atomics especially in cases where they are not actually needed. For example the 1:1 qdisc to txq mappings. It seems a bit evasive though for net. Finally just an FYI but I think I'll look at a distributed counter soon so we can get a lockless token bucket. I need the counter for BPF as well so coming soon. Thanks, John
Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast
From: John Fastabend Date: Sat, 24 Mar 2018 22:25:06 -0700 > After the qdisc lock was dropped in pfifo_fast we allow multiple > enqueue threads and dequeue threads to run in parallel. On the > enqueue side the skb bit ooo_okay is used to ensure all related > skbs are enqueued in-order. On the dequeue side though there is > no similar logic. What we observe is with fewer queues than CPUs > it is possible to re-order packets when two instances of > __qdisc_run() are running in parallel. Each thread will dequeue > a skb and then whichever thread calls the ndo op first will > be sent on the wire. This doesn't typically happen because > qdisc_run() is usually triggered by the same core that did the > enqueue. However, drivers will trigger __netif_schedule() > when queues are transitioning from stopped to awake using the > netif_tx_wake_* APIs. When this happens netif_schedule() calls > qdisc_run() on the same CPU that did the netif_tx_wake_* which > is usually done in the interrupt completion context. This CPU > is selected with the irq affinity which is unrelated to the > enqueue operations. > > To resolve this we add a RUNNING bit to the qdisc to ensure > only a single dequeue per qdisc is running. Enqueue and dequeue > operations can still run in parallel and also on multi queue > NICs we can still have a dequeue in-flight per qdisc, which > is typically per CPU. > > Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") > Reported-by: Jakob Unterwurzacher > Signed-off-by: John Fastabend Applied, thanks John.
[net PATCH v2] net: sched, fix OOO packets with pfifo_fast
After the qdisc lock was dropped in pfifo_fast we allow multiple enqueue threads and dequeue threads to run in parallel. On the enqueue side the skb bit ooo_okay is used to ensure all related skbs are enqueued in-order. On the dequeue side though there is no similar logic. What we observe is with fewer queues than CPUs it is possible to re-order packets when two instances of __qdisc_run() are running in parallel. Each thread will dequeue a skb and then whichever thread calls the ndo op first will be sent on the wire. This doesn't typically happen because qdisc_run() is usually triggered by the same core that did the enqueue. However, drivers will trigger __netif_schedule() when queues are transitioning from stopped to awake using the netif_tx_wake_* APIs. When this happens netif_schedule() calls qdisc_run() on the same CPU that did the netif_tx_wake_* which is usually done in the interrupt completion context. This CPU is selected with the irq affinity which is unrelated to the enqueue operations. To resolve this we add a RUNNING bit to the qdisc to ensure only a single dequeue per qdisc is running. Enqueue and dequeue operations can still run in parallel and also on multi queue NICs we can still have a dequeue in-flight per qdisc, which is typically per CPU. Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array") Reported-by: Jakob Unterwurzacher Signed-off-by: John Fastabend --- include/net/sch_generic.h |1 + net/sched/sch_generic.c | 17 + 2 files changed, 14 insertions(+), 4 deletions(-) diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index 2092d33..8da3267 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -30,6 +30,7 @@ struct qdisc_rate_table { enum qdisc_state_t { __QDISC_STATE_SCHED, __QDISC_STATE_DEACTIVATED, + __QDISC_STATE_RUNNING, }; struct qdisc_size_table { diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 7e3fbe9..39c144b 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -373,24 +373,33 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q, */ static inline bool qdisc_restart(struct Qdisc *q, int *packets) { + bool more, validate, nolock = q->flags & TCQ_F_NOLOCK; spinlock_t *root_lock = NULL; struct netdev_queue *txq; struct net_device *dev; struct sk_buff *skb; - bool validate; /* Dequeue packet */ + if (nolock && test_and_set_bit(__QDISC_STATE_RUNNING, &q->state)) + return false; + skb = dequeue_skb(q, &validate, packets); - if (unlikely(!skb)) + if (unlikely(!skb)) { + if (nolock) + clear_bit(__QDISC_STATE_RUNNING, &q->state); return false; + } - if (!(q->flags & TCQ_F_NOLOCK)) + if (!nolock) root_lock = qdisc_lock(q); dev = qdisc_dev(q); txq = skb_get_tx_queue(dev, skb); - return sch_direct_xmit(skb, q, dev, txq, root_lock, validate); + more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate); + if (nolock) + clear_bit(__QDISC_STATE_RUNNING, &q->state); + return more; } void __qdisc_run(struct Qdisc *q)