Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-05-08 Thread Paolo Abeni
Hi all,

I'm still crashing my head on this item...

On Wed, 2018-04-18 at 09:44 -0700, John Fastabend wrote:
> There is a set of conditions
> that if met we can run without the lock. Possibly ONETXQUEUE and
> aligned cpu_map is sufficient. We could detect this case and drop
> the locking. For existing systems and high Gbps NICs I think (feel
> free to correct me) assuming a core per cpu is OK. At some point
> though we probably need to revisit this assumption.

I think we can improve measurably moving the __QDISC_STATE_RUNNING bit
fiddling around the __qdisc_run() call in the 'lockless' path, instead
of keeping it inside __qdisc_restart().

Currently, in the single sender, pkt rate below link-limit scenario we
hit the atomic bit overhead twice per xmitted packet: one for each
dequeue, plus another one for the next, failing, dequeue attempt. With
the wider scope we will hit it always only once.

After that change __QDISC_STATE_RUNNING usage will look a bit like
qdisc_lock(), for the dequeue part at least. So I'm wondering if we
could replace __QDISC_STATE_RUNNING with spin_trylock(qdisc_lock())
_and_ keep such lock held for the whole qdisc_run() !?! 

The comment above qdisc_restart() states clearly we can't, but I don't
see why !?! Acquiring qdisc_lock() and xmit lock always in the given
sequence looks safe to me. Can someone please explain? Is there some
possible deathlock condition I'm missing ?!?

It looks like the comment itself cames directly from the pre-bitkeeper
era (modulo locks name change).

Performance wise, acquiring the qdisc_lock only once per xmitted packet
should improve considerably 'locked' qdisc performance, both in the
contented and in the uncontended scenario (and some quick experiments
seems to confirm that).

Thanks,

Paolo


Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-04-19 Thread Paolo Abeni
On Wed, 2018-04-18 at 09:44 -0700, John Fastabend wrote:
> Thanks for bringing this up. I'll think about it for a bit maybe
> there is something we can do here. There is a set of conditions
> that if met we can run without the lock. Possibly ONETXQUEUE and
> aligned cpu_map is sufficient. 

I think you mean "root qdisc is mq and aligned cpu_map": AFAICS we can
have ONETXQUEUE when root qdisc is e.g. pfifo_fast which would not help
here.

> We could detect this case and drop
> the locking. For existing systems and high Gbps NICs I think (feel
> free to correct me) assuming a core per cpu is OK. 

I'm sorry, I'm lost. Do you mean "a tx queue per core" instead ?!? 

I'm unsure we can assume the above. In my experiments, at least in some
scenarios it's preferrable configuring a limited number of rx/tx
queues, confine BH processing to the related cores and let user space
processes run on the others, with a many to 1 relationship between the
cores "assigned" to user-space and the cores "assigned" to BH
processing. 

Can't we somewhat try to leverage TCQ_F_CAN_BYPASS even with NOLOCK
qdisc? I *think* we can avoid the qdisc_run() call after
sch_direct_xmit() in the bypass scenario, and that will avoid the
blamed atomic ops above.

Cheers,

Paolo



Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-04-18 Thread John Fastabend
On 04/18/2018 12:28 AM, Paolo Abeni wrote:
> Hi,
> 
> let me revive this old thread...
> 
> On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote:
>> On 03/26/2018 10:30 AM, Cong Wang wrote:
>>> On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
>>>  wrote:
 After the qdisc lock was dropped in pfifo_fast we allow multiple
 enqueue threads and dequeue threads to run in parallel. On the
 enqueue side the skb bit ooo_okay is used to ensure all related
 skbs are enqueued in-order. On the dequeue side though there is
 no similar logic. What we observe is with fewer queues than CPUs
 it is possible to re-order packets when two instances of
 __qdisc_run() are running in parallel. Each thread will dequeue
 a skb and then whichever thread calls the ndo op first will
 be sent on the wire. This doesn't typically happen because
 qdisc_run() is usually triggered by the same core that did the
 enqueue. However, drivers will trigger __netif_schedule()
 when queues are transitioning from stopped to awake using the
 netif_tx_wake_* APIs. When this happens netif_schedule() calls
 qdisc_run() on the same CPU that did the netif_tx_wake_* which
 is usually done in the interrupt completion context. This CPU
 is selected with the irq affinity which is unrelated to the
 enqueue operations.
>>>
>>> Interesting. Why this is unique to pfifo_fast? For me it could
>>> happen to other qdisc's too, when we release the qdisc root
>>> lock in sch_direct_xmit(), another CPU could dequeue from
>>> the same qdisc and transmit the skb in parallel too?
>>>
>>
>> Agreed, my guess is it never happens because the timing is
>> tighter in the lock case. Or if it is happening its infrequent
>> enough that no one noticed the OOO packets.
> 
> I think the above could not happend due to the qdisc seqlock - which is
> not acquired by NOLOCK qdiscs.
> 

Yep, seems to be the case.

>> For net-next we probably could clean this up. I was just
>> going for something simple in net that didn't penalize all
>> qdiscs as Eric noted. This patch doesn't make it any worse
>> at least. And we have been living with the above race for
>> years.
> 
> I've benchmarked this patch is some different scenario, and in my
> testing it introduces a measurable regression in uncontended/lightly
> contended scenarios. The measured peak negative delta is with a pktgen
> thread using "xmit_mode queue_xmit":
> 
> before: 27674032 pps
> after: 23809052 pps

Yeah more atomic ops :/

> 
> I spend some time searching a way to improve this, without success.
> 
> John, did you had any chance to look at this again?
> 

If we have a multiple cores pulling from the same skb list and
feeding the same txq this happens. One problem is even if the
normal dev_queue_xmit path is aligned drivers call netif_schedule
from interrupt context and that happens on an arbitrary a cpu. When
the arbitrary cpu runs the netif_schedule logic it will dequeue
from the skb list using the cpu it was scheduled on.

The lockless case is not _really_ lockless after this patch we
have managed to pull apart the enqueue and dequeue serialization
though.

Thanks for bringing this up. I'll think about it for a bit maybe
there is something we can do here. There is a set of conditions
that if met we can run without the lock. Possibly ONETXQUEUE and
aligned cpu_map is sufficient. We could detect this case and drop
the locking. For existing systems and high Gbps NICs I think (feel
free to correct me) assuming a core per cpu is OK. At some point
though we probably need to revisit this assumption.

.John

> Thanks,
> 
> Paolo
> 



Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-04-18 Thread Paolo Abeni
Hi,

let me revive this old thread...

On Mon, 2018-03-26 at 11:16 -0700, John Fastabend wrote:
> On 03/26/2018 10:30 AM, Cong Wang wrote:
> > On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
> >  wrote:
> > > After the qdisc lock was dropped in pfifo_fast we allow multiple
> > > enqueue threads and dequeue threads to run in parallel. On the
> > > enqueue side the skb bit ooo_okay is used to ensure all related
> > > skbs are enqueued in-order. On the dequeue side though there is
> > > no similar logic. What we observe is with fewer queues than CPUs
> > > it is possible to re-order packets when two instances of
> > > __qdisc_run() are running in parallel. Each thread will dequeue
> > > a skb and then whichever thread calls the ndo op first will
> > > be sent on the wire. This doesn't typically happen because
> > > qdisc_run() is usually triggered by the same core that did the
> > > enqueue. However, drivers will trigger __netif_schedule()
> > > when queues are transitioning from stopped to awake using the
> > > netif_tx_wake_* APIs. When this happens netif_schedule() calls
> > > qdisc_run() on the same CPU that did the netif_tx_wake_* which
> > > is usually done in the interrupt completion context. This CPU
> > > is selected with the irq affinity which is unrelated to the
> > > enqueue operations.
> > 
> > Interesting. Why this is unique to pfifo_fast? For me it could
> > happen to other qdisc's too, when we release the qdisc root
> > lock in sch_direct_xmit(), another CPU could dequeue from
> > the same qdisc and transmit the skb in parallel too?
> > 
> 
> Agreed, my guess is it never happens because the timing is
> tighter in the lock case. Or if it is happening its infrequent
> enough that no one noticed the OOO packets.

I think the above could not happend due to the qdisc seqlock - which is
not acquired by NOLOCK qdiscs.

> For net-next we probably could clean this up. I was just
> going for something simple in net that didn't penalize all
> qdiscs as Eric noted. This patch doesn't make it any worse
> at least. And we have been living with the above race for
> years.

I've benchmarked this patch is some different scenario, and in my
testing it introduces a measurable regression in uncontended/lightly
contended scenarios. The measured peak negative delta is with a pktgen
thread using "xmit_mode queue_xmit":

before: 27674032 pps
after: 23809052 pps

I spend some time searching a way to improve this, without success.

John, did you had any chance to look at this again?

Thanks,

Paolo


Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-03-26 Thread John Fastabend
On 03/26/2018 10:30 AM, Cong Wang wrote:
> On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
>  wrote:
>> After the qdisc lock was dropped in pfifo_fast we allow multiple
>> enqueue threads and dequeue threads to run in parallel. On the
>> enqueue side the skb bit ooo_okay is used to ensure all related
>> skbs are enqueued in-order. On the dequeue side though there is
>> no similar logic. What we observe is with fewer queues than CPUs
>> it is possible to re-order packets when two instances of
>> __qdisc_run() are running in parallel. Each thread will dequeue
>> a skb and then whichever thread calls the ndo op first will
>> be sent on the wire. This doesn't typically happen because
>> qdisc_run() is usually triggered by the same core that did the
>> enqueue. However, drivers will trigger __netif_schedule()
>> when queues are transitioning from stopped to awake using the
>> netif_tx_wake_* APIs. When this happens netif_schedule() calls
>> qdisc_run() on the same CPU that did the netif_tx_wake_* which
>> is usually done in the interrupt completion context. This CPU
>> is selected with the irq affinity which is unrelated to the
>> enqueue operations.
> 
> Interesting. Why this is unique to pfifo_fast? For me it could
> happen to other qdisc's too, when we release the qdisc root
> lock in sch_direct_xmit(), another CPU could dequeue from
> the same qdisc and transmit the skb in parallel too?
> 

Agreed, my guess is it never happens because the timing is
tighter in the lock case. Or if it is happening its infrequent
enough that no one noticed the OOO packets.

For net-next we probably could clean this up. I was just
going for something simple in net that didn't penalize all
qdiscs as Eric noted. This patch doesn't make it any worse
at least. And we have been living with the above race for
years.

> ...
> 
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 7e3fbe9..39c144b 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -373,24 +373,33 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc 
>> *q,
>>   */
>>  static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>>  {
>> +   bool more, validate, nolock = q->flags & TCQ_F_NOLOCK;
>> spinlock_t *root_lock = NULL;
>> struct netdev_queue *txq;
>> struct net_device *dev;
>> struct sk_buff *skb;
>> -   bool validate;
>>
>> /* Dequeue packet */
>> +   if (nolock && test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
>> +   return false;
>> +
> 
> Nit: you probably want to move the comment below this if check,
> or simply remove it since it is useless...
> 

hmm I was planning to do a comment rewrite patch to bring the comments
in sch_generic.c up to date in net-next I'll delete it there. I think
we can live with the extra line in net. Also Eric pointed out that
qdisc_restart is not really a good name anymore for this routine.

.John


Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-03-26 Thread Cong Wang
On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
 wrote:
> After the qdisc lock was dropped in pfifo_fast we allow multiple
> enqueue threads and dequeue threads to run in parallel. On the
> enqueue side the skb bit ooo_okay is used to ensure all related
> skbs are enqueued in-order. On the dequeue side though there is
> no similar logic. What we observe is with fewer queues than CPUs
> it is possible to re-order packets when two instances of
> __qdisc_run() are running in parallel. Each thread will dequeue
> a skb and then whichever thread calls the ndo op first will
> be sent on the wire. This doesn't typically happen because
> qdisc_run() is usually triggered by the same core that did the
> enqueue. However, drivers will trigger __netif_schedule()
> when queues are transitioning from stopped to awake using the
> netif_tx_wake_* APIs. When this happens netif_schedule() calls
> qdisc_run() on the same CPU that did the netif_tx_wake_* which
> is usually done in the interrupt completion context. This CPU
> is selected with the irq affinity which is unrelated to the
> enqueue operations.

Interesting. Why this is unique to pfifo_fast? For me it could
happen to other qdisc's too, when we release the qdisc root
lock in sch_direct_xmit(), another CPU could dequeue from
the same qdisc and transmit the skb in parallel too?

...

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 7e3fbe9..39c144b 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -373,24 +373,33 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc 
> *q,
>   */
>  static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>  {
> +   bool more, validate, nolock = q->flags & TCQ_F_NOLOCK;
> spinlock_t *root_lock = NULL;
> struct netdev_queue *txq;
> struct net_device *dev;
> struct sk_buff *skb;
> -   bool validate;
>
> /* Dequeue packet */
> +   if (nolock && test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
> +   return false;
> +

Nit: you probably want to move the comment below this if check,
or simply remove it since it is useless...


Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-03-26 Thread John Fastabend
On 03/26/2018 09:36 AM, David Miller wrote:
> From: John Fastabend 
> Date: Sat, 24 Mar 2018 22:25:06 -0700
> 
>> After the qdisc lock was dropped in pfifo_fast we allow multiple
>> enqueue threads and dequeue threads to run in parallel. On the
>> enqueue side the skb bit ooo_okay is used to ensure all related
>> skbs are enqueued in-order. On the dequeue side though there is
>> no similar logic. What we observe is with fewer queues than CPUs
>> it is possible to re-order packets when two instances of
>> __qdisc_run() are running in parallel. Each thread will dequeue
>> a skb and then whichever thread calls the ndo op first will
>> be sent on the wire. This doesn't typically happen because
>> qdisc_run() is usually triggered by the same core that did the
>> enqueue. However, drivers will trigger __netif_schedule()
>> when queues are transitioning from stopped to awake using the
>> netif_tx_wake_* APIs. When this happens netif_schedule() calls
>> qdisc_run() on the same CPU that did the netif_tx_wake_* which
>> is usually done in the interrupt completion context. This CPU
>> is selected with the irq affinity which is unrelated to the
>> enqueue operations.
>>
>> To resolve this we add a RUNNING bit to the qdisc to ensure
>> only a single dequeue per qdisc is running. Enqueue and dequeue
>> operations can still run in parallel and also on multi queue
>> NICs we can still have a dequeue in-flight per qdisc, which
>> is typically per CPU.
>>
>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>> Reported-by: Jakob Unterwurzacher 
>> 
>> Signed-off-by: John Fastabend 
> 
> Applied, thanks John.
> 

Great, also off-list email from Jakob (I forgot to add him to the
CC list here, oops) told me to add, 

Tested-by: Jakob Unterwurzacher 

Also in net-next I'll look to see if we can avoid doing the extra
atomics especially in cases where they are not actually needed. For
example the 1:1 qdisc to txq mappings. It seems a bit evasive
though for net.

Finally just an FYI but I think I'll look at a distributed counter
soon so we can get a lockless token bucket. I need the counter for
BPF as well so coming soon.

Thanks,
John


Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-03-26 Thread David Miller
From: John Fastabend 
Date: Sat, 24 Mar 2018 22:25:06 -0700

> After the qdisc lock was dropped in pfifo_fast we allow multiple
> enqueue threads and dequeue threads to run in parallel. On the
> enqueue side the skb bit ooo_okay is used to ensure all related
> skbs are enqueued in-order. On the dequeue side though there is
> no similar logic. What we observe is with fewer queues than CPUs
> it is possible to re-order packets when two instances of
> __qdisc_run() are running in parallel. Each thread will dequeue
> a skb and then whichever thread calls the ndo op first will
> be sent on the wire. This doesn't typically happen because
> qdisc_run() is usually triggered by the same core that did the
> enqueue. However, drivers will trigger __netif_schedule()
> when queues are transitioning from stopped to awake using the
> netif_tx_wake_* APIs. When this happens netif_schedule() calls
> qdisc_run() on the same CPU that did the netif_tx_wake_* which
> is usually done in the interrupt completion context. This CPU
> is selected with the irq affinity which is unrelated to the
> enqueue operations.
> 
> To resolve this we add a RUNNING bit to the qdisc to ensure
> only a single dequeue per qdisc is running. Enqueue and dequeue
> operations can still run in parallel and also on multi queue
> NICs we can still have a dequeue in-flight per qdisc, which
> is typically per CPU.
> 
> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
> Reported-by: Jakob Unterwurzacher 
> Signed-off-by: John Fastabend 

Applied, thanks John.


[net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-03-24 Thread John Fastabend
After the qdisc lock was dropped in pfifo_fast we allow multiple
enqueue threads and dequeue threads to run in parallel. On the
enqueue side the skb bit ooo_okay is used to ensure all related
skbs are enqueued in-order. On the dequeue side though there is
no similar logic. What we observe is with fewer queues than CPUs
it is possible to re-order packets when two instances of
__qdisc_run() are running in parallel. Each thread will dequeue
a skb and then whichever thread calls the ndo op first will
be sent on the wire. This doesn't typically happen because
qdisc_run() is usually triggered by the same core that did the
enqueue. However, drivers will trigger __netif_schedule()
when queues are transitioning from stopped to awake using the
netif_tx_wake_* APIs. When this happens netif_schedule() calls
qdisc_run() on the same CPU that did the netif_tx_wake_* which
is usually done in the interrupt completion context. This CPU
is selected with the irq affinity which is unrelated to the
enqueue operations.

To resolve this we add a RUNNING bit to the qdisc to ensure
only a single dequeue per qdisc is running. Enqueue and dequeue
operations can still run in parallel and also on multi queue
NICs we can still have a dequeue in-flight per qdisc, which
is typically per CPU.

Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
Reported-by: Jakob Unterwurzacher 
Signed-off-by: John Fastabend 
---
 include/net/sch_generic.h |1 +
 net/sched/sch_generic.c   |   17 +
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 2092d33..8da3267 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -30,6 +30,7 @@ struct qdisc_rate_table {
 enum qdisc_state_t {
__QDISC_STATE_SCHED,
__QDISC_STATE_DEACTIVATED,
+   __QDISC_STATE_RUNNING,
 };
 
 struct qdisc_size_table {
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 7e3fbe9..39c144b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -373,24 +373,33 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  */
 static inline bool qdisc_restart(struct Qdisc *q, int *packets)
 {
+   bool more, validate, nolock = q->flags & TCQ_F_NOLOCK;
spinlock_t *root_lock = NULL;
struct netdev_queue *txq;
struct net_device *dev;
struct sk_buff *skb;
-   bool validate;
 
/* Dequeue packet */
+   if (nolock && test_and_set_bit(__QDISC_STATE_RUNNING, &q->state))
+   return false;
+
skb = dequeue_skb(q, &validate, packets);
-   if (unlikely(!skb))
+   if (unlikely(!skb)) {
+   if (nolock)
+   clear_bit(__QDISC_STATE_RUNNING, &q->state);
return false;
+   }
 
-   if (!(q->flags & TCQ_F_NOLOCK))
+   if (!nolock)
root_lock = qdisc_lock(q);
 
dev = qdisc_dev(q);
txq = skb_get_tx_queue(dev, skb);
 
-   return sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+   more = sch_direct_xmit(skb, q, dev, txq, root_lock, validate);
+   if (nolock)
+   clear_bit(__QDISC_STATE_RUNNING, &q->state);
+   return more;
 }
 
 void __qdisc_run(struct Qdisc *q)