Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Song Liu



> On Sep 24, 2018, at 8:30 AM, Eric Dumazet  wrote:
> 
> On Sun, Sep 23, 2018 at 10:04 PM David Miller  wrote:
>> 
>> Series applied, thanks Eric.
> 
> Thanks David.
> 
> Song, would you please this additional patch ?
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 
> 3219a2932463096566ce8ff336ecdf699422dd65..2ad45babe621b2c979ad5496b7df4342e4efbaa6
> 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -150,13 +150,6 @@ static void poll_one_napi(struct napi_struct *napi)
> {
>int work = 0;
> 
> -   /* net_rx_action's ->poll() invocations and our's are
> -* synchronized by this test which is only made while
> -* holding the napi->poll_lock.
> -*/
> -   if (!test_bit(NAPI_STATE_SCHED, >state))
> -   return;
> -
>/* If we set this bit but see that it has already been set,
> * that indicates that napi has been disabled and we need
> * to abort this operation


Reviewed-and-tested-by: Song Liu 

Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Michael Chan
On Tue, Sep 25, 2018 at 11:25 AM Song Liu  wrote:

>
> Hi Michael,
>
> This may not be related. But I am looking at this:
>
> bnxt_poll_work() {
>
> while (1) {
> 
> if (rx_pkts == budget)
> return
> }
> }
>
> With budget of 0, the loop will terminate after processing one packet.
> But I think the expectation is to finish all tx packets. So it doesn't
> feel right. Could you please confirm?
>

Right, this in effect is processing only 1 TX packet so it will be
inefficient at least.

But I think fixing it here still will not fix all the issues, because
even if we process all the TX packets here, we may still miss some
that are in flight.  When we exit poll, netpoll may not call us back
again and there may be no interrupts because we don't ARM the IRQ when
budget of 0 is reached.  I will send a test patch shortly for review
and testing.  Thanks.


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Song Liu



> On Sep 25, 2018, at 7:43 AM, Michael Chan  wrote:
> 
> On Tue, Sep 25, 2018 at 7:20 AM Eric Dumazet  wrote:
>> 
>> On Tue, Sep 25, 2018 at 7:02 AM Michael Chan  
>> wrote:
>>> 
>>> On Mon, Sep 24, 2018 at 2:18 PM Song Liu  wrote:
 
 
 
> On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> 
>> 
>> Interesting, maybe a bnxt specific issue.
>> 
>> It seems their model is to process TX/RX notification in the same queue,
>> they throw away RX events if budget == 0
>> 
>> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
>> must be reverted.
>> 
>> Otherwise, we have a possibility of blocking a queue under netpoll 
>> pressure.
> 
> Hmm, actually a revert might not be enough, since code at lines 2030-2031
> would fire and we might not call napi_complete_done() anyway.
> 
> Unfortunately this driver logic is quite complex.
> 
> Could you test on other NIC eventually ?
> 
 
 It actually runs OK on ixgbe.
 
 @Michael, could you please help us with this?
 
>>> I've taken a quick look using today's net tree plus Eric's
>>> poll_one_napi() patch.  The problem I'm seeing is that netpoll calls
>>> bnxt_poll() with budget 0.  And since work_done >= budget of 0, we
>>> return without calling napi_complete_done() and without arming the
>>> interrupt.  netpoll doesn't always call us back until we call
>>> napi_complete_done(), right?  So I think if there are in-flight TX
>>> completions, we'll miss those.
>> 
>> That's the whole point of netpoll :
>> 
>> We drain the TX queues, without interrupts being involved at all,
>> by calling ->napi() with a zero budget.
>> 
>> napi_complete(), even if called from ->napi() while budget was zero,
>> should do nothing but return early.
>> 
>> budget==0 means that ->napi() should process all TX completions.
> 
> All TX completions that we can see.  We cannot see the in-flight ones.
> 
> If budget is exceeded, I think the assumption is that poll will always
> be called again.
> 
>> 
>> So it looks like bnxt has a bug, that is showing up after the latest
>> poll_one_napi() patch.
>> This latest patch is needed otherwise the cpu attempting the
>> netpoll-TX-drain might drain nothing at all,
>> since it does not anymore call ndo_poll_controller() that was grabbing
>> SCHED bits on all queues (napi_schedule() like calls)
> 
> I think the latest patch is preventing the normal interrupt -> NAPI
> path from coming in and cleaning the remaining TX completions and
> arming the interrupt.

Hi Michael, 

This may not be related. But I am looking at this:

bnxt_poll_work() {

while (1) {

if (rx_pkts == budget)
return
}
}

With budget of 0, the loop will terminate after processing one packet. 
But I think the expectation is to finish all tx packets. So it doesn't
feel right. Could you please confirm?

Thanks,
Song





Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Michael Chan
On Tue, Sep 25, 2018 at 7:20 AM Eric Dumazet  wrote:
>
> On Tue, Sep 25, 2018 at 7:02 AM Michael Chan  
> wrote:
> >
> > On Mon, Sep 24, 2018 at 2:18 PM Song Liu  wrote:
> > >
> > >
> > >
> > > > On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> > > >
> > > >>
> > > >> Interesting, maybe a bnxt specific issue.
> > > >>
> > > >> It seems their model is to process TX/RX notification in the same 
> > > >> queue,
> > > >> they throw away RX events if budget == 0
> > > >>
> > > >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
> > > >> must be reverted.
> > > >>
> > > >> Otherwise, we have a possibility of blocking a queue under netpoll 
> > > >> pressure.
> > > >
> > > > Hmm, actually a revert might not be enough, since code at lines 
> > > > 2030-2031
> > > > would fire and we might not call napi_complete_done() anyway.
> > > >
> > > > Unfortunately this driver logic is quite complex.
> > > >
> > > > Could you test on other NIC eventually ?
> > > >
> > >
> > > It actually runs OK on ixgbe.
> > >
> > > @Michael, could you please help us with this?
> > >
> > I've taken a quick look using today's net tree plus Eric's
> > poll_one_napi() patch.  The problem I'm seeing is that netpoll calls
> > bnxt_poll() with budget 0.  And since work_done >= budget of 0, we
> > return without calling napi_complete_done() and without arming the
> > interrupt.  netpoll doesn't always call us back until we call
> > napi_complete_done(), right?  So I think if there are in-flight TX
> > completions, we'll miss those.
>
> That's the whole point of netpoll :
>
>  We drain the TX queues, without interrupts being involved at all,
> by calling ->napi() with a zero budget.
>
> napi_complete(), even if called from ->napi() while budget was zero,
> should do nothing but return early.
>
> budget==0 means that ->napi() should process all TX completions.

All TX completions that we can see.  We cannot see the in-flight ones.

If budget is exceeded, I think the assumption is that poll will always
be called again.

>
> So it looks like bnxt has a bug, that is showing up after the latest
> poll_one_napi() patch.
> This latest patch is needed otherwise the cpu attempting the
> netpoll-TX-drain might drain nothing at all,
> since it does not anymore call ndo_poll_controller() that was grabbing
> SCHED bits on all queues (napi_schedule() like calls)

I think the latest patch is preventing the normal interrupt -> NAPI
path from coming in and cleaning the remaining TX completions and
arming the interrupt.


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Eric Dumazet
On Tue, Sep 25, 2018 at 7:02 AM Michael Chan  wrote:
>
> On Mon, Sep 24, 2018 at 2:18 PM Song Liu  wrote:
> >
> >
> >
> > > On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> > >
> > >>
> > >> Interesting, maybe a bnxt specific issue.
> > >>
> > >> It seems their model is to process TX/RX notification in the same queue,
> > >> they throw away RX events if budget == 0
> > >>
> > >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
> > >> must be reverted.
> > >>
> > >> Otherwise, we have a possibility of blocking a queue under netpoll 
> > >> pressure.
> > >
> > > Hmm, actually a revert might not be enough, since code at lines 2030-2031
> > > would fire and we might not call napi_complete_done() anyway.
> > >
> > > Unfortunately this driver logic is quite complex.
> > >
> > > Could you test on other NIC eventually ?
> > >
> >
> > It actually runs OK on ixgbe.
> >
> > @Michael, could you please help us with this?
> >
> I've taken a quick look using today's net tree plus Eric's
> poll_one_napi() patch.  The problem I'm seeing is that netpoll calls
> bnxt_poll() with budget 0.  And since work_done >= budget of 0, we
> return without calling napi_complete_done() and without arming the
> interrupt.  netpoll doesn't always call us back until we call
> napi_complete_done(), right?  So I think if there are in-flight TX
> completions, we'll miss those.

That's the whole point of netpoll :

 We drain the TX queues, without interrupts being involved at all,
by calling ->napi() with a zero budget.

napi_complete(), even if called from ->napi() while budget was zero,
should do nothing but return early.

budget==0 means that ->napi() should process all TX completions.

So it looks like bnxt has a bug, that is showing up after the latest
poll_one_napi() patch.
This latest patch is needed otherwise the cpu attempting the
netpoll-TX-drain might drain nothing at all,
since it does not anymore call ndo_poll_controller() that was grabbing
SCHED bits on all queues (napi_schedule() like calls)


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Michael Chan
On Mon, Sep 24, 2018 at 2:18 PM Song Liu  wrote:
>
>
>
> > On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> >
> >>
> >> Interesting, maybe a bnxt specific issue.
> >>
> >> It seems their model is to process TX/RX notification in the same queue,
> >> they throw away RX events if budget == 0
> >>
> >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
> >> must be reverted.
> >>
> >> Otherwise, we have a possibility of blocking a queue under netpoll 
> >> pressure.
> >
> > Hmm, actually a revert might not be enough, since code at lines 2030-2031
> > would fire and we might not call napi_complete_done() anyway.
> >
> > Unfortunately this driver logic is quite complex.
> >
> > Could you test on other NIC eventually ?
> >
>
> It actually runs OK on ixgbe.
>
> @Michael, could you please help us with this?
>
I've taken a quick look using today's net tree plus Eric's
poll_one_napi() patch.  The problem I'm seeing is that netpoll calls
bnxt_poll() with budget 0.  And since work_done >= budget of 0, we
return without calling napi_complete_done() and without arming the
interrupt.  netpoll doesn't always call us back until we call
napi_complete_done(), right?  So I think if there are in-flight TX
completions, we'll miss those.


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-24 Thread Song Liu



> On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> 
>> 
>> Interesting, maybe a bnxt specific issue.
>> 
>> It seems their model is to process TX/RX notification in the same queue,
>> they throw away RX events if budget == 0
>> 
>> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
>> must be reverted.
>> 
>> Otherwise, we have a possibility of blocking a queue under netpoll pressure.
> 
> Hmm, actually a revert might not be enough, since code at lines 2030-2031
> would fire and we might not call napi_complete_done() anyway.
> 
> Unfortunately this driver logic is quite complex.
> 
> Could you test on other NIC eventually ?
> 

It actually runs OK on ixgbe. 

@Michael, could you please help us with this? 

Thanks,
Song






Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-24 Thread Eric Dumazet
>
> Interesting, maybe a bnxt specific issue.
>
> It seems their model is to process TX/RX notification in the same queue,
> they throw away RX events if budget == 0
>
> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
> must be reverted.
>
> Otherwise, we have a possibility of blocking a queue under netpoll pressure.

Hmm, actually a revert might not be enough, since code at lines 2030-2031
would fire and we might not call napi_complete_done() anyway.

Unfortunately this driver logic is quite complex.

Could you test on other NIC eventually ?

Thanks.


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-24 Thread Eric Dumazet
On Mon, Sep 24, 2018 at 1:00 PM Song Liu  wrote:
>
>
>
> > On Sep 24, 2018, at 12:41 PM, Eric Dumazet  wrote:
> >
> > On Mon, Sep 24, 2018 at 12:31 PM Song Liu  wrote:
> >
> >> This triggers dev_watchdog() on a simple netperf TCP_RR on bnxt (I haven't
> >> tested other drivers yet).
> >>
> >> I guess this is because NAPI_STATE_SCHED is set when poll_one_napi() calls
> >> napi->poll(). And then cleared by napi->poll().
> >
> > Which part of napi->poll() could possibly clear NAPI_STATE_SCHED when
> > called by netpoll ?
> >
> > AFAIK, napi_complete_done() should exit early (before having a chance
> > to clear NAPI_STATE_SCHED)
> > because of :
> >
> > if (unlikely(n->state & (NAPIF_STATE_NPSVC | NAPIF_STATE_IN_BUSY_POLL)))
> > return false;
> >
> > Thanks !
>
> You are right on this condition. But this does trigger dev_watchdog() for
> some reason.

Interesting, maybe a bnxt specific issue.

It seems their model is to process TX/RX notification in the same queue,
they throw away RX events if budget == 0

It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
must be reverted.

Otherwise, we have a possibility of blocking a queue under netpoll pressure.


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-24 Thread Song Liu



> On Sep 24, 2018, at 12:41 PM, Eric Dumazet  wrote:
> 
> On Mon, Sep 24, 2018 at 12:31 PM Song Liu  wrote:
> 
>> This triggers dev_watchdog() on a simple netperf TCP_RR on bnxt (I haven't
>> tested other drivers yet).
>> 
>> I guess this is because NAPI_STATE_SCHED is set when poll_one_napi() calls
>> napi->poll(). And then cleared by napi->poll().
> 
> Which part of napi->poll() could possibly clear NAPI_STATE_SCHED when
> called by netpoll ?
> 
> AFAIK, napi_complete_done() should exit early (before having a chance
> to clear NAPI_STATE_SCHED)
> because of :
> 
> if (unlikely(n->state & (NAPIF_STATE_NPSVC | NAPIF_STATE_IN_BUSY_POLL)))
> return false;
> 
> Thanks !

You are right on this condition. But this does trigger dev_watchdog() for
some reason. 

Thanks,
Song




Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-24 Thread Eric Dumazet
On Mon, Sep 24, 2018 at 12:31 PM Song Liu  wrote:

> This triggers dev_watchdog() on a simple netperf TCP_RR on bnxt (I haven't
> tested other drivers yet).
>
> I guess this is because NAPI_STATE_SCHED is set when poll_one_napi() calls
> napi->poll(). And then cleared by napi->poll().

Which part of napi->poll() could possibly clear NAPI_STATE_SCHED when
called by netpoll ?

AFAIK, napi_complete_done() should exit early (before having a chance
to clear NAPI_STATE_SCHED)
because of :

if (unlikely(n->state & (NAPIF_STATE_NPSVC | NAPIF_STATE_IN_BUSY_POLL)))
 return false;

Thanks !


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-24 Thread Song Liu



> On Sep 24, 2018, at 8:30 AM, Eric Dumazet  wrote:
> 
> On Sun, Sep 23, 2018 at 10:04 PM David Miller  wrote:
>> 
>> Series applied, thanks Eric.
> 
> Thanks David.
> 
> Song, would you please this additional patch ?
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 
> 3219a2932463096566ce8ff336ecdf699422dd65..2ad45babe621b2c979ad5496b7df4342e4efbaa6
> 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -150,13 +150,6 @@ static void poll_one_napi(struct napi_struct *napi)
> {
>int work = 0;
> 
> -   /* net_rx_action's ->poll() invocations and our's are
> -* synchronized by this test which is only made while
> -* holding the napi->poll_lock.
> -*/
> -   if (!test_bit(NAPI_STATE_SCHED, >state))
> -   return;
> -
>/* If we set this bit but see that it has already been set,
> * that indicates that napi has been disabled and we need
> * to abort this operation

This triggers dev_watchdog() on a simple netperf TCP_RR on bnxt (I haven't
tested other drivers yet). 

I guess this is because NAPI_STATE_SCHED is set when poll_one_napi() calls
napi->poll(). And then cleared by napi->poll(). So a packet is missed by
napi (set NAPI_STATE_SCHED, but didn't got handled). 

Song



Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-24 Thread Eric Dumazet
On Sun, Sep 23, 2018 at 10:04 PM David Miller  wrote:
>
> Series applied, thanks Eric.

Thanks David.

Song, would you please this additional patch ?

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 
3219a2932463096566ce8ff336ecdf699422dd65..2ad45babe621b2c979ad5496b7df4342e4efbaa6
100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -150,13 +150,6 @@ static void poll_one_napi(struct napi_struct *napi)
 {
int work = 0;

-   /* net_rx_action's ->poll() invocations and our's are
-* synchronized by this test which is only made while
-* holding the napi->poll_lock.
-*/
-   if (!test_bit(NAPI_STATE_SCHED, >state))
-   return;
-
/* If we set this bit but see that it has already been set,
 * that indicates that napi has been disabled and we need
 * to abort this operation


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-23 Thread David Miller
From: Eric Dumazet 
Date: Fri, 21 Sep 2018 15:27:37 -0700

> As diagnosed by Song Liu, ndo_poll_controller() can
> be very dangerous on loaded hosts, since the cpu
> calling ndo_poll_controller() might steal all NAPI
> contexts (for all RX/TX queues of the NIC).
> 
> This capture, showing one ksoftirqd eating all cycles
> can last for unlimited amount of time, since one
> cpu is generally not able to drain all the queues under load.
> 
> It seems that all networking drivers that do use NAPI
> for their TX completions, should not provide a ndo_poll_controller() :
> 
> Most NAPI drivers have netpoll support already handled
> in core networking stack, since netpoll_poll_dev()
> uses poll_napi(dev) to iterate through registered
> NAPI contexts for a device.
> 
> This patch series take care of the first round, we will
> handle other drivers in future rounds.

Series applied, thanks Eric.


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-23 Thread Eric Dumazet
On Sun, Sep 23, 2018 at 12:29 PM David Miller  wrote:
>
> From: Eric Dumazet 
> Date: Fri, 21 Sep 2018 15:27:37 -0700
>
> > As diagnosed by Song Liu, ndo_poll_controller() can
> > be very dangerous on loaded hosts, since the cpu
> > calling ndo_poll_controller() might steal all NAPI
> > contexts (for all RX/TX queues of the NIC).
> >
> > This capture, showing one ksoftirqd eating all cycles
> > can last for unlimited amount of time, since one
> > cpu is generally not able to drain all the queues under load.
> >
> > It seems that all networking drivers that do use NAPI
> > for their TX completions, should not provide a ndo_poll_controller() :
> >
> > Most NAPI drivers have netpoll support already handled
> > in core networking stack, since netpoll_poll_dev()
> > uses poll_napi(dev) to iterate through registered
> > NAPI contexts for a device.
>
> I'm having trouble understanding the difference.
>
> If the drivers are processing all of the RX/TX queue draining by hand
> in their ndo_poll_controller() method, how is that different from the
> generic code walking all of the registererd NAPI instances one by one?

(resent in plain text mode this time)

Reading poll_napi() and poll_one_napi() I thought that we were using
NAPI_STATE_NPSVC
and cmpxchg(>poll_owner, -1, cpu) to _temporary_ [1] own each
napi at a time.

But I do see we also have this part at the beginning of poll_one_napi() :

if (!test_bit(NAPI_STATE_SCHED, >state))
  return;

So we probably should remove it. (The normal napi->poll() calls would
not proceed since napi->poll_owner would not be -1)

[1]
While if a cpu succeeds into setting NAPI_STATE_SCHED, it means it has
to own it as long as the
napi->poll() does not call napi_complete_done(), and this can be
forever (the capture effect)

Basically calling napi_schedule() is the dangerous part.

I believe busy_polling and netpoll are the same intruders (as they can
run on arbitrary cpus).
But netpoll is far more problematic since it iterates through all RX/TX queues.


Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-23 Thread David Miller
From: Eric Dumazet 
Date: Fri, 21 Sep 2018 15:27:37 -0700

> As diagnosed by Song Liu, ndo_poll_controller() can
> be very dangerous on loaded hosts, since the cpu
> calling ndo_poll_controller() might steal all NAPI
> contexts (for all RX/TX queues of the NIC).
> 
> This capture, showing one ksoftirqd eating all cycles
> can last for unlimited amount of time, since one
> cpu is generally not able to drain all the queues under load.
> 
> It seems that all networking drivers that do use NAPI
> for their TX completions, should not provide a ndo_poll_controller() :
> 
> Most NAPI drivers have netpoll support already handled
> in core networking stack, since netpoll_poll_dev()
> uses poll_napi(dev) to iterate through registered
> NAPI contexts for a device.

I'm having trouble understanding the difference.

If the drivers are processing all of the RX/TX queue draining by hand
in their ndo_poll_controller() method, how is that different from the
generic code walking all of the registererd NAPI instances one by one?