Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
> On Sep 24, 2018, at 8:30 AM, Eric Dumazet wrote: > > On Sun, Sep 23, 2018 at 10:04 PM David Miller wrote: >> >> Series applied, thanks Eric. > > Thanks David. > > Song, would you please this additional patch ? > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > index > 3219a2932463096566ce8ff336ecdf699422dd65..2ad45babe621b2c979ad5496b7df4342e4efbaa6 > 100644 > --- a/net/core/netpoll.c > +++ b/net/core/netpoll.c > @@ -150,13 +150,6 @@ static void poll_one_napi(struct napi_struct *napi) > { >int work = 0; > > - /* net_rx_action's ->poll() invocations and our's are > -* synchronized by this test which is only made while > -* holding the napi->poll_lock. > -*/ > - if (!test_bit(NAPI_STATE_SCHED, &napi->state)) > - return; > - >/* If we set this bit but see that it has already been set, > * that indicates that napi has been disabled and we need > * to abort this operation Reviewed-and-tested-by: Song Liu
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
On Tue, Sep 25, 2018 at 11:25 AM Song Liu wrote: > > Hi Michael, > > This may not be related. But I am looking at this: > > bnxt_poll_work() { > > while (1) { > > if (rx_pkts == budget) > return > } > } > > With budget of 0, the loop will terminate after processing one packet. > But I think the expectation is to finish all tx packets. So it doesn't > feel right. Could you please confirm? > Right, this in effect is processing only 1 TX packet so it will be inefficient at least. But I think fixing it here still will not fix all the issues, because even if we process all the TX packets here, we may still miss some that are in flight. When we exit poll, netpoll may not call us back again and there may be no interrupts because we don't ARM the IRQ when budget of 0 is reached. I will send a test patch shortly for review and testing. Thanks.
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
> On Sep 25, 2018, at 7:43 AM, Michael Chan wrote: > > On Tue, Sep 25, 2018 at 7:20 AM Eric Dumazet wrote: >> >> On Tue, Sep 25, 2018 at 7:02 AM Michael Chan >> wrote: >>> >>> On Mon, Sep 24, 2018 at 2:18 PM Song Liu wrote: > On Sep 24, 2018, at 2:05 PM, Eric Dumazet wrote: > >> >> Interesting, maybe a bnxt specific issue. >> >> It seems their model is to process TX/RX notification in the same queue, >> they throw away RX events if budget == 0 >> >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and >> must be reverted. >> >> Otherwise, we have a possibility of blocking a queue under netpoll >> pressure. > > Hmm, actually a revert might not be enough, since code at lines 2030-2031 > would fire and we might not call napi_complete_done() anyway. > > Unfortunately this driver logic is quite complex. > > Could you test on other NIC eventually ? > It actually runs OK on ixgbe. @Michael, could you please help us with this? >>> I've taken a quick look using today's net tree plus Eric's >>> poll_one_napi() patch. The problem I'm seeing is that netpoll calls >>> bnxt_poll() with budget 0. And since work_done >= budget of 0, we >>> return without calling napi_complete_done() and without arming the >>> interrupt. netpoll doesn't always call us back until we call >>> napi_complete_done(), right? So I think if there are in-flight TX >>> completions, we'll miss those. >> >> That's the whole point of netpoll : >> >> We drain the TX queues, without interrupts being involved at all, >> by calling ->napi() with a zero budget. >> >> napi_complete(), even if called from ->napi() while budget was zero, >> should do nothing but return early. >> >> budget==0 means that ->napi() should process all TX completions. > > All TX completions that we can see. We cannot see the in-flight ones. > > If budget is exceeded, I think the assumption is that poll will always > be called again. > >> >> So it looks like bnxt has a bug, that is showing up after the latest >> poll_one_napi() patch. >> This latest patch is needed otherwise the cpu attempting the >> netpoll-TX-drain might drain nothing at all, >> since it does not anymore call ndo_poll_controller() that was grabbing >> SCHED bits on all queues (napi_schedule() like calls) > > I think the latest patch is preventing the normal interrupt -> NAPI > path from coming in and cleaning the remaining TX completions and > arming the interrupt. Hi Michael, This may not be related. But I am looking at this: bnxt_poll_work() { while (1) { if (rx_pkts == budget) return } } With budget of 0, the loop will terminate after processing one packet. But I think the expectation is to finish all tx packets. So it doesn't feel right. Could you please confirm? Thanks, Song
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
On Tue, Sep 25, 2018 at 7:20 AM Eric Dumazet wrote: > > On Tue, Sep 25, 2018 at 7:02 AM Michael Chan > wrote: > > > > On Mon, Sep 24, 2018 at 2:18 PM Song Liu wrote: > > > > > > > > > > > > > On Sep 24, 2018, at 2:05 PM, Eric Dumazet wrote: > > > > > > > >> > > > >> Interesting, maybe a bnxt specific issue. > > > >> > > > >> It seems their model is to process TX/RX notification in the same > > > >> queue, > > > >> they throw away RX events if budget == 0 > > > >> > > > >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and > > > >> must be reverted. > > > >> > > > >> Otherwise, we have a possibility of blocking a queue under netpoll > > > >> pressure. > > > > > > > > Hmm, actually a revert might not be enough, since code at lines > > > > 2030-2031 > > > > would fire and we might not call napi_complete_done() anyway. > > > > > > > > Unfortunately this driver logic is quite complex. > > > > > > > > Could you test on other NIC eventually ? > > > > > > > > > > It actually runs OK on ixgbe. > > > > > > @Michael, could you please help us with this? > > > > > I've taken a quick look using today's net tree plus Eric's > > poll_one_napi() patch. The problem I'm seeing is that netpoll calls > > bnxt_poll() with budget 0. And since work_done >= budget of 0, we > > return without calling napi_complete_done() and without arming the > > interrupt. netpoll doesn't always call us back until we call > > napi_complete_done(), right? So I think if there are in-flight TX > > completions, we'll miss those. > > That's the whole point of netpoll : > > We drain the TX queues, without interrupts being involved at all, > by calling ->napi() with a zero budget. > > napi_complete(), even if called from ->napi() while budget was zero, > should do nothing but return early. > > budget==0 means that ->napi() should process all TX completions. All TX completions that we can see. We cannot see the in-flight ones. If budget is exceeded, I think the assumption is that poll will always be called again. > > So it looks like bnxt has a bug, that is showing up after the latest > poll_one_napi() patch. > This latest patch is needed otherwise the cpu attempting the > netpoll-TX-drain might drain nothing at all, > since it does not anymore call ndo_poll_controller() that was grabbing > SCHED bits on all queues (napi_schedule() like calls) I think the latest patch is preventing the normal interrupt -> NAPI path from coming in and cleaning the remaining TX completions and arming the interrupt.
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
On Tue, Sep 25, 2018 at 7:02 AM Michael Chan wrote: > > On Mon, Sep 24, 2018 at 2:18 PM Song Liu wrote: > > > > > > > > > On Sep 24, 2018, at 2:05 PM, Eric Dumazet wrote: > > > > > >> > > >> Interesting, maybe a bnxt specific issue. > > >> > > >> It seems their model is to process TX/RX notification in the same queue, > > >> they throw away RX events if budget == 0 > > >> > > >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and > > >> must be reverted. > > >> > > >> Otherwise, we have a possibility of blocking a queue under netpoll > > >> pressure. > > > > > > Hmm, actually a revert might not be enough, since code at lines 2030-2031 > > > would fire and we might not call napi_complete_done() anyway. > > > > > > Unfortunately this driver logic is quite complex. > > > > > > Could you test on other NIC eventually ? > > > > > > > It actually runs OK on ixgbe. > > > > @Michael, could you please help us with this? > > > I've taken a quick look using today's net tree plus Eric's > poll_one_napi() patch. The problem I'm seeing is that netpoll calls > bnxt_poll() with budget 0. And since work_done >= budget of 0, we > return without calling napi_complete_done() and without arming the > interrupt. netpoll doesn't always call us back until we call > napi_complete_done(), right? So I think if there are in-flight TX > completions, we'll miss those. That's the whole point of netpoll : We drain the TX queues, without interrupts being involved at all, by calling ->napi() with a zero budget. napi_complete(), even if called from ->napi() while budget was zero, should do nothing but return early. budget==0 means that ->napi() should process all TX completions. So it looks like bnxt has a bug, that is showing up after the latest poll_one_napi() patch. This latest patch is needed otherwise the cpu attempting the netpoll-TX-drain might drain nothing at all, since it does not anymore call ndo_poll_controller() that was grabbing SCHED bits on all queues (napi_schedule() like calls)
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
On Mon, Sep 24, 2018 at 2:18 PM Song Liu wrote: > > > > > On Sep 24, 2018, at 2:05 PM, Eric Dumazet wrote: > > > >> > >> Interesting, maybe a bnxt specific issue. > >> > >> It seems their model is to process TX/RX notification in the same queue, > >> they throw away RX events if budget == 0 > >> > >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and > >> must be reverted. > >> > >> Otherwise, we have a possibility of blocking a queue under netpoll > >> pressure. > > > > Hmm, actually a revert might not be enough, since code at lines 2030-2031 > > would fire and we might not call napi_complete_done() anyway. > > > > Unfortunately this driver logic is quite complex. > > > > Could you test on other NIC eventually ? > > > > It actually runs OK on ixgbe. > > @Michael, could you please help us with this? > I've taken a quick look using today's net tree plus Eric's poll_one_napi() patch. The problem I'm seeing is that netpoll calls bnxt_poll() with budget 0. And since work_done >= budget of 0, we return without calling napi_complete_done() and without arming the interrupt. netpoll doesn't always call us back until we call napi_complete_done(), right? So I think if there are in-flight TX completions, we'll miss those.
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
> On Sep 24, 2018, at 2:05 PM, Eric Dumazet wrote: > >> >> Interesting, maybe a bnxt specific issue. >> >> It seems their model is to process TX/RX notification in the same queue, >> they throw away RX events if budget == 0 >> >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and >> must be reverted. >> >> Otherwise, we have a possibility of blocking a queue under netpoll pressure. > > Hmm, actually a revert might not be enough, since code at lines 2030-2031 > would fire and we might not call napi_complete_done() anyway. > > Unfortunately this driver logic is quite complex. > > Could you test on other NIC eventually ? > It actually runs OK on ixgbe. @Michael, could you please help us with this? Thanks, Song
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
> > Interesting, maybe a bnxt specific issue. > > It seems their model is to process TX/RX notification in the same queue, > they throw away RX events if budget == 0 > > It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and > must be reverted. > > Otherwise, we have a possibility of blocking a queue under netpoll pressure. Hmm, actually a revert might not be enough, since code at lines 2030-2031 would fire and we might not call napi_complete_done() anyway. Unfortunately this driver logic is quite complex. Could you test on other NIC eventually ? Thanks.
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
On Mon, Sep 24, 2018 at 1:00 PM Song Liu wrote: > > > > > On Sep 24, 2018, at 12:41 PM, Eric Dumazet wrote: > > > > On Mon, Sep 24, 2018 at 12:31 PM Song Liu wrote: > > > >> This triggers dev_watchdog() on a simple netperf TCP_RR on bnxt (I haven't > >> tested other drivers yet). > >> > >> I guess this is because NAPI_STATE_SCHED is set when poll_one_napi() calls > >> napi->poll(). And then cleared by napi->poll(). > > > > Which part of napi->poll() could possibly clear NAPI_STATE_SCHED when > > called by netpoll ? > > > > AFAIK, napi_complete_done() should exit early (before having a chance > > to clear NAPI_STATE_SCHED) > > because of : > > > > if (unlikely(n->state & (NAPIF_STATE_NPSVC | NAPIF_STATE_IN_BUSY_POLL))) > > return false; > > > > Thanks ! > > You are right on this condition. But this does trigger dev_watchdog() for > some reason. Interesting, maybe a bnxt specific issue. It seems their model is to process TX/RX notification in the same queue, they throw away RX events if budget == 0 It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and must be reverted. Otherwise, we have a possibility of blocking a queue under netpoll pressure.
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
> On Sep 24, 2018, at 12:41 PM, Eric Dumazet wrote: > > On Mon, Sep 24, 2018 at 12:31 PM Song Liu wrote: > >> This triggers dev_watchdog() on a simple netperf TCP_RR on bnxt (I haven't >> tested other drivers yet). >> >> I guess this is because NAPI_STATE_SCHED is set when poll_one_napi() calls >> napi->poll(). And then cleared by napi->poll(). > > Which part of napi->poll() could possibly clear NAPI_STATE_SCHED when > called by netpoll ? > > AFAIK, napi_complete_done() should exit early (before having a chance > to clear NAPI_STATE_SCHED) > because of : > > if (unlikely(n->state & (NAPIF_STATE_NPSVC | NAPIF_STATE_IN_BUSY_POLL))) > return false; > > Thanks ! You are right on this condition. But this does trigger dev_watchdog() for some reason. Thanks, Song
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
On Mon, Sep 24, 2018 at 12:31 PM Song Liu wrote: > This triggers dev_watchdog() on a simple netperf TCP_RR on bnxt (I haven't > tested other drivers yet). > > I guess this is because NAPI_STATE_SCHED is set when poll_one_napi() calls > napi->poll(). And then cleared by napi->poll(). Which part of napi->poll() could possibly clear NAPI_STATE_SCHED when called by netpoll ? AFAIK, napi_complete_done() should exit early (before having a chance to clear NAPI_STATE_SCHED) because of : if (unlikely(n->state & (NAPIF_STATE_NPSVC | NAPIF_STATE_IN_BUSY_POLL))) return false; Thanks !
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
> On Sep 24, 2018, at 8:30 AM, Eric Dumazet wrote: > > On Sun, Sep 23, 2018 at 10:04 PM David Miller wrote: >> >> Series applied, thanks Eric. > > Thanks David. > > Song, would you please this additional patch ? > > diff --git a/net/core/netpoll.c b/net/core/netpoll.c > index > 3219a2932463096566ce8ff336ecdf699422dd65..2ad45babe621b2c979ad5496b7df4342e4efbaa6 > 100644 > --- a/net/core/netpoll.c > +++ b/net/core/netpoll.c > @@ -150,13 +150,6 @@ static void poll_one_napi(struct napi_struct *napi) > { >int work = 0; > > - /* net_rx_action's ->poll() invocations and our's are > -* synchronized by this test which is only made while > -* holding the napi->poll_lock. > -*/ > - if (!test_bit(NAPI_STATE_SCHED, &napi->state)) > - return; > - >/* If we set this bit but see that it has already been set, > * that indicates that napi has been disabled and we need > * to abort this operation This triggers dev_watchdog() on a simple netperf TCP_RR on bnxt (I haven't tested other drivers yet). I guess this is because NAPI_STATE_SCHED is set when poll_one_napi() calls napi->poll(). And then cleared by napi->poll(). So a packet is missed by napi (set NAPI_STATE_SCHED, but didn't got handled). Song
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
On Sun, Sep 23, 2018 at 10:04 PM David Miller wrote: > > Series applied, thanks Eric. Thanks David. Song, would you please this additional patch ? diff --git a/net/core/netpoll.c b/net/core/netpoll.c index 3219a2932463096566ce8ff336ecdf699422dd65..2ad45babe621b2c979ad5496b7df4342e4efbaa6 100644 --- a/net/core/netpoll.c +++ b/net/core/netpoll.c @@ -150,13 +150,6 @@ static void poll_one_napi(struct napi_struct *napi) { int work = 0; - /* net_rx_action's ->poll() invocations and our's are -* synchronized by this test which is only made while -* holding the napi->poll_lock. -*/ - if (!test_bit(NAPI_STATE_SCHED, &napi->state)) - return; - /* If we set this bit but see that it has already been set, * that indicates that napi has been disabled and we need * to abort this operation
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
From: Eric Dumazet Date: Fri, 21 Sep 2018 15:27:37 -0700 > As diagnosed by Song Liu, ndo_poll_controller() can > be very dangerous on loaded hosts, since the cpu > calling ndo_poll_controller() might steal all NAPI > contexts (for all RX/TX queues of the NIC). > > This capture, showing one ksoftirqd eating all cycles > can last for unlimited amount of time, since one > cpu is generally not able to drain all the queues under load. > > It seems that all networking drivers that do use NAPI > for their TX completions, should not provide a ndo_poll_controller() : > > Most NAPI drivers have netpoll support already handled > in core networking stack, since netpoll_poll_dev() > uses poll_napi(dev) to iterate through registered > NAPI contexts for a device. > > This patch series take care of the first round, we will > handle other drivers in future rounds. Series applied, thanks Eric.
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
On Sun, Sep 23, 2018 at 12:29 PM David Miller wrote: > > From: Eric Dumazet > Date: Fri, 21 Sep 2018 15:27:37 -0700 > > > As diagnosed by Song Liu, ndo_poll_controller() can > > be very dangerous on loaded hosts, since the cpu > > calling ndo_poll_controller() might steal all NAPI > > contexts (for all RX/TX queues of the NIC). > > > > This capture, showing one ksoftirqd eating all cycles > > can last for unlimited amount of time, since one > > cpu is generally not able to drain all the queues under load. > > > > It seems that all networking drivers that do use NAPI > > for their TX completions, should not provide a ndo_poll_controller() : > > > > Most NAPI drivers have netpoll support already handled > > in core networking stack, since netpoll_poll_dev() > > uses poll_napi(dev) to iterate through registered > > NAPI contexts for a device. > > I'm having trouble understanding the difference. > > If the drivers are processing all of the RX/TX queue draining by hand > in their ndo_poll_controller() method, how is that different from the > generic code walking all of the registererd NAPI instances one by one? (resent in plain text mode this time) Reading poll_napi() and poll_one_napi() I thought that we were using NAPI_STATE_NPSVC and cmpxchg(&napi->poll_owner, -1, cpu) to _temporary_ [1] own each napi at a time. But I do see we also have this part at the beginning of poll_one_napi() : if (!test_bit(NAPI_STATE_SCHED, &napi->state)) return; So we probably should remove it. (The normal napi->poll() calls would not proceed since napi->poll_owner would not be -1) [1] While if a cpu succeeds into setting NAPI_STATE_SCHED, it means it has to own it as long as the napi->poll() does not call napi_complete_done(), and this can be forever (the capture effect) Basically calling napi_schedule() is the dangerous part. I believe busy_polling and netpoll are the same intruders (as they can run on arbitrary cpus). But netpoll is far more problematic since it iterates through all RX/TX queues.
Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
From: Eric Dumazet Date: Fri, 21 Sep 2018 15:27:37 -0700 > As diagnosed by Song Liu, ndo_poll_controller() can > be very dangerous on loaded hosts, since the cpu > calling ndo_poll_controller() might steal all NAPI > contexts (for all RX/TX queues of the NIC). > > This capture, showing one ksoftirqd eating all cycles > can last for unlimited amount of time, since one > cpu is generally not able to drain all the queues under load. > > It seems that all networking drivers that do use NAPI > for their TX completions, should not provide a ndo_poll_controller() : > > Most NAPI drivers have netpoll support already handled > in core networking stack, since netpoll_poll_dev() > uses poll_napi(dev) to iterate through registered > NAPI contexts for a device. I'm having trouble understanding the difference. If the drivers are processing all of the RX/TX queue draining by hand in their ndo_poll_controller() method, how is that different from the generic code walking all of the registererd NAPI instances one by one?
[PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers
As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture, showing one ksoftirqd eating all cycles can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. It seems that all networking drivers that do use NAPI for their TX completions, should not provide a ndo_poll_controller() : Most NAPI drivers have netpoll support already handled in core networking stack, since netpoll_poll_dev() uses poll_napi(dev) to iterate through registered NAPI contexts for a device. This patch series take care of the first round, we will handle other drivers in future rounds. Eric Dumazet (15): netpoll: make ndo_poll_controller() optional bonding: use netpoll_poll_dev() helper ixgbe: remove ndo_poll_controller ixgbevf: remove ndo_poll_controller fm10k: remove ndo_poll_controller ixgb: remove ndo_poll_controller igb: remove ndo_poll_controller ice: remove ndo_poll_controller i40evf: remove ndo_poll_controller mlx4: remove ndo_poll_controller mlx5: remove ndo_poll_controller bnx2x: remove ndo_poll_controller bnxt: remove ndo_poll_controller nfp: remove ndo_poll_controller tun: remove ndo_poll_controller drivers/net/bonding/bond_main.c | 11 + .../net/ethernet/broadcom/bnx2x/bnx2x_main.c | 16 --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 18 drivers/net/ethernet/intel/fm10k/fm10k.h | 3 -- .../net/ethernet/intel/fm10k/fm10k_netdev.c | 3 -- drivers/net/ethernet/intel/fm10k/fm10k_pci.c | 22 -- .../net/ethernet/intel/i40evf/i40evf_main.c | 26 --- drivers/net/ethernet/intel/ice/ice_main.c | 27 drivers/net/ethernet/intel/igb/igb_main.c | 30 - drivers/net/ethernet/intel/ixgb/ixgb_main.c | 25 --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 25 --- .../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 21 - .../net/ethernet/mellanox/mlx4/en_netdev.c| 20 - .../net/ethernet/mellanox/mlx5/core/en_main.c | 19 .../ethernet/netronome/nfp/nfp_net_common.c | 18 drivers/net/tun.c | 43 --- include/linux/netpoll.h | 5 ++- net/core/netpoll.c| 19 +++- 18 files changed, 12 insertions(+), 339 deletions(-) -- 2.19.0.444.g18242da7ef-goog