Re: new NAPI interface broken
Benjamin Herrenschmidt <[EMAIL PROTECTED]> wrote on 16.10.2007 11:01:49: > > > Christoph, have any of you tried it on powerpc ? No we didn't try this (yet). This approach makes a lot of sense. Why is this not installed by both large distros on PPC by default? how mature is this for larger SMPs on PPC? > > Cheers, > Ben. > > Gruss / Regards Christoph R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
Hi, > As far as I know, the x86 in-kernel thingy doesn't but yeah, the > userland one seems much more evolved and does things based on the > "class" of the device. > > Christoph, have any of you tried it on powerpc ? FYI It works fine on PowerPC and its installed by default on some distros (eg RHEL5). Anton - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
David Miller wrote: From: Benjamin Herrenschmidt <[EMAIL PROTECTED]> Date: Tue, 16 Oct 2007 18:28:56 +1000 Allright, so that's an out of tree userland thingy... (which may well work on ppc too I suppose). Definitely not installed by default by my distro so IRQs from the network cards on all x86's using ubuntu gutsy at least are spread to all CPUs :-) But the thing does treat network interfaces differently from other devices. and it works on various architectures The in-kernel x86 thing is going away (it's actually highly suboptimal) Yes it's done in userland, this is the right place so far out of tree... well... there's no good place for such userspace tools in the kernel tree currently otherwise I'd love to have it there. If your distro doesn't install this by default, please file a bug against the distro; I know we (Intel) worked with Fedora, RHEL, SuSE and Ubuntu to get it included maybe others don't? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
On Tue, 2007-10-16 at 01:31 -0700, David Miller wrote: > From: Benjamin Herrenschmidt <[EMAIL PROTECTED]> > Date: Tue, 16 Oct 2007 18:28:56 +1000 > > > Allright, so that's an out of tree userland thingy... (which may well > > work on ppc too I suppose). Definitely not installed by default by my > > distro so IRQs from the network cards on all x86's using ubuntu gutsy at > > least are spread to all CPUs :-) > > But the thing does treat network interfaces differently from > other devices. > > Arjan ran over to my table at kernel summit when I presented this > topic to the audience, to remind me of all of this and he should be > included in on any discussions about this topic. :-) > > I've CC:'d him. As far as I know, the x86 in-kernel thingy doesn't but yeah, the userland one seems much more evolved and does things based on the "class" of the device. Christoph, have any of you tried it on powerpc ? Cheers, Ben. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
From: Benjamin Herrenschmidt <[EMAIL PROTECTED]> Date: Tue, 16 Oct 2007 18:28:56 +1000 > Allright, so that's an out of tree userland thingy... (which may well > work on ppc too I suppose). Definitely not installed by default by my > distro so IRQs from the network cards on all x86's using ubuntu gutsy at > least are spread to all CPUs :-) But the thing does treat network interfaces differently from other devices. Arjan ran over to my table at kernel summit when I presented this topic to the audience, to remind me of all of this and he should be included in on any discussions about this topic. :-) I've CC:'d him. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
On Tue, 2007-10-16 at 00:44 -0700, David Miller wrote: > From: Benjamin Herrenschmidt <[EMAIL PROTECTED]> > Date: Tue, 16 Oct 2007 17:29:47 +1000 > > > Do you have any pointer to how that is done on x86 or sparc64 ? > > Sparc64 does it statically in the kernel. > > For x86, see http://irqbalance.org/ Allright, so that's an out of tree userland thingy... (which may well work on ppc too I suppose). Definitely not installed by default by my distro so IRQs from the network cards on all x86's using ubuntu gutsy at least are spread to all CPUs :-) There's also a balance kernel thread in x86 with has a notion of irq that isn't moveable but that flag isn't set by any driver. Cheers, Ben. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
From: Benjamin Herrenschmidt <[EMAIL PROTECTED]> Date: Tue, 16 Oct 2007 17:29:47 +1000 > Do you have any pointer to how that is done on x86 or sparc64 ? Sparc64 does it statically in the kernel. For x86, see http://irqbalance.org/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
> So the powerpc platform just honors the affinity mask, and depending on > the PIC does things that range from nothing to spreading interrupts to > CPUs in the affinity mask. > > All interrupts by defaults are spread to all CPUs (full balancing). > > At this stage, it's afaik userland business to enforce different > policies by changing the affinities via /proc/irq/*. > > Do you have any pointer to how that is done on x86 or sparc64 ? On my > x86 laptop using ubuntu gutsy, I definitely see the IRQ on which the > network card is connected (e1000) happily spread between the 2 cores > just like powerpc would do. More specifically, IRQF_NOBALANCING doesn't seem to be set anywhere except a few arch specific timer interrupts etc... nowhere I can see in network drivers or the network stack (the stack wouldn't know what IRQ anyway since not all drivers set netdev->irq). We currently don't have a balance kthread like x86 has, though I wonder if we should move this one out of x86 and make it generic (hell, it's even hidden in the IO_APIC code :-) But at this stage, HW balancing by the PIC is the norm and seems to be happening. Ben. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
Jumping on an old train ... On Wed, 2007-09-12 at 05:50 -0700, David Miller wrote: > > > This would mean we have a problem on all SMP machines right now. > > This is not a correct statement. > > Only on your platform do network device interrupts get moved > around, no other platform does this. > > Sparc64 doesn't, all interrupts stay in one location after > the cpu is initially choosen. > > x86 and x86_64 specifically do not move around network > device interrupts, even though other device types do > get dynamic IRQ cpu distribution. > > That's why you are the only person seeing this problem. > > I agree that it should be fixed, but we should also fix the IRQ > distribution scheme used on powerpc platforms which is totally > broken in these cases. So the powerpc platform just honors the affinity mask, and depending on the PIC does things that range from nothing to spreading interrupts to CPUs in the affinity mask. All interrupts by defaults are spread to all CPUs (full balancing). At this stage, it's afaik userland business to enforce different policies by changing the affinities via /proc/irq/*. Do you have any pointer to how that is done on x86 or sparc64 ? On my x86 laptop using ubuntu gutsy, I definitely see the IRQ on which the network card is connected (e1000) happily spread between the 2 cores just like powerpc would do. Cheers, Ben. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
On Wednesday 19 September 2007 17:33, Roland Dreier wrote: > > One other thing I observed is that I can not unload the module as the > > ref counter of the eth device is too low. I haven't tracked down the > > source of this problem yet. > > I suspect that this is because netif_rx_reschedule() was missing a > dev_hold() to match the dev_put() in netif_rx_complete(). Dave merged > a fix for that problem yesterday, so current net-2.6.24 should be OK. > Let us know if it's not OK, because then there's another bug... > > - R. > When I applied this netif_rx_reschedule fix this problem did not occur anymore (module could be unloaded). So I guess I hit the problem you described. Thanks, Jan-Bernd - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
> One other thing I observed is that I can not unload the module as the > ref counter of the eth device is too low. I haven't tracked down the > source of this problem yet. I suspect that this is because netif_rx_reschedule() was missing a dev_hold() to match the dev_put() in netif_rx_complete(). Dave merged a fix for that problem yesterday, so current net-2.6.24 should be OK. Let us know if it's not OK, because then there's another bug... - R. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
From: Jan-Bernd Themann <[EMAIL PROTECTED]> Date: Tue, 18 Sep 2007 18:15:45 +0200 > One other thing I observed is that I can not unload the module as the > ref counter of the eth device is too low. I haven't tracked down the > source of this problem yet. This is probably because of the resched device refcounting bug that Roland Dreier just posted a fix for. Look for subject "Fix refcounting problem with netif_rx_reschedule()" I merge that in shortly to net-2.6.24 as well. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
Hi, On Saturday 15 September 2007 00:12, David Miller wrote: > Ok, for now I'm going to try and deal with this by reverting > the list handling to something approximating the old NAPI > code, as per the patch below. > > I've only quickly test booted into this kernel on my workstation. > So take care when trying it out. > > I took the liberty to add some assertions and comments all over > wrt. list and quota handling. One thing to note that (and this > was true previously too) that if ->poll() uses the whole quota > it _MUST_ _NOT_ modify the NAPI state by doing a complete or > reschedule. The net_rx_action code still owns the NAPI state in > that case, and therefore is allowed to make modifications to the > ->poll_list. I'm currently updating the eHEA polling function to work with this scheme. Up to now we had sort of a quota for or TX refill part as well, which was also done in the poll function. This was possible as the device could be scheduled again even if the quota has not been used completely. If I got it right this is not longer possible. The idea was to benefit from the same "fairness" scheme of NAPI as it is done for the RX side. One other thing I observed is that I can not unload the module as the ref counter of the eth device is too low. I haven't tracked down the source of this problem yet. Regards, Jan-Bernd - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
From: Jan-Bernd Themann <[EMAIL PROTECTED]> Date: Fri, 7 Sep 2007 11:37:02 +0200 > Its about the question who inserts and removes devices from the poll list. > > netif_rx_schedule: sets NAPI_STATE_SCHED flag, insert device in poll list. > netif_rx_complete: clears NAPI_STATE_SCHED > netif_rx_reschedule: sets NAPI_STATE_SCHED, insert device in poll list. > net_rx_action: > -removes dev from poll list > -calls poll function > -adds dev to poll list if NAPI_STATE_SCHED still set Ok, for now I'm going to try and deal with this by reverting the list handling to something approximating the old NAPI code, as per the patch below. I've only quickly test booted into this kernel on my workstation. So take care when trying it out. I took the liberty to add some assertions and comments all over wrt. list and quota handling. One thing to note that (and this was true previously too) that if ->poll() uses the whole quota it _MUST_ _NOT_ modify the NAPI state by doing a complete or reschedule. The net_rx_action code still owns the NAPI state in that case, and therefore is allowed to make modifications to the ->poll_list. I realize this adds a lot of IRQ enable/disable overhead back into the code, but we can't get rid of that cleanly without solving this list ownership and modification issue properly. Signed-off-by: David S. Miller <[EMAIL PROTECTED]> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0106fa6..9837ed3 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -284,7 +284,14 @@ extern int __init netdev_boot_setup(char *str); * Structure for NAPI scheduling similar to tasklet but with weighting */ struct napi_struct { + /* The poll_list must only be managed by the entity which +* changes the state of the NAPI_STATE_SCHED bit. This means +* whoever atomically sets that bit can add this napi_struct +* to the per-cpu poll_list, and whoever clears that bit +* can remove from the list right before clearing the bit. +*/ struct list_headpoll_list; + unsigned long state; int weight; int quota; @@ -336,13 +343,21 @@ static inline void napi_schedule(struct napi_struct *n) * * Mark NAPI processing as complete. */ -static inline void napi_complete(struct napi_struct *n) +static inline void __napi_complete(struct napi_struct *n) { BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state)); + list_del(&n->poll_list); smp_mb__before_clear_bit(); clear_bit(NAPI_STATE_SCHED, &n->state); } +static inline void napi_complete(struct napi_struct *n) +{ + local_irq_disable(); + __napi_complete(n); + local_irq_enable(); +} + /** * napi_disable - prevent NAPI from scheduling * @n: napi context @@ -1184,19 +1199,11 @@ static inline u32 netif_msg_init(int debug_value, int default_msg_enable_bits) return (1 << debug_value) - 1; } -/* Try to reschedule poll. Called by dev->poll() after netif_rx_complete(). - * Do not inline this? - */ +/* Try to reschedule poll. Called by dev->poll() after netif_rx_complete(). */ static inline int netif_rx_reschedule(struct napi_struct *n) { if (napi_schedule_prep(n)) { - unsigned long flags; - - local_irq_save(flags); - list_add_tail(&n->poll_list, - &__get_cpu_var(softnet_data).poll_list); - __raise_softirq_irqoff(NET_RX_SOFTIRQ); - local_irq_restore(flags); + __napi_schedule(n); return 1; } return 0; @@ -1234,7 +1241,7 @@ static inline void netif_rx_schedule(struct net_device *dev, static inline void __netif_rx_complete(struct net_device *dev, struct napi_struct *napi) { - napi_complete(napi); + __napi_complete(napi); dev_put(dev); } diff --git a/net/core/dev.c b/net/core/dev.c index f119dc0..2897352 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2120,6 +2120,13 @@ static int process_backlog(struct napi_struct *napi, int quota) return work; } +static void napi_check_quota_bug(struct napi_struct *n) +{ + /* It is illegal to consume more than the given quota. */ + if (WARN_ON_ONCE(n->quota < 0)) + n->quota = 0; +} + /** * __napi_schedule - schedule for receive * @napi: entry to schedule @@ -2130,10 +2137,8 @@ void fastcall __napi_schedule(struct napi_struct *n) { unsigned long flags; - if (n->quota < 0) - n->quota += n->weight; - else - n->quota = n->weight; + napi_check_quota_bug(n); + n->quota = n->weight; local_irq_save(flags); list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list); @@ -2145,32 +2150,36 @@ EXPORT_SYMBOL(__napi_schedule); static void net_rx_action(struct softirq_ac
Re: new NAPI interface broken for POWER architecture?
On Wednesday 12 September 2007, Christoph Raisch wrote: > David Miller <[EMAIL PROTECTED]> wrote on 12.09.2007 14:50:04: > > > I agree that it should be fixed, but we should also fix the IRQ > > distribution scheme used on powerpc platforms which is totally > > broken in these cases. > > This is definitely not something we can change in the HEA device driver > alone. > It could also affect any other networking cards on POWER (e1000,s2io...). > > Paul, Michael, Arndt, what is your opinion here? The situation on Cell with the existing south bridge chips is that interrupts _never_ get moved around, but are routed to specific SMT threads by the firmware, while Linux does not interfere with this. We have been thinking about changing this so we can distribute interrupts over all SMT threads in a given NUMA node, or even over all logical CPUs in the system by reprogramming the interrupt controller after each interrupt, but the current Axon bridge chip will always have all devices routed to the same target CPU, so it's unclear whether that is even an advantage. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken for POWER architecture?
From: Christoph Raisch <[EMAIL PROTECTED]> Date: Wed, 12 Sep 2007 15:10:08 +0200 > This is definitely not something we can change in the HEA device driver > alone. And it shouldn't be, x86 implements the policy in irq balance daemon, powerpc should do it wherever it would be appropriate there. > Paul, Michael, Arndt, what is your opinion here? I'm all ears too :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken for POWER architecture?
David Miller <[EMAIL PROTECTED]> wrote on 12.09.2007 14:50:04: > From: Jan-Bernd Themann <[EMAIL PROTECTED]> > Date: Fri, 7 Sep 2007 11:37:02 +0200 > > > 2) On SMP systems: after netif_rx_complete has been called on CPU1 > >(+interruts enabled), netif_rx_schedule could be called on CPU2 > >(irq handler) before net_rx_action on CPU1 has checked NAPI_STATE_SCHED. > >In that case the device would be added to poll lists of CPU1 and CPU2 > >as net_rx_action would see NAPI_STATE_SCHED set. > >This must not happen. It will be caught when netif_rx_complete is > >called the second time (BUG() called) > > > > This would mean we have a problem on all SMP machines right now. > > This is not a correct statement. > > Only on your platform do network device interrupts get moved > around, no other platform does this. > > Sparc64 doesn't, all interrupts stay in one location after > the cpu is initially choosen. > > x86 and x86_64 specifically do not move around network > device interrupts, even though other device types do > get dynamic IRQ cpu distribution. > > That's why you are the only person seeing this problem. > > I agree that it should be fixed, but we should also fix the IRQ > distribution scheme used on powerpc platforms which is totally > broken in these cases. This is definitely not something we can change in the HEA device driver alone. It could also affect any other networking cards on POWER (e1000,s2io...). Paul, Michael, Arndt, what is your opinion here? Gruss / Regards Christoph Raisch - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: new NAPI interface broken
From: Jan-Bernd Themann <[EMAIL PROTECTED]> Date: Fri, 7 Sep 2007 11:37:02 +0200 > 2) On SMP systems: after netif_rx_complete has been called on CPU1 >(+interruts enabled), netif_rx_schedule could be called on CPU2 >(irq handler) before net_rx_action on CPU1 has checked NAPI_STATE_SCHED. >In that case the device would be added to poll lists of CPU1 and CPU2 >as net_rx_action would see NAPI_STATE_SCHED set. >This must not happen. It will be caught when netif_rx_complete is >called the second time (BUG() called) > > This would mean we have a problem on all SMP machines right now. This is not a correct statement. Only on your platform do network device interrupts get moved around, no other platform does this. Sparc64 doesn't, all interrupts stay in one location after the cpu is initially choosen. x86 and x86_64 specifically do not move around network device interrupts, even though other device types do get dynamic IRQ cpu distribution. That's why you are the only person seeing this problem. I agree that it should be fixed, but we should also fix the IRQ distribution scheme used on powerpc platforms which is totally broken in these cases. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html