Re: Kernel header changes break glibc build
On Fri, 2006-10-06 at 17:20 +, Joseph S. Myers wrote: > The kernel headers installed by Linux 2.6.19-rc1 "make > headers_install" do not work for building glibc, because glibc expects > to provide various definitions, some of which have > been moved to and some of which have been removed > altogether. > > This kernel patch allows glibc to build again by making rtnetlink.h > include if_addr.h and adding back the removed definitions required by > glibc, but I don't know if it's the correct approach or if glibc > should change the headers it includes and add its own macro > definitions. > > Signed-off-by: Joseph Myers <[EMAIL PROTECTED]> Since it was Thomas Graf who made these changes and David Miller who committed them, you'd do well to Cc both of those people. Sorry for the delayed response on my part -- I've been away concentrating on OLPC stuff, and just about everything else has fallen by the wayside. Thomas, this is in response to your changes in http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1823730fbc89fadde72a7bb3b7bdf03cc7b8835c;hp=47f68512d2685431f1781830dfcbab31bda87644 in which you create and require that it's included directly rather than being part of (or even included from) . Was there a good reason for changing that user-visible header? Is there a reason not to include if_addr.h from rtnetlink.h as Joseph's patch does? I suspect that if the IF{L,}A_{PAYLOAD,RTA} macros aren't used in the kernel then the best answer is for glibc to define those for itself. > --- > Index: include/linux/rtnetlink.h > === > --- include/linux/rtnetlink.h > +++ include/linux/rtnetlink.h > @@ -2,6 +2,7 @@ > #define __LINUX_RTNETLINK_H > > #include > +#include > #include > > / > Index: include/linux/if_link.h > === > --- include/linux/if_link.h > +++ include/linux/if_link.h > @@ -82,6 +82,9 @@ > > #define IFLA_MAX (__IFLA_MAX - 1) > > +#define IFLA_RTA(r) ((struct rtattr*)(((char*)(r)) + > NLMSG_ALIGN(sizeof(struct ifinfomsg > +#define IFLA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct ifinfomsg)) > + > /* ifi_flags. > > IFF_* flags. > Index: include/linux/if_addr.h > === > --- include/linux/if_addr.h > +++ include/linux/if_addr.h > @@ -52,4 +52,7 @@ > __u32 tstamp; /* updated timestamp, hundredths of seconds */ > }; > > +#define IFA_RTA(r) ((struct rtattr*)(((char*)(r)) + > NLMSG_ALIGN(sizeof(struct ifaddrmsg > +#define IFA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct ifaddrmsg)) > + > #endif > > > -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel header changes break glibc build
On Mon, 2006-12-04 at 10:13 +0100, Thomas Graf wrote: > Userspace is not supposd to directly include kernel headers, instead > it has to make local copies and compile against them. No. It was _never_ sensible to simply declare that userspace "shall not use kernel headers" in the absence of any serious alternative. It was always just a cop-out. Historically, there were a number of efforts to provide 'sanitised' kernel headers which are needed for userspace to build against. Various people forked headers from the kernel at different times, got burned out at different times, took different approaches w.r.t. how much abuse of kernel-private stuff should be permitted. We ended up with a bunch of inconsistent sets of 'kernel headers' each done differently and updated sporadically. Thankfully, this is no longer the case. The kernel now has a 'headers_install' make target which spits out those headers which are suitable for userspace, sanitised by removing anything within __KERNEL__. We have a _consistent_, up to date set of headers which distributions can use for building glibc and other system libraries and tools. Some distributions are already shipping like that; others are still working on doing so. > Binary compatibility is always guaranteed but in times of development > within a stable tree it's wrong to assume that headers never change. > > I do not agree with the change to include if_addr.h in rtnetlink.h. > The point is to move bits apart and have multiple small pieces > of header files defining a specific rtnetlink family which are a > lot easier to maintain for both kernel and userspace than one giant > rtnetlink.h for everything. That makes some sense, but we need to be more careful about making incompatible changes in the headers which we export -- it's no longer acceptable to just toss a pile of crap over the wall and declare it to be someone else's problem. That isn't to say that we should _never_ make such changes, but we should at least make sure we think about it and make sure that glibc adapts to cope, _before_ people start to see build failures. > > I suspect that if the IF{L,}A_{PAYLOAD,RTA} macros aren't used in the > > kernel then the best answer is for glibc to define those for itself. > > Right, if they did it right they would only have noticed when they > updated the kernel headers to some newer versions and only had to > move the bits to some compat header. No. They _are_ doing it right -- they're running 'make headers_install' against the 2.6.19 kernel and only _now_ are they finding that we broke it without even the courtesy of a warning, let alone any consultation. If _we_ had done it right, then they would have been warned when we decided to change this, and we wouldn't have just released 2.6.19 with changes which break the glibc build. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel header changes break glibc build
On Wed, 2006-12-06 at 14:43 +0100, Jakub Jelinek wrote: > > +/* 2.6.19 kernel headers helpfully removed some macros and > + moved lots of stuff into new headers, some of which aren't > + included by linux/rtnetlink.h. */ > + > +#ifndef IFA_MAX > +struct ifaddrmsg > +{ > + uint8_t ifa_family; > + uint8_t ifa_prefixlen; > + uint8_t ifa_flags; > + uint8_t ifa_scope; > + uint32_t ifa_index; > +}; > + > +enum > +{ > + IFA_UNSPEC, > + IFA_ADDRESS, > + IFA_LOCAL, > + IFA_LABEL, > + IFA_BROADCAST, > + IFA_ANYCAST, > + IFA_CACHEINFO, > + IFA_MULTICAST > +}; > +#endif > + > +#ifndef IFA_F_SECONDARY > +# define IFA_F_SECONDARY 0x01 > +# define IFA_F_TEMPORARY IFA_F_SECONDARY > +# define IFA_F_HOMEADDRESS 0x10 > +# define IFA_F_DEPRECATED 0x20 > +#endif This much is still available from the new -- you could include that directly instead of copying it. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel header changes break glibc build
On Wed, 2006-12-06 at 14:57 +0100, Jakub Jelinek wrote: > Yes, but as I said, I'd need to add configure checks for that, using > #include > alone breaks build with older headers. I was thinking that the #ifndef IFA_MAX you already have ought to be sufficient for that. Or even checking KERNEL_VERSION. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel header changes break glibc build
On Wed, 2006-12-06 at 14:59 +0100, Thomas Graf wrote: > Are you suggesting that the kernel has to keep macros around which > are of no use to the kernel itself just because glibc uses them? No, although in fact that _is_ the only reason we use these horrid __uXX types rather than proper C datatypes, isn't it? I'm suggesting that if you want to change things around as you did, you should make sure the users of those headers adapt to cope. You did fix the in-kernel users; you neglected to fix glibc -- and as far as I can tell you didn't even bother to _warn_ glibc folks. We need to do better than that. The dark ages where we used to toss a pile of crap over the wall and declare it was someone else's problem are now behind us, thankfully. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETLINK]: Restore API compatibility of address and neighbour bits
On Thu, 2006-12-07 at 11:55 +0100, Thomas Graf wrote: > Restore API compatibility due to bits moved from rtnetlink.h to > separate headers. For 2.6.19.1 too, please. > Signed-off-by: Thomas Graf <[EMAIL PROTECTED]> > > Index: net-2.6/include/linux/rtnetlink.h > === > --- net-2.6.orig/include/linux/rtnetlink.h2006-12-07 11:25:29.0 > +0100 > +++ net-2.6/include/linux/rtnetlink.h 2006-12-07 11:32:25.0 +0100 > @@ -3,6 +3,8 @@ > > #include > #include > +#include > +#include > > / > * Routing/neighbour discovery messages. > - > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel header changes break glibc build
On Wed, 2006-12-06 at 15:23 +0100, Thomas Graf wrote: > > I'm suggesting that if you want to change things around as you did, you > > should make sure the users of those headers adapt to cope. You did fix > > the in-kernel users; you neglected to fix glibc -- and as far as I can > > tell you didn't even bother to _warn_ glibc folks. > > I didn't warn them because I didn't know better. Fair enough. Now you do -- and thanks for fixing it. > I was under the impression that glibc still maintains their own set of > headers and will fix this automatically when they look at the diff. > That's what I do for my userspace applications that use kernel > headers. That's never been the case for glibc. As I said, various people have _attempted_ to do it, but it really isn't a workable approach and the results were wildly inconsistent. Having a single, centrally-exported set of headers is a massive improvement -- but yes, it does mean we have to take a little care with what we're exporting to userspace. > Ideally install_headers would do the trick but it often fails f.e. > when some application which uses bsd features thus including net/if.h > also wants to use new linux features and includes linux/if.h which > then conflicts. Applications in general shouldn't be doing that -- kernel headers are for system libraries and tools only. We should try to ensure that the _sensible_ use cases don't break, but we don't have to go overboard with keeping our namespace clean and anticipating _every_ strange thing which a userspace app might do with our headers -- we still get to say 'caveat emptor'; it's just that we can't be _quite_ as haphazard about it as we've been in the past. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETLINK]: Schedule removal of old macros exported to userspace
On Sat, 2006-12-09 at 15:58 +0100, Stefan Rompf wrote: > But when even glibc needs internal compat headers to compile with the second > kernel version that provides userspace headers, what is the long-term - even > mid-term - perspective of make headers_install then? We've only _just_ started paying attention to what we export. I tried to keep the initial cut of headers_install very simple and uncontentious -- sticking to implementation, and not policy. So I didn't do a particularly thorough cull of stuff that we shouldn't be exposing -- it's expected that we may lose _some_ more stuff. But breaking userspace like _this_ isn't acceptable, and we're not doing it. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [NETLINK]: Schedule removal of old macros exported to userspace
On Sat, 2006-12-09 at 15:58 +0100, Stefan Rompf wrote: > But when even glibc needs internal compat headers to compile with the second > kernel version that provides userspace headers, what is the long-term - even > mid-term - perspective of make headers_install then? Bear in mind that with the first cut of the headers_install I was trying to keep it entirely uncontentious and stick to _mechanism_, not policy. The _first_ set of exported headers still aren't ideal, and we're still cleaning up some parts of them (like properly getting rid of the _syscallX() macros, etc.). We've only just started to be sensible about what we export to userspace -- it's not entirely set in stone at this point, -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix buglets in mpc5200 FEC code that are corrupting memory.
On Sun, 2007-11-18 at 20:49 -0500, Jon Smirl wrote: > On 11/9/07, Grant Likely <[EMAIL PROTECTED]> wrote: > > On 11/9/07, Domen Puncer <[EMAIL PROTECTED]> wrote: > > > On 09/11/07 00:31 -0500, Jon Smirl wrote: > > > > This is the reason I couldn't get user space started or connect to my > > > > nfs server. Patch is against current linus git. > > > > > > > > mpc5200 fec driver is corrupting memory. This patch fixes two bugs > > > > where the wrong skb buffer was being referenced. > > > > > > > > Signed-off-by: Jon Smirl <[EMAIL PROTECTED]> > > > > > > Acked-by: Domen Puncer <[EMAIL PROTECTED]> > > > > Signed-off-by: Grant Likely <[EMAIL PROTECTED]> > > > > Jeff, can you please pick this up for .24? > > This is didn't make it into rc3. Without this patch this driver is broken. Jeff? -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] SET_NETDEV_DEV() in fec_mpc52xx.c
This helps to allow the Fedora installer to use the built-in Ethernet on the Efika for a network install. Signed-off-by: David Woodhouse <[EMAIL PROTECTED]> --- a/drivers/net/fec_mpc52xx.c +++ b/drivers/net/fec_mpc52xx.c @@ -971,6 +971,8 @@ mpc52xx_fec_probe(struct of_device *op, const struct of_device_id *match) mpc52xx_fec_reset_stats(ndev); + SET_NETDEV_DEV(ndev, &op->dev); + /* Register the new network device */ rv = register_netdev(ndev); if (rv < 0) -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Stop phy code from returning success to unknown ioctls.
This kind of sucks, and prevents the Fedora installer from using the device for network installs... [EMAIL PROTECTED] phy]# iwconfig eth0 Warning: Driver for device eth0 has been compiled with an ancient version of Wireless Extension, while this program support version 11 and later. Some things may be broken... eth0ESSID:off/any Nickname:"" NWID:0 Channel:0 Access Point: 00:00:BF:81:14:E0 Bit Rate:-1.08206e+06 kb/s Sensitivity=0/0 RTS thr:off Fragment thr:off Encryption key: Power Management:off Signed-off-by: David Woodhouse <[EMAIL PROTECTED]> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index 9bc1177..7c9e6e3 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -406,6 +406,9 @@ int phy_mii_ioctl(struct phy_device *phydev, && phydev->drv->config_init) phydev->drv->config_init(phydev); break; + + default: + return -ENOTTY; } return 0; -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PASEMI_MAC] Don't claim to do IPv6 checksum offload
Signed-off-by: David Woodhouse <[EMAIL PROTECTED]> diff --git a/drivers/net/pasemi_mac.c b/drivers/net/pasemi_mac.c index 09b4fde..a8db5d7 100644 --- a/drivers/net/pasemi_mac.c +++ b/drivers/net/pasemi_mac.c @@ -1362,7 +1362,7 @@ pasemi_mac_probe(struct pci_dev *pdev, const struct pci_device_id *ent) netif_napi_add(dev, &mac->napi, pasemi_mac_poll, 64); - dev->features = NETIF_F_HW_CSUM | NETIF_F_LLTX | NETIF_F_SG; + dev->features = NETIF_F_IP_CSUM | NETIF_F_LLTX | NETIF_F_SG; /* These should come out of the device tree eventually */ mac->dma_txch = index; -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix memory corruption in fec_mpc52xx
From: Jon Smirl <[EMAIL PROTECTED]> The mpc5200 fec driver is corrupting memory. This patch fixes two bugs where the wrong skb was being referenced. Signed-off-by: Jon Smirl <[EMAIL PROTECTED]> Acked-by: Domen Puncer <[EMAIL PROTECTED]> Signed-off-by: Grant Likely <[EMAIL PROTECTED]> Signed-off-by: David Woodhouse <[EMAIL PROTECTED]> --- a/drivers/net/fec_mpc52xx.c +++ b/drivers/net/fec_mpc52xx.c @@ -422,7 +422,7 @@ static irqreturn_t mpc52xx_fec_rx_interrupt(int irq, void *dev_id) rskb = bcom_retrieve_buffer(priv->rx_dmatsk, &status, (struct bcom_bd **)&bd); - dma_unmap_single(&dev->dev, bd->skb_pa, skb->len, DMA_FROM_DEVICE); + dma_unmap_single(&dev->dev, bd->skb_pa, rskb->len, DMA_FROM_DEVICE); /* Test for errors in received frame */ if (status & BCOM_FEC_RX_BD_ERRORS) { @@ -467,7 +467,7 @@ static irqreturn_t mpc52xx_fec_rx_interrupt(int irq, void *dev_id) bcom_prepare_next_buffer(priv->rx_dmatsk); bd->status = FEC_RX_BUFFER_SIZE; - bd->skb_pa = dma_map_single(&dev->dev, rskb->data, + bd->skb_pa = dma_map_single(&dev->dev, skb->data, FEC_RX_BUFFER_SIZE, DMA_FROM_DEVICE); bcom_submit_next_buffer(priv->rx_dmatsk, skb); -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.24] pasemi_mac: Fix reuse of free'd skb
Oops: Exception in kernel mode, sig: 5 [#1] nfs: server pmac not responding, still trying SMP NR_CPUS=128 NUMA PA Semi PWRficient Modules linked in: appletouch cbc blkcipher dm_crypt dm_emc dm_round_robin dm_mu ltipath dm_snapshot dm_mirror dm_zero dm_mod xfs jfs reiserfs lock_nolock gfs2 m sdos linear raid10 raid456 async_memcpy async_tx async_xor xor raid1 raid0 sata_ mv libata pasemi_mac gpio_mdio libphy ehci_hcd ohci_hcd iscsi_tcp libiscsi scsi_ transport_iscsi sr_mod sd_mod scsi_mod ide_cd cdrom ipv6 ext2 ext4dev crc16 jbd2 ext3 mbcache jbd squashfs loop nfs nfs_acl lockd sunrpc vfat fat cramfs NIP: c03af2f0 LR: c03af2ec CTR: c00791c0 REGS: c9c0 TRAP: 0700 Not tainted (2.6.24-0.67.rc3.git7.fc9) MSR: 90029032 CR: 2428 XER: TASK = c0677720[0] 'swapper' THREAD: c0758000 CPU: 0 GPR00: c03af2ec cc40 c0753d98 0087 GPR04: GPR08: c0677720 0004 GPR12: c0678200 GPR16: 4140 GPR20: 0040 06e0 06e8 0037 GPR24: c0003ba40370 00dc 540045ee0111a3a8 GPR28: 05ee c0003d104900 c0712b18 c0003b939800 NIP [c03af2f0] .skb_over_panic+0x50/0x58 LR [c03af2ec] .skb_over_panic+0x4c/0x58 Call Trace: [cc40] [c03af2ec] .skb_over_panic+0x4c/0x58 (unreliable) [ccd0] [d02dfb10] .pasemi_mac_clean_rx+0x2f0/0x480 [pasemi_m ac] [cda0] [d02e008c] .pasemi_mac_poll+0x44/0xe4 [pasemi_mac] [ce40] [c03bba54] .net_rx_action+0xf8/0x214 [cef0] [c008eb58] .__do_softirq+0xa8/0x164 [cf90] [c002b468] .call_do_softirq+0x14/0x24 [c075b940] [c000d4d0] .do_softirq+0x68/0xac [c075b9d0] [c008ecc8] .irq_exit+0x60/0xb0 [c075ba50] [c000db00] .do_IRQ+0x174/0x1e8 [c075baf0] [c0004c24] hardware_interrupt_entry+0x24/0x28 --- Exception: 501 at .ppc64_runlatch_off+0x0/0x54 LR = .cpu_idle+0x70/0x1f0 [c075bde0] [c00135c4] .cpu_idle+0x11c/0x1f0 (unreliable) [c075be60] [c0009b08] .rest_init+0x78/0x90 [c075bee0] [c05c59ec] .start_kernel+0x3f8/0x41c [c075bf90] [c0008590] .start_here_common+0x60/0xd0 Instruction dump: 80a30068 e8e300c8 e90300d0 812300bc 814300c0 2fa0 409e0008 e81e8018 e87e8028 f8010070 4bcd92cd 6000 <0fe0> 4800 7c0802a6 faa1ffa8 Kernel panic - not syncing: Fatal exception in interrupt Rebooting in 180 seconds.. -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.24] pasemi_mac: Fix reuse of free'd skb
On Tue, 2007-12-04 at 18:04 +, David Woodhouse wrote: > Oops: Exception in kernel mode, sig: 5 [#1] > > nfs: server pmac not responding, still trying Oops, sorry. That was meant to be just to Olof. And it helps if I'm actually running the kernel in which his patch is applied, too... -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.24-rc3] Fix /proc/net breakage
On Mon, 2007-11-26 at 15:17 -0700, Eric W. Biederman wrote: > Well I clearly goofed when I added the initial network namespace support > for /proc/net. Currently things work but there are odd details visible > to user space, even when we have a single network namespace. > > Since we do not cache proc_dir_entry dentries at the moment we can > just modify ->lookup to return a different directory inode depending > on the network namespace of the process looking at /proc/net, replacing > the current technique of using a magic and fragile follow_link method. > > To accomplish that this patch: > - introduces a shadow_proc method to allow different dentries to > be returned from proc_lookup. > - Removes the old /proc/net follow_link magic > - Fixes a weakness in our not caching of proc generic dentries. > > As shadow_proc uses a task struct to decided which dentry to return we > can go back later and fix the proc generic caching without modifying any code > that > uses the shadow_proc method. > > Signed-off-by: Eric W. Biederman <[EMAIL PROTECTED]> > --- > fs/proc/generic.c | 12 ++- > fs/proc/proc_net.c | 86 > +++ > include/linux/proc_fs.h |3 ++ > 3 files changed, 19 insertions(+), 82 deletions(-) (commit 2b1e300a9dfc3196ccddf6f1d74b91b7af55e416) This seems to have broken the use of /proc/bus/usb as a mountpoint. It always appears empty now, whatever's supposed to be mounted there. -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] PS3: gelic: Add wireless support for PS3
--- linux-2.6.23.ppc64/drivers/net/ps3_gelic_net.c~ 2007-12-14 01:31:50.0 -0500 +++ linux-2.6.23.ppc64/drivers/net/ps3_gelic_net.c 2007-12-14 01:39:25.0 -0500 @@ -1139,7 +1139,7 @@ static irqreturn_t gelic_card_interrupt( * * see Documentation/networking/netconsole.txt */ -static void gelic_net_poll_controller(struct net_device *netdev) +void gelic_net_poll_controller(struct net_device *netdev) { struct gelic_card *card = netdev_card(netdev); --- linux-2.6.23.ppc64/drivers/net/ps3_gelic_net.h~ 2007-12-14 01:31:50.0 -0500 +++ linux-2.6.23.ppc64/drivers/net/ps3_gelic_net.h 2007-12-14 01:39:47.0 -0500 @@ -346,6 +346,7 @@ extern void gelic_net_tx_timeout(struct extern int gelic_net_change_mtu(struct net_device *netdev, int new_mtu); extern int gelic_net_setup_netdev(struct net_device *netdev, struct gelic_card *card); +extern void gelic_net_poll_controller(struct net_device *netdev); /* shared ethtool ops */ extern void gelic_net_get_drvinfo(struct net_device *netdev, -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH FINAL] Merge the Sonics Silicon Backplane subsystem
On Tue, 2007-07-31 at 20:26 -0700, Andrew Morton wrote: > Look. Kconfig's `select' Just. Does. Not. Work. If you find > yourself contemplating using it, please, don sackcloth, take a cold > shower and several analgesics, then have another go, OK? Amen. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix incorrect prototype for ipxrtr_route_packet()
The function ipxrtr_route_packet() takes a 'len' argument of type size_t. However, its prototype in af_ipx.c incorrectly suggests that the corresponding argument is of type 'int' instead. Discovered by building with --combine and letting the compiler see it all at once. Signed-off-by: David Woodhouse <[EMAIL PROTECTED]> --- a/net/ipx/af_ipx.c +++ b/net/ipx/af_ipx.c @@ -87,7 +87,7 @@ extern int ipxrtr_add_route(__be32 network, struct ipx_interface *intrfc, unsigned char *node); extern void ipxrtr_del_routes(struct ipx_interface *intrfc); extern int ipxrtr_route_packet(struct sock *sk, struct sockaddr_ipx *usipx, - struct iovec *iov, int len, int noblock); + struct iovec *iov, size_t len, int noblock); extern int ipxrtr_route_skb(struct sk_buff *skb); extern struct ipx_route *ipxrtr_lookup(__be32 net); extern int ipxrtr_ioctl(unsigned int cmd, void __user *arg); -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix incorrect prototype for ipxrtr_route_packet()
On Thu, 2007-05-17 at 15:27 -0700, Andrew Morton wrote: > Lovely. So it was actually generating wrong code on all > sizeof(size_t)!=sizeof(int) architectures. I was trying to work out which architectures would actually be affected. Probably not many. If it's in a register it probably doesn't matter -- the size only matters if you put it on the stack. And it would have to be a 64-bit architecture, none of which (iirc) are so register-starved that this particular argument would be on the stack anyway. So I'm not actually sure it will really generate wrong code on any architecture we support -- not that this excuses it in any way :) > If only we could find some way in which all callers of a function as > well as its definition can see the same declaration? Well, building with --combine helps. :) Admittedly it doesn't necessarily catch _all_ callers -- only those which we actually compile together. That does cover _most_ of the times we do crap like this without a prototype in a header file though, I suspect. As Geert points out, sparse can warn about non-static functions which aren't prototyped; if we start getting anal about that, it might help. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [1/2] 2.6.22-rc4: known regressions with patches
On Tue, 2007-06-05 at 16:54 +0200, Michal Piotrowski wrote: > File systems > > Subject: JFFS2 issues > References : > http://lists.infradead.org/pipermail/linux-mtd/2007-May/018426.html > Submitter : Haavard Skinnemoen <[EMAIL PROTECTED]> > Caused-By : commit 10731f83009e2556f98ffa5c7c2cbffe66dacfb3 > Artem Bityutskiy <[EMAIL PROTECTED]> > Handled-By : Artem Bityutskiy <[EMAIL PROTECTED]> > Patch : > http://lists.infradead.org/pipermail/linux-mtd/2007-May/018453.html > Status : patch available > This is fixed in 2.6.22-rc4 with this commit: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ea55d30798ac206c9f584ac264b6b8eb093d237a -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cbe-oss-dev] [PATCH 0/15] spidernet driver bug fixes
On Wed, 2007-06-13 at 11:14 -0500, Linas Vepstas wrote: > Some googling seems to show that "git pull" has a bug/feature of > ignoring the branch that one is working in, and pulling "master" > no matter what. I have no clue why; this seems broken to me. Branches are generally a PITA -- it's probably best just to avoid them entirely. It's not as if it's hard to create separate trees, and even share objects between the trees. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cbe-oss-dev] [PATCH 0/15] spidernet driver bug fixes
On Thu, 2007-06-14 at 19:01 -0400, Jeff Garzik wrote: > It makes diffing between lines of development more difficult, takes up > more overall space, less cache friendly, ... All of which is much less true if you're sharing object directories or even using alternates. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Cbe-oss-dev] [PATCH 0/15] spidernet driver bug fixes
On Thu, 2007-06-14 at 19:04 -0400, Jeff Garzik wrote: > Think about the actual kernel tree source code, not just the > metadata... Disk is cheap. Waiting for the whole damn thing to rebuild after switching branches and back again is less so. Besides, checking it out is optional. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc6: known regressions with patches
On Wed, 2007-09-12 at 18:59 +0200, Michal Piotrowski wrote: > MTD > > Subject : include/linux/mtd/map.h:128:2: error: #error "No bus width > supported. What's the point?" > References : http://lkml.org/lkml/2007/9/11/151 > Last known good : ? > Submitter : Toralf Förster <[EMAIL PROTECTED]> > Caused-By : ? > Handled-By : ? > Patch : > http://git.infradead.org/?p=mtd-2.6.git;a=commit;h=241651d04d672fb685b2874707016cbbf95931e5 > Status : patch available Didn't you already _drop_ this from your list of regressions once? -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPV6]: Do not set IF_READY if device is down
On Thu, 2007-03-08 at 03:59 +, Linux Kernel Mailing List wrote: > Gitweb: > http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7ababbdc647e67e953d153ddf62cbdc9fe3297e > Commit: c7ababbdc647e67e953d153ddf62cbdc9fe3297e > Parent: 16bec31db751030171b31d7767fa3a5bdbe980ea > Author: Herbert Xu <[EMAIL PROTECTED]> > AuthorDate: Wed Mar 7 16:02:40 2007 -0800 > Committer: David S. Miller <[EMAIL PROTECTED]> > CommitDate: Wed Mar 7 16:08:12 2007 -0800 > > [IPV6]: Do not set IF_READY if device is down > > Now that we add the IPv6 device at registration time we don't need > to set IF_READY in ipv6_add_dev anymore because we will always get > a NETDEV_UP event later on should the device ever become ready. > > Signed-off-by: Herbert Xu <[EMAIL PROTECTED]> > Signed-off-by: David S. Miller <[EMAIL PROTECTED]> This commit seems to have broken NetworkManager, which removes the link-local IPv6 address from each wireless interface while it's only intending to use it for scanning, then adds it back when it actually wants to _use_ the interface in question. After this commit, any IPv6 address added to an interface which had none will be marked as 'tentative', and will never become real. A failing test looks like this + ip -6 addr del fe80::203:ccff:fe3c:27d/64 dev eth0 + ip -6 addr del 2001:8b0:10b:1:203:ccff:fe3c:27d/64 dev eth0 + ip -6 addr list eth0 + ip -6 addr add fe80::203:ccff:fe3c:27d/64 dev eth0 + ip -6 addr list eth0 4: eth0: mtu 1500 qlen 1000 inet6 fe80::203:ccff:fe3c:27d/64 scope link tentative valid_lft forever preferred_lft forever + sleep 3 + ip -6 addr list eth0 4: eth0: mtu 1500 qlen 1000 inet6 fe80::203:ccff:fe3c:27d/64 scope link tentative valid_lft forever preferred_lft forever ... while on a good kernel it looks like this... + ip -6 addr del fe80::203:ccff:fe3c:27d/64 dev eth0 + ip -6 addr del 2001:8b0:10b:1:203:ccff:fe3c:27d/64 dev eth0 + ip -6 addr list eth0 + ip -6 addr add fe80::203:ccff:fe3c:27d/64 dev eth0 + ip -6 addr list eth0 2: eth0: mtu 1500 qlen 1000 inet6 fe80::203:ccff:fe3c:27d/64 scope link tentative valid_lft forever preferred_lft forever + sleep 3 + ip -6 addr list eth0 2: eth0: mtu 1500 qlen 1000 inet6 2001:8b0:10b:1:203:ccff:fe3c:27d/64 scope global dynamic valid_lft 298sec preferred_lft 118sec inet6 fe80::203:ccff:fe3c:27d/64 scope link valid_lft forever preferred_lft forever If I add an extra IPv6 address before running the script on a 'bad' kernel, it behaves correctly. The problem only happens when you remove _all_ IPv6 addresses from an interface. Answers of "don't do that then" should be accompanied with an alternative plan which allows NetworkManager to use a wireless device for scanning without actually having any other effect. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [IPV6]: Do not set IF_READY if device is down
On Wed, 2007-03-28 at 07:30 +1000, Herbert Xu wrote: > Sorry, that patch is indeed broken. We do need to set IF_READY in > the case where all addresses were deleted (including the link-local) > and then recreated. The IPv6 device will be destroyed and recreated > too in that case. Your new patch fixes the problem for me, at least with my test script. Thanks. -- dwmw2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20-rc5: known regressions with patches (v2)
On Thu, 2007-01-18 at 20:59 +0100, Adrian Bunk wrote: > Subject: CONFIG_JFFS2_FS_DEBUG=2 compile error > References : http://lkml.org/lkml/2007/1/12/161 > Submitter : Russell King <[EMAIL PROTECTED]> > Caused-By : Al Viro <[EMAIL PROTECTED]> > commit 914e26379decf1fd984b22e51fd2e4209b7a7f1b > Handled-By : David Woodhouse <[EMAIL PROTECTED]> > Status : patch available Linus, please pull from git://git.infradead.org/mtd-2.6.git This fixes the above bug along with a few others. It does also contain a small amount of new code which has been waiting for a while (including the driver for the CAFÉ NAND controller which we use on OLPC.). My apologies for missing the merge window and first asking you to pull this a few hours after 2.6.20-rc1 was cut; I'd been waiting for the bitrev stuff to land, and had waited too long. Adrian Bunk (3): [MTD] SSFDC must depend on BLOCK [MTD] [NAND] rtc_from4.c: use lib/bitrev.c [MTD] make drivers/mtd/cmdlinepart.c:mtdpart_setup() static Adrian Hunter (2): [MTD] OneNAND: Implement read-while-load [MTD] OneNAND: Handle DDP chip boundary during read-while-load Akinobu Mita (1): [JFFS2] Use rb_first() and rb_last() cleanup Alan Cox (1): [MTD] MAPS: esb2rom: use hotplug safe interfaces Alexey Dobriyan (1): [MTD] JEDEC probe: fix comment typo (devic) Amit Choudhary (1): [JFFS2] Fix error-path leak in summary scan Andrew Morton (1): [MTD] Tidy bitrev usage in rtc_from4.c Andrew Victor (2): [MTD] NAND: AT91 NAND driver [MTD] NAND: Support for 16-bit bus-width on AT91. Artem Bityutskiy (10): [MTD] core: trivial comments fix [MTD] NAND: nandsim: support subpage write [MTD] increase MAX_MTD_DEVICES [MTD] add get_mtd_device_nm() function [MTD] add get and put methods [MTD] return error code from get_mtd_device() [MTD] nandsim: bugfix in page addressing [JFFS2] add cond_resched() when garbage collecting deletion dirent [JFFS2] Reschedule in loops [MTD] OneNAND: release CPU in cycles Burman Yan (1): [MTD] replace kmalloc+memset with kzalloc Dave Olsen (1): [MTD] [MAPS] Support for BIOS flash chips on the nvidia ck804 southbridge David Anders (1): [MTD] NOR: leave Intel chips in read-array mode on suspend David Woodhouse (29): [MTD NAND] Initial import of CAFÉ NAND driver. [MTD NAND] OLPC CAFÉ driver update Merge branch 'master' of git://git.kernel.org/.../torvalds/linux-2.6 [MTD] NAND: Combined oob buffer so it's contiguous with data [MTD] NAND: Correct setting of chip->oob_poi OOB buffer Merge git://git.infradead.org/~dwmw2/cafe-2.6 [MTD] NAND: Add hardware ECC correction support to CAFÉ NAND driver [MTD] NAND: CAFÉ NAND driver cleanup, fix ECC on reading empty flash [MTD] NAND: Disable ECC checking on CAFÉ since it's broken for now [MTD] NAND: Café ECC -- remove spurious BUG_ON() in err_pos() [MTD] NAND: Reset Café controller before initialising. [MTD] CAFÉ NAND: Add 'slowtiming' parameter, default usedma and checkecc on [MTD] NAND: Add ECC debugging for CAFÉ [MTD] NAND: Remove empty block ECC workaround [MTD] NAND: Fix timing calculation in CAFÉ debugging message [MTD] NAND: Use register #defines throughout CAFÉ driver, not numbers [MTD] NAND: Add register debugging spew option to CAFÉ driver [MTD] NAND: Fix ECC settings in CAFÉ controller driver. Merge git://git.infradead.org/~dwmw2/cafe-2.6 Merge git://git.infradead.org/~kmpark/onenand-mtd-2.6 [MTD] [NAND] Update CAFÉ driver interrupt handler prototype [MTD] Use EXPORT_SYMBOL_GPL() for exported symbols. [MTD] Remove trailing whitespace Merge branch 'master' of git://git.kernel.org/.../torvalds/linux-2.6 [MTD] Fix SSFDC build for variable blocksize. [MTD] Fix ssfdc blksize typo Merge branch 'master' of git://git.infradead.org/~kmpark/onenand-mtd-2.6 [JFFS2] debug.h: include for current->pid Merge branch 'master' of git://git.kernel.org/.../torvalds/linux-2.6 Haavard Skinnemoen (1): [MTD] bugfix: DataFlash is not bit writable Jeff Garzik (1): [JFFS2] kill warning RE debug-only variables Josh Boyer (1): [MTD] add MTD_BLKDEVS Kconfig option Kyungmin Park (9): MTD: OneNAND: interrupt based wait support [MTD] OneNAND: lock support [MTD] OneNAND: Single bit error detection [MTD] OneNAND: fix oob handling in recent oob patch [JFFS2] use the ref_offset macro [MTD] OneNAND: fix onenand_wait bug [MTD] OneNAND: add subpage write support [MTD] OneNAND: fix onenand_wait bug in read ecc error [MTD] OneNAND: return ecc error code only when 2-bit ecc occurs Lew Glendenning (1): [MTD] MAPS: Support for BIOS f
Re: [PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
On Wed, 2015-09-23 at 10:58 -0700, David Miller wrote: > From: David Woodhouse > Date: Wed, 23 Sep 2015 09:14:09 +0100 > > > But sure, for now I'll drop this from the series and I can try to > > convince you separately. > > Yes, let's discuss this independantly to the nice bug fixing > that's happening in the rest of this series. > > Thanks. This is the patch we discussed earlier, and I came away from the conversation believing that I had achieved my goal of trying to convince you separately :) The patch in patchwork at https://patchwork.ozlabs.org/patch/520318/ still applies cleanly; is that sufficient or would you like it resubmitted (perhaps after -rc1)? Thanks. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH] Fix AF_PACKET ABI breakage in 4.2
Commit 7d82410950aa ("virtio: add explicit big-endian support to memory accessors") accidentally changed the virtio_net header used by AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian. Since virtio_legacy_is_little_endian() is a very long identifier, define a VIO_LE macro and use that throughout the code instead of the hard-coded 'false' for little-endian. This restores the ABI to match 4.1 and earlier kernels, and makes my test program work again. Signed-off-by: David Woodhouse --- Or perhaps we should just use (__force u16) and (__force __virtio16) since that's all we really want anyway? diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 7b8e39a..d646623 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -230,6 +230,8 @@ struct packet_skb_cb { } sa; }; +#define VIO_LE virtio_legacy_is_little_endian() + #define PACKET_SKB_CB(__skb) ((struct packet_skb_cb *)((__skb)->cb)) #define GET_PBDQC_FROM_RB(x) ((struct tpacket_kbdq_core *)(&(x)->prb_bdqc)) @@ -2680,15 +2682,15 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) goto out_unlock; if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && - (__virtio16_to_cpu(false, vnet_hdr.csum_start) + -__virtio16_to_cpu(false, vnet_hdr.csum_offset) + 2 > - __virtio16_to_cpu(false, vnet_hdr.hdr_len))) - vnet_hdr.hdr_len = __cpu_to_virtio16(false, -__virtio16_to_cpu(false, vnet_hdr.csum_start) + - __virtio16_to_cpu(false, vnet_hdr.csum_offset) + 2); + (__virtio16_to_cpu(VIO_LE, vnet_hdr.csum_start) + +__virtio16_to_cpu(VIO_LE, vnet_hdr.csum_offset) + 2 > + __virtio16_to_cpu(VIO_LE, vnet_hdr.hdr_len))) + vnet_hdr.hdr_len = __cpu_to_virtio16(VIO_LE, +__virtio16_to_cpu(VIO_LE, vnet_hdr.csum_start) + + __virtio16_to_cpu(VIO_LE, vnet_hdr.csum_offset) + 2); err = -EINVAL; - if (__virtio16_to_cpu(false, vnet_hdr.hdr_len) > len) + if (__virtio16_to_cpu(VIO_LE, vnet_hdr.hdr_len) > len) goto out_unlock; if (vnet_hdr.gso_type != VIRTIO_NET_HDR_GSO_NONE) { @@ -2731,7 +2733,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) hlen = LL_RESERVED_SPACE(dev); tlen = dev->needed_tailroom; skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, - __virtio16_to_cpu(false, vnet_hdr.hdr_len), + __virtio16_to_cpu(VIO_LE, vnet_hdr.hdr_len), msg->msg_flags & MSG_DONTWAIT, &err); if (skb == NULL) goto out_unlock; @@ -2778,8 +2780,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) if (po->has_vnet_hdr) { if (vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { - u16 s = __virtio16_to_cpu(false, vnet_hdr.csum_start); - u16 o = __virtio16_to_cpu(false, vnet_hdr.csum_offset); + u16 s = __virtio16_to_cpu(VIO_LE, vnet_hdr.csum_start); + u16 o = __virtio16_to_cpu(VIO_LE, vnet_hdr.csum_offset); if (!skb_partial_csum_set(skb, s, o)) { err = -EINVAL; goto out_free; @@ -2787,7 +2789,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) } skb_shinfo(skb)->gso_size = - __virtio16_to_cpu(false, vnet_hdr.gso_size); + __virtio16_to_cpu(VIO_LE, vnet_hdr.gso_size); skb_shinfo(skb)->gso_type = gso_type; /* Header must be checked, and gso_segs computed. */ @@ -3161,9 +3163,9 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, /* This is a hint as to how much should be linear. */ vnet_hdr.hdr_len = - __cpu_to_virtio16(false, skb_headlen(skb)); + __cpu_to_virtio16(VIO_LE, skb_headlen(skb)); vnet_hdr.gso_size = - __cpu_to_virtio16(false, sinfo->gso_size); + __cpu_to_virtio16(VIO_LE, sinfo->gso_size); if (sinfo->gso_type & SKB_GSO_TCPV4) vnet_hdr.gso_type = VIRTIO_NET_HDR_GSO_TCPV4; else if (sinfo->gso_type & SKB_GSO_TCPV6) @@ -3181,9 +3183,9 @@ static int packet_recvmsg(stru
Re: [RFC PATCH 1/3] Avoid making inappropriate requests of NETIF_F_V[46]_CSUM devices
On Mon, 2015-09-21 at 17:29 +0100, David Woodhouse wrote: > > Did we ever resolve this? AFAICT from inspecting the code the > virtio_net device still advertises hardware csum capabilities to the > guest. And accepts packets which need checksumming, calling > skb_partial_csum_set() as appropriate. Likewise tun, xen, macvtap and > af_packet. Here's a test case which provokes the network stack into handing a CHECKSUM_PARTIAL skb to a device which it knows can't handle it. (It obviously needs the AF_PACKET endianness ABI fix I sent earlier.) You might well be right to refer to this as 'injecting unchecked crap', but we are *gaining* injection points with the ability to do this, and for not entirely insane reasons — people want to be able to make full use of hardware offload capabilities. And we *have* a safety check, to avoid handing CHECKSUM_PARTIAL buffers to devices which can't handle them. We already do check the capabilities of the device we end up routing it to, and complete the checksum in software if the device can't cope. All we're talking about here is a corner case when that existing check doesn't actually give the right results, because it assumes a device with NETIF_F_IP_CSUM can checksum *all* Legacy IP packets, not just TCP and UDP. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation #include #include #include #include #include #include #include #include #include #include //#include #define PACKET_VNET_HDR 15 struct virtio_net_hdr { uint8_t flags; uint8_t gso_type; uint16_t hdr_len; uint16_t gso_size; uint16_t csum_start; uint16_t csum_offset; }; #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 unsigned char eth_hdr[] = { 0x52, 0x54, 0x00, 0x3a, 0xbe, 0x28, 0x52, 0x54, 0x00, 0x74, 0x2f, 0xfd, 0x08, 0x00 }; unsigned char icmp_pkt[] = { 0x45, 0x00, 0x00, 0x54, 0x11, 0xec, 0x40, 0x00, 0x40, 0x01, 0xb3, 0x58, 0xc0, 0xa8, 0x7a, 0x12, 0xc0, 0xa8, 0x7a, 0x01, 0x08, 0x00, 0x00, 0x00, 0x07, 0xd2, 0x00, 0x01, 0x63, 0x7f, 0x02, 0x56, 0x00, 0x00, 0x00, 0x00, 0x7c, 0x1b, 0x0b, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2a, 0x2b, 0x2c, 0x2d, 0x2e, 0x2f, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37 }; unsigned char udp_pkt[] = { 0x45, 0x00, 0x00, 0x22, 0xc4, 0x25, 0x40, 0x00, 0x40, 0x11, 0x01, 0x41, 0xc0, 0xa8, 0x7a, 0x12, 0xc0, 0xa8, 0x7a, 0x01, 0xae, 0xc7, 0x1f, 0x90, 0x00, 0x0e, 0x75, 0x84, 0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x0a }; int main(int argc, char **argv) { struct ifreq req; struct sockaddr_ll lladdr; int fd, ret, val; struct { struct virtio_net_hdr vnet; unsigned char eth[14]; unsigned char pkt[2048]; } buf; if (argc != 2) { fprintf(stderr, "Usage: %s \n", argv[0]); exit(1); } fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_IP)); if (fd < 0) { perror("socket"); exit(1); } memset(&req, 0, sizeof(req)); strncpy(req.ifr_name, argv[1], IFNAMSIZ-1); ret = ioctl(fd, SIOCGIFINDEX, &req); if (ret < 0) { perror("SIOGIFINDEX"); exit(1); } memset(&lladdr, 0, sizeof(lladdr)); lladdr.sll_family = AF_PACKET; lladdr.sll_protocol = htons(ETH_P_IP); lladdr.sll_ifindex = req.ifr_ifindex; ret = bind(fd, (const struct sockaddr *)&lladdr, sizeof(lladdr)); if (ret < 0) { perror("bind"); exit(1); } val = 1; ret = setsockopt(fd, SOL_PACKET, PACKET_VNET_HDR, (const char *)&val, sizeof(val)); if (ret < 0) { perror("setsockopt(SOL_PACKET, PACKET_VNET_HDR)"); exit(1); } memset(&buf.vnet, 0, sizeof(buf.vnet)); memcpy(&buf.eth, eth_hdr, sizeof(eth_hdr)); memcpy(&buf.pkt, udp_pkt, sizeof(udp_pkt)); buf.vnet.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM; buf.vnet.csum_start = 0x22; buf.vnet.csum_offset = 0x6; if (write(fd, (void *)&buf, sizeof(buf.vnet) + 14 + sizeof(udp_pkt)) < 0) { perror("Write UDP packet"); exit(1); } memcpy(&buf.pkt, icmp_pkt, sizeof(icmp_pkt)); buf.vnet.csum_offset = 0x2; if (write(fd, (void *)&buf, sizeof(buf.vnet) + 14 + sizeof(icmp_pkt)) < 0) { perror("Write ICMP packet"); exit(1); } return 0; } smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] Convert smsc911x to use ACPI as well as DT
On Wed, 2015-08-12 at 17:06 -0500, Jeremy Linton wrote: > > > +static const struct acpi_device_id smsc911x_acpi_match[] = { > + { "ARMH9118", 0 }, > + { } > +}; > +MODULE_DEVICE_TABLE(acpi, smsc911x_acpi_match); > + > static struct platform_driver smsc911x_driver = { > .probe = smsc911x_drv_probe, > .remove = smsc911x_drv_remove, > @@ -2661,6 +2656,7 @@ static struct platform_driver smsc911x_driver = > { > .name = SMSC_CHIPNAME, > .pm = SMSC911X_PM_OPS, > .of_match_table = of_match_ptr(smsc911x_dt_ids), > + .acpi_match_table = ACPI_PTR(smsc911x_acpi_match), > }, > }; Hm, surely you shouldn't need this part? If the ACPI device has a HID of PRP0001, and an appropriate "compatible" property, then it ought to be loaded anyway. If that doesn't work, we should fix that. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
[PATCH v2] Fix AF_PACKET ABI breakage in 4.2
Commit 7d82410950aa ("virtio: add explicit big-endian support to memory accessors") accidentally changed the virtio_net header used by AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian. Since virtio_legacy_is_little_endian() is a very long identifier, define a VIO_LE macro and use that throughout the code instead of the hard-coded 'false' for little-endian. This restores the ABI to match 4.1 and earlier kernels, and makes my test program work again. Signed-off-by: David Woodhouse --- On Wed, 2015-09-23 at 11:09 -0700, David Miller wrote: > > +#define VIO_LE virtio_legacy_is_little_endian() > > When you define a shorthand macro, the defines to a function call, > make the macro have parenthesis too. In which case I suppose it also wants to be lower-case. Although "function call" is a bit strong since it's effectively just a constant. I'm still wondering if it'd be nicer just to use (__force u16) instead. diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 7b8e39a..aa4b15c 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -230,6 +230,8 @@ struct packet_skb_cb { } sa; }; +#define vio_le() virtio_legacy_is_little_endian() + #define PACKET_SKB_CB(__skb) ((struct packet_skb_cb *)((__skb)->cb)) #define GET_PBDQC_FROM_RB(x) ((struct tpacket_kbdq_core *)(&(x)->prb_bdqc)) @@ -2680,15 +2682,15 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) goto out_unlock; if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && - (__virtio16_to_cpu(false, vnet_hdr.csum_start) + -__virtio16_to_cpu(false, vnet_hdr.csum_offset) + 2 > - __virtio16_to_cpu(false, vnet_hdr.hdr_len))) - vnet_hdr.hdr_len = __cpu_to_virtio16(false, -__virtio16_to_cpu(false, vnet_hdr.csum_start) + - __virtio16_to_cpu(false, vnet_hdr.csum_offset) + 2); + (__virtio16_to_cpu(vio_le(), vnet_hdr.csum_start) + +__virtio16_to_cpu(vio_le(), vnet_hdr.csum_offset) + 2 > + __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len))) + vnet_hdr.hdr_len = __cpu_to_virtio16(vio_le(), +__virtio16_to_cpu(vio_le(), vnet_hdr.csum_start) + + __virtio16_to_cpu(vio_le(), vnet_hdr.csum_offset) + 2); err = -EINVAL; - if (__virtio16_to_cpu(false, vnet_hdr.hdr_len) > len) + if (__virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len) > len) goto out_unlock; if (vnet_hdr.gso_type != VIRTIO_NET_HDR_GSO_NONE) { @@ -2731,7 +2733,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) hlen = LL_RESERVED_SPACE(dev); tlen = dev->needed_tailroom; skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, - __virtio16_to_cpu(false, vnet_hdr.hdr_len), + __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len), msg->msg_flags & MSG_DONTWAIT, &err); if (skb == NULL) goto out_unlock; @@ -2778,8 +2780,8 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) if (po->has_vnet_hdr) { if (vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) { - u16 s = __virtio16_to_cpu(false, vnet_hdr.csum_start); - u16 o = __virtio16_to_cpu(false, vnet_hdr.csum_offset); + u16 s = __virtio16_to_cpu(vio_le(), vnet_hdr.csum_start); + u16 o = __virtio16_to_cpu(vio_le(), vnet_hdr.csum_offset); if (!skb_partial_csum_set(skb, s, o)) { err = -EINVAL; goto out_free; @@ -2787,7 +2789,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) } skb_shinfo(skb)->gso_size = - __virtio16_to_cpu(false, vnet_hdr.gso_size); + __virtio16_to_cpu(vio_le(), vnet_hdr.gso_size); skb_shinfo(skb)->gso_type = gso_type; /* Header must be checked, and gso_segs computed. */ @@ -3161,9 +3163,9 @@ static int packet_recvmsg(struct socket *sock, struct msghdr *msg, size_t len, /* This is a hint as to how much should be linear. */ vnet_hdr.hdr_len = - __cpu_to_virtio16(false, skb_headlen(skb)); + __cpu_to_virtio16(vio_le(), skb_headlen(skb)); vnet_hdr.gso_size = - __cpu_to_virtio16(false, sinfo-&g
Re: [PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
On Wed, 2015-09-23 at 10:58 -0700, David Miller wrote: > From: David Woodhouse > Date: Wed, 23 Sep 2015 09:14:09 +0100 > > > But sure, for now I'll drop this from the series and I can try to > > convince you separately. > > Yes, let's discuss this independantly to the nice bug fixing > that's happening in the rest of this series. Since you explicitly call them 'bug fixes'... I suppose the last patch in the latest series (enabling the various offloads by default) isn't a bug fix and should be held back for net-next. I think I can live with pushing the rest for the net tree. Just to recap, the series is: 8139cp: Do not re-enable RX interrupts in cp_tx_timeout() 8139cp: Fix tx_queued debug message to print correct slot numbers 8139cp: Fix TSO/scatter-gather descriptor setup 8139cp: Reduce duplicate csum/tso code in cp_start_xmit() 8139cp: Fix DMA unmapping of transmitted buffers 8139cp: Dump contents of descriptor ring on TX timeout 8139cp: Enable offload features by default The penultimate patch is also not strictly a bug fix, but it's only adding additional diagnostics to a code path which was already broken until I fixed it anyway, so I can live with that in net although I'll also be happy if you want to defer it. The fourth patch which removes the three duplicate versions of the csum/tso checks might also be deferrable but really, it's *also* fixing the failure mode when we see an inappropriate CHECKSUM_PARTIAL packet, and it's touching a code path which I've *also* fixed in this patch set. So it might as well stay. I can shuffle things around if you disagree, but assuming Francois concurs I'd suggest merging patches 1-6 (or just pulling from git://git.infradead.org/users/dwmw2/linux-8139cp.git fixes-only) and then I'll resubmit the last patch some time later, after you next pull net into net-next. Otherwise, please let me know if/how you'd like me to reorganise it. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] Convert smsc911x to use ACPI as well as DT
On Wed, 2015-09-23 at 22:51 +0200, Rafael J. Wysocki wrote: > > But if the device ID is assigned already, why would it hurt to > actually use it? It doesn't hurt at all, as long as we understand that there was no need to assign it a device ID at all, at least for Linux's benefit. And as long as it doesn't set a precedent that makes other people think they need to do so. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
[PATCH] 8139cp: Set GSO max size and enforce it
The maximum MSS value we can set is 0xFFF. When fixing the TSO code I noticed we just mask the gso_size value to the hardware's maximum and don't care about the consequences. That seems... unsagacious. Instead of masking, WARN and drop the packet if it's too large. And use netif_set_gso_max_size() so it shouldn't happen in the first place. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index e5173f3..fa8f850 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -754,10 +754,16 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, eor = (entry == (CP_TX_RING_SIZE - 1)) ? RingEnd : 0; mss = skb_shinfo(skb)->gso_size; + if (mss > MSSMask) { + WARN_ONCE(1, "Net bug: GSO size %d too large for 8139CP\n", + mss); + goto out_dma_error; + } + opts2 = cpu_to_le32(cp_tx_vlan_tag(skb)); opts1 = DescOwn; if (mss) - opts1 |= LargeSend | ((mss & MSSMask) << MSSShift); + opts1 |= LargeSend | (mss << MSSShift); else if (skb->ip_summed == CHECKSUM_PARTIAL) { const struct iphdr *ip = ip_hdr(skb); if (ip->protocol == IPPROTO_TCP) @@ -1994,6 +2000,8 @@ static int cp_init_one (struct pci_dev *pdev, const struct pci_device_id *ent) dev->vlan_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO | NETIF_F_HIGHDMA; + netif_set_gso_max_size(dev, MSSMask); + rc = register_netdev(dev); if (rc) goto err_out_iomap; -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
On Wed, 2015-09-23 at 14:48 -0700, David Miller wrote: > I've applied patches #1-#6 and left #7 sitting around in patchwork > for when I next merge net into net-next, thanks. Great, thanks. I've got one more for the net tree; just masking the gso_size to the maximum (0xfff) that the hardware can handle, and never telling the net stack that that's our maximum, doesn't seem like such a good idea... I think I really am going to stop looking now :) -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
On Thu, 2015-09-24 at 00:44 +0200, Francois Romieu wrote: > David Woodhouse : > [...] > > I can shuffle things around if you disagree, but assuming Francois > > concurs I'd suggest merging patches 1-6 (or just pulling from > > git://git.infradead.org/users/dwmw2/linux-8139cp.git fixes-only) > > and then I'll resubmit the last patch some time later, after you > > next pull net into net-next. > > No objection but I do not really trust whatever review I do this late. > > Did you try #5 with the DMA allocation debug code enabled ? Only in QEMU (where there wasn't a problem in the first place). But yes. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
On Thu, 2015-09-24 at 00:44 +0200, Francois Romieu wrote: > David Woodhouse : > [...] > > But sure, for now I'll drop this from the series and I can try to > > convince you separately. > > You may as well change the IRQ storm avoidance logic so that it does not > require conformant driver code if you do not want to waste more time :o) That's not entirely trivial — at a certain level we *do* require that drivers don't lie to the core IRQ logic. I suppose even if a rogue driver is returning IRQ_HANDLED in an IRQ storm, when it's *not* actually doing anything or making any progress, perhaps there's a way we could detect that we're always going straight back into the next interrupt handler as *soon* as we exit the previous one. Perhaps there's merit in exploring that option. I'm not sure that gives you a valid excuse for not wanting to fix the driver though :) I might start by "clarifying" the documentation on the IRQ handler's return value... -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] Convert smsc911x to use ACPI as well as DT
On Wed, 2015-09-23 at 16:56 -0700, Hanjun Guo wrote: > > It really depends on the people who writing the ASL code (DSDT), > I'm not sure if Windows will use _DSD and how to use it, but > firmware guys may just use the device ID to make the firmware > to be usable both by Linux and Windows, and that's reasonable > I think. And the ideal way to do that is to add a _CID of "PRP0001" and a compatible string, and then Linux doesn't need to care about the special new HID that you've added purely to pander to the limitations of Windows. That way, once drivers are converted from the legacy of-property APIs to the new device-property APIs, there should be *nothing* required to make device drivers work under ACPI. That was kind of the point. If people want to add HIDs, fine. But let's not go down a path which *requires* modifying drivers. (I'm planning to try to learn Coccinelle and do a bombing run convertin to the new API some time soon, FWIW). -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
[PATCH WTF] 8139cp: Fix GSO MSS handling
When fixing the TSO support I noticed we just mask skb->gso_size with the MSSMask value and don't care about the consequences. That seems... unsagacious. Instead of masking, WARN and drop the packet if the maximum is exceeded. And call netif_set_gso_max_size() so it shouldn't happen in the first place. Finally, Francois Romieu noticed that we didn't even have the right value for MSSMask anyway; it should be 0x7ff (11 bits) not 0xfff. Signed-off-by: David Woodhouse --- Nice and simple and obvious, right? So why does it stop TSO from happening at all? When I call netif_set_gso_max_size() with a value of 0xaad (2733) or higher, I see TSO being used (with an MSS of 1214; this is just between a VM and its host; Legacy IP with an MTU of 1500). If I set a maximum of 0xaac or lower, TSO never gets used. Am I doing something wrong? Or is there a problem somewhere else? drivers/net/ethernet/realtek/8139cp.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 686334f..ae24d42 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -175,7 +175,7 @@ enum { LastFrag= (1 << 28), /* Final segment of a packet */ LargeSend = (1 << 27), /* TCP Large Send Offload (TSO) */ MSSShift= 16,/* MSS value position */ - MSSMask = 0xfff, /* MSS value: 11 bits */ + MSSMask = 0x7ff, /* MSS value: 11 bits */ TxError = (1 << 23), /* Tx error summary */ RxError = (1 << 20), /* Rx error summary */ IPCS= (1 << 18), /* Calculate IP checksum */ @@ -754,10 +754,16 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, eor = (entry == (CP_TX_RING_SIZE - 1)) ? RingEnd : 0; mss = skb_shinfo(skb)->gso_size; + if (mss > MSSMask) { + WARN_ONCE(1, "Net bug: GSO size %d too large for 8139CP\n", + mss); + goto out_dma_error; + } + opts2 = cpu_to_le32(cp_tx_vlan_tag(skb)); opts1 = DescOwn; if (mss) - opts1 |= LargeSend | ((mss & MSSMask) << MSSShift); + opts1 |= LargeSend | (mss << MSSShift); else if (skb->ip_summed == CHECKSUM_PARTIAL) { const struct iphdr *ip = ip_hdr(skb); if (ip->protocol == IPPROTO_TCP) @@ -1994,6 +2000,8 @@ static int cp_init_one (struct pci_dev *pdev, const struct pci_device_id *ent) dev->vlan_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_TSO | NETIF_F_HIGHDMA; + netif_set_gso_max_size(dev, MSSMask); + rc = register_netdev(dev); if (rc) goto err_out_iomap; -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2] Fix AF_PACKET ABI breakage in 4.2
On Thu, 2015-09-24 at 12:55 +0300, Michael S. Tsirkin wrote: > > I'm fine with this patch > > Acked-by: Michael S. Tsirkin Thanks. In fact Dave has already merged it. > but if you want to re-work it along the lines suggested > by Greg, that's also fine with me. If I'm going to define my own accessors, I'd probably just make them use (__force __virtio16). But TBH none of the options seemed particularly pretty to me. I can live with what we have. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH v2 RFC] 8139cp: Fix GSO MSS handling
When fixing the TSO support I noticed we just mask ->gso_size with the MSSMask value and don't care about the consequences. Provide a .ndo_features_check() method which drops the NETIF_F_TSO feature for any skb which would exceed the maximum, and thus forces it to be segmented by software. Then we can stop the masking in cp_start_xmit(), and just WARN if the maximum is exceeded, which should now never happen. Finally, Francois Romieu noticed that we didn't even have the right value for MSSMask anyway; it should be 0x7ff (11 bits) not 0xfff. Signed-off-by: David Woodhouse --- OK, netif_set_gso_max_size() was *not*, as I had naïvely assumed, a way to set the maximum value of ->gso_size. That's an unfortunate set of naming decisions. I don't see a simple way to tell the net stack about the maximum MSS value we can cope with in the general case. But we can do it this way by clearing NETIF_F_TSO on a per-packet basis instead. Incidentally, other drivers might benefit from doing the same if this is considered reasonable. For example it looks like the r8169 driver manually calls skb_gso_segment() for its fallback case, and I don't see what then prevents it from hitting the "BUG!" at the start of rtl8169_start_xmit() if it runs out of space in the descriptor ring. drivers/net/ethernet/realtek/8139cp.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 686334f..fe1040d 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -175,7 +175,7 @@ enum { LastFrag= (1 << 28), /* Final segment of a packet */ LargeSend = (1 << 27), /* TCP Large Send Offload (TSO) */ MSSShift= 16,/* MSS value position */ - MSSMask = 0xfff, /* MSS value: 11 bits */ + MSSMask = 0x7ff, /* MSS value: 11 bits */ TxError = (1 << 23), /* Tx error summary */ RxError = (1 << 20), /* Rx error summary */ IPCS= (1 << 18), /* Calculate IP checksum */ @@ -754,10 +754,16 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, eor = (entry == (CP_TX_RING_SIZE - 1)) ? RingEnd : 0; mss = skb_shinfo(skb)->gso_size; + if (mss > MSSMask) { + WARN_ONCE(1, "Net bug: GSO size %d too large for 8139CP\n", + mss); + goto out_dma_error; + } + opts2 = cpu_to_le32(cp_tx_vlan_tag(skb)); opts1 = DescOwn; if (mss) - opts1 |= LargeSend | ((mss & MSSMask) << MSSShift); + opts1 |= LargeSend | (mss << MSSShift); else if (skb->ip_summed == CHECKSUM_PARTIAL) { const struct iphdr *ip = ip_hdr(skb); if (ip->protocol == IPPROTO_TCP) @@ -1852,6 +1858,15 @@ static void cp_set_d3_state (struct cp_private *cp) pci_set_power_state (cp->pdev, PCI_D3hot); } +static netdev_features_t cp_features_check(struct sk_buff *skb, + struct net_device *dev, + netdev_features_t features) +{ + if (skb_shinfo(skb)->gso_size > MSSMask) + features &= ~NETIF_F_TSO; + + return vlan_features_check(skb, features); +} static const struct net_device_ops cp_netdev_ops = { .ndo_open = cp_open, .ndo_stop = cp_close, @@ -1864,6 +1879,7 @@ static const struct net_device_ops cp_netdev_ops = { .ndo_tx_timeout = cp_tx_timeout, .ndo_set_features = cp_set_features, .ndo_change_mtu = cp_change_mtu, + .ndo_features_check = cp_features_check, #ifdef CONFIG_NET_POLL_CONTROLLER .ndo_poll_controller= cp_poll_controller, -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] Convert smsc911x to use ACPI as well as DT
Hi Catalin, I understand your concerns, but I'm still not convinced of your conclusion. On Thu, 2015-09-24 at 11:31 +0100, Catalin Marinas wrote: > PRP0001 opens the door to any DT-like properties in ACPI and, knowing > how the ARM ecosystem works, I'm pretty sure we'll soon lose control (if > we haven't already; not all the developments are done in the open). No. That door is wide open already — people can already do whatever they like in _DSD properties. If they're going to screw up DT properties, they'll screw up ACPI-only _DSD properties just the same. You speak of maintaining "tight control of the _DSD properties that are going to be used in ACPI tables"... but if you're going to confiscate their crackpipe and stand over them while they work, you can do that just as well whether they're using "PRP0001" or "FOO1234" as their HID value. In that sense, the HID is entirely orthogonal. And think about it... we *really* don't want a given peripheral device to have *different* bindings depending on whether it's discovered with a specific ACPI HID, vs. when it's discovered via DT or the PRP0001 HID. That way lies complete insanity. In some ways, your proposal would be actively *counterproductive*. You say you want to train people *not* to keep patching the kernel. But where they *could* have just used PRP0001 and used a pre-existing kernel, you then tell them "oh, but now you need to patch the kernel because we want you to make up a new HID and not be compatible with what already exists." If you go down this road, I predict we'll start seeing *separate* drivers for identical components, because the bindings for DT vs. ACPI properties are different. We really don't want that. > The pro ARM ACPI camp has been very vocal against DT in the server > space. I'd like to seem them as vocal about PRP0001, unless they find it > acceptable (and should apologise for bashing DT ;)). Fundamentally, that DT vs. ACPI distinction has gone away with the introduction of _DSD. We *need* the flexibility that we gain from being able to provide device properties rather than inventing a new HID for every permutation, and with that flexibility comes a certain amount of responsibility to do things sensibly. People have not always designed their bindings sensibly. But that isn't going to be magically solved in the ACPI world, unless you *do* actually stand over them with their crackpipe in your hand, as I joked above. And eschewing PRP0001 really doesn't help you with that. It just makes things harder. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2 RFC] 8139cp: Fix GSO MSS handling
On Thu, 2015-09-24 at 05:05 -0700, Eric Dumazet wrote: > > Right, netif_skb_features() only has the following checks : > > if (gso_segs > dev->gso_max_segs || gso_segs < dev->gso_min_segs) > features &= ~NETIF_F_GSO_MASK; > > But now we have .ndo_features_check() we could remove this generic > check from fast path. Perhaps so, yes. Any thoughts on the other reason I was staring at this same code path this week? I am able to reliably feed inappropriate packets to a NETIF_F_IP_CSUM-capable device with the test program at http://bombadil.infradead.org/~dwmw2/raw.c (and equivalent code paths via virtio_net, tun and others). They're *supposed* to get checksummed by software if the device can't cope, but netif_skb_features() returns the wrong answer, so we fail to do that and they're fed with CHECKSUM_PARTIAL to a device which can't handle them. Causing a WARN() or a BUG() or a silent corruption, depending on the driver. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] Convert smsc911x to use ACPI as well as DT
On Thu, 2015-09-24 at 15:01 +0100, Lorenzo Pieralisi wrote: > On Thu, Sep 24, 2015 at 12:52:38PM +0100, David Woodhouse wrote: > > [...] > > > And think about it... we *really* don't want a given peripheral device > > to have *different* bindings depending on whether it's discovered with > > a specific ACPI HID, vs. when it's discovered via DT or the PRP0001 > > HID. That way lies complete insanity. > > There are "devices" that we do _not_ want to discover at all when > booting with ACPI (eg drivers/clk, drivers/regulator) because the way > they interact with ACPI standard power management AML methods may > definitely clash. So it would be quite clearly a firmware bug to *expose* those things in that way, when also exposing AML methods which make use of them. But those really aren't what I'm looking at converting, anyway. The target of the Coccinelle script would be leaf-node drivers, like the one in this thread. Not subsystem code. We have, for example, a generic method for 'get the GPIO' which works equally well whether the actual information is given in the DT form, or whether it's done through the ACPI GPIO abstraction as described in Documentation/acpi/gpio-properties.txt. We certainly shouldn't be looking at a naïve 1:1 literal transliteration of properties. But on the other hand, we *do* want a direct and automatic bijective mapping between the ACPI and DT representations. > PRP0001 allows to pull everything in (depending on what the Coccinelle > script will look like, because AFAIK that's the only thing that stops > people from enabling eg drivers/clk drivers in ACPI systems with PRP0001, > namely the OF to FW node API conversion). > > > In some ways, your proposal would be actively *counterproductive*. You > > say you want to train people *not* to keep patching the kernel. But > > where they *could* have just used PRP0001 and used a pre-existing > > kernel, you then tell them "oh, but now you need to patch the kernel > > because we want you to make up a new HID and not be compatible with > > what already exists." > > We do not want people to reuse DT drivers that are handled in ACPI with > existing methods (again, power management is a prime example). > I do not think we want to enable DT PCI host controllers drivers either > since the ACPI model for PCI works differently already. > > Should I mention CPUfreq and CPUidle DT drivers ? No. Except to note that those are prime examples of the parts that *don't* want to be converted. > > Fundamentally, that DT vs. ACPI distinction has gone away with the > > introduction of _DSD. We *need* the flexibility that we gain from being > > able to provide device properties rather than inventing a new HID for > > every permutation, and with that flexibility comes a certain amount of > > responsibility to do things sensibly. > > How do you enforce that apart from limiting the number of drivers you > convert to the FW node API ? What drivers are we going to convert and > on what basis ? To start with, as I said, just leaf-node device drivers. None of the things like the above, which you are quite right don't really want a trivial conversion. It's insane to talk of "enforcement", so let's not go there. Anyone taking an existing driver and adding an ACPI HID to it would need to start with *precisely* the same patches for those properties anyway. Just as shown in the example in this thread. > I think that's a debate worth having before making a tree-wide conversion > of OF API calls to the generic FW node API. Well, if that's true, that's a debate worth having before introducing the generic FW node API. That having been kind of the *point* of the generic FW node API — to give code which doesn't *need* to be gratuitously firmware-specific, an API that works everywhere. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] Convert smsc911x to use ACPI as well as DT
On Thu, 2015-09-24 at 16:15 +0100, Catalin Marinas wrote: > With "PRP0001", they can skip the _DSD properties review process (not > that they bother much currently) as long as the existing DT support > covers their needs. So no change there then. I take it the smsc911x bindings didn't go through this mythical _DSD properties review process either, right? > However, we don't want to emulate DT in ACPI but > first try the > established ACPI methods and only use _DSD where these are > insufficient. Automatically converting existing drivers and encouraging > people to use "PRP0001" does not provide them with any incentive to try > harder and learn the "ACPI way". But again, we're *not* looking at a simplistic transliteration of properties. Where there *is* an "ACPI way", such as the GPIO example I gave, that should absolutely be used. And there should be sufficiently high-level access functions that it shouldn't *matter* to the driver whether the information is coming from ACPI or DT. > > In that sense, the HID is entirely orthogonal. > > > > And think about it... we *really* don't want a given peripheral device > > to have *different* bindings depending on whether it's discovered with > > a specific ACPI HID, vs. when it's discovered via DT or the PRP0001 > > HID. That way lies complete insanity. > > I'm fine with reusing the same property names between _DSD and DT but I > strongly consider that _DSD should only cover a _subset_ of the DT > bindings and they should be reviewed independently. > > Take the smsc911x driver for example: the DT bindings for the ARM Juno > platform include things like "clocks", "vdd33a-supply". Such concepts > are not available in ACPI (but we've seen people trying to emulate > them), so rather than enabling such clocks in firmware, vendors will be > tempted to do things the DT way, possibly with "PRP0001" HIDs covering > the clock devices (though I think they currently require some kernel > changes). Aren't those optional? If nothing is provided, they are basically a no-op. AFAICT the existing patches for smsc911x don't touch that code at all. So whatever motivation you imagine there is for ACPI firmware vendors to "do things the DT way" is still there with the existing patches. Again, this is *orthogonal* to the question of whether the device is matched by PRP0001 + its compatible string, vs. a newly-invented HID. > > In some ways, your proposal would be actively *counterproductive*. You > > say you want to train people *not* to keep patching the kernel. But > > where they *could* have just used PRP0001 and used a pre-existing > > kernel, you then tell them "oh, but now you need to patch the kernel > > because we want you to make up a new HID and not be compatible with > > what already exists." > > I gave an example above with the clocks but let's take a simpler, > device-specific property like "smsc,irq-active-high". Is this documented > anywhere apart from the kernel driver and the in-kernel DT smsc911x.txt? > No. Are other non-Linux OS vendors going to look in the kernel source > tree to implement support in their drivers? I doubt it and we could end > up with two different paths in firmware for handling the same device. > ACPI probably never was truly OS agnostic but with "PRP0001" we don't > even try. Where is the documentation for the INT_CFG_IRQ_POL_ bit in the INT_CFG register, that it controls? The documentation on the bindings really ought to live near that, in an ideal world. But that's still a non-sequitur in the context of this particular discussion. The patch that was posted *already* keeps that very same (optional) smsc,irq-active-high property. Again, it isn't relevant to the question of whether the driver is matched via PRP0001 or a newly-allocated HID. The driver, as the patch was posted, *does* have the same set of properties with the same meanings — because anything else would be insane. > The idea of registering a HID is to also have a process for getting > corresponding _DSD properties approved, published. Some of them like > "mac-address" would be more generic and not require further review but > we'll have more specific properties. > > Patching a kernel driver to support a new HID is a (weak) method of > keeping track of which devices are used with ACPI. Given how quickly > these two patches were merged while still under discussion, I doubt that > arch maintainers would be in a position to review/reject drivers using > non-approved _DSD properties. Again, nothing has changed there. > Apart from a _DSD properties review process, what we need is (main) OS > vendors enforcing it, possibly with stricter ACPI test suite. Disabling > PRP0001 for ARM wouldn't solve the problem but at least it would make > people think about what _DSD properties they need registered for their > device (or I'm over optimistic and I should just stop caring ;)). I'm all for requiring pre-existing DT bindings to be "vetted" by the nascent _DSD review process, b
[RFC PATCH] Fix false positives in can_checksum_protocol()
The check in harmonize_features() is supposed to match the skb against the features of the device in question, and prevent us from handing a skb to a device which can't handle it. It doesn't work correctly. A device with NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM capabilities is only required to checksum TCP or UDP, on Legacy IP and IPv6 respectively. But the existing check will allow us to pass it *any* ETH_P_IP/ETH_P_IPV6 packets for hardware checksum offload. Depending on the driver in use, this leads to a BUG, a WARNING, or just silent data corruption. This is one approach to fixing that, and my test program at http://bombadil.infradead.org/~dwmw2/raw.c can no longer trivially reproduce the problem. The test does now have false *negatives*, but those shouldn't happen for locally-generated packets; only packets injected from af_packet, tun, virtio_net and other places that allow us to inject CHECKSUM_PARTIAL packets in order to make use of hardware offload features. And false negatives aren't anywhere near as much of a problem as false positives are — we just finish the checksum in software and send the packet anyway. It would be possible to fix those false negatives, if we really wanted to — perhaps by adding an additional bit in the skbuff which indicates that it *is* a TCP or UDP packet, rather than using ->sk->sk_protocol. Then that bit could be set if appropriate in skb_partial_csum_set(), as well as the places where we locally generate such packets. And the check in can_checksum_protocol() would just check for that bit. Signed-off-by: David Woodhouse --- Since can_checksum_protocol is inline, the compiler ought to know that it doesn't even need to dereference skb->sk in the case where the device has the NETIF_F_GEN_CSUM feature. So the additional check should not slow down the (hopefully) common case in the fast path. diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 2d15e38..76c8330 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3628,15 +3628,23 @@ struct sk_buff *skb_gso_segment(struct sk_buff *skb, netdev_features_t features) __be16 skb_network_protocol(struct sk_buff *skb, int *depth); static inline bool can_checksum_protocol(netdev_features_t features, -__be16 protocol) -{ - return ((features & NETIF_F_GEN_CSUM) || - ((features & NETIF_F_V4_CSUM) && -protocol == htons(ETH_P_IP)) || - ((features & NETIF_F_V6_CSUM) && -protocol == htons(ETH_P_IPV6)) || - ((features & NETIF_F_FCOE_CRC) && -protocol == htons(ETH_P_FCOE))); +__be16 protocol, u8 sk_protocol) +{ + if ((features & NETIF_F_GEN_CSUM) || + ((features & NETIF_F_FCOE_CRC) && protocol == htons(ETH_P_FCOE))) + return 1; + + /* NETIF_F_V[46]_CSUM are defined to work only on TCP and UDP. +* That is, when it needs to start checksumming at the transport +* header, and place the result at an offset of either 6 (for UDP) +* or 16 (for TCP). +*/ + if features & NETIF_F_V4_CSUM) && protocol == htons(ETH_P_IP)) || +((features & NETIF_F_V6_CSUM) && protocol == htons(ETH_P_IPV6))) && + (sk_protocol == IPPROTO_TCP || sk_protocol == IPPROTO_UDP)) + return 1; + + return 0; } #ifdef CONFIG_BUG diff --git a/net/core/dev.c b/net/core/dev.c index 6bb6470..3c40957 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2613,7 +2613,8 @@ static netdev_features_t harmonize_features(struct sk_buff *skb, features = net_mpls_features(skb, features, type); if (skb->ip_summed != CHECKSUM_NONE && - !can_checksum_protocol(features, type)) { + !can_checksum_protocol(features, type, + skb->sk ? skb->sk->sk_protocol : 0)) { features &= ~NETIF_F_ALL_CSUM; } else if (illegal_highdma(skb->dev, skb)) { features &= ~NETIF_F_SG; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index dad4dd3..9126c6f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3004,7 +3004,8 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb, return ERR_PTR(-EINVAL); csum = !head_skb->encap_hdr_csum && - !!can_checksum_protocol(features, proto); + !!can_checksum_protocol(features, proto, + head_skb->sk ? head_skb->sk->sk_protocol : 0); headroom = skb_headroom(head_skb); pos = skb_headlen(head_skb); -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] Convert smsc911x to use ACPI as well as DT
On Fri, 2015-09-25 at 16:28 +0100, Catalin Marinas wrote: > > This only works as long as they target an existing driver with prior DT > support (usually with reviewed bindings). If they have a new driver and > only ACPI in mind, I'm pretty sure we'll end up with new insane things. > That's why we need some form of _DSD properties review and "compatible" > is one such property. Sure, that makes a lot of sense. My main concern is that we don't end up with gratuitously *different* property sets for DT vs. ACPI. That way we end up with either two separate drivers, or abstracting the core of the driver out and having two bindings for it (much as we do for PCI vs. platform/etc for some devices already). We don't want that pain where we can avoid it. And we don't want people to *have* to hack the kernel driver to migrate to ACPI, if we can avoid it. Sometimes it might be worth being different — if the DT binding is utterly crap, and deserves to be thrown away. In that case, by all means invent a new binding. But if we're going to accept the pain of having multiple bindings, why not make the *new* one work via DT too anyway. I'd be happy to see the existing DT bindings put through the nascent _DSD review process — such as it is — and into the database. One at a time on a case-by-case basis as they get used, perhaps. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH v2 RFC] 8139cp: Fix GSO MSS handling
On Sun, 2015-09-27 at 22:37 -0700, Tom Herbert wrote: > > Which drivers are doing this? It is up to the driver to determine > whether a particular packet being sent can have checksum offloaded to > the device. If it cannot offload the checksum it must call > skb_checksum_help. Not so. A driver sets the NETIF_F_IP_CSUM feature to indicate that it can do the checksum on Legacy IP TCP or UDP frames and *nothing* else. It most certainly does not expect to be handed any other kind of packet for checksumming, and bad things will often happen if it is. If drivers *do* spot that they've been given something they don't handle, I see BUG() calls and warnings, but I don't see any of them calling skb_checksum_help() to silently cope. Many of them just feed it to the hardware and don't even notice at all because it's the *hardware* which decides whether to do a TCP or a UDP checksum. So who knows what'll happen. The check is supposed to be done in can_checksum_protocol(), called from harmonize_features(). But as noted, that check has false positives and lets some inappropriate packets through — for NETIF_F_IP_CSUM it lets through *all* skbuffs with ->protocol == ETH_P_IP instead of only TCP and UDP. I originally couldn't see how to deal with this except by looking at the contents of the packet, which sucked. But I think I've found a somewhat more acceptable approach now: http://lists.openwall.net/netdev/2015/09/25/85 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [RFC PATCH] Fix false positives in can_checksum_protocol()
On Mon, 2015-09-28 at 10:03 -0700, Tom Herbert wrote: > > + if features & NETIF_F_V4_CSUM) && protocol == htons(ETH_P_IP)) > > || > > +((features & NETIF_F_V6_CSUM) && protocol == > > htons(ETH_P_IPV6))) && > > + (sk_protocol == IPPROTO_TCP || sk_protocol == IPPROTO_UDP)) > > + return 1; > > + > Relying on skb->sk->sk_protocol is problematic. This is making the > assumption that the checksum being offloaded for the packet is the > same as that of the protocol for the socket-- this may not be the > case when we are offloading an outer checksum in encapsulation. > Currently this wouldn't a be problem since we're probably only > offloading outer UDP checksums, but if we ever start trying to > offload other outer checksums (e.g. GRE) then this probably doesn't > work so well. That makes sense. > Also, this doesn't help those drivers that that can offload TCP and > UDP for IPv6 but only if there are no extension headers, in those > case the driver needs to look at the packet to see if it is a > "simple" UDP/TCP packet. Hm, are such devices even permitted to set NETIF_F_IPV6_CSUM? > AFAIK, the only non UDP/TCP transport IP checksum in the stack is GRE > checksum which as I pointed out we don't attempt to offload. So the > only way to trip the bug that you are seeing is probably through a > userspace packet interface like in the test code. I think this > actually might expose a much more serious issue. Looking at tun.c, I > don't see anything that validates that the csum_start and csum_offset > provided by userspace actually refers to a sane checksum offset. That's handled in skb_partial_csum_set(). > Not only is this a way to ask the stack to perform checksums for non > TCP/UDP, but it actually seems like the interface could be used by a > malicious application to have a device arbitrarily overwrite two > bytes anywhere in the packet with it's own data far below the stack, > netfilter, routing. To really fix this we should probably be doing > validation in tun, if the checksum isn't for TCP or UDP then call > skb_checksum_help before sending the packet into the stack. So... if it's never valid to ask for a hardware checksum on anything but TCP or UDP, why do we bother with NETIF_F_GEN_CSUM at all? Should we just be removing it entirely? That seems like something of a retrograde step. Perhaps a better solution would be a bit in the skbuff which indicates that it *is* a TCP or UDP checksum. That would be set by our UDP and TCP sockets, cleared by encapsulation, also set if appropriate by skb_partial_csum_set(). And then the check in can_checksum_protocol() is trivial and clearly correct. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [RFC PATCH] Fix false positives in can_checksum_protocol()
On Mon, 2015-09-28 at 12:13 -0700, Tom Herbert wrote: > > > Perhaps a better solution would be a bit in the skbuff which indicates > > that it *is* a TCP or UDP checksum. That would be set by our UDP and > > TCP sockets, cleared by encapsulation, also set if appropriate by > > skb_partial_csum_set(). > > > Yes I agree. What I have been thinking to do is steal two bits from > csum_offset that would indicate that the checksum is IPv4 or IPv6 > (specifically that the checksum value is seeded with an IPv4 or IPv6 > pseudo header). This information plus the csum_offset would be > sufficient for drivers to identify the checksum as UDP/TCP-IPv4/IPv6. > The other case that needs special handling is inner vs. outer > checksum, but that can be deduced by comparing (inner of outer) > transport offset to checksum start. With this and a couple of utility > functions we should be able to start deprecating NETIF_F_IP_CSUM and > NETIF_F_IPV6_CSUM. You mean drivers which currently set NETIF_F_IP_CSUM would need to provide a .ndo_features_check() which tolerates only the packets they can actually handle? And we'd just ensure that the bits are there for them to use, in the skbuff? That seems reasonable. Note that 'seeded with an IPv[46] pseudo header' isn't quite sufficient. Some hardware like 8139cp is explicitly told to do a UDP or a TCP checksum with a bit in the descriptor, so any UDP-like or TCP -like checksum works out fine. Other hardware works out whether to do a UDP or a TCP checksum for *itself*, so it *can't* cope with other protocols which just happen to look the same. For those it really *must* be IPPROTO_TCP or IPPROTO_UDP and they're going to be looking in the IP header for it. I do suspect we'll want a bit which says it's *actually* TCP or UDP, not just 'seeded with a pseudo-header'. That's the important distinction for NETIF_F_IP_CSUM vs. NETIF_F_HW_CSUM. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [RFC PATCH] Fix false positives in can_checksum_protocol()
On Mon, 2015-09-28 at 12:37 -0700, Tom Herbert wrote: > I think it's easier to just call skb_checksum_help from the driver > when the packet is actually sent to the device (should be no cost for > late binding). That's true for checksum. Not for things like TSO though, and I wonder if it's worth keeping it simple and doing it *all* in .ndo_features_check()? > > Note that 'seeded with an IPv[46] pseudo header' isn't quite > > sufficient. Some hardware like 8139cp is explicitly told to do a UDP or > > a TCP checksum with a bit in the descriptor, so any UDP-like or TCP > > -like checksum works out fine. > > > UDP or TCP can be determined from csum_offset, e.g. 16=>TCP 6=>UDP Kind of. There'll be false positives there too, though. That was actually the basis of my first attempt to address this, at http://lists.openwall.net/netdev/2013/01/14/36 -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [RFC PATCH] Fix false positives in can_checksum_protocol()
On Mon, 2015-09-28 at 20:04 -0700, Tom Herbert wrote: > > > I've been pondering a bit of a redesign in this space. I think the > > skb struct should be explicit in its instructions to hardware for > > which offloads to do for each packet. > > > > In this way, the stack would be *directly* telling the drivers what to > > do (and what not to do), solving all sorts of bugs and really improving > > driver reliability and implementation. > > > Doesn't CHECKSUM_PARTIAL with csum_offset and csum_start already tell > the driver unambiguously what to do wrt checksum offload? Right. That's precisely what we *do* have. But as things stand, we can't *use* it to its full capability. It's fine for decent devices which can handle such explicit instructions (advertised by the NETIF_F_HW_CSUM feature). The problem is the crappy devices that can *only* checksum UDP and TCP frames, advertised with the NETIF_F_IP{V6,}_CSUM features. We make a primitive attempt *not* to feed arbitrary checksum requests to such hardware. But we fail — we end up feeding *all* Legacy IP packets to a NETIF_F_IP_CSUM device, and *all* IPv6 packets to a NETIF_F_IPV6_CSUM device, regardless of whether they're *actually* TCP or UDP packets. That's the problem I'm trying to solve. And then we *can* make full use of the generic checksum offload (I had it working for ICMPv6 at one point: http://lists.openwall.net/netdev/2013/01/14/38 ). -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH net-next 3/3] r8169: support IPv6
> Support the IPv6 hw checksum for RTL8111C and later chips. Note > that the hw has the limitation for the transport offset. The > checksum must be calculated by sw, when the transport offset is > out of the range which the hw accepts. It would be better to implement this check in a .ndo_features_check method, just clearing the appropriate CSUM/TSO features for the skbs for which the hardware cannot cope. That way the stack fixes them up for you. > + if (skb_shinfo(skb)->gso_size) { > + netdev_features_t features = tp->dev->features; > + struct sk_buff *segs, *nskb; > + > + features &= ~(NETIF_F_SG | NETIF_F_IPV6_CSUM | NETIF_F_TSO6); > + segs = skb_gso_segment(skb, features); > + if (IS_ERR(segs) || !segs) > + goto drop; > + > + do { > + nskb = segs; > + segs = segs->next; > + nskb->next = NULL; > + rtl8169_start_xmit(nskb, tp->dev); > + } while (segs); > + > + dev_kfree_skb(skb); This loop in particular makes no attempt to avoid exceeding the available space in your descriptor ring, and can drop packets and trigger the warning at the start of your hard_start_xmit function. -- dwmw2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH] Fix false positives in can_checksum_protocol()
On Tue, 2015-09-29 at 15:52 -0700, Tom Herbert wrote: > Please look at ixgbe_tx_csum in ixgbe driver. This one example of how > a driver can determine whether the checksum being offloaded is TCP or > UDP. The bug in this driver is... I think it serves better as an example of why we don't *want* drivers doing that kind of thing for themselves... :) I propose we steal some high bits from csum_offset, as you suggested, and use them to indicate a 'checksum type', which will include TCP and UDP. Then the filter in netif_skb_features() can trivially do the right thing for NETIF_F_IP{V6,}_CSUM devices, so avoid feeding them packets they can't handle. You mentioned that you actually want to deprecate those feature flags — which works for me, but it's kind of orthogonal. If we do that we'd still want to provide generic functions that such drivers can use as their .ndo_features_check() method. And we'd *still* want to do the check based on a simple flag, rather than grubbing around in the packet data. (And the drivers if they *are* asked to do the checksum will sometimes care whether it's TCP vs. UDP too). I don't think we want drivers calling skb_checksum_help() for themselves; we want the pre-filter. Mainly because we *definitely* don't want drivers calling gso_skb_segment() for themselves in the same situation — see the comment I posted on Friday about the r8169 instance of that. ('Re: [PATCH net-next 3/3] r8169: support IPv6'). -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [RFC PATCH] Fix false positives in can_checksum_protocol()
On Mon, 2015-10-05 at 09:23 -0700, Tom Herbert wrote: > > 1) Drivers may advertise NETIF_F_HW_CSUM. The stack will indicate > checksum offload exclusively using the > CHECKSUM_PARTIAL/csum_start/csum_offset interface. No additional > interfaces (bits in skbuff should not be needed) > 2) A driver may inspect packets via ndo_check to decide if it wants to > offload the checksum, if not cancels NETIF_F_HW_CSUM in the packet. > 3) In driver xmit when CHECKSUM_PARTIAL is set the driver MUST > correctly resolve the checksum-- either by properly offloading to the > device or calling skb_checksum_help. In cases where they haven't used .ndo_features_check() to ensure that they don't *see* such packets, sure. But using .ndo_features_check() should probably be the preferred method. > 4) To help drivers for devices with limited offload capabilities we'll > define a helper function to check for typical restrictions (.e.g. IPv4 > only, TCP/UDP only. no encapsulation, no IPv6 extension headers, > etc.). I am working on this helper function and will send RFC shortly. I do suspect that helper function would benefit from seeing TCP/UDP flags in the high bits of csum_offset, rather than grubbing around in the packet for itself to see if it's really TCP/UDP. After all, it's almost free to set those in the first place — at least for locally -generated packets. And not *so* hard to add them in skb_partial_csum_set(). But hey, if you can push an implementation which is grubbing around in the packet then at least *I* don't have to feel dirty for it... :) -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH RFC 0/3] net: Add driver helper function to determine if checksum is offloadable
On Mon, 2015-10-05 at 16:39 -0700, Tom Herbert wrote: > When drivers have support for offloading transport IP checksums, they > will indicate this in the features for the device. As described in > skbuff.h, a driver will usually advertise NETIF_F_HW_CSUM, > NETIF_F_IP_CSUM, and/or NETIF_F_IPV6_CSUM. The first of these > (NETIF_F_HW_CSUM) is the preferred method since this implies that the > device has implemented a generic checksum offload feature that should > work under arbitrary scenarios (e.g. for different protocols, with or > without encapsulation). > > For narrow features support (NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM > for offload of TCP/UDP), the features flags may not be sufficient to > deduce whether a packet may be offloaded. Some devices will not be able > to offload encapsulated checksums, some cannot offload transport > checksums in packet with IPv6 extension headers, etc. In these cases > a driver will need to perform additional packet inspection to determine > if a packet's checksum can be offloaded to a device. > > This patch defines a helper function that drivers can call to check if > it is able to offload the checksum for a particular packet. In an > argument to the function, the driver specifies what type of packets it > is able to offload to a device. The function is intended to check for > the most common restrictions of devices (like by IP version, transport > protocol, encapsulation, extension headers). Since the function includes > checks for IP version and transport protocol, the driver is able > to advertise NETIF_F_HW_CSUM instead of protocol specific support. > This should put us on a path to deprecate NETIF_F_IP_CSUM and > NETIF_F_IPV6_CSUM. > > The helper function may be called from ndo_features_check or the xmit > routine of the driver. The csum_help argument does not need to be set > when the function is called from ndo_features_check. Looks good in general; thanks. I do strongly believe we want to encourage your helper to be called from .ndo_features_check(). Because if you can't do the checksum, then you can't do TSO. And if you can't do TSO, you *really* want your hard_start_xmit() function to be handed one skb at a time and not have to call skb_gso_segment() for itself when it might not have enough room in its descriptor ring for *all* the resulting segments. Also, is your skb_csum_offload_chk() going to do the right thing for SCTP packets? Looks like it doesn't do crc32... And how accurate is the check we have, on the various IP output routines, for rt->dst.dev->features & NETIF_F_xx_CSUM? Could we consider just ditching that check and always using CHECKSUM_PARTIAL, in the knowledge that we check it before it hits a non-capable driver anyway? Or do we benefit from doing the software checksum early, due to improved cache locality when we've probably just copied the data from userspace in many cases? -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] Convert smsc911x to use ACPI as well as DT
On Mon, 2015-10-05 at 17:20 -0700, Charles Garcia-Tobin wrote: > it in ACPI circles > unless we had wider agreement among OSs to use it. AFAIK PRP1 has not > actually been approved yet in the specification forum, and that it in > itself is more of a concern for me,as the code has been pushed upstream. Why would that be a concern? In that context it's just one device ID. Individual devices don't *need* to be approved. OK, the 'PRP' vendor prefix is not officially assigned but that's really a trivial piece of bureaucracy. > I guess it¹s up to Catalin, but disabling for ARM seems like a good idea > right now, another option is to add tests to FWTS. I understand the motivation to avoid embracing a whole bunch of crappy bindings. But I think that eschewing PRP0001 is the wrong technical approach to achieving that. It has false negatives — as soon as you have a *single* existing DT binding, perhaps something as simple as the serial port bindings from the CHRP days, you'll be in a situation where you can't use that. I've *got* hardware where I need to advertise a serial port with a clock-frequency property because it *isn't* compatible with PNP0501. And it has false positives — there's nothing to prevent people from doing ACPI-style bindings with crappy device bindings which also aren't approved. I think it's utterly naïve to believe that simply avoiding the use of PRP0001 + compatible for matching is going to have *any* significant beneficial effect whatsoever. It only makes life harder for all concerned. Perhaps a better approach would be to introduce something like CONFIG_UNAPPROVED_BINDINGS (which can't be set on ARM64), and those drivers which use bindings that *aren't* approved by Catalin's crack team of reviewers need to depend on !UNAPPROVED_BINDINGS. To be honest, I still think even *that* is somewhat naïve, but it's still a better way of implementing what you're actually trying to achieve, however optimistic you have to be to think it'll ever work in practice. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH RFC 0/3] net: Add driver helper function to determine if checksum is offloadable
On Mon, 2015-10-05 at 16:39 -0700, Tom Herbert wrote: > This patch defines a helper function that drivers can call to check > if it is able to offload the checksum for a particular packet. It occurs to me that by itself, this doesn't actually fix the problem. We'd then need to go over all drivers which currently use NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM, and convert them. Would that be the intention? -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
On Mon, 2013-05-20 at 17:27 -0700, Stephen Hemminger wrote: > On Mon, 20 May 2013 23:37:28 +0200 > Francois Romieu wrote: > > > cp_stop_hw includes netdev_reset_queue. > > > > You have imho exhibited a start_xmit after cp_stop_hw race - not sure if > > it happens in cp_tx_timeout or cp_change_mtu. Reverting the analysis above, > > I have not found a place where cp_stop_hw could be called without being > > followed by a cp_clean_rings. The netdev_reset_queue in cp_stop_hw, now > > useless, should thus be removed. > > > > Does it make sense ? > > Your right, you could probably remove it. > > It doesn't solve the problem, still seeing transmit timeouts. > Looks like what happens with DHCP is something else. Did you ever work this out? I'm seeing something similar on the inward -facing interface on my home router under high load — and it doesn't automatically recover. [308309.340644] [ cut here ] [308309.345379] WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0x103/0x190() [308309.352789] Hardware name: Geos [308309.356020] NETDEV WATCHDOG: eth1 (8139cp): transmit queue 0 timed out [308309.362733] Modules linked in: sch_fq_codel sch_teql gpio_keys_polled leds_gpio geodewdt solos_pci ledtrig_heartbeat gpio_cs5535 cs5535_clockevt 8139cp ip6t_REJECT ip6t_rt ip6t_hbh ip6t_mh ip6t_ipv6header ip6t_frag ip6t_eui64 ip6t_ah ip6table_raw ip6table_mangle ip6table_filter ip6_tables nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp xt_HL xt_hl xt_ecn ipt_ECN xt_CLASSIFY xt_time xt_tcpmss xt_statistic xt_mark xt_length xt_DSCP xt_dscp cs5535_mfgpt cs5535_mfd mfd_core ipt_MASQUERADE nf_nat xt_recent xt_helper xt_connmark xt_connbytes pptp l2tp_ppp pppoe xt_conntrack xt_CT iptable_raw xt_state nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack pppox pppoatm ipt_REJECT xt_TCPMSS xt_comment xt_multiport xt_mac xt_limit iptable_mangle iptable_filter ip_tables xt_tcpudp x_tables nsc_gpio ip_gre gre sit l2tp_netlink l2tp_core ppp_mppe tunnel4 tun ppp_async ppp_generic slhc br2684 atm crc_ccitt ipv6 input_polldev msr input_core sha1_generic geode_aes ecb arc4 aes_i586 ohci_hcd ehci_hcd usbcore usb_common [308309.457239] Pid: 0, comm: swapper Not tainted 3.7.1 #1 [308309.462463] Call Trace: [308309.465020] [] ? warn_slowpath_common+0x87/0xb0 [308309.470691] [] ? dev_watchdog+0x103/0x190 [308309.475755] [] ? warn_slowpath_fmt+0x33/0x40 [308309.481159] [] ? dev_watchdog+0x103/0x190 [308309.486244] [] ? pfifo_fast_dequeue+0xd0/0xd0 [308309.491751] [] ? call_timer_fn.isra.42+0x1c/0x80 [308309.497422] [] ? process_backlog+0x54/0xe0 [308309.502674] [] ? run_timer_softirq+0x12a/0x160 [308309.508169] [] ? pfifo_fast_dequeue+0xd0/0xd0 [308309.513697] [] ? __do_softirq+0x6d/0x110 [308309.518675] [] ? __tasklet_schedule+0x40/0x40 [308309.524178][] ? irq_exit+0x31/0x60 [308309.529359] [] ? do_IRQ+0x8d/0xb0 [308309.533723] [] ? do_IRQ+0x8d/0xb0 [308309.538201] [] ? common_interrupt+0x29/0x2e [308309.543440] [] ? rt_mutex_adjust_prio_chain+0x180/0x280 [308309.549829] [] ? default_idle+0x14/0x30 [308309.554719] [] ? cpu_idle+0x2f/0x50 [308309.559259] [] ? start_kernel+0x286/0x28b [308309.564414] [] ? repair_env_string+0x4d/0x4d [308309.569729] ---[ end trace 2e18cc211cee6089 ]--- [308309.574551] 8139cp :00:0b.0 eth1: Transmit timeout, status c 2b0 80ff -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
IPv6 routing/fragmentation panic
kb, 16); SKB_CB(skb)->dma_addr = pci_map_single(card->dev, skb->data, RX_DMA_SIZE, PCI_DMA_FROMDEVICE); Now, I probably should have done this a long time ago, because that lack of headroom probably meant that the machine was always having to reallocate buffers just to fit the Ethernet header on the front of them when routing incoming packets. So I might be happy enough with submitting a variant of the above patch and calling it a performance improvement. But should the kernel *panic* without it? If there are requirements on the headroom I must leave on received packets, where are they documented? Or is this a bug in the IPv6 fragmentation code, to make such assumptions? I'm not entirely sure how to interpret the above stack trace. Is the incoming IPv6 packet being reassembled for netfilter's benefit, then re -fragmented for transmission? -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH RFC] solos-pci: Fix BUG() with shared skb
On Wed, 2013-09-04 at 21:41 +0100, David Woodhouse wrote: > On Wed, 2013-09-04 at 14:30 -0400, David Miller wrote: > > skb_realloc_headroom() should do everything you need. > > Great, thanks! Something like this then... ? > > diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c > index 32784d1..4492c0f 100644 > --- a/drivers/atm/solos-pci.c > +++ b/drivers/atm/solos-pci.c > @@ -1145,19 +1145,19 @@ static int psend(struct atm_vcc *vcc, struct sk_buff > *skb) > >> > return 0; > >> } > > ->> if (!skb_clone_writable(skb, sizeof(*header))) { > ->> > int expand_by = 0; > ->> > int ret; > - > ->> > if (skb_headroom(skb) < sizeof(*header)) > ->> > > expand_by = sizeof(*header) - skb_headroom(skb); > - > ->> > ret = pskb_expand_head(skb, expand_by, 0, GFP_ATOMIC); > ->> > if (ret) { > ->> > > dev_warn(&card->dev->dev, "pskb_expand_head > failed.\n"); > ->> > > solos_pop(vcc, skb); > ->> > > return ret; > ->> > } > +>> if (skb_headroom(skb) < sizeof(*header)) { > +>> > struct sk_buff *nskb; > + > +>> > nskb = skb_realloc_headroom(skb, sizeof(*header)); > +>> > if (!nskb) { > +>> > > solos_pop(vcc, skb); > +>> > > return -ENOMEM; > +>> > } > +>> > if (skb->truesize != nskb->truesize) > +>> > > atm_force_charge(vcc, nskb->truesize - skb->truesize); > + > +>> > dev_kfree_skb_any(skb); > +>> > skb = nskb; > >> } > > >> header = (void *)skb_push(skb, sizeof(*header)); Simon, did you ever test this? Can you still (tell me how to) reproduce the original problem? I think that sending on br2684 was necessary but not sufficient...? -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: IPv6 routing/fragmentation panic
On Wed, 2015-09-16 at 01:48 +0200, Florian Westphal wrote: > > What I don't understand is why you see this with fragmented ipv6 > packets only (and not with all ipv6 forwarded skbs). > > Something like this copy-pastry from ip_finish_output2 should fix it: That works; thanks. Tested-by: David Woodhouse A little extra debugging output shows that the offending fragments were arriving here with skb_headroom(skb)==10. Which is reasonable, being the Solos ADSL card's header of 8 bytes followed by 2 bytes of PPP frame type. The non-fragmented packets, on the other hand, are arriving with a headroom of 42 bytes. Could something else already have reallocated them before they get that far? (Do we have any way to gather statistics on such reallocations? It seems that might be useful for performance investigation.) Johannes and I were talking on IRC yesterday about trying to make this kind of thing easier to reproduce without odd hardware. We postulated a skb_torture() function which, when an appropriate debugging option was enabled, would randomly screw around with the skb in various interesting ways — shifting the data down so that there's no headroom, deliberately making it *non-linear*, temporarily cloning it and freeing the clone a couple of seconds later, etc. Then we could insert calls to skb_torture() in interesting places like netif_rx(), ip6_finish_output2() and anywhere else that seems appropriate (perhaps with flags to indicate *what* kind of torture is permissible in certain locations). And see what breaks... -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH] solos-pci: Increase headroom on received packets
A comment in include/linux/skbuff.h says that: * Various parts of the networking layer expect at least 32 bytes of * headroom, you should not reduce this. This was demonstrated by a panic when handling fragmented IPv6 packets: http://marc.info/?l=linux-netdev&m=144236093519172&w=2 It's not entirely clear if that comment is still valid — and if it is, perhaps netif_rx() ought to be enforcing it with a warning. But either way, it is rather stupid from a performance point of view for us to be receiving packets into a buffer which doesn't have enough room to prepend an Ethernet header — it means that *every* incoming packet is going to be need to be reallocated. So let's fix that. Signed-off-by: David Woodhouse --- Tested in the DMA code path; I don't believe the DMA-capable devices can still be used in MMIO mode. Simon, Guy, would you be able to test the MMIO version? diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c index 74e18b0..be8225e 100644 --- a/drivers/atm/solos-pci.c +++ b/drivers/atm/solos-pci.c @@ -805,13 +805,13 @@ static void solos_bh(unsigned long card_arg) continue; } - skb = alloc_skb(size + 1, GFP_ATOMIC); + skb = alloc_skb(size + NET_SKB_PAD + 1, GFP_ATOMIC); if (!skb) { if (net_ratelimit()) dev_warn(&card->dev->dev, "Failed to allocate sk_buff for RX\n"); continue; } - + skb_reserve(skb, NET_SKB_PAD); memcpy_fromio(skb_put(skb, size), RX_BUF(card, port) + sizeof(*header), size); @@ -869,8 +869,10 @@ static void solos_bh(unsigned long card_arg) /* Allocate RX skbs for any ports which need them */ if (card->using_dma && card->atmdev[port] && !card->rx_skb[port]) { - struct sk_buff *skb = alloc_skb(RX_DMA_SIZE, GFP_ATOMIC); + struct sk_buff *skb = alloc_skb(RX_DMA_SIZE + NET_SKB_PAD, + GFP_ATOMIC); if (skb) { + skb_reserve(skb, NET_SKB_PAD); SKB_CB(skb)->dma_addr = dma_map_single(&card->dev->dev, skb->data, RX_DMA_SIZE, DMA_FROM_DEVICE); -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH] solos-pci: Increase headroom on received packets
On Wed, 2015-09-16 at 03:53 -0700, Eric Dumazet wrote: > You should use netdev_alloc_skb() : This helper is better for rx skbs, > as it allows for better packing of frames in GRO or TCP stack. OK, thanks. I don't have a netdev (this is an ATM device) but I can use dev_alloc_skb(). > Also netdev_alloc_skb_ip_align() might handle the NET_IP_ALIGN stuff > for arches that care. I'd briefly considered NET_IP_ALIGN but decided against it because this isn't Ethernet and my hardware header is a nice sane 8 bytes, not 14. But actually, the primary use cases for this are PPPoATM — with 2 bytes of PPP frame type, and PPPoE over BR2684 — with 14 bytes of Ethernet header. So NET_IP_ALIGN would actually make sense. Unfortunately the FPGA can't do DMA to unaligned addresses, so I can't do it in the DMA case. I can do it for the MMIO code path though (which I still haven't tested). I'll send a new patch in a moment... -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
[PATCH v2] solos-pci: Increase headroom on received packets
A comment in include/linux/skbuff.h says that: * Various parts of the networking layer expect at least 32 bytes of * headroom, you should not reduce this. This was demonstrated by a panic when handling fragmented IPv6 packets: http://marc.info/?l=linux-netdev&m=144236093519172&w=2 It's not entirely clear if that comment is still valid — and if it is, perhaps netif_rx() ought to be enforcing it with a warning. But either way, it is rather stupid from a performance point of view for us to be receiving packets into a buffer which doesn't have enough room to prepend an Ethernet header — it means that *every* incoming packet is going to be need to be reallocated. So let's fix that. Signed-off-by: David Woodhouse --- diff --git a/drivers/atm/solos-pci.c b/drivers/atm/solos-pci.c index 74e18b0..3d7fb65 100644 --- a/drivers/atm/solos-pci.c +++ b/drivers/atm/solos-pci.c @@ -805,7 +805,12 @@ static void solos_bh(unsigned long card_arg) continue; } - skb = alloc_skb(size + 1, GFP_ATOMIC); + /* Use netdev_alloc_skb() because it adds NET_SKB_PAD of +* headroom, and ensures we can route packets back out an +* Ethernet interface (for example) without having to +* reallocate. Adding NET_IP_ALIGN also ensures that both +* PPPoATM and PPPoEoBR2684 packets end up aligned. */ + skb = netdev_alloc_skb_ip_align(NULL, size + 1); if (!skb) { if (net_ratelimit()) dev_warn(&card->dev->dev, "Failed to allocate sk_buff for RX\n"); @@ -869,7 +874,10 @@ static void solos_bh(unsigned long card_arg) /* Allocate RX skbs for any ports which need them */ if (card->using_dma && card->atmdev[port] && !card->rx_skb[port]) { - struct sk_buff *skb = alloc_skb(RX_DMA_SIZE, GFP_ATOMIC); + /* Unlike the MMIO case (qv) we can't add NET_IP_ALIGN +* here; the FPGA can only DMA to addresses which are +* aligned to 4 bytes. */ + struct sk_buff *skb = dev_alloc_skb(RX_DMA_SIZE); if (skb) { SKB_CB(skb)->dma_addr = dma_map_single(&card->dev->dev, skb->data, -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: IPv6 routing/fragmentation panic
On Wed, 2015-09-16 at 15:27 +0200, Florian Westphal wrote: > > David, could you test this? I'd do an official patch submission > then. Compiles. Doesn't fix the problem. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: IPv6 routing/fragmentation panic
On Wed, 2015-09-16 at 15:27 +0200, Florian Westphal wrote: > @@ -599,7 +600,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff > *skb, > /* Correct geometry. */ > if (frag->len > mtu || > ((frag->len & 7) && frag->next) || > - skb_headroom(frag) < hlen) > + skb_headroom(frag) < (hlen + hroom)) > goto slow_path_clean; > > /* Partially cloned skb? */ My test is 'ping -s 2000', and I end up with a fragment of 1280 bytes followed by a fragment of 776 bytes. The test cited above is only actually running on the latter fragment (which for some reason is fine and has headroom of 58 bytes). The first, larger, fragment isn't being checked. And that's the one with only 10 bytes of headroom. [ 62.027984] has frag list [ 62.030616] line 604 check frag ddc5fcc0 len 776 headroom 58 (hlen 40 hroom 16) [ 62.036720] line 678 send skb ded050c0 len 1280 headroom 10 [ 62.041096] skbuff: skb_under_panic: text:c125f9ca len:1294 put:14 head:dec89 000 data:dec88ffc tail:0xdec8950a end:0xdec89f50 dev:br-lan -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH net] ipv6: ip6_fragment: fix headroom tests and skb leak
On Wed, 2015-09-16 at 17:26 +0200, Florian Westphal wrote: > I tested this e1000 driver hacked to not allocate additional headroom > (we end up in slowpath, since LL_RESERVED_SPACE is 16). And it works on the originally-offending setup too; thanks. Tested-by: David Woodhouse > Reported-by: David Woodhouse > Diagnosed-by: David Woodhouse They generally prefer me to use @intel.com for those too, if you would. I draw the line at using it for actual email communication though :) -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
_spin_unlock_irqrestore+0xa/0x10 [ 260.116369] EAX: EBX: 00200292 ECX: e04d8100 EDX: 00200292 [ 260.116369] ESI: EDI: ff32 EBP: 0258 ESP: dec0bf90 [ 260.116369] DS: 007b ES: 007b FS: GS: SS: 0068 [ 260.116369] CR0: 8005003b CR2: 098c4c50 CR3: 1ef88000 CR4: 0090 [ 260.116369] Stack: [ 260.116369] dee95000 c1272ecb dee95000 dee95240 0258 8100 c1272d10 dee95000 [ 260.116369] 0100 c105b19e 0410 c1272d10 c105b33d c146d524 0001 [ 260.116369] 0002 c1033777 0002 e577 c13f2000 000a 0020 c13f1f70 [ 260.116369] Call Trace: [ 260.116369] [] ? dev_watchdog+0x1bb/0x200 [ 260.116369] [] ? qdisc_rcu_free+0x30/0x30 [ 260.116369] [] ? call_timer_fn.isra.7+0xe/0x60 [ 260.116369] [] ? qdisc_rcu_free+0x30/0x30 [ 260.116369] [] ? run_timer_softirq+0xfd/0x1b0 [ 260.116369] [] ? __do_softirq+0xa7/0x190 [ 260.116369] [] ? __hrtimer_tasklet_trampoline+0x20/0x20 [ 260.116369] [] ? do_softirq_own_stack+0x1b/0x20 [ 260.116369] [ 260.116369] [] ? do_IRQ+0x35/0xa0 [ 260.116369] [] ? common_interrupt+0x29/0x30 [ 260.116369] [] ? put_unbound_pool+0x17b/0x1a0 [ 260.116369] [] ? default_idle+0x2/0x10 [ 260.116369] [] ? arch_cpu_idle+0x6/0x10 [ 260.116369] [] ? cpu_startup_entry+0xf5/0x190 [ 260.116369] [] ? start_kernel+0x2e5/0x2e8 [ 260.116369] Code: 00 b8 01 00 00 00 c3 8d 76 00 8d bc 27 00 00 00 00 e8 db 2e d3 ff c3 8d 76 00 8d bc 27 00 00 00 00 53 89 d3 e8 c8 2e d3 ff 53 9d <5b> c3 8d 74 26 00 e8 bb 2e d3 ff fb c3 89 f6 8d bc 27 00 00 00 At this point even sysrq on the serial console isn't working; I'm going to have to go and physically reset it. (gdb) list *dev_watchdog+0x1bb 0xc1272ecb is in dev_watchdog (net/sched/sch_generic.c:304). 299 } 300 301 if (some_queue_timedout) { 302 WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n", 303dev->name, netdev_drivername(dev), i); 304 dev->netdev_ops ->ndo_tx_timeout(dev); 305 } 306 if (!mod_timer(&dev->watchdog_timer, 307round_jiffies(jiffies + 308 dev ->watchdog_timeo))) So that _raw_spin_unlock_irqrestore() is going to be a tailcall from the end of cp_tx_timeout(). Moderately confused that it dies on *unlock*. _raw_spin_lock_irqrestore+0xa is the instruction after a 'popf', which makes me wonder if it dies in an IRQ storm. Although then *some* of the NMI watchdog invocations would surely show an IRQ on the stack, but they don't; they're all right on the same instruction. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
at pppoe nf_nat_i... [ 292.130412] CPU: 0 PID: 0 Comm: swapper Tainted: GW 4.2.0-gx+ #26 [ 292.130412] task: c13f7540 ti: c13f task.ti: c13f [ 292.130412] EIP: 0060:[] EFLAGS: 00200286 CPU: 0 [ 292.130412] EIP is at _raw_spin_unlock_irqrestore+0xa/0x10 [ 292.130412] EAX: EBX: 00200286 ECX: c1409ee4 EDX: 00200286 [ 292.130412] ESI: 00200286 EDI: ddc53428 EBP: ddc53420 ESP: dec0bf68 [ 292.130412] DS: 007b ES: 007b FS: GS: SS: 0068 [ 292.130412] CR0: 8005003b CR2: 0805244c CR3: 1edf6000 CR4: 0090 [ 292.130412] Stack: [ 292.130412] ddc53000 e06f8055 e06f9e2e 00200286 e06f9e18 e06f9f0c 00200286 ddc53000 [ 292.130412] fde1 0258 c1272ecb ddc53000 ddc53240 0258 8100 [ 292.130412] c1272d10 ddc53000 0100 c105b19e 0442 c1272d10 c105b33d ddc7c444 [ 292.130412] Call Trace: [ 292.130412] [] ? cp_tx_timeout+0x1a5/0x1c0 [8139cp] [ 292.130412] [] ? dev_watchdog+0x1bb/0x200 [ 292.130412] [] ? qdisc_rcu_free+0x30/0x30 [ 292.130412] [] ? call_timer_fn.isra.7+0xe/0x60 [ 292.130412] [] ? qdisc_rcu_free+0x30/0x30 [ 292.130412] [] ? run_timer_softirq+0xfd/0x1b0 [ 292.130412] [] ? __do_softirq+0xa7/0x190 [ 292.130412] [] ? __hrtimer_tasklet_trampoline+0x20/0x20 [ 292.130412] [] ? do_softirq_own_stack+0x1b/0x20 [ 292.130412] [ 292.130412] [] ? do_IRQ+0x35/0xa0 [ 292.130412] [] ? common_interrupt+0x29/0x30 [ 292.130412] [] ? put_unbound_pool+0x17b/0x1a0 [ 292.130412] [] ? default_idle+0x2/0x10 [ 292.130412] [] ? arch_cpu_idle+0x6/0x10 [ 292.130412] [] ? cpu_startup_entry+0xf5/0x190 [ 292.130412] [] ? start_kernel+0x2e5/0x2e8 [ 292.130412] Code: 00 b8 01 00 00 00 c3 8d 76 00 8d bc 27 00 00 00 00 e8 db 2e d3 ff c3 8d 76 00 8d bc 27 00 00 00 00 53 89 d3 e8 c8 2e d3 ff 53 9d <5b> c3 8d 74 26 00 e8 bb 2e d3 ff fb c3 89 f6 8d bc 27 00 00 00 [ 292.130412] Kernel panic - not syncing: softlockup: hung tasks [ 292.130412] CPU: 0 PID: 0 Comm: swapper Tainted: GWL 4.2.0-gx+ #26 [ 292.130412] c1319a5f c13f7540 00f9 c1071439 c1399de4 [ 292.130412] dec0bf2c c140b660 c10712f0 0001 c140aac0 c105bf8e 03bbae55 [ 292.130412] 0044 c140aacc 03bbae55 0044 0003 c141b860 c141b801 [ 292.130412] Call Trace: [ 292.130412] [] ? panic+0x76/0x161 [ 292.130412] [] ? watchdog_timer_fn+0x149/0x150 [ 292.130412] [] ? watchdog_cleanup+0x10/0x10 [ 292.130412] [] ? __hrtimer_run_queues.constprop.7+0xae/0x180 [ 292.130412] [] ? hrtimer_interrupt+0x87/0x1d0 [ 292.130412] [] ? tick_handle_oneshot_broadcast+0xcf/0x130 [ 292.130412] [] ? timer_interrupt+0xa/0x10 [ 292.130412] [] ? handle_irq_event_percpu+0x4f/0xf0 [ 292.130412] [] ? handle_irq_event+0x2d/0x50 [ 292.130412] [] ? handle_level_irq+0x69/0xf0 [ 292.130412] [] ? handle_simple_irq+0x80/0x80 [ 292.130412] [] ? handle_irq+0x43/0x70 [ 292.130412][] ? do_IRQ+0x2c/0xa0 [ 292.130412] [] ? common_interrupt+0x29/0x30 [ 292.130412] [] ? remove_waiter+0x11b/0x120 [ 292.130412] [] ? _raw_spin_unlock_irqrestore+0xa/0x10 [ 292.130412] [] ? cp_tx_timeout+0x1a5/0x1c0 [8139cp] -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com
Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
On Thu, 2015-09-17 at 12:36 +0100, David Woodhouse wrote: > > Thanks; I'll try that. In fact since updating to 4.2 the problem has > got worse — now the whole machine dies: There is something very strange going on here. I've found two ways to make it stop crashing when cp_tx_timeout() hits the 'popf' when unlocking the spinlock. The first is to comment out the whole of cp_tx_timeout() and let it happen once. Then put that code *back* again and reload the module. Then it can work fine. The second way is to comment out the WARN_ONCE in dev_watchdog(). I remain utterly bemused; I have no idea what's going on there. But that aside, even when it survives running cp_tx_timeout(), it still doesn't *work* — it looks like TX is indeed working and has recovered, but we are not *receiving* any packets. I can't actually trigger the TX timeout at all with debugging enabled; I've hacked things so that cp_set_wol() will also call cp_tx_timeout() and simulate it. And now I see this... [ 4358.499474] 8139cp :00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b [ 4358.499488] 8139cp :00:0b.0 eth1: tx done, slot 35 [ 4358.513663] 8139cp :00:0b.0 eth1: tx queued, slot 37, skblen 54 [ 4358.513692] 8139cp :00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b [ 4358.513705] 8139cp :00:0b.0 eth1: tx done, slot 36 [ 4358.518880] 8139cp :00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b [ 4358.518900] 8139cp :00:0b.0 eth1: rx slot 1 status 0x32014040 len 60 [ 4358.523601] 8139cp :00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b [ 4358.526910] 8139cp :00:0b.0 eth1: rx slot 2 status 0x32036052 len 78 [ 4358.547898] 8139cp :00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b [ 4358.547996] 8139cp :00:0b.0 eth1: rx slot 3 status 0x32036052 len 78 [ 4358.580526] 8139cp :00:0b.0 eth1: tx queued, slot 38, skblen 70 [ 4358.580555] 8139cp :00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b [ 4358.580569] 8139cp :00:0b.0 eth1: tx done, slot 37 [ 4358.601912] 8139cp :00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b [ 4358.601932] 8139cp :00:0b.0 eth1: rx slot 4 status 0x32036052 len 78 [ 4358.650678] 8139cp :00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b [ 4358.650698] 8139cp :00:0b.0 eth1: rx slot 5 status 0x320145a5 len 1441 [ 4358.665572] will lock... [ 4358.668222] Handling tx timeout, flags 282 [ 4358.672494] nway_reset [ 4358.674858] Will wake queue... [ 4358.677919] Will unlock... flags 282 [ 4358.681525] did unlock... [ 4358.684198] 8139cp :00:0b.0 eth1: Transmit timeout handled, status c 2b0 80ff [ 4358.708234] 8139cp :00:0b.0 eth1: tx queued, slot 1, skblen 92 [ 4358.714567] 8139cp :00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b [ 4358.722405] 8139cp :00:0b.0 eth1: tx done, slot 0 [ 4358.747412] 8139cp :00:0b.0 eth1: tx queued, slot 2, skblen 106 [ 4358.753736] 8139cp :00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b [ 4358.756824] 8139cp :00:0b.0 eth1: tx done, slot 1 [ 4358.814961] 8139cp :00:0b.0 eth1: tx queued, slot 3, skblen 173 [ 4358.821291] 8139cp :00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b [ 4358.824186] 8139cp :00:0b.0 eth1: tx done, slot 2 [ 4358.834352] 8139cp :00:0b.0 eth1: tx queued, slot 4, skblen 86 [ 4358.840579] 8139cp :00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b [ 4358.844216] 8139cp :00:0b.0 eth1: tx done, slot 3 [ 4358.853615] 8139cp :00:0b.0 eth1: tx queued, slot 5, skblen 54 [ 4358.859822] 8139cp :00:0b.0 eth1: intr, status 0484 enable 80ff cmd 0c cpcmd 002b [ 4358.863497] 8139cp :00:0b.0 eth1: tx done, slot 4 [ 4358.873111] 8139cp :00:0b.0 eth1: tx queued, slot 6, skblen 66 -- -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
On Thu, 2015-09-17 at 22:44 +0200, Francois Romieu wrote: > David Woodhouse : > > On Thu, 2015-09-17 at 12:36 +0100, David Woodhouse wrote: > > > > > > Thanks; I'll try that. In fact since updating to 4.2 the problem has > > > got worse — now the whole machine dies: > > > > There is something very strange going on here. I've found two ways to > > make it stop crashing when cp_tx_timeout() hits the 'popf' when > > unlocking the spinlock. > > cp_tx_timeout takes lock, disables irq, calls cp_clean_rings, thus > plain dev_kfree_skb if a skb is still referenced in one of the > rx/tx ring. You may replace it with dev_kfree_skb_any. Well spotted; I've made that change locally. Although I don't think it explains the symptoms. Not that I'm sure what *could*. I've also found that adding a call to __cp_set_rx_mode() seems to fix the RX after reset, in some tests. Especially the simulated one via the hack in cp_set_wol(). I think that's necessary, if not sufficient — at least on real hardware. I didn't see the problem at all when running in qemu. Sometimes, though, it still dies in an interrupt storm after re -enabling IRQs: [ 900.004214] 8139cp :00:0b.0 eth1: Transmit timeout, status c 2b0 80ff [ 900.011725] will lock... [ 900.014273] Handling tx timeout, flags 200296 [ 900.018774] Will wake queue... [ 900.021645] Will unlock... flags 200296 [ 900.021645] 8139cp :00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b [ 900.021645] 8139cp :00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b ... [ 901.628439] 8139cp :00:0b.0 eth1: intr, status 0001 enable 80ff cmd 0c cpcmd 002b [ 901.636291] 8139cp :00:0b.0 eth1: intr, status 0011 enable 80ff cmd 0c cpcmd 002b ... [ 901.966243] 8139cp :00:0b.0 eth1: intr, status 0011 enable 80ff cmd 0c cpcmd 002b [ 901.968353] 8139cp :00:0b.0 eth1: intr, status 0051 enable 80ff cmd 0c cpcmd 002b ... forever... And of course, even if I fix the TX timeout handling, I'd still like to know why it's happening in the first place... -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
[PATCH 1/2] 8139cp: Use dev_kfree_skb_any() instead of dev_kfree_skb() in cp_clean_rings()
This can be called from cp_tx_timeout() with interrupts disabled. Spotted by Francois Romieu Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index d79e33b..52a5334 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -1151,7 +1151,7 @@ static void cp_clean_rings (struct cp_private *cp) desc = cp->rx_ring + i; dma_unmap_single(&cp->pdev->dev,le64_to_cpu(desc->addr), cp->rx_buf_sz, PCI_DMA_FROMDEVICE); - dev_kfree_skb(cp->rx_skb[i]); + dev_kfree_skb_any(cp->rx_skb[i]); } } @@ -1164,7 +1164,7 @@ static void cp_clean_rings (struct cp_private *cp) le32_to_cpu(desc->opts1) & 0x, PCI_DMA_TODEVICE); if (le32_to_cpu(desc->opts1) & LastFrag) - dev_kfree_skb(skb); + dev_kfree_skb_any(skb); cp->dev->stats.tx_dropped++; } } -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 2/2] 8139cp: Call __cp_set_rx_mode() from cp_tx_timeout()
Unless we reset the RX config, on real hardware I don't seem to receive any packets after a TX timeout. Signed-off-by: David Woodhouse --- Now it does actually recover from the TX timeout, lots of the time. Sometimes it still hits that IRQ storm, which probably explains the apparent lockup right after the 'popf'... although I thought we handled it more gracefully than that these days. That's probably a race with the RX handling code, or something. I'll try harder to reproduce the TX timeout with the debugging enabled. Which might shed some light on this, and also on the reason why it happens in the first place. If we're lucky. drivers/net/ethernet/realtek/8139cp.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 52a5334..ba3dab7 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -1261,6 +1261,7 @@ static void cp_tx_timeout(struct net_device *dev) cp_clean_rings(cp); rc = cp_init_rings(cp); cp_start_hw(cp); + __cp_set_rx_mode(dev); cp_enable_irq(cp); netif_wake_queue(dev); -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
On Fri, 2015-09-18 at 01:44 +0200, Francois Romieu wrote: > The TxDmaOkLowDesc register may tell if the Tx dma part is still > making any progress. I have added a TxPoll request. See below. I've just added that into the original TX timeout handler, since that doesn't seem to be crashing the box for me as long as I avoid the IRQ storm. Not sure what we learn from it ('desc 6550' printed as hex)... I've also made it dump the TX descriptor ring (skb, addr, opts1, opts2): [ 1733.027156] 8139cp :00:0b.0 eth1: Transmit timeout, status c 2b head 25 tail 22 desc 65500 80ff [ 1733.036819] TX ring 00: (null) 1dd68d8c 3000cb67 0 [ 1733.037040] TX ring 01: (null) 1dd8774c 3000cb67 0 [ 1733.037040] TX ring 02: (null) 1dd86e7c 3a08 0 [ 1733.037040] TX ring 03: (null) 1dd865ac 3a08 0 [ 1733.037040] TX ring 04: (null) 1dd8540c 3a08 0 [ 1733.037040] TX ring 05: (null) 1dd85cdc 3a08 0 [ 1733.037040] TX ring 06: (null) 1dd84b3c 3a08 0 [ 1733.037040] TX ring 07: (null) 1dd8399c 3a08 0 [ 1733.037040] TX ring 08: (null) 1dd8426c 3a08 0 [ 1733.037040] TX ring 09: (null) 1dd830cc 3a08 0 [ 1733.037040] TX ring 10: (null) 1dd81f2c 3a08 0 [ 1733.037040] TX ring 11: (null) 1dd8165c 3a08 0 [ 1733.037040] TX ring 12: (null) 1dd827fc 3a08 0 [ 1733.037040] TX ring 13: (null) 1dd80d8c 3a08 0 [ 1733.037040] TX ring 14: (null) 1dd804bc 3a08 0 [ 1733.037040] TX ring 15: (null) 1dd6ee7c 3a08 0 [ 1733.037040] TX ring 16: (null) 1dd6f74c 3a08 0 [ 1733.037040] TX ring 17: (null) 1dd6e5ac 3a08 0 [ 1733.037040] TX ring 18: (null) 1dd6dcdc 3a08 0 [ 1733.037040] TX ring 19: (null) 1dd6cb3c 3000f804 0 [ 1733.037040] TX ring 20: (null) 1dd8de02 300083da 0 [ 1733.037040] TX ring 21: (null) 1dd8da02 3000 0 [ 1733.037040] TX ring 22: defcc240 1dd6d40c b5ea 0 [ 1733.037040] TX ring 23: decdb900 1dd6b99c b046 0 [ 1733.037040] TX ring 24: ddd27c00 1dd6b0cc b5ea 0 [ 1733.037040] TX ring 25: (null) 1dd6e5ac 3000ca24 0 [ 1733.037040] TX ring 26: (null) 1dd6dcdc 3000ca24 0 [ 1733.037040] TX ring 27: (null) 1dd6d40c 3000ca24 0 [ 1733.037040] TX ring 28: (null) 1dd6cb3c 3000ca24 0 [ 1733.037040] TX ring 29: (null) 1dd6c26c 3000ca24 0 [ 1733.037040] TX ring 30: (null) 1dd6b99c 3000ca24 0 [ 1733.037040] TX ring 31: (null) 1dd6b0cc 3000ca24 0 [ 1733.037040] TX ring 32: (null) 1dd6a7fc 3000ca24 0 [ 1733.037040] TX ring 33: (null) 1dd69f2c 3000ca24 0 [ 1733.037040] TX ring 34: (null) 1dd68d8c 3000ca24 0 [ 1733.037040] TX ring 35: (null) 1dd6965c 3000cb67 0 [ 1733.037040] TX ring 36: (null) 1dd684bc 3000cb67 0 [ 1733.037040] TX ring 37: (null) 1dd86e7c 3000cb67 0 [ 1733.037040] TX ring 38: (null) 1dd8774c 3000cb67 0 [ 1733.037040] TX ring 39: (null) 1dd865ac 3000cb67 0 [ 1733.037040] TX ring 40: (null) 1dd85cdc 3000cb67 0 [ 1733.037040] TX ring 41: (null) 1dd8540c 3000cb67 0 [ 1733.037040] TX ring 42: (null) 1dd84b3c 3000cb67 0 [ 1733.037040] TX ring 43: (null) 1dd8426c 3000cb67 0 [ 1733.037040] TX ring 44: (null) 1dd8399c 3000cb67 0 [ 1733.037040] TX ring 45: (null) 1dd830cc 3000cb67 0 [ 1733.037040] TX ring 46: (null) 1dd81f2c 3000cb67 0 [ 1733.037040] TX ring 47: (null) 1dd827fc 3000cb67 0 [ 1733.037040] TX ring 48: (null) 1dd8165c 3000cb67 0 [ 1733.037040] TX ring 49: (null) 1dd80d8c 3000cb67 0 [ 1733.037040] TX ring 50: (null) 1dd804bc 3000cb67 0 [ 1733.037040] TX ring 51: (null) 1dd6f74c 3000cb67 0 [ 1733.037040] TX ring 52: (null) 1dd6ee7c 3000cb67 0 [ 1733.037040] TX ring 53: (null) 1dd6dcdc 3000cb67 0 [ 1733.037040] TX ring 54: (null) 1
Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
On Fri, 2015-09-18 at 02:04 +0100, David Woodhouse wrote: > On Fri, 2015-09-18 at 01:44 +0200, Francois Romieu wrote: > > The TxDmaOkLowDesc register may tell if the Tx dma part is still > > making any progress. I have added a TxPoll request. See below. > > I've just added that into the original TX timeout handler, since that > doesn't seem to be crashing the box for me as long as I avoid the IRQ > storm. > > Not sure what we learn from it ('desc 6550' printed as hex)... I've > also made it dump the TX descriptor ring (skb, addr, opts1, opts2): The TxDmaOkLowDesc values look sane to me; they always match the low bits of the last descriptor that *should* have been sent, according to the ring dumps. I made it store and dump the original contents of the TX descriptor ring too (before it gets overwritten by the hardware). So we can see what *was* being transmitted. It isn't any more enlightening. I also tried just prodding the hardware by writing to the TxPoll register. That wasn't sufficient to restart it. [26589.024750] 8139cp :00:0b.0 eth1: Transmit timeout, status c 2b head 52 tail 45 desc 96c00 80ff [26589.034632] TX ring 00: (null) 1de16e7c 30003130 0 (b5ea 0) [26589.034632] TX ring 01: (null) 1de165ac 30003130 0 (b5ea 0) [26589.034632] TX ring 02: (null) 1de1540c 30003130 0 (b5ea 0) [26589.034632] TX ring 03: (null) 1de15cdc 30003130 0 (b5ea 0) [26589.034632] TX ring 04: (null) 1de14b3c 30003130 0 (b5ea 0) [26589.034632] TX ring 05: (null) 1de1399c 30003130 0 (b5ea 0) [26589.034632] TX ring 06: (null) 1de130cc 30003130 0 (b5ea 0) [26589.034632] TX ring 07: (null) 1de1426c 30003130 0 (b5ea 0) [26589.034632] TX ring 08: (null) 1de127fc 30003130 0 (b5ea 0) [26589.034632] TX ring 09: (null) 1de11f2c 30003130 0 (b5ea 0) [26589.034632] TX ring 10: (null) 1de10d8c 30003130 0 (b056 0) [26589.034632] TX ring 11: (null) 1de1165c 30003130 0 (b5ea 0) [26589.034632] TX ring 12: (null) 1df4774c 30003130 0 (b5ea 0) [26589.034632] TX ring 13: (null) 1de104bc 30003130 0 (b5ea 0) [26589.034632] TX ring 14: (null) 1df46e7c 30003130 0 (b5ea 0) [26589.034632] TX ring 15: (null) 1df465ac 30003130 0 (b5ea 0) [26589.034632] TX ring 16: (null) 1df45cdc 30003130 0 (b5ea 0) [26589.034632] TX ring 17: (null) 1df44b3c 30003130 0 (b056 0) [26589.034632] TX ring 18: (null) 1df4426c 30003130 0 (b5ea 0) [26589.034632] TX ring 19: (null) 1df4540c 30003130 0 (b5ea 0) [26589.034632] TX ring 20: (null) 1df4399c 30004954 0 (b5ea 0) [26589.034632] TX ring 21: (null) 1df430cc 30004954 0 (b5ea 0) [26589.034632] TX ring 22: (null) 1df427fc 30004954 0 (b5ea 0) [26589.034632] TX ring 23: (null) 1df4165c 30004954 0 (b5ea 0) [26589.034632] TX ring 24: (null) 1df41f2c 30004954 0 (b5ea 0) [26589.034632] TX ring 25: (null) 1df40d8c 30004954 0 (b5ea 0) [26589.034632] TX ring 26: (null) 1dcc9602 3000796b 0 (b06a 0) [26589.034632] TX ring 27: (null) 1de1774c 3000 0 (b097 0) [26589.034632] TX ring 28: (null) 1de16e7c 3000 0 (b467 0) [26589.034632] TX ring 29: (null) 1df404bc 3000 0 (b557 0) [26589.034632] TX ring 30: (null) 1de165ac 3000 0 (b0e7 0) [26589.034632] TX ring 31: (null) 1de1540c 3000 0 (b0d7 0) [26589.034632] TX ring 32: (null) 1dcc9602 300087b2 0 (b062 0) [26589.034632] TX ring 33: (null) 1de15cdc 300087b2 0 (b046 0) [26589.034632] TX ring 34: (null) 1de14b3c 3540 0 (b08b 0) [26589.034632] TX ring 35: (null) 1de1426c 3000709e 0 (b097 0) [26589.034632] TX ring 36: (null) 1de1399c 3000709e 0 (b097 0) [26589.034632] TX ring 37: (null) 1de130cc 3000c169 0 (b084 0) [26589.034632] TX ring 38: (null) 1de127fc 3000a5eb 0 (b5ea 0) [26589.034632] TX ring 39: (null) 1de11f2c 3000d4a3 0 (b557 0) [26589.034632] TX ring 40: (null) 1de10d8c 3000b57e 0 (b046 0) [26589.034632] TX ring 41: (null) 1de104bc 3000 0 (b0a7 0) [26589.034632] TX ring 42: (null) 1dce774c 30004000 0 (b046 0) [26589.034632] TX ring 43: (null) 1dce6e7c 300076f6 0 (b046 0) [26589.034632] TX ring 44: (null) 1dce5cdc 30002034 0 (b096 0) [26589.034632] TX ring 45: dde4c3c0 1dce540c b557 0 (b557 0) [26589.034632] TX ring 46: dddc1900 1dce426c b097 0 (b0
[PATCH 3/2] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
The TX timeout handling has been observed to trigger RX IRQ storms. And since cp_interrupt() just keeps saying that it handled the interrupt, the machine then dies. Fix the return value from cp_interrupt(), and the offending IRQ gets disabled and the machine survives. Signed-off-by: David Woodhouse --- ... and I have to make fewer trips to the cupboard under the stairs. There follows a fix to prevent the IRQ storm from happening at all. drivers/net/ethernet/realtek/8139cp.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index ba3dab7..f1054ad 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -371,7 +371,7 @@ struct cp_private { static void __cp_set_rx_mode (struct net_device *dev); -static void cp_tx (struct cp_private *cp); +static int cp_tx (struct cp_private *cp); static void cp_clean_rings (struct cp_private *cp); #ifdef CONFIG_NET_POLL_CONTROLLER static void cp_poll_controller(struct net_device *dev); @@ -587,8 +587,6 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance) if (!status || (status == 0x)) goto out_unlock; - handled = 1; - netif_dbg(cp, intr, dev, "intr, status %04x cmd %02x cpcmd %04x\n", status, cpr8(Cmd), cpr16(CpCmd)); @@ -596,25 +594,30 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance) /* close possible race's with dev_close */ if (unlikely(!netif_running(dev))) { + handled = 1; cpw16(IntrMask, 0); goto out_unlock; } - if (status & (RxOK | RxErr | RxEmpty | RxFIFOOvr)) + if (status & (RxOK | RxErr | RxEmpty | RxFIFOOvr)) { if (napi_schedule_prep(&cp->napi)) { + handled = 1; cpw16_f(IntrMask, cp_norx_intr_mask); __napi_schedule(&cp->napi); } - + } if (status & (TxOK | TxErr | TxEmpty | SWInt)) - cp_tx(cp); - if (status & LinkChg) - mii_check_media(&cp->mii_if, netif_msg_link(cp), false); + handled |= cp_tx(cp); + if (status & LinkChg) { + handled = 1; + mii_check_media(&cp->mii_if, netif_msg_link(cp), false); + } if (status & PciErr) { u16 pci_status; + handled = 1; pci_read_config_word(cp->pdev, PCI_STATUS, &pci_status); pci_write_config_word(cp->pdev, PCI_STATUS, pci_status); netdev_err(dev, "PCI bus error, status=%04x, PCI status=%04x\n", @@ -645,11 +648,12 @@ static void cp_poll_controller(struct net_device *dev) } #endif -static void cp_tx (struct cp_private *cp) +static int cp_tx (struct cp_private *cp) { unsigned tx_head = cp->tx_head; unsigned tx_tail = cp->tx_tail; unsigned bytes_compl = 0, pkts_compl = 0; + int handled = 0; while (tx_tail != tx_head) { struct cp_desc *txd = cp->tx_ring + tx_tail; @@ -661,6 +665,7 @@ static void cp_tx (struct cp_private *cp) if (status & DescOwn) break; + handled = 1; skb = cp->tx_skb[tx_tail]; BUG_ON(!skb); @@ -704,6 +709,8 @@ static void cp_tx (struct cp_private *cp) netdev_completed_queue(cp->dev, pkts_compl, bytes_compl); if (TX_BUFFS_AVAIL(cp) > (MAX_SKB_FRAGS + 1)) netif_wake_queue(cp->dev); + + return handled; } static inline u32 cp_tx_vlan_tag(struct sk_buff *skb) -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 4/2] 8139cp: Do not re-enable RX interrupts in cp_tx_timeout()
If an RX interrupt was already received but NAPI has not yet run when the RX timeout happens, we end up in cp_tx_timeout() with RX interrupts already disabled. Blindly re-enabling them will cause an IRQ storm. This is somewhat less painful than it was a few minutes ago before I fixed the return value from cp_interrupt(), but still suboptimal. Unconditionally leave RX interrupts disabled after the reset, and schedule NAPI to check the receive ring and re-enable them. Signed-off-by: David Woodhouse --- It might seem a little odd to deliberately schedule NAPI when we know there's going to be nothing there since we just cleared the rings. But given that we only get here if the hardware is playing silly buggers, I don't much like the idea of preserving the existing IntrMask that we read from the hardware. And there's no pretty way of determining whether NAPI is already scheduled. So just ensuring that it *is* scheduled seems simplest. It's not like we're going to care about the performance implications of one fruitless poll when we've just suffered a TX timeout and reset the hardware. In future I think I might want to fix cp_tx_timeout() to *only* reset the TX ring anyway. Especially as I'm entirely unsure about its locking w.r.t. cp_rx_poll()... what prevents cp_rx_poll() from running while we're in the middle of screwing around with the RX rings? drivers/net/ethernet/realtek/8139cp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index f1054ad..d12fc50 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -1269,9 +1269,10 @@ static void cp_tx_timeout(struct net_device *dev) rc = cp_init_rings(cp); cp_start_hw(cp); __cp_set_rx_mode(dev); - cp_enable_irq(cp); + cpw16_f(IntrMask, cp_norx_intr_mask); netif_wake_queue(dev); + napi_schedule_irqoff(&cp->napi); spin_unlock_irqrestore(&cp->lock, flags); } -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH net 2/2] 8139cp: reset BQL when ring tx ring cleared
On Fri, 2015-09-18 at 01:44 +0200, Francois Romieu wrote: > The TxDmaOkLowDesc register may tell if the Tx dma part is still > making any progress. I have added a TxPoll request. See below. It isn't making any progress. And TxPoll doesn't help. The only thing I've found that restarts it is to clear and then set the TxOn bit in the Cmd register, which resets the entire ring. I briefly wondered if it was triggered by our constantly banging on the TxPoll register even when the ring is already running, so I put in an optimisation to avoid that except in the case where we've had a TxEmpty (0x80) interrupt. That doesn't help either. When the problem happens, we've put a new descriptor into the ring and written to TxPoll to start the processing. We get a TxEmpty interrupt *without* TxDone, while the descriptor is still marked as being owned by the hardware. At other times we do get a TxEmpty without TxDone, but usually the TxDone happens shortly thereafter. In the problematic case, the TxDone never happens — and the offending descriptor is never given back. (Note that I also fixed the off-by-one in the 'tx queued' debugging message) [35322.861870] 8139cp :00:0b.0 eth1: tx queued, slot 13, skblen 1294 [35322.861870] 8139cp :00:0b.0 eth1: tx kicking tx_poll, head 14 tail 13 desc 54c0 poll 00 [35322.861870] 8139cp :00:0b.0 eth1: intr, status 0484 cmd 0c cpcmd 002b [35322.861870] 8139cp :00:0b.0 eth1: tx done, slot 13, status 0x30003a6e dec 54d0 [35322.861870] 8139cp :00:0b.0 eth1: irq not kicking tx_poll, head 14 tail 14 desc 54d0 poll 00 [35322.875014] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35322.913285] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35322.917250] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35322.943075] 8139cp :00:0b.0 eth1: tx queued, slot 14, skblen 70 [35322.943103] 8139cp :00:0b.0 eth1: tx kicking tx_poll, head 15 tail 14 desc 54d0 poll 00 [35322.943138] 8139cp :00:0b.0 eth1: intr, status 0484 cmd 0c cpcmd 002b [35322.943164] 8139cp :00:0b.0 eth1: tx done, slot 14, status 0x30008001 dec 54e0 [35322.943190] 8139cp :00:0b.0 eth1: irq not kicking tx_poll, head 15 tail 15 desc 54e0 poll 00 [35322.947071] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35322.959487] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35323.001723] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35323.041541] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35323.041541] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35323.052386] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35323.113828] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35323.114824] 8139cp :00:0b.0 eth1: tx queued, slot 15, skblen 1294 [35323.114851] 8139cp :00:0b.0 eth1: tx kicking tx_poll, head 16 tail 15 desc 54e0 poll 00 [35323.114887] 8139cp :00:0b.0 eth1: intr, status 0080 cmd 0c cpcmd 002b [35323.114921] 8139cp :00:0b.0 eth1: Invalid TxEmpty, should have seen 15 at ddea54f0 status c 2b0 80ff desc 54e0 poll 00 [35323.124559] 8139cp :00:0b.0 eth1: irq kicking tx_poll, head 16 tail 15 desc 54e0 poll 00 [35323.126775] 8139cp :00:0b.0 eth1: tx queued, slot 16, skblen 1294 [35323.126803] 8139cp :00:0b.0 eth1: tx not kicking tx_poll, head 17 tail 15 desc 54e0 poll 00 [35323.127154] 8139cp :00:0b.0 eth1: tx queued, slot 17, skblen 1294 [35323.127181] 8139cp :00:0b.0 eth1: tx not kicking tx_poll, head 18 tail 15 desc 54e0 poll 00 [35323.127776] 8139cp :00:0b.0 eth1: tx queued, slot 18, skblen 1294 [35323.127802] 8139cp :00:0b.0 eth1: tx not kicking tx_poll, head 19 tail 15 desc 54e0 poll 00 [35323.147218] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35323.247288] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35323.314456] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b ... [35331.995477] 8139cp :00:0b.0 eth1: intr, status 0001 cmd 0c cpcmd 002b [35332.024428] 8139cp :00:0b.0 eth1: Transmit timeout, status c 2b0 80ff desc 54e0 poll 00 [35332.033424] TX ring 00 @ddea5400: (null) 1ddd65ac 30006362 0 (b50e 0) [35332.040399] TX ring 01 @ddea5410: (null) 1ddd5cdc 30006362 0 (b50e 0) [35332.043399] TX ring 02 @ddea5420: (null) 1ed9165c 30006362 0 (b50e 0) [35332.043399] TX ring 03 @ddea5430: (null) 1ddd540c 30006362 0 (b50e 0) [35332.043399] TX ring 04 @ddea5440: (null) 1ddd4b3c 30006362 0 (b50e 0) [35332.043399] TX ring 05 @ddea5450: (null) 1ddd426c 30006362 0 (b50e 0) [35332.043399] TX ring 06 @ddea5460: (null) 1ddd399c 30006362 0 (b50e 0) [35332.043399] TX ring 07 @ddea5470: (null) 1ddd27fc 30006362 0 (b50e 0) [35332.043399] TX ring 08 @ddea5480: (null) 1ddd1f2c 30006362 0 (b50e 0) [35332.043399] TX ring 09 @ddea5490: (null) 1ddd165c 30006362 0 (b50e 0) [35332.0433
[PATCH 2/7] 8139cp: Do not re-enable RX interrupts in cp_tx_timeout()
From: David Woodhouse If an RX interrupt was already received but NAPI has not yet run when the RX timeout happens, we end up in cp_tx_timeout() with RX interrupts already disabled. Blindly re-enabling them will cause an IRQ storm. This is somewhat less painful than it was a few minutes ago before I fixed the return value from cp_interrupt(), but still suboptimal. Unconditionally leave RX interrupts disabled after the reset, and schedule NAPI to check the receive ring and re-enable them. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index f1054ad..d12fc50 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -1269,9 +1269,10 @@ static void cp_tx_timeout(struct net_device *dev) rc = cp_init_rings(cp); cp_start_hw(cp); __cp_set_rx_mode(dev); - cp_enable_irq(cp); + cpw16_f(IntrMask, cp_norx_intr_mask); netif_wake_queue(dev); + napi_schedule_irqoff(&cp->napi); spin_unlock_irqrestore(&cp->lock, flags); } -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
From: David Woodhouse The TX timeout handling has been observed to trigger RX IRQ storms. And since cp_interrupt() just keeps saying that it handled the interrupt, the machine then dies. Fix the return value from cp_interrupt(), and the offending IRQ gets disabled and the machine survives. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index ba3dab7..f1054ad 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -371,7 +371,7 @@ struct cp_private { static void __cp_set_rx_mode (struct net_device *dev); -static void cp_tx (struct cp_private *cp); +static int cp_tx (struct cp_private *cp); static void cp_clean_rings (struct cp_private *cp); #ifdef CONFIG_NET_POLL_CONTROLLER static void cp_poll_controller(struct net_device *dev); @@ -587,8 +587,6 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance) if (!status || (status == 0x)) goto out_unlock; - handled = 1; - netif_dbg(cp, intr, dev, "intr, status %04x cmd %02x cpcmd %04x\n", status, cpr8(Cmd), cpr16(CpCmd)); @@ -596,25 +594,30 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance) /* close possible race's with dev_close */ if (unlikely(!netif_running(dev))) { + handled = 1; cpw16(IntrMask, 0); goto out_unlock; } - if (status & (RxOK | RxErr | RxEmpty | RxFIFOOvr)) + if (status & (RxOK | RxErr | RxEmpty | RxFIFOOvr)) { if (napi_schedule_prep(&cp->napi)) { + handled = 1; cpw16_f(IntrMask, cp_norx_intr_mask); __napi_schedule(&cp->napi); } - + } if (status & (TxOK | TxErr | TxEmpty | SWInt)) - cp_tx(cp); - if (status & LinkChg) - mii_check_media(&cp->mii_if, netif_msg_link(cp), false); + handled |= cp_tx(cp); + if (status & LinkChg) { + handled = 1; + mii_check_media(&cp->mii_if, netif_msg_link(cp), false); + } if (status & PciErr) { u16 pci_status; + handled = 1; pci_read_config_word(cp->pdev, PCI_STATUS, &pci_status); pci_write_config_word(cp->pdev, PCI_STATUS, pci_status); netdev_err(dev, "PCI bus error, status=%04x, PCI status=%04x\n", @@ -645,11 +648,12 @@ static void cp_poll_controller(struct net_device *dev) } #endif -static void cp_tx (struct cp_private *cp) +static int cp_tx (struct cp_private *cp) { unsigned tx_head = cp->tx_head; unsigned tx_tail = cp->tx_tail; unsigned bytes_compl = 0, pkts_compl = 0; + int handled = 0; while (tx_tail != tx_head) { struct cp_desc *txd = cp->tx_ring + tx_tail; @@ -661,6 +665,7 @@ static void cp_tx (struct cp_private *cp) if (status & DescOwn) break; + handled = 1; skb = cp->tx_skb[tx_tail]; BUG_ON(!skb); @@ -704,6 +709,8 @@ static void cp_tx (struct cp_private *cp) netdev_completed_queue(cp->dev, pkts_compl, bytes_compl); if (TX_BUFFS_AVAIL(cp) > (MAX_SKB_FRAGS + 1)) netif_wake_queue(cp->dev); + + return handled; } static inline u32 cp_tx_vlan_tag(struct sk_buff *skb) -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] 8139cp: Call __cp_set_rx_mode() from cp_tx_timeout()
On Sun, 2015-09-20 at 22:24 -0700, David Miller wrote: > From: David Woodhouse > Date: Fri, 18 Sep 2015 00:21:54 +0100 > > > Unless we reset the RX config, on real hardware I don't seem to > receive > > any packets after a TX timeout. > > > > Signed-off-by: David Woodhouse > > Applied. Thanks. I'll send another batch, including the original patches 3/2 and 4/3 from this series, in reply to this message. After which, I think we might be able to turn on TX checksumming by default and I also have a way to implement early detection of the TX stall I've been seeing. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 6/7] 8139cp: Dump contents of descriptor ring on TX timeout
From: David Woodhouse We are seeing unexplained TX timeouts under heavy load. Let's try to get a better idea of what's going on. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 036ad85..6feff9f 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -157,6 +157,7 @@ enum { NWayAdvert = 0x66, /* MII ADVERTISE */ NWayLPAR= 0x68, /* MII LPA */ NWayExpansion = 0x6A, /* MII Expansion */ + TxDmaOkLowDesc = 0x82, /* Low 16 bit address of a Tx descriptor. */ Config5 = 0xD8, /* Config5 */ TxPoll = 0xD9, /* Tell chip to check Tx descriptors for work */ RxMaxSize = 0xDA, /* Max size of an Rx packet (8169 only) */ @@ -1263,7 +1264,7 @@ static void cp_tx_timeout(struct net_device *dev) { struct cp_private *cp = netdev_priv(dev); unsigned long flags; - int rc; + int rc, i; netdev_warn(dev, "Transmit timeout, status %2x %4x %4x %4x\n", cpr8(Cmd), cpr16(CpCmd), @@ -1271,6 +1272,17 @@ static void cp_tx_timeout(struct net_device *dev) spin_lock_irqsave(&cp->lock, flags); + netif_dbg(cp, tx_err, cp->dev, "TX ring head %d tail %d desc %x\n", + cp->tx_head, cp->tx_tail, cpr16(TxDmaOkLowDesc)); + for (i = 0; i < CP_TX_RING_SIZE; i++) { + netif_dbg(cp, tx_err, cp->dev, + "TX slot %d @%p: %08x (%08x) %08x %llx %p\n", + i, &cp->tx_ring[i], le32_to_cpu(cp->tx_ring[i].opts1), + cp->tx_opts[i], le32_to_cpu(cp->tx_ring[i].opts2), + le64_to_cpu(cp->tx_ring[i].addr), + cp->tx_skb[i]); + } + cp_stop_hw(cp); cp_clean_rings(cp); rc = cp_init_rings(cp); -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 5/7] 8139cp: Fix DMA unmapping of transmitted buffers
From: David Woodhouse The low 16 bits of the 'opts1' field in the TX descriptor are supposed to still contain the buffer length when the descriptor is handed back to us. In practice, at least on my hardware, they don't. So stash the original value of the opts1 field and get the length to unmap from there. There are other ways we could have worked out the length, but I actually want a stash of the opts1 field anyway so that I can dump it alongside the contents of the descriptor ring when we suffer a TX timeout. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 07621b5..036ad85 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -341,6 +341,7 @@ struct cp_private { unsignedtx_tail; struct cp_desc *tx_ring; struct sk_buff *tx_skb[CP_TX_RING_SIZE]; + u32 tx_opts[CP_TX_RING_SIZE]; unsignedrx_buf_sz; unsignedwol_enabled : 1; /* Is Wake-on-LAN enabled? */ @@ -670,7 +671,7 @@ static int cp_tx (struct cp_private *cp) BUG_ON(!skb); dma_unmap_single(&cp->pdev->dev, le64_to_cpu(txd->addr), -le32_to_cpu(txd->opts1) & 0x, +cp->tx_opts[tx_tail] & 0x, PCI_DMA_TODEVICE); if (status & LastFrag) { @@ -793,6 +794,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, wmb(); cp->tx_skb[entry] = skb; + cp->tx_opts[entry] = flags; netif_dbg(cp, tx_queued, cp->dev, "tx queued, slot %d, skblen %d\n", entry, skb->len); } else { @@ -856,6 +858,8 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->opts1 = cpu_to_le32(ctrl); wmb(); + + cp->tx_opts[entry] = ctrl; cp->tx_skb[entry] = skb; } @@ -880,6 +884,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->opts1 = cpu_to_le32(ctrl); wmb(); + cp->tx_opts[first_entry] = ctrl; netif_dbg(cp, tx_queued, cp->dev, "tx queued, slots %d-%d, skblen %d\n", first_entry, entry, skb->len); } @@ -1122,6 +1127,7 @@ static int cp_init_rings (struct cp_private *cp) { memset(cp->tx_ring, 0, sizeof(struct cp_desc) * CP_TX_RING_SIZE); cp->tx_ring[CP_TX_RING_SIZE - 1].opts1 = cpu_to_le32(RingEnd); + memset(cp->tx_opts, 0, sizeof(cp->tx_opts)); cp_init_rings_index(cp); @@ -1179,6 +1185,7 @@ static void cp_clean_rings (struct cp_private *cp) memset(cp->rx_ring, 0, sizeof(struct cp_desc) * CP_RX_RING_SIZE); memset(cp->tx_ring, 0, sizeof(struct cp_desc) * CP_TX_RING_SIZE); + memset(cp->tx_opts, 0, sizeof(cp->tx_opts)); memset(cp->rx_skb, 0, sizeof(struct sk_buff *) * CP_RX_RING_SIZE); memset(cp->tx_skb, 0, sizeof(struct sk_buff *) * CP_TX_RING_SIZE); -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 3/7] 8139cp: Fix tx_queued debug message to print correct slot numbers
From: David Woodhouse After a certain amount of staring at the debug output of this driver, I realised it was lying to me. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 15 --- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index d12fc50..058f835 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -793,7 +793,8 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, wmb(); cp->tx_skb[entry] = skb; - entry = NEXT_TX(entry); + netif_dbg(cp, tx_queued, cp->dev, "tx queued, slot %d, skblen %d\n", + entry, skb->len); } else { struct cp_desc *txd; u32 first_len, first_eor; @@ -812,7 +813,6 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, goto out_dma_error; cp->tx_skb[entry] = skb; - entry = NEXT_TX(entry); for (frag = 0; frag < skb_shinfo(skb)->nr_frags; frag++) { const skb_frag_t *this_frag = &skb_shinfo(skb)->frags[frag]; @@ -820,6 +820,8 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, u32 ctrl; dma_addr_t mapping; + entry = NEXT_TX(entry); + len = skb_frag_size(this_frag); mapping = dma_map_single(&cp->pdev->dev, skb_frag_address(this_frag), @@ -855,9 +857,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->opts1 = cpu_to_le32(ctrl); wmb(); - cp->tx_skb[entry] = skb; - entry = NEXT_TX(entry); } txd = &cp->tx_ring[first_entry]; @@ -880,12 +880,13 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->opts1 = cpu_to_le32(first_eor | first_len | FirstFrag | DescOwn); wmb(); + + netif_dbg(cp, tx_queued, cp->dev, "tx queued, slots %d-%d, skblen %d\n", + first_entry, entry, skb->len); } - cp->tx_head = entry; + cp->tx_head = NEXT_TX(entry); netdev_sent_queue(dev, skb->len); - netif_dbg(cp, tx_queued, cp->dev, "tx queued, slot %d, skblen %d\n", - entry, skb->len); if (TX_BUFFS_AVAIL(cp) <= (MAX_SKB_FRAGS + 1)) netif_stop_queue(dev); -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 4/7] 8139cp: Fix TSO/scatter-gather descriptor setup
From: David Woodhouse When sending a TSO frame in multiple buffers, we were neglecting to set the first descriptor up in TSO mode. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 23 +++ 1 file changed, 11 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 058f835..07621b5 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -797,7 +797,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, entry, skb->len); } else { struct cp_desc *txd; - u32 first_len, first_eor; + u32 first_len, first_eor, ctrl; dma_addr_t first_mapping; int frag, first_entry = entry; const struct iphdr *ip = ip_hdr(skb); @@ -817,7 +817,6 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, for (frag = 0; frag < skb_shinfo(skb)->nr_frags; frag++) { const skb_frag_t *this_frag = &skb_shinfo(skb)->frags[frag]; u32 len; - u32 ctrl; dma_addr_t mapping; entry = NEXT_TX(entry); @@ -865,20 +864,20 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->addr = cpu_to_le64(first_mapping); wmb(); - if (skb->ip_summed == CHECKSUM_PARTIAL) { + ctrl = first_eor | first_len | FirstFrag | DescOwn; + if (mss) + ctrl |= LargeSend | + ((mss & MSSMask) << MSSShift); + else if (skb->ip_summed == CHECKSUM_PARTIAL) { if (ip->protocol == IPPROTO_TCP) - txd->opts1 = cpu_to_le32(first_eor | first_len | -FirstFrag | DescOwn | -IPCS | TCPCS); + ctrl |= IPCS | TCPCS; else if (ip->protocol == IPPROTO_UDP) - txd->opts1 = cpu_to_le32(first_eor | first_len | -FirstFrag | DescOwn | -IPCS | UDPCS); + ctrl |= IPCS | UDPCS; else BUG(); - } else - txd->opts1 = cpu_to_le32(first_eor | first_len | -FirstFrag | DescOwn); + } + + txd->opts1 = cpu_to_le32(ctrl); wmb(); netif_dbg(cp, tx_queued, cp->dev, "tx queued, slots %d-%d, skblen %d\n", -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 7/7] 8139cp: Avoid gratuitous writes to TxPoll register when already running
From: David Woodhouse This is a minor optimisation, but as a side-effect it means we can know precisely which descriptors were already in the ring when we last prodded it to run. This will give us a better chance to catch the case where we get a TxEmpty interrupt when it hasn't actually finished the descriptors we *know* it should have seen, before it ends up being a full-blown TX timeout and reset. Since QEMU's emulation doesn't give TxEmpty interrupts, *always* bash on TxPoll until we see the first TxEmpty interrupt and cp->txempty_seen gets set. Signed-off-by: David Woodhouse --- I'm actually having second thoughts about this one since realising that QEMU doesn't implement it correctly. The workaround isn't *that* horrid but it's not clear it's enough of a performance win — or whether it's entirely necessary for catching my TX stall. drivers/net/ethernet/realtek/8139cp.c | 37 +-- 1 file changed, 31 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 6feff9f..67a4fcf 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -340,6 +340,9 @@ struct cp_private { unsignedtx_head cacheline_aligned; unsignedtx_tail; + unsignedtx_running; + unsignedtxempty_seen; + struct cp_desc *tx_ring; struct sk_buff *tx_skb[CP_TX_RING_SIZE]; u32 tx_opts[CP_TX_RING_SIZE]; @@ -611,6 +614,22 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance) if (status & (TxOK | TxErr | TxEmpty | SWInt)) handled |= cp_tx(cp); + if ((status & TxEmpty) && cp->tx_running) { + handled = 1; + /* Qemu's emulation doesn't give TxEmpty interrupts */ + cp->txempty_seen = 1; + if (cp->tx_head == cp->tx_tail) { + /* Out of descriptors and we have nothing more for it. + Let it stop. */ + cp->tx_running = 0; + } else { + /* The hardware raced with us adding a new descriptor, + and we didn't get the TxEmpty IRQ in time so we + didn't prod it. Prod it now to restart. */ + cpw8(TxPoll, NormalTxPoll); + } + } + if (status & LinkChg) { handled = 1; mii_check_media(&cp->mii_if, netif_msg_link(cp), false); @@ -796,8 +815,9 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, cp->tx_skb[entry] = skb; cp->tx_opts[entry] = flags; - netif_dbg(cp, tx_queued, cp->dev, "tx queued, slot %d, skblen %d\n", - entry, skb->len); + netif_dbg(cp, tx_queued, cp->dev, + "tx queued, slot %d, skblen %d r %d\n", + entry, skb->len, cp->tx_running); } else { struct cp_desc *txd; u32 first_len, first_eor, ctrl; @@ -886,8 +906,9 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, wmb(); cp->tx_opts[first_entry] = ctrl; - netif_dbg(cp, tx_queued, cp->dev, "tx queued, slots %d-%d, skblen %d\n", - first_entry, entry, skb->len); + netif_dbg(cp, tx_queued, cp->dev, + "tx queued, slots %d-%d, skblen %d r %d\n", + first_entry, entry, skb->len, cp->tx_running); } cp->tx_head = NEXT_TX(entry); @@ -895,11 +916,13 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, if (TX_BUFFS_AVAIL(cp) <= (MAX_SKB_FRAGS + 1)) netif_stop_queue(dev); + if (!cp->tx_running || !cp->txempty_seen) { + cpw8(TxPoll, NormalTxPoll); + cp->tx_running = 1; + } out_unlock: spin_unlock_irqrestore(&cp->lock, intr_flags); - cpw8(TxPoll, NormalTxPoll); - return NETDEV_TX_OK; out_dma_error: dev_kfree_skb_any(skb); @@ -989,6 +1012,7 @@ static void cp_stop_hw (struct cp_private *cp) cp->rx_tail = 0; cp->tx_head = cp->tx_tail = 0; + cp->tx_running = 0; netdev_reset_queue(cp->dev); } @@ -1041,6 +1065,7 @@ static inline void cp_start_hw (struct cp_private *cp) * This variant appears to work fine. */ cpw8(Cmd, RxOn | TxOn); + cp->tx_running = 0; netdev_reset_queue(cp->dev); } -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 2/2] 8139cp: Call __cp_set_rx_mode() from cp_tx_timeout()
On Mon, 2015-09-21 at 14:59 +0100, David Woodhouse wrote: > After which, I think we might be able to turn on TX checksumming by > default and I also have a way to implement early detection of the TX > stall I've been seeing. This is a patch I've been playing with to catch the TX stall. When it happens we get a TxEmpty interrupt *without* TxDone. After loading the driver, we never get TxEmpty without TxDone until the problem has happened. I've got a minimal cp_tx_timeout() which just clears the TxOn bit in the Cmd register and turns it back on again, after moving the descriptors all down to the start of the TX ring. Strangely, *after* this has happened we do see a lot of instances of TxEmpty without TxDone. But then the TxDone usually comes immediately afterwards. Until the driver is reloaded, at which point the next instance of TxEmpty without TxDone *is* the hardware stalling again. I'm very confused about what's going on there. I think I possibly need to set a timer when the TxEmpty/!TxDone interrupt happens, and do a preemptive reset of the TX queue (as shown below) if the timer manages to expire before the TxDone actually happens. diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 67a4fcf..64b44ec 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -592,8 +592,8 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance) if (!status || (status == 0x)) goto out_unlock; - netif_dbg(cp, intr, dev, "intr, status %04x cmd %02x cpcmd %04x\n", - status, cpr8(Cmd), cpr16(CpCmd)); + netif_dbg(cp, intr, dev, "intr, status %04x cmd %02x cpcmd %04x desc %04x\n", + status, cpr8(Cmd), cpr16(CpCmd), cpr16(TxDmaOkLowDesc)); cpw16(IntrStatus, status & ~cp_rx_intr_mask); @@ -623,6 +623,10 @@ static irqreturn_t cp_interrupt (int irq, void *dev_instance) Let it stop. */ cp->tx_running = 0; } else { + if (!(status & (TxOK | TxErr))) + netdev_warn(dev, "TxEmpty without TxDone. h %d t %d d %04x\n", + cp->tx_head, cp->tx_tail, cpr16(TxDmaOkLowDesc)); + /* The hardware raced with us adding a new descriptor, and we didn't get the TxEmpty IRQ in time so we didn't prod it. Prod it now to restart. */ @@ -1307,12 +1311,49 @@ static void cp_tx_timeout(struct net_device *dev) le64_to_cpu(cp->tx_ring[i].addr), cp->tx_skb[i]); } - +#if 1 + /* Stop the TX (which is already stopped), move the + descriptors down, and start it again */ + cpw8_f(Cmd, RxOn); + if (cp->tx_tail == 0) { + /* Do nothing */ + } else if (cp->tx_head > cp->tx_tail) { + for (i = 0; i < cp->tx_head - cp->tx_tail; i++) { + int from = i + cp->tx_tail; + cp->tx_ring[i].addr = cp->tx_ring[from].addr; + cp->tx_ring[i].opts2 = cp->tx_ring[from].opts2; + cp->tx_ring[i].opts1 = cp->tx_ring[from].opts1; + cp->tx_ring[from].opts1 &= ~cpu_to_le32(DescOwn); + cp->tx_opts[i] = cp->tx_opts[from]; + cp->tx_skb[i] = cp->tx_skb[from]; + cp->tx_skb[from] = NULL; + printk("Ring move %d->%d\n", from, i); + } + } else { + for (i = cp->tx_tail - cp->tx_head - 1; i >= 0; i--) { + int from = (i + cp->tx_tail) & (CP_TX_RING_SIZE - 1); + cp->tx_ring[i].addr = cp->tx_ring[from].addr; + cp->tx_ring[i].opts2 = cp->tx_ring[from].opts2; + cp->tx_ring[i].opts1 = cp->tx_ring[from].opts1; + cp->tx_ring[from].opts1 &= ~cpu_to_le32(DescOwn); + cp->tx_opts[i] = cp->tx_opts[from]; + cp->tx_skb[i] = cp->tx_skb[from]; + cp->tx_skb[from] = NULL; + printk("Ring move %d->%d\n", from, i); + } + } + cpw8_f(Cmd, RxOn|TxOn); + cp->tx_head = (cp->tx_head - cp->tx_tail) & (CP_TX_RING_SIZE - 1); + cp->tx_tail = 0; + printk("head %d tail %d\n", cp->tx_head, cp->tx_tail); + cpw8(TxPoll, NormalTxPoll); +#else cp_stop_hw(cp); cp_clean_rings(cp); rc = cp_init_rings(cp); cp_start_hw(cp); __cp_set_rx_mode(dev); +#endif cpw16_f(IntrMask, cp_norx_intr_mask); netif_wake_queue(dev); -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [RFC PATCH 1/3] Avoid making inappropriate requests of NETIF_F_V[46]_CSUM devices
On Wed, 2013-01-16 at 18:00 -0500, David Miller wrote: > From: David Woodhouse > Date: Wed, 16 Jan 2013 22:34:18 + > > > On Wed, 2013-01-16 at 15:54 -0500, David Miller wrote: > >> > >> My opinion on this is that the injectors of packets are responsible > >> for ensuring checksum types are set on SKBs in an appropriate way. > >> > >> So we ensure this in the local protocol stacks that generate packets, > >> and if foreign alien entities can inject SKBs with these checksum > >> settings (like the tun device can) the burdon of verification falls > >> upon whatever layer allows that to happen. > >> > >> So really, the fix is in the tun device and the virtio layer. > > > > The virtio layer (and the tun device) expose the equivalent of the > > NETIF_F_HW_CSUM capability to the guest. In the case where we have a > > real device on the host which *also* has NETIF_F_HW_CSUM capability, are > > you saying that the tun driver should do the checksum for non-UDP/TCP > > packets in software *anyway*, just because the packet might end up going > > out a device *without* that capability, and the check in > > harmonize_features() isn't sophisticated enough to cope properly? > > I'm saying that tun can't inject unchecked crap into our stack. Did we ever resolve this? AFAICT from inspecting the code the virtio_net device still advertises hardware csum capabilities to the guest. And accepts packets which need checksumming, calling skb_partial_csum_set() as appropriate. Likewise tun, xen, macvtap and af_packet. And that works fine — it's a nice performance win because it means that VM guests (and other clients) can make full use of the HW csum capabilities of the real network hardware. And when the outbound netdevice *doesn't* have HW csum support, we generally do the right thing and complete the csum in software in the host kernel before transmitting it. Perhaps I'm missing something, but I'm not sure why you refer to that as 'injecting unchecked crap'. Surely it's using CHECKSUM_PARTIAL precisely as it was designed, and allowing the checksum to be completed either by hardware or software as appropriate? The *only* problem is the false positive in harmonize_features(), which was addressed by my patch which started this thread (in 2013). The problem is that an IP packet that *isn't* TCP or UDP, being sent out a device that has only NETIF_F_IP_CSUM capability, ends up being handed to the device unchecksummed because harmonize_features() fails to clear the HW csum flag as it (arguably) should. Original thread at http://comments.gmane.org/gmane.linux.network/254981 I'm only looking at it again because I'm pondering enabling HW csum in 8139cp (now that I've fixed TSO), and it reminded me of this... -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
On Mon, 2015-09-21 at 22:25 +0200, Francois Romieu wrote: > David Woodhouse : > > From: David Woodhouse > > > > The TX timeout handling has been observed to trigger RX IRQ storms. And > > since cp_interrupt() just keeps saying that it handled the interrupt, > > the machine then dies. Fix the return value from cp_interrupt(), and > > the offending IRQ gets disabled and the machine survives. > > I am not fond of the way it dissociates the hardware status word and the > software "handled" variable. Oh, I like that part very much :) The practice of returning a 'handled' status only when you've actually *done* something you expect to mitigate the interrupt, is a useful way of also protecting against both hardware misbehaviour and software bugs. > What you are describing - RX IRQ storms - looks like a problem between > the irq and poll handlers. That's where I expect the problem to be solved. I already fixed that, in the next patch in the series. But the failure mode *should* have been 'IRQ disabled' and the device continuing to work via polling. Not a complete death of the machine. That's the difference this patch makes. > Sprinkling "handled" operations does not make me terribly confortable, > especially as I'd happily trade the old-style part irq, part napi > processing for a plain napi processing (I can get over it though :o) ). The existing cp_rx_poll() function mostly runs without taking cp->lock. But cp_tx() *does* need cp->lock for the tx_head/tx_tail manipulations. I'm guessing that's why it's still being called directly from the hard IRQ handler? I suppose we could probably find a way to move that out. But I was mostly just trying to fix stuff that was actually broken... :) -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 4/7] 8139cp: Fix TSO/scatter-gather descriptor setup
On Mon, 2015-09-21 at 23:01 +0200, Francois Romieu wrote: > Can you pile a patch to replace BUG with WARN_ON_ONCE(1) ? OK. I can probably contrive a userspace program using AF_PACKET and PACKET_VNET_HDR to trigger it, too¹ :) -- dwmw2 ¹ http://comments.gmane.org/gmane.linux.network/254981 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 7/7] 8139cp: Avoid gratuitous writes to TxPoll register when already running
On Mon, 2015-09-21 at 22:54 +0200, Francois Romieu wrote: > David Woodhouse : > [...] > > I'm actually having second thoughts about this one since realising that > > QEMU doesn't implement it correctly. The workaround isn't *that* horrid > > but it's not clear it's enough of a performance win — or whether it's > > entirely necessary for catching my TX stall. > > It don't indulge in this kind of fetish but I'm fine with people who want > to play with QEMU 8139cp emulation where they could use virtio. They should > imvho realize that it's their job to fix QEMU 8139cp emulation, not the > other way around. Oh, I'll fix the QEMU emulation (but not this week; updating my router has taken long enough already and I have Real Work™ to do). But those fixes will take time to propagate to actual users. I'm not sure it's so reasonable to knowingly break the 8139cp driver for all currently-released versions of QEMU. It wouldn't surprise me to find that QEMU probably accounts for the *majority* of 8139cp use on modern Linux kernels. I've had to fix regressions which *only* fail on real hardware, and were *only* tested on QEMU :) > Please keep things sane (*cough*) and avoid the qemu workaround. I don't even know that I need it. As you saw in my last WIP patch for catching the TX stall, I was just checking for TxEmpty without TxDone. An earlier iteration had actually remembered the last slot that was already present when we prodded the TxPoll, and would warn on getting a TxEmpty interrupt while that slot was still owned by the hardware. But that has the *same* false positives, only *after* the initial stall, that my current version has. So I'm just not sure I care about eliminating the gratuitous TxPoll writes. It's not as if we even wait for them. It's an MMIO write which can complete later. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
Re: [PATCH 4/7] 8139cp: Fix TSO/scatter-gather descriptor setup
On Mon, 2015-09-21 at 23:01 +0200, Francois Romieu wrote: > > Can you pile a patch to replace BUG with WARN_ON_ONCE(1) ? Let's avoid having three copies of the same damn code, while we're at it... http://git.infradead.org/users/dwmw2/linux-8139cp.git has this and the appropriate minor fixes to subsequent patches in the series. What do you think of finally enabling hw csum and TSO by default, btw? Subject: [PATCH 4½/7] 8139cp: Reduce duplicate csum/tso code in cp_start_xmit() We calculate the value of the opts1 descriptor field in three different places. With two different behaviours when given an invalid packet to be checksummed — none of them correct. Sort that out. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 61 --- 1 file changed, 20 insertions(+), 41 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 3b219aa..a2c471d 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -740,7 +740,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, { struct cp_private *cp = netdev_priv(dev); unsigned entry; - u32 eor, flags; + u32 eor, opts1; unsigned long intr_flags; __le32 opts2; int mss = 0; @@ -760,6 +760,21 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, mss = skb_shinfo(skb)->gso_size; opts2 = cpu_to_le32(cp_tx_vlan_tag(skb)); + opts1 = DescOwn; + if (mss) + opts1 |= LargeSend | ((mss & MSSMask) << MSSShift); + else if (skb->ip_summed == CHECKSUM_PARTIAL) { + const struct iphdr *ip = ip_hdr(skb); + if (ip->protocol == IPPROTO_TCP) + opts1 |= IPCS | TCPCS; + else if (ip->protocol == IPPROTO_UDP) + opts1 |= IPCS | UDPCS; + else { + WARN_ONCE(1, + "Net bug: asked to checksum invalid Legacy IP packet\n"); + goto out_dma_error; + } + } if (skb_shinfo(skb)->nr_frags == 0) { struct cp_desc *txd = &cp->tx_ring[entry]; @@ -775,21 +790,9 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->addr = cpu_to_le64(mapping); wmb(); - flags = eor | len | DescOwn | FirstFrag | LastFrag; - - if (mss) - flags |= LargeSend | ((mss & MSSMask) << MSSShift); - else if (skb->ip_summed == CHECKSUM_PARTIAL) { - const struct iphdr *ip = ip_hdr(skb); - if (ip->protocol == IPPROTO_TCP) - flags |= IPCS | TCPCS; - else if (ip->protocol == IPPROTO_UDP) - flags |= IPCS | UDPCS; - else - WARN_ON(1); /* we need a WARN() */ - } + opts1 |= eor | len | FirstFrag | LastFrag; - txd->opts1 = cpu_to_le32(flags); + txd->opts1 = cpu_to_le32(opts1); wmb(); cp->tx_skb[entry] = skb; @@ -800,7 +803,6 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, u32 first_len, first_eor, ctrl; dma_addr_t first_mapping; int frag, first_entry = entry; - const struct iphdr *ip = ip_hdr(skb); /* We must give this initial chunk to the device last. * Otherwise we could race with the device. @@ -832,19 +834,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, eor = (entry == (CP_TX_RING_SIZE - 1)) ? RingEnd : 0; - ctrl = eor | len | DescOwn; - - if (mss) - ctrl |= LargeSend | - ((mss & MSSMask) << MSSShift); - else if (skb->ip_summed == CHECKSUM_PARTIAL) { - if (ip->protocol == IPPROTO_TCP) - ctrl |= IPCS | TCPCS; - else if (ip->protocol == IPPROTO_UDP) - ctrl |= IPCS | UDPCS; - else - BUG(); - } + ctrl = opts1 | eor | len; if (frag == skb_shinfo(skb)->nr_frags - 1) ctrl |= LastFrag; @@ -864,18 +854,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->addr = cpu_to_le64(first_mapping); wmb(); - ctrl = first_eor | first_len | FirstFrag | DescOwn; -
Re: [PATCH 1/7] 8139cp: Improve accuracy of cp_interrupt() return, to survive IRQ storms
On Tue, 2015-09-22 at 16:45 -0700, David Miller wrote: > And if we are getting Rx* interrupts with napi_schedule_prep() > returning false, that's a serious problem. It can mean that the TX > timeout handler's resetting of the chip is either miscoded or is > racing with either NAPI polling or this interrupt handler. Such bugs happen. That's why we have IRQ-storm detection, to help protect us. > And if that's the case your patch is making the chip's IRQ line get > disabled when this race triggers. > > This change is even worse, in my opinion, if patch #2 indeed makes > the problem go away. Even worse than what? When a bug like the one I've fixed in #2 happens, let's look at the options: - With this patch, the IRQ storm is detected. We disable the IRQ line and the driver continues to work, albeit with a higher latency. - Without this patch, the box just dies completely. First the hardware watchdog was triggering an apparently spontaneous reset. Then when I disabled the hardware watchdog, the machine locked up and doesn't even respond to serial sysrq. Eventually I got a little bit of sense out of it using the NMI watchdog. Trust me, the disabled IRQ you get with my patch really *isn't* worse than what happened without it :) If cp_interrupt() had already worked the way I suggest, I would literally have two days of my life back right now. I spent a *lot* of time working out that it was dying in an interrupt storm. And trying to find the underlying problem while having to physically visit it under the stairs and reset it all the time. In fact, I've been seeing undebuggable spontaneous resets of this router for a while now. If I'd had an 'IRQ disabled' backtrace and a smoking gun, I might have been able to fix it a long time ago. I even discounted the IRQ-storm hypothesis, early on, on the basis that "Linux handles those quite gracefully now, and disables the interrupt". Except of course *that* relies on the IRQ handler returning an appropriate value. So it doesn't work with 8139cp as things stand. That's why I'm really quite keen on fixing cp_interrupt() to be more defensive in when it claims to have handled the interrupt. Surely that's *why* we have the IRQ-storm detection in the first place — to help us survive precisely this situation. IRQ storms are either caused by software bugs like the one fixed in patch #2, or hardware misbehaviour. And if IRQ handlers blindly return 'handled' just because the hardware actually *did* assert an interrupt, regardless of whether they've made any *progress*, then we don't get that protection in a *lot* of the situations where we'd want it. But sure, for now I'll drop this from the series and I can try to convince you separately. In fact I think it only affects the last patch in the series which reduces the banging on TxPoll. And I'm going to drop that one too for now anyway. I'll repost shortly. And I think I'll add a new one at the end, enabling TX checksums, SG and TSO by default. -- dwmw2 smime.p7s Description: S/MIME cryptographic signature
[PATCH 3/7] 8139cp: Fix TSO/scatter-gather descriptor setup
From: David Woodhouse When sending a TSO frame in multiple buffers, we were neglecting to set the first descriptor up in TSO mode. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 22 ++ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index cbca0de..75a8cee 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -790,7 +790,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, entry, skb->len); } else { struct cp_desc *txd; - u32 first_len, first_eor; + u32 first_len, first_eor, ctrl; dma_addr_t first_mapping; int frag, first_entry = entry; const struct iphdr *ip = ip_hdr(skb); @@ -810,7 +810,6 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, for (frag = 0; frag < skb_shinfo(skb)->nr_frags; frag++) { const skb_frag_t *this_frag = &skb_shinfo(skb)->frags[frag]; u32 len; - u32 ctrl; dma_addr_t mapping; entry = NEXT_TX(entry); @@ -858,20 +857,19 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->addr = cpu_to_le64(first_mapping); wmb(); - if (skb->ip_summed == CHECKSUM_PARTIAL) { + ctrl = first_eor | first_len | FirstFrag | DescOwn; + if (mss) + ctrl |= LargeSend | ((mss & MSSMask) << MSSShift); + else if (skb->ip_summed == CHECKSUM_PARTIAL) { if (ip->protocol == IPPROTO_TCP) - txd->opts1 = cpu_to_le32(first_eor | first_len | -FirstFrag | DescOwn | -IPCS | TCPCS); + ctrl |= IPCS | TCPCS; else if (ip->protocol == IPPROTO_UDP) - txd->opts1 = cpu_to_le32(first_eor | first_len | -FirstFrag | DescOwn | -IPCS | UDPCS); + ctrl |= IPCS | UDPCS; else BUG(); - } else - txd->opts1 = cpu_to_le32(first_eor | first_len | -FirstFrag | DescOwn); + } + + txd->opts1 = cpu_to_le32(ctrl); wmb(); netif_dbg(cp, tx_queued, cp->dev, "tx queued, slots %d-%d, skblen %d\n", -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 1/7] 8139cp: Do not re-enable RX interrupts in cp_tx_timeout()
From: David Woodhouse If an RX interrupt was already received but NAPI has not yet run when the RX timeout happens, we end up in cp_tx_timeout() with RX interrupts already disabled. Blindly re-enabling them will cause an IRQ storm. (This is made particularly horrid by the fact that cp_interrupt() always returns that it's handled the interrupt, even when it hasn't actually done anything. If it didn't do that, the core IRQ code would have detected the storm and handled it, I'd have had a clear smoking gun backtrace instead of just a spontaneously resetting router, and I'd have at *least* two days of my life back. Changing the return value of cp_interrupt() will be argued about under separate cover.) Unconditionally leave RX interrupts disabled after the reset, and schedule NAPI to check the receive ring and re-enable them. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index ba3dab7..947932d 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -1262,9 +1262,10 @@ static void cp_tx_timeout(struct net_device *dev) rc = cp_init_rings(cp); cp_start_hw(cp); __cp_set_rx_mode(dev); - cp_enable_irq(cp); + cpw16_f(IntrMask, cp_norx_intr_mask); netif_wake_queue(dev); + napi_schedule_irqoff(&cp->napi); spin_unlock_irqrestore(&cp->lock, flags); } -- 2.4.3 -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature
[PATCH 4/7] 8139cp: Reduce duplicate csum/tso code in cp_start_xmit()
From: David Woodhouse We calculate the value of the opts1 descriptor field in three different places. With two different behaviours when given an invalid packet to be checksummed — none of them correct. Sort that out. Signed-off-by: David Woodhouse --- drivers/net/ethernet/realtek/8139cp.c | 61 --- 1 file changed, 20 insertions(+), 41 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c index 75a8cee..b3bd8b1 100644 --- a/drivers/net/ethernet/realtek/8139cp.c +++ b/drivers/net/ethernet/realtek/8139cp.c @@ -733,7 +733,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, { struct cp_private *cp = netdev_priv(dev); unsigned entry; - u32 eor, flags; + u32 eor, opts1; unsigned long intr_flags; __le32 opts2; int mss = 0; @@ -753,6 +753,21 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, mss = skb_shinfo(skb)->gso_size; opts2 = cpu_to_le32(cp_tx_vlan_tag(skb)); + opts1 = DescOwn; + if (mss) + opts1 |= LargeSend | ((mss & MSSMask) << MSSShift); + else if (skb->ip_summed == CHECKSUM_PARTIAL) { + const struct iphdr *ip = ip_hdr(skb); + if (ip->protocol == IPPROTO_TCP) + opts1 |= IPCS | TCPCS; + else if (ip->protocol == IPPROTO_UDP) + opts1 |= IPCS | UDPCS; + else { + WARN_ONCE(1, + "Net bug: asked to checksum invalid Legacy IP packet\n"); + goto out_dma_error; + } + } if (skb_shinfo(skb)->nr_frags == 0) { struct cp_desc *txd = &cp->tx_ring[entry]; @@ -768,21 +783,9 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->addr = cpu_to_le64(mapping); wmb(); - flags = eor | len | DescOwn | FirstFrag | LastFrag; - - if (mss) - flags |= LargeSend | ((mss & MSSMask) << MSSShift); - else if (skb->ip_summed == CHECKSUM_PARTIAL) { - const struct iphdr *ip = ip_hdr(skb); - if (ip->protocol == IPPROTO_TCP) - flags |= IPCS | TCPCS; - else if (ip->protocol == IPPROTO_UDP) - flags |= IPCS | UDPCS; - else - WARN_ON(1); /* we need a WARN() */ - } + opts1 |= eor | len | FirstFrag | LastFrag; - txd->opts1 = cpu_to_le32(flags); + txd->opts1 = cpu_to_le32(opts1); wmb(); cp->tx_skb[entry] = skb; @@ -793,7 +796,6 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, u32 first_len, first_eor, ctrl; dma_addr_t first_mapping; int frag, first_entry = entry; - const struct iphdr *ip = ip_hdr(skb); /* We must give this initial chunk to the device last. * Otherwise we could race with the device. @@ -825,19 +827,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, eor = (entry == (CP_TX_RING_SIZE - 1)) ? RingEnd : 0; - ctrl = eor | len | DescOwn; - - if (mss) - ctrl |= LargeSend | - ((mss & MSSMask) << MSSShift); - else if (skb->ip_summed == CHECKSUM_PARTIAL) { - if (ip->protocol == IPPROTO_TCP) - ctrl |= IPCS | TCPCS; - else if (ip->protocol == IPPROTO_UDP) - ctrl |= IPCS | UDPCS; - else - BUG(); - } + ctrl = opts1 | eor | len; if (frag == skb_shinfo(skb)->nr_frags - 1) ctrl |= LastFrag; @@ -857,18 +847,7 @@ static netdev_tx_t cp_start_xmit (struct sk_buff *skb, txd->addr = cpu_to_le64(first_mapping); wmb(); - ctrl = first_eor | first_len | FirstFrag | DescOwn; - if (mss) - ctrl |= LargeSend | ((mss & MSSMask) << MSSShift); - else if (skb->ip_summed == CHECKSUM_PARTIAL) { - if (ip->protocol == IPPROTO_TCP) - ctrl |= IPCS | TCPCS; - else if (ip->protocol == IPPROTO_UDP) - ctrl |= IPCS | UDPCS; - else - BUG(); -