RE: Oops: 17 SMP ARM (v3.16-rc2)
Hi Russell, > Now because things have changed during the last merge window, I've got > an even bigger problem sorting through that patch set and getting it > back into a submittable state. I've just sent out v2 for it onto the > net...@vger.kernel.org mailing list. > > The initial version (marked RFC) attracted very little interest from > testers, or acks. I'd very much like to have some testing of it, so > if you want to try it out, I can provide you with a git URL, patches or a > combined patch. We have run v3.16 for about three months now, and many millions of ssh connections on eight separate systems, both without and with your network patches. Our conclusion is that the patches clearly reduce the number of network timeouts, and this is a great improvement. However, after a month or so of uptime, the number of timeouts began to increase again, forcing us to reboot the cards. Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
Hi Mattis, On Fri, Aug 29, 2014 at 7:57 AM, Mattis Lorentzon wrote: > Iain, > >> Interesting. We obviously have some differences in how we boot, my >> changes to your config to get it to boot basically amount to reverting the >> patch you attached and then enabling sata and mmc. So far I've been unable >> to get your config to fail. > > Our version of U-boot doesn't support specifying a device tree separate from > the kernel, so we append it to the end of the kernel binary. We also enable > automatic configuration of IP addresses (CONFIG_IP_PNP). Our bootargs are: > console=ttymxc1,115200 > ip=192.168.2.157:192.168.2.1:192.168.2.1:255.255.255.0:armcard:eth0:on > earlyprintk enable_wait_mode=off I suppose that this 'enable_wait_mode=off' is a left over from the time you used the FSL BSP. This is not needed in mainline. >> It would be good to know what makes my config work for you, I don't think >> I've done anything special with it. > > With a couple of modifications (attached) we have been able to get your > config running on our Zynq boards as well, solving our ethernet issues. > > The serial port and ethernet are essentially the only things we use. No disks, > no graphics, no USB, etc. which is why we tried to reduce the kernel > configuration to a bare minimum. We have no idea which disabled and/or > enabled options that are causing the stalls. It's good to hear you do not have the lockups anymore, but this is still a big mistery for us as we have not yet understood the root cause and what is the 'guilty' kernel config option that makes things FEC to work unreliably. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Iain, > Interesting. We obviously have some differences in how we boot, my > changes to your config to get it to boot basically amount to reverting the > patch you attached and then enabling sata and mmc. So far I've been unable > to get your config to fail. Our version of U-boot doesn't support specifying a device tree separate from the kernel, so we append it to the end of the kernel binary. We also enable automatic configuration of IP addresses (CONFIG_IP_PNP). Our bootargs are: console=ttymxc1,115200 ip=192.168.2.157:192.168.2.1:192.168.2.1:255.255.255.0:armcard:eth0:on earlyprintk enable_wait_mode=off > It would be good to know what makes my config work for you, I don't think > I've done anything special with it. With a couple of modifications (attached) we have been able to get your config running on our Zynq boards as well, solving our ethernet issues. The serial port and ethernet are essentially the only things we use. No disks, no graphics, no USB, etc. which is why we tried to reduce the kernel configuration to a bare minimum. We have no idea which disabled and/or enabled options that are causing the stalls. Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** config.patch Description: config.patch
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 27/08/14 07:32, Mattis Lorentzon wrote: > Hi Iain, Russell and Fabio, > >> The config is attached. Note that there's a lot of additional stuff enabled >> as >> I'm aiming for a single general purpose kernel that covers i.MX6, AM3359, >> Allwinner A10/A20 along with several versions of boards using those >> particular SoCs. >> >> Same kernel binary on all the boards I've tried this on, only real >> differences >> will be the devicetree and u-boot > > Amazingly we have been able to run a complete nightly test on eight i.MX6 > boards without hickups using Iain's config! We had to modify it slightly to > get > it to boot, please find attached patch and Iain's patched config. Interesting. We obviously have some differences in how we boot, my changes to your config to get it to boot basically amount to reverting the patch you attached and then enabling sata and mmc. So far I've been unable to get your config to fail. I'm attaching the patch showing what I changed in case it sheds any light on what's going on, although I don't see why any of the changes make any difference. My kernel command line is also fairly obvious with nothing I'd think is odd: console=ttymxc1,115200n8 root=/dev/sda1 ro rootfstype=ext2 rootwait video= ahci-imx.hotplug=1 It would be good to know what makes my config work for you, I don't think I've done anything special with it. Iain 3c3 < # Linux/arm 3.16.0 Kernel Configuration --- > # Linux/arm 3.16.0-rc2 Kernel Configuration 38c38 < CONFIG_KERNEL_GZIP=y --- > # CONFIG_KERNEL_GZIP is not set 41c41 < # CONFIG_KERNEL_LZO is not set --- > CONFIG_KERNEL_LZO=y 233,239c233 < CONFIG_PARTITION_ADVANCED=y < # CONFIG_ACORN_PARTITION is not set < # CONFIG_AIX_PARTITION is not set < # CONFIG_OSF_PARTITION is not set < # CONFIG_AMIGA_PARTITION is not set < # CONFIG_ATARI_PARTITION is not set < # CONFIG_MAC_PARTITION is not set --- > # CONFIG_PARTITION_ADVANCED is not set 241,249d234 < # CONFIG_BSD_DISKLABEL is not set < # CONFIG_MINIX_SUBPARTITION is not set < # CONFIG_SOLARIS_X86_PARTITION is not set < # CONFIG_UNIXWARE_DISKLABEL is not set < # CONFIG_LDM_PARTITION is not set < # CONFIG_SGI_PARTITION is not set < # CONFIG_ULTRIX_PARTITION is not set < # CONFIG_SUN_PARTITION is not set < # CONFIG_KARMA_PARTITION is not set 251,252d235 < # CONFIG_SYSV68_PARTITION is not set < # CONFIG_CMDLINE_PARTITION is not set 265,266d247 < CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y < CONFIG_RWSEM_SPIN_ON_OWNER=y 533,534c514,521 < # CONFIG_ARM_APPENDED_DTB is not set < CONFIG_CMDLINE="" --- > CONFIG_ARM_APPENDED_DTB=y > CONFIG_ARM_ATAG_DTB_COMPAT=y > CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_FROM_BOOTLOADER=y > # CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_EXTEND is not set > CONFIG_CMDLINE="___console=ttymxc0,115200 ___debug ___LOGLEVEL=8 > ___initrd=0x11800040,12383491 ___dyndbg=\"file * +p\"" > CONFIG_CMDLINE_FROM_BOOTLOADER=y > # CONFIG_CMDLINE_EXTEND is not set > # CONFIG_CMDLINE_FORCE is not set 591c578 < # CONFIG_PM_TEST_SUSPEND is not set --- > CONFIG_PM_TEST_SUSPEND=y 919d905 < CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y 920a907 > CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y 1045,1059c1032 < CONFIG_ATA=y < # CONFIG_ATA_NONSTANDARD is not set < CONFIG_ATA_VERBOSE_ERROR=y < CONFIG_SATA_PMP=y < < # < # Controllers with non-SFF native interface < # < CONFIG_SATA_AHCI=y < CONFIG_SATA_AHCI_PLATFORM=y < CONFIG_AHCI_IMX=y < # CONFIG_SATA_INIC162X is not set < # CONFIG_SATA_ACARD_AHCI is not set < # CONFIG_SATA_SIL24 is not set < # CONFIG_ATA_SFF is not set --- > # CONFIG_ATA is not set 1786,1815c1759 < CONFIG_MMC=y < # CONFIG_MMC_DEBUG is not set < # CONFIG_MMC_CLKGATE is not set < < # < # MMC/SD/SDIO Card Drivers < # < CONFIG_MMC_BLOCK=y < CONFIG_MMC_BLOCK_MINORS=8 < CONFIG_MMC_BLOCK_BOUNCE=y < # CONFIG_SDIO_UART is not set < # CONFIG_MMC_TEST is not set < < # < # MMC/SD/SDIO Host Controller Drivers < # < CONFIG_MMC_SDHCI=y < CONFIG_MMC_SDHCI_IO_ACCESSORS=y < # CONFIG_MMC_SDHCI_PCI is not set < CONFIG_MMC_SDHCI_PLTFM=y < # CONFIG_MMC_SDHCI_OF_ARASAN is not set < CONFIG_MMC_SDHCI_ESDHC_IMX=y < # CONFIG_MMC_SDHCI_PXAV3 is not set < # CONFIG_MMC_SDHCI_PXAV2 is not set < # CONFIG_MMC_MXC is not set < # CONFIG_MMC_TIFM_SD is not set < # CONFIG_MMC_CB710 is not set < # CONFIG_MMC_VIA_SDMMC is not set < # CONFIG_MMC_DW is not set < # CONFIG_MMC_USDHI6ROL0 is not set --- > # CONFIG_MMC is not set 1968d1911 < # CONFIG_WIMAX_GDM72XX is not set
RE: Oops: 17 SMP ARM (v3.16-rc2)
Hi Iain, Russell and Fabio, > The config is attached. Note that there's a lot of additional stuff enabled as > I'm aiming for a single general purpose kernel that covers i.MX6, AM3359, > Allwinner A10/A20 along with several versions of boards using those > particular SoCs. > > Same kernel binary on all the boards I've tried this on, only real differences > will be the devicetree and u-boot Amazingly we have been able to run a complete nightly test on eight i.MX6 boards without hickups using Iain's config! We had to modify it slightly to get it to boot, please find attached patch and Iain's patched config. On Russell's suggestion we also began to disable flow control on the machines. However it did not seem to make a difference because all our Zynq cards stalled during the same test run (using our own Zynq config). Iain's config seems promising and we will continue to run tests during the next couple of days. We will also try to adapt Iain's config to our Zynq board. Many thanks for all suggestions, patches and configs so far! Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** config.patch Description: config.patch config.gz Description: config.gz
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 21/08/14 10:39, Iain Paton wrote: > On 19/08/14 07:03, Iain Paton wrote: >> On 17/08/14 22:46, Fabio Estevam wrote: >>> Iain, >>> >>> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton wrote: On 15/08/14 06:42, Mattis Lorentzon wrote: > We mostly run SSH with benchmarks using NFS, it can probably be > triggered by using only SSH with the following loop: > > # while : ; do ssh arm-card date; done Mattis, What sort of time does it take for you to see a problem? I've been running the above for nearly two days on 3.16.0 on a board with fec interrupts routed through gpio_6 and haven't seen a hint of a problem. >>> >>> Thanks for testing. >>> >>> Which mx6 board have you used on this test? >> >> It's currently pointed at a RIoTboard (atheros phy) but I'm happy to >> try it against both a Sabre-Lite and a Wandboard B1, all running the >> same kernel binary, as well. >> >> I'm interested enough in why different people get different results >> with this that I'll put some time towards testing to try to help >> narrow down the cause. >> > > two and a half days of running this against both a sabre-lite and a > wandboard quad B1 and I still have no reason to think there's any > sort of a problem. > > Up to now, my testing has been done with my own config, I'll now > repeat the whole thing using the config Mattis posted to see if > I can reproduce it that way. > > Suggestions on a better / easier / quicker way to reproduce it are > welcome. > So I wasn't able to use Mattis exact configuration as I couldn't get it to boot properly on anything. I made changes enough to enable mmc/sata and to disable the compiled in kernel command line and appended devicetree and initrd. Even then it still won't boot on my WBQUAD. It is running on Sabre-Lite and RIoTboard though, so useful enough to test against the SL in a similar manner to Mattis tests with SL. I've had the test running against both for approx one day and again no sign of any problems. I'm happy to leave this running, but at this stage I'm not expecting I'll see any problems even if I leave it running for a week. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 25/08/14 11:18, Russell King - ARM Linux wrote: > On Wed, Aug 13, 2014 at 01:39:27PM +, Mattis Lorentzon wrote: >> All our tests seem to behave the same way on the Sabrelite as on our own >> board. >> A working theory is that the switch (3Com Switch 4400) triggers the >> degeneration >> of the network stack from which Linux does not seem to recover, even if we >> later >> bypass the switch and directly connect the board to the server machine. > > Please can you try something - what happens if you completely disable > pause frame support (flow control) on all machines on the switch? Russell, while trying to duplicate this I have flow-control disabled on the switch which leads to it being auto-negotiated off on all devices. Do you think it could be worth turning it on and trying again? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Wed, Aug 13, 2014 at 01:39:27PM +, Mattis Lorentzon wrote: > All our tests seem to behave the same way on the Sabrelite as on our own > board. > A working theory is that the switch (3Com Switch 4400) triggers the > degeneration > of the network stack from which Linux does not seem to recover, even if we > later > bypass the switch and directly connect the board to the server machine. Please can you try something - what happens if you completely disable pause frame support (flow control) on all machines on the switch? -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 22/08/14 01:01, Fabio Estevam wrote: > On Thu, Aug 21, 2014 at 6:39 AM, Iain Paton wrote: > >> two and a half days of running this against both a sabre-lite and a >> wandboard quad B1 and I still have no reason to think there's any >> sort of a problem. >> >> Up to now, my testing has been done with my own config, I'll now >> repeat the whole thing using the config Mattis posted to see if >> I can reproduce it that way. >> >> Suggestions on a better / easier / quicker way to reproduce it are >> welcome. > > Thanks, Iain. > > Mattis, > > What is the silicon version of the mx6 in your sabrelite? What GCC > version do you use? > For reference, both my SL and WBQUAD report silicon rev 1.2 The RIoTboard uses a Solo and reports silicon rev 1.1 I'm using vanilla gcc 4.9.1 and compiling the kernel natively on a sabre-lite. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Aug 14, 2014 at 02:43:56PM +, Mattis Lorentzon wrote: > Fabio and Russell, > > > A working theory is that the switch (3Com Switch 4400) triggers the > > degeneration of the network stack from which Linux does not seem to > > recover, even if we later bypass the switch and directly connect the board > > to > > the server machine. > > After a few more tests we have finally been able to trigger the exact > same stalls on the Sabrelite board with a direct network connection > (i.e. without the switch). That's a setup which I can't reproduce, as all my MX6 hardware runs root-NFS, so using a direct connection to a machine to test will result in the MX6 losing its root filesystem. That said, on SolidRun hardware, there is some investigation going on at the moment concerning poor UDP performance - this is an on-going problem that has been present for a long time. What we find is that TCP performance achieves around the 600mbps mark, but UDP performance can be extremely poor with high packet loss. Adding a udelay(210) into the fec_enet_rx() can perversely (on multi-core SoCs) increase UDP performance to around 500mbps at the expense of a reduction in TCP performance. This "solution" was tripped over while trying to debug this problem, and it was found that adding printk()s to the driver increased UDP performance - so subsituting udelay() for printk() was then tried. I tried to run perf on the kernel yesterday to find out what's going on, but for some reason, perf gave me impossible call traces, so I gave up with that idea. For example, perf told me that there was a high hit rate in memcpy() being called from net_rx_action(), but net_rx_action() doesn't call memcpy(), nor do any of the called functions as a tail-call. That said, I don't think perf could tell us what's going on - what we need is a trace of the CPU's execution while iperf is running, *without* affecting the CPU itself. This is something I can't do with the hardware I have. My suspicion (unproven) is that a batch of packets get processed in the softirq handler called during the FEC interrupt exit path. Then, because there's more work to be done, ksoftirqd is scheduled, but it takes time for ksoftirqd to start running - during which time we drop a lot of packets. ksoftirqd processes some packets, but then finds that it can't complete the NAPI "work budget", and so stops running, resulting in the packet processing being triggered by the next FEC interrupt, and the cycle repeats. TCP notices this, and adjusts its sending rate to match, whereas UDP just carries on regardless, resulting in lots of packets dropped each time we switch from the tail of hardirq processing to ksoftirqd. With the udelay() in place, processing takes enough time that it gets bounced onto ksoftirqd, where it stays. I'm adding this to this thread in case it has any bearing on the problem(s) you're seeing - yes, it seems like a different problem, but could it be related... -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Fabio, > What is the silicon version of the mx6 in your sabrelite? What GCC version do > you use? The silicon version is PCIMX6Q6AVT10AA and the GCC version we use is arm-none-eabi-gcc (Fedora 2013.11.24-2.fc19) 4.8.1. Iain, > Up to now, my testing has been done with my own config, I'll now > repeat the whole thing using the config Mattis posted to see if I can > reproduce it that way. Thanks for testing this. Could you also send me the config that you used for your Sabrelite? Do you know of any options that enable additional debug information about the network driver state (full buffers etc.)? Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html ***
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Aug 21, 2014 at 6:39 AM, Iain Paton wrote: > two and a half days of running this against both a sabre-lite and a > wandboard quad B1 and I still have no reason to think there's any > sort of a problem. > > Up to now, my testing has been done with my own config, I'll now > repeat the whole thing using the config Mattis posted to see if > I can reproduce it that way. > > Suggestions on a better / easier / quicker way to reproduce it are > welcome. Thanks, Iain. Mattis, What is the silicon version of the mx6 in your sabrelite? What GCC version do you use? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 19/08/14 07:03, Iain Paton wrote: > On 17/08/14 22:46, Fabio Estevam wrote: >> Iain, >> >> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton wrote: >>> On 15/08/14 06:42, Mattis Lorentzon wrote: >>> We mostly run SSH with benchmarks using NFS, it can probably be triggered by using only SSH with the following loop: # while : ; do ssh arm-card date; done >>> >>> Mattis, >>> >>> What sort of time does it take for you to see a problem? >>> >>> I've been running the above for nearly two days on 3.16.0 on a board >>> with fec interrupts routed through gpio_6 and haven't seen a hint of >>> a problem. >> >> Thanks for testing. >> >> Which mx6 board have you used on this test? > > It's currently pointed at a RIoTboard (atheros phy) but I'm happy to > try it against both a Sabre-Lite and a Wandboard B1, all running the > same kernel binary, as well. > > I'm interested enough in why different people get different results > with this that I'll put some time towards testing to try to help > narrow down the cause. > two and a half days of running this against both a sabre-lite and a wandboard quad B1 and I still have no reason to think there's any sort of a problem. Up to now, my testing has been done with my own config, I'll now repeat the whole thing using the config Mattis posted to see if I can reproduce it that way. Suggestions on a better / easier / quicker way to reproduce it are welcome. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 17/08/14 22:46, Fabio Estevam wrote: > Iain, > > On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton wrote: >> On 15/08/14 06:42, Mattis Lorentzon wrote: >> >>> We mostly run SSH with benchmarks using NFS, it can probably be >>> triggered by using only SSH with the following loop: >>> >>> # while : ; do ssh arm-card date; done >> >> Mattis, >> >> What sort of time does it take for you to see a problem? >> >> I've been running the above for nearly two days on 3.16.0 on a board >> with fec interrupts routed through gpio_6 and haven't seen a hint of >> a problem. > > Thanks for testing. > > Which mx6 board have you used on this test? It's currently pointed at a RIoTboard (atheros phy) but I'm happy to try it against both a Sabre-Lite and a Wandboard B1, all running the same kernel binary, as well. I'm interested enough in why different people get different results with this that I'll put some time towards testing to try to help narrow down the cause. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
Iain, On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton wrote: > On 15/08/14 06:42, Mattis Lorentzon wrote: > >> We mostly run SSH with benchmarks using NFS, it can probably be >> triggered by using only SSH with the following loop: >> >> # while : ; do ssh arm-card date; done > > Mattis, > > What sort of time does it take for you to see a problem? > > I've been running the above for nearly two days on 3.16.0 on a board > with fec interrupts routed through gpio_6 and haven't seen a hint of > a problem. Thanks for testing. Which mx6 board have you used on this test? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 15/08/14 06:42, Mattis Lorentzon wrote: > We mostly run SSH with benchmarks using NFS, it can probably be > triggered by using only SSH with the following loop: > > # while : ; do ssh arm-card date; done Mattis, What sort of time does it take for you to see a problem? I've been running the above for nearly two days on 3.16.0 on a board with fec interrupts routed through gpio_6 and haven't seen a hint of a problem. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Fabio, > Do the stalls also happen on a pure 3.16 kernel? Yes, we just tried this out overnight and we get the same stalls here. We have seen similar problems on a Zynq-based board. It might be worth noting that a common chip between all three boards is, for example, the KSZ9021RN, while the FEC driver, for example, only runs on the two iMX6-boards. > How can we reproduce the error? We mostly run SSH with benchmarks using NFS, it can probably be triggered by using only SSH with the following loop: # while : ; do ssh arm-card date; done Our (pure) 3.16 kernel uses the following config. http://lkml.iu.edu/hypermail/linux/kernel/1408.1/03045/config.gz (We have quite generously disabled a lot of sub-systems in our config.) Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html ***
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Aug 14, 2014 at 11:43 AM, Mattis Lorentzon wrote: > After a few more tests we have finally been able to trigger the exact same > stalls > on the Sabrelite board with a direct network connection (i.e. without the > switch). Do the stalls also happen on a pure 3.16 kernel? How can we reproduce the error? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Fabio and Russell, > A working theory is that the switch (3Com Switch 4400) triggers the > degeneration of the network stack from which Linux does not seem to > recover, even if we later bypass the switch and directly connect the board to > the server machine. After a few more tests we have finally been able to trigger the exact same stalls on the Sabrelite board with a direct network connection (i.e. without the switch). Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html ***N�r��yb�X��ǧv�^�){.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a��� 0��h���i
RE: Oops: 17 SMP ARM (v3.16-rc2)
Fabio and Russell, > In order to try to narrow down whether this is a board issue, could you try to > run the same kernel on a mx6q development board, such as mx6qsabresd, > cubox-i, wandboard, etc? Indeed, we have a Sabrelite development board and have run the same kernel configuration (please find attached). Russells 30 FEC related patches are applied. We have also tried with and without the extended interrupts entry in the DT. All our tests seem to behave the same way on the Sabrelite as on our own board. A working theory is that the switch (3Com Switch 4400) triggers the degeneration of the network stack from which Linux does not seem to recover, even if we later bypass the switch and directly connect the board to the server machine. Since the problem is stochastic in nature we are not completely sure if we can trigger the problem without the switch. It's the switch that allows us to run many cards simultaneously and thus trigger the problem more easily. :-) What are your thoughts? Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** config.gz Description: config.gz
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Mon, Aug 11, 2014 at 10:32 AM, Mattis Lorentzon wrote: > Russell and Fabio, > >> I'd be interested to hear whether removing the >> >> interrupts-extended = ... >> >> property from your board's DT file, thereby causing you to revert back to the >> default I list above, also fixes the instability you are seeing. > > We have tried to remove the board specific interrupts-extended field and the > MX6QDL_PAD_GPIO_6__ENET_IRQ entry. Sadly this did not seem to improve > the stalls. Our interrupts look like this now: > > 150: 15519 0 0 0 GIC 150 > 2188000.ethernet > 151: 0 0 0 0 GIC 151 > 2188000.ethernet > > Our device tree might still be slightly incorrect. We have noticed that our > RGMII_INT is connected to GPIO 19 (P5) which might be nonstandard (we are > a bit surprised that this works at all). We are not quite sure how to > configure > this properly. In order to try to narrow down whether this is a board issue, could you try to run the same kernel on a mx6q development board, such as mx6qsabresd, cubox-i, wandboard, etc? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Russell and Fabio, > I'd be interested to hear whether removing the > > interrupts-extended = ... > > property from your board's DT file, thereby causing you to revert back to the > default I list above, also fixes the instability you are seeing. We have tried to remove the board specific interrupts-extended field and the MX6QDL_PAD_GPIO_6__ENET_IRQ entry. Sadly this did not seem to improve the stalls. Our interrupts look like this now: 150: 15519 0 0 0 GIC 150 2188000.ethernet 151: 0 0 0 0 GIC 151 2188000.ethernet Our device tree might still be slightly incorrect. We have noticed that our RGMII_INT is connected to GPIO 19 (P5) which might be nonstandard (we are a bit surprised that this works at all). We are not quite sure how to configure this properly. Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Aug 07, 2014 at 01:12:48PM +0100, Russell King - ARM Linux wrote: > On Thu, Aug 07, 2014 at 11:11:06AM +, Mattis Lorentzon wrote: > > Russell, > > > > > Can you ascertain whether these stalls are a result of some failure of the > > > receive side or the transmit side - you should be able to tell that if > > > you watch > > > the packet counts via ifconfig on the stalled card. Also, it would be > > > useful to > > > know whether the FEC interrupt was firing. > > > > grep eth /proc/interrupts > > 151: 0 0 0 0 GIC 151 > > 2188000.ethernet > > 166:1205661 0 0 0 gpio-mxc 6 > > 2188000.ethernet > > > > The interrupt counter 166 increases regularly during the stalls. > > Ifconfig indicates that the RX and TX counters do not increase. > > Hmm, I'm slightly confused. On my iMX6Q, I have: > > 150: 581754 0 0 0 GIC 150 > 2188000.ethernet > 151: 0 0 0 0 GIC 151 > 2188000.ethernet > > In the DT file, we have: > > fec: ethernet@02188000 { > compatible = "fsl,imx6q-fec"; > reg = <0x02188000 0x4000>; > interrupts-extended = > <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, > <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; > clocks = <&clks 117>, <&clks 117>, <&clks > 190>; > clock-names = "ipg", "ahb", "ptp"; > status = "disabled"; > }; > > which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151. > Yet you seem to have nothing registered against GIC 150, instead having > an interrupt against GPIO 6. > > This seems very odd, and as this is an on-SoC device, I don't see why > you would want to bind the interrupts for the FEC device any differently > to standard platforms. > > This could well be the cause of your stalls. > > What's GPIO 6 used for on your board? We have a second report of instability with the FEC today, and the problem board (wanboard) is also using GPIO1 6 for the ethernet IRQ. We have confirmation from the reporter that reverting the change (thus making the FEC use the standard interrupt) fixes their problem. Therefore, it seems that the workaround for ERR006687 is itself buggy. I'd be interested to hear whether removing the interrupts-extended = ... property from your board's DT file, thereby causing you to revert back to the default I list above, also fixes the instability you are seeing. Thanks. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
Mattis, On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam wrote: > On Thu, Aug 7, 2014 at 9:12 AM, Russell King - ARM Linux > wrote: > >> Hmm, I'm slightly confused. On my iMX6Q, I have: >> >> 150: 581754 0 0 0 GIC 150 >> 2188000.ethernet >> 151: 0 0 0 0 GIC 151 >> 2188000.ethernet > > Same here on a mx6qsabresd. > >> In the DT file, we have: >> >> fec: ethernet@02188000 { >> compatible = "fsl,imx6q-fec"; >> reg = <0x02188000 0x4000>; >> interrupts-extended = >> <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, >> <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; >> clocks = <&clks 117>, <&clks 117>, <&clks >> 190>; >> clock-names = "ipg", "ahb", "ptp"; >> status = "disabled"; >> }; >> >> which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151. >> Yet you seem to have nothing registered against GIC 150, instead having >> an interrupt against GPIO 6. >> >> This seems very odd, and as this is an on-SoC device, I don't see why >> you would want to bind the interrupts for the FEC device any differently >> to standard platforms. >> >> This could well be the cause of your stalls. >> >> What's GPIO 6 used for on your board? > > On a imx6q sabreauto I also get: > > 151: 0 0 0 0 GIC 151 > 2188000.ethernet > 166: 4577 0 0 0 gpio-mxc 6 > 2188000.ethernet Could you remove 'interrupts-extended' from the FEC node and also MX6QDL_PAD_GPIO_6__ENET_IRQ from the pinctrl node and test again? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 8/7/2014 7:38 AM, Fabio Estevam wrote: > On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam wrote: > > ,but I am wondering if we should also do: > > --- a/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi > +++ b/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi > @@ -66,6 +66,7 @@ > pinctrl-0 = <&pinctrl_enet>; > phy-mode = "rgmii"; > interrupts-extended = <&gpio1 6 IRQ_TYPE_LEVEL_HIGH>, > + <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, > <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; > status = "okay"; > }; > @@ -226,7 +227,7 @@ > MX6QDL_PAD_RGMII_RD2__RGMII_RD2 > 0x1b0b0 > MX6QDL_PAD_RGMII_RD3__RGMII_RD3 > 0x1b0b0 > MX6QDL_PAD_RGMII_RX_CTL__RGMII_RX_CTL > 0x1b0b0 > - MX6QDL_PAD_GPIO_6__ENET_IRQ > 0x000b1 > + MX6QDL_PAD_GPIO_6__ENET_IRQ > 0x40b1 > > Since the Workaround for erratum ERR006687 states that the SION bit > needs to be used: > > "All of the interrupts can be selected by MUX and output to pad GPIO6. > If GPIO6 is selected to > output ENET interrupts and GPIO6 SION is set, the resulting GPIO > interrupt will wake the system > from Wait mode." > arch/arm/boot/dts/imx6q-pinfunc.h:#define MX6QDL_PAD_GPIO_6__ENET_IRQ 0x230 0x600 0x03c 0x11 0xff000609 So, the ion bit should already be set(0x11). But the other way works too. Troy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam wrote: > On a imx6q sabreauto I also get: > > 151: 0 0 0 0 GIC 151 > 2188000.ethernet > 166: 4577 0 0 0 gpio-mxc 6 > 2188000.ethernet > > and the GPIO1_6 interrupt comes from this commit: > > commit bc20a5d6da718f9d60da0a78f70c653c1cd16af3 > Author: Troy Kisky > Date: Fri Dec 20 11:47:12 2013 -0700 > > ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC interrupt. > > This works around a hardware bug. > > Signed-off-by: Troy Kisky > Signed-off-by: Shawn Guo Actually a more descriptive commit log can be found here: commit 6261c4c8f13eb91f733e8ba6d67c409a2e841667 Author: Troy Kisky Date: Fri Dec 20 11:47:11 2013 -0700 ARM: dts: imx6qdl-sabrelite: use GPIO_6 for FEC interrupt. This works around a hardware bug. From "Chip Errata for the i.MX 6Dual/6Quad" ERR006687 ENET: Only the ENET wake-up interrupt request can wake the system from Wait mode. The ENET block generates many interrupts. Only one of these interrupt lines is connected to the General Power Controller (GPC) block, but a logical OR of all of the ENET interrupts is connected to the General Interrupt Controller (GIC). When the system enters Wait mode, a normal RX Done or TX Done does not wake up the system because the GPC cannot see this interrupt. This impacts performance of the ENET block because its interrupts are serviced only when the chip exits Wait mode due to an interrupt from some other wake-up source. Before this patch, ping times of a Sabre Lite board are quite random: ping 192.168.0.13 -i.5 -c5 PING 192.168.0.13 (192.168.0.13) 56(84) bytes of data. 64 bytes from 192.168.0.13: icmp_req=1 ttl=64 time=15.7 ms 64 bytes from 192.168.0.13: icmp_req=2 ttl=64 time=14.4 ms 64 bytes from 192.168.0.13: icmp_req=3 ttl=64 time=13.4 ms 64 bytes from 192.168.0.13: icmp_req=4 ttl=64 time=12.4 ms 64 bytes from 192.168.0.13: icmp_req=5 ttl=64 time=11.4 ms === 192.168.0.13 ping statistics === 5 packets transmitted, 5 received, 0% packet loss, time 2004ms rtt min/avg/max/mdev = 11.431/13.501/15.746/1.508 ms After this patch: ping 192.168.0.13 -i.5 -c5 PING 192.168.0.13 (192.168.0.13) 56(84) bytes of data. 64 bytes from 192.168.0.13: icmp_req=1 ttl=64 time=0.120 ms 64 bytes from 192.168.0.13: icmp_req=2 ttl=64 time=0.175 ms 64 bytes from 192.168.0.13: icmp_req=3 ttl=64 time=0.169 ms 64 bytes from 192.168.0.13: icmp_req=4 ttl=64 time=0.168 ms 64 bytes from 192.168.0.13: icmp_req=5 ttl=64 time=0.172 ms === 192.168.0.13 ping statistics === 5 packets transmitted, 5 received, 0% packet loss, time 1999ms rtt min/avg/max/mdev = 0.120/0.160/0.175/0.026 ms Also, apply same change to imx6qdl-nitrogen6x. This change may not be appropriate for all boards. Sabre Lite uses GPIO6 as a power down output for a ov5642 camera. As this expansion board does not yet work with mainline, this is not yet a conflict. It would be nice to have an alternative fix for boards where this is a problem. For example Sabre SD uses GPIO6 for I2C3_SDA. It also has long ping times currently. But cannot use this fix without giving up a touchscreen. Its ping times are also random. ping 192.168.0.19 -i.5 -c5 PING 192.168.0.19 (192.168.0.19) 56(84) bytes of data. 64 bytes from 192.168.0.19: icmp_req=1 ttl=64 time=16.0 ms 64 bytes from 192.168.0.19: icmp_req=2 ttl=64 time=15.4 ms 64 bytes from 192.168.0.19: icmp_req=3 ttl=64 time=14.4 ms 64 bytes from 192.168.0.19: icmp_req=4 ttl=64 time=13.4 ms 64 bytes from 192.168.0.19: icmp_req=5 ttl=64 time=12.4 ms === 192.168.0.19 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 12.451/14.369/16.057/1.316 ms Signed-off-by: Troy Kisky CC: Ranjani Vaidyanathan Signed-off-by: Shawn Guo ,but I am wondering if we should also do: --- a/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi +++ b/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi @@ -66,6 +66,7 @@ pinctrl-0 = <&pinctrl_enet>; phy-mode = "rgmii"; interrupts-extended = <&gpio1 6 IRQ_TYPE_LEVEL_HIGH>, + <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; status = "okay"; }; @@ -226,7 +227,7 @@ MX6QDL_PAD_RGMII_RD2__RGMII_RD2 0x1b0b0 MX6QDL_PAD_RGMII_RD3__RGMII_RD3 0x1b0b0 MX6QDL_PAD_RGMII_RX_CTL__RGMII_RX_CTL 0x1b0b0 - MX6QDL_PAD_GPIO_6__ENET_IRQ 0x000b1 + MX6QDL_PAD_GPIO_6__E
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Aug 7, 2014 at 9:12 AM, Russell King - ARM Linux wrote: > Hmm, I'm slightly confused. On my iMX6Q, I have: > > 150: 581754 0 0 0 GIC 150 > 2188000.ethernet > 151: 0 0 0 0 GIC 151 > 2188000.ethernet Same here on a mx6qsabresd. > In the DT file, we have: > > fec: ethernet@02188000 { > compatible = "fsl,imx6q-fec"; > reg = <0x02188000 0x4000>; > interrupts-extended = > <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, > <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; > clocks = <&clks 117>, <&clks 117>, <&clks > 190>; > clock-names = "ipg", "ahb", "ptp"; > status = "disabled"; > }; > > which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151. > Yet you seem to have nothing registered against GIC 150, instead having > an interrupt against GPIO 6. > > This seems very odd, and as this is an on-SoC device, I don't see why > you would want to bind the interrupts for the FEC device any differently > to standard platforms. > > This could well be the cause of your stalls. > > What's GPIO 6 used for on your board? On a imx6q sabreauto I also get: 151: 0 0 0 0 GIC 151 2188000.ethernet 166: 4577 0 0 0 gpio-mxc 6 2188000.ethernet and the GPIO1_6 interrupt comes from this commit: commit bc20a5d6da718f9d60da0a78f70c653c1cd16af3 Author: Troy Kisky Date: Fri Dec 20 11:47:12 2013 -0700 ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC interrupt. This works around a hardware bug. Signed-off-by: Troy Kisky Signed-off-by: Shawn Guo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Aug 07, 2014 at 11:11:06AM +, Mattis Lorentzon wrote: > Russell, > > > Can you ascertain whether these stalls are a result of some failure of the > > receive side or the transmit side - you should be able to tell that if you > > watch > > the packet counts via ifconfig on the stalled card. Also, it would be > > useful to > > know whether the FEC interrupt was firing. > > grep eth /proc/interrupts > 151: 0 0 0 0 GIC 151 > 2188000.ethernet > 166:1205661 0 0 0 gpio-mxc 6 > 2188000.ethernet > > The interrupt counter 166 increases regularly during the stalls. > Ifconfig indicates that the RX and TX counters do not increase. Hmm, I'm slightly confused. On my iMX6Q, I have: 150: 581754 0 0 0 GIC 150 2188000.ethernet 151: 0 0 0 0 GIC 151 2188000.ethernet In the DT file, we have: fec: ethernet@02188000 { compatible = "fsl,imx6q-fec"; reg = <0x02188000 0x4000>; interrupts-extended = <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>, <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>; clocks = <&clks 117>, <&clks 117>, <&clks 190>; clock-names = "ipg", "ahb", "ptp"; status = "disabled"; }; which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151. Yet you seem to have nothing registered against GIC 150, instead having an interrupt against GPIO 6. This seems very odd, and as this is an on-SoC device, I don't see why you would want to bind the interrupts for the FEC device any differently to standard platforms. This could well be the cause of your stalls. What's GPIO 6 used for on your board? -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Russell, > Can you ascertain whether these stalls are a result of some failure of the > receive side or the transmit side - you should be able to tell that if you > watch > the packet counts via ifconfig on the stalled card. Also, it would be useful > to > know whether the FEC interrupt was firing. grep eth /proc/interrupts 151: 0 0 0 0 GIC 151 2188000.ethernet 166:1205661 0 0 0 gpio-mxc 6 2188000.ethernet The interrupt counter 166 increases regularly during the stalls. Ifconfig indicates that the RX and TX counters do not increase. > I hope you have some kind of serial console on these cards? Yes, indeed. Local stimuli seems to be able to unstall the network in a somewhat random fashion. Running e.g. ifconfig or ping locally may immediately or after up to about half a minute make the network responsive. However, it usually degenerates again to a complete stall within seconds. Without local stimuli the network does not appear to recover at all. The card does not even respond to pings (again, most often without any apparent error messages). Running both of the following commands in parallel from the FC server seems to trigger the problem within minutes (please note that the arm card stops responding to both ping and ssh): # while :; do ssh arm-card echo Ok; done # ping arm-card We have noticed the same problem on both the i.MX6 and the Zynq cards (using KSZ9021 and Cadence GEM drivers). However, the number of iterations required to trigger the problem vary. Sometimes it might stall after less than 100, but in other cases the stalls begin after nearly 1 iterations. Once stalled (and unstalled after stimuli), the network on that particular card degenerates a lot more often. Apart from the kernel, IP numbers and MAC addresses, the software configurations are identical between the Zynq and the i.MX6. Perhaps the fault is unrelated to the Freescale driver? > Hmm. Okay, I think the first thing we need to do is to work out why the > silent stalls are happening. Would you have any ideas on what to check next? Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Wed, Aug 06, 2014 at 11:10:06AM +, Mattis Lorentzon wrote: > Russell, > > > What is on the other end of the link? > > 16 ARM cards connected to a 3Com Switch 4400 connected to a Linux FC 20 > machine (Intel Corporation 82541PI Gigabit Ethernet Controller rev 05). > > There may be multiple problems. The backtrace has only been seen a few > times, on two different cards. Most of the time, the network for a random > card just stalls without any visible backtrace or error messages. The other > cards seem to be unaffected when this happens. Can you ascertain whether these stalls are a result of some failure of the receive side or the transmit side - you should be able to tell that if you watch the packet counts via ifconfig on the stalled card. Also, it would be useful to know whether the FEC interrupt was firing. I hope you have some kind of serial console on these cards? > > What I would like to do is to stamp each packet in some way with an > > identifier marking its ring position, and then monitor the network to find > > out > > whether the packet at slot 85 was actually transmitted - that's made > > slightly > > harder because packets may be dropped at the receiver when operating in > > promisc mode. This would then allow us to work out some likely causes. > > We would be glad to run this test on our setup, do you have more detailed > information on how to set it up? One of the problems is to find some way to stamp each packet with a 10-bit number without having any side effects. I guess one possibility would be to overwrite the source MAC address on transmit, which hopefully should not cause any side effects. > After a network stall, we usually have to powercycle the ARM hardware to > get it back to a usable state. These stalls last at least several minutes, > perhaps indefinitely. It does not seem to recover properly, and is no longer > reachable via the network. Hmm. Okay, I think the first thing we need to do is to work out why the silent stalls are happening. -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Russell, > What is on the other end of the link? 16 ARM cards connected to a 3Com Switch 4400 connected to a Linux FC 20 machine (Intel Corporation 82541PI Gigabit Ethernet Controller rev 05). There may be multiple problems. The backtrace has only been seen a few times, on two different cards. Most of the time, the network for a random card just stalls without any visible backtrace or error messages. The other cards seem to be unaffected when this happens. > What I would like to do is to stamp each packet in some way with an > identifier marking its ring position, and then monitor the network to find out > whether the packet at slot 85 was actually transmitted - that's made slightly > harder because packets may be dropped at the receiver when operating in > promisc mode. This would then allow us to work out some likely causes. We would be glad to run this test on our setup, do you have more detailed information on how to set it up? > Note that after the transmit watchdog, the interface should recover and start > operating normally again - and that should not take "several minutes." After a network stall, we usually have to powercycle the ARM hardware to get it back to a usable state. These stalls last at least several minutes, perhaps indefinitely. It does not seem to recover properly, and is no longer reachable via the network. Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Tue, Aug 05, 2014 at 01:31:29PM +, Mattis Lorentzon wrote: > We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are > currently running some stability tests. > > During our first test round we triggered a timeout which caused the fec driver > to become unresponsive for several minutes. The attached backtrace was > shown when the hardware was rebooted. What is on the other end of the link? > [ cut here ] > WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 > dev_watchdog+0x270/0x27c() > NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out ... > fec 2188000.ethernet eth0: TX ring dump > Nr SC addr len SKB > 00x1c00 0x 66 (null) ... > 830x1c00 0x 66 (null) > 84 H 0x1c00 0x 66 (null) > 850x9c00 0x2e205000 66 9e384f00 > 860x1c00 0x2e204800 66 9e384d80 > 870x1c00 0x2e204000 66 9e384180 ... > 3760x1c00 0x2e252800 66 81cf6180 > 3770x1c00 0x2e253000 66 81cf6240 > 378 S 0x1c00 0x 66 (null) So, the software would insert the next packet into slot 378. However, the slots from 85 to 377 have not been reaped, despite those in 86 to 377 allegedly having been sent. This is because the entry in slot 85 shows that it has yet to be sent. I've no idea what causes this; it looks like there's something screwed with the hardware which causes the transmitter to skip an entry in the ring under certain circumstances. As I've never been able to reproduce it here, I've not been able to investigate it. What I would like to do is to stamp each packet in some way with an identifier marking its ring position, and then monitor the network to find out whether the packet at slot 85 was actually transmitted - that's made slightly harder because packets may be dropped at the receiver when operating in promisc mode. This would then allow us to work out some likely causes. Note that after the transmit watchdog, the interface should recover and start operating normally again - and that should not take "several minutes." -- FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up according to speedtest.net. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Hi Fabio, > Could this problem be the same one as reported at: > http://www.spinics.net/lists/arm-kernel/msg347914.html ? The problem you link to describes a permanent issue, our problem seems to be sporadic as most of our tests work fine (at least for a while). > Which Ethernet PHY do you use? Do you have pull-up in the MDIO line? Our hardware has the KSZ9021RN PHY, so the MDIO line should be pull-up. Do you know if there are debug options that could help us determine the cause of the timeout? Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html ***
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Tue, Aug 5, 2014 at 10:31 AM, Mattis Lorentzon wrote: > We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are > currently running some stability tests. > > During our first test round we triggered a timeout which caused the fec driver > to become unresponsive for several minutes. The attached backtrace was > shown when the hardware was rebooted. Could this problem be the same one as reported at: http://www.spinics.net/lists/arm-kernel/msg347914.html ? Which Ethernet PHY do you use? Do you have pull-up in the MDIO line? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Hi Russell! > Now because things have changed during the last merge window, I've got an > even bigger problem sorting through that patch set and getting it back into a > submittable state. I've just sent out v2 for it onto the > net...@vger.kernel.org mailing list. > > The initial version (marked RFC) attracted very little interest from testers, > or > acks. I'd very much like to have some testing of it, so if you want to try > it out, > I can provide you with a git URL, patches or a combined patch. We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are currently running some stability tests. During our first test round we triggered a timeout which caused the fec driver to become unresponsive for several minutes. The attached backtrace was shown when the hardware was rebooted. Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** [ cut here ] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c() NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc2+ #7 Backtrace: [<8001234c>] (dump_backtrace) from [<80012628>] (show_stack+0x18/0x1c) r6:0108 r5: r4:806ac3dc r3: [<80012610>] (show_stack) from [<804d2c60>] (dump_stack+0x8c/0x9c) [<804d2bd4>] (dump_stack) from [<80025c18>] (warn_slowpath_common+0x74/0x90) r5:0009 r4:8068fd70 [<80025ba4>] (warn_slowpath_common) from [<80025c6c>] (warn_slowpath_fmt+0x38/0x40) r8:806900c0 r7:9e160254 r6:9f4ec800 r5:9e16 r4: [<80025c38>] (warn_slowpath_fmt) from [<803f4578>] (dev_watchdog+0x270/0x27c) r3:9e16 r2:8061ad58 [<803f4308>] (dev_watchdog) from [<8002fee8>] (call_timer_fn+0x74/0xec) r10:8068e008 r9:9e16 r8:803f4308 r7:0100 r6:8068e000 r5:0001 r4:8068fdd8 [<8002fe74>] (call_timer_fn) from [<80030b2c>] (run_timer_softirq+0x1d4/0x254) r10:803f4308 r9:806900c0 r8:9e16 r7: r6:8068fe28 r5:806cc140 r4:9e160284 [<80030958>] (run_timer_softirq) from [<8002a124>] (__do_softirq+0x168/0x2f0) r10:0001 r9:80690080 r8:4001 r7:8068e000 r6:0100 r5:80690084 r4:0020 [<80029fbc>] (__do_softirq) from [<8002a5d0>] (irq_exit+0xc8/0x10c) r10:8068e000 r9:806cafd9 r8:0001 r7:f4000100 r6: r5:001d r4:8068e008 [<8002a508>] (irq_exit) from [<8000f304>] (handle_IRQ+0x4c/0x98) r5:001d r4:8068ae14 [<8000f2b8>] (handle_IRQ) from [<8000860c>] (gic_handle_irq+0x34/0x64) r6:8068ff20 r5:80696a40 r4:f400010c r3:00a0 [<800085d8>] (gic_handle_irq) from [<80013184>] (__irq_svc+0x44/0x58) Exception stack(0x8068ff20 to 0x8068ff68) ff20: 0001 0001 806996f0 8069652c 806964d8 806cafd9 804db068 ff40: 0001 806cafd9 8068e000 8068ff74 8068ff68 80061a1c 8000f664 ff60: 200f0013 r7:8068ff54 r6: r5:200f0013 r4:8000f664 [<8000f638>] (arch_cpu_idle) from [<8005d154>] (cpu_startup_entry+0x10c/0x164) [<8005d048>] (cpu_startup_entry) from [<804cdefc>] (rest_init+0xc8/0xd8) r7:80683450 r3: [<804cde34>] (rest_init) from [<80651c68>] (start_kernel+0x3a0/0x3ac) r5:0001 r4:806965d0 [<806518c8>] (start_kernel) from [<10008074>] (0x10008074) ---[ end trace b51f6196c5e036f0 ]--- fec 2188000.ethernet eth0: TX ring dump Nr SC addr len SKB 00x1c00 0x 66 (null) 10x1c00 0x 66 (null) 20x1c00 0x 66 (null) 30x1c00 0x 66 (null) 40x1c00 0x 66 (null) 50x1c00 0x 66 (null) 60x1c00 0x 66 (null) 70x1c00 0x 66 (null) 80x1c00 0x 66 (null) 90x1c00 0x 66 (null) 100x1c00 0x 66 (null) 110x1c00 0x 66 (null) 120x1c00 0x 66 (null) 130x1c00 0x 66 (null) 140x1c00 0x 66 (null) 150x1c00 0x 66 (null) 160x1c00 0x 66 (null) 170x1c00 0x 66 (null) 180x1c00 0x 66 (null) 190x1c00 0x 66 (null) 200x1c00 0x 66 (null) 210x1c00 0x 66 (null) 220x1c00 0x 66 (null) 230x1c00 0x 66 (null) 240x1c00 0x 66 (null) 250x1c00 0x 66 (null) 260x1c00 0x 66 (null) 270x1c00 0x 66 (null) 280x1c00 0x 66 (null) 290x1c00 0x 66 (null) 300x1c00 0x 66 (null) 310x1c00 0x 66 (null) 320x1c00 0x 66 (null) 330x1c00 0x 66 (null) 340x1c00 0x 66 (null) 350x1c00 0x 66 (null) 360x
RE: Oops: 17 SMP ARM (v3.16-rc2)
Hi Russell, > -Original Message- > > The initial version (marked RFC) attracted very little interest from > > testers, or acks. I'd very much like to have some testing of it, so > > if you want to try it out, I can provide you with a git URL, patches > > or a combined patch. > > Sure! A combined gzip patch attachment is fine. Git over HTTP probably > works too. We are still interested in trying out your patches to improve network performance. We can do some testing this week and in August. Best regards, Fredrik *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On 06/30/2014 07:30 AM, Fredrik Noring wrote: >> >> On Fri, Jun 27, 2014 at 04:16:57PM +, Fredrik Noring wrote: >>> Please find below a trace that appeared once with 3.16-rc2. Perhaps it >>> is of some interest? >> >> It's not that serious... I know that the FEC ethernet driver is horrendously >> racy (I have had a patch set for about the last six months which fixes some >> of >> its problems) but as I've had a lot of patches to deal with, and it's been >> pushed to the back of the queue... >> >> The races don't lead to data corruption though, merely timeouts and some >> lost packets. > It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a properly > working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to do a lot > better. No crashes so far with v3.16-rc2! > Did you narrow it down to a particular GCC bug? The symptoms you reported remind me of: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854 Sadly, unpatched GCC 4.8.1 and 4.8.2 are unsuitable for building ARM kernels. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Hi Russell, It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a properly working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to do a lot better. No crashes so far with v3.16-rc2! All the best, Fredrik > -Original Message- > Hi Fredrik, > > On Fri, Jun 27, 2014 at 04:16:57PM +, Fredrik Noring wrote: > > Please find below a trace that appeared once with 3.16-rc2. Perhaps it > > is of some interest? > > It's not that serious... I know that the FEC ethernet driver is horrendously > racy (I have had a patch set for about the last six months which fixes some of > its problems) but as I've had a lot of patches to deal with, and it's been > pushed to the back of the queue... > > The races don't lead to data corruption though, merely timeouts and some > lost packets. > > Now because things have changed during the last merge window, I've got an > even bigger problem sorting through that patch set and getting it back into a > submittable state. I've just sent out v2 for it onto the > net...@vger.kernel.org mailing list. > > The initial version (marked RFC) attracted very little interest from testers, > or > acks. I'd very much like to have some testing of it, so if you want to try it > out, I can provide you with a git URL, patches or a combined patch. > > -- > FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly > improving, and getting towards what was expected from it. *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Hi Russell, > -Original Message- > It's not that serious... I know that the FEC ethernet driver is horrendously > racy (I have had a patch set for about the last six months which fixes some of > its problems) but as I've had a lot of patches to deal with, and it's been > pushed to the back of the queue... > > The races don't lead to data corruption though, merely timeouts and some > lost packets. The serial port (uart1) and Ethernet are essentially the only things we use. No disks, no graphics, no USB, etc. If not the Ethernet driver, what else is likely to crash NFS so badly? Also, we are happy to change our config if that would simplify things: http://lkml.iu.edu/hypermail/linux/kernel/1406.3/01488/config.gz > Now because things have changed during the last merge window, I've got an > even bigger problem sorting through that patch set and getting it back into a > submittable state. I've just sent out v2 for it onto the > net...@vger.kernel.org mailing list. > > The initial version (marked RFC) attracted very little interest from testers, > or > acks. I'd very much like to have some testing of it, so if you want to try it > out, I can provide you with a git URL, patches or a combined patch. Sure! A combined gzip patch attachment is fine. Git over HTTP probably works too. All the best, Fredrik *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Hi Russel, > On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote: > > That's a similar workload to the one which is mentioned in the > > previous report. I've just set a similar transfer going, but this > > will be a 16GB file. > > I've run this transfer several times, but so far I've unable to reproduce the > issue here. Many thanks for testing this. We attempted to bisect, but unfortunately the result was not conclusive. One reason might be that the config had to be updated during the process, and so we did not end up with the exact same configuration (things like e.g. IMX_SDMA in DMA_ENGINE etc.). Some runs deadlocked without any visible Oops or printout. Some versions did not have an entirely working console configuration. Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of some interest? (We also had memtester run for days on the i.MX6 hardware, without issues.) All the best, Fredrik [ cut here ] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c() NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out Modules linked in: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc2 #19 Backtrace: [<80012390>] (dump_backtrace) from [<8001266c>] (show_stack+0x18/0x1c) r6:0108 r5: r4:8064e29c r3: [<80012654>] (show_stack) from [<8049791c>] (dump_stack+0x8c/0x9c) [<80497890>] (dump_stack) from [<80024f4c>] (warn_slowpath_common+0x74/0x90) r5:0009 r4:80631d70 [<80024ed8>] (warn_slowpath_common) from [<80024fa0>] (warn_slowpath_fmt+0x38/0x40) r8:806320c0 r7:9d85a254 r6:9d879000 r5:9d85a000 r4: [<80024f6c>] (warn_slowpath_fmt) from [<803b8ff0>] (dev_watchdog+0x270/0x27c) r3:9d85a000 r2:805c4790 [<803b8d80>] (dev_watchdog) from [<8002f280>] (call_timer_fn+0x6c/0xe4) r10:80630008 r9:9d85a000 r8:803b8d80 r7:0100 r6:8063 r5:0001 r4:80631dd8 [<8002f214>] (call_timer_fn) from [<8002fec8>] (run_timer_softirq+0x1d4/0x254) r10:803b8d80 r9:806320c0 r8:9d85a000 r7: r6:80631e28 r5:80667040 r4:9d85a284 [<8002fcf4>] (run_timer_softirq) from [<8002945c>] (__do_softirq+0x17c/0x30c) r10:0001 r9:80632080 r8:4001 r7:8063 r6:0100 r5:80632084 r4:0020 [<800292e0>] (__do_softirq) from [<80029920>] (irq_exit+0xd0/0x114) r10:8063 r9:80665f19 r8:0001 r7:f4000100 r6: r5:80630008 r4:8063 [<80029850>] (irq_exit) from [<8000f348>] (handle_IRQ+0x4c/0x98) r5:001d r4:8062ce44 [<8000f2fc>] (handle_IRQ) from [<80008614>] (gic_handle_irq+0x34/0x64) r6:80631f20 r5:80638a40 r4:f400010c r3:00a0 [<800085e0>] (gic_handle_irq) from [<800131c4>] (__irq_svc+0x44/0x58) Exception stack(0x80631f20 to 0x80631f68) 1f20: 0001 0001 8063b6f0 8063852c 806384d8 80665f19 804a0040 1f40: 0001 80665f19 8063 80631f74 80631f68 800614b8 8000f6a8 1f60: 200f0013 r7:80631f54 r6: r5:200f0013 r4:8000f6a8 [<8000f67c>] (arch_cpu_idle) from [<8005cbf8>] (cpu_startup_entry+0x10c/0x164) [<8005caec>] (cpu_startup_entry) from [<80492b68>] (rest_init+0xc8/0xd8) r7:80625028 r3: [<80492aa0>] (rest_init) from [<805f6c5c>] (start_kernel+0x39c/0x3a8) r5:0001 r4:806385d0 [<805f68c0>] (start_kernel) from [<10008074>] (0x10008074) ---[ end trace a7b7109ab2d04e11 ]--- *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
Hi Fredrik, On Fri, Jun 27, 2014 at 04:16:57PM +, Fredrik Noring wrote: > Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of > some interest? It's not that serious... I know that the FEC ethernet driver is horrendously racy (I have had a patch set for about the last six months which fixes some of its problems) but as I've had a lot of patches to deal with, and it's been pushed to the back of the queue... The races don't lead to data corruption though, merely timeouts and some lost packets. Now because things have changed during the last merge window, I've got an even bigger problem sorting through that patch set and getting it back into a submittable state. I've just sent out v2 for it onto the net...@vger.kernel.org mailing list. The initial version (marked RFC) attracted very little interest from testers, or acks. I'd very much like to have some testing of it, so if you want to try it out, I can provide you with a git URL, patches or a combined patch. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote: > On Thu, Jun 26, 2014 at 02:44:52PM +, Mattis Lorentzon wrote: > > We have managed to trigger the Oops by just transferring a large file > > over nfs > > cat /mnt/foo > /dev/null > > where foo is a file that is approximately 2 GB. There may be some > > packet losses on this network, perhaps this differs from your workload? > > That's a similar workload to the one which is mentioned in the previous > report. I've just set a similar transfer going, but this will be a 16GB > file. I've run this transfer several times, but so far I've unable to reproduce the issue here. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Thu, Jun 26, 2014 at 02:44:52PM +, Mattis Lorentzon wrote: > Thank you for your reply, > > > On Wed, Jun 25, 2014 at 01:55:05PM +, Mattis Lorentzon wrote: > > > I have a similar issue with v3.16-rc2 as previously reported by Waldemar > > Brodkorb for v3.15-rc4. > > > https://lkml.org/lkml/2014/5/9/330 > > > > This URL returns no useful information. I find that lkml.org is broken more > > times than not in recent years. Please use a different archive site when > > referring to posts, thanks. > > http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html I remember that report, but it was never resolved as I think no one has any ideas what is causing these, and no one has any idea where to start looking. > We have managed to trigger the Oops by just transferring a large file > over nfs > cat /mnt/foo > /dev/null > where foo is a file that is approximately 2 GB. There may be some > packet losses on this network, perhaps this differs from your workload? That's a similar workload to the one which is mentioned in the previous report. I've just set a similar transfer going, but this will be a 16GB file. > We have done some more investigations, please find it in this mail: > > http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html Yes, I saw that before I replied, and my reply was written with that message in mind. That's what prompted this paragraph in my previous reply: "Your other oops dumps also show various other functions apparantly returning 0x. I can't believe that there's more than one bug doing this, so I doubt the problem is in these functions. Something else must be going on." One of the problems is that there's soo much work going on with the kernel by many different parties, pulling it in various directions, that no one really has an overview of all the changes, and so no one has much of a feel what could be the cause of weird bugs like this. I don't know what to suggest - you could try using git bisect to see if you can track it down to a particular commit, but it sounds like that's going to be very time consuming. You mentioned that 3.12 doesn't show the bug, but 3.13 does - so start off telling git bisect that 3.12 is "good" and 3.13 is "bad". Hopefully there won't be too many breakages during the 3.13 merge window (between 3.12 and 3.13-rc1), but I don't have much faith in that; people seem to have a habbit of holding back fixes until -rc1, which makes _exactly_ this kind of bug much harder for people like yourselves to track down - or maybe even impossible. I'm afraid I can't offer very much help beyond this until either I can produce it, or someone manages to identify a particular change which caused this. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Oops: 17 SMP ARM (v3.16-rc2)
Thank you for your reply, > On Wed, Jun 25, 2014 at 01:55:05PM +, Mattis Lorentzon wrote: > > I have a similar issue with v3.16-rc2 as previously reported by Waldemar > Brodkorb for v3.15-rc4. > > https://lkml.org/lkml/2014/5/9/330 > > This URL returns no useful information. I find that lkml.org is broken more > times than not in recent years. Please use a different archive site when > referring to posts, thanks. http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html > I have had two iMX6 platforms running root-NFS for about the last six to nine > months with various workloads, and have never seen this oops. > Unfortunately, the description above gives very little information for what > the mechanism to trigger this bug may be. For example, if I wanted to > reproduce it, what would I need to do? We have managed to trigger the Oops by just transferring a large file over nfs cat /mnt/foo > /dev/null where foo is a file that is approximately 2 GB. There may be some packet losses on this network, perhaps this differs from your workload? > > The error is sporadic and it seems to occur more frequently when using > perf. > > So it occurs when not using perf? Yes, certainly, see above. We have done some more investigations, please find it in this mail: http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html The Oops seems to have been introduced somewhere between v3.12 and v3.13: - The Oops is reproducible within seconds when running Linux 3.16-rc2. - We have observed the Oops on 8 different hardware units and two different chipsets (Freescale i.MX6 and Xilinx Zynq). - The Oops has not been seen on Linux 3.12 so it appears to be good. - The Oops has been seen on Linux 3.13, 3.14, 3.15, 3.16-rc2 so these appear to be bad. Configs and a couple of Oops reports are attached to the linked mail. Best regards, Mattis Lorentzon *** Consider the environment before printing this message. To read Autoliv's Information and Confidentiality Notice, follow this link: http://www.autoliv.com/disclaimer.html *** -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Oops: 17 SMP ARM (v3.16-rc2)
On Wed, Jun 25, 2014 at 01:55:05PM +, Mattis Lorentzon wrote: > Hello kernel people, You may wish to also copy linux-arm-ker...@lists.infradead.org, which is where ARM kernel people are. > I have a similar issue with v3.16-rc2 as previously reported by Waldemar > Brodkorb for v3.15-rc4. > https://lkml.org/lkml/2014/5/9/330 This URL returns no useful information. I find that lkml.org is broken more times than not in recent years. Please use a different archive site when referring to posts, thanks. > We are running a benchmark application, sometimes using perf, with heavy > traffic over NFS. I have had two iMX6 platforms running root-NFS for about the last six to nine months with various workloads, and have never seen this oops. Unfortunately, the description above gives very little information for what the mechanism to trigger this bug may be. For example, if I wanted to reproduce it, what would I need to do? > The error is sporadic and it seems to occur more frequently when using perf. So it occurs when not using perf? > Linux imx6-test0 3.16.0-rc2+ #1 SMP Wed Jun 25 15:04:16 CEST 2014 armv7l > armv7l armv7l GNU/Linux > > Any help is greatly appreciated. > > Best regards, > Mattis Lorentzon > > Unable to handle kernel paging request at virtual address > pgd = 9e338000 > [] *pgd=2fffd821, *pte=, *ppte= > Internal error: Oops: 17 [#1] SMP ARM > Modules linked in: > CPU: 0 PID: 146 Comm: stereo Not tainted 3.16.0-rc2+ #1 > task: 9e07a700 ti: 81c42000 task.ti: 81c42000 > PC is at find_get_entry+0x60/0xfc > LR is at radix_tree_lookup_slot+0x1c/0x2c > pc : [<800a34d8>]lr : [<80290448>]psr: a013 > sp : 81c43d98 ip : fp : 81c43dcc > r10: 0001 r9 : 9e30e3c0 r8 : 02a7 > r7 : 9f3758a0 r6 : r5 : 0001 r4 : > r3 : 81c43d84 r2 : r1 : 02a7 r0 : ... > Code: e1a01008 eb07b3d6 e350 0a1c (e5904000) Right, so radix_tree_lookup_slot returned 0x. I've no idea how that happened, and I'm not about to try reading and trying to understand that code. However, as that is generic code, I find it unlikely that the code is buggy. So, I suspect something else must be going on here, such as a compiler bug or memory corruption. Your other oops dumps also show various other functions apparantly returning 0x. I can't believe that there's more than one bug doing this, so I doubt the problem is in these functions. Something else must be going on. -- FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly improving, and getting towards what was expected from it. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/