subject:"\"RE\\\: Oops\\\: 17 SMP ARM \\\(v3.16\\\-rc2\\\)\""

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-12-16 Thread Mattis Lorentzon

Hi Russell,

> Now because things have changed during the last merge window, I've got
> an even bigger problem sorting through that patch set and getting it
> back into a submittable state.  I've just sent out v2 for it onto the
> net...@vger.kernel.org mailing list.
>
> The initial version (marked RFC) attracted very little interest from
> testers, or acks.  I'd very much like to have some testing of it, so
> if you want to try it out, I can provide you with a git URL, patches or a
> combined patch.
 
We have run v3.16 for about three months now, and many millions of ssh
connections on eight separate systems, both without and with your network
patches. Our conclusion is that the patches clearly reduce the number of
network timeouts, and this is a great improvement. However, after a month
or so of uptime, the number of timeouts began to increase again, forcing us
to reboot the cards.

Best regards,
Mattis Lorentzon
***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-29 Thread Fabio Estevam

Hi Mattis,

On Fri, Aug 29, 2014 at 7:57 AM, Mattis Lorentzon
 wrote:
> Iain,
>
>> Interesting. We obviously have some differences in how we boot, my
>> changes to your config to get it to boot basically amount to reverting the
>> patch you attached and then enabling sata and mmc. So far I've been unable
>> to get your config to fail.
>
> Our version of U-boot doesn't support specifying a device tree separate from
> the kernel, so we append it to the end of the kernel binary. We also enable
> automatic configuration of IP addresses (CONFIG_IP_PNP). Our bootargs are:
> console=ttymxc1,115200
> ip=192.168.2.157:192.168.2.1:192.168.2.1:255.255.255.0:armcard:eth0:on
> earlyprintk enable_wait_mode=off

I suppose that this 'enable_wait_mode=off' is a left over from the
time you used the FSL BSP.

This is not needed in mainline.

>> It would be good to know what makes my config work for you, I don't think
>> I've done anything special with it.
>
> With a couple of modifications (attached) we have been able to get your
> config running on our Zynq boards as well, solving our ethernet issues.
>
> The serial port and ethernet are essentially the only things we use. No disks,
> no graphics, no USB, etc. which is why we tried to reduce the kernel
> configuration to a bare minimum. We have no idea which disabled and/or
> enabled options that are causing the stalls.

It's good to hear you do not have the lockups anymore, but this is
still a big mistery for us as we have not yet understood the root
cause and what is the 'guilty' kernel config option that makes things
FEC to work unreliably.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-29 Thread Mattis Lorentzon

Iain,

> Interesting. We obviously have some differences in how we boot, my
> changes to your config to get it to boot basically amount to reverting the
> patch you attached and then enabling sata and mmc. So far I've been unable
> to get your config to fail.

Our version of U-boot doesn't support specifying a device tree separate from
the kernel, so we append it to the end of the kernel binary. We also enable
automatic configuration of IP addresses (CONFIG_IP_PNP). Our bootargs are:
console=ttymxc1,115200
ip=192.168.2.157:192.168.2.1:192.168.2.1:255.255.255.0:armcard:eth0:on
earlyprintk enable_wait_mode=off

> It would be good to know what makes my config work for you, I don't think
> I've done anything special with it.

With a couple of modifications (attached) we have been able to get your
config running on our Zynq boards as well, solving our ethernet issues.

The serial port and ethernet are essentially the only things we use. No disks,
no graphics, no USB, etc. which is why we tried to reduce the kernel
configuration to a bare minimum. We have no idea which disabled and/or
enabled options that are causing the stalls.

Best regards,
Mattis Lorentzon

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***


config.patch
Description: config.patch

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-27 Thread Iain Paton

On 27/08/14 07:32, Mattis Lorentzon wrote:
> Hi Iain, Russell and Fabio,
> 
>> The config is attached. Note that there's a lot of additional stuff enabled 
>> as
>> I'm aiming for a single general purpose kernel that covers i.MX6, AM3359,
>> Allwinner A10/A20 along with several versions of boards using those
>> particular SoCs.
>>
>> Same kernel binary on all the boards I've tried this on, only real 
>> differences
>> will be the devicetree and u-boot
> 
> Amazingly we have been able to run a complete nightly test on eight i.MX6
> boards without hickups using Iain's config! We had to modify it slightly to 
> get
> it to boot, please find attached patch and Iain's patched config.

Interesting. We obviously have some differences in how we boot, my changes to 
your config to get it to boot basically amount to reverting the patch you 
attached 
and then enabling sata and mmc. So far I've been unable to get your config to 
fail.

I'm attaching the patch showing what I changed in case it sheds any light on 
what's going on, although I don't see why any of the changes make any 
difference.

My kernel command line is also fairly obvious with nothing I'd think is odd:
console=ttymxc1,115200n8 root=/dev/sda1 ro rootfstype=ext2 rootwait video= 
ahci-imx.hotplug=1

It would be good to know what makes my config work for you, I don't think I've 
done anything special with it.

Iain

3c3
< # Linux/arm 3.16.0 Kernel Configuration
---
> # Linux/arm 3.16.0-rc2 Kernel Configuration
38c38
< CONFIG_KERNEL_GZIP=y
---
> # CONFIG_KERNEL_GZIP is not set
41c41
< # CONFIG_KERNEL_LZO is not set
---
> CONFIG_KERNEL_LZO=y
233,239c233
< CONFIG_PARTITION_ADVANCED=y
< # CONFIG_ACORN_PARTITION is not set
< # CONFIG_AIX_PARTITION is not set
< # CONFIG_OSF_PARTITION is not set
< # CONFIG_AMIGA_PARTITION is not set
< # CONFIG_ATARI_PARTITION is not set
< # CONFIG_MAC_PARTITION is not set
---
> # CONFIG_PARTITION_ADVANCED is not set
241,249d234
< # CONFIG_BSD_DISKLABEL is not set
< # CONFIG_MINIX_SUBPARTITION is not set
< # CONFIG_SOLARIS_X86_PARTITION is not set
< # CONFIG_UNIXWARE_DISKLABEL is not set
< # CONFIG_LDM_PARTITION is not set
< # CONFIG_SGI_PARTITION is not set
< # CONFIG_ULTRIX_PARTITION is not set
< # CONFIG_SUN_PARTITION is not set
< # CONFIG_KARMA_PARTITION is not set
251,252d235
< # CONFIG_SYSV68_PARTITION is not set
< # CONFIG_CMDLINE_PARTITION is not set
265,266d247
< CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
< CONFIG_RWSEM_SPIN_ON_OWNER=y
533,534c514,521
< # CONFIG_ARM_APPENDED_DTB is not set
< CONFIG_CMDLINE=""
---
> CONFIG_ARM_APPENDED_DTB=y
> CONFIG_ARM_ATAG_DTB_COMPAT=y
> CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_FROM_BOOTLOADER=y
> # CONFIG_ARM_ATAG_DTB_COMPAT_CMDLINE_EXTEND is not set
> CONFIG_CMDLINE="___console=ttymxc0,115200 ___debug ___LOGLEVEL=8 
> ___initrd=0x11800040,12383491 ___dyndbg=\"file * +p\""
> CONFIG_CMDLINE_FROM_BOOTLOADER=y
> # CONFIG_CMDLINE_EXTEND is not set
> # CONFIG_CMDLINE_FORCE is not set
591c578
< # CONFIG_PM_TEST_SUSPEND is not set
---
> CONFIG_PM_TEST_SUSPEND=y
919d905
< CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
920a907
> CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
1045,1059c1032
< CONFIG_ATA=y
< # CONFIG_ATA_NONSTANDARD is not set
< CONFIG_ATA_VERBOSE_ERROR=y
< CONFIG_SATA_PMP=y
< 
< #
< # Controllers with non-SFF native interface
< #
< CONFIG_SATA_AHCI=y
< CONFIG_SATA_AHCI_PLATFORM=y
< CONFIG_AHCI_IMX=y
< # CONFIG_SATA_INIC162X is not set
< # CONFIG_SATA_ACARD_AHCI is not set
< # CONFIG_SATA_SIL24 is not set
< # CONFIG_ATA_SFF is not set
---
> # CONFIG_ATA is not set
1786,1815c1759
< CONFIG_MMC=y
< # CONFIG_MMC_DEBUG is not set
< # CONFIG_MMC_CLKGATE is not set
< 
< #
< # MMC/SD/SDIO Card Drivers
< #
< CONFIG_MMC_BLOCK=y
< CONFIG_MMC_BLOCK_MINORS=8
< CONFIG_MMC_BLOCK_BOUNCE=y
< # CONFIG_SDIO_UART is not set
< # CONFIG_MMC_TEST is not set
< 
< #
< # MMC/SD/SDIO Host Controller Drivers
< #
< CONFIG_MMC_SDHCI=y
< CONFIG_MMC_SDHCI_IO_ACCESSORS=y
< # CONFIG_MMC_SDHCI_PCI is not set
< CONFIG_MMC_SDHCI_PLTFM=y
< # CONFIG_MMC_SDHCI_OF_ARASAN is not set
< CONFIG_MMC_SDHCI_ESDHC_IMX=y
< # CONFIG_MMC_SDHCI_PXAV3 is not set
< # CONFIG_MMC_SDHCI_PXAV2 is not set
< # CONFIG_MMC_MXC is not set
< # CONFIG_MMC_TIFM_SD is not set
< # CONFIG_MMC_CB710 is not set
< # CONFIG_MMC_VIA_SDMMC is not set
< # CONFIG_MMC_DW is not set
< # CONFIG_MMC_USDHI6ROL0 is not set
---
> # CONFIG_MMC is not set
1968d1911
< # CONFIG_WIMAX_GDM72XX is not set

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-26 Thread Mattis Lorentzon

Hi Iain, Russell and Fabio,

> The config is attached. Note that there's a lot of additional stuff enabled as
> I'm aiming for a single general purpose kernel that covers i.MX6, AM3359,
> Allwinner A10/A20 along with several versions of boards using those
> particular SoCs.
>
> Same kernel binary on all the boards I've tried this on, only real differences
> will be the devicetree and u-boot

Amazingly we have been able to run a complete nightly test on eight i.MX6
boards without hickups using Iain's config! We had to modify it slightly to get
it to boot, please find attached patch and Iain's patched config.

On Russell's suggestion we also began to disable flow control on the machines.
However it did not seem to make a difference because all our Zynq cards
stalled during the same test run (using our own Zynq config).

Iain's config seems promising and we will continue to run tests during the
next couple of days. We will also try to adapt Iain's config to our Zynq board.

Many thanks for all suggestions, patches and configs so far!

Best regards,
Mattis Lorentzon

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***


config.patch
Description: config.patch


config.gz
Description: config.gz

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-26 Thread Iain Paton

On 21/08/14 10:39, Iain Paton wrote:
> On 19/08/14 07:03, Iain Paton wrote:
>> On 17/08/14 22:46, Fabio Estevam wrote:
>>> Iain,
>>>
>>> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton  wrote:
 On 15/08/14 06:42, Mattis Lorentzon wrote:

> We mostly run SSH with benchmarks using NFS, it can probably be
> triggered by using only SSH with the following loop:
>
> # while : ; do ssh arm-card date; done

 Mattis,

 What sort of time does it take for you to see a problem?

 I've been running the above for nearly two days on 3.16.0 on a board
 with fec interrupts routed through gpio_6 and haven't seen a hint of
 a problem.
>>>
>>> Thanks for testing.
>>>
>>> Which mx6 board have you used on this test?
>>
>> It's currently pointed at a RIoTboard (atheros phy) but I'm happy to 
>> try it against both a Sabre-Lite and a Wandboard B1, all running the 
>> same kernel binary, as well. 
>>
>> I'm interested enough in why different people get different results 
>> with this that I'll put some time towards testing to try to help 
>> narrow down the cause.
>>
> 
> two and a half days of running this against both a sabre-lite and a 
> wandboard quad B1 and I still have no reason to think there's any 
> sort of a problem.
> 
> Up to now, my testing has been done with my own config, I'll now
> repeat the whole thing using the config Mattis posted to see if 
> I can reproduce it that way.
> 
> Suggestions on a better / easier / quicker way to reproduce it are 
> welcome.
> 

So I wasn't able to use Mattis exact configuration as I couldn't 
get it to boot properly on anything. 

I made changes enough to enable mmc/sata and to disable the 
compiled in kernel command line and appended devicetree and initrd.
Even then it still won't boot on my WBQUAD. 
It is running on Sabre-Lite and RIoTboard though, so useful enough 
to test against the SL in a similar manner to Mattis tests with SL.

I've had the test running against both for approx one day and again 
no sign of any problems. I'm happy to leave this running, but at 
this stage I'm not expecting I'll see any problems even if I leave 
it running for a week.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-26 Thread Iain Paton

On 25/08/14 11:18, Russell King - ARM Linux wrote:
> On Wed, Aug 13, 2014 at 01:39:27PM +, Mattis Lorentzon wrote:
>> All our tests seem to behave the same way on the Sabrelite as on our own 
>> board.
>> A working theory is that the switch (3Com Switch 4400) triggers the 
>> degeneration
>> of the network stack from which Linux does not seem to recover, even if we 
>> later
>> bypass the switch and directly connect the board to the server machine.
> 
> Please can you try something - what happens if you completely disable
> pause frame support (flow control) on all machines on the switch?

Russell, while trying to duplicate this I have flow-control disabled 
on the switch which leads to it being auto-negotiated off on all devices.
Do you think it could be worth turning it on and trying again?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-25 Thread Russell King - ARM Linux

On Wed, Aug 13, 2014 at 01:39:27PM +, Mattis Lorentzon wrote:
> All our tests seem to behave the same way on the Sabrelite as on our own 
> board.
> A working theory is that the switch (3Com Switch 4400) triggers the 
> degeneration
> of the network stack from which Linux does not seem to recover, even if we 
> later
> bypass the switch and directly connect the board to the server machine.

Please can you try something - what happens if you completely disable
pause frame support (flow control) on all machines on the switch?

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-22 Thread Iain Paton

On 22/08/14 01:01, Fabio Estevam wrote:
> On Thu, Aug 21, 2014 at 6:39 AM, Iain Paton  wrote:
> 
>> two and a half days of running this against both a sabre-lite and a
>> wandboard quad B1 and I still have no reason to think there's any
>> sort of a problem.
>>
>> Up to now, my testing has been done with my own config, I'll now
>> repeat the whole thing using the config Mattis posted to see if
>> I can reproduce it that way.
>>
>> Suggestions on a better / easier / quicker way to reproduce it are
>> welcome.
> 
> Thanks, Iain.
> 
> Mattis,
> 
> What is the silicon version of the mx6 in your sabrelite? What GCC
> version do you use?
> 

For reference, both my SL and WBQUAD report silicon rev 1.2
The RIoTboard uses a Solo and reports silicon rev 1.1

I'm using vanilla gcc 4.9.1 and compiling the kernel natively on a 
sabre-lite.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-22 Thread Russell King - ARM Linux

On Thu, Aug 14, 2014 at 02:43:56PM +, Mattis Lorentzon wrote:
> Fabio and Russell,
> 
> > A working theory is that the switch (3Com Switch 4400) triggers the
> > degeneration of the network stack from which Linux does not seem to
> > recover, even if we later bypass the switch and directly connect the board 
> > to
> > the server machine.
> 
> After a few more tests we have finally been able to trigger the exact
> same stalls on the Sabrelite board with a direct network connection
> (i.e. without the switch).

That's a setup which I can't reproduce, as all my MX6 hardware runs
root-NFS, so using a direct connection to a machine to test will
result in the MX6 losing its root filesystem.

That said, on SolidRun hardware, there is some investigation going on
at the moment concerning poor UDP performance - this is an on-going
problem that has been present for a long time.

What we find is that TCP performance achieves around the 600mbps mark,
but UDP performance can be extremely poor with high packet loss.
Adding a udelay(210) into the fec_enet_rx() can perversely (on
multi-core SoCs) increase UDP performance to around 500mbps at the
expense of a reduction in TCP performance.

This "solution" was tripped over while trying to debug this problem,
and it was found that adding printk()s to the driver increased UDP
performance - so subsituting udelay() for printk() was then tried.

I tried to run perf on the kernel yesterday to find out what's going
on, but for some reason, perf gave me impossible call traces, so I
gave up with that idea.  For example, perf told me that there was a
high hit rate in memcpy() being called from net_rx_action(), but
net_rx_action() doesn't call memcpy(), nor do any of the called
functions as a tail-call.

That said, I don't think perf could tell us what's going on - what
we need is a trace of the CPU's execution while iperf is running,
*without* affecting the CPU itself.  This is something I can't do
with the hardware I have.

My suspicion (unproven) is that a batch of packets get processed in
the softirq handler called during the FEC interrupt exit path.  Then,
because there's more work to be done, ksoftirqd is scheduled, but it
takes time for ksoftirqd to start running - during which time we drop
a lot of packets.  ksoftirqd processes some packets, but then finds
that it can't complete the NAPI "work budget", and so stops running,
resulting in the packet processing being triggered by the next FEC
interrupt, and the cycle repeats.

TCP notices this, and adjusts its sending rate to match, whereas UDP
just carries on regardless, resulting in lots of packets dropped each
time we switch from the tail of hardirq processing to ksoftirqd.

With the udelay() in place, processing takes enough time that it gets
bounced onto ksoftirqd, where it stays.

I'm adding this to this thread in case it has any bearing on the
problem(s) you're seeing - yes, it seems like a different problem, but
could it be related...

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-21 Thread Mattis Lorentzon

Fabio,

> What is the silicon version of the mx6 in your sabrelite? What GCC version do
> you use?

The silicon version is PCIMX6Q6AVT10AA and the GCC version we use is
arm-none-eabi-gcc (Fedora 2013.11.24-2.fc19) 4.8.1.

Iain,

> Up to now, my testing has been done with my own config, I'll now
> repeat the whole thing using the config Mattis posted to see if I can
> reproduce it that way.

Thanks for testing this. Could you also send me the config that you used for
your Sabrelite?

Do you know of any options that enable additional debug information about
the network driver state (full buffers etc.)?

Best regards,
Mattis Lorentzon

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-21 Thread Fabio Estevam

On Thu, Aug 21, 2014 at 6:39 AM, Iain Paton  wrote:

> two and a half days of running this against both a sabre-lite and a
> wandboard quad B1 and I still have no reason to think there's any
> sort of a problem.
>
> Up to now, my testing has been done with my own config, I'll now
> repeat the whole thing using the config Mattis posted to see if
> I can reproduce it that way.
>
> Suggestions on a better / easier / quicker way to reproduce it are
> welcome.

Thanks, Iain.

Mattis,

What is the silicon version of the mx6 in your sabrelite? What GCC
version do you use?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-21 Thread Iain Paton

On 19/08/14 07:03, Iain Paton wrote:
> On 17/08/14 22:46, Fabio Estevam wrote:
>> Iain,
>>
>> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton  wrote:
>>> On 15/08/14 06:42, Mattis Lorentzon wrote:
>>>
 We mostly run SSH with benchmarks using NFS, it can probably be
 triggered by using only SSH with the following loop:

 # while : ; do ssh arm-card date; done
>>>
>>> Mattis,
>>>
>>> What sort of time does it take for you to see a problem?
>>>
>>> I've been running the above for nearly two days on 3.16.0 on a board
>>> with fec interrupts routed through gpio_6 and haven't seen a hint of
>>> a problem.
>>
>> Thanks for testing.
>>
>> Which mx6 board have you used on this test?
> 
> It's currently pointed at a RIoTboard (atheros phy) but I'm happy to 
> try it against both a Sabre-Lite and a Wandboard B1, all running the 
> same kernel binary, as well. 
> 
> I'm interested enough in why different people get different results 
> with this that I'll put some time towards testing to try to help 
> narrow down the cause.
> 

two and a half days of running this against both a sabre-lite and a 
wandboard quad B1 and I still have no reason to think there's any 
sort of a problem.

Up to now, my testing has been done with my own config, I'll now
repeat the whole thing using the config Mattis posted to see if 
I can reproduce it that way.

Suggestions on a better / easier / quicker way to reproduce it are 
welcome.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-18 Thread Iain Paton

On 17/08/14 22:46, Fabio Estevam wrote:
> Iain,
> 
> On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton  wrote:
>> On 15/08/14 06:42, Mattis Lorentzon wrote:
>>
>>> We mostly run SSH with benchmarks using NFS, it can probably be
>>> triggered by using only SSH with the following loop:
>>>
>>> # while : ; do ssh arm-card date; done
>>
>> Mattis,
>>
>> What sort of time does it take for you to see a problem?
>>
>> I've been running the above for nearly two days on 3.16.0 on a board
>> with fec interrupts routed through gpio_6 and haven't seen a hint of
>> a problem.
> 
> Thanks for testing.
> 
> Which mx6 board have you used on this test?

It's currently pointed at a RIoTboard (atheros phy) but I'm happy to 
try it against both a Sabre-Lite and a Wandboard B1, all running the 
same kernel binary, as well. 

I'm interested enough in why different people get different results 
with this that I'll put some time towards testing to try to help 
narrow down the cause.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-17 Thread Fabio Estevam

Iain,

On Sun, Aug 17, 2014 at 6:34 PM, Iain Paton  wrote:
> On 15/08/14 06:42, Mattis Lorentzon wrote:
>
>> We mostly run SSH with benchmarks using NFS, it can probably be
>> triggered by using only SSH with the following loop:
>>
>> # while : ; do ssh arm-card date; done
>
> Mattis,
>
> What sort of time does it take for you to see a problem?
>
> I've been running the above for nearly two days on 3.16.0 on a board
> with fec interrupts routed through gpio_6 and haven't seen a hint of
> a problem.

Thanks for testing.

Which mx6 board have you used on this test?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-17 Thread Iain Paton

On 15/08/14 06:42, Mattis Lorentzon wrote:

> We mostly run SSH with benchmarks using NFS, it can probably be
> triggered by using only SSH with the following loop:
> 
> # while : ; do ssh arm-card date; done

Mattis,

What sort of time does it take for you to see a problem?

I've been running the above for nearly two days on 3.16.0 on a board 
with fec interrupts routed through gpio_6 and haven't seen a hint of 
a problem.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-14 Thread Mattis Lorentzon

Fabio,

> Do the stalls also happen on a pure 3.16 kernel?

Yes, we just tried this out overnight and we get the same stalls here.
We have seen similar problems on a Zynq-based board. It might be
worth noting that a common chip between all three boards is, for
example, the KSZ9021RN, while the FEC driver, for example, only
runs on the two iMX6-boards.

> How can we reproduce the error?

We mostly run SSH with benchmarks using NFS, it can probably be
triggered by using only SSH with the following loop:

# while : ; do ssh arm-card date; done

Our (pure) 3.16 kernel uses the following config.
http://lkml.iu.edu/hypermail/linux/kernel/1408.1/03045/config.gz

(We have quite generously disabled a lot of sub-systems in our config.)

Best regards,
Mattis Lorentzon

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-14 Thread Fabio Estevam

On Thu, Aug 14, 2014 at 11:43 AM, Mattis Lorentzon
 wrote:

> After a few more tests we have finally been able to trigger the exact same 
> stalls
> on the Sabrelite board with a direct network connection (i.e. without the 
> switch).

Do the stalls also happen on a pure 3.16 kernel?

How can we reproduce the error?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-14 Thread Mattis Lorentzon

Fabio and Russell,

> A working theory is that the switch (3Com Switch 4400) triggers the
> degeneration of the network stack from which Linux does not seem to
> recover, even if we later bypass the switch and directly connect the board to
> the server machine.

After a few more tests we have finally been able to trigger the exact same 
stalls
on the Sabrelite board with a direct network connection (i.e. without the 
switch).

Best regards,
Mattis Lorentzon

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-13 Thread Mattis Lorentzon

Fabio and Russell,

> In order to try to narrow down whether this is a board issue, could you try to
> run the same kernel on a mx6q development board, such as mx6qsabresd,
> cubox-i, wandboard, etc?

Indeed, we have a Sabrelite development board and have run the same kernel
configuration (please find attached). Russells 30 FEC related patches are 
applied.
We have also tried with and without the extended interrupts entry in the DT.

All our tests seem to behave the same way on the Sabrelite as on our own board.
A working theory is that the switch (3Com Switch 4400) triggers the degeneration
of the network stack from which Linux does not seem to recover, even if we later
bypass the switch and directly connect the board to the server machine.

Since the problem is stochastic in nature we are not completely sure if we can
trigger the problem without the switch. It's the switch that allows us to run 
many
cards simultaneously and thus trigger the problem more easily. :-)

What are your thoughts?

Best regards,
Mattis Lorentzon

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

config.gz
Description: config.gz

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-11 Thread Fabio Estevam

On Mon, Aug 11, 2014 at 10:32 AM, Mattis Lorentzon
 wrote:
> Russell and Fabio,
>
>> I'd be interested to hear whether removing the
>>
>>   interrupts-extended = ...
>>
>> property from your board's DT file, thereby causing you to revert back to the
>> default I list above, also fixes the instability you are seeing.
>
> We have tried to remove the board specific interrupts-extended field and the
> MX6QDL_PAD_GPIO_6__ENET_IRQ entry. Sadly this did not seem to improve
> the stalls. Our interrupts look like this now:
>
> 150:  15519  0  0  0   GIC 150  
> 2188000.ethernet
> 151:  0  0  0  0   GIC 151  
> 2188000.ethernet
>
> Our device tree might still be slightly incorrect. We have noticed that our
> RGMII_INT is connected to GPIO 19 (P5) which might be nonstandard (we are
> a bit surprised that this works at all). We are not quite sure how to 
> configure
> this properly.

In order to try to narrow down whether this is a board issue, could
you try to run the same kernel on a mx6q development board, such as
mx6qsabresd, cubox-i, wandboard, etc?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-11 Thread Mattis Lorentzon

Russell and Fabio,

> I'd be interested to hear whether removing the
> 
>   interrupts-extended = ...
> 
> property from your board's DT file, thereby causing you to revert back to the
> default I list above, also fixes the instability you are seeing.

We have tried to remove the board specific interrupts-extended field and the
MX6QDL_PAD_GPIO_6__ENET_IRQ entry. Sadly this did not seem to improve
the stalls. Our interrupts look like this now:

150:  15519  0  0  0   GIC 150  2188000.ethernet
151:  0  0  0  0   GIC 151  2188000.ethernet

Our device tree might still be slightly incorrect. We have noticed that our
RGMII_INT is connected to GPIO 19 (P5) which might be nonstandard (we are
a bit surprised that this works at all). We are not quite sure how to configure
this properly.

Best regards,
Mattis Lorentzon
***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-08 Thread Russell King - ARM Linux

On Thu, Aug 07, 2014 at 01:12:48PM +0100, Russell King - ARM Linux wrote:
> On Thu, Aug 07, 2014 at 11:11:06AM +, Mattis Lorentzon wrote:
> > Russell,
> > 
> > > Can you ascertain whether these stalls are a result of some failure of the
> > > receive side or the transmit side - you should be able to tell that if 
> > > you watch
> > > the packet counts via ifconfig on the stalled card.  Also, it would be 
> > > useful to
> > > know whether the FEC interrupt was firing.
> > 
> > grep eth /proc/interrupts
> > 151:  0  0  0  0   GIC 151  
> > 2188000.ethernet
> > 166:1205661  0  0  0  gpio-mxc   6  
> > 2188000.ethernet
> > 
> > The interrupt counter 166 increases regularly during the stalls.
> > Ifconfig indicates that the RX and TX  counters do not increase.
> 
> Hmm, I'm slightly confused.  On my iMX6Q, I have:
> 
> 150: 581754  0  0  0   GIC 150  
> 2188000.ethernet
> 151:  0  0  0  0   GIC 151  
> 2188000.ethernet
> 
> In the DT file, we have:
> 
> fec: ethernet@02188000 {
> compatible = "fsl,imx6q-fec";
> reg = <0x02188000 0x4000>;
> interrupts-extended =
> <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
> <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
> clocks = <&clks 117>, <&clks 117>, <&clks 
> 190>;
> clock-names = "ipg", "ahb", "ptp";
> status = "disabled";
> };
> 
> which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151.
> Yet you seem to have nothing registered against GIC 150, instead having
> an interrupt against GPIO 6.
> 
> This seems very odd, and as this is an on-SoC device, I don't see why
> you would want to bind the interrupts for the FEC device any differently
> to standard platforms.
> 
> This could well be the cause of your stalls.
> 
> What's GPIO 6 used for on your board?

We have a second report of instability with the FEC today, and the
problem board (wanboard) is also using GPIO1 6 for the ethernet IRQ.
We have confirmation from the reporter that reverting the change
(thus making the FEC use the standard interrupt) fixes their problem.

Therefore, it seems that the workaround for ERR006687 is itself buggy.

I'd be interested to hear whether removing the 

interrupts-extended = ...

property from your board's DT file, thereby causing you to revert back
to the default I list above, also fixes the instability you are seeing.

Thanks.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-08 Thread Fabio Estevam

Mattis,

On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam  wrote:
> On Thu, Aug 7, 2014 at 9:12 AM, Russell King - ARM Linux
>  wrote:
>
>> Hmm, I'm slightly confused.  On my iMX6Q, I have:
>>
>> 150: 581754  0  0  0   GIC 150  
>> 2188000.ethernet
>> 151:  0  0  0  0   GIC 151  
>> 2188000.ethernet
>
> Same here on a mx6qsabresd.
>
>> In the DT file, we have:
>>
>> fec: ethernet@02188000 {
>> compatible = "fsl,imx6q-fec";
>> reg = <0x02188000 0x4000>;
>> interrupts-extended =
>> <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
>> <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
>> clocks = <&clks 117>, <&clks 117>, <&clks 
>> 190>;
>> clock-names = "ipg", "ahb", "ptp";
>> status = "disabled";
>> };
>>
>> which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151.
>> Yet you seem to have nothing registered against GIC 150, instead having
>> an interrupt against GPIO 6.
>>
>> This seems very odd, and as this is an on-SoC device, I don't see why
>> you would want to bind the interrupts for the FEC device any differently
>> to standard platforms.
>>
>> This could well be the cause of your stalls.
>>
>> What's GPIO 6 used for on your board?
>
> On a imx6q sabreauto I also get:
>
> 151:  0  0  0  0   GIC 151  
> 2188000.ethernet
> 166:   4577  0  0  0  gpio-mxc   6  
> 2188000.ethernet

Could you remove 'interrupts-extended'  from the FEC node and also
MX6QDL_PAD_GPIO_6__ENET_IRQ from the pinctrl node and test again?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-07 Thread Troy Kisky

On 8/7/2014 7:38 AM, Fabio Estevam wrote:
> On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam  wrote:
> 
> ,but I am wondering if we should also do:
> 
> --- a/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
> +++ b/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
> @@ -66,6 +66,7 @@
> pinctrl-0 = <&pinctrl_enet>;
> phy-mode = "rgmii";
> interrupts-extended = <&gpio1 6 IRQ_TYPE_LEVEL_HIGH>,
> + <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
>   <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
> status = "okay";
>  };
> @@ -226,7 +227,7 @@
> MX6QDL_PAD_RGMII_RD2__RGMII_RD2 
> 0x1b0b0
> MX6QDL_PAD_RGMII_RD3__RGMII_RD3 
> 0x1b0b0
> MX6QDL_PAD_RGMII_RX_CTL__RGMII_RX_CTL   
> 0x1b0b0
> -   MX6QDL_PAD_GPIO_6__ENET_IRQ 
> 0x000b1
> +   MX6QDL_PAD_GPIO_6__ENET_IRQ
>  0x40b1
> 
> Since the Workaround for erratum ERR006687 states that the SION bit
> needs to be used:
> 
> "All of the interrupts can be selected by MUX and output to pad GPIO6.
> If GPIO6 is selected to
> output ENET interrupts and GPIO6 SION is set, the resulting GPIO
> interrupt will wake the system
> from Wait mode."
> 
arch/arm/boot/dts/imx6q-pinfunc.h:#define MX6QDL_PAD_GPIO_6__ENET_IRQ   
0x230 0x600
0x03c 0x11 0xff000609

So, the ion bit should already be set(0x11). But the other way works too.


Troy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-07 Thread Fabio Estevam

On Thu, Aug 7, 2014 at 11:20 AM, Fabio Estevam  wrote:

> On a imx6q sabreauto I also get:
>
> 151:  0  0  0  0   GIC 151  
> 2188000.ethernet
> 166:   4577  0  0  0  gpio-mxc   6  
> 2188000.ethernet
>
> and the GPIO1_6 interrupt comes from this commit:
>
> commit bc20a5d6da718f9d60da0a78f70c653c1cd16af3
> Author: Troy Kisky 
> Date:   Fri Dec 20 11:47:12 2013 -0700
>
> ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC interrupt.
>
> This works around a hardware bug.
>
> Signed-off-by: Troy Kisky 
> Signed-off-by: Shawn Guo 

Actually a more descriptive commit log can be found here:

commit 6261c4c8f13eb91f733e8ba6d67c409a2e841667
Author: Troy Kisky 
Date:   Fri Dec 20 11:47:11 2013 -0700

ARM: dts: imx6qdl-sabrelite: use GPIO_6 for FEC interrupt.

This works around a hardware bug.
From "Chip Errata for the i.MX 6Dual/6Quad"

ERR006687 ENET: Only the ENET wake-up interrupt request can wake the
system from Wait mode.

The ENET block generates many interrupts. Only one of these interrupt lines
is connected to the General Power Controller (GPC) block, but a logical OR
of all of the ENET interrupts is connected to the General
Interrupt Controller
(GIC). When the system enters Wait mode, a normal RX Done or TX
Done does not
wake up the system because the GPC cannot see this interrupt. This impacts
performance of the ENET block because its interrupts are serviced only when
the chip exits Wait mode due to an interrupt from some other wake-up source.

Before this patch, ping times of a Sabre Lite board are quite
random:
ping 192.168.0.13 -i.5 -c5
PING 192.168.0.13 (192.168.0.13) 56(84) bytes of data.
64 bytes from 192.168.0.13: icmp_req=1 ttl=64 time=15.7 ms
64 bytes from 192.168.0.13: icmp_req=2 ttl=64 time=14.4 ms
64 bytes from 192.168.0.13: icmp_req=3 ttl=64 time=13.4 ms
64 bytes from 192.168.0.13: icmp_req=4 ttl=64 time=12.4 ms
64 bytes from 192.168.0.13: icmp_req=5 ttl=64 time=11.4 ms

=== 192.168.0.13 ping statistics ===
5 packets transmitted, 5 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 11.431/13.501/15.746/1.508 ms

After this patch:

ping 192.168.0.13 -i.5 -c5
PING 192.168.0.13 (192.168.0.13) 56(84) bytes of data.
64 bytes from 192.168.0.13: icmp_req=1 ttl=64 time=0.120 ms
64 bytes from 192.168.0.13: icmp_req=2 ttl=64 time=0.175 ms
64 bytes from 192.168.0.13: icmp_req=3 ttl=64 time=0.169 ms
64 bytes from 192.168.0.13: icmp_req=4 ttl=64 time=0.168 ms
64 bytes from 192.168.0.13: icmp_req=5 ttl=64 time=0.172 ms

=== 192.168.0.13 ping statistics ===
5 packets transmitted, 5 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.120/0.160/0.175/0.026 ms

Also, apply same change to imx6qdl-nitrogen6x.

This change may not be appropriate for all boards.
Sabre Lite uses GPIO6 as a power down output for a ov5642
camera. As this expansion board does not yet work with mainline,
this is not yet a conflict. It would be nice to have an alternative
fix for boards where this is a problem.

For example Sabre SD uses GPIO6 for I2C3_SDA. It also
has long ping times currently. But cannot use this fix
without giving up a touchscreen.

Its ping times are also random.

ping 192.168.0.19 -i.5 -c5
PING 192.168.0.19 (192.168.0.19) 56(84) bytes of data.
64 bytes from 192.168.0.19: icmp_req=1 ttl=64 time=16.0 ms
64 bytes from 192.168.0.19: icmp_req=2 ttl=64 time=15.4 ms
64 bytes from 192.168.0.19: icmp_req=3 ttl=64 time=14.4 ms
64 bytes from 192.168.0.19: icmp_req=4 ttl=64 time=13.4 ms
64 bytes from 192.168.0.19: icmp_req=5 ttl=64 time=12.4 ms

=== 192.168.0.19 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 12.451/14.369/16.057/1.316 ms

Signed-off-by: Troy Kisky 
CC: Ranjani Vaidyanathan 
Signed-off-by: Shawn Guo 

,but I am wondering if we should also do:

--- a/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
+++ b/arch/arm/boot/dts/imx6qdl-sabreauto.dtsi
@@ -66,6 +66,7 @@
pinctrl-0 = <&pinctrl_enet>;
phy-mode = "rgmii";
interrupts-extended = <&gpio1 6 IRQ_TYPE_LEVEL_HIGH>,
+ <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
  <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
status = "okay";
 };
@@ -226,7 +227,7 @@
MX6QDL_PAD_RGMII_RD2__RGMII_RD2 0x1b0b0
MX6QDL_PAD_RGMII_RD3__RGMII_RD3 0x1b0b0
MX6QDL_PAD_RGMII_RX_CTL__RGMII_RX_CTL   0x1b0b0
-   MX6QDL_PAD_GPIO_6__ENET_IRQ 0x000b1
+   MX6QDL_PAD_GPIO_6__E

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-07 Thread Fabio Estevam

On Thu, Aug 7, 2014 at 9:12 AM, Russell King - ARM Linux
 wrote:

> Hmm, I'm slightly confused.  On my iMX6Q, I have:
>
> 150: 581754  0  0  0   GIC 150  
> 2188000.ethernet
> 151:  0  0  0  0   GIC 151  
> 2188000.ethernet

Same here on a mx6qsabresd.

> In the DT file, we have:
>
> fec: ethernet@02188000 {
> compatible = "fsl,imx6q-fec";
> reg = <0x02188000 0x4000>;
> interrupts-extended =
> <&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
> <&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
> clocks = <&clks 117>, <&clks 117>, <&clks 
> 190>;
> clock-names = "ipg", "ahb", "ptp";
> status = "disabled";
> };
>
> which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151.
> Yet you seem to have nothing registered against GIC 150, instead having
> an interrupt against GPIO 6.
>
> This seems very odd, and as this is an on-SoC device, I don't see why
> you would want to bind the interrupts for the FEC device any differently
> to standard platforms.
>
> This could well be the cause of your stalls.
>
> What's GPIO 6 used for on your board?

On a imx6q sabreauto I also get:

151:  0  0  0  0   GIC 151  2188000.ethernet
166:   4577  0  0  0  gpio-mxc   6  2188000.ethernet

and the GPIO1_6 interrupt comes from this commit:

commit bc20a5d6da718f9d60da0a78f70c653c1cd16af3
Author: Troy Kisky 
Date:   Fri Dec 20 11:47:12 2013 -0700

ARM: dts: imx6qdl-sabreauto: use GPIO_6 for FEC interrupt.

This works around a hardware bug.

Signed-off-by: Troy Kisky 
Signed-off-by: Shawn Guo 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-07 Thread Russell King - ARM Linux

On Thu, Aug 07, 2014 at 11:11:06AM +, Mattis Lorentzon wrote:
> Russell,
> 
> > Can you ascertain whether these stalls are a result of some failure of the
> > receive side or the transmit side - you should be able to tell that if you 
> > watch
> > the packet counts via ifconfig on the stalled card.  Also, it would be 
> > useful to
> > know whether the FEC interrupt was firing.
> 
> grep eth /proc/interrupts
> 151:  0  0  0  0   GIC 151  
> 2188000.ethernet
> 166:1205661  0  0  0  gpio-mxc   6  
> 2188000.ethernet
> 
> The interrupt counter 166 increases regularly during the stalls.
> Ifconfig indicates that the RX and TX  counters do not increase.

Hmm, I'm slightly confused.  On my iMX6Q, I have:

150: 581754  0  0  0   GIC 150  2188000.ethernet
151:  0  0  0  0   GIC 151  2188000.ethernet

In the DT file, we have:

fec: ethernet@02188000 {
compatible = "fsl,imx6q-fec";
reg = <0x02188000 0x4000>;
interrupts-extended =
<&intc 0 118 IRQ_TYPE_LEVEL_HIGH>,
<&intc 0 119 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&clks 117>, <&clks 117>, <&clks 190>;
clock-names = "ipg", "ahb", "ptp";
status = "disabled";
};

which, for the gic, would be 118 + 32 (first SPI) = 150, 119 + 32 = 151.
Yet you seem to have nothing registered against GIC 150, instead having
an interrupt against GPIO 6.

This seems very odd, and as this is an on-SoC device, I don't see why
you would want to bind the interrupts for the FEC device any differently
to standard platforms.

This could well be the cause of your stalls.

What's GPIO 6 used for on your board?

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-07 Thread Mattis Lorentzon

Russell,

> Can you ascertain whether these stalls are a result of some failure of the
> receive side or the transmit side - you should be able to tell that if you 
> watch
> the packet counts via ifconfig on the stalled card.  Also, it would be useful 
> to
> know whether the FEC interrupt was firing.

grep eth /proc/interrupts
151:  0  0  0  0   GIC 151  2188000.ethernet
166:1205661  0  0  0  gpio-mxc   6  2188000.ethernet

The interrupt counter 166 increases regularly during the stalls.
Ifconfig indicates that the RX and TX  counters do not increase.

> I hope you have some kind of serial console on these cards?

Yes, indeed. Local stimuli seems to be able to unstall the network in a
somewhat random fashion. Running e.g. ifconfig or ping locally may
immediately or after up to about half a minute make the network responsive.
However, it usually degenerates again to a complete stall within seconds.
Without local stimuli the network does not appear to recover at all. The card
does not even respond to pings (again, most often without any apparent
error messages).

Running both of the following commands in parallel from the FC server seems
to trigger the problem within minutes (please note that the arm card stops
responding to both ping and ssh):

# while :; do ssh arm-card echo Ok; done
# ping arm-card

We have noticed the same problem on both the i.MX6 and the Zynq cards
(using KSZ9021 and Cadence GEM drivers). However, the number of
iterations required to trigger the problem vary. Sometimes it might stall after
less than 100, but in other cases the stalls begin after nearly 1 
iterations.
Once stalled (and unstalled after stimuli), the network on that particular card
degenerates a lot more often. Apart from the kernel, IP numbers and MAC
addresses, the software configurations are identical between the Zynq and
the i.MX6. Perhaps the fault is unrelated to the Freescale driver?

> Hmm.  Okay, I think the first thing we need to do is to work out why the
> silent stalls are happening.

Would you have any ideas on what to check next?

Best regards,
Mattis Lorentzon
***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-06 Thread Russell King - ARM Linux

On Wed, Aug 06, 2014 at 11:10:06AM +, Mattis Lorentzon wrote:
> Russell,
> 
> > What is on the other end of the link?
> 
> 16 ARM cards connected to a 3Com Switch 4400 connected to a Linux FC 20
> machine (Intel Corporation 82541PI Gigabit Ethernet Controller rev 05).
> 
> There may be multiple problems. The backtrace has only been seen a few
> times, on two different cards. Most of the time, the network for a random
> card just stalls without any visible backtrace or error messages. The other
> cards seem to be unaffected when this happens.

Can you ascertain whether these stalls are a result of some failure of the
receive side or the transmit side - you should be able to tell that if you
watch the packet counts via ifconfig on the stalled card.  Also, it would
be useful to know whether the FEC interrupt was firing.

I hope you have some kind of serial console on these cards?

> > What I would like to do is to stamp each packet in some way with an
> > identifier marking its ring position, and then monitor the network to find 
> > out
> > whether the packet at slot 85 was actually transmitted - that's made 
> > slightly
> > harder because packets may be dropped at the receiver when operating in
> > promisc mode.  This would then allow us to work out some likely causes.
> 
> We would be glad to run this test on our setup, do you have more detailed
> information on how to set it up?

One of the problems is to find some way to stamp each packet with a 10-bit
number without having any side effects.  I guess one possibility would be
to overwrite the source MAC address on transmit, which hopefully should not
cause any side effects.

> After a network stall, we usually have to powercycle the ARM hardware to
> get it back to a usable state. These stalls last at least several minutes,
> perhaps indefinitely. It does not seem to recover properly, and is no longer
> reachable via the network.

Hmm.  Okay, I think the first thing we need to do is to work out why
the silent stalls are happening.

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-06 Thread Mattis Lorentzon

Russell,

> What is on the other end of the link?

16 ARM cards connected to a 3Com Switch 4400 connected to a Linux FC 20
machine (Intel Corporation 82541PI Gigabit Ethernet Controller rev 05).

There may be multiple problems. The backtrace has only been seen a few
times, on two different cards. Most of the time, the network for a random
card just stalls without any visible backtrace or error messages. The other
cards seem to be unaffected when this happens.

> What I would like to do is to stamp each packet in some way with an
> identifier marking its ring position, and then monitor the network to find out
> whether the packet at slot 85 was actually transmitted - that's made slightly
> harder because packets may be dropped at the receiver when operating in
> promisc mode.  This would then allow us to work out some likely causes.

We would be glad to run this test on our setup, do you have more detailed
information on how to set it up?

> Note that after the transmit watchdog, the interface should recover and start
> operating normally again - and that should not take "several minutes."

After a network stall, we usually have to powercycle the ARM hardware to
get it back to a usable state. These stalls last at least several minutes,
perhaps indefinitely. It does not seem to recover properly, and is no longer
reachable via the network.

Best regards,
Mattis Lorentzon
***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-06 Thread Russell King - ARM Linux

On Tue, Aug 05, 2014 at 01:31:29PM +, Mattis Lorentzon wrote:
> We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are
> currently running some stability tests.
> 
> During our first test round we triggered a timeout which caused the fec driver
> to become unresponsive for several minutes. The attached backtrace was
> shown when the hardware was rebooted.

What is on the other end of the link?

> [ cut here ]
> WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 
> dev_watchdog+0x270/0x27c()
> NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
...
> fec 2188000.ethernet eth0: TX ring dump
> Nr SC addr   len  SKB
>   00x1c00 0x   66   (null)
...
>  830x1c00 0x   66   (null)
>  84  H 0x1c00 0x   66   (null)
>  850x9c00 0x2e205000   66 9e384f00
>  860x1c00 0x2e204800   66 9e384d80
>  870x1c00 0x2e204000   66 9e384180
...
> 3760x1c00 0x2e252800   66 81cf6180
> 3770x1c00 0x2e253000   66 81cf6240
> 378 S  0x1c00 0x   66   (null)

So, the software would insert the next packet into slot 378.  However,
the slots from 85 to 377 have not been reaped, despite those in 86 to
377 allegedly having been sent.  This is because the entry in slot 85
shows that it has yet to be sent.

I've no idea what causes this; it looks like there's something screwed
with the hardware which causes the transmitter to skip an entry in the
ring under certain circumstances.  As I've never been able to reproduce
it here, I've not been able to investigate it.

What I would like to do is to stamp each packet in some way with an
identifier marking its ring position, and then monitor the network to
find out whether the packet at slot 85 was actually transmitted - that's
made slightly harder because packets may be dropped at the receiver
when operating in promisc mode.  This would then allow us to work out
some likely causes.

Note that after the transmit watchdog, the interface should recover and
start operating normally again - and that should not take "several
minutes."

-- 
FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-05 Thread Mattis Lorentzon

Hi Fabio,

> Could this problem be the same one as reported at:
> http://www.spinics.net/lists/arm-kernel/msg347914.html ?

The problem you link to describes a permanent issue, our problem seems
to be sporadic as most of our tests work fine (at least for a while).

> Which Ethernet PHY do you use? Do you have pull-up in the MDIO line?

Our hardware has the KSZ9021RN PHY, so the MDIO line should be pull-up.

Do you know if there are debug options that could help us determine the
cause of the timeout?

Best regards,
Mattis Lorentzon

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-05 Thread Fabio Estevam

On Tue, Aug 5, 2014 at 10:31 AM, Mattis Lorentzon
 wrote:

> We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are
> currently running some stability tests.
>
> During our first test round we triggered a timeout which caused the fec driver
> to become unresponsive for several minutes. The attached backtrace was
> shown when the hardware was rebooted.

Could this problem be the same one as reported at:
http://www.spinics.net/lists/arm-kernel/msg347914.html ?

Which Ethernet PHY do you use? Do you have pull-up in the MDIO line?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-08-05 Thread Mattis Lorentzon

Hi Russell!

> Now because things have changed during the last merge window, I've got an
> even bigger problem sorting through that patch set and getting it back into a
> submittable state.  I've just sent out v2 for it onto the
> net...@vger.kernel.org mailing list.
>
> The initial version (marked RFC) attracted very little interest from testers, 
> or
> acks.  I'd very much like to have some testing of it, so if you want to try 
> it out,
> I can provide you with a git URL, patches or a combined patch.

We have applied your V2 patch set of 30 patches on top of v3.16-rc2 and are
currently running some stability tests.

During our first test round we triggered a timeout which caused the fec driver
to become unresponsive for several minutes. The attached backtrace was
shown when the hardware was rebooted.

Best regards,
Mattis Lorentzon

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***
[ cut here ]
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc2+ #7
Backtrace: 
[<8001234c>] (dump_backtrace) from [<80012628>] (show_stack+0x18/0x1c)
 r6:0108 r5: r4:806ac3dc r3:
[<80012610>] (show_stack) from [<804d2c60>] (dump_stack+0x8c/0x9c)
[<804d2bd4>] (dump_stack) from [<80025c18>] (warn_slowpath_common+0x74/0x90)
 r5:0009 r4:8068fd70
[<80025ba4>] (warn_slowpath_common) from [<80025c6c>] 
(warn_slowpath_fmt+0x38/0x40)
 r8:806900c0 r7:9e160254 r6:9f4ec800 r5:9e16 r4:
[<80025c38>] (warn_slowpath_fmt) from [<803f4578>] (dev_watchdog+0x270/0x27c)
 r3:9e16 r2:8061ad58
[<803f4308>] (dev_watchdog) from [<8002fee8>] (call_timer_fn+0x74/0xec)
 r10:8068e008 r9:9e16 r8:803f4308 r7:0100 r6:8068e000 r5:0001
 r4:8068fdd8
[<8002fe74>] (call_timer_fn) from [<80030b2c>] (run_timer_softirq+0x1d4/0x254)
 r10:803f4308 r9:806900c0 r8:9e16 r7: r6:8068fe28 r5:806cc140
 r4:9e160284
[<80030958>] (run_timer_softirq) from [<8002a124>] (__do_softirq+0x168/0x2f0)
 r10:0001 r9:80690080 r8:4001 r7:8068e000 r6:0100 r5:80690084
 r4:0020
[<80029fbc>] (__do_softirq) from [<8002a5d0>] (irq_exit+0xc8/0x10c)
 r10:8068e000 r9:806cafd9 r8:0001 r7:f4000100 r6: r5:001d
 r4:8068e008
[<8002a508>] (irq_exit) from [<8000f304>] (handle_IRQ+0x4c/0x98)
 r5:001d r4:8068ae14
[<8000f2b8>] (handle_IRQ) from [<8000860c>] (gic_handle_irq+0x34/0x64)
 r6:8068ff20 r5:80696a40 r4:f400010c r3:00a0
[<800085d8>] (gic_handle_irq) from [<80013184>] (__irq_svc+0x44/0x58)
Exception stack(0x8068ff20 to 0x8068ff68)
ff20: 0001 0001  806996f0 8069652c 806964d8 806cafd9 804db068
ff40: 0001 806cafd9 8068e000 8068ff74  8068ff68 80061a1c 8000f664
ff60: 200f0013 
 r7:8068ff54 r6: r5:200f0013 r4:8000f664
[<8000f638>] (arch_cpu_idle) from [<8005d154>] (cpu_startup_entry+0x10c/0x164)
[<8005d048>] (cpu_startup_entry) from [<804cdefc>] (rest_init+0xc8/0xd8)
 r7:80683450 r3:
[<804cde34>] (rest_init) from [<80651c68>] (start_kernel+0x3a0/0x3ac)
 r5:0001 r4:806965d0
[<806518c8>] (start_kernel) from [<10008074>] (0x10008074)
---[ end trace b51f6196c5e036f0 ]---
fec 2188000.ethernet eth0: TX ring dump
Nr SC addr   len  SKB
  00x1c00 0x   66   (null)
  10x1c00 0x   66   (null)
  20x1c00 0x   66   (null)
  30x1c00 0x   66   (null)
  40x1c00 0x   66   (null)
  50x1c00 0x   66   (null)
  60x1c00 0x   66   (null)
  70x1c00 0x   66   (null)
  80x1c00 0x   66   (null)
  90x1c00 0x   66   (null)
 100x1c00 0x   66   (null)
 110x1c00 0x   66   (null)
 120x1c00 0x   66   (null)
 130x1c00 0x   66   (null)
 140x1c00 0x   66   (null)
 150x1c00 0x   66   (null)
 160x1c00 0x   66   (null)
 170x1c00 0x   66   (null)
 180x1c00 0x   66   (null)
 190x1c00 0x   66   (null)
 200x1c00 0x   66   (null)
 210x1c00 0x   66   (null)
 220x1c00 0x   66   (null)
 230x1c00 0x   66   (null)
 240x1c00 0x   66   (null)
 250x1c00 0x   66   (null)
 260x1c00 0x   66   (null)
 270x1c00 0x   66   (null)
 280x1c00 0x   66   (null)
 290x1c00 0x   66   (null)
 300x1c00 0x   66   (null)
 310x1c00 0x   66   (null)
 320x1c00 0x   66   (null)
 330x1c00 0x   66   (null)
 340x1c00 0x   66   (null)
 350x1c00 0x   66   (null)
 360x

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-07-01 Thread Fredrik Noring

Hi Russell,

> -Original Message-
> > The initial version (marked RFC) attracted very little interest from
> > testers, or acks.  I'd very much like to have some testing of it, so
> > if you want to try it out, I can provide you with a git URL, patches
> > or a combined patch.
> 
> Sure! A combined gzip patch attachment is fine. Git over HTTP probably
> works too.

We are still interested in trying out your patches to improve network
performance. We can do some testing this week and in August.

Best regards,
Fredrik

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-30 Thread Nathan Lynch

On 06/30/2014 07:30 AM, Fredrik Noring wrote:
>>
>> On Fri, Jun 27, 2014 at 04:16:57PM +, Fredrik Noring wrote:
>>> Please find below a trace that appeared once with 3.16-rc2. Perhaps it
>>> is of some interest?
>>
>> It's not that serious... I know that the FEC ethernet driver is horrendously
>> racy (I have had a patch set for about the last six months which fixes some 
>> of
>> its problems) but as I've had a lot of patches to deal with, and it's been
>> pushed to the back of the queue...
>>
>> The races don't lead to data corruption though, merely timeouts and some
>> lost packets.

> It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a
properly
> working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to
do a lot
> better. No crashes so far with v3.16-rc2!
>

Did you narrow it down to a particular GCC bug?  The symptoms you
reported remind me of:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58854

Sadly, unpatched GCC 4.8.1 and 4.8.2 are unsuitable for building ARM
kernels.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-30 Thread Fredrik Noring

Hi Russell,

It seems to be a compiler issue, where (GCC) 4.8.2 does not produce a properly
working kernel. Happily, (Fedora 2013.11.24-2.fc19) 4.8.1 appears to do a lot
better. No crashes so far with v3.16-rc2!

All the best,
Fredrik

> -Original Message-
> Hi Fredrik,
> 
> On Fri, Jun 27, 2014 at 04:16:57PM +, Fredrik Noring wrote:
> > Please find below a trace that appeared once with 3.16-rc2. Perhaps it
> > is of some interest?
> 
> It's not that serious... I know that the FEC ethernet driver is horrendously
> racy (I have had a patch set for about the last six months which fixes some of
> its problems) but as I've had a lot of patches to deal with, and it's been
> pushed to the back of the queue...
> 
> The races don't lead to data corruption though, merely timeouts and some
> lost packets.
> 
> Now because things have changed during the last merge window, I've got an
> even bigger problem sorting through that patch set and getting it back into a
> submittable state.  I've just sent out v2 for it onto the
> net...@vger.kernel.org mailing list.
> 
> The initial version (marked RFC) attracted very little interest from testers, 
> or
> acks.  I'd very much like to have some testing of it, so if you want to try it
> out, I can provide you with a git URL, patches or a combined patch.
> 
> --
> FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
> improving, and getting towards what was expected from it.
***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-29 Thread Fredrik Noring

Hi Russell,

> -Original Message-
> It's not that serious... I know that the FEC ethernet driver is horrendously
> racy (I have had a patch set for about the last six months which fixes some of
> its problems) but as I've had a lot of patches to deal with, and it's been
> pushed to the back of the queue...
> 
> The races don't lead to data corruption though, merely timeouts and some
> lost packets.

The serial port (uart1) and Ethernet are essentially the only things we use.
No disks, no graphics, no USB, etc. If not the Ethernet driver, what else is
likely to crash NFS so badly?

Also, we are happy to change our config if that would simplify things:

http://lkml.iu.edu/hypermail/linux/kernel/1406.3/01488/config.gz

> Now because things have changed during the last merge window, I've got an
> even bigger problem sorting through that patch set and getting it back into a
> submittable state.  I've just sent out v2 for it onto the
> net...@vger.kernel.org mailing list.
> 
> The initial version (marked RFC) attracted very little interest from testers, 
> or
> acks.  I'd very much like to have some testing of it, so if you want to try it
> out, I can provide you with a git URL, patches or a combined patch.

Sure! A combined gzip patch attachment is fine. Git over HTTP probably works
too.

All the best,
Fredrik

***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-27 Thread Fredrik Noring

Hi Russel,

> On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote:
> > That's a similar workload to the one which is mentioned in the
> > previous report.  I've just set a similar transfer going, but this
> > will be a 16GB file.
> 
> I've run this transfer several times, but so far I've unable to reproduce the
> issue here.

Many thanks for testing this. We attempted to bisect, but unfortunately the
result was not conclusive. One reason might be that the config had to be
updated during the process, and so we did not end up with the exact same
configuration (things like e.g. IMX_SDMA in DMA_ENGINE etc.). Some runs
deadlocked without any visible Oops or printout. Some versions did not have
an entirely working console configuration.

Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of
some interest?

(We also had memtester run for days on the i.MX6 hardware, without issues.)

All the best,
Fredrik

[ cut here ]
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:264 dev_watchdog+0x270/0x27c()
NETDEV WATCHDOG: eth0 (fec): transmit queue 0 timed out
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.16.0-rc2 #19
Backtrace: 
[<80012390>] (dump_backtrace) from [<8001266c>] (show_stack+0x18/0x1c)
 r6:0108 r5: r4:8064e29c r3:
[<80012654>] (show_stack) from [<8049791c>] (dump_stack+0x8c/0x9c)
[<80497890>] (dump_stack) from [<80024f4c>] (warn_slowpath_common+0x74/0x90)
 r5:0009 r4:80631d70
[<80024ed8>] (warn_slowpath_common) from [<80024fa0>] 
(warn_slowpath_fmt+0x38/0x40)
 r8:806320c0 r7:9d85a254 r6:9d879000 r5:9d85a000 r4:
[<80024f6c>] (warn_slowpath_fmt) from [<803b8ff0>] (dev_watchdog+0x270/0x27c)
 r3:9d85a000 r2:805c4790
[<803b8d80>] (dev_watchdog) from [<8002f280>] (call_timer_fn+0x6c/0xe4)
 r10:80630008 r9:9d85a000 r8:803b8d80 r7:0100 r6:8063 r5:0001
 r4:80631dd8
[<8002f214>] (call_timer_fn) from [<8002fec8>] (run_timer_softirq+0x1d4/0x254)
 r10:803b8d80 r9:806320c0 r8:9d85a000 r7: r6:80631e28 r5:80667040
 r4:9d85a284
[<8002fcf4>] (run_timer_softirq) from [<8002945c>] (__do_softirq+0x17c/0x30c)
 r10:0001 r9:80632080 r8:4001 r7:8063 r6:0100 r5:80632084
 r4:0020
[<800292e0>] (__do_softirq) from [<80029920>] (irq_exit+0xd0/0x114)
 r10:8063 r9:80665f19 r8:0001 r7:f4000100 r6: r5:80630008
 r4:8063
[<80029850>] (irq_exit) from [<8000f348>] (handle_IRQ+0x4c/0x98)
 r5:001d r4:8062ce44
[<8000f2fc>] (handle_IRQ) from [<80008614>] (gic_handle_irq+0x34/0x64)
 r6:80631f20 r5:80638a40 r4:f400010c r3:00a0
[<800085e0>] (gic_handle_irq) from [<800131c4>] (__irq_svc+0x44/0x58)
Exception stack(0x80631f20 to 0x80631f68)
1f20: 0001 0001  8063b6f0 8063852c 806384d8 80665f19 804a0040
1f40: 0001 80665f19 8063 80631f74  80631f68 800614b8 8000f6a8
1f60: 200f0013 
 r7:80631f54 r6: r5:200f0013 r4:8000f6a8
[<8000f67c>] (arch_cpu_idle) from [<8005cbf8>] (cpu_startup_entry+0x10c/0x164)
[<8005caec>] (cpu_startup_entry) from [<80492b68>] (rest_init+0xc8/0xd8)
 r7:80625028 r3:
[<80492aa0>] (rest_init) from [<805f6c5c>] (start_kernel+0x39c/0x3a8)
 r5:0001 r4:806385d0
[<805f68c0>] (start_kernel) from [<10008074>] (0x10008074)
---[ end trace a7b7109ab2d04e11 ]---
***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-27 Thread Russell King - ARM Linux

Hi Fredrik,

On Fri, Jun 27, 2014 at 04:16:57PM +, Fredrik Noring wrote:
> Please find below a trace that appeared once with 3.16-rc2. Perhaps it is of
> some interest?

It's not that serious... I know that the FEC ethernet driver is
horrendously racy (I have had a patch set for about the last six months
which fixes some of its problems) but as I've had a lot of patches to
deal with, and it's been pushed to the back of the queue...

The races don't lead to data corruption though, merely timeouts and
some lost packets.

Now because things have changed during the last merge window, I've got
an even bigger problem sorting through that patch set and getting it
back into a submittable state.  I've just sent out v2 for it onto the
net...@vger.kernel.org mailing list.

The initial version (marked RFC) attracted very little interest from
testers, or acks.  I'd very much like to have some testing of it, so
if you want to try it out, I can provide you with a git URL, patches
or a combined patch.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-27 Thread Russell King - ARM Linux

On Thu, Jun 26, 2014 at 04:14:24PM +0100, Russell King - ARM Linux wrote:
> On Thu, Jun 26, 2014 at 02:44:52PM +, Mattis Lorentzon wrote:
> > We have managed to trigger the Oops by just transferring a large file
> > over nfs
> > cat /mnt/foo > /dev/null
> > where foo is a file that is approximately 2 GB. There may be some
> > packet losses on this network, perhaps this differs from your workload?
> 
> That's a similar workload to the one which is mentioned in the previous
> report.  I've just set a similar transfer going, but this will be a 16GB
> file.

I've run this transfer several times, but so far I've unable to reproduce
the issue here.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-26 Thread Russell King - ARM Linux

On Thu, Jun 26, 2014 at 02:44:52PM +, Mattis Lorentzon wrote:
> Thank you for your reply,
> 
> > On Wed, Jun 25, 2014 at 01:55:05PM +, Mattis Lorentzon wrote:
> > > I have a similar issue with v3.16-rc2 as previously reported by Waldemar
> > Brodkorb for v3.15-rc4.
> > > https://lkml.org/lkml/2014/5/9/330
> >
> > This URL returns no useful information.  I find that lkml.org is broken more
> > times than not in recent years.  Please use a different archive site when
> > referring to posts, thanks.
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html

I remember that report, but it was never resolved as I think no one has
any ideas what is causing these, and no one has any idea where to start
looking.

> We have managed to trigger the Oops by just transferring a large file
> over nfs
> cat /mnt/foo > /dev/null
> where foo is a file that is approximately 2 GB. There may be some
> packet losses on this network, perhaps this differs from your workload?

That's a similar workload to the one which is mentioned in the previous
report.  I've just set a similar transfer going, but this will be a 16GB
file.

> We have done some more investigations, please find it in this mail:
> 
> http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html

Yes, I saw that before I replied, and my reply was written with that
message in mind.  That's what prompted this paragraph in my previous
reply:

"Your other oops dumps also show various other functions apparantly
returning 0x.  I can't believe that there's more than one bug
doing this, so I doubt the problem is in these functions.  Something
else must be going on."

One of the problems is that there's soo much work going on with the
kernel by many different parties, pulling it in various directions,
that no one really has an overview of all the changes, and so no one
has much of a feel what could be the cause of weird bugs like this.

I don't know what to suggest - you could try using git bisect to see
if you can track it down to a particular commit, but it sounds like
that's going to be very time consuming.  You mentioned that 3.12
doesn't show the bug, but 3.13 does - so start off telling git bisect
that 3.12 is "good" and 3.13 is "bad".

Hopefully there won't be too many breakages during the 3.13 merge
window (between 3.12 and 3.13-rc1), but I don't have much faith in
that; people seem to have a habbit of holding back fixes until -rc1,
which makes _exactly_ this kind of bug much harder for people like
yourselves to track down - or maybe even impossible.

I'm afraid I can't offer very much help beyond this until either I can
produce it, or someone manages to identify a particular change which
caused this.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-26 Thread Mattis Lorentzon

Thank you for your reply,

> On Wed, Jun 25, 2014 at 01:55:05PM +, Mattis Lorentzon wrote:
> > I have a similar issue with v3.16-rc2 as previously reported by Waldemar
> Brodkorb for v3.15-rc4.
> > https://lkml.org/lkml/2014/5/9/330
> 
> This URL returns no useful information.  I find that lkml.org is broken more
> times than not in recent years.  Please use a different archive site when
> referring to posts, thanks.

http://lkml.iu.edu/hypermail/linux/kernel/1405.1/01114.html

> I have had two iMX6 platforms running root-NFS for about the last six to nine
> months with various workloads, and have never seen this oops.
> Unfortunately, the description above gives very little information for what
> the mechanism to trigger this bug may be.  For example, if I wanted to
> reproduce it, what would I need to do?

We have managed to trigger the Oops by just transferring a large file over nfs
cat /mnt/foo > /dev/null
where foo is a file that is approximately 2 GB. There may be some packet losses
on this network, perhaps this differs from your workload?

> > The error is sporadic and it seems to occur more frequently when using
> perf.
> 
> So it occurs when not using perf?

Yes, certainly, see above.

We have done some more investigations, please find it in this mail:

http://lkml.iu.edu/hypermail/linux/kernel/1406.3/02190.html

The Oops seems to have been introduced somewhere between v3.12 and v3.13:

- The Oops is reproducible within seconds when running Linux 3.16-rc2.
- We have observed the Oops on 8 different hardware units and two different 
chipsets (Freescale i.MX6 and Xilinx Zynq).
- The Oops has not been seen on Linux 3.12 so it appears to be good.
- The Oops has been seen on Linux 3.13, 3.14, 3.15, 3.16-rc2 so these appear to 
be bad.

Configs and a couple of Oops reports are attached to the linked mail.

Best regards,
Mattis Lorentzon
***
Consider the environment before printing this message.

To read Autoliv's Information and Confidentiality Notice, follow this link:
http://www.autoliv.com/disclaimer.html
***

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Oops: 17 SMP ARM (v3.16-rc2)

2014-06-26 Thread Russell King - ARM Linux

On Wed, Jun 25, 2014 at 01:55:05PM +, Mattis Lorentzon wrote:
> Hello kernel people,

You may wish to also copy linux-arm-ker...@lists.infradead.org, which is
where ARM kernel people are.

> I have a similar issue with v3.16-rc2 as previously reported by Waldemar 
> Brodkorb for v3.15-rc4.
> https://lkml.org/lkml/2014/5/9/330

This URL returns no useful information.  I find that lkml.org is broken
more times than not in recent years.  Please use a different archive
site when referring to posts, thanks.

> We are running a benchmark application, sometimes using perf, with heavy
> traffic over NFS.

I have had two iMX6 platforms running root-NFS for about the last six to
nine months with various workloads, and have never seen this oops.
Unfortunately, the description above gives very little information for
what the mechanism to trigger this bug may be.  For example, if I wanted
to reproduce it, what would I need to do?

> The error is sporadic and it seems to occur more frequently when using perf.

So it occurs when not using perf?

> Linux imx6-test0 3.16.0-rc2+ #1 SMP Wed Jun 25 15:04:16 CEST 2014 armv7l 
> armv7l armv7l GNU/Linux
> 
> Any help is greatly appreciated.
> 
> Best regards,
> Mattis Lorentzon
> 
> Unable to handle kernel paging request at virtual address 
> pgd = 9e338000
> [] *pgd=2fffd821, *pte=, *ppte=
> Internal error: Oops: 17 [#1] SMP ARM
> Modules linked in:
> CPU: 0 PID: 146 Comm: stereo Not tainted 3.16.0-rc2+ #1
> task: 9e07a700 ti: 81c42000 task.ti: 81c42000
> PC is at find_get_entry+0x60/0xfc
> LR is at radix_tree_lookup_slot+0x1c/0x2c
> pc : [<800a34d8>]lr : [<80290448>]psr: a013
> sp : 81c43d98  ip :   fp : 81c43dcc
> r10: 0001  r9 : 9e30e3c0  r8 : 02a7
> r7 : 9f3758a0  r6 :   r5 : 0001  r4 : 
> r3 : 81c43d84  r2 :   r1 : 02a7  r0 : 
...
> Code: e1a01008 eb07b3d6 e350 0a1c (e5904000)

Right, so radix_tree_lookup_slot returned 0x.  I've no idea how
that happened, and I'm not about to try reading and trying to understand
that code.  However, as that is generic code, I find it unlikely that
the code is buggy.  So, I suspect something else must be going on here,
such as a compiler bug or memory corruption.

Your other oops dumps also show various other functions apparantly
returning 0x.  I can't believe that there's more than one bug
doing this, so I doubt the problem is in these functions.  Something
else must be going on.

-- 
FTTC broadband for 0.8mile line: now at 9.7Mbps down 460kbps up... slowly
improving, and getting towards what was expected from it.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

45 matches

Mail list logo