Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55)

2010-10-27 Thread pacman
Benjamin Herrenschmidt writes: Ok so you'll have to make up a workaround in prom_init that looks for OHCI's in the device-tree and disable them. Check if the OHCI node has some existing f-code words you can use for that with dev /path-to-ohci words in OF for example. If not, you may need

Re: Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55)

2010-10-27 Thread Olaf Hering
On Wed, Oct 27, pac...@kosh.dhis.org wrote: |1. How do I locate all usb nodes in the device tree? | |2. How do I know if a particular usb node is OHCI? In the installed system, run 'lspci | grep -i usb', this gives the pci bus numbers. Then run 'find /sys -name devspec', and look or the bus

Re: Pegasos OHCI bug (was Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55)

2010-10-27 Thread Benjamin Herrenschmidt
Since then, the silence has been deafening. My assumption now is that this is not ever getting fixed. I'm certainly not able to fix it. I'm not a even kernel programmer! I got far enough to diagnose the cause just with the add more printk's and boot it again technique. Hundreds of reboots

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-22 Thread pacman
Benjamin Herrenschmidt writes: On Wed, 2010-10-20 at 13:33 -0500, pac...@kosh.dhis.org wrote: Just try :-) quiesce is something that afaik only apple ever implemented anyways. It uses hooks inside their OF to shut down all drivers that do bus master (among other HW sanitization tasks).

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-20 Thread Benjamin Herrenschmidt
On Tue, 2010-10-19 at 22:23 -0500, pac...@kosh.dhis.org wrote: The diff fragment above applied inside prom_close_stdin, but there are some prom_printf calls after prom_close_stdin. Calling prom_printf after closing stdout sounds like it could be bad. If I moved it down below all the

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-20 Thread pacman
Benjamin Herrenschmidt writes: On Tue, 2010-10-19 at 22:23 -0500, pac...@kosh.dhis.org wrote: The diff fragment above applied inside prom_close_stdin, but there are some prom_printf calls after prom_close_stdin. Calling prom_printf after closing stdout sounds like it could be bad. If

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-20 Thread Benjamin Herrenschmidt
On Wed, 2010-10-20 at 13:33 -0500, pac...@kosh.dhis.org wrote: Just try :-) quiesce is something that afaik only apple ever implemented anyways. It uses hooks inside their OF to shut down all drivers that do bus master (among other HW sanitization tasks). I booted a version with a

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Benjamin Herrenschmidt
From there, you might be able to close onto the culprit a bit more, for example, try using the DABR register to set data access breakpoints shortly before the corruption spot. AFAIK, On those old 32-bit CPUs, you can set whether you want it to break on a real or a virtual address. I

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Thomas Gleixner
On Tue, 19 Oct 2010, Helmut Grohne wrote: On Mon, Oct 18, 2010 at 11:55:44PM +0200, Thomas Gleixner wrote: I might be completely one off as usual, but this thing reminds me of a bug I stared at yesterday night: This problem is completely unrelated. My problem was caused by using

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Helmut Grohne
On Mon, Oct 18, 2010 at 11:55:44PM +0200, Thomas Gleixner wrote: I might be completely one off as usual, but this thing reminds me of a bug I stared at yesterday night: This problem is completely unrelated. My problem was caused by using binutils-gold. Helmut

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread pacman
Benjamin Herrenschmidt writes: I thought of that, but as far as I can tell, this CPU doesn't have DABR. AFAIK, the 7447 is just a derivative of the 7450 design which -does- have a DABR ... Unless it's broken :-) Hmm. gdb resorts to single-stepping when I set a watchpoint while debugging

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Segher Boessenkool
I made a new discovery. And this nails it :-) So then I ran dd if=/dev/mem bs=4 count=1 skip=$((0xfc5c080/4)) | od -t x4 a few times very fast, plucking the first affected word directly out of memory by its physical address. The result: The low 16 bits are always zero as before. The

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Benjamin Herrenschmidt
On Tue, 2010-10-19 at 13:10 -0500, pac...@kosh.dhis.org wrote: So what type of driver, firmware, or hardware bug puts a 16-bit 1000Hz timer in memory, and does it in little-endian instead of the CPU's native byte order? And why does it stop doing it some time during the early init scripts,

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread Benjamin Herrenschmidt
On Tue, 2010-10-19 at 22:47 +0200, Segher Boessenkool wrote: It looks like it is the frame counter in an USB OHCI HCCA. 16-bit, 1kHz update, offset x'80 in a page. So either the kernel forgot to call quiesce on it, or the firmware doesn't implement that, or the firmware messed up some

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-19 Thread pacman
Benjamin Herrenschmidt writes: On Tue, 2010-10-19 at 22:47 +0200, Segher Boessenkool wrote: It looks like it is the frame counter in an USB OHCI HCCA. 16-bit, 1kHz update, offset x'80 in a page. So either the kernel forgot to call quiesce on it, or the firmware doesn't implement

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Mel Gorman
On Wed, Oct 13, 2010 at 12:52:05PM -0500, pac...@kosh.dhis.org wrote: Mel Gorman writes: On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote: It's corruption of user memory, which is unusual. I'd be wondering if there was a pre-existing bug which 6dda9d55bf545013597 has

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread pacman
Mel Gorman writes: A bit but I still don't know why it would cause corruption. Maybe this is still a caching issue but the difference in timing between list_add and list_add_tail is enough to hide the bug. It's also possible there are some registers ioremapped after the memmap array and

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Andrew Morton
On Mon, 18 Oct 2010 12:33:31 +0100 Mel Gorman m...@csn.ul.ie wrote: A bit but I still don't know why it would cause corruption. Maybe this is still a caching issue but the difference in timing between list_add and list_add_tail is enough to hide the bug. It's also possible there are some

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Benjamin Herrenschmidt
On Wed, 2010-10-13 at 15:40 +0100, Mel Gorman wrote: This is somewhat contrived but I can see how it might happen even on one CPU particularly if the L1 cache is virtual and is loose about checking physical tags. How sensitive/vulnerable is PPC32 to such things? I can not tell you

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Benjamin Herrenschmidt
On Mon, 2010-10-18 at 12:37 -0700, Andrew Morton wrote: Well, you've spotted a bug so I'd say we fix it asap. It's a bit of a shame that we lose the only known way of reproducing a different bug, but presumably that will come back and bite someone else one day, and we'll fix it then :(

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Benjamin Herrenschmidt
On Mon, 2010-10-18 at 14:10 -0500, pac...@kosh.dhis.org wrote: I've been flailing around quite a bit. Here's my latest result: Since I can view the corruption with md5sum /sbin/e2fsck, I know it's in a clean cached page. So I made an extra copy of /sbin/e2fsck, which won't be loaded into

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread pacman
Benjamin Herrenschmidt writes: You can do something fun... like a timer interrupt that peeks at those physical addresses from the linear mapping for example, and try to find out when they get set to the wrong value (you should observe the load from disk, then the corruption, unless they end

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-18 Thread Thomas Gleixner
On Mon, 18 Oct 2010, Andrew Morton wrote: On Mon, 18 Oct 2010 12:33:31 +0100 Mel Gorman m...@csn.ul.ie wrote: A bit but I still don't know why it would cause corruption. Maybe this is still a caching issue but the difference in timing between list_add and list_add_tail is enough

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-13 Thread Mel Gorman
On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote: (cc linuxppc-dev@lists.ozlabs.org) On Mon, 11 Oct 2010 15:30:22 +0100 Mel Gorman m...@csn.ul.ie wrote: On Sat, Oct 09, 2010 at 04:57:18AM -0500, pac...@kosh.dhis.org wrote: (What a big Cc: list... scripts/get_maintainer.pl

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-13 Thread pacman
Mel Gorman writes: On Mon, Oct 11, 2010 at 02:00:39PM -0700, Andrew Morton wrote: It's corruption of user memory, which is unusual. I'd be wondering if there was a pre-existing bug which 6dda9d55bf545013597 has exposed - previously the corruption was hitting something harmless.

Re: PROBLEM: memory corrupting bug, bisected to 6dda9d55

2010-10-11 Thread Andrew Morton
(cc linuxppc-dev@lists.ozlabs.org) On Mon, 11 Oct 2010 15:30:22 +0100 Mel Gorman m...@csn.ul.ie wrote: On Sat, Oct 09, 2010 at 04:57:18AM -0500, pac...@kosh.dhis.org wrote: (What a big Cc: list... scripts/get_maintainer.pl made me do it.) This will be a long story with a weak conclusion,