Re: 3.2.0-rc1 panic on PowerPC
On 2011.11.21 at 12:25 +1100, Benjamin Herrenschmidt wrote: > On Sun, 2011-11-20 at 17:17 -0800, Christian Kujau wrote: > > On Mon, 21 Nov 2011 at 11:58, Benjamin Herrenschmidt wrote: > > > I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I > > > couldn't capture the oops log at the time. > > > > It just happened again today, after heavy CPU & IO load (rsyncing from/to > > external disks on dm-crypt). This time the oops was printed on the screen > > but nothing on netconsole: > > > > http://nerdbynature.de/bits/3.2.0-rc1/oops/oops3m.JPG > > > > It looks like the oops I reported earlier (oops2m.JPG) so I doubt it's a > > random corruption due to hardware issues...? > > Yeah it's starting to look like a pattern. Your latest oops looks a lot > like the one I had (though it was with tg3 on the g5), ie, vfs_read -> > driver -> allocator -> crash. I might be seeing a similar issue on x86_64. See: http://thread.gmane.org/gmane.linux.kernel.mm/70254 -- Markus ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 3.2.0-rc1 panic on PowerPC
On Mon, 21 Nov 2011 at 12:51, Benjamin Herrenschmidt wrote: > BTW. SLUB or SLAB ? Mine was SLUB with SLUB_DEBUG enabled (tho the debug > didn't seem to catch anything). SLUB, and SLUB_DEBUG=y (but w/o SLUB_DEBUG_ON and SLUB_STATS). Full config here: http://nerdbynature.de/bits/3.2.0-rc1/oops/config.txt I'm compiling today's git checkout (mainline) with more debug settings enabled[0], let's see if this helps anything. Christian. [0] diff to old config +CONFIG_RT_MUTEX_TESTER=y +CONFIG_DEBUG_LOCKDEP=y +CONFIG_DEBUG_HIGHMEM=y +CONFIG_DEBUG_INFO=y +CONFIG_DEBUG_VM=y +CONFIG_DEBUG_WRITECOUNT=y +CONFIG_DEBUG_LIST=y +CONFIG_ATOMIC64_SELFTEST=y +CONFIG_XMON=y +CONFIG_XMON_DEFAULT=y +CONFIG_XMON_DISASSEMBLY=y +CONFIG_DEBUGGER=y -- BOFH excuse #242: Software uses US measurements, but the OS is in metric... ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 3.2.0-rc1 panic on PowerPC
On Mon, 2011-11-21 at 12:25 +1100, Benjamin Herrenschmidt wrote: > On Sun, 2011-11-20 at 17:17 -0800, Christian Kujau wrote: > > On Mon, 21 Nov 2011 at 11:58, Benjamin Herrenschmidt wrote: > > > I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I > > > couldn't capture the oops log at the time. > > > > It just happened again today, after heavy CPU & IO load (rsyncing from/to > > external disks on dm-crypt). This time the oops was printed on the screen > > but nothing on netconsole: > > > > http://nerdbynature.de/bits/3.2.0-rc1/oops/oops3m.JPG > > > > It looks like the oops I reported earlier (oops2m.JPG) so I doubt it's a > > random corruption due to hardware issues...? > > Yeah it's starting to look like a pattern. Your latest oops looks a lot > like the one I had (though it was with tg3 on the g5), ie, vfs_read -> > driver -> allocator -> crash. > > > Any debug or boot options to set in my next kernel build? > > Well, you can turn everything on see whether that makes any difference > or finds something a bit more precisely BTW. SLUB or SLAB ? Mine was SLUB with SLUB_DEBUG enabled (tho the debug didn't seem to catch anything). Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 3.2.0-rc1 panic on PowerPC
On Sun, 2011-11-20 at 17:17 -0800, Christian Kujau wrote: > On Mon, 21 Nov 2011 at 11:58, Benjamin Herrenschmidt wrote: > > I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I > > couldn't capture the oops log at the time. > > It just happened again today, after heavy CPU & IO load (rsyncing from/to > external disks on dm-crypt). This time the oops was printed on the screen > but nothing on netconsole: > > http://nerdbynature.de/bits/3.2.0-rc1/oops/oops3m.JPG > > It looks like the oops I reported earlier (oops2m.JPG) so I doubt it's a > random corruption due to hardware issues...? Yeah it's starting to look like a pattern. Your latest oops looks a lot like the one I had (though it was with tg3 on the g5), ie, vfs_read -> driver -> allocator -> crash. > Any debug or boot options to set in my next kernel build? Well, you can turn everything on see whether that makes any difference or finds something a bit more precisely Cheers, Ben. > Thanks, > Christian. > > > Looks like there's some kind of memory corruption happening. So far I > > haven't been able to get a good target at what could be causing it. > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 3.2.0-rc1 panic on PowerPC
On Mon, 21 Nov 2011 at 11:58, Benjamin Herrenschmidt wrote: > I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I > couldn't capture the oops log at the time. It just happened again today, after heavy CPU & IO load (rsyncing from/to external disks on dm-crypt). This time the oops was printed on the screen but nothing on netconsole: http://nerdbynature.de/bits/3.2.0-rc1/oops/oops3m.JPG It looks like the oops I reported earlier (oops2m.JPG) so I doubt it's a random corruption due to hardware issues...? Any debug or boot options to set in my next kernel build? Thanks, Christian. > Looks like there's some kind of memory corruption happening. So far I > haven't been able to get a good target at what could be causing it. -- BOFH excuse #90: Budget cuts ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 3.2.0-rc1 panic on PowerPC
On Sun, 2011-11-20 at 15:31 -0800, Christian Kujau wrote: > On Tue, 15 Nov 2011 at 00:44, Christian Kujau wrote: > > I noticed a few crashes on this PowerBook G4 lately, starting somewhere in > > 3.2.0-rc1. The crashes are really rare and as I'm not on the system all > > the time I did not notice most of them. By the time I did, the screen was > > blank already and I had to hard-reset the box. But not this time: > > > > http://nerdbynature.de/bits/3.2.0-rc1/oops/ > > > > When the crash occured, the system was failry loaded (CPU and disk I/O > > wise), so that may have triggered it. I tried to type off the stack trace, > > I hope there are not too many typos, see below. > > > > The machine is fairly old, so maybe it's "just" bad RAM or something, I > > wouldn't be suprised. But maybe not, the box us pretty stable most of the > > time and only now I notice these rare crashes. > > Happened again with 3.2.0-rc2-00027-gff0ff78, this time with netconsole > enabled. But this time the machine just stopped, w/o any output on the > screen or on netconsole :( I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I couldn't capture the oops log at the time. Looks like there's some kind of memory corruption happening. So far I haven't been able to get a good target at what could be causing it. Cheers, Ben. > Christian. > > > If anyone could take a quick look...? > > > > Thank you, > > Christian. > > > > Instruction dump: > > 92c40008 6801 0f00 8004 543c 9004 817f000c 380b > > 901f000c 2f09 81640018 81440014 <916a0004> 914b 92840014 92a49918 > > Kernel panic - not syncing: Fatal exception in interrupt > > Call Trace: > > show_stack+0x70/0x1bc (unreliable) > > panic+0xc8/0x220 > > die+0x2ac/0x2b8 > > bad_page_fault+0xbc/0x104 > > handle_page_fault+0x7c/0x80 > > Exception: 300 at T.975+0x3f4/0x570 > > LR = T.957+0x300/0x570 > > kmem_cache_alloc+0x150/0x150 > > __aloc_skb+0x50/0x148 > > tcp_send_ack+0x35/0x138 > > tcp_delay_timer+0x140/0x244 > > run_timer_softirq+0x1a0/0x2ec > > __do_softirq+0xf4/0x1bc > > call_do_softirq+0x14/0x24 > > do_softirq+0xfc/0x128 > > irq_exit+0xa0/0xa4 > > timer_interrupt+0x148/0x180 > > ret_from_except+0x0/0x14 > > cpu_idle+0xa0/0x118 > > rest_init+0xf0/0x114 > > start_kernel+0x2d0/0x2f0 > > 0x3444 > > Rebooting in 180 seconds.. > > > > -- > > BOFH excuse #184: > > > > loop found in loop in redundant loopback > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: 3.2.0-rc1 panic on PowerPC
On Tue, 15 Nov 2011 at 00:44, Christian Kujau wrote: > I noticed a few crashes on this PowerBook G4 lately, starting somewhere in > 3.2.0-rc1. The crashes are really rare and as I'm not on the system all > the time I did not notice most of them. By the time I did, the screen was > blank already and I had to hard-reset the box. But not this time: > > http://nerdbynature.de/bits/3.2.0-rc1/oops/ > > When the crash occured, the system was failry loaded (CPU and disk I/O > wise), so that may have triggered it. I tried to type off the stack trace, > I hope there are not too many typos, see below. > > The machine is fairly old, so maybe it's "just" bad RAM or something, I > wouldn't be suprised. But maybe not, the box us pretty stable most of the > time and only now I notice these rare crashes. Happened again with 3.2.0-rc2-00027-gff0ff78, this time with netconsole enabled. But this time the machine just stopped, w/o any output on the screen or on netconsole :( Christian. > If anyone could take a quick look...? > > Thank you, > Christian. > > Instruction dump: > 92c40008 6801 0f00 8004 543c 9004 817f000c 380b > 901f000c 2f09 81640018 81440014 <916a0004> 914b 92840014 92a49918 > Kernel panic - not syncing: Fatal exception in interrupt > Call Trace: > show_stack+0x70/0x1bc (unreliable) > panic+0xc8/0x220 > die+0x2ac/0x2b8 > bad_page_fault+0xbc/0x104 > handle_page_fault+0x7c/0x80 > Exception: 300 at T.975+0x3f4/0x570 > LR = T.957+0x300/0x570 > kmem_cache_alloc+0x150/0x150 > __aloc_skb+0x50/0x148 > tcp_send_ack+0x35/0x138 > tcp_delay_timer+0x140/0x244 > run_timer_softirq+0x1a0/0x2ec > __do_softirq+0xf4/0x1bc > call_do_softirq+0x14/0x24 > do_softirq+0xfc/0x128 > irq_exit+0xa0/0xa4 > timer_interrupt+0x148/0x180 > ret_from_except+0x0/0x14 > cpu_idle+0xa0/0x118 > rest_init+0xf0/0x114 > start_kernel+0x2d0/0x2f0 > 0x3444 > Rebooting in 180 seconds.. > > -- > BOFH excuse #184: > > loop found in loop in redundant loopback > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- BOFH excuse #387: Your computer's union contract is set to expire at midnight. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
3.2.0-rc1 panic on PowerPC
Hi, I noticed a few crashes on this PowerBook G4 lately, starting somewhere in 3.2.0-rc1. The crashes are really rare and as I'm not on the system all the time I did not notice most of them. By the time I did, the screen was blank already and I had to hard-reset the box. But not this time: http://nerdbynature.de/bits/3.2.0-rc1/oops/ When the crash occured, the system was failry loaded (CPU and disk I/O wise), so that may have triggered it. I tried to type off the stack trace, I hope there are not too many typos, see below. The machine is fairly old, so maybe it's "just" bad RAM or something, I wouldn't be suprised. But maybe not, the box us pretty stable most of the time and only now I notice these rare crashes. If anyone could take a quick look...? Thank you, Christian. Instruction dump: 92c40008 6801 0f00 8004 543c 9004 817f000c 380b 901f000c 2f09 81640018 81440014 <916a0004> 914b 92840014 92a49918 Kernel panic - not syncing: Fatal exception in interrupt Call Trace: show_stack+0x70/0x1bc (unreliable) panic+0xc8/0x220 die+0x2ac/0x2b8 bad_page_fault+0xbc/0x104 handle_page_fault+0x7c/0x80 Exception: 300 at T.975+0x3f4/0x570 LR = T.957+0x300/0x570 kmem_cache_alloc+0x150/0x150 __aloc_skb+0x50/0x148 tcp_send_ack+0x35/0x138 tcp_delay_timer+0x140/0x244 run_timer_softirq+0x1a0/0x2ec __do_softirq+0xf4/0x1bc call_do_softirq+0x14/0x24 do_softirq+0xfc/0x128 irq_exit+0xa0/0xa4 timer_interrupt+0x148/0x180 ret_from_except+0x0/0x14 cpu_idle+0xa0/0x118 rest_init+0xf0/0x114 start_kernel+0x2d0/0x2f0 0x3444 Rebooting in 180 seconds.. -- BOFH excuse #184: loop found in loop in redundant loopback ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev