Re: 3.2.0-rc1 panic on PowerPC

2011-11-21 Thread Markus Trippelsdorf
On 2011.11.21 at 12:25 +1100, Benjamin Herrenschmidt wrote:
> On Sun, 2011-11-20 at 17:17 -0800, Christian Kujau wrote:
> > On Mon, 21 Nov 2011 at 11:58, Benjamin Herrenschmidt wrote:
> > > I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I
> > > couldn't capture the oops log at the time.
> > 
> > It just happened again today, after heavy CPU & IO load (rsyncing from/to 
> > external disks on dm-crypt). This time the oops was printed on the screen 
> > but nothing on netconsole:
> > 
> > http://nerdbynature.de/bits/3.2.0-rc1/oops/oops3m.JPG
> > 
> > It looks like the oops I reported earlier (oops2m.JPG) so I doubt it's a 
> > random corruption due to hardware issues...?
> 
> Yeah it's starting to look like a pattern. Your latest oops looks a lot
> like the one I had (though it was with tg3 on the g5), ie, vfs_read ->
> driver -> allocator -> crash.

I might be seeing a similar issue on x86_64. See:
http://thread.gmane.org/gmane.linux.kernel.mm/70254

-- 
Markus
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: 3.2.0-rc1 panic on PowerPC

2011-11-20 Thread Christian Kujau
On Mon, 21 Nov 2011 at 12:51, Benjamin Herrenschmidt wrote:
> BTW. SLUB or SLAB ? Mine was SLUB with SLUB_DEBUG enabled (tho the debug
> didn't seem to catch anything).

SLUB, and SLUB_DEBUG=y (but w/o SLUB_DEBUG_ON and SLUB_STATS). Full config 
here: http://nerdbynature.de/bits/3.2.0-rc1/oops/config.txt

I'm compiling today's git checkout (mainline) with more debug settings 
enabled[0], let's see if this helps anything.

Christian.

[0] diff to old config
+CONFIG_RT_MUTEX_TESTER=y
+CONFIG_DEBUG_LOCKDEP=y
+CONFIG_DEBUG_HIGHMEM=y
+CONFIG_DEBUG_INFO=y
+CONFIG_DEBUG_VM=y
+CONFIG_DEBUG_WRITECOUNT=y
+CONFIG_DEBUG_LIST=y
+CONFIG_ATOMIC64_SELFTEST=y
+CONFIG_XMON=y
+CONFIG_XMON_DEFAULT=y
+CONFIG_XMON_DISASSEMBLY=y
+CONFIG_DEBUGGER=y

-- 
BOFH excuse #242:

Software uses US measurements, but the OS is in metric...
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: 3.2.0-rc1 panic on PowerPC

2011-11-20 Thread Benjamin Herrenschmidt
On Mon, 2011-11-21 at 12:25 +1100, Benjamin Herrenschmidt wrote:
> On Sun, 2011-11-20 at 17:17 -0800, Christian Kujau wrote:
> > On Mon, 21 Nov 2011 at 11:58, Benjamin Herrenschmidt wrote:
> > > I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I
> > > couldn't capture the oops log at the time.
> > 
> > It just happened again today, after heavy CPU & IO load (rsyncing from/to 
> > external disks on dm-crypt). This time the oops was printed on the screen 
> > but nothing on netconsole:
> > 
> > http://nerdbynature.de/bits/3.2.0-rc1/oops/oops3m.JPG
> > 
> > It looks like the oops I reported earlier (oops2m.JPG) so I doubt it's a 
> > random corruption due to hardware issues...?
> 
> Yeah it's starting to look like a pattern. Your latest oops looks a lot
> like the one I had (though it was with tg3 on the g5), ie, vfs_read ->
> driver -> allocator -> crash.
> 
> > Any debug or boot options to set in my next kernel build?
> 
> Well, you can turn everything on see whether that makes any difference
> or finds something a bit more precisely

BTW. SLUB or SLAB ? Mine was SLUB with SLUB_DEBUG enabled (tho the debug
didn't seem to catch anything).

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: 3.2.0-rc1 panic on PowerPC

2011-11-20 Thread Benjamin Herrenschmidt
On Sun, 2011-11-20 at 17:17 -0800, Christian Kujau wrote:
> On Mon, 21 Nov 2011 at 11:58, Benjamin Herrenschmidt wrote:
> > I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I
> > couldn't capture the oops log at the time.
> 
> It just happened again today, after heavy CPU & IO load (rsyncing from/to 
> external disks on dm-crypt). This time the oops was printed on the screen 
> but nothing on netconsole:
> 
> http://nerdbynature.de/bits/3.2.0-rc1/oops/oops3m.JPG
> 
> It looks like the oops I reported earlier (oops2m.JPG) so I doubt it's a 
> random corruption due to hardware issues...?

Yeah it's starting to look like a pattern. Your latest oops looks a lot
like the one I had (though it was with tg3 on the g5), ie, vfs_read ->
driver -> allocator -> crash.

> Any debug or boot options to set in my next kernel build?

Well, you can turn everything on see whether that makes any difference
or finds something a bit more precisely

Cheers,
Ben.

> Thanks,
> Christian.
> 
> > Looks like there's some kind of memory corruption happening. So far I
> > haven't been able to get a good target at what could be causing it.
> 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: 3.2.0-rc1 panic on PowerPC

2011-11-20 Thread Christian Kujau
On Mon, 21 Nov 2011 at 11:58, Benjamin Herrenschmidt wrote:
> I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I
> couldn't capture the oops log at the time.

It just happened again today, after heavy CPU & IO load (rsyncing from/to 
external disks on dm-crypt). This time the oops was printed on the screen 
but nothing on netconsole:

http://nerdbynature.de/bits/3.2.0-rc1/oops/oops3m.JPG

It looks like the oops I reported earlier (oops2m.JPG) so I doubt it's a 
random corruption due to hardware issues...?

Any debug or boot options to set in my next kernel build?

Thanks,
Christian.

> Looks like there's some kind of memory corruption happening. So far I
> haven't been able to get a good target at what could be causing it.

-- 
BOFH excuse #90:

Budget cuts
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: 3.2.0-rc1 panic on PowerPC

2011-11-20 Thread Benjamin Herrenschmidt
On Sun, 2011-11-20 at 15:31 -0800, Christian Kujau wrote:
> On Tue, 15 Nov 2011 at 00:44, Christian Kujau wrote:
> > I noticed a few crashes on this PowerBook G4 lately, starting somewhere in 
> > 3.2.0-rc1. The crashes are really rare and as I'm not on the system all 
> > the time I did not notice most of them. By the time I did, the screen was 
> > blank already and I had to hard-reset the box. But not this time:
> > 
> >   http://nerdbynature.de/bits/3.2.0-rc1/oops/
> > 
> > When the crash occured, the system was failry loaded (CPU and disk I/O 
> > wise), so that may have triggered it. I tried to type off the stack trace, 
> > I hope there are not too many typos, see below.
> > 
> > The machine is fairly old, so maybe it's "just" bad RAM or something, I 
> > wouldn't be suprised. But maybe not, the box us pretty stable most of the 
> > time and only now I notice these rare crashes.
> 
> Happened again with 3.2.0-rc2-00027-gff0ff78, this time with netconsole 
> enabled. But this time the machine just stopped, w/o any output on the 
> screen or on netconsole :(

I've seen something similar with 3.2-rc2 at cfcfc9ec, unfortunately I
couldn't capture the oops log at the time.

Looks like there's some kind of memory corruption happening. So far I
haven't been able to get a good target at what could be causing it.

Cheers,
Ben.

> Christian.
> 
> > If anyone could take a quick look...?
> > 
> > Thank you,
> > Christian.
> > 
> > Instruction dump:
> > 92c40008 6801 0f00 8004 543c 9004 817f000c 380b
> > 901f000c 2f09 81640018 81440014 <916a0004> 914b 92840014 92a49918
> > Kernel panic - not syncing: Fatal exception in interrupt
> > Call Trace:
> > show_stack+0x70/0x1bc (unreliable)
> > panic+0xc8/0x220
> > die+0x2ac/0x2b8
> > bad_page_fault+0xbc/0x104
> > handle_page_fault+0x7c/0x80
> > Exception: 300 at T.975+0x3f4/0x570
> > LR = T.957+0x300/0x570
> > kmem_cache_alloc+0x150/0x150
> > __aloc_skb+0x50/0x148
> > tcp_send_ack+0x35/0x138
> > tcp_delay_timer+0x140/0x244
> > run_timer_softirq+0x1a0/0x2ec
> > __do_softirq+0xf4/0x1bc
> > call_do_softirq+0x14/0x24
> > do_softirq+0xfc/0x128
> > irq_exit+0xa0/0xa4
> > timer_interrupt+0x148/0x180
> > ret_from_except+0x0/0x14
> > cpu_idle+0xa0/0x118
> > rest_init+0xf0/0x114
> > start_kernel+0x2d0/0x2f0
> > 0x3444
> > Rebooting in 180 seconds..
> > 
> > -- 
> > BOFH excuse #184:
> > 
> > loop found in loop in redundant loopback
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 
> 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: 3.2.0-rc1 panic on PowerPC

2011-11-20 Thread Christian Kujau
On Tue, 15 Nov 2011 at 00:44, Christian Kujau wrote:
> I noticed a few crashes on this PowerBook G4 lately, starting somewhere in 
> 3.2.0-rc1. The crashes are really rare and as I'm not on the system all 
> the time I did not notice most of them. By the time I did, the screen was 
> blank already and I had to hard-reset the box. But not this time:
> 
>   http://nerdbynature.de/bits/3.2.0-rc1/oops/
> 
> When the crash occured, the system was failry loaded (CPU and disk I/O 
> wise), so that may have triggered it. I tried to type off the stack trace, 
> I hope there are not too many typos, see below.
> 
> The machine is fairly old, so maybe it's "just" bad RAM or something, I 
> wouldn't be suprised. But maybe not, the box us pretty stable most of the 
> time and only now I notice these rare crashes.

Happened again with 3.2.0-rc2-00027-gff0ff78, this time with netconsole 
enabled. But this time the machine just stopped, w/o any output on the 
screen or on netconsole :(

Christian.

> If anyone could take a quick look...?
> 
> Thank you,
> Christian.
> 
> Instruction dump:
> 92c40008 6801 0f00 8004 543c 9004 817f000c 380b
> 901f000c 2f09 81640018 81440014 <916a0004> 914b 92840014 92a49918
> Kernel panic - not syncing: Fatal exception in interrupt
> Call Trace:
> show_stack+0x70/0x1bc (unreliable)
> panic+0xc8/0x220
> die+0x2ac/0x2b8
> bad_page_fault+0xbc/0x104
> handle_page_fault+0x7c/0x80
> Exception: 300 at T.975+0x3f4/0x570
> LR = T.957+0x300/0x570
> kmem_cache_alloc+0x150/0x150
> __aloc_skb+0x50/0x148
> tcp_send_ack+0x35/0x138
> tcp_delay_timer+0x140/0x244
> run_timer_softirq+0x1a0/0x2ec
> __do_softirq+0xf4/0x1bc
> call_do_softirq+0x14/0x24
> do_softirq+0xfc/0x128
> irq_exit+0xa0/0xa4
> timer_interrupt+0x148/0x180
> ret_from_except+0x0/0x14
> cpu_idle+0xa0/0x118
> rest_init+0xf0/0x114
> start_kernel+0x2d0/0x2f0
> 0x3444
> Rebooting in 180 seconds..
> 
> -- 
> BOFH excuse #184:
> 
> loop found in loop in redundant loopback
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
BOFH excuse #387:

Your computer's union contract is set to expire at midnight.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


3.2.0-rc1 panic on PowerPC

2011-11-15 Thread Christian Kujau
Hi,

I noticed a few crashes on this PowerBook G4 lately, starting somewhere in 
3.2.0-rc1. The crashes are really rare and as I'm not on the system all 
the time I did not notice most of them. By the time I did, the screen was 
blank already and I had to hard-reset the box. But not this time:

  http://nerdbynature.de/bits/3.2.0-rc1/oops/

When the crash occured, the system was failry loaded (CPU and disk I/O 
wise), so that may have triggered it. I tried to type off the stack trace, 
I hope there are not too many typos, see below.

The machine is fairly old, so maybe it's "just" bad RAM or something, I 
wouldn't be suprised. But maybe not, the box us pretty stable most of the 
time and only now I notice these rare crashes.

If anyone could take a quick look...?

Thank you,
Christian.

Instruction dump:
92c40008 6801 0f00 8004 543c 9004 817f000c 380b
901f000c 2f09 81640018 81440014 <916a0004> 914b 92840014 92a49918
Kernel panic - not syncing: Fatal exception in interrupt
Call Trace:
show_stack+0x70/0x1bc (unreliable)
panic+0xc8/0x220
die+0x2ac/0x2b8
bad_page_fault+0xbc/0x104
handle_page_fault+0x7c/0x80
Exception: 300 at T.975+0x3f4/0x570
LR = T.957+0x300/0x570
kmem_cache_alloc+0x150/0x150
__aloc_skb+0x50/0x148
tcp_send_ack+0x35/0x138
tcp_delay_timer+0x140/0x244
run_timer_softirq+0x1a0/0x2ec
__do_softirq+0xf4/0x1bc
call_do_softirq+0x14/0x24
do_softirq+0xfc/0x128
irq_exit+0xa0/0xa4
timer_interrupt+0x148/0x180
ret_from_except+0x0/0x14
cpu_idle+0xa0/0x118
rest_init+0xf0/0x114
start_kernel+0x2d0/0x2f0
0x3444
Rebooting in 180 seconds..

-- 
BOFH excuse #184:

loop found in loop in redundant loopback
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev