Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Paul Mackerras wrote: Kamalesh Babulal writes: Thanks, after applying the patch the oops is not reproducible on the machine. The console log had no message starting with SLB: or FWNMI:. I have updated the bugzilla also. Tested-by: Kamalesh Babulal [EMAIL PROTECTED] Could you test Linus' current git tree and see if you can reproduce the same problem now? The patch I sent upstream was a little different from the one you tested, though it should have the same effect, and I would like to be sure that it is just as effective at fixing the bug. Thanks, Paul. Hi Paul, The patch has the same effect. I tested the 2.6.26-rc1-git7 kernel and the oops is not reproducible. -- Thanks Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Kamalesh Babulal writes: Thanks, after applying the patch the oops is not reproducible on the machine. The console log had no message starting with SLB: or FWNMI:. I have updated the bugzilla also. Tested-by: Kamalesh Babulal [EMAIL PROTECTED] Could you test Linus' current git tree and see if you can reproduce the same problem now? The patch I sent upstream was a little different from the one you tested, though it should have the same effect, and I would like to be sure that it is just as effective at fixing the bug. Thanks, Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Paul Mackerras wrote: Kamalesh Babulal writes: After applying the patch above and the patch posted on http://lkml.org/lkml/2008/4/8/42 the bug had the following information, Thanks. The patch below, against Linus' current git tree, fixes one bug that might be the cause of the problem, and also attempts to detect the erroneous situation earlier and fix it up, and also print some debug information. Please try to reproduce the problem with this patch applied, and if there are any console log messages starting with SLB: or FWNMI:, please send me the console log. Paul. diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index c0db5b7..f7f0962 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -439,6 +439,19 @@ END_FTR_SECTION_IFSET(CPU_FTR_1T_SEGMENT) mr r1,r8 /* start using new stack pointer */ std r7,PACAKSAVE(r13) + /* check that SLB entry 2 contains the right thing */ + clrrdi r6,r1,28 + clrldi. r0,r6,2 + beq 3f + li r0,2 + slbmfee r7,r0 + orisr6,r6,[EMAIL PROTECTED] + cmpdr6,r7 + beq 3f + bl bad_slb_switch + ld r3,PACACURRENT(r13) + addir3,r3,THREAD +3: ld r6,_CCR(r1) mtcrf 0xFF,r6 @@ -540,6 +553,19 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) ld r4,_XER(r1) mtspr SPRN_XER,r4 + /* check that SLB entry 2 contains the right thing */ + clrrdi r6,r1,28/* stack ESID */ + clrldi. r0,r6,2 + beq 57f + li r0,2 + slbmfee r7,r0 + orisr6,r6,[EMAIL PROTECTED] + cmpdr6,r7 + beq 57f + addir3,r1,STACK_FRAME_OVERHEAD + bl bad_slb_exc + ld r3,_MSR(r1) +57: REST_8GPRS(5, r1) andi. r0,r3,MSR_RI diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index be35ffa..c938134 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -45,6 +45,7 @@ #include asm/system.h #include asm/mpic.h #include asm/vdso_datapage.h +#include asm/mmu.h #ifdef CONFIG_PPC64 #include asm/paca.h #endif @@ -580,6 +581,10 @@ int __devinit start_secondary(void *unused) atomic_inc(init_mm.mm_count); current-active_mm = init_mm; + /* Bolt in the entry for the kernel stack now */ + if (cpu_has_feature(CPU_FTR_SLB)) + slb_flush_and_rebolt(); + smp_store_cpu_info(cpu); set_dec(tb_ticks_per_jiffy); preempt_disable(); diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index 906daed..bb7765b 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -309,3 +309,34 @@ void slb_initialize(void) * one. */ asm volatile(isync:::memory); } + +static void dump_slb(void) +{ + long entry; + unsigned long esid, vsid; + + printk(KERN_EMERG SLB contents now:\n); + for (entry = 0; entry 64; ++entry) { + asm volatile(slbmfee %0,%1 : =r (esid) : r (entry)); + if (esid == 0) + /* valid bit is clear along with everything else */ + continue; + asm volatile(slbmfev %0,%1 : =r (vsid) : r (entry)); + printk(KERN_EMERG %d: %.16lx %.16lx\n, entry, esid, vsid); + } +} + +void bad_slb_exc(struct pt_regs *regs) +{ + printk(KERN_EMERG SLB: stack not bolted on exception return\n); + dump_slb(); + slb_flush_and_rebolt(); + show_regs(regs); +} + +void bad_slb_switch(void) +{ + printk(KERN_EMERG SLB: stack not bolted on context switch\n); + dump_slb(); + slb_flush_and_rebolt(); +} diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c index a1ab25c..ed68083 100644 --- a/arch/powerpc/platforms/pseries/ras.c +++ b/arch/powerpc/platforms/pseries/ras.c @@ -325,6 +325,8 @@ static int recover_mce(struct pt_regs *regs, struct rtas_error_log * err) if (err-disposition == RTAS_DISP_FULLY_RECOVERED) { /* Platform corrected itself */ + printk(KERN_ALERT FWNMI: platform corrected error %.16lx\n, +*(unsigned long *)err); nonfatal = 1; } else if ((regs-msr MSR_RI) user_mode(regs) Hi Paul, Thanks, after applying the patch the oops is not reproducible on the machine. The console log had no message starting with SLB: or FWNMI:. I have updated the bugzilla also. Tested-by: Kamalesh Babulal [EMAIL PROTECTED] -- Thanks Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Kamalesh Babulal writes: After applying the patch above and the patch posted on http://lkml.org/lkml/2008/4/8/42 the bug had the following information, Thanks. The patch below, against Linus' current git tree, fixes one bug that might be the cause of the problem, and also attempts to detect the erroneous situation earlier and fix it up, and also print some debug information. Please try to reproduce the problem with this patch applied, and if there are any console log messages starting with SLB: or FWNMI:, please send me the console log. Paul. diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index c0db5b7..f7f0962 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -439,6 +439,19 @@ END_FTR_SECTION_IFSET(CPU_FTR_1T_SEGMENT) mr r1,r8 /* start using new stack pointer */ std r7,PACAKSAVE(r13) + /* check that SLB entry 2 contains the right thing */ + clrrdi r6,r1,28 + clrldi. r0,r6,2 + beq 3f + li r0,2 + slbmfee r7,r0 + orisr6,r6,[EMAIL PROTECTED] + cmpdr6,r7 + beq 3f + bl bad_slb_switch + ld r3,PACACURRENT(r13) + addir3,r3,THREAD +3: ld r6,_CCR(r1) mtcrf 0xFF,r6 @@ -540,6 +553,19 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) ld r4,_XER(r1) mtspr SPRN_XER,r4 + /* check that SLB entry 2 contains the right thing */ + clrrdi r6,r1,28/* stack ESID */ + clrldi. r0,r6,2 + beq 57f + li r0,2 + slbmfee r7,r0 + orisr6,r6,[EMAIL PROTECTED] + cmpdr6,r7 + beq 57f + addir3,r1,STACK_FRAME_OVERHEAD + bl bad_slb_exc + ld r3,_MSR(r1) +57: REST_8GPRS(5, r1) andi. r0,r3,MSR_RI diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c index be35ffa..c938134 100644 --- a/arch/powerpc/kernel/smp.c +++ b/arch/powerpc/kernel/smp.c @@ -45,6 +45,7 @@ #include asm/system.h #include asm/mpic.h #include asm/vdso_datapage.h +#include asm/mmu.h #ifdef CONFIG_PPC64 #include asm/paca.h #endif @@ -580,6 +581,10 @@ int __devinit start_secondary(void *unused) atomic_inc(init_mm.mm_count); current-active_mm = init_mm; + /* Bolt in the entry for the kernel stack now */ + if (cpu_has_feature(CPU_FTR_SLB)) + slb_flush_and_rebolt(); + smp_store_cpu_info(cpu); set_dec(tb_ticks_per_jiffy); preempt_disable(); diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index 906daed..bb7765b 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -309,3 +309,34 @@ void slb_initialize(void) * one. */ asm volatile(isync:::memory); } + +static void dump_slb(void) +{ + long entry; + unsigned long esid, vsid; + + printk(KERN_EMERG SLB contents now:\n); + for (entry = 0; entry 64; ++entry) { + asm volatile(slbmfee %0,%1 : =r (esid) : r (entry)); + if (esid == 0) + /* valid bit is clear along with everything else */ + continue; + asm volatile(slbmfev %0,%1 : =r (vsid) : r (entry)); + printk(KERN_EMERG %d: %.16lx %.16lx\n, entry, esid, vsid); + } +} + +void bad_slb_exc(struct pt_regs *regs) +{ + printk(KERN_EMERG SLB: stack not bolted on exception return\n); + dump_slb(); + slb_flush_and_rebolt(); + show_regs(regs); +} + +void bad_slb_switch(void) +{ + printk(KERN_EMERG SLB: stack not bolted on context switch\n); + dump_slb(); + slb_flush_and_rebolt(); +} diff --git a/arch/powerpc/platforms/pseries/ras.c b/arch/powerpc/platforms/pseries/ras.c index a1ab25c..ed68083 100644 --- a/arch/powerpc/platforms/pseries/ras.c +++ b/arch/powerpc/platforms/pseries/ras.c @@ -325,6 +325,8 @@ static int recover_mce(struct pt_regs *regs, struct rtas_error_log * err) if (err-disposition == RTAS_DISP_FULLY_RECOVERED) { /* Platform corrected itself */ + printk(KERN_ALERT FWNMI: platform corrected error %.16lx\n, + *(unsigned long *)err); nonfatal = 1; } else if ((regs-msr MSR_RI) user_mode(regs) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Kamalesh Babulal writes: The SHA1 ID of the kernel is 0e81a8ae37687845f7cdfa2adce14ea6a5f1dd34 (2.6.25-rc8) and the source seems to have the patch 44387e9ff25267c78a99229aca55ed750e9174c7. The kernel was patched only the patch you gave me (http://lkml.org/lkml/2008/4/8/42). Please try again with both that patch and the one below. Once again it won't fix the bug but will give us more information. When the oops occurs, the kernel will print a lot of debug information that should help locate the problem. Paul. diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index e932b43..f16db50 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -144,6 +144,9 @@ int main(void) DEFINE(PACA_SLBSHADOWPTR, offsetof(struct paca_struct, slb_shadow_ptr)); DEFINE(PACA_DATA_OFFSET, offsetof(struct paca_struct, data_offset)); DEFINE(PACA_TRAP_SAVE, offsetof(struct paca_struct, trap_save)); + DEFINE(PACASLBLOG, offsetof(struct paca_struct, slblog)); + DEFINE(PACASLBLOGIX, offsetof(struct paca_struct, slblog_ix)); + DEFINE(PACALASTSLB, offsetof(struct paca_struct, last_slb)); DEFINE(SLBSHADOW_STACKVSID, offsetof(struct slb_shadow, save_area[SLB_NUM_BOLTED - 1].vsid)); diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 148a354..663df17 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -419,6 +419,18 @@ END_FTR_SECTION_IFSET(CPU_FTR_1T_SEGMENT) slbmte r7,r0 isync + ld r4,PACASLBLOGIX(r13) + addir4,r4,1 + clrldi r4,r4,64-6 + std r4,PACASLBLOGIX(r13) + add r4,r4,r13 + addir4,r4,PACASLBLOG + li r5,4 + std r5,0(r4) + mftbr5 + std r5,8(r4) + std r6,16(r4) + std r0,24(r4) 2: clrrdi r7,r8,THREAD_SHIFT /* base of new stack */ /* Note: this uses SWITCH_FRAME_SIZE rather than INT_FRAME_SIZE @@ -533,6 +545,17 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) stdcx. r0,0,r1 /* to clear the reservation */ + li r4,0 + slbmfee r2,r4 + std r2,PACALASTSLB(r13) + slbmfev r2,r4 + std r2,PACALASTSLB+8(r13) + li r4,1 + slbmfee r2,r4 + std r2,PACALASTSLB+16(r13) + slbmfev r2,r4 + std r2,PACALASTSLB+24(r13) + /* * Clear RI before restoring r13. If we are returning to * userspace and we take an exception after restoring r13, diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 4b5b7ff..c918f33 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -1141,6 +1141,40 @@ void SPEFloatingPointException(struct pt_regs *regs) } #endif +static void dump_unrecov_slb(void) +{ +#ifdef CONFIG_PPC64 + long entry, rstart; + unsigned long esid, vsid; + + printk(KERN_EMERG SLB contents now:\n); + for (entry = 0; entry 64; ++entry) { + asm volatile(slbmfee %0,%1 : =r (esid) : r (entry)); + if (esid == 0) + /* valid bit is clear along with everything else */ + continue; + asm volatile(slbmfev %0,%1 : =r (vsid) : r (entry)); + printk(KERN_EMERG %d: %.16lx %.16lx\n, entry, esid, vsid); + } + + printk(KERN_EMERG SLB 0-1 at last exception exit:\n); + printk(KERN_EMERG 0: %.16lx %.16lx\n, get_paca()-last_slb[0][0], + get_paca()-last_slb[0][1]); + printk(KERN_EMERG 1: %.16lx %.16lx\n, get_paca()-last_slb[1][0], + get_paca()-last_slb[1][1]); + printk(KERN_EMERG SLB update log:\n); + rstart = entry = get_paca()-slblog_ix; + do { + printk(KERN_EMERG %d: %lx %lx %.16lx %.16lx\n, entry, + get_paca()-slblog[entry][0], + get_paca()-slblog[entry][1], + get_paca()-slblog[entry][2], + get_paca()-slblog[entry][3]); + entry = (entry + 1) % 63; + } while (entry != rstart); +#endif +} + /* * We enter here if we get an unrecoverable exception, that is, one * that happened at a point where the RI (recoverable interrupt) bit @@ -1151,6 +1185,8 @@ void unrecoverable_exception(struct pt_regs *regs) { printk(KERN_EMERG Unrecoverable exception %lx at %lx\n, regs-trap, regs-nip); + if (regs-trap == 0x4100) + dump_unrecov_slb(); die(Unrecoverable exception, regs, SIGABRT); } diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index 906daed..235edf7 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -105,6 +105,7 @@ void slb_flush_and_rebolt(void) * appropriately too. */ unsigned long linear_llp, vmalloc_llp, lflags, vflags; unsigned
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Paul Mackerras wrote: Kamalesh Babulal writes: The SHA1 ID of the kernel is 0e81a8ae37687845f7cdfa2adce14ea6a5f1dd34 (2.6.25-rc8) and the source seems to have the patch 44387e9ff25267c78a99229aca55ed750e9174c7. The kernel was patched only the patch you gave me (http://lkml.org/lkml/2008/4/8/42). Please try again with both that patch and the one below. Once again it won't fix the bug but will give us more information. When the oops occurs, the kernel will print a lot of debug information that should help locate the problem. Paul. diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index e932b43..f16db50 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -144,6 +144,9 @@ int main(void) DEFINE(PACA_SLBSHADOWPTR, offsetof(struct paca_struct, slb_shadow_ptr)); DEFINE(PACA_DATA_OFFSET, offsetof(struct paca_struct, data_offset)); DEFINE(PACA_TRAP_SAVE, offsetof(struct paca_struct, trap_save)); + DEFINE(PACASLBLOG, offsetof(struct paca_struct, slblog)); + DEFINE(PACASLBLOGIX, offsetof(struct paca_struct, slblog_ix)); + DEFINE(PACALASTSLB, offsetof(struct paca_struct, last_slb)); DEFINE(SLBSHADOW_STACKVSID, offsetof(struct slb_shadow, save_area[SLB_NUM_BOLTED - 1].vsid)); diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S index 148a354..663df17 100644 --- a/arch/powerpc/kernel/entry_64.S +++ b/arch/powerpc/kernel/entry_64.S @@ -419,6 +419,18 @@ END_FTR_SECTION_IFSET(CPU_FTR_1T_SEGMENT) slbmte r7,r0 isync + ld r4,PACASLBLOGIX(r13) + addir4,r4,1 + clrldi r4,r4,64-6 + std r4,PACASLBLOGIX(r13) + add r4,r4,r13 + addir4,r4,PACASLBLOG + li r5,4 + std r5,0(r4) + mftbr5 + std r5,8(r4) + std r6,16(r4) + std r0,24(r4) 2: clrrdi r7,r8,THREAD_SHIFT /* base of new stack */ /* Note: this uses SWITCH_FRAME_SIZE rather than INT_FRAME_SIZE @@ -533,6 +545,17 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) stdcx. r0,0,r1 /* to clear the reservation */ + li r4,0 + slbmfee r2,r4 + std r2,PACALASTSLB(r13) + slbmfev r2,r4 + std r2,PACALASTSLB+8(r13) + li r4,1 + slbmfee r2,r4 + std r2,PACALASTSLB+16(r13) + slbmfev r2,r4 + std r2,PACALASTSLB+24(r13) + /* * Clear RI before restoring r13. If we are returning to * userspace and we take an exception after restoring r13, diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c index 4b5b7ff..c918f33 100644 --- a/arch/powerpc/kernel/traps.c +++ b/arch/powerpc/kernel/traps.c @@ -1141,6 +1141,40 @@ void SPEFloatingPointException(struct pt_regs *regs) } #endif +static void dump_unrecov_slb(void) +{ +#ifdef CONFIG_PPC64 + long entry, rstart; + unsigned long esid, vsid; + + printk(KERN_EMERG SLB contents now:\n); + for (entry = 0; entry 64; ++entry) { + asm volatile(slbmfee %0,%1 : =r (esid) : r (entry)); + if (esid == 0) + /* valid bit is clear along with everything else */ + continue; + asm volatile(slbmfev %0,%1 : =r (vsid) : r (entry)); + printk(KERN_EMERG %d: %.16lx %.16lx\n, entry, esid, vsid); + } + + printk(KERN_EMERG SLB 0-1 at last exception exit:\n); + printk(KERN_EMERG 0: %.16lx %.16lx\n, get_paca()-last_slb[0][0], +get_paca()-last_slb[0][1]); + printk(KERN_EMERG 1: %.16lx %.16lx\n, get_paca()-last_slb[1][0], +get_paca()-last_slb[1][1]); + printk(KERN_EMERG SLB update log:\n); + rstart = entry = get_paca()-slblog_ix; + do { + printk(KERN_EMERG %d: %lx %lx %.16lx %.16lx\n, entry, +get_paca()-slblog[entry][0], +get_paca()-slblog[entry][1], +get_paca()-slblog[entry][2], +get_paca()-slblog[entry][3]); + entry = (entry + 1) % 63; + } while (entry != rstart); +#endif +} + /* * We enter here if we get an unrecoverable exception, that is, one * that happened at a point where the RI (recoverable interrupt) bit @@ -1151,6 +1185,8 @@ void unrecoverable_exception(struct pt_regs *regs) { printk(KERN_EMERG Unrecoverable exception %lx at %lx\n, regs-trap, regs-nip); + if (regs-trap == 0x4100) + dump_unrecov_slb(); die(Unrecoverable exception, regs, SIGABRT); } diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index 906daed..235edf7 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -105,6 +105,7 @@ void slb_flush_and_rebolt(void) * appropriately too. */ unsigned long linear_llp, vmalloc_llp, lflags, vflags; unsigned
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Kamalesh Babulal writes: The Kernel oopses is seen while running the kernbench followed by tbench with 2.6.25-rc2-git4 kernel on powerpc, this oops was reported for the 2.6.24-rc8-mm1 kernel (http://lkml.org/lkml/2008/1/18/71) and is visible with almost all of the main line ,rc(s) and their git(s) release from then. This oops is visible in the linux-next-20080220 kernel also.The machine is power4+ box with four cpus and has 30 GB RAM. Please try to replicate the oops with the patch below applied. It doesn't solve the cause of the oops but it should mean the kernel prints out more useful information about the cause of the oops. I assume you can replicate the oops easily on this machine - is that right? Paul. diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S index 11b4f6d..a3ac72a 100644 --- a/arch/powerpc/kernel/head_64.S +++ b/arch/powerpc/kernel/head_64.S @@ -621,7 +621,7 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) mtlrr10 andi. r10,r12,MSR_RI /* check for unrecoverable exception */ - beq-unrecov_slb + beq-2f .machine push .machine power4 @@ -643,6 +643,22 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) rfid b . /* prevent speculative execution */ +2: +#ifdef CONFIG_PPC_ISERIES +BEGIN_FW_FTR_SECTION + b unrecov_slb +END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) +#endif /* CONFIG_PPC_ISERIES */ + mfspr r11,SPRN_SRR0 + clrrdi r10,r13,32 + LOAD_HANDLER(r10,unrecov_slb) + mtspr SPRN_SRR0,r10 + mfmsr r10 + ori r10,r10,MSR_IR|MSR_DR|MSR_RI + mtspr SPRN_SRR1,r10 + rfid + b . + unrecov_slb: EXCEPTION_PROLOG_COMMON(0x4100, PACA_EXSLB) DISABLE_INTS ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Paul Mackerras wrote: Kamalesh Babulal writes: The Kernel oopses is seen while running the kernbench followed by tbench with 2.6.25-rc2-git4 kernel on powerpc, this oops was reported for the 2.6.24-rc8-mm1 kernel (http://lkml.org/lkml/2008/1/18/71) and is visible with almost all of the main line ,rc(s) and their git(s) release from then. This oops is visible in the linux-next-20080220 kernel also.The machine is power4+ box with four cpus and has 30 GB RAM. Please try to replicate the oops with the patch below applied. It doesn't solve the cause of the oops but it should mean the kernel prints out more useful information about the cause of the oops. I assume you can replicate the oops easily on this machine - is that right? Paul. diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S index 11b4f6d..a3ac72a 100644 --- a/arch/powerpc/kernel/head_64.S +++ b/arch/powerpc/kernel/head_64.S @@ -621,7 +621,7 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) mtlrr10 andi. r10,r12,MSR_RI /* check for unrecoverable exception */ - beq-unrecov_slb + beq-2f .machine push .machine power4 @@ -643,6 +643,22 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) rfid b . /* prevent speculative execution */ +2: +#ifdef CONFIG_PPC_ISERIES +BEGIN_FW_FTR_SECTION + b unrecov_slb +END_FW_FTR_SECTION_IFSET(FW_FEATURE_ISERIES) +#endif /* CONFIG_PPC_ISERIES */ + mfspr r11,SPRN_SRR0 + clrrdi r10,r13,32 + LOAD_HANDLER(r10,unrecov_slb) + mtspr SPRN_SRR0,r10 + mfmsr r10 + ori r10,r10,MSR_IR|MSR_DR|MSR_RI + mtspr SPRN_SRR1,r10 + rfid + b . + unrecov_slb: EXCEPTION_PROLOG_COMMON(0x4100, PACA_EXSLB) DISABLE_INTS Hi Paul, The kernel oops after applying the patch. Some time it takes more than one run to reproduce it, it was reproducible in the second run this time. Unrecoverable exception 4100 at c0008c8c Oops: Unrecoverable exception, sig: 6 [#1] SMP NR_CPUS=128 NUMA pSeries Modules linked in: NIP: c0008c8c LR: 0ff0135c CTR: 0ff012f0 REGS: c00772343bb0 TRAP: 4100 Not tainted (2.6.25-rc8-autotest) MSR: 80001030 ME,IR,DR CR: 44044228 XER: TASK = c0077cfa0900[13437] 'cc1' THREAD: c0077234 CPU: 2 GPR00: 4000 c00772343e30 00bb d032 GPR04: 00bb 0400 000a 0002 GPR08: GPR12: c0734000 0064 ffe6df08 GPR16: 105b 105b 1044 105b GPR20: ffe6e008 105b 105b 000a GPR24: 0ffec408 0001 ffe6ddca 0400 GPR28: 0ffec408 f7ff8000 0ffebff4 0400 NIP [c0008c8c] restore+0x8c/0xc0 LR [0ff0135c] 0xff0135c Call Trace: [c00772343e30] [c0008cd4] do_work+0x14/0x2c (unreliable) Instruction dump: 7c840078 7c810164 70604000 41820028 6000 7c4c42e6 e88d01f0 f84d01f0 7c841050 e84d01e8 7c422214 f84d01e8 e9a100d8 7c7b03a6 e84101a0 7c4ff120 (gdb) l *0xc0008cdc 0xc0008cdc is at arch/powerpc/kernel/entry_64.S:608. 603 mtmsrd r10,1 604 605 andi. r0,r4,_TIF_NEED_RESCHED 606 beq 1f 607 bl .schedule 608 b .ret_from_except_lite 609 610 1: bl .save_nvgprs 611 li r3,0 612 addir4,r1,STACK_FRAME_OVERHEAD please let me know if you need more information. -- Thanks Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Kamalesh Babulal writes: The kernel oops after applying the patch. Some time it takes more than one run to reproduce it, it was reproducible in the second run this time. Unrecoverable exception 4100 at c0008c8c Oops: Unrecoverable exception, sig: 6 [#1] SMP NR_CPUS=128 NUMA pSeries Modules linked in: NIP: c0008c8c LR: 0ff0135c CTR: 0ff012f0 REGS: c00772343bb0 TRAP: 4100 Not tainted (2.6.25-rc8-autotest) MSR: 80001030 ME,IR,DR CR: 44044228 XER: TASK = c0077cfa0900[13437] 'cc1' THREAD: c0077234 CPU: 2 GPR00: 4000 c00772343e30 00bb d032 GPR04: 00bb 0400 000a 0002 GPR08: GPR12: c0734000 0064 ffe6df08 GPR16: 105b 105b 1044 105b GPR20: ffe6e008 105b 105b 000a GPR24: 0ffec408 0001 ffe6ddca 0400 GPR28: 0ffec408 f7ff8000 0ffebff4 0400 NIP [c0008c8c] restore+0x8c/0xc0 LR [0ff0135c] 0xff0135c Call Trace: [c00772343e30] [c0008cd4] do_work+0x14/0x2c (unreliable) Instruction dump: 7c840078 7c810164 70604000 41820028 6000 7c4c42e6 e88d01f0 f84d01f0 7c841050 e84d01e8 7c422214 f84d01e8 e9a100d8 7c7b03a6 e84101a0 7c4ff120 (gdb) l *0xc0008cdc 0xc0008cdc is at arch/powerpc/kernel/entry_64.S:608. 603 mtmsrd r10,1 604 605 andi. r0,r4,_TIF_NEED_RESCHED 606 beq 1f 607 bl .schedule 608 b .ret_from_except_lite 609 610 1: bl .save_nvgprs 611 li r3,0 612 addir4,r1,STACK_FRAME_OVERHEAD The exception happened at c...8c8c but you looked at c...8cdc with gdb. What's at c...8c8c? please let me know if you need more information. The .config would be useful, but don't spam everyone on cc with it, just send it to me privately. Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Paul Mackerras wrote: Kamalesh Babulal writes: The kernel oops after applying the patch. Some time it takes more than one run to reproduce it, it was reproducible in the second run this time. Unrecoverable exception 4100 at c0008c8c Oops: Unrecoverable exception, sig: 6 [#1] SMP NR_CPUS=128 NUMA pSeries Modules linked in: NIP: c0008c8c LR: 0ff0135c CTR: 0ff012f0 REGS: c00772343bb0 TRAP: 4100 Not tainted (2.6.25-rc8-autotest) MSR: 80001030 ME,IR,DR CR: 44044228 XER: TASK = c0077cfa0900[13437] 'cc1' THREAD: c0077234 CPU: 2 GPR00: 4000 c00772343e30 00bb d032 GPR04: 00bb 0400 000a 0002 GPR08: GPR12: c0734000 0064 ffe6df08 GPR16: 105b 105b 1044 105b GPR20: ffe6e008 105b 105b 000a GPR24: 0ffec408 0001 ffe6ddca 0400 GPR28: 0ffec408 f7ff8000 0ffebff4 0400 NIP [c0008c8c] restore+0x8c/0xc0 LR [0ff0135c] 0xff0135c Call Trace: [c00772343e30] [c0008cd4] do_work+0x14/0x2c (unreliable) Instruction dump: 7c840078 7c810164 70604000 41820028 6000 7c4c42e6 e88d01f0 f84d01f0 7c841050 e84d01e8 7c422214 f84d01e8 e9a100d8 7c7b03a6 e84101a0 7c4ff120 snip The exception happened at c...8c8c but you looked at c...8cdc with gdb. What's at c...8c8c? please let me know if you need more information. The .config would be useful, but don't spam everyone on cc with it, just send it to me privately. Paul. Hi Paul, Similar call trace was seen in 2.6.24-rc3-git2 kernel while bootup, I have attached the boot log to bugzilla (http://bugzilla.kernel.org/attachment.cgi?id=15666action=view). When looking for the last good one, we found that the kernel oops seems to be reproducible from the 2.6.24-rc8-git3 kernel onwards. Thanks to nishanth for pointing it out, Please let me know if you need more information. -- Thanks Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Paul Mackerras wrote: Kamalesh Babulal writes: The kernel oops after applying the patch. Some time it takes more than one run to reproduce it, it was reproducible in the second run this time. Unrecoverable exception 4100 at c0008c8c Oops: Unrecoverable exception, sig: 6 [#1] SMP NR_CPUS=128 NUMA pSeries Modules linked in: NIP: c0008c8c LR: 0ff0135c CTR: 0ff012f0 REGS: c00772343bb0 TRAP: 4100 Not tainted (2.6.25-rc8-autotest) MSR: 80001030 ME,IR,DR CR: 44044228 XER: TASK = c0077cfa0900[13437] 'cc1' THREAD: c0077234 CPU: 2 GPR00: 4000 c00772343e30 00bb d032 GPR04: 00bb 0400 000a 0002 GPR08: GPR12: c0734000 0064 ffe6df08 GPR16: 105b 105b 1044 105b GPR20: ffe6e008 105b 105b 000a GPR24: 0ffec408 0001 ffe6ddca 0400 GPR28: 0ffec408 f7ff8000 0ffebff4 0400 NIP [c0008c8c] restore+0x8c/0xc0 LR [0ff0135c] 0xff0135c Call Trace: [c00772343e30] [c0008cd4] do_work+0x14/0x2c (unreliable) Instruction dump: 7c840078 7c810164 70604000 41820028 6000 7c4c42e6 e88d01f0 f84d01f0 7c841050 e84d01e8 7c422214 f84d01e8 e9a100d8 7c7b03a6 e84101a0 7c4ff120 That looks like the bug that was supposed to be fixed by commit 44387e9ff25267c78a99229aca55ed750e9174c7, which is in 2.6.25-rc7 and later. What was the SHA1 ID of the head commit for the kernel source that gave you this oops? Did you have any other patches besides the one I sent you applied? Paul. The SHA1 ID of the kernel is 0e81a8ae37687845f7cdfa2adce14ea6a5f1dd34 (2.6.25-rc8) and the source seems to have the patch 44387e9ff25267c78a99229aca55ed750e9174c7. The kernel was patched only the patch you gave me (http://lkml.org/lkml/2008/4/8/42). -- Thanks Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[BUG] 2.6.25-rc2-git4 - Regression Kernel oops while running kernbench and tbench on powerpc
Hi, The Kernel oopses is seen while running the kernbench followed by tbench with 2.6.25-rc2-git4 kernel on powerpc, this oops was reported for the 2.6.24-rc8-mm1 kernel (http://lkml.org/lkml/2008/1/18/71) and is visible with almost all of the main line ,rc(s) and their git(s) release from then. This oops is visible in the linux-next-20080220 kernel also.The machine is power4+ box with four cpus and has 30 GB RAM. oops while running kernbench - Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=32 NUMA pSeries Modules linked in: NIP: 4570 LR: 0ff0135c CTR: 0ff012f0 REGS: c00771a678c0 TRAP: 0300 Not tainted (2.6.25-rc2-autotest-next-20080220) MSR: 80001000 ME CR: 28000282 XER: DAR: c00771a67ce0, DSISR: 0a00 TASK = c0077b9d2c00[22046] 'cc1' THREAD: c00771a64000 CPU: 3 GPR00: 4000 c00771a67b40 0052 d032 GPR04: 0052 0400 106823e0 000e GPR08: 44000288 c00771a67e30 998be2321500 GPR12: 80001030 c05d7e00 1003 1003 GPR16: 105b 105b 1044 105b GPR20: 105b 105b 105b 105b GPR24: 105b41e8 0020 105b 0400 GPR28: 0ffec408 f7ff8000 0ffebff4 0400 NIP [4570] 0x4570 LR [0ff0135c] 0xff0135c Call Trace: [c00771a67b40] [c062d558] 0xc062d558 (unreliable) [c00771a67e08] [f7ff8000] 0xf7ff8000 Instruction dump: 4800 41820008 4810 f92101a0 ---[ end trace 26a7439b76b3cbab ]--- oops while running tbench -- Unable to handle kernel paging request for data at address 0xc0077e2e3ce0 Faulting instruction address: 0x4570 Oops: Kernel access of bad area, sig: 11 [#2] SMP NR_CPUS=32 NUMA pSeries Modules linked in: NIP: 4570 LR: c000872c CTR: REGS: c0077e2e38c0 TRAP: 0300 Tainted: G D (2.6.25-rc2-autotest-next-20080220) MSR: 80001000 ME CR: 28022822 XER: 2000 DAR: c0077e2e3ce0, DSISR: 0a00 TASK = c0077cb55800[3900] 'zmd' THREAD: c0077e2e CPU: 3 GPR00: c0077e2e3b40 c06a29b8 006e GPR04: 0fe7bac8 54022888 4000 0fe7bae4 GPR08: d032 44022824 c0077e2e3e30 998be2321500 GPR12: 80001030 c05d7e00 0001 GPR16: f7c16708 f7fa0800 100e0040 GPR20: f7ffe018 0001 f5397550 f5397548 GPR24: 4a19 f7d66474 f53975a8 GPR28: 9433 0fe94ff4 f7d66490 NIP [4570] 0x4570 LR [c000872c] syscall_exit+0x0/0x40 Call Trace: [c0077e2e3b40] [0496000729b8] 0x496000729b8 (unreliable) Instruction dump: 4800 41820008 4810 f92101a0 The machine -- Thanks Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev