Re: Kernel oops while duming user core.

2008-02-02 Thread Clemens Koller
Scott Wood schrieb:
 On Thu, Jan 31, 2008 at 10:15:27AM -0600, Nathan Lynch wrote:
 Rune Torgersen wrote:
 I get the following kernel core while a user program I have is dumping
 core.
 Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on
 a 8280)
 When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11
 and dumps core.
 On 2.6.24, I ghet the kernel oops, and then the program hangs sround
 forever and is unkillable.
 Hmm, this is the second report of 2.6.24 crashing in
 __flush_dcache_icache during a core dump; see:
 http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html

 Is this easily recreatable?
 
 Yes, this program does it reliably:
 
 #include pthread.h
 #include stdio.h
 #include unistd.h
 #include signal.h
 
 void *threadfn(void *arg)
 {
   fprintf(stderr, threadfn\n);
   fflush(stderr);
   sleep(1);
   *(char *)0=0;
   return NULL;
 }
 
 int main(void)
 {
   pthread_t thread[4];
   int i;
 
   for (i = 0; i  4; i++)
   pthread_create(thread[0], NULL, threadfn, NULL);
 
   for (;;);
 }

Ack!

This is a MPC8540ADS arch/powerpc compatible environment here:

Feb  2 12:59:17 fox_1 kernel: Unable to handle kernel paging request for data 
at address 0x4802f000
Feb  2 12:59:17 fox_1 kernel: Faulting instruction address: 0xc000d5b8
Feb  2 12:59:17 fox_1 kernel: Oops: Kernel access of bad area, sig: 11 [#1]
Feb  2 12:59:17 fox_1 kernel: MPC85xx ADS
Feb  2 12:59:17 fox_1 kernel: Modules linked in:
Feb  2 12:59:17 fox_1 kernel: NIP: c000d5b8 LR: c0010fb8 CTR: 0080
Feb  2 12:59:17 fox_1 kernel: REGS: c24abb20 TRAP: 0300   Not tainted  (2.6.24)
Feb  2 12:59:17 fox_1 kernel: MSR: 00029000 EE,ME  CR: 2288  XER: 
Feb  2 12:59:17 fox_1 kernel: DEAR: 4802f000, ESR: 
Feb  2 12:59:17 fox_1 kernel: TASK = cf894d20[942] 'oops' THREAD: c24aa000
Feb  2 12:59:17 fox_1 kernel: GPR00: c22c7680 c24abbd0 cf894d20 4802f000 
0080 000f8b60 4802f000 
Feb  2 12:59:17 fox_1 kernel: GPR08:  c22c7680 08d1  
2288 10018a64 0006 c035a300
Feb  2 12:59:17 fox_1 kernel: GPR16: 00024000 c038 c24aa000 c24abc9c 
c24abc98 c2570480 c22c7680 c038
Feb  2 12:59:17 fox_1 kernel: GPR24: c0390420 cf09d000 c0497b60 c5b63948 
4802f000 c24aa000 00bc c0497b60
Feb  2 12:59:17 fox_1 kernel: NIP [c000d5b8] __flush_dcache_icache+0x14/0x40
Feb  2 12:59:17 fox_1 kernel: LR [c0010fb8] update_mmu_cache+0x94/0x98
Feb  2 12:59:17 fox_1 kernel: Call Trace:
Feb  2 12:59:17 fox_1 kernel: [c24abbd0] [c24aa000] 0xc24aa000 (unreliable)
Feb  2 12:59:17 fox_1 kernel: [c24abbe0] [c005d978] handle_mm_fault+0x374/0x6a4
Feb  2 12:59:17 fox_1 kernel: [c24abc30] [c005ddd0] get_user_pages+0x128/0x384
Feb  2 12:59:17 fox_1 kernel: [c24abc90] [c00a80d8] elf_core_dump+0xab8/0xb74
Feb  2 12:59:17 fox_1 kernel: [c24abd30] [c007718c] do_coredump+0x730/0x758
Feb  2 12:59:17 fox_1 kernel: [c24abe30] [c002eeb0] 
get_signal_to_deliver+0x244/0x3c4
Feb  2 12:59:17 fox_1 kernel: [c24abe80] [c000782c] do_signal+0x48/0x264
Feb  2 12:59:17 fox_1 kernel: [c24abf40] [c000e4ac] do_user_signal+0x74/0xc4
Feb  2 12:59:17 fox_1 kernel: Instruction dump:
Feb  2 12:59:17 fox_1 kernel: 4d820020 7c8903a6 7c001bac 38630020 4200fff8 
7c0004ac 4e800020 6000
Feb  2 12:59:17 fox_1 kernel: 54630026 38800080 7c8903a6 7c661b78 7c00186c 
38630020 4200fff8 7c0004ac
Feb  2 12:59:17 fox_1 kernel: ---[ end trace a1d91e665173315a ]---
Feb  2 12:59:17 fox_1 kernel: note: oops[942] exited with preempt_count 1

It does not oops when the core dump is disabled.

Regards,

Clemens
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Kernel oops while duming user core.

2008-02-02 Thread Benjamin Herrenschmidt

On Thu, 2008-01-31 at 16:10 -0600, Rune Torgersen wrote:
 Scott Wood wrote:
  Scott Wood wrote:
  Nathan Lynch wrote:
  Is the crashing program multithreaded?  The first report had firefox
  triggering the oops.
  
  OK, I've got a test program that triggers it now.  I'll see if I can
  figure out what's going on.
  
  The problem seems to be that update_mmu_cache() is called on a guard
  page with no access rights. 
  
  Changing update_mmu_cache() to always call flush_dcache_icache_page()
  fixes it, though a better performing fix would probably be to add an
  exception table entry for the dcbst.
 
 I can confirm that this seems to fix it.

Might be better to avoid the flush when the page isn't readable ?

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-02-01 Thread Scott Wood
On Thu, Jan 31, 2008 at 10:15:27AM -0600, Nathan Lynch wrote:
 Rune Torgersen wrote:
  Hi
  
  I get the following kernel core while a user program I have is dumping
  core.
  Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on
  a 8280)
  When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11
  and dumps core.
  On 2.6.24, I ghet the kernel oops, and then the program hangs sround
  forever and is unkillable.
 
 Hmm, this is the second report of 2.6.24 crashing in
 __flush_dcache_icache during a core dump; see:
 http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html
 
 Is this easily recreatable?

Yes, this program does it reliably:

#include pthread.h
#include stdio.h
#include unistd.h
#include signal.h

void *threadfn(void *arg)
{
fprintf(stderr, threadfn\n);
fflush(stderr);
sleep(1);
*(char *)0=0;
return NULL;
}

int main(void)
{
pthread_t thread[4];
int i;

for (i = 0; i  4; i++)
pthread_create(thread[0], NULL, threadfn, NULL);

for (;;);
}
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Kernel oops while duming user core.

2008-01-31 Thread Rune Torgersen
Nathan Lynch wrote:
 Hmm, this is the second report of 2.6.24 crashing in
 __flush_dcache_icache during a core dump; see:
 http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html
 
 Is this easily recreatable?

Yes. I have a binary that will do this every time it is started (on this
particular system), 
only takes about 10 seconds before it dumps.

I was going to test HEAD of powerpc.git to see if it is still there.
I cannot test any earlier versions as our board port was done on 2.6.24.

Our older kernel port is 2.6.18 on arch/ppc, and it works just fine.


One potential clue:
 Unable to handle kernel paging request for data at address 0x48024000

this adddress is beyond our physical memory. We have 1GB of mem 
(CONFIG_HIGH_MEM enabled) so 0x3fff_ is the last valid address.
0x4000_ to 0x7fff_ are unused, 0x8000_ to 0x9fff_ is
used by PCI.





___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Kernel oops while duming user core.

2008-01-31 Thread Rune Torgersen
Rune Torgersen wrote:
 I was going to test HEAD of powerpc.git to see if it is still there.

Still there. Also used GDB on the vmlinux image to get source and
dissasembly of the ooops:
Unable to handle kernel paging request for data at address 0x48024000
Faulting instruction address: 0xc000f0a0
Oops: Kernel access of bad area, sig: 11 [#1]
PREEMPT Innovative Systems ApMax
Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
NIP: c000f0a0 LR: c0011fec CTR: 0080
REGS: eebe9b70 TRAP: 0300   Tainted: P (2.6.24-test)
MSR: 9032 EE,ME,IR,DR  CR: 24004442  XER: 
DAR: 48024000, DSISR: 2000
TASK = eeba9780[2554] 'armd_crash' THREAD: eebe8000
GPR00: eea44d00 eebe9c20 eeba9780 48024000 0080 37a56181 48024000

GPR08: 37a56181 eea44d00  c200 44004422 10100f38 ef336600
bfff
GPR16: eeff0300 0030 eea44d00  eebe9cdc 0011 eebe9cd8
eebca480
GPR24: eea44d00 37a56181 48024000 eebad580 eebad580 37a56181 48024000
c26f4ac0
NIP [c000f0a0] __flush_dcache_icache+0x14/0x40
LR [c0011fec] update_mmu_cache+0x74/0x114
Call Trace:
[eebe9c20] [eebe8000] 0xeebe8000 (unreliable)
[eebe9c40] [c005cfd0] handle_mm_fault+0x630/0xbc0
[eebe9c80] [c005d954] get_user_pages+0x3f4/0x4fc
[eebe9cd0] [c00aa730] elf_core_dump+0x9a4/0xc5c
[eebe9d60] [c0077954] do_coredump+0x6e0/0x748
[eebe9e50] [c002a520] get_signal_to_deliver+0x40c/0x45c
[eebe9e80] [c0008cec] do_signal+0x50/0x294
[eebe9f40] [c000fc9c] do_user_signal+0x74/0xc4
--- Exception: 300 at 0x10044efc
LR = 0x10044ec0
Instruction dump:
4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 6000
54630026 38800080 7c8903a6 7c661b78 7c00186c 38630020 4200fff8
7c0004ac
---[ end trace 37755b0fb9e79677 ]---
note: armd_crash[2554] exited with preempt_count 2

backtrace using gdb on vmlinux image:

0xc00aa730 is in elf_core_dump (fs/binfmt_elf.c:1762).
1757
1758for (addr = vma-vm_start; addr  end; addr +=
PAGE_SIZE) {
1759struct page *page;
1760struct vm_area_struct *vma;
1761
1762if (get_user_pages(current, current-mm,
addr, 1, 0, 1,
1763page, vma) =
0) {
1764DUMP_SEEK(PAGE_SIZE);
1765} else {
1766if (page == ZERO_PAGE(0)) {
(gdb) list *0xc005d954
0xc005d954 is in get_user_pages (mm/memory.c:1072).
1067cond_resched();
1068while (!(page = follow_page(vma, start,
foll_flags))) {
1069int ret;
1070ret = handle_mm_fault(mm, vma,
start,
1071foll_flags 
FOLL_WRITE);
1072if (ret  VM_FAULT_ERROR) {
1073if (ret  VM_FAULT_OOM)
1074return i ? i :
-ENOMEM;
1075else if (ret 
VM_FAULT_SIGBUS)
1076return i ? i :
-EFAULT;
(gdb) list *0xc005cfd0
0xc005cfd0 is in handle_mm_fault (include/asm/thread_info.h:99).
94  {
95  register unsigned long sp asm(r1);
96
97  /* gcc4, at least, is smart enough to turn this into a
single
98   * rlwinm for ppc32 and clrrdi for ppc64 */
99  return (struct thread_info *)(sp  ~(THREAD_SIZE-1));
100 }
101
102 #endif /* __ASSEMBLY__ */
103
(gdb)
(gdb) list *0xc0011fec
0xc0011fec is in update_mmu_cache (arch/powerpc/mm/mem.c:489).
484 _tlbie(address, 0 /* 8xx doesn't care about PID
*/);
485 #endif
486 if (!PageReserved(page)
487  !test_bit(PG_arch_1, page-flags)) {
488 if (vma-vm_mm == current-active_mm) {
489 __flush_dcache_icache((void *)
address);
490 } else
491 flush_dcache_icache_page(page);
492 set_bit(PG_arch_1, page-flags);
493 }
(gdb) list *0xc000f0a0
No source file for address 0xc000f0a0.
(gdb) disassemble 0xc000f0a0
Dump of assembler code for function __flush_dcache_icache:
0xc000f08c __flush_dcache_icache+0:   dec%esi
0xc000f08d __flush_dcache_icache+1:   addb   $0x20,(%eax)
0xc000f090 __flush_dcache_icache+4:   push   %esp
0xc000f091 __flush_dcache_icache+5:   arpl   %ax,(%eax)
0xc000f093 __flush_dcache_icache+7:   cmp%al,%es:0x897c8000(%eax)
0xc000f09a __flush_dcache_icache+14:  add0x781b667c(%esi),%esp
0xc000f0a0 __flush_dcache_icache+20:  jl 0xc000f0a2
__flush_dcache_icache+22
0xc000f0a2 __flush_dcache_icache+22:  sbb

RE: Kernel oops while duming user core.

2008-01-31 Thread Rune Torgersen
Kumar Gala wrote:
 Can you git-bisect to narrow this down further.

Not easilly, as the board port to arch/powerpc was done on 2.6.24-rc7
and up.
Is there an somewhat esy way in git to apply the differences from master
branch to our board branch to a branch created by bisect?

And I don't even know where this started to happen.
Would trying arch/ppc help any? I have our arch/ppc port in a
semiworking state for kernels up to 2.6.23


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-01-31 Thread Kumar Gala

On Jan 31, 2008, at 10:26 AM, Rune Torgersen wrote:

 Nathan Lynch wrote:
 Hmm, this is the second report of 2.6.24 crashing in
 __flush_dcache_icache during a core dump; see:
 http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html

 Is this easily recreatable?

 Yes. I have a binary that will do this every time it is started (on  
 this
 particular system),
 only takes about 10 seconds before it dumps.

 I was going to test HEAD of powerpc.git to see if it is still there.
 I cannot test any earlier versions as our board port was done on  
 2.6.24.

 Our older kernel port is 2.6.18 on arch/ppc, and it works just fine.


 One potential clue:
 Unable to handle kernel paging request for data at address 0x48024000

 this adddress is beyond our physical memory. We have 1GB of mem
 (CONFIG_HIGH_MEM enabled) so 0x3fff_ is the last valid address.
 0x4000_ to 0x7fff_ are unused, 0x8000_ to 0x9fff_ is
 used by PCI.


Can you git-bisect to narrow this down further.

- k
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-01-31 Thread Kumar Gala
}
 (gdb) list *0xc000f0a0
 No source file for address 0xc000f0a0.
 (gdb) disassemble 0xc000f0a0
 Dump of assembler code for function __flush_dcache_icache:
 0xc000f08c __flush_dcache_icache+0:   dec%esi
 0xc000f08d __flush_dcache_icache+1:   addb   $0x20,(%eax)
 0xc000f090 __flush_dcache_icache+4:   push   %esp
 0xc000f091 __flush_dcache_icache+5:   arpl   %ax,(%eax)
 0xc000f093 __flush_dcache_icache+7:   cmp%al,%es: 
 0x897c8000(%eax)
 0xc000f09a __flush_dcache_icache+14:  add0x781b667c(%esi),%esp
 0xc000f0a0 __flush_dcache_icache+20:  jl 0xc000f0a2
 __flush_dcache_icache+22
 0xc000f0a2 __flush_dcache_icache+22:  sbb%ch,0x63(%eax,%edi,1)
 0xc000f0a6 __flush_dcache_icache+26:  add%ah,(%eax)
 0xc000f0a8 __flush_dcache_icache+28:  inc%edx
 0xc000f0a9 __flush_dcache_icache+29:  add%bh,%bh
 0xc000f0ab __flush_dcache_icache+31:  clc
 0xc000f0ac __flush_dcache_icache+32:  jl 0xc000f0ae
 __flush_dcache_icache+34
 0xc000f0ae __flush_dcache_icache+34:  add$0xac,%al
 0xc000f0b0 __flush_dcache_icache+36:  jl 0xc000f03b
 flush_dcache_range+15
 0xc000f0b2 __flush_dcache_icache+38:  add0xac37007c(%esi),%esp
 0xc000f0b8 __flush_dcache_icache+44:  cmp%al,%dh
 0xc000f0ba __flush_dcache_icache+46:  add%ah,(%eax)
 0xc000f0bc __flush_dcache_icache+48:  inc%edx
 0xc000f0bd __flush_dcache_icache+49:  add%bh,%bh
 0xc000f0bf __flush_dcache_icache+51:  clc
 0xc000f0c0 __flush_dcache_icache+52:  jl 0xc000f0c2
 __flush_dcache_icache+54
 0xc000f0c2 __flush_dcache_icache+54:  add$0xac,%al
 0xc000f0c4 __flush_dcache_icache+56:  dec%esp
 0xc000f0c5 __flush_dcache_icache+57:  add%al,(%ecx)
 0xc000f0c7 __flush_dcache_icache+59:  sub$0x4e,%al
 0xc000f0c9 __flush_dcache_icache+61:  addb   $0x20,(%eax)
 End of assembler dump.

This doesn't look like ppc disasm to me :)

- k
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-01-31 Thread Nathan Lynch
Rune Torgersen wrote:
 Kumar Gala wrote:
  Can you git-bisect to narrow this down further.
 
 Not easilly, as the board port to arch/powerpc was done on 2.6.24-rc7
 and up.
 Is there an somewhat esy way in git to apply the differences from master
 branch to our board branch to a branch created by bisect?
 
 And I don't even know where this started to happen.
 Would trying arch/ppc help any? I have our arch/ppc port in a
 semiworking state for kernels up to 2.6.23

Well, we know this happens on other 32-bit powerpc machines (pmac at
least)... perhaps someone could arrange to bisect on a machine that
works with older powerpc kernels (assuming they have a good repro
case).
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-01-31 Thread Scott Wood
On Thu, Jan 31, 2008 at 11:40:04AM -0600, Rune Torgersen wrote:
 Unable to handle kernel paging request for data at address 0x48024000
 Faulting instruction address: 0xc000f0a0
 Oops: Kernel access of bad area, sig: 11 [#1]
 PREEMPT Innovative Systems ApMax

Does it happen without preempt?

 Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
 drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
 NIP: c000f0a0 LR: c0011fec CTR: 0080
 REGS: eebe9b70 TRAP: 0300   Tainted: P (2.6.24-test)

Does it happen without the modules?

 MSR: 9032 EE,ME,IR,DR  CR: 24004442  XER: 
 DAR: 48024000, DSISR: 2000

Hmm, this doesn't look like a valid DSISR, so I'm guessing this was a TLB
miss that got redirected to DataAccess (or is there something that causes
DSRISR[2] to be set on 8280?  I didn't see anything in the manual...). 
However, SRR1 in that case seems to indicate a store, which dcbst shouldn't
generate (except on 8xx, according to the comment in update_mmu_cache).

Do you have a simple test case that we could try to reproduce?  I tried a
simple core dump on an 8280, and it worked.

Failing that, I'd add code to the page fault handler to dump what is (or
isn't) supposed to be mapped at the faulting address, and something to track
which (if any) TLB miss exception it came through.

-Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Kernel oops while duming user core.

2008-01-31 Thread Rune Torgersen
Scott Wood wrote:
 Does it happen without preempt?

Will try shortly, just updated my git to HEAD of Linus's tree
 
 Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
 drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
 NIP: c000f0a0 LR: c0011fec CTR: 0080
 REGS: eebe9b70 TRAP: 0300   Tainted: P (2.6.24-test)
 
 Does it happen without the modules?
Cannot test without most of them.

 Do you have a simple test case that we could try to
 reproduce?  I tried a
 simple core dump on an 8280, and it worked.

I do not have a testcase, except a app for our board that does this
reliably after about 10 seconds.

 Failing that, I'd add code to the page fault handler to dump what is
 (or isn't) supposed to be mapped at the faulting address, and
 something to track which (if any) TLB miss exception it came through.

I can test code.

___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Kernel oops while duming user core.

2008-01-31 Thread Rune Torgersen
Scott Wood wrote:
 Does it happen without preempt?

Yes
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Kernel oops while duming user core.

2008-01-31 Thread Rune Torgersen
Nathan Lynch wrote:
 Scott Wood wrote:
 Do you have a simple test case that we could try to reproduce?  I
 tried a simple core dump on an 8280, and it worked.
 
 Is the crashing program multithreaded?  The first report had firefox
 triggering the oops.

The crashing program has 10 threads. (NPTL pthreads, glibc-2.5, gcc
4.1.2)


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-01-31 Thread Scott Wood
Nathan Lynch wrote:
 I doubt the modules are the problem; there was a practically identical
 report from someone with an untainted 2.6.24-rc kernel a few weeks ago
 (see my first reply to Rune).

I didn't think they were; I was just trying to eliminate the low hanging 
fruit and get a simpler testcase. :-)

 Do you have a simple test case that we could try to reproduce?  I tried a
 simple core dump on an 8280, and it worked.
 
 Is the crashing program multithreaded?  The first report had firefox
 triggering the oops.

OK, I've got a test program that triggers it now.  I'll see if I can 
figure out what's going on.

-Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-01-31 Thread Nathan Lynch
Scott Wood wrote:
 On Thu, Jan 31, 2008 at 11:40:04AM -0600, Rune Torgersen wrote:
  Unable to handle kernel paging request for data at address 0x48024000
  Faulting instruction address: 0xc000f0a0
  Oops: Kernel access of bad area, sig: 11 [#1]
  PREEMPT Innovative Systems ApMax
 
 Does it happen without preempt?
 
  Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
  drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
  NIP: c000f0a0 LR: c0011fec CTR: 0080
  REGS: eebe9b70 TRAP: 0300   Tainted: P (2.6.24-test)
 
 Does it happen without the modules?

I doubt the modules are the problem; there was a practically identical
report from someone with an untainted 2.6.24-rc kernel a few weeks ago
(see my first reply to Rune).

 
  MSR: 9032 EE,ME,IR,DR  CR: 24004442  XER: 
  DAR: 48024000, DSISR: 2000
 
 Hmm, this doesn't look like a valid DSISR, so I'm guessing this was a TLB
 miss that got redirected to DataAccess (or is there something that causes
 DSRISR[2] to be set on 8280?  I didn't see anything in the manual...). 
 However, SRR1 in that case seems to indicate a store, which dcbst shouldn't
 generate (except on 8xx, according to the comment in update_mmu_cache).
 
 Do you have a simple test case that we could try to reproduce?  I tried a
 simple core dump on an 8280, and it worked.

Is the crashing program multithreaded?  The first report had firefox
triggering the oops.
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-01-31 Thread Scott Wood
Scott Wood wrote:
 Nathan Lynch wrote:
 Is the crashing program multithreaded?  The first report had firefox
 triggering the oops.
 
 OK, I've got a test program that triggers it now.  I'll see if I can 
 figure out what's going on.

The problem seems to be that update_mmu_cache() is called on a guard 
page with no access rights.

Changing update_mmu_cache() to always call flush_dcache_icache_page() 
fixes it, though a better performing fix would probably be to add an 
exception table entry for the dcbst.

-Scott
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


RE: Kernel oops while duming user core.

2008-01-31 Thread Rune Torgersen
Scott Wood wrote:
 Scott Wood wrote:
 Nathan Lynch wrote:
 Is the crashing program multithreaded?  The first report had firefox
 triggering the oops.
 
 OK, I've got a test program that triggers it now.  I'll see if I can
 figure out what's going on.
 
 The problem seems to be that update_mmu_cache() is called on a guard
 page with no access rights. 
 
 Changing update_mmu_cache() to always call flush_dcache_icache_page()
 fixes it, though a better performing fix would probably be to add an
 exception table entry for the dcbst.

I can confirm that this seems to fix it.


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Kernel oops while duming user core.

2008-01-31 Thread Nathan Lynch
Rune Torgersen wrote:
 Hi
 
 I get the following kernel core while a user program I have is dumping
 core.
 Any DIeas at what to look for? (this is runnign 2.6.24, arch/powerpc on
 a 8280)
 When runnign the program on 2.6.18 arch/ppc, the program gets a sig 11
 and dumps core.
 On 2.6.24, I ghet the kernel oops, and then the program hangs sround
 forever and is unkillable.

Hmm, this is the second report of 2.6.24 crashing in
__flush_dcache_icache during a core dump; see:
http://ozlabs.org/pipermail/linuxppc-dev/2007-December/048662.html

Is this easily recreatable?

 
 Unable to handle kernel paging request for data at address 0x48024000
 Faulting instruction address: 0xc000ef88
 Oops: Kernel access of bad area, sig: 11 [#1]
 PREEMPT Innovative Systems ApMax
 Modules linked in: drv_wd(P) drv_scc devcom drv_pcir tipc drv_ss7
 drv_auxcpu drv_leds(P) drv_ethsw proc_sysinfo(P) i2c_8266(P)
 NIP: c000ef88 LR: c0012180 CTR: 0080
 REGS: eebc9b70 TRAP: 0300   Tainted: P (2.6.24)
 MSR: 9032 EE,ME,IR,DR  CR: 24004442  XER: 
 DAR: 48024000, DSISR: 2000
 TASK = eebac3c0[3131] 'armd' THREAD: eebc8000
 GPR00: ee9b7d00 eebc9c20 eebac3c0 48024000 0080 399a4181 48024000
 
 GPR08: 399a4181 ee9b7d00  c200 44004422 10100f38 ee82fc00
 bfff
 GPR16: ef377060 0030 ee9b7d00  eebc9cdc 0011 eebc9cd8
 eeb96480
 GPR24: ee9b7d00 399a4181 48024000 eeb9a370 eeb9a370 399a4181 48024000
 c2733480
 NIP [c000ef88] __flush_dcache_icache+0x14/0x40
 LR [c0012180] update_mmu_cache+0x74/0x114
 Call Trace:
 [eebc9c20] [eebc8000] 0xeebc8000 (unreliable)
 [eebc9c40] [c005d060] handle_mm_fault+0x630/0xbc0
 [eebc9c80] [c005d9e4] get_user_pages+0x3f4/0x4fc
 [eebc9cd0] [c00aa7c4] elf_core_dump+0x9a4/0xc5c
 [eebc9d60] [c00779e4] do_coredump+0x6e0/0x748
 [eebc9e50] [c002a5b0] get_signal_to_deliver+0x40c/0x45c
 [eebc9e80] [c0008ce8] do_signal+0x50/0x294
 [eebc9f40] [c000fb98] do_user_signal+0x74/0xc4
 --- Exception: 300 at 0x10044efc
 LR = 0x10044ec0
 Instruction dump:
 4d820020 7c8903a6 7c001bac 38630020 4200fff8 7c0004ac 4e800020 6000
 54630026 38800080 7c8903a6 7c661b78 7c00186c 38630020 4200fff8
 7c0004ac
 ---[ end trace 97db37eaf213da3c ]---
 note: armd[3131] exited with preempt_count 2


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev