Re: panic starting gnome
Terry Lambert wrote: Debug: [excellent kernel-debugging recipe snipped] Here's a backtrace of a crashdump that should be more helpful: Fatal trap 12: page fault while in kernel mode cpuid = 0; lapic.id = fault virtual address = 0x34 fault code = supervisor read, page not present instruction pointer = 0x8:0xc01b28c6 stack pointer = 0x10:0xeb3b17c0 frame pointer = 0x10:0xeb3b17e0 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 2104 (gconf-sanity-check-) panic: from debugger cpuid = 0; lapic.id = Fatal trap 3: breakpoint instruction fault while in kernel mode cpuid = 0; lapic.id = instruction pointer = 0x8:0xc03019ea stack pointer = 0x10:0xeb3b1534 frame pointer = 0x10:0xeb3b1540 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags= IOPL = 0 current process = 2104 (gconf-sanity-check-) panic: from debugger cpuid = 0; lapic.id = boot() called on cpu#0 Uptime: 4m49s Dumping 1023 MB 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 272 288 304 320 336 352 368 384 400 416 432 448 464 480 496 512 528 544 560 576 592 608 624 640 656 672 688 704 720 736 752 768 784 800 816 832 848 864 880 896 912 928 944 960 976 992 1008 --- #0 doadump () at /usr/src/sys/kern/kern_shutdown.c:240 240 dumpsys(dumper); (kgdb) bt #0 doadump () at /usr/src/sys/kern/kern_shutdown.c:240 #1 0xc01bc00e in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:371 #2 0xc01bc627 in panic (fmt=0xc0349b8d from debugger) at /usr/src/sys/kern/kern_shutdown.c:542 #3 0xc0148192 in db_panic () at /usr/src/sys/ddb/db_command.c:448 #4 0xc0147fcc in db_command (last_cmdp=0xc037f9a0, cmd_table=0x0, aux_cmd_tablep=0xc0376fb8, aux_cmd_tablep_end=0xc0376fbc) at /usr/src/sys/ddb/db_command.c:346 #5 0xc014820a in db_command_loop () at /usr/src/sys/ddb/db_command.c:470 #6 0xc014af96 in db_trap (type=12, code=0) at /usr/src/sys/ddb/db_trap.c:72 #7 0xc0301697 in kdb_trap (type=12, code=0, regs=0xeb3b1780) at /usr/src/sys/i386/i386/db_interface.c:166 #8 0xc031a590 in trap_fatal (frame=0xeb3b1780, eva=0) at /usr/src/sys/i386/i386/trap.c:839 #9 0xc031a2da in trap_pfault (frame=0xeb3b1780, usermode=0, eva=52) at /usr/src/sys/i386/i386/trap.c:758 #10 0xc0319e95 in trap (frame= {tf_fs = -1038483432, tf_es = 16, tf_ds = -1070202864, tf_edi = 158, tf_esi = 52, tf_ebp = -348448800, tf_isp = -348448852, tf_ebx = 0, tf_edx = -966573056, tf_ecx = -966602272, tf_eax = -966602272, tf_trapno = 12, tf_err = 0, tf_eip = -1071961914, tf_cs = 8, tf_eflags = 66178, tf_esp = 0, tf_ss = -1070141731}) at /usr/src/sys/i386/i386/trap.c:445 #11 0xc0302ff8 in calltrap () at {standard input}:97 #12 0xc02098a4 in namei (ndp=0x9e) at /usr/src/sys/kern/vfs_lookup.c:158 #13 0xc021bcfc in vn_open_cred (ndp=0xeb3b1a44, flagp=0xeb3b1a0c, cmode=0, cred=0xc2195e80) at /usr/src/sys/kern/vfs_vnops.c:185 #14 0xc6acffb4 in ?? () #15 0xc01a06b3 in closef (fp=0x2, td=0x0) at vnode_if.h:1225 #16 0xc01a0054 in fdfree (td=0xc662d1e0) at /usr/src/sys/kern/kern_descrip.c:1433 #17 0xc01a5da2 in exit1 (td=0xc662d1e0) at /usr/src/sys/kern/kern_exit.c:254 #18 0xc01a5b11 in sys_exit () at /usr/src/sys/kern/kern_exit.c:116 #19 0xc031ab56 in syscall (frame= {tf_fs = 47, tf_es = 47, tf_ds = 47, tf_edi = 0, tf_esi = 11095, tf_ebp = -1077937128, tf_isp = -348447372, tf_ebx = 679838148, tf_edx = 679837268, tf_ecx = 19, tf_eax = 1, tf_trapno = 12, tf_err = 2, tf_eip = 680166719, tf_cs = 31, tf_eflags = 582, tf_esp = -1077937172, tf_ss = 47}) at /usr/src/sys/i386/i386/trap.c:1033 #20 0xc030304d in Xint0x80_syscall () at {standard input}:139 ---Can't read userspace from dump, or kernel process--- (kgdb) up 12 #12 0xc02098a4 in namei (ndp=0x9e) at /usr/src/sys/kern/vfs_lookup.c:158 158 FILEDESC_LOCK(fdp); (kgdb) list 153 #endif 154 155 /* 156 * Get starting point for the translation. 157 */ 158 FILEDESC_LOCK(fdp); 159 ndp-ni_rootdir = fdp-fd_rdir; 160 ndp-ni_topdir = fdp-fd_jdir; 161 162 dp = fdp-fd_cdir; (kgdb) print ndp $2 = (struct nameidata *) 0x9e (kgdb) print fdp $1 = (struct filedesc *) 0x34 (kgdb) (kgdb) print p $3 = (struct proc *) 0x0 (kgdb) print td $5 = (struct thread *) 0xc662d1e0 (kgdb) print *td $7 = {td_proc = 0xc66307f0, [...] Very strange. namei() does essentially the following: p = td-td_proc; fdp = p-p_fd; td-td_proc seems reasonable, but p is 0. No idea how this could happen, any guesses? Thanks, Lars -- Lars Eggert [EMAIL PROTECTED] USC Information Sciences Institute smime.p7s Description: S/MIME Cryptographic Signature
Re: panic starting gnome
On Wed, 2003-02-19 at 16:44, Lars Eggert wrote: #11 0xc0302ff8 in calltrap () at {standard input}:97 #12 0xc02098a4 in namei (ndp=0x9e) at /usr/src/sys/kern/vfs_lookup.c:158 #13 0xc021bcfc in vn_open_cred (ndp=0xeb3b1a44, flagp=0xeb3b1a0c, cmode=0, cred=0xc2195e80) at /usr/src/sys/kern/vfs_vnops.c:185 #14 0xc6acffb4 in ?? () #15 0xc01a06b3 in closef (fp=0x2, td=0x0) at vnode_if.h:1225 #16 0xc01a0054 in fdfree (td=0xc662d1e0) at /usr/src/sys/kern/kern_descrip.c:1433 #17 0xc01a5da2 in exit1 (td=0xc662d1e0) at /usr/src/sys/kern/kern_exit.c:254 Well, I haven't had much luck tracking down the exact cause. For some reason I haven't been able to figure out, all of my crash dumps jump directly from vn_open_cred (line 185 of vfs_vnops.c) to calltrap(). The namei call doesn't show up in the stack at all, almost like the function is being inlined. I'm only using -O, which shouldn't inline anything not explicitly declared as such. Anyway, using a cvsup binary search I've managed to narrow it down some. The problem did not exist before midnight UTC on 2003-04-15. It does exist on midnight UTC 2003-04-16. I've been digging through the commit logs for that day, but it seems it was a busy day for the VFS code with lots of commits. Since it always happens after an fdfree(), I'm leaning toward a large (number of files) commit by alfred@ having to do with a lock order reversal and adding a mutex associated with freeing filedesc structures. Just a guess, though. Reproducing the problem seems to be as simple as killing any process that has an open, locked file on an NFS volume. A simple gconfd-1 sleep 5; killall -9 gconfd-1 does it every time for me. I assume this would also happen if a process calls exit() without closing all of it's fds first; probably why starting GNOME or booting diskless is enough to tickle it. Craig To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: panic starting gnome
Lars Eggert wrote: Terry Lambert wrote: Debug: [excellent kernel-debugging recipe snipped] Here's a backtrace of a crashdump that should be more helpful: [ ... ] (kgdb) up 12 #12 0xc02098a4 in namei (ndp=0x9e) at /usr/src/sys/kern/vfs_lookup.c:158 158 FILEDESC_LOCK(fdp); (kgdb) list 153 #endif 154 155 /* 156 * Get starting point for the translation. 157 */ 158 FILEDESC_LOCK(fdp); 159 ndp-ni_rootdir = fdp-fd_rdir; 160 ndp-ni_topdir = fdp-fd_jdir; 161 162 dp = fdp-fd_cdir; (kgdb) print ndp $2 = (struct nameidata *) 0x9e (kgdb) print fdp $1 = (struct filedesc *) 0x34 (kgdb) (kgdb) print p $3 = (struct proc *) 0x0 (kgdb) print td $5 = (struct thread *) 0xc662d1e0 (kgdb) print *td $7 = {td_proc = 0xc66307f0, [...] Very strange. namei() does essentially the following: p = td-td_proc; fdp = p-p_fd; td-td_proc seems reasonable, but p is 0. No idea how this could happen, any guesses? Cool. This is not where I was guessing it was at, at all. 8-) 8-). There's a commit that Alfred made last Friday night that might have something to do with it. It was an attempt to fix a lock order reversal between PROC/filedesc, according to the commit, and it introduced fdesc_mtx. If you grep for that everywhere, and then annotate the involved files, it should be pretty obvious which changes to revert to see if this is the case (1.50-1.49 of /sys/sys/filedesc.h, etc.). It may also be an issue with some of the recent KSE commits over the last weekend missing an assignment on a context switch. Probably the easiest thing to do, if you can repeat the problem reliably, is to bsearch, starting 8 days days ago, for the commit that broke the camel's back. It's really tempting to make a script that's capable of carrying out a /usr/src/sys bsearch semi-automatically, because people are really hesistant to use this approach for solving problems, even though it only requires O(log2(N)) reboots to find it... -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: panic starting gnome
Craig Boston wrote: Well, I haven't had much luck tracking down the exact cause. For some reason I haven't been able to figure out, all of my crash dumps jump directly from vn_open_cred (line 185 of vfs_vnops.c) to calltrap(). The namei call doesn't show up in the stack at all, almost like the function is being inlined. I'm only using -O, which shouldn't inline anything not explicitly declared as such. Nope. The problem is a NULL pointer dereference, apparently into the proc structure, which is a NULL proc pointer. Anyway, using a cvsup binary search I've managed to narrow it down some. The problem did not exist before midnight UTC on 2003-04-15. It does exist on midnight UTC 2003-04-16. I've been digging through the commit logs for that day, but it seems it was a busy day for the VFS code with lots of commits. Since it always happens after an fdfree(), I'm leaning toward a large (number of files) commit by alfred@ having to do with a lock order reversal and adding a mutex associated with freeing filedesc structures. Just a guess, though. FWIW, I arrived at the same place, given Lars' debugging information, though it was only my most likely suspect. There are some changes that went in for KSE, as well, but I'm pretty sure they were after last Wednesday. Reproducing the problem seems to be as simple as killing any process that has an open, locked file on an NFS volume. A simple gconfd-1 sleep 5; killall -9 gconfd-1 does it every time for me. I assume this would also happen if a process calls exit() without closing all of it's fds first; probably why starting GNOME or booting diskless is enough to tickle it. Yes, this is most likely. -- Terry To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
panic starting gnome
Hi, on today's -current, I get the following panic when starting gnome from xdm; a kernel from 2/10 works with today's world, so it must be something in the kernel that changed over the last week: Fatal trap 12: page fault while in kernel mode cpuid = 0; lapic.id = fault virtual address = 0x34 fault code = supervisor read, page not present instruction pointer = 0x8:0xc01b28a6 stack pointer = 0x10:0xe91a57c0 frame pointer = 0x10:0xe91a57e0 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 2444 (gconf-sanity-check-) kernel: type 12 trap, code=0 Stopped at _mtx_lock_flags+0x26: cmpl$0xc03884a0,0(%esi) db mi_switch(c21b4980,0,c0354624,185,3f8) at mi_switch+0x240 ithread_schedule(c6283b80,1,c0305e16,c658e780,e91a55c0) at ithread_schedule+0x11c sched_ithd(d) at sched_ithd+0x41 Xintr13() at Xintr13+0xd3 --- interrupt, eip = 0xc02efea2, esp = 0xe91a55a4, ebp = 0xe91a55c0 --- siocnopen(e91a55d4,3f8,1c200,301,c01f040b) at siocnopen+0x12 siocncheckc(c03b4c80,78,e91a5608,c01f0358,e91a5624) at siocncheckc+0x40 cncheckc(e91a5624,c0149625,e91a57c8,c03a9ac8,e91a5634) at cncheckc+0x2c cngetc(e91a57c8,c03a9ac8,e91a5634,0,e91a57c8) at cngetc+0x18 db_readline(c03b1b80,78,e91a5658,c01481e6,c03499fb) at db_readline+0x65 db_read_line(c03499fb,c03a9ac8,e91a5658,c0148a28,0) at db_read_line+0x1a db_command_loop(c01b28a6,a0,0,e91a5680,0) at db_command_loop+0x46 db_trap(c,0,0,e91a56c0,5) at db_trap+0x66 kdb_trap(c,0,e91a5780,1,1) at kdb_trap+0x107 trap_fatal(e91a5780,34,c0372ee0,2e4,c658e780) at trap_fatal+0x250 trap_pfault(e91a5780,0,34,c03e0758,34) at trap_pfault+0x17a trap(c21a0018,10,c0360010,9e,34) at trap+0x3e5 calltrap() at calltrap+0x5 --- trap 0xc, eip = 0xc01b28a6, esp = 0xe91a57c0, ebp = 0xe91a57e0 --- _mtx_lock_flags(34,0,c035cf5f,9e,c658e780) at _mtx_lock_flags+0x26 namei(e91a5a44,c0207d5a,c749458c,0,c658e780) at namei+0x134 vn_open_cred(e91a5a44,e91a5a0c,0,c2195e80,0) at vn_open_cred+0x53c nfs_dolock(e91a5c0c,c658e780,1b3,c03e0748,6001) at nfs_dolock+0x294 closef(c6673834,c658e780,c0353f03,595,c7375934) at closef+0x123 fdfree(c658e780,0,c03543ab,f2,73) at fdfree+0x1d4 exit1(c658e780,0,c03543ab,73,e91a5d40) at exit1+0x282 sys_exit(c658e780,e91a5d10,c0372ee0,407,c658de4c) at sys_exit+0x41 syscall(2f,2f,2f,0,2b57) at syscall+0x3d6 Xint0x80_syscall() at Xint0x80_syscall+0x1d --- syscall (1), eip = 0x288a853f, esp = 0xbfbffbec, ebp = 0xbfbffc18 --- Lars -- Lars Eggert [EMAIL PROTECTED] USC Information Sciences Institute smime.p7s Description: S/MIME Cryptographic Signature
Re: panic starting gnome
FWIW, this looks nearly identical to the panic I reported last night in the thread VFS panic (possibly NFS locking related?). I didn't manage to catch the ddb trace and had to work postmortem with a crash dump and gdb. But it looked just like here. Lars: Do you by any chance have your home directory on an NFS mount? I think the reason that my gdb trace showed ?? instead of nfs_dolock is that I have nfsclient loaded as a module... Craig On Tue, 2003-02-18 at 17:00, Lars Eggert wrote: Hi, on today's -current, I get the following panic when starting gnome from xdm; a kernel from 2/10 works with today's world, so it must be something in the kernel that changed over the last week: Fatal trap 12: page fault while in kernel mode cpuid = 0; lapic.id = fault virtual address = 0x34 fault code = supervisor read, page not present instruction pointer = 0x8:0xc01b28a6 stack pointer = 0x10:0xe91a57c0 frame pointer = 0x10:0xe91a57e0 code segment= base 0x0, limit 0xf, type 0x1b = DPL 0, pres 1, def32 1, gran 1 processor eflags= interrupt enabled, resume, IOPL = 0 current process = 2444 (gconf-sanity-check-) kernel: type 12 trap, code=0 Stopped at _mtx_lock_flags+0x26: cmpl$0xc03884a0,0(%esi) db mi_switch(c21b4980,0,c0354624,185,3f8) at mi_switch+0x240 ithread_schedule(c6283b80,1,c0305e16,c658e780,e91a55c0) at ithread_schedule+0x11c sched_ithd(d) at sched_ithd+0x41 Xintr13() at Xintr13+0xd3 --- interrupt, eip = 0xc02efea2, esp = 0xe91a55a4, ebp = 0xe91a55c0 --- siocnopen(e91a55d4,3f8,1c200,301,c01f040b) at siocnopen+0x12 siocncheckc(c03b4c80,78,e91a5608,c01f0358,e91a5624) at siocncheckc+0x40 cncheckc(e91a5624,c0149625,e91a57c8,c03a9ac8,e91a5634) at cncheckc+0x2c cngetc(e91a57c8,c03a9ac8,e91a5634,0,e91a57c8) at cngetc+0x18 db_readline(c03b1b80,78,e91a5658,c01481e6,c03499fb) at db_readline+0x65 db_read_line(c03499fb,c03a9ac8,e91a5658,c0148a28,0) at db_read_line+0x1a db_command_loop(c01b28a6,a0,0,e91a5680,0) at db_command_loop+0x46 db_trap(c,0,0,e91a56c0,5) at db_trap+0x66 kdb_trap(c,0,e91a5780,1,1) at kdb_trap+0x107 trap_fatal(e91a5780,34,c0372ee0,2e4,c658e780) at trap_fatal+0x250 trap_pfault(e91a5780,0,34,c03e0758,34) at trap_pfault+0x17a trap(c21a0018,10,c0360010,9e,34) at trap+0x3e5 calltrap() at calltrap+0x5 --- trap 0xc, eip = 0xc01b28a6, esp = 0xe91a57c0, ebp = 0xe91a57e0 --- _mtx_lock_flags(34,0,c035cf5f,9e,c658e780) at _mtx_lock_flags+0x26 namei(e91a5a44,c0207d5a,c749458c,0,c658e780) at namei+0x134 vn_open_cred(e91a5a44,e91a5a0c,0,c2195e80,0) at vn_open_cred+0x53c nfs_dolock(e91a5c0c,c658e780,1b3,c03e0748,6001) at nfs_dolock+0x294 closef(c6673834,c658e780,c0353f03,595,c7375934) at closef+0x123 fdfree(c658e780,0,c03543ab,f2,73) at fdfree+0x1d4 exit1(c658e780,0,c03543ab,73,e91a5d40) at exit1+0x282 sys_exit(c658e780,e91a5d10,c0372ee0,407,c658de4c) at sys_exit+0x41 syscall(2f,2f,2f,0,2b57) at syscall+0x3d6 Xint0x80_syscall() at Xint0x80_syscall+0x1d --- syscall (1), eip = 0x288a853f, esp = 0xbfbffbec, ebp = 0xbfbffc18 --- Lars To Unsubscribe: send mail to [EMAIL PROTECTED] with unsubscribe freebsd-current in the body of the message
Re: panic starting gnome
Craig Boston wrote: FWIW, this looks nearly identical to the panic I reported last night in the thread VFS panic (possibly NFS locking related?). I missed your message, just read it: yes, that sounds similar. I didn't manage to catch the ddb trace and had to work postmortem with a crash dump and gdb. But it looked just like here. Lars: Do you by any chance have your home directory on an NFS mount? Yes, I do. I think the reason that my gdb trace showed ?? instead of nfs_dolock is that I have nfsclient loaded as a module... Mine's loaded as a module, too. Lars -- Lars Eggert [EMAIL PROTECTED] USC Information Sciences Institute smime.p7s Description: S/MIME Cryptographic Signature
Re: panic starting gnome
Lars Eggert wrote: Fatal trap 12: page fault while in kernel mode cpuid = 0; lapic.id = fault virtual address = 0x34 fault code = supervisor read, page not present instruction pointer = 0x8:0xc01b28a6 [ ... ] kernel: type 12 trap, code=0 Stopped at _mtx_lock_flags+0x26: cmpl$0xc03884a0,0(%esi) [ ... ] trap_fatal(e91a5780,34,c0372ee0,2e4,c658e780) at trap_fatal+0x250 trap_pfault(e91a5780,0,34,c03e0758,34) at trap_pfault+0x17a trap(c21a0018,10,c0360010,9e,34) at trap+0x3e5 calltrap() at calltrap+0x5 --- trap 0xc, eip = 0xc01b28a6, esp = 0xe91a57c0, ebp = 0xe91a57e0 --- _mtx_lock_flags(34,0,c035cf5f,9e,c658e780) at _mtx_lock_flags+0x26 ** Attempt to dereference the value 0x34 as if it were a pointer. namei(e91a5a44,c0207d5a,c749458c,0,c658e780) at namei+0x134 Called from here. Debug: 1) Make sure that the kernel that has the fault was created with config -g, so that there is a debug version of it lying around in the build directory. 2) Make sure that the kernel you installed is the stripped version of the debug kernel (there are two kernels created as a result of config -g; one is kernel.debug (the debug version) and the other is kernel (the stripped version). 3) If #1 and #2 are not true, then make them true, and repeat the problem. 4) Boot a kernel that doesn't crash instead, so that you can run the debugger. 5) Go to the build directory, and look at the faulting code to see where it gets the value 0x34 to pass in to the _mtx_lock_flags(); this is the bogus value. For example, if you had a debug kernel for the kernel that has the problem, and it was config'ed from i386 GENERIC, you would use the following sequence of commands: cd /sys/i386/compile/GENERIC gdb -k kernel.debug list namei+0x134 6) Change the code so the bogus value is no longer being passed. 7) Live happily ever after. Note that, to me, this looks like a problem with a dereference of a current process which is not really current, as a result of a wakeup occurring in an interrupt handler for an outstanding request which was satisfied by the interrupt handler. Note: Under no circumstances should a page 0 address be passed around to anyone, since page zero is typically unmapped in order to trigger NULL pointer dereference faults and/or structure member reference faults for structure elements (at least in the the initial 4K: range 0x-0x1000) when a structure pointer itself is NULL. IMO, the most likely cause is that you have a null structure pointer, and the element at offset 0x34 into the structure is being referenced out of it, without checking that the pointer is not NULL, and the most likely culprit is a proc/kse/thread type structure that's not guaranteed to be valid in interrupt context. Probably, the scheduler is switching directly from interrupt of a process context Q to a wakeup of the same process Q, without restoring a register value that should normally be restored following an interrupt. I have no idea which of the schedulers you are using, so I have no idea if this should be an expected omission; my best guess is you are using the new one, though, because this is an unlikely problem with the old one, if it's really a scheduler wakeup problem. namei(e91a5a44,c0207d5a,c749458c,0,c658e780) at namei+0x134 ^ | vn_open_cred(e91a5a44,e91a5a0c,0,c2195e80,0) at vn_open_cred+0x53c ^ ^ | | ...all three of these are also incredibly suspicious, at first sight... Until you are willing to list out the code where the bogus value is being passed to the function call, there's no way any of us are going to be able to correlate your stack traceback to our own source trees, in order to be able to help you, unless you are running a tagged veraion (e.g. 5.0-RELEASE) with no modifications. Just saying the most recent current or I CVS'up'ed on xxx date is really useless to us, because CVS mirrors don't contain well known information relative to a CVS'up date. In many cases, we will need you to check out (at least!) a fresh /sys source tree from the CVS repository, using a date tage, if you are not running a -RELEASE version. Yes, this is a long-standing problem with the FreeBSD project itself. If you can do this, and repeat the