Re: Kernel panics on amd64 recently - do I have bad hardware?
This part: VOP_FSYNC() at VOP_FSYNC+0x2f ffs_sync_vnode() at ffs_sync_vnode+0x77 vfs_mount_foreach_vnode() at vfs_mount_foreach_vnode+0x38 ffs_sync() at ffs_sync+0x83 sys_sync() at sys_sync+0xa1 vfs_syncwait() at vfs_syncwait+0x50 vfs_shutdown() at vfs_shutdown+0x32 boot() at boot+0x17f panic() at panic+0xf6 is from the boot crash, not the original crash. Looking at the original crash: --- trap (number 8) --- ffs_update() at ffs_update+0x19f That points to the math in the ino_to_fsba() macro in this like of ffs_update() error = bread(ip-i_devvp, fsbtodb(fs, ino_to_fsba(fs, ip-i_number)), (int)fs-fs_bsize, bp); It's trying to calculate the block address of the inode so that it can update the timestamps in it and divided by zero. That means the in-memory copy of the superblock had zeros in on other another member. If the on-disk superblock had zeros there, I would expected fsck to catch it, or for it to crash earlier, but maybe a forced fsck is in order. Otherwise, something's writing through a bogus pointer in the kernel... Well, I was hopeful after I manually fscked everything on Monday, but it crashed again last night: fatal integer divide fault in supervisor mode trap type 8 code 0 rip 81292dff cs 8 rflags 10246 cr2 9c8edee6f0c cpl 0 rsp 8000226bac30 panic: trap type 8, code 0, pc=81292dff Starting stack trace... panic() at panic+0xfb trap() at trap+0x7f1 --- trap (number 8) --- ffs_update() at ffs_update+0x19f ufs_inactive() at ufs_inactive+0xd3 VOP_INACTIVE() at VOP_INACTIVE+0x28 vrele() at vrele+0x61 proc_zap() at proc_zap+0xa1 dowait4() at dowait4+0x2ca sys_wait4() at sys_wait4+0x38 syscall() at syscall+0x249 syscall -- (number 11) --- end of kernel end trace frame: 0x9caee03eba0, count: 247 0x9caf9cf4aea: End of stack trace. syncing disks... I can re-enable the ddb.panic setting so it *doesn't* automatically reboot, but I don't know what information from the debugger would be actually useful. If you can suggest some commands to run from the ddb prompt, I'll be more than happy to do so the next time it crashes. Thank you very much for any help! Benny -- No matter how tempted I am with the prospect of unlimited power, I will not consume any energy field bigger than my head. -- #22 on Peter Anspach's Evil Overlord list
Re: Kernel panics on amd64 recently - do I have bad hardware?
This part: VOP_FSYNC() at VOP_FSYNC+0x2f ffs_sync_vnode() at ffs_sync_vnode+0x77 vfs_mount_foreach_vnode() at vfs_mount_foreach_vnode+0x38 ffs_sync() at ffs_sync+0x83 sys_sync() at sys_sync+0xa1 vfs_syncwait() at vfs_syncwait+0x50 vfs_shutdown() at vfs_shutdown+0x32 boot() at boot+0x17f panic() at panic+0xf6 is from the boot crash, not the original crash. Looking at the original crash: --- trap (number 8) --- ffs_update() at ffs_update+0x19f That points to the math in the ino_to_fsba() macro in this like of ffs_update() error = bread(ip-i_devvp, fsbtodb(fs, ino_to_fsba(fs, ip-i_number)), (int)fs-fs_bsize, bp); It's trying to calculate the block address of the inode so that it can update the timestamps in it and divided by zero. That means the in-memory copy of the superblock had zeros in on other another member. If the on-disk superblock had zeros there, I would expected fsck to catch it, or for it to crash earlier, but maybe a forced fsck is in order. Otherwise, something's writing through a bogus pointer in the kernel... Thank you so much, Philip. I ran each filesystem through a 'fsck -n' just to see what it thought, and it identified three filesystems that seemed to have issues. So, I dropped it down to single user and ran fsck on each one. It didn't say it fixed anything - kinda surprised me - but I ran fsck on every filesystem, and then did a 'fsck -p' for good measure. Everything came up clean? I booted it back and and I guess we'll see how things go... Thank you for your help! Benny -- No matter how tempted I am with the prospect of unlimited power, I will not consume any energy field bigger than my head. -- #22 on Peter Anspach's Evil Overlord list
Kernel panics on amd64 recently - do I have bad hardware?
Hey folks, I've had a helluva week - my colocated server has crashed at least four times, and I'd like a little sanity check from people that know a lot more than I do. Sorry for the length of this, trying to include all the data I'm aware of that might be relevant and helpful. For the two crashes that I've been able to capture some output from (one from an IP KVM, one from /var/log/messages after setting ddb.panic=0), I've seen: uvm_fault(0x81cf2b20, 0x80cef000, 0, 2) - e kernel: page fault trap, code=0 Stopped at memmove+0x16: repe movsq (%rsi),%es:(%rdi) and reboot after panic: trap type 8, code=0, pc=81292dff Because kernel panics are so rare in OpenBSD, I don't have much experience debugging them. Following crash(8), I fired up gdb and took a look at this morning's crash and auto-reboot: gdb GNU gdb 6.3 Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as amd64-unknown-openbsd5.4. (gdb) file /var/crash/bsd.0 Reading symbols from /var/crash/bsd.0...(no debugging symbols found)...done. (gdb) target kvm /var/crash/bsd.0.core #0 0x8130a194 in dumpsys () (gdb) where #0 0x8130a194 in dumpsys () #1 0x8130a2e5 in boot () #2 0x811a2d76 in panic () #3 0x81313d51 in trap () #4 0x81315766 in alltraps () #5 0x in ?? () I don't *think* it was resource starvation: vmstat -N /var/crash/bsd.0 -M /var/crash/bsd.0.core -m Memory statistics by bucket size Size In Use Free Requests HighWater Couldfree 1646085 47867 109033481280 2417 32 36535711604650 640 0 64 4215 12892687011 320 18492 128 5405 1411 925024 160930 256 2066286 629177 80 74 512 1774338 462020 40 9397 1024 1539685 578108 20 141600 2048 287 45 78486 10 21570 4096 83528 144485 5 101528 8192 20 8 18105 5 7483 163841 0366 5 0 327688 0102 5 0 655362 01909341 5 0 5242882 0 2 5 0 Memory usage type by bucket size Size Type(s) 16 devbuf, pcb, routetbl, sem, dirhash, ACPI, exec, UVM amap, UVM aobj, USB, USB device, temp 32 devbuf, pcb, routetbl, ifaddr, sysctl, vnodes, sem, dirhash, ACPI, in_multi, exec, UVM amap, USB, temp 64 devbuf, routetbl, ifaddr, vnodes, UFS mount, dirhash, ACPI, proc, VFS cluster, in_multi, ether_multi, VM swap, UVM amap, USB, USB device, NDP, temp 128 devbuf, pcb, routetbl, sysctl, UFS mount, sem, dirhash, ACPI, NFS srvsock, ttys, pfkey data, inodedep, VM swap, UVM amap, USB, USB device, USB HC, NDP, temp 256 devbuf, routetbl, ifaddr, ioctlops, vnodes, UFS mount, shm, VM map, sem, dirhash, ACPI, exec, xform_data, UVM amap, USB, USB device, temp 512 devbuf, routetbl, ifaddr, ioctlops, sem, dirhash, ACPI, file desc, NFS daemon, ttys, xform_data, newblk, UVM amap, USB, temp 1024 devbuf, pcb, sysctl, ioctlops, mount, UFS mount, shm, dirhash, ACPI, file desc, proc, ttys, exec, UVM amap, crypto data, temp 2048 devbuf, ioctlops, UFS mount, sem, dirhash, ACPI, file desc, VM swap, UVM amap, UVM aobj, temp 4096 devbuf, ifaddr, ioctlops, UFS mount, shm, dirhash, file desc, proc, UVM amap, memdesc, temp 8192 devbuf, file, ttys, pagedep, UVM amap, USB, temp 16384 devbuf, MSDOSFS mount, indirdep, temp 32768 devbuf, UFS quota, UFS mount, ISOFS mount, inodedep, indirdep, NTFS hash 65536 devbuf, temp 524288 VM swap Memory statistics by type Type Kern Type InUse MemUse HighUse Limit Requests Limit Limit Size(s) devbuf 733 495K 2597K 78644K232870 0 16,32,64,128,256,512,1024,2048,4096,8192,16384,32768,65536 pcb 21834K 42K 78644K407230 0 16,32,128,1024 routetbl78 9K 10K 78644K 41980 0 16,32,64,128,256,512 ifaddr5616K 16K 78644K 580 0 32,64,256,512,4096 sysctl 3 2K 2K 78644K30 0 32,128,1024 ioctlops 0 0K 4K 78644K 46320 0 256,512,1024,2048,4096 mount1313K
Re: Kernel panics on amd64 recently - do I have bad hardware?
For the two crashes that I've been able to capture some output from (one from an IP KVM, one from /var/log/messages after setting ddb.panic=0), I've seen: uvm_fault(0x81cf2b20, 0x80cef000, 0, 2) - e kernel: page fault trap, code=0 Stopped at memmove+0x16: repe movsq (%rsi),%es:(%rdi) and reboot after panic: trap type 8, code=0, pc=81292dff Whoops - the hosting company caught some of the panic message before it rebooted this morning (retyped from a screenshot): VOP_FSYNC() at VOP_FSYNC+0x2f ffs_sync_vnode() at ffs_sync_vnode+0x77 vfs_mount_foreach_vnode() at vfs_mount_foreach_vnode+0x38 ffs_sync() at ffs_sync+0x83 sys_sync() at sys_sync+0xa1 vfs_syncwait() at vfs_syncwait+0x50 vfs_shutdown() at vfs_shutdown+0x32 boot() at boot+0x17f panic() at panic+0xf6 trap() at trap+0x7f1 --- trap (number 8) --- ffs_update() at ffs_update+0x19f ufs_inactive() at ufs_inactive+0xd3 VOP_INACTIVE() at VOP_INACTIVE+0x28 vrele() at vrele+0x61 proc_zap() at proc_zap+0xa1 dowait4() at dowait4+0x2ca sys_wait4() at sys_wait4+0x38 syscall() at syscall+0x249 --- syscall (number 11) --- end of kernel end trace frame: 0xd6f3ca35c0, count: 215 0xd6f3cebaea: End of stack trace. -- No matter how tempted I am with the prospect of unlimited power, I will not consume any energy field bigger than my head. -- #22 on Peter Anspach's Evil Overlord list
Re: Kernel panics on amd64 recently - do I have bad hardware?
On Sun, Sep 15, 2013 at 6:17 AM, C. Bensend be...@bennyvision.com wrote: For the two crashes that I've been able to capture some output from (one from an IP KVM, one from /var/log/messages after setting ddb.panic=0), I've seen: uvm_fault(0x81cf2b20, 0x80cef000, 0, 2) - e kernel: page fault trap, code=0 Stopped at memmove+0x16: repe movsq (%rsi),%es:(%rdi) and reboot after panic: trap type 8, code=0, pc=81292dff Whoops - the hosting company caught some of the panic message before it rebooted this morning (retyped from a screenshot): This part: VOP_FSYNC() at VOP_FSYNC+0x2f ffs_sync_vnode() at ffs_sync_vnode+0x77 vfs_mount_foreach_vnode() at vfs_mount_foreach_vnode+0x38 ffs_sync() at ffs_sync+0x83 sys_sync() at sys_sync+0xa1 vfs_syncwait() at vfs_syncwait+0x50 vfs_shutdown() at vfs_shutdown+0x32 boot() at boot+0x17f panic() at panic+0xf6 is from the boot crash, not the original crash. Looking at the original crash: --- trap (number 8) --- ffs_update() at ffs_update+0x19f That points to the math in the ino_to_fsba() macro in this like of ffs_update() error = bread(ip-i_devvp, fsbtodb(fs, ino_to_fsba(fs, ip-i_number)), (int)fs-fs_bsize, bp); It's trying to calculate the block address of the inode so that it can update the timestamps in it and divided by zero. That means the in-memory copy of the superblock had zeros in on other another member. If the on-disk superblock had zeros there, I would expected fsck to catch it, or for it to crash earlier, but maybe a forced fsck is in order. Otherwise, something's writing through a bogus pointer in the kernel... Philip Guenther