Re: Kernel panics on amd64 recently - do I have bad hardware?

2013-09-19 Thread C. Bensend
 This part:

 VOP_FSYNC() at VOP_FSYNC+0x2f
 ffs_sync_vnode() at ffs_sync_vnode+0x77
 vfs_mount_foreach_vnode() at vfs_mount_foreach_vnode+0x38
 ffs_sync() at ffs_sync+0x83
 sys_sync() at sys_sync+0xa1
 vfs_syncwait() at vfs_syncwait+0x50
 vfs_shutdown() at vfs_shutdown+0x32
 boot() at boot+0x17f
 panic() at panic+0xf6

 is from the boot crash, not the original crash.  Looking at the
 original crash:

 --- trap (number 8) ---
 ffs_update() at ffs_update+0x19f

 That points to the math in the ino_to_fsba() macro in this like of
 ffs_update()
 error = bread(ip-i_devvp, fsbtodb(fs, ino_to_fsba(fs,
 ip-i_number)),
 (int)fs-fs_bsize, bp);

 It's trying to calculate the block address of the inode so that it can
 update the timestamps in it and divided by zero.  That means the
 in-memory copy of the superblock had zeros in on other another member.
  If the on-disk superblock had zeros there, I would expected fsck to
 catch it, or for it to crash earlier, but maybe a forced fsck is in
 order.  Otherwise, something's writing through a bogus pointer in the
 kernel...

Well, I was hopeful after I manually fscked everything on Monday, but
it crashed again last night:

fatal integer divide fault in supervisor mode
trap type 8 code 0 rip 81292dff cs 8 rflags 10246 cr2  9c8edee6f0c
cpl 0 rsp 8000226bac30
panic: trap type 8, code 0, pc=81292dff
Starting stack trace...
panic() at panic+0xfb
trap() at trap+0x7f1
--- trap (number 8) ---
ffs_update() at ffs_update+0x19f
ufs_inactive() at ufs_inactive+0xd3
VOP_INACTIVE() at VOP_INACTIVE+0x28
vrele() at vrele+0x61
proc_zap() at proc_zap+0xa1
dowait4() at dowait4+0x2ca
sys_wait4() at sys_wait4+0x38
syscall() at syscall+0x249
syscall -- (number 11) ---
end of kernel
end trace frame: 0x9caee03eba0, count: 247
0x9caf9cf4aea:
End of stack trace.
syncing disks...

I can re-enable the ddb.panic setting so it *doesn't* automatically
reboot, but I don't know what information from the debugger would be
actually useful.  If you can suggest some commands to run from the
ddb prompt, I'll be more than happy to do so the next time it crashes.

Thank you very much for any help!

Benny


-- 
No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head.
  -- #22 on Peter Anspach's Evil
 Overlord list



Re: Kernel panics on amd64 recently - do I have bad hardware?

2013-09-17 Thread C. Bensend
 This part:

 VOP_FSYNC() at VOP_FSYNC+0x2f
 ffs_sync_vnode() at ffs_sync_vnode+0x77
 vfs_mount_foreach_vnode() at vfs_mount_foreach_vnode+0x38
 ffs_sync() at ffs_sync+0x83
 sys_sync() at sys_sync+0xa1
 vfs_syncwait() at vfs_syncwait+0x50
 vfs_shutdown() at vfs_shutdown+0x32
 boot() at boot+0x17f
 panic() at panic+0xf6

 is from the boot crash, not the original crash.  Looking at the
 original crash:

 --- trap (number 8) ---
 ffs_update() at ffs_update+0x19f

 That points to the math in the ino_to_fsba() macro in this like of
 ffs_update()
 error = bread(ip-i_devvp, fsbtodb(fs, ino_to_fsba(fs,
 ip-i_number)),
 (int)fs-fs_bsize, bp);

 It's trying to calculate the block address of the inode so that it can
 update the timestamps in it and divided by zero.  That means the
 in-memory copy of the superblock had zeros in on other another member.
  If the on-disk superblock had zeros there, I would expected fsck to
 catch it, or for it to crash earlier, but maybe a forced fsck is in
 order.  Otherwise, something's writing through a bogus pointer in the
 kernel...

Thank you so much, Philip.

I ran each filesystem through a 'fsck -n' just to see what it thought,
and it identified three filesystems that seemed to have issues.  So,
I dropped it down to single user and ran fsck on each one.  It didn't
say it fixed anything - kinda surprised me - but I ran fsck on every
filesystem, and then did a 'fsck -p' for good measure.  Everything
came up clean?

I booted it back and and I guess we'll see how things go...

Thank you for your help!

Benny


-- 
No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head.
  -- #22 on Peter Anspach's Evil
 Overlord list



Kernel panics on amd64 recently - do I have bad hardware?

2013-09-15 Thread C. Bensend
Hey folks,

   I've had a helluva week - my colocated server has crashed at least
four times, and I'd like a little sanity check from people that know
a lot more than I do.  Sorry for the length of this, trying to include
all the data I'm aware of that might be relevant and helpful.

   For the two crashes that I've been able to capture some output
from (one from an IP KVM, one from /var/log/messages after setting
ddb.panic=0), I've seen:


uvm_fault(0x81cf2b20, 0x80cef000, 0, 2) - e
kernel: page fault trap, code=0
Stopped at  memmove+0x16:   repe movsq  (%rsi),%es:(%rdi)

and

reboot after panic: trap type 8, code=0, pc=81292dff


   Because kernel panics are so rare in OpenBSD, I don't have much
experience debugging them.  Following crash(8), I fired up gdb and
took a look at this morning's crash and auto-reboot:


gdb
GNU gdb 6.3
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as amd64-unknown-openbsd5.4.
(gdb) file /var/crash/bsd.0
Reading symbols from /var/crash/bsd.0...(no debugging symbols found)...done.
(gdb) target kvm /var/crash/bsd.0.core
#0  0x8130a194 in dumpsys ()
(gdb) where
#0  0x8130a194 in dumpsys ()
#1  0x8130a2e5 in boot ()
#2  0x811a2d76 in panic ()
#3  0x81313d51 in trap ()
#4  0x81315766 in alltraps ()
#5  0x in ?? ()


I don't *think* it was resource starvation:


vmstat -N /var/crash/bsd.0 -M /var/crash/bsd.0.core -m
Memory statistics by bucket size
Size   In Use   Free   Requests  HighWater  Couldfree
  1646085  47867   109033481280   2417
  32 36535711604650 640  0
  64 4215   12892687011 320  18492
 128 5405   1411 925024 160930
 256 2066286 629177  80 74
 512 1774338 462020  40   9397
1024 1539685 578108  20 141600
2048  287 45  78486  10  21570
4096   83528 144485   5 101528
8192   20  8  18105   5   7483
   163841  0366   5  0
   327688  0102   5  0
   655362  01909341   5  0
  5242882  0  2   5  0

Memory usage type by bucket size
Size  Type(s)
  16  devbuf, pcb, routetbl, sem, dirhash, ACPI, exec, UVM amap, UVM
aobj,
  USB, USB device, temp
  32  devbuf, pcb, routetbl, ifaddr, sysctl, vnodes, sem, dirhash, ACPI,
  in_multi, exec, UVM amap, USB, temp
  64  devbuf, routetbl, ifaddr, vnodes, UFS mount, dirhash, ACPI, proc,
  VFS cluster, in_multi, ether_multi, VM swap, UVM amap, USB,
  USB device, NDP, temp
 128  devbuf, pcb, routetbl, sysctl, UFS mount, sem, dirhash, ACPI,
  NFS srvsock, ttys, pfkey data, inodedep, VM swap, UVM amap, USB,
  USB device, USB HC, NDP, temp
 256  devbuf, routetbl, ifaddr, ioctlops, vnodes, UFS mount, shm, VM map,
  sem, dirhash, ACPI, exec, xform_data, UVM amap, USB, USB device,
temp
 512  devbuf, routetbl, ifaddr, ioctlops, sem, dirhash, ACPI, file desc,
  NFS daemon, ttys, xform_data, newblk, UVM amap, USB, temp
1024  devbuf, pcb, sysctl, ioctlops, mount, UFS mount, shm, dirhash,
ACPI,
  file desc, proc, ttys, exec, UVM amap, crypto data, temp
2048  devbuf, ioctlops, UFS mount, sem, dirhash, ACPI, file desc, VM
swap,
  UVM amap, UVM aobj, temp
4096  devbuf, ifaddr, ioctlops, UFS mount, shm, dirhash, file desc, proc,
  UVM amap, memdesc, temp
8192  devbuf, file, ttys, pagedep, UVM amap, USB, temp
   16384  devbuf, MSDOSFS mount, indirdep, temp
   32768  devbuf, UFS quota, UFS mount, ISOFS mount, inodedep, indirdep,
  NTFS hash
   65536  devbuf, temp
  524288  VM swap

Memory statistics by type   Type  Kern
  Type InUse MemUse HighUse  Limit Requests Limit Limit Size(s)
devbuf   733   495K   2597K 78644K232870 0 
16,32,64,128,256,512,1024,2048,4096,8192,16384,32768,65536
   pcb   21834K 42K 78644K407230 0 
16,32,128,1024
  routetbl78 9K 10K 78644K 41980 0 
16,32,64,128,256,512
ifaddr5616K 16K 78644K   580 0 
32,64,256,512,4096
sysctl 3 2K  2K 78644K30 0  32,128,1024
  ioctlops 0 0K  4K 78644K 46320 0 
256,512,1024,2048,4096
 mount1313K   

Re: Kernel panics on amd64 recently - do I have bad hardware?

2013-09-15 Thread C. Bensend
For the two crashes that I've been able to capture some output
 from (one from an IP KVM, one from /var/log/messages after setting
 ddb.panic=0), I've seen:


 uvm_fault(0x81cf2b20, 0x80cef000, 0, 2) - e
 kernel: page fault trap, code=0
 Stopped at  memmove+0x16:   repe movsq  (%rsi),%es:(%rdi)

 and

 reboot after panic: trap type 8, code=0, pc=81292dff

Whoops - the hosting company caught some of the panic message before
it rebooted this morning (retyped from a screenshot):


VOP_FSYNC() at VOP_FSYNC+0x2f
ffs_sync_vnode() at ffs_sync_vnode+0x77
vfs_mount_foreach_vnode() at vfs_mount_foreach_vnode+0x38
ffs_sync() at ffs_sync+0x83
sys_sync() at sys_sync+0xa1
vfs_syncwait() at vfs_syncwait+0x50
vfs_shutdown() at vfs_shutdown+0x32
boot() at boot+0x17f
panic() at panic+0xf6
trap() at trap+0x7f1
--- trap (number 8) ---
ffs_update() at ffs_update+0x19f
ufs_inactive() at ufs_inactive+0xd3
VOP_INACTIVE() at VOP_INACTIVE+0x28
vrele() at vrele+0x61
proc_zap() at proc_zap+0xa1
dowait4() at dowait4+0x2ca
sys_wait4() at sys_wait4+0x38
syscall() at syscall+0x249
--- syscall (number 11) ---
end of kernel
end trace frame: 0xd6f3ca35c0, count: 215
0xd6f3cebaea:
End of stack trace.



-- 
No matter how tempted I am with the prospect of unlimited power, I
will not consume any energy field bigger than my head.
  -- #22 on Peter Anspach's Evil
 Overlord list



Re: Kernel panics on amd64 recently - do I have bad hardware?

2013-09-15 Thread Philip Guenther
On Sun, Sep 15, 2013 at 6:17 AM, C. Bensend be...@bennyvision.com wrote:
For the two crashes that I've been able to capture some output
 from (one from an IP KVM, one from /var/log/messages after setting
 ddb.panic=0), I've seen:

 uvm_fault(0x81cf2b20, 0x80cef000, 0, 2) - e
 kernel: page fault trap, code=0
 Stopped at  memmove+0x16:   repe movsq  (%rsi),%es:(%rdi)

 and

 reboot after panic: trap type 8, code=0, pc=81292dff

 Whoops - the hosting company caught some of the panic message before
 it rebooted this morning (retyped from a screenshot):


This part:

 VOP_FSYNC() at VOP_FSYNC+0x2f
 ffs_sync_vnode() at ffs_sync_vnode+0x77
 vfs_mount_foreach_vnode() at vfs_mount_foreach_vnode+0x38
 ffs_sync() at ffs_sync+0x83
 sys_sync() at sys_sync+0xa1
 vfs_syncwait() at vfs_syncwait+0x50
 vfs_shutdown() at vfs_shutdown+0x32
 boot() at boot+0x17f
 panic() at panic+0xf6

is from the boot crash, not the original crash.  Looking at the
original crash:

 --- trap (number 8) ---
 ffs_update() at ffs_update+0x19f

That points to the math in the ino_to_fsba() macro in this like of ffs_update()
error = bread(ip-i_devvp, fsbtodb(fs, ino_to_fsba(fs, ip-i_number)),
(int)fs-fs_bsize, bp);

It's trying to calculate the block address of the inode so that it can
update the timestamps in it and divided by zero.  That means the
in-memory copy of the superblock had zeros in on other another member.
 If the on-disk superblock had zeros there, I would expected fsck to
catch it, or for it to crash earlier, but maybe a forced fsck is in
order.  Otherwise, something's writing through a bogus pointer in the
kernel...


Philip Guenther