Re: Panic caused by bad memory?

2006-10-26 Thread Charles Sprickman

On Wed, 25 Oct 2006, John Baldwin wrote:


On Wednesday 25 October 2006 02:28, Charles Sprickman wrote:

On Tue, 24 Oct 2006 [EMAIL PROTECTED] wrote:


I can't get a kernel dump since it fails like this each time:

dumping to dev #da/0x20001, offset 2097152
dump 1024 1023 1022 1021 Aborting dump due to I/O error.
status == 0xb, scsi status == 0x0
failed, reason: i/o error


Bad memory seems unlikely to cause an I/O error trying to write the
dump to the swap partition.  I'd guess a dicey drive -- and bad
swap space could also account for the original crash.  You might
be able to get a backup by booting single user, provided nothing
activates the (presumably bad) swap partition.


Just for the record, this box is running an Adaptec raid controller (2005S
- ZCR card) and swap is coming off a mirrored array.

Coincidentally, I have a utility box where it had bad blocks on the swap
partition (but no others) - what I saw there is that the box would just
hang and spit out a bunch of swap_pager timeout messages to the console.
Quick and dirty remote fix while waiting for a drive?  Run file-backed
swap on /usr. :)

Let's pretend for a minute it's not the drive that's the root cause...
Not saying it isn't - we're none too thrilled with these Adaptec RAID
controllers...  Do those memory addresses in the panic message point
towards bad memory if they are always the same?


No, they are virtual addresses.  Having the same EIP means you are crashing in
the same place.  Did you recently kldunload a module before it crashed?


Same place == same code?  The only change on this box was a massive 
portupgrade which included apache, php, mysql, postgres and most of the 
additional gnu tools.


There is one module that someone set to load on boot, and that's the 
linuxolator.  I have disabled that in rc.conf for now and we'll see what 
happens after the next panic.


We also have a few sticks of RAM on order now...

Thanks,

Charles


--
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Panic caused by bad memory?

2006-10-25 Thread Charles Sprickman

On Tue, 24 Oct 2006 [EMAIL PROTECTED] wrote:


I can't get a kernel dump since it fails like this each time:

dumping to dev #da/0x20001, offset 2097152
dump 1024 1023 1022 1021 Aborting dump due to I/O error.
status == 0xb, scsi status == 0x0
failed, reason: i/o error


Bad memory seems unlikely to cause an I/O error trying to write the
dump to the swap partition.  I'd guess a dicey drive -- and bad
swap space could also account for the original crash.  You might
be able to get a backup by booting single user, provided nothing
activates the (presumably bad) swap partition.


Just for the record, this box is running an Adaptec raid controller (2005S 
- ZCR card) and swap is coming off a mirrored array.


Coincidentally, I have a utility box where it had bad blocks on the swap 
partition (but no others) - what I saw there is that the box would just 
hang and spit out a bunch of swap_pager timeout messages to the console. 
Quick and dirty remote fix while waiting for a drive?  Run file-backed 
swap on /usr. :)


Let's pretend for a minute it's not the drive that's the root cause... 
Not saying it isn't - we're none too thrilled with these Adaptec RAID 
controllers...  Do those memory addresses in the panic message point 
towards bad memory if they are always the same?


Thanks,

Charles
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Panic caused by bad memory?

2006-10-25 Thread John Baldwin
On Wednesday 25 October 2006 02:28, Charles Sprickman wrote:
 On Tue, 24 Oct 2006 [EMAIL PROTECTED] wrote:
 
  I can't get a kernel dump since it fails like this each time:
 
  dumping to dev #da/0x20001, offset 2097152
  dump 1024 1023 1022 1021 Aborting dump due to I/O error.
  status == 0xb, scsi status == 0x0
  failed, reason: i/o error
 
  Bad memory seems unlikely to cause an I/O error trying to write the
  dump to the swap partition.  I'd guess a dicey drive -- and bad
  swap space could also account for the original crash.  You might
  be able to get a backup by booting single user, provided nothing
  activates the (presumably bad) swap partition.
 
 Just for the record, this box is running an Adaptec raid controller (2005S 
 - ZCR card) and swap is coming off a mirrored array.
 
 Coincidentally, I have a utility box where it had bad blocks on the swap 
 partition (but no others) - what I saw there is that the box would just 
 hang and spit out a bunch of swap_pager timeout messages to the console. 
 Quick and dirty remote fix while waiting for a drive?  Run file-backed 
 swap on /usr. :)
 
 Let's pretend for a minute it's not the drive that's the root cause... 
 Not saying it isn't - we're none too thrilled with these Adaptec RAID 
 controllers...  Do those memory addresses in the panic message point 
 towards bad memory if they are always the same?

No, they are virtual addresses.  Having the same EIP means you are crashing in 
the same place.  Did you recently kldunload a module before it crashed?

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Panic caused by bad memory?

2006-10-24 Thread Charles Sprickman

Hello all,

Without a full dump are there any telltale signs from the panic message 
that can give me some sign of whether I'm dealing with a hardware or 
software issue?  I have a box that has been running 4.11-p10 for quite 
some time with no problems.  I upgraded a number of ports 
(apache/php/mysql) and since then I've had two panics.  Of course userland 
apps shouldn't cause this, but that's the only change I see.


I can't get a kernel dump since it fails like this each time:

dumping to dev #da/0x20001, offset 2097152
dump 1024 1023 1022 1021 Aborting dump due to I/O error.
status == 0xb, scsi status == 0x0
failed, reason: i/o error

The meat of my question though, what are these lines telling me:

(panic 1)
instruction pointer = 0x8:0xc028b053
stack pointer   = 0x10:0xe138eefc
frame pointer   = 0x10:0xe138ef2c

(panic 2)
instruction pointer = 0x8:0xc028b053
stack pointer   = 0x10:0xe138eefc
frame pointer   = 0x10:0xe138ef2c

Are those physical memory addresses where the code that caused the panic 
resides?  If so, does that point to bad RAM?


Thanks,

Charles

Here's more info if anyone is curious:

[-- MARK -- Mon Oct 23 06:00:00 2006]


Fatal trap 12: page fault while in kernel mode
mp_lock = 0002; cpuid = 0; lapic.id = 
fault virtual address   = 0xc327c614
fault code  = supervisor read, page not present
instruction pointer = 0x8:0xc028b053
stack pointer   = 0x10:0xe138eefc
frame pointer   = 0x10:0xe138ef2c
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 8 (syncer)
interrupt mask  = none - SMP: XXX
trap number = 12
panic: page fault
mp_lock = 0002; cpuid = 0; lapic.id = 
boot() called on cpu#0

syncing disks... panic: rslock: cpu: 0, addr: 0xc0391ccc, lock: 0x0001
mp_lock = 0002; cpuid = 0; lapic.id = 
boot() called on cpu#0
Uptime: 441d9h31m5s

dumping to dev #da/0x20001, offset 2097152
dump 1024 1023 1022 1021 Aborting dump due to I/O error.
status == 0xb, scsi status == 0x0
failed, reason: i/o error
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
cpu_reset called on cpu#0
cpu_reset: Stopping other CPUs

[-- MARK -- Tue Oct 24 09:00:00 2006]


Fatal trap 12: page fault while in kernel mode
mp_lock = 0102; cpuid = 1; lapic.id = 0100
fault virtual address   = 0xc29d2b94
fault code  = supervisor read, page not present
instruction pointer = 0x8:0xc028b053
stack pointer   = 0x10:0xe138eefc
frame pointer   = 0x10:0xe138ef2c
code segment= base 0x0, limit 0xf, type 0x1b
= DPL 0, pres 1, def32 1, gran 1
processor eflags= interrupt enabled, resume, IOPL = 0
current process = 8 (syncer)
interrupt mask  = none - SMP: XXX
trap number = 12
panic: page fault
mp_lock = 0102; cpuid = 1; lapic.id = 0100
boot() called on cpu#1

syncing disks... panic: rslock: cpu: 1, addr: 0xc0391ccc, lock: 0x0101
mp_lock = 0102; cpuid = 1; lapic.id = 0100
boot() called on cpu#1
Uptime: 1d2h55m38s

dumping to dev #da/0x20001, offset 2097152
dump 1024 1023 1022 1021 Aborting dump due to I/O error.
status == 0xb, scsi status == 0x0
failed, reason: i/o error
Automatic reboot in 15 seconds - press a key on the console to abort
Rebooting...
cpu_reset called on cpu#1
cpu_reset: Stopping other CPUs
cpu_reset: Restarting BSP
cpu_reset_proxy: Grabbed mp locckp uf_re sBeStP:
BSP did not grab mp lock
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Panic caused by bad memory?

2006-10-24 Thread perryh
 I can't get a kernel dump since it fails like this each time:

 dumping to dev #da/0x20001, offset 2097152
 dump 1024 1023 1022 1021 Aborting dump due to I/O error.
 status == 0xb, scsi status == 0x0
 failed, reason: i/o error

Bad memory seems unlikely to cause an I/O error trying to write the
dump to the swap partition.  I'd guess a dicey drive -- and bad
swap space could also account for the original crash.  You might
be able to get a backup by booting single user, provided nothing
activates the (presumably bad) swap partition.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]