Re: Please help - kernel crashes often

2006-02-02 Thread mike
[EMAIL PROTECTED] ~]# mcelog --k8 --ascii  wrote:
> Yeah, I looked at the memory. It's got a PQI sticker (at least one set)
>
> It's SUPPOSED to be "(Samsung, Micron, Elpida, Infineon, Hynix OEM)" -
> which would align it basically with what Supermicro suggests:
>
> http://supermicro.com/Aplus/support/resources/memory/?sz=1.0&mspd=0.4&mtyp=9&id=51EF70624CA791283EC434A52DA0D4E2
>
> Anyway, I called Supermicro. I'm going to order their
> recommended/proper heatsink, air shroud, and then also call up the
> vendor I got the RAM from and tell them they did not deliver the
> proper stuff. They'll put up a fight, because they don't do good
> business - so tomorrow looks to be fun.
>
> Hopefully between those two any cooling and any RAM issues will be out
> of the equation.
>
> On 2/1/06, Paul Brook <[EMAIL PROTECTED]> wrote:
> > On Wednesday 01 February 2006 16:47, mike wrote:
> > > After running memtest86 (V3.3) for at least 24 hours, I came back and
> > > saw that each machine completed 61-63 cycles of tests, with 0
> > > errors...
> > >
> > > However, I did look through the BIOS for cache disabling - and it
> > > doesn't appear I can disable the CPU cache.
> > >
> > > I did turn on chipkill and some other supposed ECC memory "helpers"
> > > and instantly had the machine crash twice.
> > >
> > > [EMAIL PROTECTED] ~]# mcelog --k8 --ascii  > > CPU 0 4 northbridge TSC 2
> > >   Northbridge Chipkill ECC error
> > >   Chipkill ECC syndrome = 6ca0
> > >bit32 = err cpu0
> > >bit45 = uncorrected ecc error
> > >bit57 = processor context corrupt
> > >bit61 = error uncorrected
> > >   bus error 'local node origin, request didn't time out
> > >   generic read mem transaction
> > >   memory access, level generic'
> > > STATUS b65020016c080813 MCGSTATUS 4
> > > 332ff8453 ADDR 7ff5faf0
> > > Kernel panic - not syncing: Machine check
> >
> > I had something similar, and it turned out the motherboard just didn't like
> > the brand/model of memory I was using. Replacing it with a different make
> > (this time one that was on the motherboard's recommended list) fixed the
> > problem.
> >
> > Paul
> >
>



Re: Please help - kernel crashes often

2006-02-01 Thread mike
Yeah, I looked at the memory. It's got a PQI sticker (at least one set)

It's SUPPOSED to be "(Samsung, Micron, Elpida, Infineon, Hynix OEM)" -
which would align it basically with what Supermicro suggests:

http://supermicro.com/Aplus/support/resources/memory/?sz=1.0&mspd=0.4&mtyp=9&id=51EF70624CA791283EC434A52DA0D4E2

Anyway, I called Supermicro. I'm going to order their
recommended/proper heatsink, air shroud, and then also call up the
vendor I got the RAM from and tell them they did not deliver the
proper stuff. They'll put up a fight, because they don't do good
business - so tomorrow looks to be fun.

Hopefully between those two any cooling and any RAM issues will be out
of the equation.

On 2/1/06, Paul Brook <[EMAIL PROTECTED]> wrote:
> On Wednesday 01 February 2006 16:47, mike wrote:
> > After running memtest86 (V3.3) for at least 24 hours, I came back and
> > saw that each machine completed 61-63 cycles of tests, with 0
> > errors...
> >
> > However, I did look through the BIOS for cache disabling - and it
> > doesn't appear I can disable the CPU cache.
> >
> > I did turn on chipkill and some other supposed ECC memory "helpers"
> > and instantly had the machine crash twice.
> >
> > [EMAIL PROTECTED] ~]# mcelog --k8 --ascii  > CPU 0 4 northbridge TSC 2
> >   Northbridge Chipkill ECC error
> >   Chipkill ECC syndrome = 6ca0
> >bit32 = err cpu0
> >bit45 = uncorrected ecc error
> >bit57 = processor context corrupt
> >bit61 = error uncorrected
> >   bus error 'local node origin, request didn't time out
> >   generic read mem transaction
> >   memory access, level generic'
> > STATUS b65020016c080813 MCGSTATUS 4
> > 332ff8453 ADDR 7ff5faf0
> > Kernel panic - not syncing: Machine check
>
> I had something similar, and it turned out the motherboard just didn't like
> the brand/model of memory I was using. Replacing it with a different make
> (this time one that was on the motherboard's recommended list) fixed the
> problem.
>
> Paul
>



Re: Please help - kernel crashes often

2006-02-01 Thread Paul Brook
On Wednesday 01 February 2006 16:47, mike wrote:
> After running memtest86 (V3.3) for at least 24 hours, I came back and
> saw that each machine completed 61-63 cycles of tests, with 0
> errors...
>
> However, I did look through the BIOS for cache disabling - and it
> doesn't appear I can disable the CPU cache.
>
> I did turn on chipkill and some other supposed ECC memory "helpers"
> and instantly had the machine crash twice.
>
> [EMAIL PROTECTED] ~]# mcelog --k8 --ascii  CPU 0 4 northbridge TSC 2
>   Northbridge Chipkill ECC error
>   Chipkill ECC syndrome = 6ca0
>bit32 = err cpu0
>bit45 = uncorrected ecc error
>bit57 = processor context corrupt
>bit61 = error uncorrected
>   bus error 'local node origin, request didn't time out
>   generic read mem transaction
>   memory access, level generic'
> STATUS b65020016c080813 MCGSTATUS 4
> 332ff8453 ADDR 7ff5faf0
> Kernel panic - not syncing: Machine check

I had something similar, and it turned out the motherboard just didn't like 
the brand/model of memory I was using. Replacing it with a different make 
(this time one that was on the motherboard's recommended list) fixed the 
problem.

Paul


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Please help - kernel crashes often

2006-02-01 Thread Lennart Sorensen
On Wed, Feb 01, 2006 at 08:47:30AM -0800, mike wrote:
> After running memtest86 (V3.3) for at least 24 hours, I came back and
> saw that each machine completed 61-63 cycles of tests, with 0
> errors...
> 
> However, I did look through the BIOS for cache disabling - and it
> doesn't appear I can disable the CPU cache.
> 
> I did turn on chipkill and some other supposed ECC memory "helpers"
> and instantly had the machine crash twice.
> 
> [EMAIL PROTECTED] ~]# mcelog --k8 --ascii  CPU 0 4 northbridge TSC 2
>   Northbridge Chipkill ECC error
>   Chipkill ECC syndrome = 6ca0
>bit32 = err cpu0
>bit45 = uncorrected ecc error
>bit57 = processor context corrupt
>bit61 = error uncorrected
>   bus error 'local node origin, request didn't time out
>   generic read mem transaction
>   memory access, level generic'
> STATUS b65020016c080813 MCGSTATUS 4
> 332ff8453 ADDR 7ff5faf0
> Kernel panic - not syncing: Machine check

Doing a quick google search on "Northbridge Chipkill ECC error" found
this interesting thread:
http://lkml.org/lkml/2006/1/12/385

It certainly appears to point at bad memory, or potentially an
overheating problem.  The chipkill feature warns when ECC finds and
error and has to correct it in memory.  In your case it even seems to be
saying that if found a memory error that it couldn't correct.

You might have to try and run with half the memory for a number of hours
to see if it fails, given it could be heat related.  Of course memtest
doesn't really stress the cpu or disk or video, so you get a lot less
heat created while running memtest.  Kernel compiles are good though. :)

Len Sorensen


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Please help - kernel crashes often

2006-02-01 Thread mike
After running memtest86 (V3.3) for at least 24 hours, I came back and
saw that each machine completed 61-63 cycles of tests, with 0
errors...

However, I did look through the BIOS for cache disabling - and it
doesn't appear I can disable the CPU cache.

I did turn on chipkill and some other supposed ECC memory "helpers"
and instantly had the machine crash twice.

[EMAIL PROTECTED] ~]# mcelog --k8 --ascii 

Re: Please help - kernel crashes often

2006-02-01 Thread Lennart Sorensen
On Tue, Jan 31, 2006 at 08:09:16PM -0800, mike wrote:
> Do you mean disable the CPU cache (L2?) or disable the ECC memory cache?

I meant L2 CPU cache.

> I haven't looked at the BIOS yet to see which options I have, but I
> want to make sure I do the right one :)
> 
> Wouldn't disabling some caches wind up cutting the performance down,
> especially if you mean the CPU cache? Having the dual 1M L2 caches is
> supposed to be really good, disabling those would probably be a big
> performance hit then wouldn't it?

It will seriously hurt performance.  I don't think it is a solution,
just a debuging step to determine where the problem is.  After all if it
runs without errors with the cache turned off, and has errors with cache
turned on, it makes you suspect the cpu cache as the issue.  If the
system has more than one cpu you could always remove one cpu and see if
the system runs fine then, and then try with the other cpu.  Most likely
it would be only one of the cpus.  Or you could take out half the ram,
and see and try with the other half of the ram to see if you have a bad
stick of memory.

> Thanks for the feedback though. All of this is helping me.

Well the ECC errors certainly point to either a CPU cache problem or a
memory problem.  Problem is just finding out which one it is.  Of course
it could also be a motherboard problem, but then I would expect it to be
more frequent.

Len Sorensen


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Please help - kernel crashes often

2006-01-31 Thread mike
Do you mean disable the CPU cache (L2?) or disable the ECC memory cache?

I haven't looked at the BIOS yet to see which options I have, but I
want to make sure I do the right one :)

Wouldn't disabling some caches wind up cutting the performance down,
especially if you mean the CPU cache? Having the dual 1M L2 caches is
supposed to be really good, disabling those would probably be a big
performance hit then wouldn't it?

Thanks for the feedback though. All of this is helping me.

On 1/31/06, Lennart Sorensen <[EMAIL PROTECTED]> wrote:
> It seems to say _data_cache_ ecc error, not ram ecc error.  Sounds like
> a cpu has a defective cache.  Well unless that is how the cpu reports
> when a transfer between cache and ram had an ecc failure.  Need an
> expert on the k8 to answer that I think. :)
>
> 1U machines do tend to run hot, due to limited space for cooling and
> packing everything in so tight.  A cpu that is marginal is much more
> likely to fail in such conditions.  If the bios has an option to disable
> the cache, maybe you could see if that makes it stable.  If it does,
> then you pretty much know where the broken hardware is.



Re: Please help - kernel crashes often

2006-01-31 Thread Lennart Sorensen
On Tue, Jan 31, 2006 at 01:45:01AM -0800, mike wrote:
> Yes, I was able to go down and get on the console, record it, and
> found a thread on how to decypher it.
> 
> 
> The MCE was:
> 
> CPU 0: Machine Check Exception: 4 Bank 0: f60da833
> TSC 23fd7acec1e ADDR 797db2c0
> Kernel panic - not syncing: Machine check
> 
> 
> the output from "mcelog" was:
> 
> web03:~# mcelog --k8 --ascii  CPU 0 0 data cache TSC 23fd7acec1e
>   Data cache ECC error (syndrome 1b)
>bit45 = uncorrected ecc error
>bit57 = processor context corrupt
>bit61 = error uncorrected
>bit62 = error overflow (multiple errors)
>   bus error 'local node origin, request didn't time out
>   data read mem transaction
>   memory access, level generic'
> STATUS f60da833 MCGSTATUS 4
> Kernel panic - not syncing: Machine check
> 
> 
> I've been running memtest86 V3.3 (if I recall the exact title) on all
> the machines starting earlier today and will be looking at them in the
> next day or two to figure out what they say.
> 
> One thing that disturbs me is that it shows ECC: no in memtest, even
> when I force enable it on - and the RAM is most definately ECC...

It seems to say _data_cache_ ecc error, not ram ecc error.  Sounds like
a cpu has a defective cache.  Well unless that is how the cpu reports
when a transfer between cache and ram had an ecc failure.  Need an
expert on the k8 to answer that I think. :)

1U machines do tend to run hot, due to limited space for cooling and
packing everything in so tight.  A cpu that is marginal is much more
likely to fail in such conditions.  If the bios has an option to disable
the cache, maybe you could see if that makes it stable.  If it does,
then you pretty much know where the broken hardware is.

Len Sorensen


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Please help - kernel crashes often

2006-01-31 Thread mike
Yes, I was able to go down and get on the console, record it, and
found a thread on how to decypher it.


The MCE was:

CPU 0: Machine Check Exception: 4 Bank 0: f60da833
TSC 23fd7acec1e ADDR 797db2c0
Kernel panic - not syncing: Machine check


the output from "mcelog" was:

web03:~# mcelog --k8 --ascii  wrote:
> ECC failures will generate MCE's. The MCE message *should* provide some
> hint as to what is wrong.



Re: Please help - kernel crashes often

2006-01-30 Thread Anthony DeRobertis
Harald Dunkel wrote:
> Hi Mike,
> 
> A machine check exception indicates a hardware problem, i.e.
> a broken CPU. (I am not sure whether it could indicate
> bad ECC memory, too. Did you run memtest68?)

ECC failures will generate MCE's. The MCE message *should* provide some
hint as to what is wrong.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Please help - kernel crashes often

2006-01-30 Thread Anthony DeRobertis
mike wrote:
> Sure enough, one of the two others I tested failed within only 10-15 mins.
> 
> Does this look appropriate for a dual-core, single chip, Opteron 175
> in a 1u chassis? The max any CPU gets is 62C...

That's pretty damn hot. Something closer to 40C is more normal...


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Please help - kernel crashes often

2006-01-29 Thread Christoph Fassbach

I would suppose it is a hardware issue, too.

Have you tried memtest86 ( http://www.memtest.org/ ) ?

Let this check run about 20 hours. I experienced a problem on my router
after 12 runs
of the complete test suite, but immediately unziping one special file.

According to
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/30417.pdf
your Opteron is specified for not more than 65°C case (processor
package) and
70°C measured by the thermal sensor diode.
But I would check fan, heatsink and thermal compound nevertheless.
62°C might be a problem. My San Diego AMD64 4000+ reaches about 40°C if
I am
compiling a kernel.

Greets,

Christoph


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]



Re: Please help - kernel crashes often

2006-01-29 Thread mike
Sure enough, one of the two others I tested failed within only 10-15 mins.

Does this look appropriate for a dual-core, single chip, Opteron 175
in a 1u chassis? The max any CPU gets is 62C...

ipmi>sensors
CPU Temparture (C)   =62.00 Min=0.00  Max=75.00ok
System Temparture (C)=25.00 Min=0.00  Max=75.00ok
DIMM Voltage (V) =2.70  Min=2.30  Max=2.82 ok
CPU Core Voltage (V) =1.41  Min=1.04  Max=1.71 ok
3.3V Voltage (V) =3.34  Min=2.90  Max=3.57 ok
3.3VSB Voltage (V)   =3.33  Min=2.90  Max=3.57 ok
5V Voltage (V)   =5.07  Min=4.49  Max=5.48 ok
12V Voltage (V)  =11.84 Min=10.89 Max=13.29ok
-12V Voltage (V) =-12.11Min=-13.19Max=-10.81   ok
Battery Voltage (V)  =3.34  Min=3.04  Max=3.72 ok
FAN 1 Fan (RPM)  =6750.00   Min=810.00Max=17145.00 ok
FAN 5 Fan (RPM)  =4860.00   Min=810.00Max=17145.00 ok
Power Supply = ok

Thanks!



Re: Please help - kernel crashes often

2006-01-29 Thread mike
Thanks for the quick reply.

I have an IPMI card installed and can monitor the CPU temperatures -
it took a LOT to get this up to 70C and set off the threshhold alarm.

An idle machine doing a make -j2 shouldn't take it to it's knees, or
make it overheat. Especially on two separate servers... however I'm
open to try anything at this point. Especially since it is on two
servers, I'll try running it on two others and see if those two fail
as well.


On 1/29/06, Harald Dunkel <[EMAIL PROTECTED]> wrote:
> Hi Mike,
>
> A machine check exception indicates a hardware problem, i.e.
> a broken CPU. (I am not sure whether it could indicate
> bad ECC memory, too. Did you run memtest68?)
>
> Since you get the problem on heavy load I would suggest to
> look for the CPU fan. If the CPU is overheated then this
> is the kind of problem I would expect.
>
>
> Good luck
>
> Harri
>
>
>



Re: Please help - kernel crashes often

2006-01-29 Thread Harald Dunkel
Hi Mike,

A machine check exception indicates a hardware problem, i.e.
a broken CPU. (I am not sure whether it could indicate
bad ECC memory, too. Did you run memtest68?)

Since you get the problem on heavy load I would suggest to
look for the CPU fan. If the CPU is overheated then this
is the kind of problem I would expect.


Good luck

Harri


signature.asc
Description: OpenPGP digital signature


Re: Please help - kernel crashes often

2006-01-29 Thread mike
I forgot, I should have included dmesg... and lspci I suppose.

Bootdata ok (command line is root=/dev/sda2 ro)
Linux version 2.6.14.3-mike ([EMAIL PROTECTED]) (gcc version 4.0.3 2005
(prerelease) (Debian 4.0.2-4)) #3 SMP Fri Dec 2 05:39:39 PST 2005
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009fc00 (usable)
 BIOS-e820: 0009fc00 - 000a (reserved)
 BIOS-e820: 000e - 0010 (reserved)
 BIOS-e820: 0010 - 7fff (usable)
 BIOS-e820: 7fff - 7fffe000 (ACPI data)
 BIOS-e820: 7fffe000 - 8000 (ACPI NVS)
 BIOS-e820: fec0 - fec03000 (reserved)
 BIOS-e820: fee0 - fee01000 (reserved)
 BIOS-e820: ff78 - 0001 (reserved)
ACPI: RSDP (v000 ACPIAM) @ 0x000f7140
ACPI: RSDT (v001 A M I  OEMRSDT  0x11000517 MSFT 0x0097) @
0x7fff
ACPI: FADT (v002 A M I  OEMFACP  0x11000517 MSFT 0x0097) @
0x7fff0200
ACPI: MADT (v001 A M I  OEMAPIC  0x11000517 MSFT 0x0097) @
0x7fff0390
ACPI: OEMB (v001 A M I  AMI_OEM  0x11000517 MSFT 0x0097) @
0x7fffe040
ACPI: DSDT (v001  0ABSW 0ABSW005 0x0005 INTL 0x02002026) @
0x
Scanning NUMA topology in Northbridge 24
Number of nodes 1
Node 0 MemBase  Limit 7fff
Using 20 for the hash shift. Max adder is 7fff
Using node hash shift of 20
Bootmem setup node 0 -7fff
On node 0 totalpages: 524175
  DMA zone: 3999 pages, LIFO batch:1
  Normal zone: 520176 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:1
ACPI: PM-Timer IO Port: 0x508
ACPI: Local APIC address 0xfee0
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 16
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 16
ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
ACPI: IOAPIC (id[0x02] address[0xfec0] gsi_base[0])
IOAPIC[0]: apic_id 2, version 17, address 0xfec0, GSI 0-15
ACPI: IOAPIC (id[0x03] address[0xfec01000] gsi_base[16])
IOAPIC[1]: apic_id 3, version 17, address 0xfec01000, GSI 16-31
ACPI: IOAPIC (id[0x04] address[0xfec02000] gsi_base[32])
IOAPIC[2]: apic_id 4, version 17, address 0xfec02000, GSI 32-47
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: IRQ0 used by override.
ACPI: IRQ2 used by override.
ACPI: IRQ9 used by override.
Setting APIC routing to flat
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 8800 (gap: 8000:7ec0)
Checking aperture...
CPU 0: aperture @ 828000 size 32 MB
Aperture from northbridge cpu 0 too small (32 MB)
No AGP bridge found
Built 1 zonelists
Kernel command line: root=/dev/sda2 ro
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 3.579545 MHz PM timer.
time.c: Detected 2194.608 MHz processor.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Memory: 2057504k/2097088k available (2567k kernel code, 39196k
reserved, 899k data, 204k init)
Calibrating delay using timer specific routine.. 4395.97 BogoMIPS (lpj=8791943)
Mount-cache hash table entries: 256
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 0(2) -> Node 0 -> Core 0
mtrr: v2.0 (20020519)
Using local APIC timer interrupts.
Detected 12.469 MHz APIC timer.
Booting processor 1/2 APIC 0x1
Initializing CPU#1
Calibrating delay using timer specific routine.. 4389.33 BogoMIPS (lpj=8778663)
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 1024K (64 bytes/line)
CPU 1(2) -> Node 0 -> Core 1
Dual Core AMD Opteron(tm) Processor 175 stepping 02
CPU 1: Syncing TSC to CPU 0.
CPU 1: synchronized TSC with CPU 0 (last diff 0 cycles, maxerr 540 cycles)
Brought up 2 CPUs
Disabling vsyscall due to use of PM timer
time.c: Using PM based timekeeping.
testing NMI watchdog ... OK.
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: Using configuration type 1
ACPI: Subsystem revision 20050902
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (:00)
PCI: Probing PCI hardware (bus 00)
PCI: Ignoring BAR0-3 of IDE controller :00:02.1
Boot video device is :00:05.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P1._PRT]
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0.P0P1.P1P2._PRT]
ACPI: PCI Interrupt Link [LN00] (IRQs 3 4 5 7 9 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LN01] (IRQs 1 3 4 5 6 7 9 10 11 12 14 15)
*0, disabled.
ACPI: PCI Interrupt Link [LN02] (IRQs 1 3 4 5 6 7 9 10 11 12 14 15)
*0, disabled.
ACPI: PCI Interrupt Link [LN03] (IRQs 1 3 4 5 6 7 9 *10 11 1

Please help - kernel crashes often

2006-01-29 Thread mike
I'm having a problem - any time I do a large rsync, or even compile a
kernel over and over (basically a stress test - but not really
throwing a major load on it) it eventually throws a machine check
exception (I don't have a console into it right now so I can't paste
the exact dump...)

I'm wondering if perhaps I'm missing something obvious here - it's a
dual-core Opteron setup, which I believe requires NUMA? and perhaps
because of it's special needs, I might need to do something different?

Anyway I am happy to try anything and I need to get a dump of it so I
can Google that as well and see if anything specific turns up.

Any help is appreciated! PLEASE help! I need these to work properly
ASAP :) I will happily reply with any other information or try
anything!

Hardware:
* Dual-core Opteron 175
* 2x1 gig DDR3200 ECC
* Seagate 250G SATA
* Serverworks HT1000 chipset on a Supermicro H8SSL-i motherboard
(specs: http://www.supermicro.com/Aplus/motherboard/Opteron/HT1000/H8SSL-i.cfm)

Software:
* Debian-amd64
* gcc version 4.0.3 20060115 (prerelease) (Debian 4.0.2-7)
* GNU ld version 2.16.91 20060118 Debian GNU/Linux

I am using unstable for my apt sources right now, but I don't think
this is an apt package related issue, or a library issue - but a
kernel issue (or perhaps GCC 4.0 compiling the kernel?)

I have 6 identical machines built and so far two of them have the same
issue - at first I was [hoping] it was only the specific one and might
be a hardware issue... I've tried two different versions of kernels
and both have the same problem... One was 2.6.14.3, currently I am
running 2.6.15.1, vanilla sources (no patches), config below:

CONFIG_X86_64=y
CONFIG_64BIT=y
CONFIG_X86=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_MMU=y
CONFIG_RWSEM_GENERIC_SPINLOCK=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_CMPXCHG=y
CONFIG_EARLY_PRINTK=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y

#
# Code maturity level options
#
CONFIG_EXPERIMENTAL=y
CONFIG_CLEAN_COMPILE=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32

#
# General setup
#
CONFIG_LOCALVERSION="-mike"
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_SYSCTL=y
# CONFIG_AUDIT is not set
CONFIG_HOTPLUG=y
CONFIG_KOBJECT_UEVENT=y
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_CPUSETS is not set
CONFIG_INITRAMFS_SOURCE=""
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
# CONFIG_EMBEDDED is not set
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SHMEM=y
CONFIG_CC_ALIGN_FUNCTIONS=0
CONFIG_CC_ALIGN_LABELS=0
CONFIG_CC_ALIGN_LOOPS=0
CONFIG_CC_ALIGN_JUMPS=0
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0

#
# Loadable module support
#
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
CONFIG_MODULE_FORCE_UNLOAD=y
CONFIG_OBSOLETE_MODPARM=y
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y

#
# Block layer
#
CONFIG_LBD=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
# CONFIG_IOSCHED_DEADLINE is not set
# CONFIG_IOSCHED_CFQ is not set
CONFIG_DEFAULT_AS=y
# CONFIG_DEFAULT_DEADLINE is not set
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="anticipatory"

#
# Processor type and features
#
CONFIG_MK8=y
# CONFIG_MPSC is not set
# CONFIG_GENERIC_CPU is not set
CONFIG_X86_L1_CACHE_BYTES=64
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_MICROCODE=m
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_MTRR=y
CONFIG_SMP=y
# CONFIG_SCHED_SMT is not set
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_BKL=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NUMA_EMU=y
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_ARCH_DISCONTIGMEM_DEFAULT=y
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
CONFIG_DISCONTIGMEM_MANUAL=y
# CONFIG_SPARSEMEM_MANUAL is not set
CONFIG_DISCONTIGMEM=y
CONFIG_FLAT_NODE_MEM_MAP=y
CONFIG_NEED_MULTIPLE_NODES=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID=y
CONFIG_NR_CPUS=2
# CONFIG_HOTPLUG_CPU is not set
CONFIG_HPET_TIMER=y
CONFIG_X86_PM_TIMER=y
# CONFIG_HPET_EMULATE_RTC is not set
CONFIG_GART_IOMMU=y
CONFIG_SWIOTLB=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_PHYSICAL_START=0x10
# CONFIG_KEXEC is not set
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_GENERIC_HARDIRQS=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_ISA_DMA_API=y
CONFIG_GENERIC_PENDING_IRQ=y

#
# Power management options
#
CONFIG_PM=y
# CONFIG_PM_LEGACY is not set
# CONFIG_PM_DEBUG is not set

#
# ACPI (Advanced Configuration and Power Interface) Support
#
CONFIG_ACPI=y
# CONFIG_ACPI_AC is not set
# C