Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread John R Pierce

William L. Maltby wrote:

On Fri, 2008-06-20 at 15:57 -0500, Lanny Marcus wrote:
  

On 6/20/08, Alwin Roosen <[EMAIL PROTECTED]> wrote:




  


  

ws174 login: CPU 1: Machine Check Exception: 0005
CPU 0: Machine Check Exception: 0004
Bank 3: f6220002010a at 32c93500
Bank 5: f2300c000e0f
Kernel panic - not syncing: CPU context corrupt
Bank 3: f6220002010a
  

Two banks of Memory (3 and 5) have problems?

If the RAM tests OK, suggest you swap the motherboard



IIRC, you have memory interleaved? I've had problems with that, in the
past, on ... an acer? Anyway, if so, try turning it off in the BIOS
setup.

Also, make sure you have the latest BIOS for the mainboard.
  


I'm pretty sure those 'banks' mentioned in that error relate to the 
on-CPU cache, and not to motherboard main RAM.


any ECC in a MACHINE CHECK is likely CACHE ecc, not main memory ECC.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Lanny Marcus
On 6/20/08, Alwin Roosen <[EMAIL PROTECTED]> wrote:

> This is a brand new server, which has been tested for days with FreeBSD
> in our office, and a few days with Windows on the site of our hardware
> distributor. Now customer wants CentOS, which we installed, but after
> few days we get a kernel panic. Last night at 2:08 it gave the same
> kernel panic.

Have you checked to verify that the fans are spinning?

Since it is a new system, I think you should take it back to your HW
distributor and have them run cerberus(ctcs) on it, as Richard Karhuse
wrote.

If it takes a few days for it to get the Kernel Panic, I doubt that is
related to the OS.

Let your HW distributor do the work of troubleshooting and replacing
whatever component(s) are faulty. They can get a CentOS Live CD and
run that on it.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread William L. Maltby
On Fri, 2008-06-20 at 15:57 -0500, Lanny Marcus wrote:
> On 6/20/08, Alwin Roosen <[EMAIL PROTECTED]> wrote:
> 
> >

> > ws174 login: CPU 1: Machine Check Exception: 0005
> > CPU 0: Machine Check Exception: 0004
> > Bank 3: f6220002010a at 32c93500
> > Bank 5: f2300c000e0f
> > Kernel panic - not syncing: CPU context corrupt
> > Bank 3: f6220002010a
> 
> Two banks of Memory (3 and 5) have problems?
> 
> If the RAM tests OK, suggest you swap the motherboard

IIRC, you have memory interleaved? I've had problems with that, in the
past, on ... an acer? Anyway, if so, try turning it off in the BIOS
setup.

Also, make sure you have the latest BIOS for the mainboard.

> 

HTH
-- 
Bill

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Lanny Marcus
On 6/20/08, Alwin Roosen <[EMAIL PROTECTED]> wrote:

> This is a brand new server, which has been tested for days with FreeBSD
> in our office, and a few days with Windows on the site of our hardware
> distributor. Now customer wants CentOS, which we installed, but after
> few days we get a kernel panic. Last night at 2:08 it gave the same
> kernel panic.

The fact that it worked OK, the first few days, with FreeBSD and
Windows, may have been a Burn In test and now something in the HW has
failed or is failing. Or, possibly CentOS is utilizing the HW much
more robustly than the other 2 OS did?

I would suggest that you get a Knoppix Live CD, or, preferably, a
CentOS Live CD, and let it roll.

And, you get a Kernel Panic, after_ a_ few_ days, on CentOS. That
might indicate a Memory problem? Or, a Cooling problem?

> ws174 login: CPU 1: Machine Check Exception: 0005
> CPU 0: Machine Check Exception: 0004
> Bank 3: f6220002010a at 32c93500
> Bank 5: f2300c000e0f
> Kernel panic - not syncing: CPU context corrupt
> Bank 3: f6220002010a

Two banks of Memory (3 and 5) have problems?

If the RAM tests OK, suggest you swap the motherboard
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread nate
Richard Karhuse wrote:

> Dag's Repo has the new memtest86+ 2.01 RPM.  I'd pull it and
> let it run overnight.  While memtest86+ is good, I've recently had
> cases where is didn't find (obvious) memory errors.

My favorite test is cerberus(ctcs). Quite a few OEMs out there
use it to burn in their systems. For me it can typically find a problem
within a few hours. Whereas memtest I've let it run for a week and have
it not find anything useful.

Though the results of cerberus sometimes won't help you pinpoint the
problem(often the result is just a machine crash). But at least you
know there is an issue and can start swapping hardware until it's
fixed(or just replace the whole system).

http://sourceforge.net/projects/va-ctcs/

nate


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Richard Karhuse
On 6/20/08, Alwin Roosen <[EMAIL PROTECTED]> wrote:
>
> Hi,
>
>
> CentOS release 5 (Final)
> Kernel 2.6.18-53.1.21.el5 on an i686
>
> ws174 login: CPU 1: Machine Check Exception: 0005
> CPU 0: Machine Check Exception: 0004
> Bank 3: f6220002010a at 32c93500
> Bank 5: f2300c000e0f
> Kernel panic - not syncing: CPU context corrupt
> Bank 3: f6220002010a
>
>
>
Alwin -->

I would be very, very "surprised" *IF* this wasn't hardware
related.

Dave Jones wrote a nice little program to help decode this:

$ parsemce -b 3 -s f6220002010a -e 5 -a 32c93500
Status: (5) Machine Check in progress.
Restart IP valid.
parsebank(3): f6220002010a @ 32c93500
External tag parity error
CPU state corrupt. Restart not possible
Address in addr register valid
Error enabled in control register
Error not corrected.
Error overflow
Memory hierarchy error
Request: Generic error
Transaction type : Generic
Memory/IO : I/O

and:

$ parsemce -b 5 -s f2300c000e0f -e 4 -a 0
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(5): f2300c000e0f @ 0
External tag parity error
CPU state corrupt. Restart not possible
Error enabled in control register
Error not corrected.
Error overflow
Bus and interconnect error
Participation: Generic
Timeout: Request did not timeout
Request: Generic error
Transaction type : Invalid
Memory/IO : Other


Dag's Repo has the new memtest86+ 2.01 RPM.  I'd pull it and
let it run overnight.  While memtest86+ is good, I've recently had
cases where is didn't find (obvious) memory errors.

I've also seen things like SATA disks drive cause MCEs.

This one looks like you're taking memory parity errors somewhere
in the path to the CPU.  On you BIOS, check you Events log for
any "interesting" entries, too.

Hope this helps ...

   -rak-
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Lanny Marcus
On 6/20/08, nate <[EMAIL PROTECTED]> wrote:

> Easiest is to buy from a vendor that can test on your OS of choice,
> there are lots of vendors out there that can do it.
>
> Two such companies I have bought from that do this include
> http://www.siliconmechanics.com/ (HQ in Seattle, WA area)
> http://www.asaservers.com/ (HQ in San Fransisco, CA area)
>
> Both specialize in Supermicro/Tyan-based systems(as to most other
> "whitebox" vendors).

That, IMHO, is the best way to go. Another way, if the HW is
available, is to test it with a Live CD for CentOS, before purchasing,
to see if CentOS will run properly on the HW.
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread nate
Michael wrote:

> But I don't want to get into the situation above, where I purchase NEW
> hardware, and CentOS doesn't like it, and furthermore the resolution is
> elusive.
>
> What is the best HW environment for CentOS?
> Brand, MFG, chipset rev, and so on

Easiest is to buy from a vendor that can test on your OS of choice,
there are lots of vendors out there that can do it.

Two such companies I have bought from that do this include
http://www.siliconmechanics.com/ (HQ in Seattle, WA area)
http://www.asaservers.com/ (HQ in San Fransisco, CA area)

Both specialize in Supermicro/Tyan-based systems(as to most other
"whitebox" vendors).

nate

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Michael

Lanny Marcus wrote:

On 6/20/08, Alwin Roosen <[EMAIL PROTECTED]> wrote:

  

CentOS release 5 (Final)
Kernel 2.6.18-53.1.21.el5 on an i686

ws174 login: CPU 1: Machine Check Exception: 0005
CPU 0: Machine Check Exception: 0004
Bank 3: f6220002010a at 32c93500
Bank 5: f2300c000e0f
Kernel panic - not syncing: CPU context corrupt
Bank 3: f6220002010a



Phil or someone else: Do the three (3) "Bank" lines above indicate RAM
problems?  If not, what do they refer to? Alwin wrote that this is
brand new HW, so he suspects that it is OK, but it doesn't seem to be
OK? Lanny
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

  
I have the same issue, unresolved. However I am using old desktop 
hardware (Compaq Persario, and HP something or another). Maybe it is 
memory, or CPU, or some kind of incompatibility with something. I was 
just making a list of the hardware that should be purchased to run a 
low-end SME server using CentOS.


Rack mountable case, with Power Supply and fans included.
MotherBoard, mid-range processor.
2 Gb RAM
USB Drive 1 Tb
Two 500Gb or four 300 Gb internal hardrives (HW Raid would be nice)
CD/DVD R/W drive
and so on..


But I don't want to get into the situation above, where I purchase NEW 
hardware, and CentOS doesn't like it, and furthermore the resolution is 
elusive.


What is the best HW environment for CentOS?
Brand, MFG, chipset rev, and so on

--
Michael Anderson,
J3k Solutions
Sr.Systems Programmer/Analyst
832.515.3868

___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Lanny Marcus
On 6/20/08, Alwin Roosen <[EMAIL PROTECTED]> wrote:

> CentOS release 5 (Final)
> Kernel 2.6.18-53.1.21.el5 on an i686
>
> ws174 login: CPU 1: Machine Check Exception: 0005
> CPU 0: Machine Check Exception: 0004
> Bank 3: f6220002010a at 32c93500
> Bank 5: f2300c000e0f
> Kernel panic - not syncing: CPU context corrupt
> Bank 3: f6220002010a
>
Phil or someone else: Do the three (3) "Bank" lines above indicate RAM
problems?  If not, what do they refer to? Alwin wrote that this is
brand new HW, so he suspects that it is OK, but it doesn't seem to be
OK? Lanny
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Walid
2008/6/20 Alwin Roosen <[EMAIL PROTECTED]>:

> Hi,
>
>
> Is there someone on this mailing list who could/want help me figure out
> this issue? We do not know where to look to solve this.
>
If your installation is standard CentOS with no thirdparty software, and
configurations, I would first run the vendor hardware checks several times,
as they are usually not good with intermittent or hard to find problems, run
extenisve memtest also if possible

regards

Walid
___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


Re: [CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Phil Schaffner
On Fri, 2008-06-20 at 14:40 +0200, Alwin Roosen wrote:
> Hi,
> 
> 
> Is there someone on this mailing list who could/want help me figure out
> this issue? We do not know where to look to solve this.
...
> I would be very surprised if this is hardware related. 

A google on

"Machine Check Exception" "Kernel panic - not syncing: CPU context corrupt"

turns up 50 results (including your CentOS BZ request referring you to
this list), many of which point to hardware problems - CPU, MB (bad
caps), chipset, are all listed as possible problems.  I'd go back to the
hardware vendor if still under warranty.

Phil


___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos


[CentOS] Kernel panic - not syncing: CPU context corrupt

2008-06-20 Thread Alwin Roosen
Hi,


Is there someone on this mailing list who could/want help me figure out
this issue? We do not know where to look to solve this.

--- Description ---

This is a brand new server, which has been tested for days with FreeBSD
in our office, and a few days with Windows on the site of our hardware
distributor. Now customer wants CentOS, which we installed, but after
few days we get a kernel panic. Last night at 2:08 it gave the same
kernel panic.

Please tell me what information I should give you and most important how
to get it from the system, because we do not have experience with CentOS
(only FreeBSD).

I would be very surprised if this is hardware related. We use the same
hardware for several years, and run FreeBSD on it very successfully. It
is a SuperMicro PDSMI+ motherboard with 3ware raid controller
(8006-2LP). CPU is Xeon 3040 1.8 Ghz EM64 2MB 1066FSB (65W). Memory is
DDR 2 Trancend 2048MB ECC Unbuffered 800.

Error message on console is in "Additional Information".

I am hoping that I should switch off some setting in CentOS to fix this,
but I cannot find much useful information about this issue on Google.

--- Additional Information ---

CentOS release 5 (Final)
Kernel 2.6.18-53.1.21.el5 on an i686

ws174 login: CPU 1: Machine Check Exception: 0005
CPU 0: Machine Check Exception: 0004
Bank 3: f6220002010a at 32c93500
Bank 5: f2300c000e0f
Kernel panic - not syncing: CPU context corrupt
Bank 3: f6220002010a

--- Attachments ---

19-06-2008 16-03-31.png (Screenshot of console)


With kind regards,


Alwin Roosen

<>___
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos