Re: CPU error codes

2001-02-07 Thread Alan Cox

> Really? I thought it could be because of RAM. Here's the story:

RAM talks to the chipset so I dont think it could (unless it confused the
chipset)

> CPU 1: Machine Check Exception: 0004
> Bank 4: b2040151<0>Kernel panic: CPU context corrupt

Ok that decodes as:
Status valid
Uncorrect Error
Error Enabled
Processor Context Corrupt

Memory Heirarchy Error
Instruction Fetch
L1 cache

More than that I can't really say. Power and heat problems can certainly
trigger MCE's. I don't know if I/O devices can influence them.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-02-07 Thread Alan Cox

 Really? I thought it could be because of RAM. Here's the story:

RAM talks to the chipset so I dont think it could (unless it confused the
chipset)

 CPU 1: Machine Check Exception: 0004
 Bank 4: b20401510Kernel panic: CPU context corrupt

Ok that decodes as:
Status valid
Uncorrect Error
Error Enabled
Processor Context Corrupt

Memory Heirarchy Error
Instruction Fetch
L1 cache

More than that I can't really say. Power and heat problems can certainly
trigger MCE's. I don't know if I/O devices can influence them.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-02-06 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:Carlos Carvalho <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
> 
> Really? I thought it could be because of RAM. Here's the story:
> 
> The kernel is 2.2.18pre24.
> 
> I'm having VERY frequent of this (sometimes once a day, sometimes once
> a week, sometimes twice a day, on a much used machine)
> 
> CPU 1: Machine Check Exception: 0004
> Bank 4: b2040151<0>Kernel panic: CPU context corrupt
> 
> CPU 0: Machine Check Exception: 0004
> Bank 4: b2040151<0>Kernel panic: CPU context corrupt
> 
> CPU 0: Machine Check Exception: 0004
> Bank 4: b2040151<0>Kernel panic: CPU context corrupt
> 
> This is on an ASUS P2B-DS with two PIII 700MHz and 100MHz FSB, 1GB of
> RAM. The mce happens with both processors (the above is just part of
> it).
> 
> I've already changed the motherboard and processors, and it continued.
> Then I changed the memory, and it continues. I also changed the
> power supply just in case, to no avail...
> 
> It happens with PC100 and PC133 memory. I increased the memory latency
> (the SPD says it's cl2, I put it 3T and 10T DRAM) but the problem
> persists.
> 
> Since I changed the main board and processor, I think the most likely
> cause is ram. It seems the x86 can access ram directly, so if there's
> a NMI there what will happen?
> 

Much more likely is that your CPU is bad, or overclocked.

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-02-06 Thread Carlos Carvalho

Alan Cox ([EMAIL PROTECTED]) wrote on 31 January 2001 15:23:
 >> > In the intel databook. Generally an MCE indicates hardware/power/cooling
 >> > issues
 >> 
 >> Doesn't an MCE also cover some hardware memory problems - parity/ECC
 >> issues etc?
 >
 >Parity/ECC on main memory is reported by the chipset and needs seperate
 >drivers or apps to handle this

Really? I thought it could be because of RAM. Here's the story:

The kernel is 2.2.18pre24.

I'm having VERY frequent of this (sometimes once a day, sometimes once
a week, sometimes twice a day, on a much used machine)

CPU 1: Machine Check Exception: 0004
Bank 4: b2040151<0>Kernel panic: CPU context corrupt

CPU 0: Machine Check Exception: 0004
Bank 4: b2040151<0>Kernel panic: CPU context corrupt

CPU 0: Machine Check Exception: 0004
Bank 4: b2040151<0>Kernel panic: CPU context corrupt

This is on an ASUS P2B-DS with two PIII 700MHz and 100MHz FSB, 1GB of
RAM. The mce happens with both processors (the above is just part of
it).

I've already changed the motherboard and processors, and it continued.
Then I changed the memory, and it continues. I also changed the
power supply just in case, to no avail...

It happens with PC100 and PC133 memory. I increased the memory latency
(the SPD says it's cl2, I put it 3T and 10T DRAM) but the problem
persists.

Since I changed the main board and processor, I think the most likely
cause is ram. It seems the x86 can access ram directly, so if there's
a NMI there what will happen?

This is happening on a CRITICAL machine, so any help will be much
appreciated.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-02-06 Thread Carlos Carvalho

Alan Cox ([EMAIL PROTECTED]) wrote on 31 January 2001 15:23:
   In the intel databook. Generally an MCE indicates hardware/power/cooling
   issues
  
  Doesn't an MCE also cover some hardware memory problems - parity/ECC
  issues etc?
 
 Parity/ECC on main memory is reported by the chipset and needs seperate
 drivers or apps to handle this

Really? I thought it could be because of RAM. Here's the story:

The kernel is 2.2.18pre24.

I'm having VERY frequent of this (sometimes once a day, sometimes once
a week, sometimes twice a day, on a much used machine)

CPU 1: Machine Check Exception: 0004
Bank 4: b20401510Kernel panic: CPU context corrupt

CPU 0: Machine Check Exception: 0004
Bank 4: b20401510Kernel panic: CPU context corrupt

CPU 0: Machine Check Exception: 0004
Bank 4: b20401510Kernel panic: CPU context corrupt

This is on an ASUS P2B-DS with two PIII 700MHz and 100MHz FSB, 1GB of
RAM. The mce happens with both processors (the above is just part of
it).

I've already changed the motherboard and processors, and it continued.
Then I changed the memory, and it continues. I also changed the
power supply just in case, to no avail...

It happens with PC100 and PC133 memory. I increased the memory latency
(the SPD says it's cl2, I put it 3T and 10T DRAM) but the problem
persists.

Since I changed the main board and processor, I think the most likely
cause is ram. It seems the x86 can access ram directly, so if there's
a NMI there what will happen?

This is happening on a CRITICAL machine, so any help will be much
appreciated.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-02-06 Thread H. Peter Anvin

Followup to:  [EMAIL PROTECTED]
By author:Carlos Carvalho [EMAIL PROTECTED]
In newsgroup: linux.dev.kernel
 
 Really? I thought it could be because of RAM. Here's the story:
 
 The kernel is 2.2.18pre24.
 
 I'm having VERY frequent of this (sometimes once a day, sometimes once
 a week, sometimes twice a day, on a much used machine)
 
 CPU 1: Machine Check Exception: 0004
 Bank 4: b20401510Kernel panic: CPU context corrupt
 
 CPU 0: Machine Check Exception: 0004
 Bank 4: b20401510Kernel panic: CPU context corrupt
 
 CPU 0: Machine Check Exception: 0004
 Bank 4: b20401510Kernel panic: CPU context corrupt
 
 This is on an ASUS P2B-DS with two PIII 700MHz and 100MHz FSB, 1GB of
 RAM. The mce happens with both processors (the above is just part of
 it).
 
 I've already changed the motherboard and processors, and it continued.
 Then I changed the memory, and it continues. I also changed the
 power supply just in case, to no avail...
 
 It happens with PC100 and PC133 memory. I increased the memory latency
 (the SPD says it's cl2, I put it 3T and 10T DRAM) but the problem
 persists.
 
 Since I changed the main board and processor, I think the most likely
 cause is ram. It seems the x86 can access ram directly, so if there's
 a NMI there what will happen?
 

Much more likely is that your CPU is bad, or overclocked.

-hpa
-- 
[EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-31 Thread Dan Hollis

On Wed, 31 Jan 2001, James Sutherland wrote:
> On Wed, 31 Jan 2001, Alan Cox wrote:
> > Parity/ECC on main memory is reported by the chipset and needs seperate
> > drivers or apps to handle this
> Yes - MCE only covers errors in the CPU's cache, IIRC? (Is there still an
> NMI on main memory parity errors, or has this changed on modern
> chipsets? Presumably ECC is handled differently, being recoverable??)

You can program the northbridge to generate NMI or not, on ECC errors.
Most chipsets still need to scrub memory after an error to reset ECC bits.

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-31 Thread James Sutherland

On Wed, 31 Jan 2001, Alan Cox wrote:

> > > In the intel databook. Generally an MCE indicates hardware/power/cooling
> > > issues
> > 
> > Doesn't an MCE also cover some hardware memory problems - parity/ECC
> > issues etc?
> 
> Parity/ECC on main memory is reported by the chipset and needs seperate
> drivers or apps to handle this

Yes - MCE only covers errors in the CPU's cache, IIRC? (Is there still an
NMI on main memory parity errors, or has this changed on modern
chipsets? Presumably ECC is handled differently, being recoverable??)


James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-31 Thread Alan Cox

> > In the intel databook. Generally an MCE indicates hardware/power/cooling
> > issues
> 
> Doesn't an MCE also cover some hardware memory problems - parity/ECC
> issues etc?

Parity/ECC on main memory is reported by the chipset and needs seperate
drivers or apps to handle this
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-31 Thread Alan Cox

  In the intel databook. Generally an MCE indicates hardware/power/cooling
  issues
 
 Doesn't an MCE also cover some hardware memory problems - parity/ECC
 issues etc?

Parity/ECC on main memory is reported by the chipset and needs seperate
drivers or apps to handle this
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-31 Thread James Sutherland

On Wed, 31 Jan 2001, Alan Cox wrote:

   In the intel databook. Generally an MCE indicates hardware/power/cooling
   issues
  
  Doesn't an MCE also cover some hardware memory problems - parity/ECC
  issues etc?
 
 Parity/ECC on main memory is reported by the chipset and needs seperate
 drivers or apps to handle this

Yes - MCE only covers errors in the CPU's cache, IIRC? (Is there still an
NMI on main memory parity errors, or has this changed on modern
chipsets? Presumably ECC is handled differently, being recoverable??)


James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-31 Thread Dan Hollis

On Wed, 31 Jan 2001, James Sutherland wrote:
 On Wed, 31 Jan 2001, Alan Cox wrote:
  Parity/ECC on main memory is reported by the chipset and needs seperate
  drivers or apps to handle this
 Yes - MCE only covers errors in the CPU's cache, IIRC? (Is there still an
 NMI on main memory parity errors, or has this changed on modern
 chipsets? Presumably ECC is handled differently, being recoverable??)

You can program the northbridge to generate NMI or not, on ECC errors.
Most chipsets still need to scrub memory after an error to reset ECC bits.

-Dan

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-28 Thread H. Peter Anvin

Followup to:  <[EMAIL PROTECTED]>
By author:James Sutherland <[EMAIL PROTECTED]>
In newsgroup: linux.dev.kernel
>
> On Thu, 25 Jan 2001, Alan Cox wrote:
> 
> > > I was wondering if someone could tell me where I can find
> > > Xeon Pentium III cpu error messages/codes
> > 
> > In the intel databook. Generally an MCE indicates hardware/power/cooling
> > issues
> 
> Doesn't an MCE also cover some hardware memory problems - parity/ECC
> issues etc?
> 

Not main memory, but it does include cache errors.

-hpa
-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-25 Thread James Sutherland

On Thu, 25 Jan 2001, Alan Cox wrote:

> > I was wondering if someone could tell me where I can find
> > Xeon Pentium III cpu error messages/codes
> 
> In the intel databook. Generally an MCE indicates hardware/power/cooling
> issues

Doesn't an MCE also cover some hardware memory problems - parity/ECC
issues etc?


James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-25 Thread James Sutherland

On Thu, 25 Jan 2001, Alan Cox wrote:

  I was wondering if someone could tell me where I can find
  Xeon Pentium III cpu error messages/codes
 
 In the intel databook. Generally an MCE indicates hardware/power/cooling
 issues

Doesn't an MCE also cover some hardware memory problems - parity/ECC
issues etc?


James.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-24 Thread Alan Cox

> I was wondering if someone could tell me where I can find
> Xeon Pentium III cpu error messages/codes

In the intel databook. Generally an MCE indicates hardware/power/cooling
issues
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



CPU error codes

2001-01-24 Thread James Simmons


I was wondering if someone could tell me where I can find
Xeon Pentium III cpu error messages/codes

I have a machine that crashed with:
kernel: CPU 3: Machine Check Exception: 0004
kernel: Bank 1: b2000175<0>Kernel panic: CPU context


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



CPU error codes

2001-01-24 Thread James Simmons


I was wondering if someone could tell me where I can find
Xeon Pentium III cpu error messages/codes

I have a machine that crashed with:
kernel: CPU 3: Machine Check Exception: 0004
kernel: Bank 1: b20001750Kernel panic: CPU context


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: CPU error codes

2001-01-24 Thread Alan Cox

 I was wondering if someone could tell me where I can find
 Xeon Pentium III cpu error messages/codes

In the intel databook. Generally an MCE indicates hardware/power/cooling
issues
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/