Re: NMI error and Intel S5000PSL Motherboards

2007-09-28 Thread AndrewL733

[EMAIL PROTECTED] wrote:

On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:

  

Hello,


We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
They have been running for anywhere between 1 and 10 months. For the 
past few months, after updating them all to the 2.6.20.15 kernel 
(because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
errors. For example:


Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
Aug 29 09:02:10 master kernel: Do you have a strange power saving mode enabled?
Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
  

I'm also working with Andrew and Samson.  It seems that the cause of
the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
defaults to y.

With CONFIG_PCIEAER=n, scanpci works fine with no errors.  This is the
workaround that they'll likely use for now.



Glad that you found it.

  

With CONFIG_PCIEAER=y, scanpci always triggers the NMI error.  The
option aerdriver.forceload=1 has no effect.



The 'forceload' option only forces the driver to load even when the
ACPI hardware initialization routine fails.

It would be nice to be able to disable PCIEAER at boot time though.
Shouldn't be difficult.

  
So, looking for some closure here, what do we think is the "root cause"? 
Is it:


1)  a defect with Intel's S5000PSL motherboards that is exposed by an 
otherwise fine new (since 2.6.19) Linux kernel feature? (in which case 
we and others should probably press Intel to recognize they have a 
problem, seeing as they only "officially support" distributions running 
on 2.6.16 or below so maybe they don't even know about this issue).


2)  a problem with PCIEAER? And maybe "CONFIG_PCIEAER=y"  should NOT be 
the default setting? (in which case the kernel maybe needs fixing)


3)  just a bad interaction between a good motherboard and a good Linux 
feature that don't play well together? (in which case this is a kernel 
"feature" that anybody compiling a kernel to run on the Intel S5000PSL 
motherboard should know not to enable -- maybe a note is warranted so 
that when configuring the kernel, people with S5000PSL motherboards 
might not make the same mistake???).




  

The related dmesg output at boot is:

  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:02.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:03.0:pcie01 failed with error 1
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:04.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:05.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:06.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:07.0:pcie01 failed with error 1

Full dmesg, lspci, and ACPI DSDT are available here:
  http://jim.sh/~jim/tmp/nmi/

-jim




---
~Randy
Phaedrus says that Quality is about caring.
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-28 Thread AndrewL733

[EMAIL PROTECTED] wrote:

On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:

  

Hello,


We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
They have been running for anywhere between 1 and 10 months. For the 
past few months, after updating them all to the 2.6.20.15 kernel 
(because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
errors. For example:


Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
Aug 29 09:02:10 master kernel: Do you have a strange power saving mode enabled?
Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
  

I'm also working with Andrew and Samson.  It seems that the cause of
the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
defaults to y.

With CONFIG_PCIEAER=n, scanpci works fine with no errors.  This is the
workaround that they'll likely use for now.



Glad that you found it.
  
  

With CONFIG_PCIEAER=y, scanpci always triggers the NMI error.  The
option aerdriver.forceload=1 has no effect.



So, looking for some closure here, what do we think is the "root cause"? 
Is it:


1)  a defect with Intel's S5000PSL motherboards that is exposed by an 
otherwise fine new (since 2.6.19) Linux kernel feature? (in which case 
we and others should probably press Intel to recognize they have a 
problem, seeing as they only "officially support" distributions running 
on 2.6.16 or below so maybe they don't even know about this issue).


2)  a problem with PCIEAER? And maybe "CONFIG_PCIEAER=y"  should NOT be 
the default setting? (in which case the kernel maybe needs fixing)


3)  just a bad interaction between a good motherboard and a good Linux 
feature that don't play well together? (in which case this is a kernel 
"feature" that anybody compiling a kernel to run on the Intel S5000PSL 
motherboard should know not to enable -- maybe a note is warranted so 
that when configuring the kernel, people with S5000PSL motherboards 
might not make the same mistake???).




The 'forceload' option only forces the driver to load even when the
ACPI hardware initialization routine fails.

It would be nice to be able to disable PCIEAER at boot time though.
Shouldn't be difficult.


  

The related dmesg output at boot is:

  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:02.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:03.0:pcie01 failed with error 1
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:04.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:05.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:06.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:07.0:pcie01 failed with error 1

Full dmesg, lspci, and ACPI DSDT are available here:
  http://jim.sh/~jim/tmp/nmi/

-jim




---
~Randy
Phaedrus says that Quality is about caring.
  


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-28 Thread AndrewL733

[EMAIL PROTECTED] wrote:

On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:

  

Hello,


We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
They have been running for anywhere between 1 and 10 months. For the 
past few months, after updating them all to the 2.6.20.15 kernel 
(because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
errors. For example:


Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
Aug 29 09:02:10 master kernel: Do you have a strange power saving mode enabled?
Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
  

I'm also working with Andrew and Samson.  It seems that the cause of
the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
defaults to y.

With CONFIG_PCIEAER=n, scanpci works fine with no errors.  This is the
workaround that they'll likely use for now.



Glad that you found it.

  

With CONFIG_PCIEAER=y, scanpci always triggers the NMI error.  The
option aerdriver.forceload=1 has no effect.



The 'forceload' option only forces the driver to load even when the
ACPI hardware initialization routine fails.

It would be nice to be able to disable PCIEAER at boot time though.
Shouldn't be difficult.

  
So, looking for some closure here, what do we think is the root cause? 
Is it:


1)  a defect with Intel's S5000PSL motherboards that is exposed by an 
otherwise fine new (since 2.6.19) Linux kernel feature? (in which case 
we and others should probably press Intel to recognize they have a 
problem, seeing as they only officially support distributions running 
on 2.6.16 or below so maybe they don't even know about this issue).


2)  a problem with PCIEAER? And maybe CONFIG_PCIEAER=y  should NOT be 
the default setting? (in which case the kernel maybe needs fixing)


3)  just a bad interaction between a good motherboard and a good Linux 
feature that don't play well together? (in which case this is a kernel 
feature that anybody compiling a kernel to run on the Intel S5000PSL 
motherboard should know not to enable -- maybe a note is warranted so 
that when configuring the kernel, people with S5000PSL motherboards 
might not make the same mistake???).




  

The related dmesg output at boot is:

  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:02.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:03.0:pcie01 failed with error 1
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:04.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:05.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:06.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:07.0:pcie01 failed with error 1

Full dmesg, lspci, and ACPI DSDT are available here:
  http://jim.sh/~jim/tmp/nmi/

-jim




---
~Randy
Phaedrus says that Quality is about caring.
  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-28 Thread AndrewL733

[EMAIL PROTECTED] wrote:

On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:

  

Hello,


We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
They have been running for anywhere between 1 and 10 months. For the 
past few months, after updating them all to the 2.6.20.15 kernel 
(because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
errors. For example:


Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
Aug 29 09:02:10 master kernel: Do you have a strange power saving mode enabled?
Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
  

I'm also working with Andrew and Samson.  It seems that the cause of
the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
defaults to y.

With CONFIG_PCIEAER=n, scanpci works fine with no errors.  This is the
workaround that they'll likely use for now.



Glad that you found it.
  
  

With CONFIG_PCIEAER=y, scanpci always triggers the NMI error.  The
option aerdriver.forceload=1 has no effect.



So, looking for some closure here, what do we think is the root cause? 
Is it:


1)  a defect with Intel's S5000PSL motherboards that is exposed by an 
otherwise fine new (since 2.6.19) Linux kernel feature? (in which case 
we and others should probably press Intel to recognize they have a 
problem, seeing as they only officially support distributions running 
on 2.6.16 or below so maybe they don't even know about this issue).


2)  a problem with PCIEAER? And maybe CONFIG_PCIEAER=y  should NOT be 
the default setting? (in which case the kernel maybe needs fixing)


3)  just a bad interaction between a good motherboard and a good Linux 
feature that don't play well together? (in which case this is a kernel 
feature that anybody compiling a kernel to run on the Intel S5000PSL 
motherboard should know not to enable -- maybe a note is warranted so 
that when configuring the kernel, people with S5000PSL motherboards 
might not make the same mistake???).




The 'forceload' option only forces the driver to load even when the
ACPI hardware initialization routine fails.

It would be nice to be able to disable PCIEAER at boot time though.
Shouldn't be difficult.


  

The related dmesg output at boot is:

  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:02.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:03.0:pcie01 failed with error 1
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:04.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:05.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:06.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:07.0:pcie01 failed with error 1

Full dmesg, lspci, and ACPI DSDT are available here:
  http://jim.sh/~jim/tmp/nmi/

-jim




---
~Randy
Phaedrus says that Quality is about caring.
  


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-26 Thread Jim Paris
Hello,

> We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
> They have been running for anywhere between 1 and 10 months. For the 
> past few months, after updating them all to the 2.6.20.15 kernel 
> (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
> errors. For example:
> 
> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
> enabled?
> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue

I'm also working with Andrew and Samson.  It seems that the cause of
the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
defaults to y.

With CONFIG_PCIEAER=n, scanpci works fine with no errors.  This is the
workaround that they'll likely use for now.

With CONFIG_PCIEAER=y, scanpci always triggers the NMI error.  The
option aerdriver.forceload=1 has no effect.

The related dmesg output at boot is:

  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:02.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:03.0:pcie01 failed with error 1
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:04.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:05.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:06.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:07.0:pcie01 failed with error 1

Full dmesg, lspci, and ACPI DSDT are available here:
  http://jim.sh/~jim/tmp/nmi/

-jim
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-26 Thread Randy Dunlap
On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:

> Hello,
> 
> > We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
> > They have been running for anywhere between 1 and 10 months. For the 
> > past few months, after updating them all to the 2.6.20.15 kernel 
> > (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
> > errors. For example:
> > 
> > Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> > Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
> > enabled?
> > Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
> 
> I'm also working with Andrew and Samson.  It seems that the cause of
> the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
> defaults to y.
> 
> With CONFIG_PCIEAER=n, scanpci works fine with no errors.  This is the
> workaround that they'll likely use for now.

Glad that you found it.

> With CONFIG_PCIEAER=y, scanpci always triggers the NMI error.  The
> option aerdriver.forceload=1 has no effect.

The 'forceload' option only forces the driver to load even when the
ACPI hardware initialization routine fails.

It would be nice to be able to disable PCIEAER at boot time though.
Shouldn't be difficult.


> The related dmesg output at boot is:
> 
>   Evaluate _OSC Set fails. Status = 0x0005
>   Evaluate _OSC Set fails. Status = 0x0005
>   aer_init: AER service init fails - Run ACPI _OSC fails
>   aer: probe of :00:02.0:pcie01 failed with error 2
>   aer_init: AER service init fails - No ACPI _OSC support
>   aer: probe of :00:03.0:pcie01 failed with error 1
>   Evaluate _OSC Set fails. Status = 0x0005
>   Evaluate _OSC Set fails. Status = 0x0005
>   aer_init: AER service init fails - Run ACPI _OSC fails
>   aer: probe of :00:04.0:pcie01 failed with error 2
>   Evaluate _OSC Set fails. Status = 0x0005
>   Evaluate _OSC Set fails. Status = 0x0005
>   aer_init: AER service init fails - Run ACPI _OSC fails
>   aer: probe of :00:05.0:pcie01 failed with error 2
>   Evaluate _OSC Set fails. Status = 0x0005
>   Evaluate _OSC Set fails. Status = 0x0005
>   aer_init: AER service init fails - Run ACPI _OSC fails
>   aer: probe of :00:06.0:pcie01 failed with error 2
>   aer_init: AER service init fails - No ACPI _OSC support
>   aer: probe of :00:07.0:pcie01 failed with error 1
> 
> Full dmesg, lspci, and ACPI DSDT are available here:
>   http://jim.sh/~jim/tmp/nmi/
> 
> -jim


---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Re: NMI error and Intel S5000PSL Motherboards]

2007-09-26 Thread Randy Dunlap
On Wed, 26 Sep 2007 15:07:14 -0400 samson yeung wrote:

> Hello,
> 
> I'm working with AndrewL733 on this issue. I'm doing the git bisect right now.
> 
> scanpci -f -1 causes the problem, scanpci -f -2 and scanpci -O do not.

Does the problem always happen when scanpci is making an ioperm
syscall (as in the strace output below)?


> The driver does not even need to be loaded to have the problem
> (e1000). I have not tried the 2.6.18 driver with 2.6.20, but I have
> tried both the in-kernel driver as well as the newer driver from Intel
> with the same result.
> 
> The drive is a Seagate Barracuda 7200.9 80 Gbytes with fimware 3.AAE
> I can include hdparm -i output if it will help.
> 
> The problem is only happening on 64-bit. As noted above, I'm running
> git-bisect to test a stock kernel.org kernel. 32-bit Ubuntu does not
> exhibit the problem, I have not tested a kernel.org 32-bit kernel.
> 
-
> strace: I don't know what syscall_273 does. I trimmed the output to
> include syscall 273 and the lines surrounding it. I can include the
> entirety of the strace if it will help.

Does this include trace info all the way to the end of the trace
output file?  If not, please send that part also.


> arch_prctl(ARCH_SET_FS, 0x2aca24060f50) = 0
> mprotect(0x2aca23e3b000, 12288, PROT_READ) = 0
> munmap(0x2aca238e2000, 36649)   = 0
> set_tid_address(0x2aca24060fe0) = 10319
> syscall_273(0x2aca24060ff0, 0x18, 0x7fff87790188, 0x2aca233193c0,
> 0x2aca24060f50, 0x2aca233352b8, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1,
> 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1,
> 0x1, 0x1, 0x1, 0x1, 0x1) = 0
> rt_sigaction(SIGRTMIN, {0x2aca23e4a3a0, [], SA_RESTORER|SA_SIGINFO,
> 0x2aca23e53200}, NULL, 8) = 0
> rt_sigaction(SIGRT_1, {0x2aca23e4a2f0, [],
> SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x2aca23e53200}, NULL, 8) = 0
> rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
> ioperm(0, 0x400, 0x1)   = 0


---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Re: NMI error and Intel S5000PSL Motherboards]

2007-09-26 Thread samson yeung
Hello,

I'm working with AndrewL733 on this issue. I'm doing the git bisect right now.

scanpci -f -1 causes the problem, scanpci -f -2 and scanpci -O do not.

The systems have two 1-Gig sticks in the D1 and C1 slots of the
motherboard. I ran memtest86 overnight and got no errors. (Samsung 1GB
PC2-5300F-555-11-B0)

Both pci=nomsi and pci=nommconf don't change the situation on the
ubuntu's custom kernel. I can try them on a stock kernel.org kernel
after I finish doing the git bisect.

The driver does not even need to be loaded to have the problem
(e1000). I have not tried the 2.6.18 driver with 2.6.20, but I have
tried both the in-kernel driver as well as the newer driver from Intel
with the same result.

The drive is a Seagate Barracuda 7200.9 80 Gbytes with fimware 3.AAE
I can include hdparm -i output if it will help.

The problem is only happening on 64-bit. As noted above, I'm running
git-bisect to test a stock kernel.org kernel. 32-bit Ubuntu does not
exhibit the problem, I have not tested a kernel.org 32-bit kernel.

Extended command output follows:

cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5160  @ 3.00GHz
stepping: 6
cpu MHz : 1998.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fx   sr sse sse2 ss ht tm
syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16
xtpr dca lahf_lm
bogomips: 5990.11
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5160  @ 3.00GHz
stepping: 6
cpu MHz : 1998.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fx   sr sse sse2 ss ht tm
syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16
xtpr dca lahf_lm
bogomips: 5984.99
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

--
lspci -v:
00:00.0 Host bridge: Intel Corporation Server Memory Controller Hub (rev b1)
Subsystem: Intel Corporation Unknown device 3476
Flags: bus master, fast devsel, latency 0
Capabilities: 

00:02.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 2-3
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=01, subordinate=05, sec-latency=0
I/O behind bridge: 4000-4fff
Memory behind bridge: b800-b89f
Capabilities: 

00:03.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 3
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=06, subordinate=06, sec-latency=0
Capabilities: 

00:04.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 4-5
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=07, subordinate=07, sec-latency=0
Capabilities: 

00:05.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 5
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=08, subordinate=08, sec-latency=0
Capabilities: 

00:06.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 6-7
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=09, subordinate=0c, sec-latency=0
I/O behind bridge: 2000-3fff
Memory behind bridge: b8b0-b8cf
Prefetchable memory behind bridge: b8e0-b8f0
Capabilities: 

00:07.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 7
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=0d, subordinate=0d, sec-latency=0
Capabilities: 

00:08.0 System peripheral: Intel Corporation Server DMA Engine (rev b1)
Subsystem: Intel Corporation Unknown device 3476
Flags: bus master, fast devsel, latency 0, IRQ 1
Memory at fe70 (64-bit, non-prefetchable) [size=1K]
Capabilities: 

00:10.0 Host bridge: Intel Corporation Server Error Reporting 

Re: NMI error and Intel S5000PSL Motherboards

2007-09-26 Thread Alan Cox
> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
> enabled?

What would be useful is to know under what situations that board can
raise NMI 30.

> In other words, Intel seems to be blaming the problem we are seeing on 
> something introduced starting with the 2.6.19 kernel. We are not looking 
> to blame anybody. We are only looking for a solution.

The first thing to find out is to find out in which kernel the behaviour
is introduced. It might also be worth disabling msi in case Intel screwed
the board up somewhat.

> Does anybody have an idea what could be going on here, as well as what 
> the solution may be? Going back to 2.6.18 or lower is not an option.

See if 2.6.20.* with the 2.6.18 driver compiles and how that behaves.
Also see if pci=nomsi and/or pci=nommconf make a difference.

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-26 Thread Alan Cox
 Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
 Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
 enabled?

What would be useful is to know under what situations that board can
raise NMI 30.

 In other words, Intel seems to be blaming the problem we are seeing on 
 something introduced starting with the 2.6.19 kernel. We are not looking 
 to blame anybody. We are only looking for a solution.

The first thing to find out is to find out in which kernel the behaviour
is introduced. It might also be worth disabling msi in case Intel screwed
the board up somewhat.

 Does anybody have an idea what could be going on here, as well as what 
 the solution may be? Going back to 2.6.18 or lower is not an option.

See if 2.6.20.* with the 2.6.18 driver compiles and how that behaves.
Also see if pci=nomsi and/or pci=nommconf make a difference.

Alan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[Re: NMI error and Intel S5000PSL Motherboards]

2007-09-26 Thread samson yeung
Hello,

I'm working with AndrewL733 on this issue. I'm doing the git bisect right now.

scanpci -f -1 causes the problem, scanpci -f -2 and scanpci -O do not.

The systems have two 1-Gig sticks in the D1 and C1 slots of the
motherboard. I ran memtest86 overnight and got no errors. (Samsung 1GB
PC2-5300F-555-11-B0)

Both pci=nomsi and pci=nommconf don't change the situation on the
ubuntu's custom kernel. I can try them on a stock kernel.org kernel
after I finish doing the git bisect.

The driver does not even need to be loaded to have the problem
(e1000). I have not tried the 2.6.18 driver with 2.6.20, but I have
tried both the in-kernel driver as well as the newer driver from Intel
with the same result.

The drive is a Seagate Barracuda 7200.9 80 Gbytes with fimware 3.AAE
I can include hdparm -i output if it will help.

The problem is only happening on 64-bit. As noted above, I'm running
git-bisect to test a stock kernel.org kernel. 32-bit Ubuntu does not
exhibit the problem, I have not tested a kernel.org 32-bit kernel.

Extended command output follows:

cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5160  @ 3.00GHz
stepping: 6
cpu MHz : 1998.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fx   sr sse sse2 ss ht tm
syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16
xtpr dca lahf_lm
bogomips: 5990.11
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Xeon(R) CPU5160  @ 3.00GHz
stepping: 6
cpu MHz : 1998.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 1
cpu cores   : 2
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fx   sr sse sse2 ss ht tm
syscall nx lm constant_tsc pni monitor ds_cpl vmx est tm2 ssse3 cx16
xtpr dca lahf_lm
bogomips: 5984.99
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

--
lspci -v:
00:00.0 Host bridge: Intel Corporation Server Memory Controller Hub (rev b1)
Subsystem: Intel Corporation Unknown device 3476
Flags: bus master, fast devsel, latency 0
Capabilities: access denied

00:02.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 2-3
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=01, subordinate=05, sec-latency=0
I/O behind bridge: 4000-4fff
Memory behind bridge: b800-b89f
Capabilities: access denied

00:03.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 3
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=06, subordinate=06, sec-latency=0
Capabilities: access denied

00:04.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 4-5
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=07, subordinate=07, sec-latency=0
Capabilities: access denied

00:05.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 5
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=08, subordinate=08, sec-latency=0
Capabilities: access denied

00:06.0 PCI bridge: Intel Corporation Server PCI Express x8 Port 6-7
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=09, subordinate=0c, sec-latency=0
I/O behind bridge: 2000-3fff
Memory behind bridge: b8b0-b8cf
Prefetchable memory behind bridge: b8e0-b8f0
Capabilities: access denied

00:07.0 PCI bridge: Intel Corporation Server PCI Express x4 Port 7
(rev b1) (prog-if 00 [Normal decode])
Flags: bus master, fast devsel, latency 0
Bus: primary=00, secondary=0d, subordinate=0d, sec-latency=0
Capabilities: access denied

00:08.0 System peripheral: Intel Corporation Server DMA Engine (rev b1)
Subsystem: Intel Corporation Unknown device 3476
Flags: bus master, fast devsel, latency 0, IRQ 1
Memory at fe70 (64-bit, non-prefetchable) [size=1K]
  

Re: [Re: NMI error and Intel S5000PSL Motherboards]

2007-09-26 Thread Randy Dunlap
On Wed, 26 Sep 2007 15:07:14 -0400 samson yeung wrote:

 Hello,
 
 I'm working with AndrewL733 on this issue. I'm doing the git bisect right now.
 
 scanpci -f -1 causes the problem, scanpci -f -2 and scanpci -O do not.

Does the problem always happen when scanpci is making an ioperm
syscall (as in the strace output below)?


 The driver does not even need to be loaded to have the problem
 (e1000). I have not tried the 2.6.18 driver with 2.6.20, but I have
 tried both the in-kernel driver as well as the newer driver from Intel
 with the same result.
 
 The drive is a Seagate Barracuda 7200.9 80 Gbytes with fimware 3.AAE
 I can include hdparm -i output if it will help.
 
 The problem is only happening on 64-bit. As noted above, I'm running
 git-bisect to test a stock kernel.org kernel. 32-bit Ubuntu does not
 exhibit the problem, I have not tested a kernel.org 32-bit kernel.
 
-
 strace: I don't know what syscall_273 does. I trimmed the output to
 include syscall 273 and the lines surrounding it. I can include the
 entirety of the strace if it will help.

Does this include trace info all the way to the end of the trace
output file?  If not, please send that part also.


 arch_prctl(ARCH_SET_FS, 0x2aca24060f50) = 0
 mprotect(0x2aca23e3b000, 12288, PROT_READ) = 0
 munmap(0x2aca238e2000, 36649)   = 0
 set_tid_address(0x2aca24060fe0) = 10319
 syscall_273(0x2aca24060ff0, 0x18, 0x7fff87790188, 0x2aca233193c0,
 0x2aca24060f50, 0x2aca233352b8, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1,
 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1,
 0x1, 0x1, 0x1, 0x1, 0x1) = 0
 rt_sigaction(SIGRTMIN, {0x2aca23e4a3a0, [], SA_RESTORER|SA_SIGINFO,
 0x2aca23e53200}, NULL, 8) = 0
 rt_sigaction(SIGRT_1, {0x2aca23e4a2f0, [],
 SA_RESTORER|SA_RESTART|SA_SIGINFO, 0x2aca23e53200}, NULL, 8) = 0
 rt_sigprocmask(SIG_UNBLOCK, [RTMIN RT_1], NULL, 8) = 0
 ioperm(0, 0x400, 0x1)   = 0


---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-26 Thread Randy Dunlap
On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:

 Hello,
 
  We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
  They have been running for anywhere between 1 and 10 months. For the 
  past few months, after updating them all to the 2.6.20.15 kernel 
  (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
  errors. For example:
  
  Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
  Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
  enabled?
  Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
 
 I'm also working with Andrew and Samson.  It seems that the cause of
 the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
 defaults to y.
 
 With CONFIG_PCIEAER=n, scanpci works fine with no errors.  This is the
 workaround that they'll likely use for now.

Glad that you found it.

 With CONFIG_PCIEAER=y, scanpci always triggers the NMI error.  The
 option aerdriver.forceload=1 has no effect.

The 'forceload' option only forces the driver to load even when the
ACPI hardware initialization routine fails.

It would be nice to be able to disable PCIEAER at boot time though.
Shouldn't be difficult.


 The related dmesg output at boot is:
 
   Evaluate _OSC Set fails. Status = 0x0005
   Evaluate _OSC Set fails. Status = 0x0005
   aer_init: AER service init fails - Run ACPI _OSC fails
   aer: probe of :00:02.0:pcie01 failed with error 2
   aer_init: AER service init fails - No ACPI _OSC support
   aer: probe of :00:03.0:pcie01 failed with error 1
   Evaluate _OSC Set fails. Status = 0x0005
   Evaluate _OSC Set fails. Status = 0x0005
   aer_init: AER service init fails - Run ACPI _OSC fails
   aer: probe of :00:04.0:pcie01 failed with error 2
   Evaluate _OSC Set fails. Status = 0x0005
   Evaluate _OSC Set fails. Status = 0x0005
   aer_init: AER service init fails - Run ACPI _OSC fails
   aer: probe of :00:05.0:pcie01 failed with error 2
   Evaluate _OSC Set fails. Status = 0x0005
   Evaluate _OSC Set fails. Status = 0x0005
   aer_init: AER service init fails - Run ACPI _OSC fails
   aer: probe of :00:06.0:pcie01 failed with error 2
   aer_init: AER service init fails - No ACPI _OSC support
   aer: probe of :00:07.0:pcie01 failed with error 1
 
 Full dmesg, lspci, and ACPI DSDT are available here:
   http://jim.sh/~jim/tmp/nmi/
 
 -jim


---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-26 Thread Jim Paris
Hello,

 We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
 They have been running for anywhere between 1 and 10 months. For the 
 past few months, after updating them all to the 2.6.20.15 kernel 
 (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
 errors. For example:
 
 Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
 Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
 enabled?
 Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue

I'm also working with Andrew and Samson.  It seems that the cause of
the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
defaults to y.

With CONFIG_PCIEAER=n, scanpci works fine with no errors.  This is the
workaround that they'll likely use for now.

With CONFIG_PCIEAER=y, scanpci always triggers the NMI error.  The
option aerdriver.forceload=1 has no effect.

The related dmesg output at boot is:

  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:02.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:03.0:pcie01 failed with error 1
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:04.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:05.0:pcie01 failed with error 2
  Evaluate _OSC Set fails. Status = 0x0005
  Evaluate _OSC Set fails. Status = 0x0005
  aer_init: AER service init fails - Run ACPI _OSC fails
  aer: probe of :00:06.0:pcie01 failed with error 2
  aer_init: AER service init fails - No ACPI _OSC support
  aer: probe of :00:07.0:pcie01 failed with error 1

Full dmesg, lspci, and ACPI DSDT are available here:
  http://jim.sh/~jim/tmp/nmi/

-jim
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-25 Thread Randy Dunlap
On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

> We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
> They have been running for anywhere between 1 and 10 months. For the 
> past few months, after updating them all to the 2.6.20.15 kernel 
> (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
> errors. For example:
> 
> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
> enabled?
> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
> 
> Sometimes these errors cause a total system freeze. Most of the time the 
> systems keep running.
> 
> We have determined these errors come most frequently on machines that 
> have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE 
> these cards (it doesn't matter what slot they are in), the NMI errors 
> can occur as frequently as every 3-5 minutes. On machines that do NOT 
> have these Quad Port Adapters, the NMI errors occur about once per month 
> on average. (we have tried the "in-kernel" e1000 drivers, as well as 
> Intel's latest - 7.6.5).
> 
> We have also determined (through a chance discovery) that running 
> “scanpci” can 100 percent reliably reproduce the NMI error on any 
> machine that has the Quad Port NICS. Our various motherboards have 
> different Intel BIOS versions – some have Rev 70, others 74, 79 or 81. 
> They all exhibit the same behavior regardless of BIOS version.
> 
> We have reproduced this problem with:
> 
> Mandriva 2008 RC2 (2.6.22 kernel)
> Mandriva 2007 with custom 2.6.20.15 kernel
> Mandriva 2007 with custom 2.6.19.8 kernel
> Ubuntu “Feisty” with 2.6.20 kernel
> Fedora Core 7 with 2.6.22 kernel
> 
> The problem does NOT occur with any distribution running a 2.6.18 kernel 
> or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included 
> 2.6.17 kernel or custom-compiled 2.6.18 kernel.
> 
> We have been in contact with Intel. Their high level tech support people 
> have basically said,
> 
> “the errors we have logged so far are pointing to a kernel issue and
> not a hardware problem. If we [Intel] can confirm this, it will be
> up to the kernel developer or OS system manufacturer to debug those
> ones, as we do not perform Operating system support.”
> 
> In other words, Intel seems to be blaming the problem we are seeing on 
> something introduced starting with the 2.6.19 kernel. We are not looking 
> to blame anybody. We are only looking for a solution.
> 
> Does anybody have an idea what could be going on here, as well as what 
> the solution may be? Going back to 2.6.18 or lower is not an option.

Answer #2:  if a kernel change was responsible for this problem,
the direct way to find that change is to clone the kernel 'git' tree
and then use git bisect to find the culprit.  If you are certain
that 2.6.18 is good and 2.6.19 is bad, then use those git tree tags
instead of the ones that are used in the example at:
  http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html

git wiki is here:  http://git.or.cz/
and git docs are here:  http://www.kernel.org/pub/software/scm/git/docs/

If you want to use this tool, say so and I think that we (the royal
"we") will try to work you thru it.

---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-25 Thread Randy Dunlap
On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

> We have about 100 servers based on Intel S5000PSL-SATA motherboards. 

product info (for others):
http://support.intel.com/support/motherboards/server/s5000psl/index.htm

> They have been running for anywhere between 1 and 10 months. For the 
> past few months, after updating them all to the 2.6.20.15 kernel 
> (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
> errors. For example:
> 
> Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
> Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
> enabled?
> Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
> 
> Sometimes these errors cause a total system freeze. Most of the time the 
> systems keep running.
> 
> We have determined these errors come most frequently on machines that 
> have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE 
> these cards (it doesn't matter what slot they are in), the NMI errors 
> can occur as frequently as every 3-5 minutes. On machines that do NOT 
> have these Quad Port Adapters, the NMI errors occur about once per month 
> on average. (we have tried the "in-kernel" e1000 drivers, as well as 
> Intel's latest - 7.6.5).
> 
> We have also determined (through a chance discovery) that running 
> “scanpci” can 100 percent reliably reproduce the NMI error on any 
> machine that has the Quad Port NICS. Our various motherboards have 
> different Intel BIOS versions – some have Rev 70, others 74, 79 or 81. 
> They all exhibit the same behavior regardless of BIOS version.
> 
> We have reproduced this problem with:
> 
> Mandriva 2008 RC2 (2.6.22 kernel)
> Mandriva 2007 with custom 2.6.20.15 kernel
> Mandriva 2007 with custom 2.6.19.8 kernel
> Ubuntu “Feisty” with 2.6.20 kernel
> Fedora Core 7 with 2.6.22 kernel
> 
> The problem does NOT occur with any distribution running a 2.6.18 kernel 
> or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included 
> 2.6.17 kernel or custom-compiled 2.6.18 kernel.
> 
> We have been in contact with Intel. Their high level tech support people 
> have basically said,
> 
> “the errors we have logged so far are pointing to a kernel issue and
> not a hardware problem. If we [Intel] can confirm this, it will be
> up to the kernel developer or OS system manufacturer to debug those
> ones, as we do not perform Operating system support.”
> 
> In other words, Intel seems to be blaming the problem we are seeing on 
> something introduced starting with the 2.6.19 kernel. We are not looking 
> to blame anybody. We are only looking for a solution.
> 
> Does anybody have an idea what could be going on here, as well as what 
> the solution may be? Going back to 2.6.18 or lower is not an option.


Please provide some basic info, like:

- how much RAM
- what CPUs (be precise: use 'cat /proc/cpuinfo')
- output of 'lspci -v'
- what kind(s) of SATA drives
- are you using 32-bit or 64-bit kernel(s)

Can you test kernels from kernel.org (i.e., not vendor kernels,
  no other [unkwown] patches applied to them)?

Does tracing 'scanpci' produce any helpful information?
# strace -o scanpci.trace scanpci


---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-25 Thread Randy Dunlap
On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

 We have about 100 servers based on Intel S5000PSL-SATA motherboards. 

product info (for others):
http://support.intel.com/support/motherboards/server/s5000psl/index.htm

 They have been running for anywhere between 1 and 10 months. For the 
 past few months, after updating them all to the 2.6.20.15 kernel 
 (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
 errors. For example:
 
 Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
 Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
 enabled?
 Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
 
 Sometimes these errors cause a total system freeze. Most of the time the 
 systems keep running.
 
 We have determined these errors come most frequently on machines that 
 have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE 
 these cards (it doesn't matter what slot they are in), the NMI errors 
 can occur as frequently as every 3-5 minutes. On machines that do NOT 
 have these Quad Port Adapters, the NMI errors occur about once per month 
 on average. (we have tried the in-kernel e1000 drivers, as well as 
 Intel's latest - 7.6.5).
 
 We have also determined (through a chance discovery) that running 
 “scanpci” can 100 percent reliably reproduce the NMI error on any 
 machine that has the Quad Port NICS. Our various motherboards have 
 different Intel BIOS versions – some have Rev 70, others 74, 79 or 81. 
 They all exhibit the same behavior regardless of BIOS version.
 
 We have reproduced this problem with:
 
 Mandriva 2008 RC2 (2.6.22 kernel)
 Mandriva 2007 with custom 2.6.20.15 kernel
 Mandriva 2007 with custom 2.6.19.8 kernel
 Ubuntu “Feisty” with 2.6.20 kernel
 Fedora Core 7 with 2.6.22 kernel
 
 The problem does NOT occur with any distribution running a 2.6.18 kernel 
 or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included 
 2.6.17 kernel or custom-compiled 2.6.18 kernel.
 
 We have been in contact with Intel. Their high level tech support people 
 have basically said,
 
 “the errors we have logged so far are pointing to a kernel issue and
 not a hardware problem. If we [Intel] can confirm this, it will be
 up to the kernel developer or OS system manufacturer to debug those
 ones, as we do not perform Operating system support.”
 
 In other words, Intel seems to be blaming the problem we are seeing on 
 something introduced starting with the 2.6.19 kernel. We are not looking 
 to blame anybody. We are only looking for a solution.
 
 Does anybody have an idea what could be going on here, as well as what 
 the solution may be? Going back to 2.6.18 or lower is not an option.


Please provide some basic info, like:

- how much RAM
- what CPUs (be precise: use 'cat /proc/cpuinfo')
- output of 'lspci -v'
- what kind(s) of SATA drives
- are you using 32-bit or 64-bit kernel(s)

Can you test kernels from kernel.org (i.e., not vendor kernels,
  no other [unkwown] patches applied to them)?

Does tracing 'scanpci' produce any helpful information?
# strace -o scanpci.trace scanpci


---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI error and Intel S5000PSL Motherboards

2007-09-25 Thread Randy Dunlap
On Wed, 26 Sep 2007 02:12:34 -0800 AndrewL733 wrote:

 We have about 100 servers based on Intel S5000PSL-SATA motherboards. 
 They have been running for anywhere between 1 and 10 months. For the 
 past few months, after updating them all to the 2.6.20.15 kernel 
 (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI 
 errors. For example:
 
 Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
 Aug 29 09:02:10 master kernel: Do you have a strange power saving mode 
 enabled?
 Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
 
 Sometimes these errors cause a total system freeze. Most of the time the 
 systems keep running.
 
 We have determined these errors come most frequently on machines that 
 have an Intel PCI-e Quad Port Gigabit Adapter. On machines that HAVE 
 these cards (it doesn't matter what slot they are in), the NMI errors 
 can occur as frequently as every 3-5 minutes. On machines that do NOT 
 have these Quad Port Adapters, the NMI errors occur about once per month 
 on average. (we have tried the in-kernel e1000 drivers, as well as 
 Intel's latest - 7.6.5).
 
 We have also determined (through a chance discovery) that running 
 “scanpci” can 100 percent reliably reproduce the NMI error on any 
 machine that has the Quad Port NICS. Our various motherboards have 
 different Intel BIOS versions – some have Rev 70, others 74, 79 or 81. 
 They all exhibit the same behavior regardless of BIOS version.
 
 We have reproduced this problem with:
 
 Mandriva 2008 RC2 (2.6.22 kernel)
 Mandriva 2007 with custom 2.6.20.15 kernel
 Mandriva 2007 with custom 2.6.19.8 kernel
 Ubuntu “Feisty” with 2.6.20 kernel
 Fedora Core 7 with 2.6.22 kernel
 
 The problem does NOT occur with any distribution running a 2.6.18 kernel 
 or lower. I.E., CentOS or SUSE 10 and also Mandriva 2007 with included 
 2.6.17 kernel or custom-compiled 2.6.18 kernel.
 
 We have been in contact with Intel. Their high level tech support people 
 have basically said,
 
 “the errors we have logged so far are pointing to a kernel issue and
 not a hardware problem. If we [Intel] can confirm this, it will be
 up to the kernel developer or OS system manufacturer to debug those
 ones, as we do not perform Operating system support.”
 
 In other words, Intel seems to be blaming the problem we are seeing on 
 something introduced starting with the 2.6.19 kernel. We are not looking 
 to blame anybody. We are only looking for a solution.
 
 Does anybody have an idea what could be going on here, as well as what 
 the solution may be? Going back to 2.6.18 or lower is not an option.

Answer #2:  if a kernel change was responsible for this problem,
the direct way to find that change is to clone the kernel 'git' tree
and then use git bisect to find the culprit.  If you are certain
that 2.6.18 is good and 2.6.19 is bad, then use those git tree tags
instead of the ones that are used in the example at:
  http://www.kernel.org/pub/software/scm/git/docs/git-bisect.html

git wiki is here:  http://git.or.cz/
and git docs are here:  http://www.kernel.org/pub/software/scm/git/docs/

If you want to use this tool, say so and I think that we (the royal
we) will try to work you thru it.

---
~Randy
Phaedrus says that Quality is about caring.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/