Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote:
> On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to 
> remark:
> > 
> > Additionally, in "real life", very few errors are cause by known errata.
> > If the drivers know about the errata, they usually already work around
> > them. Afaik, most of the errors are caused by transcient conditions on
> > the bus or the device, like a bit beeing flipped, or thermal
> > conditions... 
> 
> 
> Heh. Let me describe "real life" a bit more accurately.
> 
> We've been running with pci error detection enabled here for the last
> two years.  Based on this experience, the ballpark figures are:
> 
> 90% of all detected errors were device driver bugs coupled to 
> pci card hardware errata

Well, this have been in-lab testing to fight driver bugs/errata on early
rlease kernels, I'm talking about the context of a released solution
with stable drivers/hw.

> 9% poorly seated pci cards (remove/reseat will make problem go away)
> 
> 1% transient/other.

Ok.

> We've seen *EVERY* and I mean *EVERY* device driver that we've put
> under stress tests (e.g. peak i/o rates for > 72 hours, e.g. 
> massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
> driver tripped on an EEH error detect that was traced back to 
> a device driver bug.  Not to blame the drivers, a lot of these
> were related to pci card hardware/foirmware bugs.  For example, 
> I think grepping for "split completion" and "NAPI" in the 
> patches/errata for e100 and e1000 for the last year will reveal 
> some of the stuff that was found.  As far as I know,
> for every bug found, a patch made it into mainline.

Yah, those are a pain. But then, it isn't the context described by
Nguyen where the driver "knows" about the errata and how to recover.
It's the context of a bug where the driver does not know what's going on
and/or doesn't have the proper workaround. My point was more that there
are very few cases where a driver will have to do recovery of PCI error
in known cases where it actually expect an error to happen.

> As a rule, it seems that finding these device driver bugs was
> very hard; we had some people work on these for months, and in 
> the case of the e1000, we managed to get Intel engineers to fly
> out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
> Ditto for Emulex.  For ipr, we had inhouse people.
> 
> So overall, PCI error detection did have the expected effect 
> (protecting the kernel from corruption, e.g. due to DMA's going 
> to wild addresses), but I don't think anybody expected that the
> vast majority would be software/hardware bugs, instead of transient 
> effects.
> 
> What's ironic in all of this is that by adding error recovery,
> device driver bugs will be able to hide more effectively ... 
> if there's a pci bus error due to a driver bug, the pci card
> will get rebooted, the kernel will burp for 3 seconds, and 
> things will keep going, and most sysadmins won't notice or 
> won't care.

Yes, but it will be logged at least, so we'll spot a lot of these during
our tests.

Ben.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Linas Vepstas
On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to 
remark:
> 
> Additionally, in "real life", very few errors are cause by known errata.
> If the drivers know about the errata, they usually already work around
> them. Afaik, most of the errors are caused by transcient conditions on
> the bus or the device, like a bit beeing flipped, or thermal
> conditions... 


Heh. Let me describe "real life" a bit more accurately.

We've been running with pci error detection enabled here for the last
two years.  Based on this experience, the ballpark figures are:

90% of all detected errors were device driver bugs coupled to 
pci card hardware errata

9% poorly seated pci cards (remove/reseat will make problem go away)

1% transient/other.


We've seen *EVERY* and I mean *EVERY* device driver that we've put
under stress tests (e.g. peak i/o rates for > 72 hours, e.g. 
massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
driver tripped on an EEH error detect that was traced back to 
a device driver bug.  Not to blame the drivers, a lot of these
were related to pci card hardware/foirmware bugs.  For example, 
I think grepping for "split completion" and "NAPI" in the 
patches/errata for e100 and e1000 for the last year will reveal 
some of the stuff that was found.  As far as I know,
for every bug found, a patch made it into mainline.

As a rule, it seems that finding these device driver bugs was
very hard; we had some people work on these for months, and in 
the case of the e1000, we managed to get Intel engineers to fly
out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
Ditto for Emulex.  For ipr, we had inhouse people.

So overall, PCI error detection did have the expected effect 
(protecting the kernel from corruption, e.g. due to DMA's going 
to wild addresses), but I don't think anybody expected that the
vast majority would be software/hardware bugs, instead of transient 
effects.

What's ironic in all of this is that by adding error recovery,
device driver bugs will be able to hide more effectively ... 
if there's a pci bus error due to a driver bug, the pci card
will get rebooted, the kernel will burp for 3 seconds, and 
things will keep going, and most sysadmins won't notice or 
won't care.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Linas Vepstas
On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to 
remark:
 
 Additionally, in real life, very few errors are cause by known errata.
 If the drivers know about the errata, they usually already work around
 them. Afaik, most of the errors are caused by transcient conditions on
 the bus or the device, like a bit beeing flipped, or thermal
 conditions... 


Heh. Let me describe real life a bit more accurately.

We've been running with pci error detection enabled here for the last
two years.  Based on this experience, the ballpark figures are:

90% of all detected errors were device driver bugs coupled to 
pci card hardware errata

9% poorly seated pci cards (remove/reseat will make problem go away)

1% transient/other.


We've seen *EVERY* and I mean *EVERY* device driver that we've put
under stress tests (e.g. peak i/o rates for  72 hours, e.g. 
massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
driver tripped on an EEH error detect that was traced back to 
a device driver bug.  Not to blame the drivers, a lot of these
were related to pci card hardware/foirmware bugs.  For example, 
I think grepping for split completion and NAPI in the 
patches/errata for e100 and e1000 for the last year will reveal 
some of the stuff that was found.  As far as I know,
for every bug found, a patch made it into mainline.

As a rule, it seems that finding these device driver bugs was
very hard; we had some people work on these for months, and in 
the case of the e1000, we managed to get Intel engineers to fly
out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
Ditto for Emulex.  For ipr, we had inhouse people.

So overall, PCI error detection did have the expected effect 
(protecting the kernel from corruption, e.g. due to DMA's going 
to wild addresses), but I don't think anybody expected that the
vast majority would be software/hardware bugs, instead of transient 
effects.

What's ironic in all of this is that by adding error recovery,
device driver bugs will be able to hide more effectively ... 
if there's a pci bus error due to a driver bug, the pci card
will get rebooted, the kernel will burp for 3 seconds, and 
things will keep going, and most sysadmins won't notice or 
won't care.

--linas

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)

2005-03-18 Thread Benjamin Herrenschmidt
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote:
 On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to 
 remark:
  
  Additionally, in real life, very few errors are cause by known errata.
  If the drivers know about the errata, they usually already work around
  them. Afaik, most of the errors are caused by transcient conditions on
  the bus or the device, like a bit beeing flipped, or thermal
  conditions... 
 
 
 Heh. Let me describe real life a bit more accurately.
 
 We've been running with pci error detection enabled here for the last
 two years.  Based on this experience, the ballpark figures are:
 
 90% of all detected errors were device driver bugs coupled to 
 pci card hardware errata

Well, this have been in-lab testing to fight driver bugs/errata on early
rlease kernels, I'm talking about the context of a released solution
with stable drivers/hw.

 9% poorly seated pci cards (remove/reseat will make problem go away)
 
 1% transient/other.

Ok.

 We've seen *EVERY* and I mean *EVERY* device driver that we've put
 under stress tests (e.g. peak i/o rates for  72 hours, e.g. 
 massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY*
 driver tripped on an EEH error detect that was traced back to 
 a device driver bug.  Not to blame the drivers, a lot of these
 were related to pci card hardware/foirmware bugs.  For example, 
 I think grepping for split completion and NAPI in the 
 patches/errata for e100 and e1000 for the last year will reveal 
 some of the stuff that was found.  As far as I know,
 for every bug found, a patch made it into mainline.

Yah, those are a pain. But then, it isn't the context described by
Nguyen where the driver knows about the errata and how to recover.
It's the context of a bug where the driver does not know what's going on
and/or doesn't have the proper workaround. My point was more that there
are very few cases where a driver will have to do recovery of PCI error
in known cases where it actually expect an error to happen.

 As a rule, it seems that finding these device driver bugs was
 very hard; we had some people work on these for months, and in 
 the case of the e1000, we managed to get Intel engineers to fly
 out here and stare at PCI bus traces for a few days.  (Thanks Intel!)
 Ditto for Emulex.  For ipr, we had inhouse people.
 
 So overall, PCI error detection did have the expected effect 
 (protecting the kernel from corruption, e.g. due to DMA's going 
 to wild addresses), but I don't think anybody expected that the
 vast majority would be software/hardware bugs, instead of transient 
 effects.
 
 What's ironic in all of this is that by adding error recovery,
 device driver bugs will be able to hide more effectively ... 
 if there's a pci bus error due to a driver bug, the pci card
 will get rebooted, the kernel will burp for 3 seconds, and 
 things will keep going, and most sysadmins won't notice or 
 won't care.

Yes, but it will be logged at least, so we'll spot a lot of these during
our tests.

Ben.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/