Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote: > On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to > remark: > > > > Additionally, in "real life", very few errors are cause by known errata. > > If the drivers know about the errata, they usually already work around > > them. Afaik, most of the errors are caused by transcient conditions on > > the bus or the device, like a bit beeing flipped, or thermal > > conditions... > > > Heh. Let me describe "real life" a bit more accurately. > > We've been running with pci error detection enabled here for the last > two years. Based on this experience, the ballpark figures are: > > 90% of all detected errors were device driver bugs coupled to > pci card hardware errata Well, this have been in-lab testing to fight driver bugs/errata on early rlease kernels, I'm talking about the context of a released solution with stable drivers/hw. > 9% poorly seated pci cards (remove/reseat will make problem go away) > > 1% transient/other. Ok. > We've seen *EVERY* and I mean *EVERY* device driver that we've put > under stress tests (e.g. peak i/o rates for > 72 hours, e.g. > massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY* > driver tripped on an EEH error detect that was traced back to > a device driver bug. Not to blame the drivers, a lot of these > were related to pci card hardware/foirmware bugs. For example, > I think grepping for "split completion" and "NAPI" in the > patches/errata for e100 and e1000 for the last year will reveal > some of the stuff that was found. As far as I know, > for every bug found, a patch made it into mainline. Yah, those are a pain. But then, it isn't the context described by Nguyen where the driver "knows" about the errata and how to recover. It's the context of a bug where the driver does not know what's going on and/or doesn't have the proper workaround. My point was more that there are very few cases where a driver will have to do recovery of PCI error in known cases where it actually expect an error to happen. > As a rule, it seems that finding these device driver bugs was > very hard; we had some people work on these for months, and in > the case of the e1000, we managed to get Intel engineers to fly > out here and stare at PCI bus traces for a few days. (Thanks Intel!) > Ditto for Emulex. For ipr, we had inhouse people. > > So overall, PCI error detection did have the expected effect > (protecting the kernel from corruption, e.g. due to DMA's going > to wild addresses), but I don't think anybody expected that the > vast majority would be software/hardware bugs, instead of transient > effects. > > What's ironic in all of this is that by adding error recovery, > device driver bugs will be able to hide more effectively ... > if there's a pci bus error due to a driver bug, the pci card > will get rebooted, the kernel will burp for 3 seconds, and > things will keep going, and most sysadmins won't notice or > won't care. Yes, but it will be logged at least, so we'll spot a lot of these during our tests. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)
On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark: > > Additionally, in "real life", very few errors are cause by known errata. > If the drivers know about the errata, they usually already work around > them. Afaik, most of the errors are caused by transcient conditions on > the bus or the device, like a bit beeing flipped, or thermal > conditions... Heh. Let me describe "real life" a bit more accurately. We've been running with pci error detection enabled here for the last two years. Based on this experience, the ballpark figures are: 90% of all detected errors were device driver bugs coupled to pci card hardware errata 9% poorly seated pci cards (remove/reseat will make problem go away) 1% transient/other. We've seen *EVERY* and I mean *EVERY* device driver that we've put under stress tests (e.g. peak i/o rates for > 72 hours, e.g. massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY* driver tripped on an EEH error detect that was traced back to a device driver bug. Not to blame the drivers, a lot of these were related to pci card hardware/foirmware bugs. For example, I think grepping for "split completion" and "NAPI" in the patches/errata for e100 and e1000 for the last year will reveal some of the stuff that was found. As far as I know, for every bug found, a patch made it into mainline. As a rule, it seems that finding these device driver bugs was very hard; we had some people work on these for months, and in the case of the e1000, we managed to get Intel engineers to fly out here and stare at PCI bus traces for a few days. (Thanks Intel!) Ditto for Emulex. For ipr, we had inhouse people. So overall, PCI error detection did have the expected effect (protecting the kernel from corruption, e.g. due to DMA's going to wild addresses), but I don't think anybody expected that the vast majority would be software/hardware bugs, instead of transient effects. What's ironic in all of this is that by adding error recovery, device driver bugs will be able to hide more effectively ... if there's a pci bus error due to a driver bug, the pci card will get rebooted, the kernel will burp for 3 seconds, and things will keep going, and most sysadmins won't notice or won't care. --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)
On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark: Additionally, in real life, very few errors are cause by known errata. If the drivers know about the errata, they usually already work around them. Afaik, most of the errors are caused by transcient conditions on the bus or the device, like a bit beeing flipped, or thermal conditions... Heh. Let me describe real life a bit more accurately. We've been running with pci error detection enabled here for the last two years. Based on this experience, the ballpark figures are: 90% of all detected errors were device driver bugs coupled to pci card hardware errata 9% poorly seated pci cards (remove/reseat will make problem go away) 1% transient/other. We've seen *EVERY* and I mean *EVERY* device driver that we've put under stress tests (e.g. peak i/o rates for 72 hours, e.g. massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY* driver tripped on an EEH error detect that was traced back to a device driver bug. Not to blame the drivers, a lot of these were related to pci card hardware/foirmware bugs. For example, I think grepping for split completion and NAPI in the patches/errata for e100 and e1000 for the last year will reveal some of the stuff that was found. As far as I know, for every bug found, a patch made it into mainline. As a rule, it seems that finding these device driver bugs was very hard; we had some people work on these for months, and in the case of the e1000, we managed to get Intel engineers to fly out here and stare at PCI bus traces for a few days. (Thanks Intel!) Ditto for Emulex. For ipr, we had inhouse people. So overall, PCI error detection did have the expected effect (protecting the kernel from corruption, e.g. due to DMA's going to wild addresses), but I don't think anybody expected that the vast majority would be software/hardware bugs, instead of transient effects. What's ironic in all of this is that by adding error recovery, device driver bugs will be able to hide more effectively ... if there's a pci bus error due to a driver bug, the pci card will get rebooted, the kernel will burp for 3 seconds, and things will keep going, and most sysadmins won't notice or won't care. --linas - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Real-life pci errors (Was: Re: PCI Error Recovery API Proposal. (WAS:: [PATCH/RFC]PCIErrorRecovery)
On Fri, 2005-03-18 at 18:35 -0600, Linas Vepstas wrote: On Sat, Mar 19, 2005 at 10:13:02AM +1100, Benjamin Herrenschmidt was heard to remark: Additionally, in real life, very few errors are cause by known errata. If the drivers know about the errata, they usually already work around them. Afaik, most of the errors are caused by transcient conditions on the bus or the device, like a bit beeing flipped, or thermal conditions... Heh. Let me describe real life a bit more accurately. We've been running with pci error detection enabled here for the last two years. Based on this experience, the ballpark figures are: 90% of all detected errors were device driver bugs coupled to pci card hardware errata Well, this have been in-lab testing to fight driver bugs/errata on early rlease kernels, I'm talking about the context of a released solution with stable drivers/hw. 9% poorly seated pci cards (remove/reseat will make problem go away) 1% transient/other. Ok. We've seen *EVERY* and I mean *EVERY* device driver that we've put under stress tests (e.g. peak i/o rates for 72 hours, e.g. massive tcp/nfs traffic, massive disk i/o traffic, etc), *EVERY* driver tripped on an EEH error detect that was traced back to a device driver bug. Not to blame the drivers, a lot of these were related to pci card hardware/foirmware bugs. For example, I think grepping for split completion and NAPI in the patches/errata for e100 and e1000 for the last year will reveal some of the stuff that was found. As far as I know, for every bug found, a patch made it into mainline. Yah, those are a pain. But then, it isn't the context described by Nguyen where the driver knows about the errata and how to recover. It's the context of a bug where the driver does not know what's going on and/or doesn't have the proper workaround. My point was more that there are very few cases where a driver will have to do recovery of PCI error in known cases where it actually expect an error to happen. As a rule, it seems that finding these device driver bugs was very hard; we had some people work on these for months, and in the case of the e1000, we managed to get Intel engineers to fly out here and stare at PCI bus traces for a few days. (Thanks Intel!) Ditto for Emulex. For ipr, we had inhouse people. So overall, PCI error detection did have the expected effect (protecting the kernel from corruption, e.g. due to DMA's going to wild addresses), but I don't think anybody expected that the vast majority would be software/hardware bugs, instead of transient effects. What's ironic in all of this is that by adding error recovery, device driver bugs will be able to hide more effectively ... if there's a pci bus error due to a driver bug, the pci card will get rebooted, the kernel will burp for 3 seconds, and things will keep going, and most sysadmins won't notice or won't care. Yes, but it will be logged at least, so we'll spot a lot of these during our tests. Ben. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/