Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Hi Paul- > I'm suggesting that the rpaphp code has a struct pci_driver whose > id_table and probe function are such that it will claim the EADS > bridges. (It would probably be best to match on vendor=IBM and > class=PCI-PCI bridge and let the probe function figure out which of > the bridges it gets asked about are actually EADS bridges.) Wouldn't this leave out hotplug-capable adapters who have direct PHB parents, since these parent PHBs don't have pci_devs? Thoughts? Thanks- John - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Linas Vepstas writes: > > One way to clean this up would be to make rpaphp the driver for the > > EADS bridges (from the pci code's point of view). > > I guess I don't understand what that means. Are you suggesting moving > pSeries_pci.c into the rpaphp code directory? No, not at all. :) I'm suggesting that the rpaphp code has a struct pci_driver whose id_table and probe function are such that it will claim the EADS bridges. (It would probably be best to match on vendor=IBM and class=PCI-PCI bridge and let the probe function figure out which of the bridges it gets asked about are actually EADS bridges.) > I would prefer to deprecate the hot-plug based recovery scheme. This > is for many reasons, including the fact that some devices that can get > pci errors are soldered onto the planar, and are not hot-pluggable. Those devices can still be isolated and reset, can they not? And they still have an EADS bridge above them, don't they? Are there any that can't be dynamically reassigned from one partition to another? I.e., they may not be physically hotpluggable but they are still virtually hotpluggable as far as the partition is concerned, IIUC. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
> Not sure that I agree with this. Not all PCI hotplug slots have EADS > devices as parents. Ahem, "PCI hotplug" above should read "EEH-enabled". Sorry :) John - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Hi Paul- > > 2) As a result, the code to call hot-unplug is a bit messy. In > >particular, there's a bit of hoop-jumping when hotplug is built as > >as a module (and said hoops were wrecked recently when I moved the > >code around, out of the rpaphp directory). > > One way to clean this up would be to make rpaphp the driver for the > EADS bridges (from the pci code's point of view). Then it would > automatically get included in the error recovery process and could do > whatever it should. Not sure that I agree with this. Not all PCI hotplug slots have EADS devices as parents. This sort of topography dependency is something we're trying to break in rpaphp. Rpaphp could set this up for devices that do have EADS parents, but then we've only covered a subset of EEH-capable devices. Anyway, isn't the OS forbidden from performing most of the expected function of such a driver for EADS devices: Enable the device Access device configuration space Discover resources (addresses and IRQ numbers) provided by the device Allocate these resources Communicate with the device Disable the device Thanks- John - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
On Mon, Aug 29, 2005 at 04:40:20PM +1000, Paul Mackerras was heard to remark: > Linas Vepstas writes: > > > Actually, no. There are three issues: > > 1) hotplug routines are called from within kernel. GregKH has stated on > >multiple occasions that doing this is wrong/bad/evil. This includes > >calling hot-unplug. > > > > 2) As a result, the code to call hot-unplug is a bit messy. In > >particular, there's a bit of hoop-jumping when hotplug is built as > >as a module (and said hoops were wrecked recently when I moved the > >code around, out of the rpaphp directory). > > One way to clean this up would be to make rpaphp the driver for the > EADS bridges (from the pci code's point of view). I guess I don't understand what that means. Are you suggesting moving pSeries_pci.c into the rpaphp code directory? > Then it would > automatically get included in the error recovery process and could do > whatever it should. John Rose, the current maintainer of the rpaphp code, is pretty militant about removing things from, not adding things to, the rpaphp code. Which is a good idea, as chunks of that code are spaghetti, and do need simplification and cleanup. > > 3) Hot-unplug causes scripts to run in user-space. There is no way to > >know when these scripts are done, so its not clear if we've waited > >long enough before calling hot-add (or if waiting is even necessary). > > OK, so let's just add a new hotplug event called KOBJ_ERROR or > something, which tells userspace that an error has occurred which has > made the device inaccessible. Greg, would that be OK? Why do we need such an event? I would prefer to deprecate the hot-plug based recovery scheme. This is for many reasons, including the fact that some devices that can get pci errors are soldered onto the planar, and are not hot-pluggable. --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
On Fri, Aug 26, 2005 at 09:37:36AM +1000, Benjamin Herrenschmidt was heard to remark: > On Fri, 2005-08-26 at 09:18 +1000, Paul Mackerras wrote: > > Benjamin Herrenschmidt writes: > > > > > Ok, so what is the problem then ? Why do we have to wait at all ? Why > > > not just unplug/replug right away ? > > > > We'd have to be absolutely certain that the driver could not possibly > > take another interrupt or try to access the device on behalf of the > > old instance of the device by the time it returned from the remove > > function. I'm not sure I'd trust most drivers that far... > > Hrm... If a driver gets that wrong, then it will also blow up when > unloaded as a module. :) We've discovered two, so far, that blow up when unloaded: lpfc and e1000. I beleive these are now fixed in mainline. --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
On Fri, Aug 26, 2005 at 07:43:57AM +1000, Benjamin Herrenschmidt was heard to remark: > On Thu, 2005-08-25 at 11:21 -0500, Linas Vepstas wrote: > > On Thu, Aug 25, 2005 at 10:49:03AM +1000, Benjamin Herrenschmidt was heard > > to remark: > > > > > > Of course, we'll possibly end up with a different ethX or whatever, but > > > > Yep, but that's not an issue, since all the various device-naming > > schemes are supposed to be fixing this. Its a distinct problem; > > it needs to be solved even across cold-boots. > > Ok, so what is the problem then ? Why do we have to wait at all ? Why > not just unplug/replug right away ? Paranoia + old versions of udev. I beleive that older versions of udev (such as the ones currently shipping with Red Hat RHEL4 and SuSE SLES9) failed to serialize events properly. I beleive that the newer versions do serialize, but have not verified/tested. --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Linas Vepstas writes: > Actually, no. There are three issues: > 1) hotplug routines are called from within kernel. GregKH has stated on >multiple occasions that doing this is wrong/bad/evil. This includes >calling hot-unplug. > > 2) As a result, the code to call hot-unplug is a bit messy. In >particular, there's a bit of hoop-jumping when hotplug is built as >as a module (and said hoops were wrecked recently when I moved the >code around, out of the rpaphp directory). One way to clean this up would be to make rpaphp the driver for the EADS bridges (from the pci code's point of view). Then it would automatically get included in the error recovery process and could do whatever it should. > 3) Hot-unplug causes scripts to run in user-space. There is no way to >know when these scripts are done, so its not clear if we've waited >long enough before calling hot-add (or if waiting is even necessary). OK, so let's just add a new hotplug event called KOBJ_ERROR or something, which tells userspace that an error has occurred which has made the device inaccessible. Greg, would that be OK? The only trouble with that is that if we don't have a hotplug bridge driver loaded for the slot, we can't tell the driver that its device has gone away. I guess that's not a big problem in practice. Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
On Fri, 2005-08-26 at 09:18 +1000, Paul Mackerras wrote: > Benjamin Herrenschmidt writes: > > > Ok, so what is the problem then ? Why do we have to wait at all ? Why > > not just unplug/replug right away ? > > We'd have to be absolutely certain that the driver could not possibly > take another interrupt or try to access the device on behalf of the > old instance of the device by the time it returned from the remove > function. I'm not sure I'd trust most drivers that far... Hrm... If a driver gets that wrong, then it will also blow up when unloaded as a module. All drivers should be fully shut down by the time they return from remove(). free_irq() is synchronous as is iounmap() and both of those are usually called as part of remove(). I wouldn't be too worried here. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Benjamin Herrenschmidt writes: > Ok, so what is the problem then ? Why do we have to wait at all ? Why > not just unplug/replug right away ? We'd have to be absolutely certain that the driver could not possibly take another interrupt or try to access the device on behalf of the old instance of the device by the time it returned from the remove function. I'm not sure I'd trust most drivers that far... Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
On Thu, 2005-08-25 at 11:21 -0500, Linas Vepstas wrote: > On Thu, Aug 25, 2005 at 10:49:03AM +1000, Benjamin Herrenschmidt was heard to > remark: > > > > Of course, we'll possibly end up with a different ethX or whatever, but > > Yep, but that's not an issue, since all the various device-naming > schemes are supposed to be fixing this. Its a distinct problem; > it needs to be solved even across cold-boots. Ok, so what is the problem then ? Why do we have to wait at all ? Why not just unplug/replug right away ? > (Didn't I ever tell you about the day I added a new disk controller to > my system, and /dev/hda became /dev/hde and thus /home was mounted on > /usr and /var as /etc and all hell broke loose? Owww, device naming > is a serious issue for home users and even more so for enterprise-class > users). > > --linas > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
On Thu, Aug 25, 2005 at 10:49:03AM +1000, Benjamin Herrenschmidt was heard to remark: > > Of course, we'll possibly end up with a different ethX or whatever, but Yep, but that's not an issue, since all the various device-naming schemes are supposed to be fixing this. Its a distinct problem; it needs to be solved even across cold-boots. (Didn't I ever tell you about the day I added a new disk controller to my system, and /dev/hda became /dev/hde and thus /home was mounted on /usr and /var as /etc and all hell broke loose? Owww, device naming is a serious issue for home users and even more so for enterprise-class users). --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
On Thu, Aug 25, 2005 at 10:10:45AM +1000, Paul Mackerras was heard to remark: > Linas Vepstas writes: > > > The meta-issue that I'd like to reach consensus on first is whether > > there should be any hot-plug recovery attempted at all. Removing > > hot-plug-recovery support will make many of the issues you raise > > to be moot. > > Yes, this probably the thorniest issue we have. > > My feeling is that the unplug half of it is probably fairly > uncontroversial, but the replug half is a can of worms. Would you > agree with that? Actually, no. There are three issues: 1) hotplug routines are called from within kernel. GregKH has stated on multiple occasions that doing this is wrong/bad/evil. This includes calling hot-unplug. 2) As a result, the code to call hot-unplug is a bit messy. In particular, there's a bit of hoop-jumping when hotplug is built as as a module (and said hoops were wrecked recently when I moved the code around, out of the rpaphp directory). 3) Hot-unplug causes scripts to run in user-space. There is no way to know when these scripts are done, so its not clear if we've waited long enough before calling hot-add (or if waiting is even necessary). > Is it udev that handles the hotplug notifications on the userspace > side in all cases (do both RHEL and SLES use udev, for instance)? Yes, and it seems to work fine despite the fact that the current sles9/rhel4 use rather oold versions of udev, which is criminal, according to Kay Seivers. I have not tested new versions of udev; I assume new versions will work "even better". > How hard is it to add a new sort of notification, on the kernel side > and in udev? Why? To acheive what goal? (Keep in mind that user-space eeh solutions seem to fail when the affected device is storage, since block devices and filesystems don't like the underlying storage to wink out.) > I think what I'd like to see is that when a slot gets isolated and the > driver doesn't have recovery code, the kernel calls the driver's > unplug function and generates a hotplug event to udev. Ideally this > would be a variant of the remove event which would say "and by the > way, please try replugging this slot when you've finished handling the > remove event" or something along those lines. Ahh, yes, this addresses the timing issue raised in point 3). However, I'm thinking that the timing issue is not really an issue, depending on how udev is designed. For example, consider a 100% cpu loaded, heavy i/o loaded PC, and now we rapidly unplug and replug some USB key. Presumably, udev will handle this gracefully. The kernel error recovery essentially looks the same to udev: the burp on the bus looks like a rapid-fire unplug-replug. BTW, zSeries has similar concerns, where channels can come and go, causing thousands of unplug/replug udev events in rapid succession. (This was discussed on the udev mailing lists in the past). It might be interesting to have the zSeries folks discuss the current EEH design, as this is something they have far more experience with than any of the pc or unix crowd. I personally have not discussed with any zSeries people. --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
> I think what I'd like to see is that when a slot gets isolated and the > driver doesn't have recovery code, the kernel calls the driver's > unplug function and generates a hotplug event to udev. Ideally this > would be a variant of the remove event which would say "and by the > way, please try replugging this slot when you've finished handling the > remove event" or something along those lines. I'm still trying to understand why we care. What prevents us from just uplugging the previous device and re-plugging right away ? After all, the driver->remove() function is supposed to guarantee that no HW access will happen after it returns and that everything was unmapped. Of course, we'll possibly end up with a different ethX or whatever, but I don't see the problem with that ... It's hopeless to think we might manage to keep that identical anyway, unless the driver implements proper error recovery. Ben. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Linas Vepstas writes: > The meta-issue that I'd like to reach consensus on first is whether > there should be any hot-plug recovery attempted at all. Removing > hot-plug-recovery support will make many of the issues you raise > to be moot. Yes, this probably the thorniest issue we have. My feeling is that the unplug half of it is probably fairly uncontroversial, but the replug half is a can of worms. Would you agree with that? Is it udev that handles the hotplug notifications on the userspace side in all cases (do both RHEL and SLES use udev, for instance)? How hard is it to add a new sort of notification, on the kernel side and in udev? I think what I'd like to see is that when a slot gets isolated and the driver doesn't have recovery code, the kernel calls the driver's unplug function and generates a hotplug event to udev. Ideally this would be a variant of the remove event which would say "and by the way, please try replugging this slot when you've finished handling the remove event" or something along those lines. Thoughts? Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
On Wed, Aug 24, 2005 at 10:45:31AM -0500, John Rose was heard to remark: > > +++ linux-2.6.13-rc6-git9/arch/ppc64/kernel/eeh_driver.c2005-08-23 > > 14:34:44.0 -0500 > > +/* > > + * PCI Hot Plug Controller Driver for RPA-compliant PPC64 platform. > > This probably isn't the right header description for this file :) Yes, this file is a little ball of ugliness that resulted from moving things out of the rpaphp directory; and, yes, it's rather un-reconstructed. I released it under the "release early" program. The meta-issue that I'd like to reach consensus on first is whether there should be any hot-plug recovery attempted at all. Removing hot-plug-recovery support will make many of the issues you raise to be moot. > > +++ linux-2.6.13-rc6-git9/include/asm-ppc64/prom.h 2005-08-23 > > 13:31:52.0 -0500 > > int busno; /* for pci devices */ > > int bussubno; /* for pci devices */ > > int devfn; /* for pci devices */ > > How about a pointer to a struct of EEH fields? Folks are touchy about > adding anything PCI-specific to device nodes, especially since most DNs > aren't PCI at all. I attempted to remove all of the pci-related stuff from this struct, and got a crash in very very early boot (before the transition from real to virtual addressing). Not sure why, I was surprised. It seems related to the flattening of the device ndode tree. I'll try again soon. --linas - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Hi Linas- I like the idea of splitting the recovery stuff into its own driver. A few comments on the last reorg patch: > Index: linux-2.6.13-rc6-git9/arch/ppc64/kernel/eeh.c ... > +static int > +eeh_slot_availability(struct device_node *dn) ... > +void eeh_restore_bars(struct device_node *dn) Inconsistent spacing in new code... > --- /dev/null 1970-01-01 00:00:00.0 + > +++ linux-2.6.13-rc6-git9/arch/ppc64/kernel/eeh_driver.c 2005-08-23 > 14:34:44.0 -0500 > @@ -0,0 +1,361 @@ > +/* > + * PCI Hot Plug Controller Driver for RPA-compliant PPC64 platform. This probably isn't the right header description for this file :) > + > +/** > + * rpaphp_find_slot - find and return the slot holding the device > + * @dev: pci device for which we want the slot structure. > + */ > +static struct slot *rpaphp_find_slot(struct pci_dev *dev) > +{ > + struct list_head *tmp, *n; > + struct slot *slot; > + > + list_for_each_safe(tmp, n, &rpaphp_slot_head) { > + struct pci_bus *bus; > + > + slot = list_entry(tmp, struct slot, rpaphp_slot_list); > + > + /* PHB's don't have bridges. */ > + bus = slot->bus; > + if (bus == NULL) > + continue; > + > + /* The PCI device could be the slot itself. */ > + if (bus->self == dev) > + return slot; > + > + if (pci_search_bus_for_dev (bus, dev)) > + return slot; > + } > + return NULL; > +} This function breaks if rpaphp is compiled as a module. It's probably bad for kernel code to depend on symbols exported from modules. This raises two larger questions: Where should this new driver sit, and should it be possible to compile it as a module as well? > +/* --- */ > +/** > + * handle_eeh_events -- reset a PCI device after hard lockup. > + * ... > +int eeh_reset_device (struct pci_dev *dev, struct device_node *dn, int > reconfig) Header doesn't match the function :) > +{ > + struct hotplug_slot *frozen_slot= NULL; > + struct hotplug_slot_ops *frozen_ops= NULL; > + > + if (!dev) > + return 1; > + > + if (reconfig) { > + frozen_slot = pci_hp_find_slot(dev); > + if (frozen_slot) > + frozen_ops = frozen_slot->ops; > + } > + > + if (frozen_ops) frozen_ops->disable_slot (frozen_slot); > + > + /* Reset the pci controller. (Asserts RST#; resets config space). > + * Reconfigure bridges and devices */ > + rtas_set_slot_reset (dn->child); > + > + /* Walk over all functions on this device */ > + struct device_node *peer = dn->child; > + while (peer) { > + rtas_configure_bridge(peer); > + eeh_restore_bars(peer); > + peer = peer->sibling; > + } > + > + /* Give the system 5 seconds to finish running the user-space > + * hotplug scripts, e.g. ifdown for ethernet. Yes, this is a hack, > + * but if we don't do this, weird things happen. > + */ > + if (frozen_ops) { > + ssleep (5); > + frozen_ops->enable_slot (frozen_slot); > + } > + return 0; > +} This dependence on struct hotplug_slot might be problematic as we restrict the registration of "PCI hotplug" slots to exclude PHBs, VIO, and embedded slots. Noticed your comment to this effect. I can work with you offline on this. > + > +/* The longest amount of time to wait for a pci device > + * to come back on line, in seconds. > + */ > +#define MAX_WAIT_FOR_RECOVERY 15 > + > +int handle_eeh_events (struct eeh_event *event) > +{ > + ... > + frozen_device = pci_bus_to_OF_node(dev->bus); > + if (!frozen_device) > + { > + printk (KERN_ERR "EEH: Cannot find PCI controller for %s\n", > + pci_name(dev)); > + > + return 1; > + } > + BUG_ON (frozen_device->phb==NULL); > + > + /* We get "permanent failure" messages on empty slots. > + * These are false alarms. Empty slots have no child dn. */ > + if ((event->state == pci_channel_io_perm_failure) && (frozen_device == > NULL)) The second part of this conditional will never be true, as this has just been checked above. > + if (frozen_device) > + freeze_count = frozen_device->eeh_freeze_count; This conditional will always be true, as this has also ben checked above. > Index: linux-2.6.13-rc6-git9/include/asm-ppc64/prom.h > === > --- linux-2.6.13-rc6-git9.orig/include/asm-ppc64/prom.h 2005-08-19 > 15:11:39.0 -0500 > +++ linux-2.6.13-rc6-git9/include/asm-ppc64/prom.h2005-08-23 > 13:31:52.0 -0500 > @@ -135,11 +135,17 @@ > int busno; /* for pci devices */ > int bussubno; /* for pci devices */ > int devfn;
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
I wrote: > Linas Vepstas writes: > > In this patch at least, your mailer seems to have blanked out lines > that match ^[-+]$. Could you send them to me again with a different > mailer or put them on a web or ftp site somewhere? I got 3 copies of each of these mails, one directly, one through linuxppc64-dev and one through linux-kernel. It looks like the copies that came through the mailing lists are OK but the copy that came directly to me is corrupted. Weird. Regards, Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Linas Vepstas writes: In this patch at least, your mailer seems to have blanked out lines that match ^[-+]$. Could you send them to me again with a different mailer or put them on a web or ftp site somewhere? Thanks, Paul. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 8/8] PCI Error Recovery: PPC64 core recovery routines
Various PCI bus errors can be signaled by newer PCI controllers. The core error recovery routines are architecture dependent. This patch adds a recovery infrastructure for the PPC64 pSeries systems. Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> -- arch/ppc64/kernel/Makefile |2 arch/ppc64/kernel/eeh.c| 543 + arch/ppc64/kernel/eeh_driver.c | 361 +++ arch/ppc64/kernel/eeh_event.c | 116 arch/ppc64/kernel/eeh_event.h | 52 +++ arch/ppc64/kernel/rtas_pci.c |5 include/asm-ppc64/eeh.h| 105 +-- include/asm-ppc64/prom.h | 10 include/asm-ppc64/rtas.h |2 9 files changed, 1001 insertions(+), 195 deletions(-) Index: linux-2.6.13-rc6-git9/arch/ppc64/kernel/eeh.c === --- linux-2.6.13-rc6-git9.orig/arch/ppc64/kernel/eeh.c 2005-08-19 12:52:31.0 -0500 +++ linux-2.6.13-rc6-git9/arch/ppc64/kernel/eeh.c 2005-08-23 16:53:05.0 -0500 @@ -1,32 +1,33 @@ /* + * * eeh.c * Copyright (C) 2001 Dave Engebretsen & Todd Inglett IBM Corporation - * + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. - * + * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. - * + * * You should have received a copy of the GNU General Public License * along with this program; if not, write to the Free Software * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ -#include +#include #include +#include #include -#include -#include #include #include #include #include #include +#include #include #include #include @@ -34,6 +35,7 @@ #include #include #include "pci.h" +#include "eeh_event.h" #undef DEBUG @@ -49,8 +51,8 @@ * were "empty": all reads return 0xff's and all writes are silently * ignored. EEH slot isolation events can be triggered by parity * errors on the address or data busses (e.g. during posted writes), - * which in turn might be caused by dust, vibration, humidity, - * radioactivity or plain-old failed hardware. + * which in turn might be caused by low voltage on the bus, dust, + * vibration, humidity, radioactivity or plain-old failed hardware. * * Note, however, that one of the leading causes of EEH slot * freeze events are buggy device drivers, buggy device microcode, @@ -75,22 +77,13 @@ #define BUID_HI(buid) ((buid) >> 32) #define BUID_LO(buid) ((buid) & 0x) -/* EEH event workqueue setup. */ -static DEFINE_SPINLOCK(eeh_eventlist_lock); -LIST_HEAD(eeh_eventlist); -static void eeh_event_handler(void *); -DECLARE_WORK(eeh_event_wq, eeh_event_handler, NULL); - -static struct notifier_block *eeh_notifier_chain; - /* * If a device driver keeps reading an MMIO register in an interrupt * handler after a slot isolation event has occurred, we assume it * is broken and panic. This sets the threshold for how many read * attempts we allow before panicking. */ -#define EEH_MAX_FAILS 1000 -static atomic_t eeh_fail_count; +#define EEH_MAX_FAILS 10 /* RTAS tokens */ static int ibm_set_eeh_option; @@ -107,6 +100,10 @@ static int eeh_error_buf_size; /* System monitoring statistics */ +static DEFINE_PER_CPU(unsigned long, no_device); +static DEFINE_PER_CPU(unsigned long, no_dn); +static DEFINE_PER_CPU(unsigned long, no_cfg_addr); +static DEFINE_PER_CPU(unsigned long, ignored_check); static DEFINE_PER_CPU(unsigned long, total_mmio_ffs); static DEFINE_PER_CPU(unsigned long, false_positives); static DEFINE_PER_CPU(unsigned long, ignored_failures); @@ -224,9 +221,9 @@ while (*p) { parent = *p; piar = rb_entry(parent, struct pci_io_addr_range, rb_node); - if (alo < piar->addr_lo) { + if (ahi < piar->addr_lo) { p = &parent->rb_left; - } else if (ahi > piar->addr_hi) { + } else if (alo > piar->addr_hi) { p = &parent->rb_right; } else { if (dev != piar->pcidev || @@ -245,6 +242,11 @@ piar->pcidev = dev; piar->flags = flags; +#ifdef DEBUG + printk (KERN_DEBUG "PIAR: insert range=[%lx:%lx] dev=%s\n", + alo, ahi, pci_name (dev)); +#endif + rb_link_node(&piar->rb_node, parent, p); rb_insert_color(&piar->rb_node, &pci_io_addr_cache_root.rb_root); @@ -268,8 +270,8 @@ if (!(dn->eeh_mode & EEH_MODE_SUPPORTED) || dn->eeh_mode & EEH_MODE_NOCHE