Re: [patch 02/17] PCI Error Recovery: Symbios SCSI base support
On Tue, Oct 02, 2007 at 03:49:26PM -0600, Matthew Wilcox wrote: On Tue, Oct 02, 2007 at 02:38:00PM -0700, [EMAIL PROTECTED] wrote: From: Linas Vepstas [EMAIL PROTECTED] Various PCI bus errors can be signaled by newer PCI controllers. This patch adds the PCI error recovery callbacks to the Symbios SCSI device driver. The patch has been tested, and appears to work well. Linas and I have been discussing the problems with this patch. I think we have a solution; we certainly have something in my tree that's acceptable to me; he'd jus like to test it before it's unleashed on the world. Matthew, your fix was a patch on top of my patch ... I assume you want to submit it that way, instead of reworking this patch? Anyway, I finally got a chance to run it yesterday, it worked fine. I'll try to make final coments in the other thread. --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
On Mon, Oct 01, 2007 at 07:27:30PM -0600, Matthew Wilcox wrote: The thing to remember is that sym2 is in transition from being a dual BSD/Linux driver to being a purely Linux driver. I was wondering about that; couldn't tell if the split in the code was historical, or being intentionally maintained. My gut instinct is to say ack, although prudence dictates that I should test first. Which might take a few days... Fine by me. I tested the patch, it worked great. It also seemed to recover much more quickly -- so quickly, in fact, that I thought something had gone wrong. I reviewed it one more time, it really does look good. A formal submission and acked by's at earliest convenience would be good. --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
On Mon, Oct 01, 2007 at 07:27:30PM -0600, Matthew Wilcox wrote: Fine by me. Do you have the ability to produce failures on a whim on your platforms? Yes, although it is very platform specific -- there are actually transistors in the pci bridge chip, which actually short out lines, and so, from the point of view of the rest of the chip, it did actually see a real error. Its supposed to be a very realistic test. I've been vaguely musing a PCI device failure patch for x86, just so people can test driver failure paths. That would be good ... I've recently agreed to accept a fedex to test someone elses card for them, which is outside my usual activities. There's also supposed to be some PCI-X riser card out there, (never seen one) which has the ability to inject actual pci errors. Its the Agilent PCI BestX card; I got the impression they might not sell it anymore; dunno. One guy in the lab used to brush a grounding strap across the pins; this usually got a rise out of the audience. --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
On Mon, Oct 01, 2007 at 02:12:47PM -0600, Matthew Wilcox wrote: I think the fundamental problem is that completions aren't really supposed to be used like this. Here's one attempt at using completions perhaps a little more the way they're supposed to be used, Yes, that looks very good to me. I see it solves a bug that I hadn't been quite aware of. I don't understand why struct host_data is preferable to struct sym_shcb (is it because this is the structure that is naturally protectected by the spinlock?) My gut instinct is to say ack, although prudence dictates that I should test first. Which might take a few days... although now I've written it, I wonder if we shouldn't just use a waitqueue instead. I thought that earlier versions of the driver used waitqueues (I vaguely remember eh_wait in the code), which were later converted to completions (I also vaguely recall thinking that the new code was more elegant/simpler). I converted my patch to use the completions likewise, and, as you've clearly shown, did a rather sloppy job in the conversion. I'm tempted to go with this patch; but if you prod, I could attempt a wait-queue based patch. --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
EDAC PCI error recovery (was Re: [PATCH 2/2] powerpc: MPC85xx EDAC device driver)
On Mon, Jul 30, 2007 at 03:47:05PM -0700, Doug Thompson wrote: --- Linas Vepstas [EMAIL PROTECTED] wrote: Also: please note that the linux kernel has a pci error recovery mechanism built in; its used by pseries and PCI-E. I'm not clear on what any of this has to do with EDAC, which I thought was supposed to be for RAM only. (The EDAC project once talked about doing pci error recovery, but that was years ago, and there is a separate system for that, now.) no, edac can/does harvest PCI bus errors, via polling and other hardware error detectors. Ehh! I had no idea. A few years ago, when I was working on the PCI error recovery, I sent a number of emails to the various EDAC people and mailing lists that I could find, and never got a response. I assumed the project was dead. I guess its not ... But at the current time, few PCI device drivers initialize those callback functions and thus errors are lost and some IO transactions fail. There are patches for 6 drivers in mainline (e100, e1000, ixgb, s2io, ipr, lpfc), and two more pending (sym53cxxx, tg3). So far, I've written all of them. Over time, as drivers get updated (might take some time) then drivers can take some sort of action FOR THEMSELVES I think I need to do more to raise awareness and interest. Yet, there is no tracking of errors - except for a log message in the log file. There is NO meter on frequency of errors, etc. One must grep the log file and that is not a very cycle friendly mechanism. Yeah, there was low interest in stats. There's a core set of stats in /proc/pp64/eeh, but these are clearly arch-specific. I'd ike to move away from those. Some recent patches added stats to the /sys tree, under the individual pci bridge and device nodes. Again, these are arch-specific; I'd like to move to some geeral/standardized presentation. The reason I added PCI parity/error device scanning, was that when I was at Linux Networx, we had parity errors on the PCI-X bus, but didn't know the cause. After we discovered that a simple PCI-X riser card had manufacturing problems (quality) and didn't drive lines properly, it caused parity errors. Heh. Not unusual. I've seen/heard of cases with voltages being low, and/or ground-bounce in slots near the end. There's a whole zoo of hardware/firmware bugs that we've had to painfully crawl through and fix. That's why the IBM boxes cost big $$$; here's to hoping that customers understand why. This feature allowed us to track nodes that were having parity problems, but we had no METER to know it. Recovery is a good thing, BUT how do you know you having LOTS of errors/recovery events? You need a meter. EDAC provides that METER I'm lazy. What source code should I be looking at? I'm concerned about duplication of function and proliferation of interfaces. I've got my metering data under (for example) /sys/bus/pci/devices/0001:c0:01.0/eeh_*, mostly very arch specific. The code for this is in arch/powerpc/platforms/pseries/eeh_sysfs.c I met with Yanmin Zhang of Intel at OLS after his paper presentation on PCI Express Advanced Error Reporting in the Kernel, and we talked about this same thing. I am talking with him on having the recovery code present information into EDAC sysfs area. (hopefully, anyway) Hmm. OK, where's that? Back when, I'd talked to Yamin about coming up with a generic, arch-indep way of driving the recovery routines. But this wasn't exactly easy, and we were still grappling with just getting things working. Now that things are working, its time to broaden horizons. Can you point me to the current edac code? find . -print |grep edac is not particuarly revealing at the moment. The recovery generates log messages BUT having to periodically 'grep' the log file looking for errors is not a good use of CPU cycles. grep once for a count and then grep later for a count and then compare the counts for a delta count per unit time. ugly. Yep. Maybe send events up to udev? The EDAC solution is to be able to have a Listener thread in user space that can be notified (via poll()) that an event has occurred. Hmm. OK, I'm alarmingly nave about udev, but my initial gut instinct is to pipe all such events to udev. Most of user-space has already been given the marching orders to use udev and/or hal for this kind of stuff. So this makes sense to me. There are more than one consumer (error recover) of error events: 1) driver recovery after a transaction (which is the recovery consumer above) I had to argue loudly for recovery in the kernel. The problem was that it was impossible to recover erros on scsi devics from userspace (since the block device and filesystems would go bonkers). 2) Management agents for health of a node 3) Maintainance agents for predictive component replacement Yes, agreed. Care to ask your management agent friends for where they'd like to get these events from (i.e. udev, or somewhere else?) We
[PATCH]: PCI Error Recovery: Symbios SCSI device driver
Various PCI bus errors can be signaled by newer PCI controllers. This patch adds the PCI error recovery callbacks to the Symbios SCSI device driver. The patch has been tested, and appears to work well. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Hi, This patch has been bouncing around for a long time, and has made appearences in various -mm trees since 2.6.something-teen. However, it has never made it into mainline, and I'm starting to get concerned that it will miss 2.6.23 as well. There was some discussion, and I think I addressed all of the various issues that came up. I'd really like to get this patch in, but am unclear on exactly who to pester at this point. Matt Wilcox seems to be looking for a job (???) and I am unable to git-clone James Bottmley's git://kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git git tree; there's some error on the server side. Linas. drivers/scsi/sym53c8xx_2/sym_glue.c | 136 drivers/scsi/sym53c8xx_2/sym_glue.h |4 + drivers/scsi/sym53c8xx_2/sym_hipd.c |6 + 3 files changed, 146 insertions(+) Index: linux-2.6.22-rc1/drivers/scsi/sym53c8xx_2/sym_glue.c === --- linux-2.6.22-rc1.orig/drivers/scsi/sym53c8xx_2/sym_glue.c 2007-04-25 22:08:32.0 -0500 +++ linux-2.6.22-rc1/drivers/scsi/sym53c8xx_2/sym_glue.c2007-05-14 17:31:44.0 -0500 @@ -657,6 +657,10 @@ static irqreturn_t sym53c8xx_intr(int ir unsigned long flags; struct sym_hcb *np = (struct sym_hcb *)dev_id; + /* Avoid spinloop trying to handle interrupts on frozen device */ + if (pci_channel_offline(np-s.device)) + return IRQ_HANDLED; + if (DEBUG_FLAGS DEBUG_TINY) printf_debug ([); spin_lock_irqsave(np-s.host-host_lock, flags); @@ -726,6 +730,20 @@ static int sym_eh_handler(int op, char * dev_warn(cmd-device-sdev_gendev, %s operation started.\n, opname); + /* We may be in an error condition because the PCI bus +* went down. In this case, we need to wait until the +* PCI bus is reset, the card is reset, and only then +* proceed with the scsi error recovery. There's no +* point in hurrying; take a leisurely wait. +*/ +#define WAIT_FOR_PCI_RECOVERY 35 + if (pci_channel_offline(np-s.device)) { + int finished_reset = wait_for_completion_timeout( + np-s.io_reset_wait, WAIT_FOR_PCI_RECOVERY*HZ); + if (!finished_reset) + return SCSI_FAILED; + } + spin_lock_irq(host-host_lock); /* This one is queued in some place - to wait for completion */ FOR_EACH_QUEUED_ELEMENT(np-busy_ccbq, qp) { @@ -1510,6 +1528,7 @@ static struct Scsi_Host * __devinit sym_ np-maxoffs = dev-chip.offset_max; np-maxburst= dev-chip.burst_max; np-myaddr = dev-host_id; + init_completion(np-s.io_reset_wait); /* * Edit its name. @@ -1948,6 +1967,116 @@ static void __devexit sym2_remove(struct attach_count--; } +/** + * sym2_io_error_detected() -- called when PCI error is detected + * @pdev: pointer to PCI device + * @state: current state of the PCI slot + */ +static pci_ers_result_t sym2_io_error_detected(struct pci_dev *pdev, + enum pci_channel_state state) +{ + struct sym_hcb *np = pci_get_drvdata(pdev); + + /* If slot is permanently frozen, turn everything off */ + if (state == pci_channel_io_perm_failure) { + sym2_remove(pdev); + return PCI_ERS_RESULT_DISCONNECT; + } + + init_completion(np-s.io_reset_wait); + disable_irq(pdev-irq); + pci_disable_device(pdev); + + /* Request a slot reset. */ + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * sym2_reset_workarounds -- hardware-specific work-arounds + * + * This routine is similar to sym_set_workarounds(), except + * that, at this point, we already know that the device was + * succesfully intialized at least once before, and so most + * of the steps taken there are un-needed here. + */ +static void sym2_reset_workarounds(struct pci_dev *pdev) +{ + u_char revision; + u_short status_reg; + struct sym_chip *chip; + + pci_read_config_byte(pdev, PCI_CLASS_REVISION, revision); + chip = sym_lookup_chip_table(pdev-device, revision); + + /* Work around for errant bit in 895A, in a fashion +* similar to what is done in sym_set_workarounds(). +*/ + pci_read_config_word(pdev, PCI_STATUS, status_reg); + if (!(chip-features FE_66MHZ) (status_reg PCI_STATUS_66MHZ)) { + status_reg = PCI_STATUS_66MHZ; + pci_write_config_word(pdev, PCI_STATUS, status_reg); + pci_read_config_word(pdev, PCI_STATUS, status_reg
Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
On Wed, May 09, 2007 at 03:26:21PM -0500, Linas Vepstas wrote: Hi Matthew, I had been hoping these patches might make it into 2.6.22, ... this is a nag note; please forward upstream. ... should I repost the patches? --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
Hi Matthew, I had been hoping these patches might make it into 2.6.22, ... this is a nag note; please forward upstream. --linas On Fri, Apr 20, 2007 at 03:47:20PM -0500, Linas Vepstas wrote: Implement the so-called first failure data capture (FFDC) for the symbios PCI error recovery. After a PCI error event is reported, the driver requests that MMIO be enabled. Once enabled, it then reads and dumps assorted status registers, and concludes by requesting the usual reset sequence. (includes a whitespace fix for bad indentation). Signed-off-by: Linas Vepstas [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure
Implement the so-called first failure data capture (FFDC) for the symbios PCI error recovery. After a PCI error event is reported, the driver requests that MMIO be enabled. Once enabled, it then reads and dumps assorted status registers, and concludes by requesting the usual reset sequence. (includes a whitespace fix for bad indentation). Signed-off-by: Linas Vepstas [EMAIL PROTECTED] drivers/scsi/sym53c8xx_2/sym_glue.c | 15 +++ drivers/scsi/sym53c8xx_2/sym_glue.h |1 + drivers/scsi/sym53c8xx_2/sym_hipd.c | 18 ++ 3 files changed, 30 insertions(+), 4 deletions(-) Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c === --- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.c 2007-04-20 12:52:01.0 -0500 +++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c 2007-04-20 15:25:35.0 -0500 @@ -1987,6 +1987,20 @@ static pci_ers_result_t sym2_io_error_de disable_irq(pdev-irq); pci_disable_device(pdev); + /* Request that MMIO be enabled, so register dump can be taken. */ + return PCI_ERS_RESULT_CAN_RECOVER; +} + +/** + * sym2_io_slot_dump -- Enable MMIO and dump debug registers + * @pdev: pointer to PCI device + */ +static pci_ers_result_t sym2_io_slot_dump (struct pci_dev *pdev) +{ + struct sym_hcb *np = pci_get_drvdata(pdev); + + sym_dump_registers(np); + /* Request a slot reset. */ return PCI_ERS_RESULT_NEED_RESET; } @@ -2241,6 +2255,7 @@ MODULE_DEVICE_TABLE(pci, sym2_id_table); static struct pci_error_handlers sym2_err_handler = { .error_detected = sym2_io_error_detected, + .mmio_enabled = sym2_io_slot_dump, .slot_reset = sym2_io_slot_reset, .resume = sym2_io_resume, }; Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h === --- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.h 2007-04-20 12:15:07.0 -0500 +++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h 2007-04-20 15:21:31.0 -0500 @@ -270,5 +270,6 @@ void sym_xpt_async_bus_reset(struct sym_ void sym_xpt_async_sent_bdr(struct sym_hcb *np, int target); int sym_setup_data_and_start (struct sym_hcb *np, struct scsi_cmnd *csio, struct sym_ccb *cp); void sym_log_bus_error(struct sym_hcb *np); +void sym_dump_registers(struct sym_hcb *np); #endif /* SYM_GLUE_H */ Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c === --- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c 2007-04-20 12:18:59.0 -0500 +++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c 2007-04-20 15:18:01.0 -0500 @@ -1180,10 +1180,10 @@ static void sym_log_hard_error(struct sy scr_to_cpu((int) *(u32 *)(script_base + script_ofs))); } -printf (%s: regdump:, sym_name(np)); -for (i=0; i24;i++) -printf ( %02x, (unsigned)INB_OFF(np, i)); -printf (.\n); + printf (%s: regdump:, sym_name(np)); + for (i=0; i24;i++) + printf ( %02x, (unsigned)INB_OFF(np, i)); + printf (.\n); /* * PCI BUS error. @@ -1192,6 +1192,16 @@ static void sym_log_hard_error(struct sy sym_log_bus_error(np); } +void sym_dump_registers(struct sym_hcb *np) +{ + u_short sist; + u_char dstat; + + sist = INW(np, nc_sist); + dstat = INB(np, nc_dstat); + sym_log_hard_error(np, sist, dstat); +} + static struct sym_chip sym_dev_table[] = { {PCI_DEVICE_ID_NCR_53C810, 0x0f, 810, 4, 8, 4, 64, FE_ERL} - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lpfc: avoid double-free during PCI error failure
Bino, James, Please review, sign-off and forward upstream. --linas If a PCI error is detected that cannot be recovered from, there will be a double call of lpfc_pci_remove_one(), with the second call resulting in a null-pointer dereference. The first call occurs in lpfc_io_error_detected(), and the second call during pci device remove. This patch eliminates the first call; its un-needed. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] drivers/scsi/lpfc/lpfc_init.c |5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) Index: linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c === --- linux-2.6.20-git16.orig/drivers/scsi/lpfc/lpfc_init.c 2007-03-08 15:57:40.0 -0600 +++ linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c2007-03-08 16:03:18.0 -0600 @@ -1817,10 +1817,9 @@ static pci_ers_result_t lpfc_io_error_de struct lpfc_sli *psli = phba-sli; struct lpfc_sli_ring *pring; - if (state == pci_channel_io_perm_failure) { - lpfc_pci_remove_one(pdev); + if (state == pci_channel_io_perm_failure) return PCI_ERS_RESULT_DISCONNECT; - } + pci_disable_device(pdev); /* * There may be I/Os dropped by the firmware. - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lpfc: add PCI error recovery support
James, Please review and forward upstream. This is a patch I'd previously submitted, and reworked by [EMAIL PROTECTED] in January. Not clear if I need to also nag James Smart (who is listed as the maintainer) for an Acked-by (which I am lead to beleive should be forthcoming? Ahh the joys of indirect communication!) --linas This patch adds PCI Error recovery support to the Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver. Lightly tested at this point, works. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Signed-off-by: [EMAIL PROTECTED] Cc: James Smart [EMAIL PROTECTED] drivers/scsi/lpfc/lpfc_init.c | 97 ++ drivers/scsi/lpfc/lpfc_sli.c | 12 + 2 files changed, 109 insertions(+) Index: linux-2.6.20-git4/drivers/scsi/lpfc/lpfc_init.c === --- linux-2.6.20-git4.orig/drivers/scsi/lpfc/lpfc_init.c2007-02-09 17:22:30.0 -0600 +++ linux-2.6.20-git4/drivers/scsi/lpfc/lpfc_init.c 2007-02-14 14:12:22.0 -0600 @@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba struct lpfc_sli *psli = phba-sli; struct lpfc_sli_ring *pring; uint32_t event_data; + /* If the pci channel is offline, ignore possible errors, +* since we cannot communicate with the pci card anyway. */ + if (pci_channel_offline(phba-pcidev)) + return; if (phba-work_hs HS_FFER6 || phba-work_hs HS_FFER5) { @@ -1797,6 +1801,92 @@ lpfc_pci_remove_one(struct pci_dev *pdev pci_set_drvdata(pdev, NULL); } +/** + * lpfc_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata; + struct lpfc_sli *psli = phba-sli; + struct lpfc_sli_ring *pring; + + if (state == pci_channel_io_perm_failure) { + lpfc_pci_remove_one(pdev); + return PCI_ERS_RESULT_DISCONNECT; + } + pci_disable_device(pdev); + /* +* There may be I/Os dropped by the firmware. +* Error iocb (I/O) on txcmplq and let the SCSI layer +* retry it after re-establishing link. +*/ + pring = psli-ring[psli-fcp_ring]; + lpfc_sli_abort_iocb_ring(phba, pring); + + /* Request a slot reset. */ + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * lpfc_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev *pdev) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata; + struct lpfc_sli *psli = phba-sli; + int bars = pci_select_bars(pdev, IORESOURCE_MEM); + + dev_printk(KERN_INFO, pdev-dev, recovering from a slot reset.\n); + if (pci_enable_device_bars(pdev, bars)) { + printk(KERN_ERR lpfc: Cannot re-enable + PCI device after reset.\n); + return PCI_ERS_RESULT_DISCONNECT; + } + + pci_set_master(pdev); + + /* Re-establishing Link */ + spin_lock_irq(phba-host-host_lock); + phba-fc_flag |= FC_ESTABLISH_LINK; + psli-sli_flag = ~LPFC_SLI2_ACTIVE; + spin_unlock_irq(phba-host-host_lock); + + + /* Take device offline; this will perform cleanup */ + lpfc_offline(phba); + lpfc_sli_brdrestart(phba); + + return PCI_ERS_RESULT_RECOVERED; +} + +/** + * lpfc_io_resume - called when traffic can start flowing again. + * @pdev: Pointer to PCI device + * + * This callback is called when the error recovery driver tells us that + * its OK to resume normal operation. + */ +static void lpfc_io_resume(struct pci_dev *pdev) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata; + + if (lpfc_online(phba) == 0) { + mod_timer(phba-fc_estabtmo, jiffies + HZ * 60); + } +} + static struct pci_device_id lpfc_id_table[] = { {PCI_VENDOR_ID_EMULEX, PCI_DEVICE_ID_VIPER, PCI_ANY_ID, PCI_ANY_ID, }, @@ -1857,11 +1947,18 @@ static struct pci_device_id lpfc_id_tabl MODULE_DEVICE_TABLE(pci, lpfc_id_table); +static struct pci_error_handlers lpfc_err_handler = { + .error_detected = lpfc_io_error_detected, + .slot_reset = lpfc_io_slot_reset, + .resume = lpfc_io_resume, +}; + static struct pci_driver lpfc_driver = { .name = LPFC_DRIVER_NAME
Re: lpfc PCIe error recovey
On Wed, Jan 10, 2007 at 04:59:39PM -0600, linas wrote: However, on a Power4 architecture there are errors reported in upper layer (we discussed this in one of earlier emails) followed by SCSI errors. I'm trying to investigate now. I found two distinct power4 bugs. I posted a patch for one yesterday, under the subject heading [PATCH] Urgent: powerpc 2.6.20-rc4 dma broken on non-LPAR pseries This affects only recent mainline kernels; it would not affect older or distro kernels. The other patch is attached below. After some more testing, I'll submit to mainline. --linas Subject: [PATCH] pSeries: EEH improperly enabled for some Power4 systems It appears that EEH is improperly enabled for some Power4 systems. On these systems, the ibm,set-eeh-option returns a value of success even when EEH is not supported on the given node. Thus, an explicit check for support is required. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] arch/powerpc/platforms/pseries/eeh.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) Index: linux-2.6.20-rc4/arch/powerpc/platforms/pseries/eeh.c === --- linux-2.6.20-rc4.orig/arch/powerpc/platforms/pseries/eeh.c 2007-01-11 14:15:02.0 -0600 +++ linux-2.6.20-rc4/arch/powerpc/platforms/pseries/eeh.c 2007-01-11 15:14:39.0 -0600 @@ -748,6 +748,7 @@ struct eeh_early_enable_info { /* Enable eeh for the given device node. */ static void *early_enable_eeh(struct device_node *dn, void *data) { + unsigned int rets[3]; struct eeh_early_enable_info *info = data; int ret; const char *status = get_property(dn, status, NULL); @@ -804,16 +805,14 @@ static void *early_enable_eeh(struct dev regs[0], info-buid_hi, info-buid_lo, EEH_ENABLE); + enable = 0; if (ret == 0) { - eeh_subsystem_enabled = 1; - pdn-eeh_mode |= EEH_MODE_SUPPORTED; pdn-eeh_config_addr = regs[0]; /* If the newer, better, ibm,get-config-addr-info is supported, * then use that instead. */ pdn-eeh_pe_config_addr = 0; if (ibm_get_config_addr_info != RTAS_UNKNOWN_SERVICE) { - unsigned int rets[2]; ret = rtas_call (ibm_get_config_addr_info, 4, 2, rets, pdn-eeh_config_addr, info-buid_hi, info-buid_lo, @@ -821,6 +820,20 @@ static void *early_enable_eeh(struct dev if (ret == 0) pdn-eeh_pe_config_addr = rets[0]; } + + /* Some older systems (Power4) allow the +* ibm,set-eeh-option call to succeed even on nodes +* where EEH is not supported. Verify support +* explicitly. */ + ret = read_slot_reset_state(pdn, rets); + if ((ret == 0) (rets[1] == 1)) + enable = 1; + } + + if (enable) { + eeh_subsystem_enabled = 1; + pdn-eeh_mode |= EEH_MODE_SUPPORTED; + #ifdef DEBUG printk(KERN_DEBUG EEH: %s: eeh enabled, config=%x pe_config=%x\n, dn-full_name, pdn-eeh_config_addr, pdn-eeh_pe_config_addr); - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Bug: 2.6.20 scsi/block device/elevator recursion loop
Hi, On Thu, Jan 11, 2007 at 04:22:52PM -0500, [EMAIL PROTECTED] wrote: This patch is present in upstream and is also present in 2.6.20. So this is a new issue. What was the patch last time around? It seems I'm seeing this more often than expected. The first time, the system spewed the softlockup error, but then recovered after a few minutes. This time, even after an hour, the system remained hung. It was pingable, but the console, and all ssh sessions were unresponsive. After hitting the little yellow button, I got a stack trace (below) in _spin_unlock_irqrestore, which makes me think that perhaps the system was being flooded with irq's. I'll try to investigate further tommorrow. --linas Background: kernel 2.6.20-rc4 IBM Power4 pSeries (630) lpfc scsi (Emulex) chsysstate -r sys -n io-raiders -o reset io-raiders:~ # cpu 0x0: Vector: 100 (System Reset) at [c0003ff69520] pc: c023d794: ._raw_spin_unlock+0xb4/0xd4 lr: c046d5ac: ._spin_unlock_irqrestore+0x18/0x3c sp: c0003ff697a0 msr: 90009032 current = 0xc43e21f0 paca= 0xc0674080 pid = 1123, comm = kblockd/0 enter ? for help [c0003ff69820] c046d5ac ._spin_unlock_irqrestore+0x18/0x3c [c0003ff698b0] c021bbe0 .blk_run_queue+0xc8/0xec [c0003ff69950] c0320728 .scsi_run_queue+0x248/0x278 [c0003ff69a00] c0321948 .scsi_queue_insert+0x88/0xa8 [c0003ff69a90] c031bc34 .scsi_dispatch_cmd+0x2b8/0x2e4 [c0003ff69b30] c0322804 .scsi_request_fn+0x2c4/0x3c0 [c0003ff69be0] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff69c60] c0216d6c .elv_insert+0x240/0x268 [c0003ff69d00] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff69d90] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff69e40] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff69ec0] c0216d6c .elv_insert+0x240/0x268 [c0003ff69f60] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff69ff0] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6a0a0] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6a120] c0216d6c .elv_insert+0x240/0x268 [c0003ff6a1c0] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6a250] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6a300] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6a380] c0216d6c .elv_insert+0x240/0x268 [c0003ff6a420] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6a4b0] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6a560] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6a5e0] c0216d6c .elv_insert+0x240/0x268 [c0003ff6a680] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6a710] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6a7c0] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6a840] c0216d6c .elv_insert+0x240/0x268 [c0003ff6a8e0] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6a970] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6aa20] c021bbac .blk_run_queue+0x94/0xec [c0003ff6aac0] c0320728 .scsi_run_queue+0x248/0x278 [c0003ff6ab70] c0321948 .scsi_queue_insert+0x88/0xa8 [c0003ff6ac00] c031bc34 .scsi_dispatch_cmd+0x2b8/0x2e4 [c0003ff6aca0] c0322804 .scsi_request_fn+0x2c4/0x3c0 [c0003ff6ad50] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6add0] c0216d6c .elv_insert+0x240/0x268 [c0003ff6ae70] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6af00] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6afb0] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6b030] c0216d6c .elv_insert+0x240/0x268 [c0003ff6b0d0] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6b160] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6b210] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6b290] c0216d6c .elv_insert+0x240/0x268 [c0003ff6b330] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6b3c0] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6b470] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6b4f0] c0216d6c .elv_insert+0x240/0x268 [c0003ff6b590] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6b620] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6b6d0] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6b750] c0216d6c .elv_insert+0x240/0x268 [c0003ff6b7f0] c021a25c .blk_requeue_request+0x38/0x54 [c0003ff6b880] c0322864 .scsi_request_fn+0x324/0x3c0 [c0003ff6b930] c021ae30 .__generic_unplug_device+0x54/0x6c [c0003ff6b9b0] c0216d6c .elv_insert+0x240/0x268 [c0003ff6ba50] c021a25c .blk_requeue_request+0x38/0x54
Re: lpfc PCIe error recoveyr
On Tue, Jan 09, 2007 at 10:00:09AM -0500, [EMAIL PROTECTED] wrote: Hi Linas, Following is the latest lpfc driver patch we are testing in the Emulex lab for PCI error recovery. This patch looks good on a Power5 platform. Yes, it seemed to survive a few hours of testting fine. I did see one interesting thing, namely a softlockup. I attribute this to the fact that I'd queued up a lot of heavy file i/o, issued a sync, which typically takes more than a few seconds on the test sytem, and then injected the artificial PCI error. After about ten seconds, I got the softlockup, but after another 10-20 seconds, things seemed back to normal. So I don't consider this an actual error, but thought it was interesting. The actual stack trace was BUG: soft lockup detected on CPU#2! Call Trace: [C253D470] [C000F8C8] .show_stack+0x68/0x1b0 (unreliable) [C253D510] [C008E770] .softlockup_tick+0xec/0x124 [C253D5B0] [C006957C] .run_local_timers+0x1c/0x30 [C253D630] [C0023C18] .timer_interrupt+0xb8/0x4a4 [C253D710] [C0003578] decrementer_common+0xf8/0x100 --- Exception: 901 at .local_irq_restore+0x3c/0x40 LR = ._spin_unlock_irqrestore+0x24/0x3c [C253DA00] [C046D574] ._spin_unlock_irqrestore+0x18/0x3c (unreliable) [C253DA90] [C031BBA0] .scsi_dispatch_cmd+0x25c/0x2e4 [C253DB30] [C03227CC] .scsi_request_fn+0x2c4/0x3c0 [C253DBE0] [C021ADF8] .__generic_unplug_device+0x54/0x6c [C253DC60] [C0216D34] .elv_insert+0x240/0x268 [C253DD00] [C021A224] .blk_requeue_request+0x38/0x54 [C253DD90] [C032282C] .scsi_request_fn+0x324/0x3c0 [C253DE40] [C021ADF8] .__generic_unplug_device+0x54/0x6c [C253DEC0] [C0216D34] .elv_insert+0x240/0x268 [C253DF60] [C021A224] .blk_requeue_request+0x38/0x54 [C253DFF0] [C032282C] .scsi_request_fn+0x324/0x3c0 [C253E0A0] [C021ADF8] .__generic_unplug_device+0x54/0x6c etc. However, on a Power4 architecture there are errors reported in upper layer (we discussed this in one of earlier emails) followed by SCSI errors. I'm trying to investigate now. The patch you sent out got garbled, so I'm reposting below. This patch adds PCI Error recovery support to the Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver. Lightly tested at this point, works. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Signed-off-by: [EMAIL PROTECTED] Cc: James Smart [EMAIL PROTECTED] drivers/scsi/lpfc/lpfc_init.c | 96 ++ drivers/scsi/lpfc/lpfc_sli.c | 12 + 2 files changed, 108 insertions(+) Index: linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c === --- linux-2.6.20-rc4.orig/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 12:30:01.0 -0600 +++ linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 12:34:27.0 -0600 @@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba struct lpfc_sli *psli = phba-sli; struct lpfc_sli_ring *pring; uint32_t event_data; + /* If the pci channel is offline, ignore possible errors, +* since we cannot communicate with the pci card anyway. */ + if (pci_channel_offline(phba-pcidev)) + return; if (phba-work_hs HS_FFER6 || phba-work_hs HS_FFER5) { @@ -1797,6 +1801,91 @@ lpfc_pci_remove_one(struct pci_dev *pdev pci_set_drvdata(pdev, NULL); } +/** + * lpfc_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata; + struct lpfc_sli *psli = phba-sli; + struct lpfc_sli_ring *pring; + + if (state == pci_channel_io_perm_failure) { + lpfc_pci_remove_one(pdev); + return PCI_ERS_RESULT_DISCONNECT; + } + pci_disable_device(pdev); + /* +* There may be I/Os dropped by the firmware. +* Error iocb (I/O) on txcmplq and let the SCSI layer +* retry it after re-establishing link. +*/ + pring = psli-ring[psli-fcp_ring]; + lpfc_sli_abort_iocb_ring(phba, pring); + + /* Request a slot reset. */ + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * lpfc_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev
crash on lpfc rmmod
Hi Bino, Fiddling with the lpfc driver on 2.6.20-rc4, shortly after booting, I attempted to rmmod the lpfc module and got a crash: io-raiders:~ # rmmod lpfc cpu 0x0: Vector: 300 (Data Access) at [c003c86075a0] pc: d08d0988: .lpfc_free_sysfs_attr+0x1c/0x58 [lpfc] lr: d08c458c: .lpfc_pci_remove_one+0x3c/0x278 [lpfc] sp: c003c8607820 msr: 90009032 dar: 11c0 dsisr: 4000 current = 0xc003bf4b4c80 paca= 0xc0674080 pid = 12977, comm = rmmod [ 3005.329608] [ cut here ] at which point the system locked up hard (I was expecting it to go into xmon). Suggestions? --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] lpfc: add PCI error recovery support
James, Please review the patch below. Presuming that you lke it, please forward upstream. --linas This patch adds PCI Error recovery support to the Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver. Lightly tested at this point, works. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Cc: James Smart [EMAIL PROTECTED] drivers/scsi/lpfc/lpfc_init.c | 91 ++ 1 file changed, 91 insertions(+) Index: linux-2.6.19-git7/drivers/scsi/lpfc/lpfc_init.c === --- linux-2.6.19-git7.orig/drivers/scsi/lpfc/lpfc_init.c2006-12-06 13:31:39.0 -0600 +++ linux-2.6.19-git7/drivers/scsi/lpfc/lpfc_init.c 2006-12-06 13:33:49.0 -0600 @@ -517,6 +517,11 @@ lpfc_handle_eratt(struct lpfc_hba * phba struct lpfc_sli_ring *pring; uint32_t event_data; + /* If the pci channel is offline, ignore possible errors, +* since we cannot communicate with the pci card anyway. */ + if (pci_channel_offline(phba-pcidev)) + return; + if (phba-work_hs HS_FFER6) { /* Re-establishing Link */ lpfc_printf_log(phba, KERN_INFO, LOG_LINK_EVENT, @@ -1825,6 +1830,85 @@ lpfc_pci_remove_one(struct pci_dev *pdev pci_set_drvdata(pdev, NULL); } +/** + * lpfc_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev, +pci_channel_state_t state) +{ + if (state == pci_channel_io_perm_failure) { + lpfc_pci_remove_one(pdev); + return PCI_ERS_RESULT_DISCONNECT; + } + pci_disable_device(pdev); + + /* Request a slot reset. */ + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * lpfc_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev *pdev) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata; + struct lpfc_sli *psli = phba-sli; + struct lpfc_sli_ring *pring; + + dev_printk(KERN_INFO, pdev-dev, recovering from a slot reset.\n); + if (pci_enable_device(pdev)) { + printk(KERN_ERR lpfc: Cannot re-enable PCI device after reset.\n); + return PCI_ERS_RESULT_DISCONNECT; + } + + pci_set_master(pdev); + + /* Re-establishing Link */ + spin_lock_irq(phba-host-host_lock); + phba-fc_flag |= FC_ESTABLISH_LINK; + psli-sli_flag = ~LPFC_SLI2_ACTIVE; + spin_unlock_irq(phba-host-host_lock); + + /* +* There may be I/Os dropped by the firmware. +* Error iocb (I/O) on txcmplq and let the SCSI layer +* retry it after re-establishing link. +*/ + pring = psli-ring[psli-fcp_ring]; + lpfc_sli_abort_iocb_ring(phba, pring); + + /* Take device offline; this will perform cleanup */ + lpfc_offline(phba); + lpfc_sli_brdrestart(phba); + + return PCI_ERS_RESULT_RECOVERED; +} + +/** + * lpfc_io_resume - called when traffic can start flowing again. + * @pdev: Pointer to PCI device + * + * This callback is called when the error recovery driver tells us that + * its OK to resume normal operation. + */ +static void lpfc_io_resume(struct pci_dev *pdev) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata; + + lpfc_online(phba); + mod_timer(phba-fc_estabtmo, jiffies + HZ * 60); +} + static struct pci_device_id lpfc_id_table[] = { {PCI_VENDOR_ID_EMULEX, PCI_DEVICE_ID_VIPER, PCI_ANY_ID, PCI_ANY_ID, }, @@ -1885,11 +1969,18 @@ static struct pci_device_id lpfc_id_tabl MODULE_DEVICE_TABLE(pci, lpfc_id_table); +static struct pci_error_handlers lpfc_err_handler = { + .error_detected = lpfc_io_error_detected, + .slot_reset = lpfc_io_slot_reset, + .resume = lpfc_io_resume, +}; + static struct pci_driver lpfc_driver = { .name = LPFC_DRIVER_NAME, .id_table = lpfc_id_table, .probe = lpfc_pci_probe_one, .remove = __devexit_p(lpfc_pci_remove_one), + .err_handler = lpfc_err_handler, }; static int __init - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]
On Tue, Mar 22, 2005 at 11:38:36AM -0600, Brian King was heard to remark: Linas Vepstas wrote: My current hardware will halt all i/o to/from the symbios controller upon detection of a PCI error. The recovery proceedure that I am currently using is to call system firmware (aka 'bios') to raise and then lower the #RST pci signal line for 1/4 second, then wait 2 seconds for the PCI bus to settle, then restore the PCI config space registers (BARs, interrupt line, etc) to what they used to be. Then, I call sym_start_up() in an attempt to get the symbios card working again. And that's where I get stuck ... My assumption is that after the #RST, that the symbios card will sit there, dumb and stupid, with no scripts running. But sometimes I find that the card has done something to make the PCI error hardware trip again. Typically, this means that the card attempted to DMA to some address that its not allowed to touch, or raised #SERR or possibly #PERR (I can't tell which). What config registers are you restoring? BAR's, grant, latency, interrupt, cacheline size. Is it possible symbios does not like something in your config restore? possibly... Another possiblity is that asserting PCI reset is not cleanly resetting the card. Does PCI reset force BIST to be run on these cards? You could try to manually run BIST on the card after the PCI reset to see if that I didn't see bist in the code, but I wasn't looking for it either. I could try that. helps, or you could try power cycling the slot instead of using PCI reset. yes I could :( I'll try that next. Problem is, not all slots are power-cyclable, only the hotplug slots are. I've discoverd that for example, the ethernet chips are soldered to the motherboard, and can't be power-cycled (but fortunately, those don't give me trouble). --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]
Hi, There has been a running thread for a while on several mailing lists concerning PCI bus error recovery. Very breifly, some architectures have PCI error recovery mechanisms built into them (e.g. IBM PowerPC, also new PCI-Express chips from Intel (and other vendors) and possibly pa-risc and others). I've been trying to prototype error recovery. I currently have ethernet and the IPR scsi driver working, but I am having trouble with the symbios driver. I need help/advice ... On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark: On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote: I also want to do the symbios driver... FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org. My current hardware will halt all i/o to/from the symbios controller upon detection of a PCI error. The recovery proceedure that I am currently using is to call system firmware (aka 'bios') to raise and then lower the #RST pci signal line for 1/4 second, then wait 2 seconds for the PCI bus to settle, then restore the PCI config space registers (BARs, interrupt line, etc) to what they used to be. Then, I call sym_start_up() in an attempt to get the symbios card working again. And that's where I get stuck ... My assumption is that after the #RST, that the symbios card will sit there, dumb and stupid, with no scripts running. But sometimes I find that the card has done something to make the PCI error hardware trip again. Typically, this means that the card attempted to DMA to some address that its not allowed to touch, or raised #SERR or possibly #PERR (I can't tell which). Sometimes, I get the PCI error while the card is sitting there idly after the #RST, but more often, I get the error in sym_chip_reset(), immediately after the OUTB (nc_istat, SRST); Any clue what this is about? Am I missing something? I'm rather perplexed at this point, any clues/hints/suggestions are welcome. --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html