Re: [patch 02/17] PCI Error Recovery: Symbios SCSI base support

2007-10-04 Thread Linas Vepstas
On Tue, Oct 02, 2007 at 03:49:26PM -0600, Matthew Wilcox wrote:
 On Tue, Oct 02, 2007 at 02:38:00PM -0700, [EMAIL PROTECTED] wrote:
  From: Linas Vepstas [EMAIL PROTECTED]
  
  Various PCI bus errors can be signaled by newer PCI controllers.  This
  patch adds the PCI error recovery callbacks to the Symbios SCSI device
  driver.  The patch has been tested, and appears to work well.
 
 Linas and I have been discussing the problems with this patch.  I think
 we have a solution; we certainly have something in my tree that's
 acceptable to me; he'd jus like to test it before it's unleashed on the
 world.

Matthew, your fix was a patch on top of my patch ... I assume you want
to submit it that way, instead of reworking this patch?  

Anyway, I finally got a chance to run it yesterday, it worked fine.  
I'll try to make final coments in the other thread.

--linas
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-10-04 Thread Linas Vepstas
On Mon, Oct 01, 2007 at 07:27:30PM -0600, Matthew Wilcox wrote:
 
 The thing to remember is that sym2 is in transition from being a dual
 BSD/Linux driver to being a purely Linux driver. 

I was wondering about that; couldn't tell if the split in the code
was historical, or being intentionally maintained.

  My gut instinct is to say ack, although prudence dictates that 
  I should test first. Which might take a few days...
 
 Fine by me.  

I tested the patch, it worked great. It also seemed to recover 
much more quickly -- so quickly, in fact, that I thought something 
had gone wrong.

I reviewed it one more time, it really does look good. A formal
submission and acked by's at earliest convenience would be good. 

--linas

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-10-02 Thread Linas Vepstas
On Mon, Oct 01, 2007 at 07:27:30PM -0600, Matthew Wilcox wrote:
 
 Fine by me.  Do you have the ability to produce failures on a whim on
 your platforms?  

Yes, although it is very platform specific -- there are actually
transistors in the pci bridge chip, which actually short out lines,
and so, from the point of view of the rest of the chip, it did
actually see a real error. Its supposed to be a very realistic 
test.

 I've been vaguely musing a PCI device failure patch for
 x86, just so people can test driver failure paths.

That would be good ... I've recently agreed to accept a fedex
to test someone elses card for them, which is outside my usual
activities.

There's also supposed to be some PCI-X riser card out there, 
(never seen one) which has the ability to inject actual pci 
errors. Its the Agilent PCI BestX card; I got the impression 
they might not sell it anymore; dunno.

One guy in the lab used to brush a grounding strap across
the pins; this usually got a rise out of the audience.

--linas

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-10-01 Thread Linas Vepstas
On Mon, Oct 01, 2007 at 02:12:47PM -0600, Matthew Wilcox wrote:
 
 I think the fundamental problem is that completions aren't really
 supposed to be used like this.  Here's one attempt at using completions
 perhaps a little more the way they're supposed to be used, 

Yes, that looks very good to me.  I see it solves a bug that
I hadn't been quite aware of. I don't understand why 
struct host_data is preferable to struct sym_shcb (is it because 
this is the structure that is naturally protectected by the 
spinlock?)

My gut instinct is to say ack, although prudence dictates that 
I should test first. Which might take a few days...

 although now
 I've written it, I wonder if we shouldn't just use a waitqueue instead.

I thought that earlier versions of the driver used waitqueues (I vaguely
remember eh_wait in the code), which were later converted to 
completions (I also vaguely recall thinking that the new code was
more elegant/simpler). I converted my patch to use the completions 
likewise, and, as you've clearly shown, did a rather sloppy job in 
the conversion.

I'm tempted to go with this patch; but if you prod, I could attempt
a wait-queue based patch.

--linas

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


EDAC PCI error recovery (was Re: [PATCH 2/2] powerpc: MPC85xx EDAC device driver)

2007-08-01 Thread Linas Vepstas
On Mon, Jul 30, 2007 at 03:47:05PM -0700, Doug Thompson wrote:
 
 --- Linas Vepstas [EMAIL PROTECTED] wrote:
  Also: please note that the linux kernel has a pci error recovery
  mechanism built in; its used by pseries and PCI-E. I'm not clear
  on what any of this has to do with EDAC, which I thought was supposed 
  to be for RAM only. (The EDAC project once talked about doing pci error 
  recovery, but that was years ago, and there is a separate system for
  that, now.)
 
 no, edac can/does harvest PCI bus errors, via polling and other hardware 
 error detectors.

Ehh! I had no idea. A few years ago, when I was working on the PCI error
recovery, I sent a number of emails to the various EDAC people and mailing 
lists that I could find, and never got a response.  I assumed the
project was dead. I guess its not ... 

 But at the current time, few PCI device drivers initialize those callback 
 functions and
 thus errors are lost and some IO transactions fail.

There are patches for 6 drivers in mainline (e100, e1000, ixgb, s2io,
ipr, lpfc), and two more pending (sym53cxxx, tg3).  So far, I've written 
all of them. 

 Over time, as drivers get updated (might take some time) then drivers
 can take some sort of action FOR THEMSELVES

I think I need to do more to raise awareness and interest.

 Yet, there is no tracking of errors - except for a log message in the log 
 file.
 
 There is NO meter on frequency of errors, etc. One must grep the log file and 
 that is not a very
 cycle friendly mechanism.

Yeah, there was low interest in stats. There's a core set of stats in
/proc/pp64/eeh, but these are clearly arch-specific. I'd ike to move
away from those.  Some recent patches added stats to the /sys tree,
under the individual pci bridge and device nodes.  Again, these are
arch-specific; I'd like to move to some geeral/standardized presentation.

 The reason I added PCI parity/error device scanning, was that when I was at 
 Linux Networx, we had
 parity errors on the PCI-X bus, but didn't know the cause.  After we 
 discovered that a simple
 PCI-X riser card had manufacturing problems (quality) and didn't drive lines 
 properly, it caused
 parity errors. 

Heh. Not unusual. I've seen/heard of cases with voltages being low,
and/or ground-bounce in slots near the end. There's a whole zoo of
hardware/firmware bugs that we've had to painfully crawl through and
fix. That's why the IBM boxes cost big $$$; here's to hoping that 
customers understand why.

 This feature allowed us to track nodes that were having parity problems, but 
 we had
 no METER to know it.
 
 Recovery is a good thing, BUT how do you know you having LOTS of 
 errors/recovery events? You need
 a meter. EDAC provides that METER

I'm lazy. What source code should I be looking at?  I'm concerned about
duplication of function and proliferation of interfaces. I've got my 
metering data under (for example)
/sys/bus/pci/devices/0001:c0:01.0/eeh_*, mostly very arch specific.
The code for this is in arch/powerpc/platforms/pseries/eeh_sysfs.c

 I met with Yanmin Zhang of Intel at OLS after his paper presentation on PCI 
 Express Advanced Error
 Reporting in the Kernel, and we talked about this same thing. I am talking 
 with him on having the
 recovery code present information into EDAC sysfs area. (hopefully, anyway)

Hmm. OK, where's that?  Back when, I'd talked to Yamin about coming up 
with a generic, arch-indep way of driving the recovery routines. But
this wasn't exactly easy, and we were still grappling with just getting
things working.  Now that things are working, its time to broaden
horizons.

Can you point me to the current edac code?
find . -print |grep edac is not particuarly revealing at the moment.

 The recovery generates log messages BUT having to periodically 'grep' the log 
 file looking for
 errors is not a good use of CPU cycles. grep once for a count and then grep 
 later for a count and
 then compare the counts for a delta count per unit time. ugly.

Yep. Maybe send events up to udev?

 The EDAC solution is to be able to have a Listener thread in user space that 
 can be notified (via
 poll()) that an event has occurred.

Hmm. OK, I'm alarmingly nave about udev, but my initial gut instinct is
to pipe all such events to udev. Most of user-space has already been
given the marching orders to use udev and/or hal for this kind of stuff.
So this makes sense to me.

 There are more than one consumer (error recover) of error events:
 1) driver recovery after a transaction (which is the recovery consumer above)

I had to argue loudly for recovery in the kernel. The problem was that
it was impossible to recover erros on scsi devics from userspace (since
the block device and filesystems would go bonkers).

 2) Management agents for health of a node
 3) Maintainance agents for predictive component replacement

Yes, agreed. Care to ask your management agent friends for where they'd
like to get these events from (i.e. udev, or somewhere else?)

 We

[PATCH]: PCI Error Recovery: Symbios SCSI device driver

2007-07-02 Thread Linas Vepstas

Various PCI bus errors can be signaled by newer PCI controllers.  
This patch adds the PCI error recovery callbacks to the Symbios 
SCSI device driver.  The patch has been tested, and appears to 
work well.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]



Hi,

This patch has been bouncing around for a long time, and has made
appearences in various -mm trees since 2.6.something-teen. However,
it has never made it into mainline, and I'm starting to get concerned
that it will miss 2.6.23 as well. 

There was some discussion, and I think I addressed all of the various
issues that came up. I'd really like to get this patch in, but am unclear
on exactly who to pester at this point. Matt Wilcox seems to be looking 
for a job (???) and I am unable to git-clone James Bottmley's 
git://kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
git tree; there's some error on the server side.

Linas.

 drivers/scsi/sym53c8xx_2/sym_glue.c |  136 
 drivers/scsi/sym53c8xx_2/sym_glue.h |4 +
 drivers/scsi/sym53c8xx_2/sym_hipd.c |6 +
 3 files changed, 146 insertions(+)

Index: linux-2.6.22-rc1/drivers/scsi/sym53c8xx_2/sym_glue.c
===
--- linux-2.6.22-rc1.orig/drivers/scsi/sym53c8xx_2/sym_glue.c   2007-04-25 
22:08:32.0 -0500
+++ linux-2.6.22-rc1/drivers/scsi/sym53c8xx_2/sym_glue.c2007-05-14 
17:31:44.0 -0500
@@ -657,6 +657,10 @@ static irqreturn_t sym53c8xx_intr(int ir
unsigned long flags;
struct sym_hcb *np = (struct sym_hcb *)dev_id;
 
+   /* Avoid spinloop trying to handle interrupts on frozen device */
+   if (pci_channel_offline(np-s.device))
+   return IRQ_HANDLED;
+
if (DEBUG_FLAGS  DEBUG_TINY) printf_debug ([);
 
spin_lock_irqsave(np-s.host-host_lock, flags);
@@ -726,6 +730,20 @@ static int sym_eh_handler(int op, char *
 
dev_warn(cmd-device-sdev_gendev, %s operation started.\n, opname);
 
+   /* We may be in an error condition because the PCI bus
+* went down. In this case, we need to wait until the
+* PCI bus is reset, the card is reset, and only then
+* proceed with the scsi error recovery.  There's no
+* point in hurrying; take a leisurely wait.
+*/
+#define WAIT_FOR_PCI_RECOVERY  35
+   if (pci_channel_offline(np-s.device)) {
+   int finished_reset = wait_for_completion_timeout(
+   np-s.io_reset_wait, WAIT_FOR_PCI_RECOVERY*HZ);
+   if (!finished_reset)
+   return SCSI_FAILED;
+   }
+
spin_lock_irq(host-host_lock);
/* This one is queued in some place - to wait for completion */
FOR_EACH_QUEUED_ELEMENT(np-busy_ccbq, qp) {
@@ -1510,6 +1528,7 @@ static struct Scsi_Host * __devinit sym_
np-maxoffs = dev-chip.offset_max;
np-maxburst= dev-chip.burst_max;
np-myaddr  = dev-host_id;
+   init_completion(np-s.io_reset_wait);
 
/*
 *  Edit its name.
@@ -1948,6 +1967,116 @@ static void __devexit sym2_remove(struct
attach_count--;
 }
 
+/**
+ * sym2_io_error_detected() -- called when PCI error is detected
+ * @pdev: pointer to PCI device
+ * @state: current state of the PCI slot
+ */
+static pci_ers_result_t sym2_io_error_detected(struct pci_dev *pdev,
+ enum pci_channel_state state)
+{
+   struct sym_hcb *np = pci_get_drvdata(pdev);
+
+   /* If slot is permanently frozen, turn everything off */
+   if (state == pci_channel_io_perm_failure) {
+   sym2_remove(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+
+   init_completion(np-s.io_reset_wait);
+   disable_irq(pdev-irq);
+   pci_disable_device(pdev);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * sym2_reset_workarounds -- hardware-specific work-arounds
+ *
+ * This routine is similar to sym_set_workarounds(), except
+ * that, at this point, we already know that the device was
+ * succesfully intialized at least once before, and so most
+ * of the steps taken there are un-needed here.
+ */
+static void sym2_reset_workarounds(struct pci_dev *pdev)
+{
+   u_char revision;
+   u_short status_reg;
+   struct sym_chip *chip;
+
+   pci_read_config_byte(pdev, PCI_CLASS_REVISION, revision);
+   chip = sym_lookup_chip_table(pdev-device, revision);
+
+   /* Work around for errant bit in 895A, in a fashion
+* similar to what is done in sym_set_workarounds().
+*/
+   pci_read_config_word(pdev, PCI_STATUS, status_reg);
+   if (!(chip-features  FE_66MHZ)  (status_reg  PCI_STATUS_66MHZ)) {
+   status_reg = PCI_STATUS_66MHZ;
+   pci_write_config_word(pdev, PCI_STATUS, status_reg);
+   pci_read_config_word(pdev, PCI_STATUS, status_reg

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-05-17 Thread Linas Vepstas
On Wed, May 09, 2007 at 03:26:21PM -0500, Linas Vepstas wrote:
 Hi Matthew,
 
 I had been hoping these patches might make it into 2.6.22,
 ... this is a nag note; please forward upstream.


... should I repost the patches? 

--linas 

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-05-09 Thread Linas Vepstas
Hi Matthew,

I had been hoping these patches might make it into 2.6.22,
... this is a nag note; please forward upstream.

--linas

On Fri, Apr 20, 2007 at 03:47:20PM -0500, Linas Vepstas wrote:
 
 Implement the so-called first failure data capture (FFDC) for the
 symbios PCI error recovery.  After a PCI error event is reported,
 the driver requests that MMIO be enabled. Once enabled, it 
 then reads and dumps assorted status registers, and concludes
 by requesting the usual reset sequence.
 
 (includes a whitespace fix for bad indentation).
 
 Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-04-20 Thread Linas Vepstas

Implement the so-called first failure data capture (FFDC) for the
symbios PCI error recovery.  After a PCI error event is reported,
the driver requests that MMIO be enabled. Once enabled, it 
then reads and dumps assorted status registers, and concludes
by requesting the usual reset sequence.

(includes a whitespace fix for bad indentation).

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]


 drivers/scsi/sym53c8xx_2/sym_glue.c |   15 +++
 drivers/scsi/sym53c8xx_2/sym_glue.h |1 +
 drivers/scsi/sym53c8xx_2/sym_hipd.c |   18 ++
 3 files changed, 30 insertions(+), 4 deletions(-)

Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c
===
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.c  
2007-04-20 12:52:01.0 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c   2007-04-20 
15:25:35.0 -0500
@@ -1987,6 +1987,20 @@ static pci_ers_result_t sym2_io_error_de
disable_irq(pdev-irq);
pci_disable_device(pdev);
 
+   /* Request that MMIO be enabled, so register dump can be taken. */
+   return PCI_ERS_RESULT_CAN_RECOVER;
+}
+
+/**
+ * sym2_io_slot_dump -- Enable MMIO and dump debug registers
+ * @pdev: pointer to PCI device
+ */
+static pci_ers_result_t sym2_io_slot_dump (struct pci_dev *pdev)
+{
+   struct sym_hcb *np = pci_get_drvdata(pdev);
+
+   sym_dump_registers(np);
+
/* Request a slot reset. */
return PCI_ERS_RESULT_NEED_RESET;
 }
@@ -2241,6 +2255,7 @@ MODULE_DEVICE_TABLE(pci, sym2_id_table);
 
 static struct pci_error_handlers sym2_err_handler = {
.error_detected = sym2_io_error_detected,
+   .mmio_enabled = sym2_io_slot_dump,
.slot_reset = sym2_io_slot_reset,
.resume = sym2_io_resume,
 };
Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h
===
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.h  
2007-04-20 12:15:07.0 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h   2007-04-20 
15:21:31.0 -0500
@@ -270,5 +270,6 @@ void sym_xpt_async_bus_reset(struct sym_
 void sym_xpt_async_sent_bdr(struct sym_hcb *np, int target);
 int  sym_setup_data_and_start (struct sym_hcb *np, struct scsi_cmnd *csio, 
struct sym_ccb *cp);
 void sym_log_bus_error(struct sym_hcb *np);
+void sym_dump_registers(struct sym_hcb *np);
 
 #endif /* SYM_GLUE_H */
Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c
===
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c  
2007-04-20 12:18:59.0 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c   2007-04-20 
15:18:01.0 -0500
@@ -1180,10 +1180,10 @@ static void sym_log_hard_error(struct sy
scr_to_cpu((int) *(u32 *)(script_base + script_ofs)));
}
 
-printf (%s: regdump:, sym_name(np));
-for (i=0; i24;i++)
-printf ( %02x, (unsigned)INB_OFF(np, i));
-printf (.\n);
+   printf (%s: regdump:, sym_name(np));
+   for (i=0; i24;i++)
+   printf ( %02x, (unsigned)INB_OFF(np, i));
+   printf (.\n);
 
/*
 *  PCI BUS error.
@@ -1192,6 +1192,16 @@ static void sym_log_hard_error(struct sy
sym_log_bus_error(np);
 }
 
+void sym_dump_registers(struct sym_hcb *np)
+{
+   u_short sist;
+   u_char dstat;
+
+   sist = INW(np, nc_sist);
+   dstat = INB(np, nc_dstat);
+   sym_log_hard_error(np, sist, dstat);
+}
+
 static struct sym_chip sym_dev_table[] = {
  {PCI_DEVICE_ID_NCR_53C810, 0x0f, 810, 4, 8, 4, 64,
  FE_ERL}
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lpfc: avoid double-free during PCI error failure

2007-03-08 Thread Linas Vepstas

Bino, James,
Please review, sign-off and forward upstream.

--linas


If a PCI error is detected that cannot be recovered from, there
will be a double call of lpfc_pci_remove_one(), with the second call
resulting in a null-pointer dereference. The first call occurs in 
lpfc_io_error_detected(), and the second call during pci device 
remove. This patch eliminates the first call; its un-needed.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]


 drivers/scsi/lpfc/lpfc_init.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-git16.orig/drivers/scsi/lpfc/lpfc_init.c   2007-03-08 
15:57:40.0 -0600
+++ linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c2007-03-08 
16:03:18.0 -0600
@@ -1817,10 +1817,9 @@ static pci_ers_result_t lpfc_io_error_de
struct lpfc_sli *psli = phba-sli;
struct lpfc_sli_ring  *pring;
 
-   if (state == pci_channel_io_perm_failure) {
-   lpfc_pci_remove_one(pdev);
+   if (state == pci_channel_io_perm_failure)
return PCI_ERS_RESULT_DISCONNECT;
-   }
+
pci_disable_device(pdev);
/*
 * There may be I/Os dropped by the firmware.
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lpfc: add PCI error recovery support

2007-02-14 Thread Linas Vepstas

James,

Please review and forward upstream.  This is a patch I'd previously
submitted, and reworked by [EMAIL PROTECTED] in January.
Not clear if I need to also nag James Smart (who is listed as the
maintainer) for an Acked-by (which I am lead to beleive should be
forthcoming? Ahh the joys of indirect communication!)

--linas

This patch adds PCI Error recovery support to the
Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver.
Lightly tested at this point, works.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Signed-off-by: [EMAIL PROTECTED]
Cc: James Smart [EMAIL PROTECTED]



 drivers/scsi/lpfc/lpfc_init.c |   97 ++
 drivers/scsi/lpfc/lpfc_sli.c  |   12 +
 2 files changed, 109 insertions(+)

Index: linux-2.6.20-git4/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-git4.orig/drivers/scsi/lpfc/lpfc_init.c2007-02-09 
17:22:30.0 -0600
+++ linux-2.6.20-git4/drivers/scsi/lpfc/lpfc_init.c 2007-02-14 
14:12:22.0 -0600
@@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba
struct lpfc_sli *psli = phba-sli;
struct lpfc_sli_ring  *pring;
uint32_t event_data;
+   /* If the pci channel is offline, ignore possible errors,
+* since we cannot communicate with the pci card anyway. */
+   if (pci_channel_offline(phba-pcidev))
+   return;
 
if (phba-work_hs  HS_FFER6 ||
phba-work_hs  HS_FFER5) {
@@ -1797,6 +1801,92 @@ lpfc_pci_remove_one(struct pci_dev *pdev
pci_set_drvdata(pdev, NULL);
 }
 
+/**
+ * lpfc_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata;
+   struct lpfc_sli *psli = phba-sli;
+   struct lpfc_sli_ring  *pring;
+
+   if (state == pci_channel_io_perm_failure) {
+   lpfc_pci_remove_one(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+   pci_disable_device(pdev);
+   /*
+* There may be I/Os dropped by the firmware.
+* Error iocb (I/O) on txcmplq and let the SCSI layer
+* retry it after re-establishing link.
+*/
+   pring = psli-ring[psli-fcp_ring];
+   lpfc_sli_abort_iocb_ring(phba, pring);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * lpfc_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev *pdev)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata;
+   struct lpfc_sli *psli = phba-sli;
+   int bars = pci_select_bars(pdev, IORESOURCE_MEM);
+
+   dev_printk(KERN_INFO, pdev-dev, recovering from a slot reset.\n);
+   if (pci_enable_device_bars(pdev, bars)) {
+   printk(KERN_ERR lpfc: Cannot re-enable 
+   PCI device after reset.\n);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+
+   pci_set_master(pdev);
+
+   /* Re-establishing Link */
+   spin_lock_irq(phba-host-host_lock);
+   phba-fc_flag |= FC_ESTABLISH_LINK;
+   psli-sli_flag = ~LPFC_SLI2_ACTIVE;
+   spin_unlock_irq(phba-host-host_lock);
+
+
+   /* Take device offline; this will perform cleanup */
+   lpfc_offline(phba);
+   lpfc_sli_brdrestart(phba);
+
+   return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * lpfc_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void lpfc_io_resume(struct pci_dev *pdev)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata;
+
+   if (lpfc_online(phba) == 0) {
+   mod_timer(phba-fc_estabtmo, jiffies + HZ * 60);
+   }
+}
+
 static struct pci_device_id lpfc_id_table[] = {
{PCI_VENDOR_ID_EMULEX, PCI_DEVICE_ID_VIPER,
PCI_ANY_ID, PCI_ANY_ID, },
@@ -1857,11 +1947,18 @@ static struct pci_device_id lpfc_id_tabl
 
 MODULE_DEVICE_TABLE(pci, lpfc_id_table);
 
+static struct pci_error_handlers lpfc_err_handler = {
+   .error_detected = lpfc_io_error_detected,
+   .slot_reset = lpfc_io_slot_reset,
+   .resume = lpfc_io_resume,
+};
+
 static struct pci_driver lpfc_driver = {
.name   = LPFC_DRIVER_NAME

Re: lpfc PCIe error recovey

2007-01-11 Thread Linas Vepstas
On Wed, Jan 10, 2007 at 04:59:39PM -0600, linas wrote:
 
  However, on a Power4 architecture there are errors reported
  in upper layer (we discussed this in one of earlier emails) followed 
  by SCSI errors.
 
 I'm trying to investigate now.

I found two distinct power4 bugs. I posted a patch for one yesterday,
under the subject heading 

  [PATCH] Urgent: powerpc 2.6.20-rc4 dma broken on non-LPAR pseries

This affects only recent mainline kernels; it would not affect
older or distro kernels.   

The other patch is attached below.  After some more testing,
I'll submit to mainline.

--linas


Subject: [PATCH] pSeries: EEH improperly enabled for some Power4 systems

It appears that EEH is improperly enabled for some Power4 systems.
On these systems, the ibm,set-eeh-option returns a value of success
even when EEH is not supported on the given node. Thus, an explicit
check for support is required.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED] 


 arch/powerpc/platforms/pseries/eeh.c |   19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6.20-rc4/arch/powerpc/platforms/pseries/eeh.c
===
--- linux-2.6.20-rc4.orig/arch/powerpc/platforms/pseries/eeh.c  2007-01-11 
14:15:02.0 -0600
+++ linux-2.6.20-rc4/arch/powerpc/platforms/pseries/eeh.c   2007-01-11 
15:14:39.0 -0600
@@ -748,6 +748,7 @@ struct eeh_early_enable_info {
 /* Enable eeh for the given device node. */
 static void *early_enable_eeh(struct device_node *dn, void *data)
 {
+   unsigned int rets[3];
struct eeh_early_enable_info *info = data;
int ret;
const char *status = get_property(dn, status, NULL);
@@ -804,16 +805,14 @@ static void *early_enable_eeh(struct dev
regs[0], info-buid_hi, info-buid_lo,
EEH_ENABLE);
 
+   enable = 0;
if (ret == 0) {
-   eeh_subsystem_enabled = 1;
-   pdn-eeh_mode |= EEH_MODE_SUPPORTED;
pdn-eeh_config_addr = regs[0];
 
/* If the newer, better, ibm,get-config-addr-info is 
supported, 
 * then use that instead. */
pdn-eeh_pe_config_addr = 0;
if (ibm_get_config_addr_info != RTAS_UNKNOWN_SERVICE) {
-   unsigned int rets[2];
ret = rtas_call (ibm_get_config_addr_info, 4, 
2, rets, 
pdn-eeh_config_addr, 
info-buid_hi, info-buid_lo,
@@ -821,6 +820,20 @@ static void *early_enable_eeh(struct dev
if (ret == 0)
pdn-eeh_pe_config_addr = rets[0];
}
+
+   /* Some older systems (Power4) allow the
+* ibm,set-eeh-option call to succeed even on nodes
+* where EEH is not supported. Verify support
+* explicitly. */
+   ret = read_slot_reset_state(pdn, rets);
+   if ((ret == 0)  (rets[1] == 1))
+   enable = 1;
+   }
+
+   if (enable) {
+   eeh_subsystem_enabled = 1;
+   pdn-eeh_mode |= EEH_MODE_SUPPORTED;
+
 #ifdef DEBUG
printk(KERN_DEBUG EEH: %s: eeh enabled, config=%x 
pe_config=%x\n,
   dn-full_name, pdn-eeh_config_addr, 
pdn-eeh_pe_config_addr);

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Bug: 2.6.20 scsi/block device/elevator recursion loop

2007-01-11 Thread Linas Vepstas
Hi,

On Thu, Jan 11, 2007 at 04:22:52PM -0500, [EMAIL PROTECTED] wrote:
 This patch is present in upstream and is also present 
 in 2.6.20. So this is a new issue.

What was the patch last time around? 

It seems I'm seeing this more often than expected. The first time,
the system spewed the softlockup error, but then recovered after 
a few minutes. This time, even after an hour, the system remained
hung. It was pingable, but the console, and all ssh sessions
were unresponsive.

After hitting the little yellow button, I got a stack trace
(below) in _spin_unlock_irqrestore, which makes me think that
perhaps the system was being flooded with irq's. I'll try 
to investigate further tommorrow.

--linas

Background:
kernel 2.6.20-rc4
IBM Power4 pSeries (630)
lpfc scsi (Emulex)

 chsysstate -r sys -n io-raiders  -o reset

io-raiders:~ # cpu 0x0: Vector: 100 (System Reset) at [c0003ff69520]
pc: c023d794: ._raw_spin_unlock+0xb4/0xd4
lr: c046d5ac: ._spin_unlock_irqrestore+0x18/0x3c
sp: c0003ff697a0
   msr: 90009032
  current = 0xc43e21f0
  paca= 0xc0674080
pid   = 1123, comm = kblockd/0
enter ? for help
[c0003ff69820] c046d5ac ._spin_unlock_irqrestore+0x18/0x3c
[c0003ff698b0] c021bbe0 .blk_run_queue+0xc8/0xec
[c0003ff69950] c0320728 .scsi_run_queue+0x248/0x278
[c0003ff69a00] c0321948 .scsi_queue_insert+0x88/0xa8
[c0003ff69a90] c031bc34 .scsi_dispatch_cmd+0x2b8/0x2e4
[c0003ff69b30] c0322804 .scsi_request_fn+0x2c4/0x3c0
[c0003ff69be0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff69c60] c0216d6c .elv_insert+0x240/0x268
[c0003ff69d00] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff69d90] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff69e40] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff69ec0] c0216d6c .elv_insert+0x240/0x268
[c0003ff69f60] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff69ff0] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6a0a0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6a120] c0216d6c .elv_insert+0x240/0x268
[c0003ff6a1c0] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6a250] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6a300] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6a380] c0216d6c .elv_insert+0x240/0x268
[c0003ff6a420] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6a4b0] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6a560] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6a5e0] c0216d6c .elv_insert+0x240/0x268
[c0003ff6a680] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6a710] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6a7c0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6a840] c0216d6c .elv_insert+0x240/0x268
[c0003ff6a8e0] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6a970] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6aa20] c021bbac .blk_run_queue+0x94/0xec
[c0003ff6aac0] c0320728 .scsi_run_queue+0x248/0x278
[c0003ff6ab70] c0321948 .scsi_queue_insert+0x88/0xa8
[c0003ff6ac00] c031bc34 .scsi_dispatch_cmd+0x2b8/0x2e4
[c0003ff6aca0] c0322804 .scsi_request_fn+0x2c4/0x3c0
[c0003ff6ad50] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6add0] c0216d6c .elv_insert+0x240/0x268
[c0003ff6ae70] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6af00] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6afb0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b030] c0216d6c .elv_insert+0x240/0x268
[c0003ff6b0d0] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6b160] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6b210] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b290] c0216d6c .elv_insert+0x240/0x268
[c0003ff6b330] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6b3c0] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6b470] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b4f0] c0216d6c .elv_insert+0x240/0x268
[c0003ff6b590] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6b620] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6b6d0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b750] c0216d6c .elv_insert+0x240/0x268
[c0003ff6b7f0] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6b880] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6b930] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b9b0] c0216d6c .elv_insert+0x240/0x268
[c0003ff6ba50] c021a25c .blk_requeue_request+0x38/0x54

Re: lpfc PCIe error recoveyr

2007-01-10 Thread Linas Vepstas
On Tue, Jan 09, 2007 at 10:00:09AM -0500, [EMAIL PROTECTED] wrote:
 Hi Linas,
   Following is the latest lpfc driver patch we are testing in the 
 Emulex lab for PCI error recovery. This patch looks good on a Power5 
 platform. 

Yes, it seemed to survive a few hours of testting fine. I did see one
interesting thing, namely a softlockup. I attribute this to the fact
that I'd queued up a lot of heavy file i/o, issued a sync, which
typically takes more than a few seconds on the test sytem, and then 
injected the artificial PCI error. After about ten seconds, I got the 
softlockup, but after another 10-20 seconds, things seemed back to
normal. So I don't consider this an actual error, but thought 
it was interesting.

The actual stack trace was

BUG: soft lockup detected on CPU#2!
Call Trace:
[C253D470] [C000F8C8] .show_stack+0x68/0x1b0 (unreliable)
[C253D510] [C008E770] .softlockup_tick+0xec/0x124
[C253D5B0] [C006957C] .run_local_timers+0x1c/0x30
[C253D630] [C0023C18] .timer_interrupt+0xb8/0x4a4
[C253D710] [C0003578] decrementer_common+0xf8/0x100
--- Exception: 901 at .local_irq_restore+0x3c/0x40
LR = ._spin_unlock_irqrestore+0x24/0x3c
[C253DA00] [C046D574] ._spin_unlock_irqrestore+0x18/0x3c 
(unreliable)
[C253DA90] [C031BBA0] .scsi_dispatch_cmd+0x25c/0x2e4
[C253DB30] [C03227CC] .scsi_request_fn+0x2c4/0x3c0
[C253DBE0] [C021ADF8] .__generic_unplug_device+0x54/0x6c
[C253DC60] [C0216D34] .elv_insert+0x240/0x268
[C253DD00] [C021A224] .blk_requeue_request+0x38/0x54
[C253DD90] [C032282C] .scsi_request_fn+0x324/0x3c0
[C253DE40] [C021ADF8] .__generic_unplug_device+0x54/0x6c
[C253DEC0] [C0216D34] .elv_insert+0x240/0x268
[C253DF60] [C021A224] .blk_requeue_request+0x38/0x54
[C253DFF0] [C032282C] .scsi_request_fn+0x324/0x3c0
[C253E0A0] [C021ADF8] .__generic_unplug_device+0x54/0x6c
etc.

 However, on a Power4 architecture there are errors reported
 in upper layer (we discussed this in one of earlier emails) followed 
 by SCSI errors.

I'm trying to investigate now.

The patch you sent out got garbled, so I'm reposting below.



This patch adds PCI Error recovery support to the
Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver.
Lightly tested at this point, works.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Signed-off-by: [EMAIL PROTECTED]
Cc: James Smart [EMAIL PROTECTED]



 drivers/scsi/lpfc/lpfc_init.c |   96 ++
 drivers/scsi/lpfc/lpfc_sli.c  |   12 +
 2 files changed, 108 insertions(+)

Index: linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-rc4.orig/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 
12:30:01.0 -0600
+++ linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c  2007-01-10 
12:34:27.0 -0600
@@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba
struct lpfc_sli *psli = phba-sli;
struct lpfc_sli_ring  *pring;
uint32_t event_data;
+   /* If the pci channel is offline, ignore possible errors,
+* since we cannot communicate with the pci card anyway. */
+   if (pci_channel_offline(phba-pcidev))
+   return;
 
if (phba-work_hs  HS_FFER6 ||
phba-work_hs  HS_FFER5) {
@@ -1797,6 +1801,91 @@ lpfc_pci_remove_one(struct pci_dev *pdev
pci_set_drvdata(pdev, NULL);
 }
 
+/**
+ * lpfc_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata;
+   struct lpfc_sli *psli = phba-sli;
+   struct lpfc_sli_ring  *pring;
+
+   if (state == pci_channel_io_perm_failure) {
+   lpfc_pci_remove_one(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+   pci_disable_device(pdev);
+   /*
+* There may be I/Os dropped by the firmware.
+* Error iocb (I/O) on txcmplq and let the SCSI layer
+* retry it after re-establishing link.
+*/
+   pring = psli-ring[psli-fcp_ring];
+   lpfc_sli_abort_iocb_ring(phba, pring);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * lpfc_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev

crash on lpfc rmmod

2007-01-10 Thread Linas Vepstas
Hi Bino,

Fiddling with the lpfc driver on 2.6.20-rc4, shortly after 
booting, I attempted to rmmod the lpfc module and got a crash:

io-raiders:~ # rmmod lpfc
cpu 0x0: Vector: 300 (Data Access) at [c003c86075a0]
pc: d08d0988: .lpfc_free_sysfs_attr+0x1c/0x58 [lpfc]
lr: d08c458c: .lpfc_pci_remove_one+0x3c/0x278 [lpfc]
sp: c003c8607820
   msr: 90009032
   dar: 11c0
 dsisr: 4000
  current = 0xc003bf4b4c80
  paca= 0xc0674080
pid   = 12977, comm = rmmod
[ 3005.329608] [ cut here ]

at which point the system locked up hard (I was expecting it to
go into xmon).

Suggestions?

--linas

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] lpfc: add PCI error recovery support

2006-12-06 Thread Linas Vepstas

James,

Please review the patch below. Presuming that you lke it,
please forward upstream.

--linas

This patch adds PCI Error recovery support to the 
Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver.
Lightly tested at this point, works.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Cc: James Smart [EMAIL PROTECTED]



 drivers/scsi/lpfc/lpfc_init.c |   91 ++
 1 file changed, 91 insertions(+)

Index: linux-2.6.19-git7/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.19-git7.orig/drivers/scsi/lpfc/lpfc_init.c2006-12-06 
13:31:39.0 -0600
+++ linux-2.6.19-git7/drivers/scsi/lpfc/lpfc_init.c 2006-12-06 
13:33:49.0 -0600
@@ -517,6 +517,11 @@ lpfc_handle_eratt(struct lpfc_hba * phba
struct lpfc_sli_ring  *pring;
uint32_t event_data;
 
+   /* If the pci channel is offline, ignore possible errors,
+* since we cannot communicate with the pci card anyway. */
+   if (pci_channel_offline(phba-pcidev))
+   return;
+
if (phba-work_hs  HS_FFER6) {
/* Re-establishing Link */
lpfc_printf_log(phba, KERN_INFO, LOG_LINK_EVENT,
@@ -1825,6 +1830,85 @@ lpfc_pci_remove_one(struct pci_dev *pdev
pci_set_drvdata(pdev, NULL);
 }
 
+/**
+ * lpfc_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev, 
+pci_channel_state_t state)
+{
+   if (state == pci_channel_io_perm_failure) {
+   lpfc_pci_remove_one(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+   pci_disable_device(pdev);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * lpfc_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev *pdev)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata;
+   struct lpfc_sli *psli = phba-sli;
+   struct lpfc_sli_ring  *pring;
+
+   dev_printk(KERN_INFO, pdev-dev, recovering from a slot reset.\n);
+   if (pci_enable_device(pdev)) {
+   printk(KERN_ERR lpfc: Cannot re-enable PCI device after 
reset.\n);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+
+   pci_set_master(pdev);
+
+   /* Re-establishing Link */
+   spin_lock_irq(phba-host-host_lock);
+   phba-fc_flag |= FC_ESTABLISH_LINK;
+   psli-sli_flag = ~LPFC_SLI2_ACTIVE;
+   spin_unlock_irq(phba-host-host_lock);
+
+   /*
+* There may be I/Os dropped by the firmware.
+* Error iocb (I/O) on txcmplq and let the SCSI layer
+* retry it after re-establishing link.
+*/
+   pring = psli-ring[psli-fcp_ring];
+   lpfc_sli_abort_iocb_ring(phba, pring);
+
+   /* Take device offline; this will perform cleanup */
+   lpfc_offline(phba);
+   lpfc_sli_brdrestart(phba);
+
+   return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * lpfc_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void lpfc_io_resume(struct pci_dev *pdev)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata;
+
+   lpfc_online(phba);
+   mod_timer(phba-fc_estabtmo, jiffies + HZ * 60);
+}
+
 static struct pci_device_id lpfc_id_table[] = {
{PCI_VENDOR_ID_EMULEX, PCI_DEVICE_ID_VIPER,
PCI_ANY_ID, PCI_ANY_ID, },
@@ -1885,11 +1969,18 @@ static struct pci_device_id lpfc_id_tabl
 
 MODULE_DEVICE_TABLE(pci, lpfc_id_table);
 
+static struct pci_error_handlers lpfc_err_handler = {
+   .error_detected = lpfc_io_error_detected,
+   .slot_reset = lpfc_io_slot_reset,
+   .resume = lpfc_io_resume,
+};
+
 static struct pci_driver lpfc_driver = {
.name   = LPFC_DRIVER_NAME,
.id_table   = lpfc_id_table,
.probe  = lpfc_pci_probe_one,
.remove = __devexit_p(lpfc_pci_remove_one),
+   .err_handler = lpfc_err_handler,
 };
 
 static int __init
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-31 Thread Linas Vepstas
On Tue, Mar 22, 2005 at 11:38:36AM -0600, Brian King was heard to remark:
 Linas Vepstas wrote:
  
  My current hardware will halt all i/o to/from the symbios controller
  upon detection of a PCI error.  The recovery proceedure that I am
  currently using is to call system firmware (aka 'bios') to raise
  and then lower the #RST pci signal line for 1/4 second, then wait 2
  seconds for the  PCI bus to settle, then restore the PCI config space
  registers (BARs, interrupt line, etc) to what they used to be. Then,
  I call sym_start_up() in an attempt to get the symbios card working
  again.  And that's where I get stuck ... 
  
  My assumption is that after the #RST, that the symbios card will sit
  there, dumb and stupid, with no scripts running.  But sometimes I find 
  that the card has done something to make the PCI error hardware trip
  again.  Typically, this means that the card attempted to DMA to some
  address that its not allowed to touch, or raised #SERR or possibly 
  #PERR (I can't tell which). 
 
 What config registers are you restoring? 

BAR's, grant, latency, interrupt, cacheline size. 

 Is it possible symbios does not
 like something in your config restore?

possibly...

 Another possiblity is that asserting PCI reset is not cleanly resetting
 the card. Does PCI reset force BIST to be run on these cards? You could
 try to manually run BIST on the card after the PCI reset to see if that

I didn't see bist in the code, but I wasn't looking for it either.  I
could try that.

 helps, or you could try power cycling the slot instead of using PCI reset.

yes I could :(  I'll try that next.  Problem is, not all slots are
power-cyclable, only the hotplug slots are.  I've discoverd that 
for example, the ethernet chips are soldered to the motherboard, and
can't be power-cycled (but fortunately, those don't give me trouble).


--linas
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-21 Thread Linas Vepstas

Hi,

There has been a running thread for a while on several mailing lists 
concerning PCI bus error recovery.  Very breifly, some architectures
have PCI error recovery mechanisms built into them (e.g. IBM PowerPC,
also new PCI-Express chips from Intel (and other vendors) and possibly
pa-risc and others).  

I've been trying to prototype  error recovery.  I currently have
ethernet and the IPR scsi driver working, but I am having trouble with 
the symbios driver.  I need help/advice ... 

On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark:
 On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote:
  I also want to do the symbios driver...
 
 FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org.


My current hardware will halt all i/o to/from the symbios controller
upon detection of a PCI error.  The recovery proceedure that I am
currently using is to call system firmware (aka 'bios') to raise
and then lower the #RST pci signal line for 1/4 second, then wait 2
seconds for the  PCI bus to settle, then restore the PCI config space
registers (BARs, interrupt line, etc) to what they used to be. Then,
I call sym_start_up() in an attempt to get the symbios card working
again.  And that's where I get stuck ... 

My assumption is that after the #RST, that the symbios card will sit
there, dumb and stupid, with no scripts running.  But sometimes I find 
that the card has done something to make the PCI error hardware trip
again.  Typically, this means that the card attempted to DMA to some
address that its not allowed to touch, or raised #SERR or possibly 
#PERR (I can't tell which). 

Sometimes, I get the PCI error while the card is sitting there idly
after the #RST, but more often, I get the error in sym_chip_reset(),
immediately after the   OUTB (nc_istat, SRST);

Any clue what this is about? Am I missing something? I'm rather
perplexed at this point, any clues/hints/suggestions are welcome.

--linas

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html