from:"Linas Vepstas"

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-10-04 Thread Linas Vepstas

On Mon, Oct 01, 2007 at 07:27:30PM -0600, Matthew Wilcox wrote:
> 
> The thing to remember is that sym2 is in transition from being a dual
> BSD/Linux driver to being a purely Linux driver. 

I was wondering about that; couldn't tell if the split in the code
was historical, or being intentionally maintained.

> > My gut instinct is to say "ack", although prudence dictates that 
> > I should test first. Which might take a few days...
> 
> Fine by me.  

I tested the patch, it worked great. It also seemed to recover 
much more quickly -- so quickly, in fact, that I thought something 
had gone wrong.

I reviewed it one more time, it really does look good. A formal
submission and acked by's at earliest convenience would be good. 

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 02/17] PCI Error Recovery: Symbios SCSI base support

2007-10-04 Thread Linas Vepstas

On Tue, Oct 02, 2007 at 03:49:26PM -0600, Matthew Wilcox wrote:
> On Tue, Oct 02, 2007 at 02:38:00PM -0700, [EMAIL PROTECTED] wrote:
> > From: Linas Vepstas <[EMAIL PROTECTED]>
> > 
> > Various PCI bus errors can be signaled by newer PCI controllers.  This
> > patch adds the PCI error recovery callbacks to the Symbios SCSI device
> > driver.  The patch has been tested, and appears to work well.
> 
> Linas and I have been discussing the problems with this patch.  I think
> we have a solution; we certainly have something in my tree that's
> acceptable to me; he'd jus like to test it before it's unleashed on the
> world.

Matthew, your fix was a patch on top of my patch ... I assume you want
to submit it that way, instead of reworking this patch?  

Anyway, I finally got a chance to run it yesterday, it worked fine.  
I'll try to make final coments in the other thread.

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-10-02 Thread Linas Vepstas

On Mon, Oct 01, 2007 at 07:27:30PM -0600, Matthew Wilcox wrote:
> 
> Fine by me.  Do you have the ability to produce failures on a whim on
> your platforms?  

Yes, although it is very platform specific -- there are actually
transistors in the pci bridge chip, which actually short out lines,
and so, from the point of view of the rest of the chip, it did
actually see a "real" error. Its supposed to be a very realistic 
test.

> I've been vaguely musing a PCI device failure patch for
> x86, just so people can test driver failure paths.

That would be good ... I've recently agreed to accept a fedex
to test someone elses card for them, which is outside my usual
activities.

There's also supposed to be some PCI-X riser card out there, 
(never seen one) which has the ability to inject actual pci 
errors. Its the Agilent PCI BestX card; I got the impression 
they might not sell it anymore; dunno.

One guy in the lab used to brush a grounding strap across
the pins; this usually got a rise out of the audience.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-10-01 Thread Linas Vepstas

On Mon, Oct 01, 2007 at 02:12:47PM -0600, Matthew Wilcox wrote:
> 
> I think the fundamental problem is that completions aren't really
> supposed to be used like this.  Here's one attempt at using completions
> perhaps a little more the way they're supposed to be used, 

Yes, that looks very good to me.  I see it solves a bug that
I hadn't been quite aware of. I don't understand why 
struct host_data is preferable to struct sym_shcb (is it because 
this is the structure that is "naturally protectected" by the 
spinlock?)

My gut instinct is to say "ack", although prudence dictates that 
I should test first. Which might take a few days...

> although now
> I've written it, I wonder if we shouldn't just use a waitqueue instead.

I thought that earlier versions of the driver used waitqueues (I vaguely
remember "eh_wait" in the code), which were later converted to 
completions (I also vaguely recall thinking that the new code was
more elegant/simpler). I converted my patch to use the completions 
likewise, and, as you've clearly shown, did a rather sloppy job in 
the conversion.

I'm tempted to go with this patch; but if you prod, I could attempt
a wait-queue based patch.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-09-27 Thread Linas Vepstas

On Thu, Sep 27, 2007 at 04:10:31PM -0600, Matthew Wilcox wrote:
> In the error handler, we wait_for_completion(io_reset_wait).
> In sym2_io_error_detected, we init_completion(io_reset_wait).
> Isn't it possible that we hit the error handler before we hit the
> io_error_detected path, and thus the completion wait is lost?
> Since the completion is already initialised in sym_attach(), I don't
> think we need to initialise it in sym2_io_error_detected().
> Makes sense to just delete it?

Good catch. But no ... and I had to study this a bit. Bear with me:

It is enough to call init_completion() once, and not once per use:
it initializes spinlocks, which shouldn't be intialized twice. 

But, that completion might be used multiple times when there are
multiple errors, and so, before using it a second time, one must 
set completion->done = 0.  The INIT_COMPLETION() macro does this. 

One must have completion->done = 0 before every use, as otherwise, 
wait_for_completion() won't actually wait. And since complete_all()
sets x->done += UINT_MAX/2, I'm pretty sure x->done won't be zero
the next time we use it, unless we make it so.

So I need to find a place to safely call INIT_COMPLETION() again, 
after the completion has been used. At the moment, I'm stumped
as to where to do this. 

 [think ... think ... think] 

I think the race you describe above is harmless. The first time
that sym_eh_handler() will run, it will be with SYM_EH_ABORT, 
in it doesn't matter if we lose that, since the device is hosed
anyway. At some later time, it will run with SYM_EH_DEVICE_RESET
and then SYM_EH_BUS_RESET and then SYM_EH_HOST_RESET, and we won't 
miss those, since, by now, sym2_io_error_detected() will have run.

So, by my reading, I'd say that init_completion() in
sym2_io_error_detected() has to stay (although perhaps
it should be replaced by the INIT_COMPLETION() macro.)
Removing it will prevent correct operation on the second 
and subsequent errors.

--Linas

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-09-27 Thread Linas Vepstas

On Wed, Sep 26, 2007 at 09:02:16AM -0600, Matthew Wilcox wrote:
> On Fri, Apr 20, 2007 at 03:47:20PM -0500, Linas Vepstas wrote:
> > Implement the so-called "first failure data capture" (FFDC) for the
> > symbios PCI error recovery.  After a PCI error event is reported,
> > the driver requests that MMIO be enabled. Once enabled, it 
> > then reads and dumps assorted status registers, and concludes
> > by requesting the usual reset sequence.
> 
> > +   /* Request that MMIO be enabled, so register dump can be taken. */
> > +   return PCI_ERS_RESULT_CAN_RECOVER;
> > +}
> 
> I'm a little concerned by the mention of MMIO.  It's entirely possible
> for the sym2 driver to be using ioports to access the card rather than
> MMIO.  Is it simply that it can't on the platform you test on?

The comment is misleading. I've been in the bad habit of calling
it "mmio" whenever its not DMA.

The habit is because there are two distinct enable bits in the 
pci-host bridge during error recovery: one to enable mmio/ioports, 
and the other to enable DMA. If the adapter has gone crazy, I don't 
want to enable DMA, so that it doesn't scribble to bad places. But, 
by enabling mmio/ioports, perhaps it can be finessed back into a 
semi-sane state, e.g. sane enough to perform a dump of its internal
state.

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH]: PCI Error Recovery: Symbios SCSI device driver

2007-08-02 Thread Linas Vepstas

On Thu, Jul 05, 2007 at 12:54:06PM -0600, Matthew Wilcox wrote:
> On Thu, Jul 05, 2007 at 11:28:38AM -0700, Andrew Morton wrote:
> > Well you've sent it a couple of times, and I've sent it in five more times
> > over the past year.  Once we were told "awaiting maintainer ack".
> > 
> > This situation is fairly stupid.  How about we make you the maintainer?
> 
> Last time I looked at it, I still wasn't comfortable with it.  I'm going
> to look at it again.

Please do. Its burning the proverbial hole in my pocket; I'd really
like to get this off my list of things I worry about.

> I'm fairly sure Linas doesn't want to be the sym2 maintainer.  It's
> still an ugly pile of junk that needs cleaning up.

Heh. I have no difficulty living with ugly code: its actually a 
great excuse to fix things instead of doing "real work" :-)

Rather, the menagerie of hardware I have access to is constantly 
changing; I don't have a symbios card just right now, and it might 
take a few days to even find someone who did.  Which is an incredibly
unpleasent, unrewarding activity.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

EDAC & PCI error recovery (was Re: [PATCH 2/2] powerpc: MPC85xx EDAC device driver)

2007-08-01 Thread Linas Vepstas

On Mon, Jul 30, 2007 at 03:47:05PM -0700, Doug Thompson wrote:
> 
> --- Linas Vepstas <[EMAIL PROTECTED]> wrote:
> > Also: please note that the linux kernel has a pci error recovery
> > mechanism built in; its used by pseries and PCI-E. I'm not clear
> > on what any of this has to do with EDAC, which I thought was supposed 
> > to be for RAM only. (The EDAC project once talked about doing pci error 
> > recovery, but that was years ago, and there is a separate system for
> > that, now.)
> 
> no, edac can/does harvest PCI bus errors, via polling and other hardware 
> error detectors.

Ehh! I had no idea. A few years ago, when I was working on the PCI error
recovery, I sent a number of emails to the various EDAC people and mailing 
lists that I could find, and never got a response.  I assumed the
project was dead. I guess its not ... 

> But at the current time, few PCI device drivers initialize those callback 
> functions and
> thus errors are lost and some IO transactions fail.

There are patches for 6 drivers in mainline (e100, e1000, ixgb, s2io,
ipr, lpfc), and two more pending (sym53cxxx, tg3).  So far, I've written 
all of them. 

> Over time, as drivers get updated (might take some time) then drivers
> can take some sort of action FOR THEMSELVES

I think I need to do more to raise awareness and interest.

> Yet, there is no tracking of errors - except for a log message in the log 
> file.
> 
> There is NO meter on frequency of errors, etc. One must grep the log file and 
> that is not a very
> cycle friendly mechanism.

Yeah, there was low interest in stats. There's a core set of stats in
/proc/pp64/eeh, but these are clearly arch-specific. I'd ike to move
away from those.  Some recent patches added stats to the /sys tree,
under the individual pci bridge and device nodes.  Again, these are
arch-specific; I'd like to move to some geeral/standardized presentation.

> The reason I added PCI parity/error device scanning, was that when I was at 
> Linux Networx, we had
> parity errors on the PCI-X bus, but didn't know the cause.  After we 
> discovered that a simple
> PCI-X riser card had manufacturing problems (quality) and didn't drive lines 
> properly, it caused
> parity errors. 

Heh. Not unusual. I've seen/heard of cases with voltages being low,
and/or ground-bounce in slots near the end. There's a whole zoo of
hardware/firmware bugs that we've had to painfully crawl through and
fix. That's why the IBM boxes cost big $$$; here's to hoping that 
customers understand why.

> This feature allowed us to track nodes that were having parity problems, but 
> we had
> no METER to know it.
> 
> Recovery is a good thing, BUT how do you know you having LOTS of 
> errors/recovery events? You need
> a meter. EDAC provides that METER

I'm lazy. What source code should I be looking at?  I'm concerned about
duplication of function and proliferation of interfaces. I've got my 
metering data under (for example)
/sys/bus/pci/devices/0001:c0:01.0/eeh_*, mostly very arch specific.
The code for this is in arch/powerpc/platforms/pseries/eeh_sysfs.c

> I met with Yanmin Zhang of Intel at OLS after his paper presentation on PCI 
> Express Advanced Error
> Reporting in the Kernel, and we talked about this same thing. I am talking 
> with him on having the
> recovery code present information into EDAC sysfs area. (hopefully, anyway)

Hmm. OK, where's that?  Back when, I'd talked to Yamin about coming up 
with a generic, arch-indep way of driving the recovery routines. But
this wasn't exactly easy, and we were still grappling with just getting
things working.  Now that things are working, its time to broaden
horizons.

Can you point me to the current edac code?
find . -print |grep edac is not particuarly revealing at the moment.

> The recovery generates log messages BUT having to periodically 'grep' the log 
> file looking for
> errors is not a good use of CPU cycles. grep once for a count and then grep 
> later for a count and
> then compare the counts for a delta count per unit time. ugly.

Yep. Maybe send events up to udev?

> The EDAC solution is to be able to have a Listener thread in user space that 
> can be notified (via
> poll()) that an event has occurred.

Hmm. OK, I'm alarmingly nave about udev, but my initial gut instinct is
to pipe all such events to udev. Most of user-space has already been
given the marching orders to use udev and/or hal for this kind of stuff.
So this makes sense to me.

> There are more than one consumer (error recover) of error events:
> 1) driver recovery after a transaction (which is the recovery consumer above)

I had to argue loudly for recovery in the kernel. The problem was that
it was impossible to recover

[PATCH]: PCI Error Recovery: Symbios SCSI device driver

2007-07-02 Thread Linas Vepstas


Various PCI bus errors can be signaled by newer PCI controllers.  
This patch adds the PCI error recovery callbacks to the Symbios 
SCSI device driver.  The patch has been tested, and appears to 
work well.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>



Hi,

This patch has been bouncing around for a long time, and has made
appearences in various -mm trees since 2.6.something-teen. However,
it has never made it into mainline, and I'm starting to get concerned
that it will miss 2.6.23 as well. 

There was some discussion, and I think I addressed all of the various
issues that came up. I'd really like to get this patch in, but am unclear
on exactly who to pester at this point. Matt Wilcox seems to be looking 
for a job (???) and I am unable to git-clone James Bottmley's 
git://kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6.git
git tree; there's some error on the server side.

Linas.

 drivers/scsi/sym53c8xx_2/sym_glue.c |  136 
 drivers/scsi/sym53c8xx_2/sym_glue.h |4 +
 drivers/scsi/sym53c8xx_2/sym_hipd.c |6 +
 3 files changed, 146 insertions(+)

Index: linux-2.6.22-rc1/drivers/scsi/sym53c8xx_2/sym_glue.c
===
--- linux-2.6.22-rc1.orig/drivers/scsi/sym53c8xx_2/sym_glue.c   2007-04-25 
22:08:32.0 -0500
+++ linux-2.6.22-rc1/drivers/scsi/sym53c8xx_2/sym_glue.c2007-05-14 
17:31:44.0 -0500
@@ -657,6 +657,10 @@ static irqreturn_t sym53c8xx_intr(int ir
unsigned long flags;
struct sym_hcb *np = (struct sym_hcb *)dev_id;
 
+   /* Avoid spinloop trying to handle interrupts on frozen device */
+   if (pci_channel_offline(np->s.device))
+   return IRQ_HANDLED;
+
if (DEBUG_FLAGS & DEBUG_TINY) printf_debug ("[");
 
spin_lock_irqsave(np->s.host->host_lock, flags);
@@ -726,6 +730,20 @@ static int sym_eh_handler(int op, char *
 
dev_warn(&cmd->device->sdev_gendev, "%s operation started.\n", opname);
 
+   /* We may be in an error condition because the PCI bus
+* went down. In this case, we need to wait until the
+* PCI bus is reset, the card is reset, and only then
+* proceed with the scsi error recovery.  There's no
+* point in hurrying; take a leisurely wait.
+*/
+#define WAIT_FOR_PCI_RECOVERY  35
+   if (pci_channel_offline(np->s.device)) {
+   int finished_reset = wait_for_completion_timeout(
+   &np->s.io_reset_wait, WAIT_FOR_PCI_RECOVERY*HZ);
+   if (!finished_reset)
+   return SCSI_FAILED;
+   }
+
spin_lock_irq(host->host_lock);
/* This one is queued in some place -> to wait for completion */
FOR_EACH_QUEUED_ELEMENT(&np->busy_ccbq, qp) {
@@ -1510,6 +1528,7 @@ static struct Scsi_Host * __devinit sym_
np->maxoffs = dev->chip.offset_max;
np->maxburst= dev->chip.burst_max;
np->myaddr  = dev->host_id;
+   init_completion(&np->s.io_reset_wait);
 
/*
 *  Edit its name.
@@ -1948,6 +1967,116 @@ static void __devexit sym2_remove(struct
attach_count--;
 }
 
+/**
+ * sym2_io_error_detected() -- called when PCI error is detected
+ * @pdev: pointer to PCI device
+ * @state: current state of the PCI slot
+ */
+static pci_ers_result_t sym2_io_error_detected(struct pci_dev *pdev,
+ enum pci_channel_state state)
+{
+   struct sym_hcb *np = pci_get_drvdata(pdev);
+
+   /* If slot is permanently frozen, turn everything off */
+   if (state == pci_channel_io_perm_failure) {
+   sym2_remove(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+
+   init_completion(&np->s.io_reset_wait);
+   disable_irq(pdev->irq);
+   pci_disable_device(pdev);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * sym2_reset_workarounds -- hardware-specific work-arounds
+ *
+ * This routine is similar to sym_set_workarounds(), except
+ * that, at this point, we already know that the device was
+ * succesfully intialized at least once before, and so most
+ * of the steps taken there are un-needed here.
+ */
+static void sym2_reset_workarounds(struct pci_dev *pdev)
+{
+   u_char revision;
+   u_short status_reg;
+   struct sym_chip *chip;
+
+   pci_read_config_byte(pdev, PCI_CLASS_REVISION, &revision);
+   chip = sym_lookup_chip_table(pdev->device, revision);
+
+   /* Work around for errant bit in 895A, in a fashion
+* similar to what is done in sym_set_workarounds().
+*/
+   pci_read_config_word(pdev, PCI_STATUS, &status_reg);
+   if (!(chip->features & FE_66MHZ) && (status_reg & PCI_STATUS_66MHZ)) {
+

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-05-17 Thread Linas Vepstas

On Wed, May 09, 2007 at 03:26:21PM -0500, Linas Vepstas wrote:
> Hi Matthew,
> 
> I had been hoping these patches might make it into 2.6.22,
> ... this is a nag note; please forward upstream.


... should I repost the patches? 

--linas 

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-05-09 Thread Linas Vepstas

Hi Matthew,

I had been hoping these patches might make it into 2.6.22,
... this is a nag note; please forward upstream.

--linas

On Fri, Apr 20, 2007 at 03:47:20PM -0500, Linas Vepstas wrote:
> 
> Implement the so-called "first failure data capture" (FFDC) for the
> symbios PCI error recovery.  After a PCI error event is reported,
> the driver requests that MMIO be enabled. Once enabled, it 
> then reads and dumps assorted status registers, and concludes
> by requesting the usual reset sequence.
> 
> (includes a whitespace fix for bad indentation).
> 
> Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

2007-04-20 Thread Linas Vepstas


Implement the so-called "first failure data capture" (FFDC) for the
symbios PCI error recovery.  After a PCI error event is reported,
the driver requests that MMIO be enabled. Once enabled, it 
then reads and dumps assorted status registers, and concludes
by requesting the usual reset sequence.

(includes a whitespace fix for bad indentation).

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>


 drivers/scsi/sym53c8xx_2/sym_glue.c |   15 +++
 drivers/scsi/sym53c8xx_2/sym_glue.h |1 +
 drivers/scsi/sym53c8xx_2/sym_hipd.c |   18 ++
 3 files changed, 30 insertions(+), 4 deletions(-)

Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c
===
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.c  
2007-04-20 12:52:01.0 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c   2007-04-20 
15:25:35.0 -0500
@@ -1987,6 +1987,20 @@ static pci_ers_result_t sym2_io_error_de
disable_irq(pdev->irq);
pci_disable_device(pdev);
 
+   /* Request that MMIO be enabled, so register dump can be taken. */
+   return PCI_ERS_RESULT_CAN_RECOVER;
+}
+
+/**
+ * sym2_io_slot_dump -- Enable MMIO and dump debug registers
+ * @pdev: pointer to PCI device
+ */
+static pci_ers_result_t sym2_io_slot_dump (struct pci_dev *pdev)
+{
+   struct sym_hcb *np = pci_get_drvdata(pdev);
+
+   sym_dump_registers(np);
+
/* Request a slot reset. */
return PCI_ERS_RESULT_NEED_RESET;
 }
@@ -2241,6 +2255,7 @@ MODULE_DEVICE_TABLE(pci, sym2_id_table);
 
 static struct pci_error_handlers sym2_err_handler = {
.error_detected = sym2_io_error_detected,
+   .mmio_enabled = sym2_io_slot_dump,
.slot_reset = sym2_io_slot_reset,
.resume = sym2_io_resume,
 };
Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h
===
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.h  
2007-04-20 12:15:07.0 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.h   2007-04-20 
15:21:31.0 -0500
@@ -270,5 +270,6 @@ void sym_xpt_async_bus_reset(struct sym_
 void sym_xpt_async_sent_bdr(struct sym_hcb *np, int target);
 int  sym_setup_data_and_start (struct sym_hcb *np, struct scsi_cmnd *csio, 
struct sym_ccb *cp);
 void sym_log_bus_error(struct sym_hcb *np);
+void sym_dump_registers(struct sym_hcb *np);
 
 #endif /* SYM_GLUE_H */
Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c
===
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_hipd.c  
2007-04-20 12:18:59.0 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_hipd.c   2007-04-20 
15:18:01.0 -0500
@@ -1180,10 +1180,10 @@ static void sym_log_hard_error(struct sy
scr_to_cpu((int) *(u32 *)(script_base + script_ofs)));
}
 
-printf ("%s: regdump:", sym_name(np));
-for (i=0; i<24;i++)
-printf (" %02x", (unsigned)INB_OFF(np, i));
-printf (".\n");
+   printf ("%s: regdump:", sym_name(np));
+   for (i=0; i<24;i++)
+   printf (" %02x", (unsigned)INB_OFF(np, i));
+   printf (".\n");
 
/*
 *  PCI BUS error.
@@ -1192,6 +1192,16 @@ static void sym_log_hard_error(struct sy
sym_log_bus_error(np);
 }
 
+void sym_dump_registers(struct sym_hcb *np)
+{
+   u_short sist;
+   u_char dstat;
+
+   sist = INW(np, nc_sist);
+   dstat = INB(np, nc_dstat);
+   sym_log_hard_error(np, sist, dstat);
+}
+
 static struct sym_chip sym_dev_table[] = {
  {PCI_DEVICE_ID_NCR_53C810, 0x0f, "810", 4, 8, 4, 64,
  FE_ERL}
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2]: PCI Error Recovery: Symbios SCSI base support

2007-04-20 Thread Linas Vepstas



Hi Matthew,

After a long hiatus, I took another stab at pci error recovery 
for the symbios. This is very nearly the same patch as before, 
with only an update to enable MWI, and to support chip workarounds.
I think I've addressed all the other issues that came up. Thus,
again, I'll ask that the patch go in (for 2.6.22 of course).


To recap the only outstanding issue:

>> @@ -657,6 +657,10 @@ static irqreturn_t sym53c8xx_intr(int ir
>> + /* Avoid spinloop trying to handle interrupts on frozen device */
>> + if (pci_channel_offline(np->s.device))
>> + return IRQ_HANDLED;
>
>Just wondering ... should we really be returning HANDLED?  What if the
>IRQ is shared?  Will the hardware de-assert the level interrupt when it
>puts the device in reset (ie is this a transitory glitch?), or do we
>have to cope with a screaming interrupt?

This routine *always* returns HANDLED anyway, so this patch does
not change semantics. For a symbios device plugged into a shared
irq line, this is a problem with or without my patch.

Yes, irq's will typically scream until handled. Yes, the device
reset will eventually clear the irq, assuming the system doesn't 
deadlock on a screaming irq. 

--linas

Here's the formal changelog entry:

Various PCI bus errors can be signaled by newer PCI controllers.  
This patch adds the PCI error recovery callbacks to the Symbios 
SCSI device driver.  The patch has been tested, and appears to 
work well.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>

--
 drivers/scsi/sym53c8xx_2/sym_glue.c |  136 
 drivers/scsi/sym53c8xx_2/sym_glue.h |4 +
 drivers/scsi/sym53c8xx_2/sym_hipd.c |6 +
 3 files changed, 146 insertions(+)

Index: linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c
===
--- linux-2.6.21-rc4-git4.orig/drivers/scsi/sym53c8xx_2/sym_glue.c  
2007-04-20 12:07:38.0 -0500
+++ linux-2.6.21-rc4-git4/drivers/scsi/sym53c8xx_2/sym_glue.c   2007-04-20 
12:52:01.0 -0500
@@ -657,6 +657,10 @@ static irqreturn_t sym53c8xx_intr(int ir
unsigned long flags;
struct sym_hcb *np = (struct sym_hcb *)dev_id;
 
+   /* Avoid spinloop trying to handle interrupts on frozen device */
+   if (pci_channel_offline(np->s.device))
+   return IRQ_HANDLED;
+
if (DEBUG_FLAGS & DEBUG_TINY) printf_debug ("[");
 
spin_lock_irqsave(np->s.host->host_lock, flags);
@@ -726,6 +730,20 @@ static int sym_eh_handler(int op, char *
 
dev_warn(&cmd->device->sdev_gendev, "%s operation started.\n", opname);
 
+   /* We may be in an error condition because the PCI bus
+* went down. In this case, we need to wait until the
+* PCI bus is reset, the card is reset, and only then
+* proceed with the scsi error recovery.  There's no
+* point in hurrying; take a leisurely wait.
+*/
+#define WAIT_FOR_PCI_RECOVERY  35
+   if (pci_channel_offline(np->s.device)) {
+   int finished_reset = wait_for_completion_timeout(
+   &np->s.io_reset_wait, WAIT_FOR_PCI_RECOVERY*HZ);
+   if (!finished_reset)
+   return SCSI_FAILED;
+   }
+
spin_lock_irq(host->host_lock);
/* This one is queued in some place -> to wait for completion */
FOR_EACH_QUEUED_ELEMENT(&np->busy_ccbq, qp) {
@@ -1510,6 +1528,7 @@ static struct Scsi_Host * __devinit sym_
np->maxoffs = dev->chip.offset_max;
np->maxburst= dev->chip.burst_max;
np->myaddr  = dev->host_id;
+   init_completion(&np->s.io_reset_wait);
 
/*
 *  Edit its name.
@@ -1948,6 +1967,116 @@ static void __devexit sym2_remove(struct
attach_count--;
 }
 
+/**
+ * sym2_io_error_detected() -- called when PCI error is detected
+ * @pdev: pointer to PCI device
+ * @state: current state of the PCI slot
+ */
+static pci_ers_result_t sym2_io_error_detected (struct pci_dev *pdev,
+ enum pci_channel_state state)
+{
+   struct sym_hcb *np = pci_get_drvdata(pdev);
+
+   /* If slot is permanently frozen, turn everything off */
+   if (state == pci_channel_io_perm_failure) {
+   sym2_remove(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+
+   init_completion(&np->s.io_reset_wait);
+   disable_irq(pdev->irq);
+   pci_disable_device(pdev);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * sym2_reset_workarounds -- hardware-specific work-arounds
+ *
+ * This routine is similar to sym_set_workarounds(), except
+ * that, at this point, we already know that the device was 
+ * succesfully intialized at least once before, a

[PATCH] lpfc: avoid double-free during PCI error failure

2007-03-08 Thread Linas Vepstas


Bino, James,
Please review, sign-off and forward upstream.

--linas


If a PCI error is detected that cannot be recovered from, there
will be a double call of lpfc_pci_remove_one(), with the second call
resulting in a null-pointer dereference. The first call occurs in 
lpfc_io_error_detected(), and the second call during pci device 
remove. This patch eliminates the first call; its un-needed.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>


 drivers/scsi/lpfc/lpfc_init.c |5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Index: linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-git16.orig/drivers/scsi/lpfc/lpfc_init.c   2007-03-08 
15:57:40.0 -0600
+++ linux-2.6.20-git16/drivers/scsi/lpfc/lpfc_init.c2007-03-08 
16:03:18.0 -0600
@@ -1817,10 +1817,9 @@ static pci_ers_result_t lpfc_io_error_de
struct lpfc_sli *psli = &phba->sli;
struct lpfc_sli_ring  *pring;
 
-   if (state == pci_channel_io_perm_failure) {
-   lpfc_pci_remove_one(pdev);
+   if (state == pci_channel_io_perm_failure)
return PCI_ERS_RESULT_DISCONNECT;
-   }
+
pci_disable_device(pdev);
/*
 * There may be I/Os dropped by the firmware.
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] lpfc: add PCI error recovery support

2007-02-14 Thread Linas Vepstas


James,

Please review and forward upstream.  This is a patch I'd previously
submitted, and reworked by [EMAIL PROTECTED] in January.
Not clear if I need to also nag James Smart (who is listed as the
maintainer) for an Acked-by (which I am lead to beleive should be
forthcoming? Ahh the joys of indirect communication!)

--linas

This patch adds PCI Error recovery support to the
Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver.
Lightly tested at this point, works.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
Signed-off-by: [EMAIL PROTECTED]
Cc: James Smart <[EMAIL PROTECTED]>



 drivers/scsi/lpfc/lpfc_init.c |   97 ++
 drivers/scsi/lpfc/lpfc_sli.c  |   12 +
 2 files changed, 109 insertions(+)

Index: linux-2.6.20-git4/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-git4.orig/drivers/scsi/lpfc/lpfc_init.c2007-02-09 
17:22:30.0 -0600
+++ linux-2.6.20-git4/drivers/scsi/lpfc/lpfc_init.c 2007-02-14 
14:12:22.0 -0600
@@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba
struct lpfc_sli *psli = &phba->sli;
struct lpfc_sli_ring  *pring;
uint32_t event_data;
+   /* If the pci channel is offline, ignore possible errors,
+* since we cannot communicate with the pci card anyway. */
+   if (pci_channel_offline(phba->pcidev))
+   return;
 
if (phba->work_hs & HS_FFER6 ||
phba->work_hs & HS_FFER5) {
@@ -1797,6 +1801,92 @@ lpfc_pci_remove_one(struct pci_dev *pdev
pci_set_drvdata(pdev, NULL);
 }
 
+/**
+ * lpfc_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host->hostdata;
+   struct lpfc_sli *psli = &phba->sli;
+   struct lpfc_sli_ring  *pring;
+
+   if (state == pci_channel_io_perm_failure) {
+   lpfc_pci_remove_one(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+   pci_disable_device(pdev);
+   /*
+* There may be I/Os dropped by the firmware.
+* Error iocb (I/O) on txcmplq and let the SCSI layer
+* retry it after re-establishing link.
+*/
+   pring = &psli->ring[psli->fcp_ring];
+   lpfc_sli_abort_iocb_ring(phba, pring);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * lpfc_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev *pdev)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host->hostdata;
+   struct lpfc_sli *psli = &phba->sli;
+   int bars = pci_select_bars(pdev, IORESOURCE_MEM);
+
+   dev_printk(KERN_INFO, &pdev->dev, "recovering from a slot reset.\n");
+   if (pci_enable_device_bars(pdev, bars)) {
+   printk(KERN_ERR "lpfc: Cannot re-enable "
+   "PCI device after reset.\n");
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+
+   pci_set_master(pdev);
+
+   /* Re-establishing Link */
+   spin_lock_irq(phba->host->host_lock);
+   phba->fc_flag |= FC_ESTABLISH_LINK;
+   psli->sli_flag &= ~LPFC_SLI2_ACTIVE;
+   spin_unlock_irq(phba->host->host_lock);
+
+
+   /* Take device offline; this will perform cleanup */
+   lpfc_offline(phba);
+   lpfc_sli_brdrestart(phba);
+
+   return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * lpfc_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void lpfc_io_resume(struct pci_dev *pdev)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host->hostdata;
+
+   if (lpfc_online(phba) == 0) {
+   mod_timer(&phba->fc_estabtmo, jiffies + HZ * 60);
+   }
+}
+
 static struct pci_device_id lpfc_id_table[] = {
{PCI_VENDOR_ID_EMULEX, PCI_DEVICE_ID_VIPER,
PCI_ANY_ID, PCI_ANY_ID, },
@@ -1857,11 +1947,18 @@ static struct pci_device_id lpfc_id_tabl
 
 MODULE_DEVICE_TABLE(pci, lpfc_id_table);
 
+static struct pci_error_handlers lpfc_err_handler = {
+   .error_detected = lpf

[PATCH] adjust use of unplug in elevator code

2007-01-15 Thread Linas Vepstas


Hi Chris, Jens,
Can you look at this, and push upstream if this looks reasonable
to you? It fixes a bug I've been tripping over.

--linas


A flag was recently added to the elevator code to avoid
performing an unplug when reuests are being re-queued.
The goal of this flag was to avoid a deep recursion that
can occur when re-queueing requests after a SCSI device/host 
reset.  See http://lkml.org/lkml/2006/5/17/254

However, that fix added the flag near the bottom of a case
statement, where an earlier break (in an if statement) could
transport one out of the case, without setting the flag.
This patch sets the flag earlier in the case statement.

I re-discovered the deep recursion recently during testing;
I was told that it was a known problem, and the fix to it was
in the kernel I was testing. Indeed it was ... but it didn't
fix the bug. With the patch below, I no longer see the bug.

Signed-off by: Linas Vepstas <[EMAIL PROTECTED]>
Cc: Jens Axboe <[EMAIL PROTECTED]>
Cc: Chris Wright <[EMAIL PROTECTED]>


 block/elevator.c |   11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

Index: linux-2.6.20-rc4/block/elevator.c
===
--- linux-2.6.20-rc4.orig/block/elevator.c  2007-01-15 14:16:03.0 
-0600
+++ linux-2.6.20-rc4/block/elevator.c   2007-01-15 14:20:04.0 -0600
@@ -590,6 +590,12 @@ void elv_insert(request_queue_t *q, stru
 */
rq->cmd_flags |= REQ_SOFTBARRIER;
 
+   /*
+* Most requeues happen because of a busy condition,
+* don't force unplug of the queue for that case.
+*/
+   unplug_it = 0;
+
if (q->ordseq == 0) {
list_add(&rq->queuelist, &q->queue_head);
break;
@@ -604,11 +610,6 @@ void elv_insert(request_queue_t *q, stru
}
 
list_add_tail(&rq->queuelist, pos);
-   /*
-* most requeues happen because of a busy condition, don't
-* force unplug of the queue for that case.
-*/
-   unplug_it = 0;
break;
 
default:
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Bug: 2.6.20 scsi/block device/elevator recursion loop

2007-01-11 Thread Linas Vepstas

Hi,

On Thu, Jan 11, 2007 at 04:22:52PM -0500, [EMAIL PROTECTED] wrote:
> This patch is present in upstream and is also present 
> in 2.6.20. So this is a new issue.

What was the patch last time around? 

It seems I'm seeing this more often than expected. The first time,
the system spewed the softlockup error, but then recovered after 
a few minutes. This time, even after an hour, the system remained
hung. It was pingable, but the console, and all ssh sessions
were unresponsive.

After hitting the little yellow button, I got a stack trace
(below) in _spin_unlock_irqrestore, which makes me think that
perhaps the system was being flooded with irq's. I'll try 
to investigate further tommorrow.

--linas

Background:
kernel 2.6.20-rc4
IBM Power4 pSeries (630)
lpfc scsi (Emulex)

 chsysstate -r sys -n io-raiders  -o reset

io-raiders:~ # cpu 0x0: Vector: 100 (System Reset) at [c0003ff69520]
pc: c023d794: ._raw_spin_unlock+0xb4/0xd4
lr: c046d5ac: ._spin_unlock_irqrestore+0x18/0x3c
sp: c0003ff697a0
   msr: 90009032
  current = 0xc43e21f0
  paca= 0xc0674080
pid   = 1123, comm = kblockd/0
enter ? for help
[c0003ff69820] c046d5ac ._spin_unlock_irqrestore+0x18/0x3c
[c0003ff698b0] c021bbe0 .blk_run_queue+0xc8/0xec
[c0003ff69950] c0320728 .scsi_run_queue+0x248/0x278
[c0003ff69a00] c0321948 .scsi_queue_insert+0x88/0xa8
[c0003ff69a90] c031bc34 .scsi_dispatch_cmd+0x2b8/0x2e4
[c0003ff69b30] c0322804 .scsi_request_fn+0x2c4/0x3c0
[c0003ff69be0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff69c60] c0216d6c .elv_insert+0x240/0x268
[c0003ff69d00] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff69d90] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff69e40] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff69ec0] c0216d6c .elv_insert+0x240/0x268
[c0003ff69f60] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff69ff0] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6a0a0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6a120] c0216d6c .elv_insert+0x240/0x268
[c0003ff6a1c0] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6a250] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6a300] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6a380] c0216d6c .elv_insert+0x240/0x268
[c0003ff6a420] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6a4b0] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6a560] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6a5e0] c0216d6c .elv_insert+0x240/0x268
[c0003ff6a680] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6a710] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6a7c0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6a840] c0216d6c .elv_insert+0x240/0x268
[c0003ff6a8e0] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6a970] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6aa20] c021bbac .blk_run_queue+0x94/0xec
[c0003ff6aac0] c0320728 .scsi_run_queue+0x248/0x278
[c0003ff6ab70] c0321948 .scsi_queue_insert+0x88/0xa8
[c0003ff6ac00] c031bc34 .scsi_dispatch_cmd+0x2b8/0x2e4
[c0003ff6aca0] c0322804 .scsi_request_fn+0x2c4/0x3c0
[c0003ff6ad50] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6add0] c0216d6c .elv_insert+0x240/0x268
[c0003ff6ae70] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6af00] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6afb0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b030] c0216d6c .elv_insert+0x240/0x268
[c0003ff6b0d0] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6b160] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6b210] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b290] c0216d6c .elv_insert+0x240/0x268
[c0003ff6b330] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6b3c0] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6b470] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b4f0] c0216d6c .elv_insert+0x240/0x268
[c0003ff6b590] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6b620] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6b6d0] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b750] c0216d6c .elv_insert+0x240/0x268
[c0003ff6b7f0] c021a25c .blk_requeue_request+0x38/0x54
[c0003ff6b880] c0322864 .scsi_request_fn+0x324/0x3c0
[c0003ff6b930] c021ae30 .__generic_unplug_device+0x54/0x6c
[c0003ff6b9b0] c0216d6c .elv_insert+0x240/0x268
[c0003ff6ba50] c021a25c .blk_requeue_request+0x38/0x54
[c0003f

Re: lpfc PCIe error recovey

2007-01-11 Thread Linas Vepstas

On Wed, Jan 10, 2007 at 04:59:39PM -0600, linas wrote:
> 
> > However, on a Power4 architecture there are errors reported
> > in upper layer (we discussed this in one of earlier emails) followed 
> > by SCSI errors.
> 
> I'm trying to investigate now.

I found two distinct power4 bugs. I posted a patch for one yesterday,
under the subject heading 

  [PATCH] Urgent: powerpc 2.6.20-rc4 dma broken on non-LPAR pseries

This affects only recent mainline kernels; it would not affect
older or distro kernels.   

The other patch is attached below.  After some more testing,
I'll submit to mainline.

--linas


Subject: [PATCH] pSeries: EEH improperly enabled for some Power4 systems

It appears that EEH is improperly enabled for some Power4 systems.
On these systems, the ibm,set-eeh-option returns a value of success
even when EEH is not supported on the given node. Thus, an explicit
check for support is required.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> 


 arch/powerpc/platforms/pseries/eeh.c |   19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

Index: linux-2.6.20-rc4/arch/powerpc/platforms/pseries/eeh.c
===
--- linux-2.6.20-rc4.orig/arch/powerpc/platforms/pseries/eeh.c  2007-01-11 
14:15:02.0 -0600
+++ linux-2.6.20-rc4/arch/powerpc/platforms/pseries/eeh.c   2007-01-11 
15:14:39.0 -0600
@@ -748,6 +748,7 @@ struct eeh_early_enable_info {
 /* Enable eeh for the given device node. */
 static void *early_enable_eeh(struct device_node *dn, void *data)
 {
+   unsigned int rets[3];
struct eeh_early_enable_info *info = data;
int ret;
const char *status = get_property(dn, "status", NULL);
@@ -804,16 +805,14 @@ static void *early_enable_eeh(struct dev
regs[0], info->buid_hi, info->buid_lo,
EEH_ENABLE);
 
+   enable = 0;
if (ret == 0) {
-   eeh_subsystem_enabled = 1;
-   pdn->eeh_mode |= EEH_MODE_SUPPORTED;
pdn->eeh_config_addr = regs[0];
 
/* If the newer, better, ibm,get-config-addr-info is 
supported, 
 * then use that instead. */
pdn->eeh_pe_config_addr = 0;
if (ibm_get_config_addr_info != RTAS_UNKNOWN_SERVICE) {
-   unsigned int rets[2];
ret = rtas_call (ibm_get_config_addr_info, 4, 
2, rets, 
pdn->eeh_config_addr, 
info->buid_hi, info->buid_lo,
@@ -821,6 +820,20 @@ static void *early_enable_eeh(struct dev
if (ret == 0)
pdn->eeh_pe_config_addr = rets[0];
}
+
+   /* Some older systems (Power4) allow the
+* ibm,set-eeh-option call to succeed even on nodes
+* where EEH is not supported. Verify support
+* explicitly. */
+   ret = read_slot_reset_state(pdn, rets);
+   if ((ret == 0) && (rets[1] == 1))
+   enable = 1;
+   }
+
+   if (enable) {
+   eeh_subsystem_enabled = 1;
+   pdn->eeh_mode |= EEH_MODE_SUPPORTED;
+
 #ifdef DEBUG
printk(KERN_DEBUG "EEH: %s: eeh enabled, config=%x 
pe_config=%x\n",
   dn->full_name, pdn->eeh_config_addr, 
pdn->eeh_pe_config_addr);

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

crash on lpfc rmmod

2007-01-10 Thread Linas Vepstas

Hi Bino,

Fiddling with the lpfc driver on 2.6.20-rc4, shortly after 
booting, I attempted to rmmod the lpfc module and got a crash:

io-raiders:~ # rmmod lpfc
cpu 0x0: Vector: 300 (Data Access) at [c003c86075a0]
pc: d08d0988: .lpfc_free_sysfs_attr+0x1c/0x58 [lpfc]
lr: d08c458c: .lpfc_pci_remove_one+0x3c/0x278 [lpfc]
sp: c003c8607820
   msr: 90009032
   dar: 11c0
 dsisr: 4000
  current = 0xc003bf4b4c80
  paca= 0xc0674080
pid   = 12977, comm = rmmod
[ 3005.329608] [ cut here ]

at which point the system locked up hard (I was expecting it to
go into xmon).

Suggestions?

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: lpfc PCIe error recoveyr

2007-01-10 Thread Linas Vepstas

On Tue, Jan 09, 2007 at 10:00:09AM -0500, [EMAIL PROTECTED] wrote:
> Hi Linas,
>   Following is the latest lpfc driver patch we are testing in the 
> Emulex lab for PCI error recovery. This patch looks good on a Power5 
> platform. 

Yes, it seemed to survive a few hours of testting fine. I did see one
interesting thing, namely a softlockup. I attribute this to the fact
that I'd queued up a lot of heavy file i/o, issued a sync, which
typically takes more than a few seconds on the test sytem, and then 
injected the artificial PCI error. After about ten seconds, I got the 
softlockup, but after another 10-20 seconds, things seemed back to
normal. So I don't consider this an actual error, but thought 
it was interesting.

The actual stack trace was

BUG: soft lockup detected on CPU#2!
Call Trace:
[C253D470] [C000F8C8] .show_stack+0x68/0x1b0 (unreliable)
[C253D510] [C008E770] .softlockup_tick+0xec/0x124
[C253D5B0] [C006957C] .run_local_timers+0x1c/0x30
[C253D630] [C0023C18] .timer_interrupt+0xb8/0x4a4
[C253D710] [C0003578] decrementer_common+0xf8/0x100
--- Exception: 901 at .local_irq_restore+0x3c/0x40
LR = ._spin_unlock_irqrestore+0x24/0x3c
[C253DA00] [C046D574] ._spin_unlock_irqrestore+0x18/0x3c 
(unreliable)
[C253DA90] [C031BBA0] .scsi_dispatch_cmd+0x25c/0x2e4
[C253DB30] [C03227CC] .scsi_request_fn+0x2c4/0x3c0
[C253DBE0] [C021ADF8] .__generic_unplug_device+0x54/0x6c
[C253DC60] [C0216D34] .elv_insert+0x240/0x268
[C253DD00] [C021A224] .blk_requeue_request+0x38/0x54
[C253DD90] [C032282C] .scsi_request_fn+0x324/0x3c0
[C253DE40] [C021ADF8] .__generic_unplug_device+0x54/0x6c
[C253DEC0] [C0216D34] .elv_insert+0x240/0x268
[C253DF60] [C021A224] .blk_requeue_request+0x38/0x54
[C253DFF0] [C032282C] .scsi_request_fn+0x324/0x3c0
[C253E0A0] [C021ADF8] .__generic_unplug_device+0x54/0x6c
etc.

> However, on a Power4 architecture there are errors reported
> in upper layer (we discussed this in one of earlier emails) followed 
> by SCSI errors.

I'm trying to investigate now.

The patch you sent out got garbled, so I'm reposting below.



This patch adds PCI Error recovery support to the
Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver.
Lightly tested at this point, works.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
Signed-off-by: [EMAIL PROTECTED]
Cc: James Smart <[EMAIL PROTECTED]>



 drivers/scsi/lpfc/lpfc_init.c |   96 ++
 drivers/scsi/lpfc/lpfc_sli.c  |   12 +
 2 files changed, 108 insertions(+)

Index: linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-rc4.orig/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 
12:30:01.0 -0600
+++ linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c  2007-01-10 
12:34:27.0 -0600
@@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba
struct lpfc_sli *psli = &phba->sli;
struct lpfc_sli_ring  *pring;
uint32_t event_data;
+   /* If the pci channel is offline, ignore possible errors,
+* since we cannot communicate with the pci card anyway. */
+   if (pci_channel_offline(phba->pcidev))
+   return;
 
if (phba->work_hs & HS_FFER6 ||
phba->work_hs & HS_FFER5) {
@@ -1797,6 +1801,91 @@ lpfc_pci_remove_one(struct pci_dev *pdev
pci_set_drvdata(pdev, NULL);
 }
 
+/**
+ * lpfc_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host->hostdata;
+   struct lpfc_sli *psli = &phba->sli;
+   struct lpfc_sli_ring  *pring;
+
+   if (state == pci_channel_io_perm_failure) {
+   lpfc_pci_remove_one(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+   pci_disable_device(pdev);
+   /*
+* There may be I/Os dropped by the firmware.
+* Error iocb (I/O) on txcmplq and let the SCSI layer
+* retry it after re-establishing link.
+*/
+   pring = &psli->ring[psli->fcp_ring];
+   lpfc_sli_abort_iocb_ring(phba, pring);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * lpfc_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI de

[PATCH] lpfc: add PCI error recovery support

2006-12-06 Thread Linas Vepstas


James,

Please review the patch below. Presuming that you lke it,
please forward upstream.

--linas

This patch adds PCI Error recovery support to the 
Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver.
Lightly tested at this point, works.

Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]>
Cc: James Smart <[EMAIL PROTECTED]>



 drivers/scsi/lpfc/lpfc_init.c |   91 ++
 1 file changed, 91 insertions(+)

Index: linux-2.6.19-git7/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.19-git7.orig/drivers/scsi/lpfc/lpfc_init.c2006-12-06 
13:31:39.0 -0600
+++ linux-2.6.19-git7/drivers/scsi/lpfc/lpfc_init.c 2006-12-06 
13:33:49.0 -0600
@@ -517,6 +517,11 @@ lpfc_handle_eratt(struct lpfc_hba * phba
struct lpfc_sli_ring  *pring;
uint32_t event_data;
 
+   /* If the pci channel is offline, ignore possible errors,
+* since we cannot communicate with the pci card anyway. */
+   if (pci_channel_offline(phba->pcidev))
+   return;
+
if (phba->work_hs & HS_FFER6) {
/* Re-establishing Link */
lpfc_printf_log(phba, KERN_INFO, LOG_LINK_EVENT,
@@ -1825,6 +1830,85 @@ lpfc_pci_remove_one(struct pci_dev *pdev
pci_set_drvdata(pdev, NULL);
 }
 
+/**
+ * lpfc_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev, 
+pci_channel_state_t state)
+{
+   if (state == pci_channel_io_perm_failure) {
+   lpfc_pci_remove_one(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+   pci_disable_device(pdev);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * lpfc_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev *pdev)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host->hostdata;
+   struct lpfc_sli *psli = &phba->sli;
+   struct lpfc_sli_ring  *pring;
+
+   dev_printk(KERN_INFO, &pdev->dev, "recovering from a slot reset.\n");
+   if (pci_enable_device(pdev)) {
+   printk(KERN_ERR "lpfc: Cannot re-enable PCI device after 
reset.\n");
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+
+   pci_set_master(pdev);
+
+   /* Re-establishing Link */
+   spin_lock_irq(phba->host->host_lock);
+   phba->fc_flag |= FC_ESTABLISH_LINK;
+   psli->sli_flag &= ~LPFC_SLI2_ACTIVE;
+   spin_unlock_irq(phba->host->host_lock);
+
+   /*
+* There may be I/Os dropped by the firmware.
+* Error iocb (I/O) on txcmplq and let the SCSI layer
+* retry it after re-establishing link.
+*/
+   pring = &psli->ring[psli->fcp_ring];
+   lpfc_sli_abort_iocb_ring(phba, pring);
+
+   /* Take device offline; this will perform cleanup */
+   lpfc_offline(phba);
+   lpfc_sli_brdrestart(phba);
+
+   return PCI_ERS_RESULT_RECOVERED;
+}
+
+/**
+ * lpfc_io_resume - called when traffic can start flowing again.
+ * @pdev: Pointer to PCI device
+ *
+ * This callback is called when the error recovery driver tells us that
+ * its OK to resume normal operation.
+ */
+static void lpfc_io_resume(struct pci_dev *pdev)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host->hostdata;
+
+   lpfc_online(phba);
+   mod_timer(&phba->fc_estabtmo, jiffies + HZ * 60);
+}
+
 static struct pci_device_id lpfc_id_table[] = {
{PCI_VENDOR_ID_EMULEX, PCI_DEVICE_ID_VIPER,
PCI_ANY_ID, PCI_ANY_ID, },
@@ -1885,11 +1969,18 @@ static struct pci_device_id lpfc_id_tabl
 
 MODULE_DEVICE_TABLE(pci, lpfc_id_table);
 
+static struct pci_error_handlers lpfc_err_handler = {
+   .error_detected = lpfc_io_error_detected,
+   .slot_reset = lpfc_io_slot_reset,
+   .resume = lpfc_io_resume,
+};
+
 static struct pci_driver lpfc_driver = {
.name   = LPFC_DRIVER_NAME,
.id_table   = lpfc_id_table,
.probe  = lpfc_pci_probe_one,
.remove = __devexit_p(lpfc_pci_remove_one),
+   .err_handler = &lpfc_err_handler,
 };
 
 static int __init
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-31 Thread Linas Vepstas

On Tue, Mar 22, 2005 at 11:38:36AM -0600, Brian King was heard to remark:
> Linas Vepstas wrote:
> > 
> > My current hardware will halt all i/o to/from the symbios controller
> > upon detection of a PCI error.  The recovery proceedure that I am
> > currently using is to call system firmware (aka 'bios') to raise
> > and then lower the #RST pci signal line for 1/4 second, then wait 2
> > seconds for the  PCI bus to settle, then restore the PCI config space
> > registers (BARs, interrupt line, etc) to what they used to be. Then,
> > I call sym_start_up() in an attempt to get the symbios card working
> > again.  And that's where I get stuck ... 
> > 
> > My assumption is that after the #RST, that the symbios card will sit
> > there, dumb and stupid, with no scripts running.  But sometimes I find 
> > that the card has done something to make the PCI error hardware trip
> > again.  Typically, this means that the card attempted to DMA to some
> > address that its not allowed to touch, or raised #SERR or possibly 
> > #PERR (I can't tell which). 
> 
> What config registers are you restoring? 

BAR's, grant, latency, interrupt, cacheline size. 

> Is it possible symbios does not
> like something in your config restore?

possibly...

> Another possiblity is that asserting PCI reset is not cleanly resetting
> the card. Does PCI reset force BIST to be run on these cards? You could
> try to manually run BIST on the card after the PCI reset to see if that

I didn't see bist in the code, but I wasn't looking for it either.  I
could try that.

> helps, or you could try power cycling the slot instead of using PCI reset.

yes I could :(  I'll try that next.  Problem is, not all slots are
power-cyclable, only the hotplug slots are.  I've discoverd that 
for example, the ethernet chips are soldered to the motherboard, and
can't be power-cycled (but fortunately, those don't give me trouble).


--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-31 Thread Linas Vepstas

Hmm,

Got distracted by other issues, so I'm answering a week late...

On Tue, Mar 22, 2005 at 10:57:28AM -0700, Grant Grundler was heard to remark:
> On Mon, Mar 21, 2005 at 05:10:28PM -0600, Linas Vepstas wrote:
> > My current hardware will halt all i/o to/from the symbios controller
> > upon detection of a PCI error.  The recovery proceedure that I am
> > currently using is to call system firmware (aka 'bios') to raise
> > and then lower the #RST pci signal line for 1/4 second, then wait 2
> > seconds for the  PCI bus to settle, then restore the PCI config space
> > registers (BARs, interrupt line, etc) to what they used to be. Then,
> > I call sym_start_up() in an attempt to get the symbios card working
> > again.  And that's where I get stuck ... 
> 
> Does this process cause a SCSI bus reset?

Don't get a chance to get that far.  Have to bring up the PCI interfaces
first, before any scsi command can be issued.

> BTW, when did sym2 get a chance to cleanup "pending" requests?

Yes, the sym2 driver has mechanisms for that.

> You want everything moved back to the "queued" state or failed
> (flush pending IO so upper layers can retry if they want).

Upper layer is the linux block device; my understanding is that it does
not retry, nor do the filesystems above that.  Passing errors upwards
seems to be pretty darned fatal.  My goal is to limit retries to the
driver.

> > Sometimes, I get the PCI error while the card is sitting there idly
> > after the #RST, but more often, I get the error in sym_chip_reset(),
> > immediately after the   OUTB (nc_istat, SRST);
> 
> Oh? Is this the driver trying to issue SCSI Reset?

No I am trying to reinitialize the scsi card after the pci bus has been
reset.  This has nothing to do with scsi bus resets, as far as I know
... 

--linas
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-21 Thread Linas Vepstas

Hi,

There has been a running thread for a while on several mailing lists 
concerning PCI bus error recovery.  Very breifly, some architectures
have PCI error recovery mechanisms built into them (e.g. IBM PowerPC,
also new PCI-Express chips from Intel (and other vendors) and possibly
pa-risc and others).  

I've been trying to prototype  error recovery.  I currently have
ethernet and the IPR scsi driver working, but I am having trouble with 
the symbios driver.  I need help/advice ... 

On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark:
> On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote:
> > I also want to do the symbios driver...
> 
> FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org.

My current hardware will halt all i/o to/from the symbios controller
upon detection of a PCI error.  The recovery proceedure that I am
currently using is to call system firmware (aka 'bios') to raise
and then lower the #RST pci signal line for 1/4 second, then wait 2
seconds for the  PCI bus to settle, then restore the PCI config space
registers (BARs, interrupt line, etc) to what they used to be. Then,
I call sym_start_up() in an attempt to get the symbios card working
again.  And that's where I get stuck ... 

My assumption is that after the #RST, that the symbios card will sit
there, dumb and stupid, with no scripts running.  But sometimes I find 
that the card has done something to make the PCI error hardware trip
again.  Typically, this means that the card attempted to DMA to some
address that its not allowed to touch, or raised #SERR or possibly 
#PERR (I can't tell which). 

Sometimes, I get the PCI error while the card is sitting there idly
after the #RST, but more often, I get the error in sym_chip_reset(),
immediately after the   OUTB (nc_istat, SRST);

Any clue what this is about? Am I missing something? I'm rather
perplexed at this point, any clues/hints/suggestions are welcome.

--linas

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

Re: [patch 02/17] PCI Error Recovery: Symbios SCSI base support

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

Re: [PATCH]: PCI Error Recovery: Symbios SCSI device driver

EDAC & PCI error recovery (was Re: [PATCH 2/2] powerpc: MPC85xx EDAC device driver)

[PATCH]: PCI Error Recovery: Symbios SCSI device driver

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

Re: [PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

[PATCH 2/2]: PCI Error Recovery: Symbios SCSI First Failure

[PATCH 1/2]: PCI Error Recovery: Symbios SCSI base support

[PATCH] lpfc: avoid double-free during PCI error failure

[PATCH] lpfc: add PCI error recovery support

[PATCH] adjust use of unplug in elevator code

Bug: 2.6.20 scsi/block device/elevator recursion loop

Re: lpfc PCIe error recovey

crash on lpfc rmmod

Re: lpfc PCIe error recoveyr

[PATCH] lpfc: add PCI error recovery support

Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

24 matches

Site Navigation

Mail list logo

Footer information