Re: [PATCH] eeh: Fixing a bug when pci structure is null

2010-02-24 Thread Mike Mason

On 2/19/2010 1:54 PM, Benjamin Herrenschmidt wrote:

On Fri, 2010-02-19 at 14:43 -0200, Breno Leitao wrote:

Hi Ben,

I'd like to ask about this patch ? Should I re-submit ?

Thanks,

Breno Leitao wrote:

During a EEH recover, the pci_dev structure can be null, mainly if an
eeh event is detected during cpi config operation. In this case, the
pci_dev will not be known (and will be null) the kernel will crash
with the following message:


It should be in -next, can you dbl check ?


I just confirmed the patch is in the -next tree.

Mike



Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/3] Support for PCI Express reset type

2009-07-30 Thread Mike Mason

This is the first of three patches that implement a bit field that PCI Express 
device drivers can use to indicate they need a fundamental reset during error 
recovery.

By default, the EEH framework on powerpc does what's known as a hot reset during 
recovery of a PCI Express device.  We've found a case where the device needs a fundamental 
reset to recover properly.  The current PCI error recovery and EEH frameworks do not support 
this distinction.

The attached patch (courtesy of Richard Lary) adds a bit field to pci_dev that 
indicates whether the device requires a fundamental reset during recovery.

These patches supersede the previously submitted patch that implemented a fundamental reset bit field. 


Please review and let me know of any concerns.

Signed-off-by: Mike Mason mm...@us.ibm.com
Signed-off-by: Richard Lary rl...@us.ibm.com


diff -uNrp a/include/linux/pci.h b/include/linux/pci.h
--- a/include/linux/pci.h   2009-07-13 14:25:37.0 -0700
+++ b/include/linux/pci.h   2009-07-15 10:25:37.0 -0700
@@ -273,6 +273,7 @@ struct pci_dev {
unsigned intari_enabled:1;  /* ARI forwarding */
unsigned intis_managed:1;
unsigned intis_pcie:1;
+   unsigned intneeds_freset:1; /* Dev requires fundamental reset */
unsigned intstate_saved:1;
unsigned intis_physfn:1;
unsigned intis_virtfn:1;
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 2/3] Support for PCI Express reset type

2009-07-30 Thread Mike Mason

This is the second of three patches that implement a bit field that PCI Express 
device drivers can use to indicate they need a fundamental reset during error 
recovery.

By default, the EEH framework on powerpc does what's known as a hot reset during 
recovery of a PCI Express device.  We've found a case where the device needs a fundamental 
reset to recover properly.  The current PCI error recovery and EEH frameworks do not support 
this distinction.

The attached patch updates the Documentation/PCI/pci-error-recovery.txt file 
with changes related to this new bit field, as well a few unrelated updates.

These patches supersede the previously submitted patch that implemented a fundamental reset bit field. 


Please review and let me know of any concerns.

Signed-off-by: Mike Mason mm...@us.ibm.com
Signed-off-by: Richard Lary rl...@us.ibm.com



--- a/Documentation/PCI/pci-error-recovery.txt  2009-06-09 20:05:27.0 
-0700
+++ b/Documentation/PCI/pci-error-recovery.txt  2009-07-30 13:57:15.0 
-0700
@@ -4,15 +4,17 @@
 February 2, 2006
 
  Current document maintainer:
- Linas Vepstas li...@austin.ibm.com
+ Linas Vepstas linasveps...@gmail.com
+  updated by Richard Lary rl...@us.ibm.com
+   and Mike Mason mm...@us.ibm.com on 27-Jul-2009
 
 
 Many PCI bus controllers are able to detect a variety of hardware
 PCI errors on the bus, such as parity errors on the data and address
 busses, as well as SERR and PERR errors.  Some of the more advanced
 chipsets are able to deal with these errors; these include PCI-E chipsets,
-and the PCI-host bridges found on IBM Power4 and Power5-based pSeries
-boxes. A typical action taken is to disconnect the affected device,
+and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
+pSeries boxes. A typical action taken is to disconnect the affected device,
 halting all I/O to it.  The goal of a disconnection is to avoid system
 corruption; for example, to halt system memory corruption due to DMA's
 to wild addresses. Typically, a reconnection mechanism is also
@@ -37,10 +39,11 @@ is forced by the need to handle multi-fu
 devices that have multiple device drivers associated with them.
 In the first stage, each driver is allowed to indicate what type
 of reset it desires, the choices being a simple re-enabling of I/O
-or requesting a hard reset (a full electrical #RST of the PCI card).
-If any driver requests a full reset, that is what will be done.
+or requesting a slot reset.
 
-After a full reset and/or a re-enabling of I/O, all drivers are
+If any driver requests a slot reset, that is what will be done.
+
+After a reset and/or a re-enabling of I/O, all drivers are
 again notified, so that they may then perform any device setup/config
 that may be required.  After these have all completed, a final
 resume normal operations event is sent out.
@@ -101,7 +104,7 @@ if it implements any, it must implement 
 is not implemented, the corresponding feature is considered unsupported.
 For example, if mmio_enabled() and resume() aren't there, then it
 is assumed that the driver is not doing any direct recovery and requires
-a reset. If link_reset() is not implemented, the card is assumed as
+a slot reset. If link_reset() is not implemented, the card is assumed to
 not care about link resets. Typically a driver will want to know about
 a slot_reset().
 
@@ -111,7 +114,7 @@ sequence described below.
 
 STEP 0: Error Event
 ---
-PCI bus error is detect by the PCI hardware.  On powerpc, the slot
+A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0x,
 all writes are ignored.
 
@@ -139,7 +142,7 @@ The driver must return one of the follow
  a chance to extract some diagnostic information (see
  mmio_enable, below).
- PCI_ERS_RESULT_NEED_RESET:
- Driver returns this if it can't recover without a hard
+ Driver returns this if it can't recover without a
  slot reset.
- PCI_ERS_RESULT_DISCONNECT:
  Driver returns this if it doesn't want to recover at all.
@@ -169,11 +172,11 @@ is STEP 6 (Permanent Failure).
 
  The current powerpc implementation doesn't much care if the device
  attempts I/O at this point, or not.  I/O's will fail, returning
- a value of 0xff on read, and writes will be dropped. If the device
- driver attempts more than 10K I/O's to a frozen adapter, it will
- assume that the device driver has gone into an infinite loop, and
- it will panic the kernel. There doesn't seem to be any other
- way of stopping a device driver that insists on spinning on I/O.
+ a value of 0xff on read, and writes will be dropped. If more than
+ EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
+ assumes that the device driver has gone into an infinite loop

[PATCH 3/3] Support for PCI Express reset type

2009-07-30 Thread Mike Mason

This is the third of three patches that implement a bit field that PCI Express 
device drivers can use to indicate they need a fundamental reset during error 
recovery.

By default, the EEH framework on powerpc does what's known as a hot reset during 
recovery of a PCI Express device.  We've found a case where the device needs a fundamental 
reset to recover properly.  The current PCI error recovery and EEH frameworks do not support 
this distinction.

The attached patch makes changes to EEH to utilize the new bit field.

These patches supersede the previously submitted patch that implemented a 
fundamental reset bit field.

Please review and let me know of any concerns.

Signed-off-by: Mike Mason mm...@us.ibm.com
Signed-off-by: Richard Lary rl...@us.ibm.com

diff -uNrp a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
--- a/arch/powerpc/kernel/pci_64.c  2009-07-13 14:25:24.0 -0700
+++ b/arch/powerpc/kernel/pci_64.c  2009-07-15 10:26:26.0 -0700
@@ -143,6 +143,7 @@ struct pci_dev *of_create_pci_dev(struct
dev-dev.bus = pci_bus_type;
dev-devfn = devfn;
dev-multifunction = 0; /* maybe a lie? */
+   dev-needs_freset = 0;   /* pcie fundamental reset required */

dev-vendor = get_int_prop(node, vendor-id, 0x);
dev-device = get_int_prop(node, device-id, 0x);
diff -uNrp a/arch/powerpc/platforms/pseries/eeh.c 
b/arch/powerpc/platforms/pseries/eeh.c
--- a/arch/powerpc/platforms/pseries/eeh.c  2009-06-09 20:05:27.0 
-0700
+++ b/arch/powerpc/platforms/pseries/eeh.c  2009-07-15 10:29:04.0 
-0700
@@ -744,7 +744,15 @@ int pcibios_set_pcie_reset_state(struct

 static void __rtas_set_slot_reset(struct pci_dn *pdn)
 {
-   rtas_pci_slot_reset (pdn, 1);
+   struct pci_dev *dev = pdn-pcidev;
+
+   /* Determine type of EEH reset required by device,
+* default hot reset or fundamental reset
+*/
+   if (dev-needs_freset)
+   rtas_pci_slot_reset(pdn, 3);
+   else
+   rtas_pci_slot_reset(pdn, 1);

/* The PCI bus requires that the reset be held high for at least
 * a 100 milliseconds. We wait a bit longer 'just in case'.  */___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] Hold reference to device_node during EEH event handling

2009-07-22 Thread Mike Mason

Michael Ellerman wrote:

On Thu, 2009-07-16 at 09:33 -0700, Mike Mason wrote:

Michael Ellerman wrote:

On Wed, 2009-07-15 at 14:43 -0700, Mike Mason wrote:

This patch increments the device_node reference counter when an EEH
error occurs and decrements the counter when the event has been
handled.  This is to prevent the device_node from being released until
eeh_event_handler() has had a chance to deal with the event.  We've
seen cases where the device_node is released too soon when an EEH
event occurs during a dlpar remove, causing the event handler to
attempt to access bad memory locations.

Please review and let me know of any concerns.

Taking a reference sounds sane, but ...

Signed-off-by: Mike Mason mm...@us.ibm.com 


--- a/arch/powerpc/platforms/pseries/eeh_event.c2008-10-09 
15:13:53.0 -0700
+++ b/arch/powerpc/platforms/pseries/eeh_event.c2009-07-14 
14:14:00.0 -0700
@@ -75,6 +75,14 @@ static int eeh_event_handler(void * dumm
if (event == NULL)
return 0;
 
+	/* EEH holds a reference to the device_node, so if it

+* equals 1 it's no longer valid and the event should
+* be ignored */
+   if (atomic_read(event-dn-kref.refcount) == 1) {
+   of_node_put(event-dn);
+   return 0;
+   }

That's really gross :)

Agreed.  I'll look for another way to determine if device is gone and
the event should be ignored.  Suggestions are welcome :-)


Actually, it turns out the atomic_read() isn't necessary.  I just need to take 
the reference to the device_node when the EEH error is detected and let EEH try 
to handle the error.  EEH detects the fact that the device is no longer valid, 
aborts the recovery attempt, then gives the device_node reference back.  Works 
as expected.

I'll resubmit the patch without the atomic_read().



Benh and I had a quick chat about it, and were wondering whether what
you really should be doing is taking a reference to the pci device
(perhaps as well as the device node).


EEH already does that 3 lines before the of_node_get (see below).



@@ -140,7 +149,7 @@ int eeh_send_failure_event (struct devic
if (dev)
pci_dev_get(dev);
 
-   event-dn = dn;

+   event-dn = of_node_get(dn);
event-dev = dev;



Thanks,
Mike

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] Hold reference to device_node during EEH event handling

2009-07-16 Thread Mike Mason

Michael Ellerman wrote:

On Wed, 2009-07-15 at 14:43 -0700, Mike Mason wrote:

This patch increments the device_node reference counter when an EEH
error occurs and decrements the counter when the event has been
handled.  This is to prevent the device_node from being released until
eeh_event_handler() has had a chance to deal with the event.  We've
seen cases where the device_node is released too soon when an EEH
event occurs during a dlpar remove, causing the event handler to
attempt to access bad memory locations.

Please review and let me know of any concerns.


Taking a reference sounds sane, but ...

Signed-off-by: Mike Mason mm...@us.ibm.com 


--- a/arch/powerpc/platforms/pseries/eeh_event.c2008-10-09 
15:13:53.0 -0700
+++ b/arch/powerpc/platforms/pseries/eeh_event.c2009-07-14 
14:14:00.0 -0700
@@ -75,6 +75,14 @@ static int eeh_event_handler(void * dumm
if (event == NULL)
return 0;
 
+	/* EEH holds a reference to the device_node, so if it

+* equals 1 it's no longer valid and the event should
+* be ignored */
+   if (atomic_read(event-dn-kref.refcount) == 1) {
+   of_node_put(event-dn);
+   return 0;
+   }


That's really gross :)


Agreed.  I'll look for another way to determine if device is gone and the event 
should be ignored.  Suggestions are welcome :-)



And what happens if the refcount goes to 1 just after the check? ie.
here.


/* Serialize processing of EEH events */
mutex_lock(eeh_event_mutex);
eeh_mark_slot(event-dn, EEH_MODE_RECOVERING);



cheers



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: Support for PCI Express reset type in EEH

2009-07-15 Thread Mike Mason

This patch was simultaneously submitted to Red Hat for review.  As a result of 
that review, I'm withdrawing this patch and will submit a new version shortly.

Mike

Mike Mason wrote:
By default, EEH does what's known as a hot reset during error recovery 
of a PCI Express device.  We've found a case where the device needs a 
fundamental reset to recover properly.  The current PCI error recovery 
and EEH frameworks do not support this distinction.


The attached patch (courtesy of Richard Lary) implements a reset type 
callback that can be used to determine what type of reset a device 
requires.  It is backwards compatible with all other drivers that 
implement PCI error recovery callbacks.  Only drivers that require a 
fundamental reset need to be changed.  So far we're only aware of one 
driver that has the requirement (qla2xxx).  The patch touches mostly EEH 
and pseries code, but does require a couple of minor additions to the 
overall PCI error recovery framework.


Signed-off-by: Mike Mason mm...@us.ibm.com

--- a/arch/powerpc/include/asm/ppc-pci.h2009-06-09 
20:05:27.0 -0700
+++ b/arch/powerpc/include/asm/ppc-pci.h2009-07-13 
16:12:31.0 -0700

@@ -90,7 +90,9 @@ int rtas_pci_enable(struct pci_dn *pdn,
*
* Returns a non-zero value if the reset failed.
*/
-int rtas_set_slot_reset (struct pci_dn *);
+#define HOT_RESET1
+#define FUNDAMENTAL_RESET3
+int rtas_set_slot_reset (struct pci_dn *, int reset_type);
int eeh_wait_for_slot_status(struct pci_dn *pdn, int max_wait_msecs);

/** --- a/arch/powerpc/platforms/pseries/eeh.c2009-06-09 
20:05:27.0 -0700
+++ b/arch/powerpc/platforms/pseries/eeh.c2009-07-13 
16:27:27.0 -0700

@@ -666,7 +666,7 @@ rtas_pci_enable(struct pci_dn *pdn, int
/**
* rtas_pci_slot_reset - raises/lowers the pci #RST line
* @pdn pci device node
- * @state: 1/0 to raise/lower the #RST
+ * @state: 1/3/0 to raise hot-reset/fundamental-reset/lower the #RST
*
* Clear the EEH-frozen condition on a slot.  This routine
* asserts the PCI #RST line if the 'state' argument is '1',
@@ -742,9 +742,9 @@ int pcibios_set_pcie_reset_state(struct
*  Return 0 if success, else a non-zero value.
*/

-static void __rtas_set_slot_reset(struct pci_dn *pdn)
+static void __rtas_set_slot_reset(struct pci_dn *pdn, int reset_type)
{
-rtas_pci_slot_reset (pdn, 1);
+rtas_pci_slot_reset (pdn, reset_type);

/* The PCI bus requires that the reset be held high for at least
 * a 100 milliseconds. We wait a bit longer 'just in case'.  */
@@ -766,13 +766,13 @@ static void __rtas_set_slot_reset(struct
msleep (PCI_BUS_SETTLE_TIME_MSEC);
}

-int rtas_set_slot_reset(struct pci_dn *pdn)
+int rtas_set_slot_reset(struct pci_dn *pdn, int reset_type)
{
int i, rc;

/* Take three shots at resetting the bus */
for (i=0; i3; i++) {
-__rtas_set_slot_reset(pdn);
+__rtas_set_slot_reset(pdn, reset_type);

rc = eeh_wait_for_slot_status(pdn, PCI_BUS_RESET_WAIT_MSEC);
if (rc == 0)
--- a/arch/powerpc/platforms/pseries/eeh_driver.c2009-07-13 
14:25:24.0 -0700
+++ b/arch/powerpc/platforms/pseries/eeh_driver.c2009-07-13 
16:39:16.0 -0700

@@ -115,6 +115,34 @@ static void eeh_enable_irq(struct pci_de

/* --- */
/**
+ * eeh_query_reset_type - query each device driver for reset type
+ *
+ * Query each device driver for special reset type if required
+ * merge the device driver responses. Cumulative response
+ * passed back in userdata.
+ */
+
+static int eeh_query_reset_type(struct pci_dev *dev, void *userdata)
+{
+enum pci_ers_result rc, *res = userdata;
+struct pci_driver *driver = dev-driver;
+
+if (!driver)
+return 0;
+
+if (!driver-err_handler ||
+!driver-err_handler-reset_type)
+return 0;
+
+rc = driver-err_handler-reset_type (dev);
+
+/* A driver that needs a special reset trumps all others */
+if (rc == PCI_ERS_RESULT_FUNDAMENTAL_RESET ) *res = rc;
+
+return 0;
+}
+
+/**
* eeh_report_error - report pci error to each device driver
* * Report an EEH error to each device driver, collect up and @@ -282,9 
+310,12 @@ static int eeh_report_failure(struct pci

* @pe_dn: pointer to a Partionable Endpoint device node.
*This is the top-level structure on which pci
*bus resets can be performed.
+ *
+ * reset_type: some devices may require type other than default hot reset.
*/

-static int eeh_reset_device (struct pci_dn *pe_dn, struct pci_bus *bus)
+static int eeh_reset_device (struct pci_dn *pe_dn, struct pci_bus *bus,
+ int reset_type)
{
struct device_node *dn;
int cnt, rc;
@@ -298,7 +329,7 @@ static int eeh_reset_device (struct pci_
/* Reset the pci controller. (Asserts RST#; resets config space).
 * Reconfigure bridges and devices. Don't try to bring the system
 * up if the reset failed for some reason. */
-rc = rtas_set_slot_reset

[PATCH] Support for PCI Express reset type in EEH

2009-07-15 Thread Mike Mason

By default, EEH does what's known as a hot reset during error recovery of a PCI Express 
device.  We've found a case where the device needs a fundamental reset to recover 
properly.  The current PCI error recovery and EEH frameworks do not support this distinction.

The attached patch (courtesy of Richard Lary) adds a bit field to pci_dev that 
indicates whether the device requires a fundamental reset during error 
recovery.  This bit can be checked by EEH to determine which reset type is 
required.

This patch supersedes the previously submitted patch that implemented a reset 
type callback.

Please review and let me know of any concerns.

Signed-off-by: Mike Mason mm...@us.ibm.com 


diff -uNrp a/arch/powerpc/kernel/pci_64.c b/arch/powerpc/kernel/pci_64.c
--- a/arch/powerpc/kernel/pci_64.c  2009-07-13 14:25:24.0 -0700
+++ b/arch/powerpc/kernel/pci_64.c  2009-07-15 10:26:26.0 -0700
@@ -143,6 +143,7 @@ struct pci_dev *of_create_pci_dev(struct
dev-dev.bus = pci_bus_type;
dev-devfn = devfn;
dev-multifunction = 0;  /* maybe a lie? */
+   dev-fndmntl_rst_rqd = 0;   /* pcie fundamental reset required */

dev-vendor = get_int_prop(node, vendor-id, 0x);
dev-device = get_int_prop(node, device-id, 0x);
diff -uNrp a/arch/powerpc/platforms/pseries/eeh.c 
b/arch/powerpc/platforms/pseries/eeh.c
--- a/arch/powerpc/platforms/pseries/eeh.c  2009-06-09 20:05:27.0 
-0700
+++ b/arch/powerpc/platforms/pseries/eeh.c  2009-07-15 10:29:04.0 
-0700
@@ -744,7 +744,15 @@ int pcibios_set_pcie_reset_state(struct

static void __rtas_set_slot_reset(struct pci_dn *pdn)
{
-   rtas_pci_slot_reset (pdn, 1);
+   struct pci_dev *dev = pdn-pcidev;
+
+   /* Determine type of EEH reset required by device,
+* default hot reset or fundamental reset
+*/
+   if (dev-fndmntl_rst_rqd)
+   rtas_pci_slot_reset(pdn, 3);
+   else
+   rtas_pci_slot_reset(pdn, 1);

/* The PCI bus requires that the reset be held high for at least
 * a 100 milliseconds. We wait a bit longer 'just in case'.  */
diff -uNrp a/include/linux/pci.h b/include/linux/pci.h
--- a/include/linux/pci.h   2009-07-13 14:25:37.0 -0700
+++ b/include/linux/pci.h   2009-07-15 10:25:37.0 -0700
@@ -273,6 +273,7 @@ struct pci_dev {
unsigned intari_enabled:1;  /* ARI forwarding */
unsigned intis_managed:1;
unsigned intis_pcie:1;
+   unsigned intfndmntl_rst_rqd:1; /* Dev requires fundamental reset */
unsigned intstate_saved:1;
unsigned intis_physfn:1;
unsigned intis_virtfn:1;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] Hold reference to device_node during EEH event handling

2009-07-15 Thread Mike Mason

This patch increments the device_node reference counter when an EEH error 
occurs and decrements the counter when the event has been handled.  This is to 
prevent the device_node from being released until eeh_event_handler() has had a 
chance to deal with the event.  We've seen cases where the device_node is 
released too soon when an EEH event occurs during a dlpar remove, causing the 
event handler to attempt to access bad memory locations.

Please review and let me know of any concerns.

Signed-off-by: Mike Mason mm...@us.ibm.com 


--- a/arch/powerpc/platforms/pseries/eeh_event.c2008-10-09 
15:13:53.0 -0700
+++ b/arch/powerpc/platforms/pseries/eeh_event.c2009-07-14 
14:14:00.0 -0700
@@ -75,6 +75,14 @@ static int eeh_event_handler(void * dumm
if (event == NULL)
return 0;

+   /* EEH holds a reference to the device_node, so if it
+* equals 1 it's no longer valid and the event should
+* be ignored */
+   if (atomic_read(event-dn-kref.refcount) == 1) {
+   of_node_put(event-dn);
+   return 0;
+   }
+
/* Serialize processing of EEH events */
mutex_lock(eeh_event_mutex);
eeh_mark_slot(event-dn, EEH_MODE_RECOVERING);
@@ -86,6 +94,7 @@ static int eeh_event_handler(void * dumm

eeh_clear_slot(event-dn, EEH_MODE_RECOVERING);
pci_dev_put(event-dev);
+   of_node_put(event-dn);
kfree(event);
mutex_unlock(eeh_event_mutex);

@@ -140,7 +149,7 @@ int eeh_send_failure_event (struct devic
if (dev)
pci_dev_get(dev);

-   event-dn = dn;
+   event-dn = of_node_get(dn);
event-dev = dev;

/* We may or may not be called in an interrupt context */


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Support for PCI Express reset type in EEH

2009-07-14 Thread Mike Mason

By default, EEH does what's known as a hot reset during error recovery of a PCI Express 
device.  We've found a case where the device needs a fundamental reset to recover 
properly.  The current PCI error recovery and EEH frameworks do not support this distinction.

The attached patch (courtesy of Richard Lary) implements a reset type callback 
that can be used to determine what type of reset a device requires.  It is 
backwards compatible with all other drivers that implement PCI error recovery 
callbacks.  Only drivers that require a fundamental reset need to be changed.  
So far we're only aware of one driver that has the requirement (qla2xxx).  The 
patch touches mostly EEH and pseries code, but does require a couple of minor 
additions to the overall PCI error recovery framework.

Signed-off-by: Mike Mason mm...@us.ibm.com

--- a/arch/powerpc/include/asm/ppc-pci.h2009-06-09 20:05:27.0 
-0700
+++ b/arch/powerpc/include/asm/ppc-pci.h2009-07-13 16:12:31.0 
-0700
@@ -90,7 +90,9 @@ int rtas_pci_enable(struct pci_dn *pdn,
*
* Returns a non-zero value if the reset failed.
*/
-int rtas_set_slot_reset (struct pci_dn *);
+#define HOT_RESET  1
+#define FUNDAMENTAL_RESET  3
+int rtas_set_slot_reset (struct pci_dn *, int reset_type);
int eeh_wait_for_slot_status(struct pci_dn *pdn, int max_wait_msecs);

/** 
--- a/arch/powerpc/platforms/pseries/eeh.c	2009-06-09 20:05:27.0 -0700

+++ b/arch/powerpc/platforms/pseries/eeh.c  2009-07-13 16:27:27.0 
-0700
@@ -666,7 +666,7 @@ rtas_pci_enable(struct pci_dn *pdn, int
/**
* rtas_pci_slot_reset - raises/lowers the pci #RST line
* @pdn pci device node
- * @state: 1/0 to raise/lower the #RST
+ * @state: 1/3/0 to raise hot-reset/fundamental-reset/lower the #RST
*
* Clear the EEH-frozen condition on a slot.  This routine
* asserts the PCI #RST line if the 'state' argument is '1',
@@ -742,9 +742,9 @@ int pcibios_set_pcie_reset_state(struct
*  Return 0 if success, else a non-zero value.
*/

-static void __rtas_set_slot_reset(struct pci_dn *pdn)
+static void __rtas_set_slot_reset(struct pci_dn *pdn, int reset_type)
{
-   rtas_pci_slot_reset (pdn, 1);
+   rtas_pci_slot_reset (pdn, reset_type);

/* The PCI bus requires that the reset be held high for at least
 * a 100 milliseconds. We wait a bit longer 'just in case'.  */
@@ -766,13 +766,13 @@ static void __rtas_set_slot_reset(struct
msleep (PCI_BUS_SETTLE_TIME_MSEC);
}

-int rtas_set_slot_reset(struct pci_dn *pdn)
+int rtas_set_slot_reset(struct pci_dn *pdn, int reset_type)
{
int i, rc;

/* Take three shots at resetting the bus */
for (i=0; i3; i++) {
-   __rtas_set_slot_reset(pdn);
+   __rtas_set_slot_reset(pdn, reset_type);

rc = eeh_wait_for_slot_status(pdn, PCI_BUS_RESET_WAIT_MSEC);
if (rc == 0)
--- a/arch/powerpc/platforms/pseries/eeh_driver.c   2009-07-13 
14:25:24.0 -0700
+++ b/arch/powerpc/platforms/pseries/eeh_driver.c   2009-07-13 
16:39:16.0 -0700
@@ -115,6 +115,34 @@ static void eeh_enable_irq(struct pci_de

/* --- */
/**
+ * eeh_query_reset_type - query each device driver for reset type
+ *
+ * Query each device driver for special reset type if required
+ * merge the device driver responses. Cumulative response
+ * passed back in userdata.
+ */
+
+static int eeh_query_reset_type(struct pci_dev *dev, void *userdata)
+{
+   enum pci_ers_result rc, *res = userdata;
+   struct pci_driver *driver = dev-driver;
+
+   if (!driver)
+   return 0;
+
+   if (!driver-err_handler ||
+   !driver-err_handler-reset_type)
+   return 0;
+
+   rc = driver-err_handler-reset_type (dev);
+
+   /* A driver that needs a special reset trumps all others */
+   if (rc == PCI_ERS_RESULT_FUNDAMENTAL_RESET ) *res = rc;
+
+   return 0;
+}
+
+/**
* eeh_report_error - report pci error to each device driver
* 
* Report an EEH error to each device driver, collect up and 
@@ -282,9 +310,12 @@ static int eeh_report_failure(struct pci

* @pe_dn: pointer to a Partionable Endpoint device node.
*This is the top-level structure on which pci
*bus resets can be performed.
+ *
+ * reset_type: some devices may require type other than default hot reset.
*/

-static int eeh_reset_device (struct pci_dn *pe_dn, struct pci_bus *bus)
+static int eeh_reset_device (struct pci_dn *pe_dn, struct pci_bus *bus,
+int reset_type)
{
struct device_node *dn;
int cnt, rc;
@@ -298,7 +329,7 @@ static int eeh_reset_device (struct pci_
/* Reset the pci controller. (Asserts RST#; resets config space).
 * Reconfigure bridges and devices. Don't try to bring the system
 * up if the reset failed for some reason. */
-   rc = rtas_set_slot_reset(pe_dn);
+   rc

Re: [PATCH] Set error_state to pci_channel_io_normal in eeh_report_reset()

2009-04-21 Thread Mike Mason




Paul Mackerras wrote:

  Mike Mason writes:

  
  
diff --git a/arch/powerpc/platforms/pseries/eeh_driver.c 
b/arch/powerpc/platforms/pseries/eeh_driver.c
index 380420f..9a2a6e3 100644
--- a/arch/powerpc/platforms/pseries/eeh_driver.c
+++ b/arch/powerpc/platforms/pseries/eeh_driver.c
@@ -182,6 +182,8 @@ static void eeh_report_reset(struct pci_dev *dev, 
void *userdata)
if (!driver)
return;

+ dev-error_state = pci_channel_io_normal;
+
eeh_enable_irq(dev);

if (!driver-err_handler ||

  
  
Your mail client totally mangled the whitespace.  I fixed it up and
applied it, since the patch was small, but in future please fix your
mail client so it doesn't do that.
  

Sorry about that.

  
It sounds like this patch needs to be applied to earlier kernel
versions -- do you agree?
  

Maybe. The only drivers that care at this point are the emulex and
qlogic drivers with native eeh support. How far back would you
typically apply a patch like this? I'm submitting it for inclusion in
SLES 10  11 and RHEL 5.

Mike

  
Paul.
  




___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

[PATCH] Set error_state to pci_channel_io_normal in eeh_report_reset()

2009-04-10 Thread Mike Mason
While adding native EEH support to Emulex and Qlogic drivers, it was 
discovered that dev-error_state was set to pci_io_channel_normal too 
late in the recovery process. These drivers rely on error_state to 
determine if they can access the device in their slot_reset callback, 
thus error_state needs to be set to pci_io_channel_norm in 
eeh_report_reset(). Below is a detailed explanation (courtesy of Richard 
Lary) as to why this is necessary.


Background:
PCI MMIO or DMA accesses to a frozen slot generate additional EEH 
errors. If the number of additional EEH errors exceeds EEH_MAX_FAILS the 
adapter will be shutdown. To avoid triggering excessive EEH errors and 
an undesirable adapter shutdown, some drivers use the 
pci_channel_offline(dev) wrapper function to return a Boolean value 
based on the value of pci_dev-error_state to determine if PCI MMIO or 
DMA accesses are safe. If the wrapper returns TRUE, drivers must not 
make PCI MMIO or DMA access to their hardware.


The pci_dev structure member error_state reflects one of three values, 
1) pci_channel_io_normal, 2) pci_channel_io_frozen, 3) 
pci_channel_io_perm_failure. Function pci_channel_offline(dev) returns 
TRUE if error_state is pci_channel_io_frozen or pci_channel_io_perm_failure.


The EEH driver sets pci_dev-error_state to pci_channel_io_frozen at the 
point where the PCI slot is frozen. Currently, the EEH driver restores 
dev-error_state to pci_channel_io_normal in eeh_report_resume() before 
calling the driver's resume callback. However, when the EEH driver calls 
the driver's slot_reset callback() from eeh_report_reset(), it 
incorrectly indicates the error state is still pci_channel_io_frozen.


Waiting until eeh_report_resume() to restore dev-error_state to 
pci_channel_io_normal is too late for Emulex and QLogic FC drivers and 
any other drivers which are designed to use common code paths in these 
two cases: i) those called after the driver's slot_reset callback() and 
ii) those called after the PCI slot is frozen but before the driver's 
slot_reset callback is called. Case i) all driver paths executed to 
reinitialize the hardware after a reset and case ii) all code paths 
executed by driver kernel threads that run asynchronous to the main 
driver thread, such as interrupt handlers and worker threads to process 
driver work queues.


Emulex and QLogic FC drivers are designed with common code paths which 
require that pci_channel_offline(dev) reflect the true state of the 
hardware. The state transitions that the hardware takes from Normal 
Operations to Slot Frozen to Reset to Normal Operations are documented 
in the Power Architectureâ„¢ Platform Requirements+ (PAPR+) in Table 75. 
PE State Control.


PAPR defines the following 3 states:

0 -- Not reset, Not EEH stopped, MMIO load/store allowed, DMA allowed 
(Normal Operations)

1 -- Reset, Not EEH stopped, MMIO load/store disabled, DMA disabled
2 -- Not reset, EEH stopped, MMIO load/store disabled, DMA disabled 
(Slot Frozen)


An EEH error places the slot in state 2 (Frozen) and the adapter driver 
is notified that an EEH error was detected. If the adapter driver 
returns PCI_ERS_RESULT_NEED_RESET, the EEH driver calls 
eeh_reset_device() to place the slot into state 1 (Reset) and 
eeh_reset_device completes by placing the slot into State 0 (Normal 
Operations). Upon return from eeh_reset_device(), the EEH driver calls 
eeh_report_reset, which then calls the adapter's slot_reset callback. At 
the time the adapter's slot_reset callback is called, the true state of 
the hardware is Normal Operations and should be accurately reflected by 
setting dev-error_state to pci_channel_io_normal.


The current implementation of EEH driver does not do so and requires the 
following patch to correct this deficiency.


Signed-off-by: Mike Mason mm...@us.ibm.com
diff --git a/arch/powerpc/platforms/pseries/eeh_driver.c 
b/arch/powerpc/platforms/pseries/eeh_driver.c

index 380420f..9a2a6e3 100644
--- a/arch/powerpc/platforms/pseries/eeh_driver.c
+++ b/arch/powerpc/platforms/pseries/eeh_driver.c
@@ -182,6 +182,8 @@ static void eeh_report_reset(struct pci_dev *dev, 
void *userdata)

if (!driver)
return;

+ dev-error_state = pci_channel_io_normal;
+
eeh_enable_irq(dev);

if (!driver-err_handler ||



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Only disable/enable LSI interrupts in EEH

2009-02-12 Thread Mike Mason

Michael Ellerman wrote:

On Tue, 2009-02-10 at 13:12 -0800, Mike Mason wrote:

I'm resubmitting this patch with a couple changes
suggested by Michael Ellerman.  1) the new functions
should be static, and 2) some people may object to
including unrelated formating changes.

=

The EEH code disables and enables interrupts during the
device recovery process.  This is unnecessary for MSI
and MSI-X interrupts because they are effectively disabled
by the DMA Stopped state when an EEH error occurs.  The
current code is also incorrect for MSI-X interrupts.  It
doesn't take into account that MSI-X interrupts are tracked
in a different way than LSI/MSI interrupts.  This patch 
ensures only LSI interrupts are disabled/enabled.


Signed-off-by: Mike Mason mm...@us.ibm.com
Acked-by: Linas Vepstas linasveps...@gmail.com



Looks good. Assuming you've tested it :)


Yes, it's been tested with network devices that use LSI, MSI and MSI-X 
interrupts.  All recovered fine.



Acked-by: Michael Ellerman mich...@ellerman.id.au




___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Only disable/enable LSI interrupts in EEH

2009-02-10 Thread Mike Mason

I'm resubmitting this patch with a couple changes
suggested by Michael Ellerman.  1) the new functions
should be static, and 2) some people may object to
including unrelated formating changes.

=

The EEH code disables and enables interrupts during the
device recovery process.  This is unnecessary for MSI
and MSI-X interrupts because they are effectively disabled
by the DMA Stopped state when an EEH error occurs.  The
current code is also incorrect for MSI-X interrupts.  It
doesn't take into account that MSI-X interrupts are tracked
in a different way than LSI/MSI interrupts.  This patch 
ensures only LSI interrupts are disabled/enabled.


Signed-off-by: Mike Mason mm...@us.ibm.com
Acked-by: Linas Vepstas linasveps...@gmail.com


--- arch/powerpc/platforms/pseries/eeh_driver.c-orig2009-02-10 
07:12:31.0 -0800
+++ arch/powerpc/platforms/pseries/eeh_driver.c 2009-02-10 08:19:54.0 
-0800
@@ -76,6 +76,40 @@ static int irq_in_use(unsigned int irq)
return rc;
}

+/**
+ * eeh_disable_irq - disable interrupt for the recovering device
+ */
+static void eeh_disable_irq(struct pci_dev *dev)
+{
+   struct device_node *dn = pci_device_to_OF_node(dev);
+
+   /* Don't disable MSI and MSI-X interrupts. They are
+	 * effectively disabled by the DMA Stopped state 
+	 * when an EEH error occurs.

+   */
+   if (dev-msi_enabled || dev-msix_enabled)
+   return;
+
+   if (!irq_in_use(dev-irq))
+   return;
+
+   PCI_DN(dn)-eeh_mode |= EEH_MODE_IRQ_DISABLED;
+   disable_irq_nosync(dev-irq);
+}
+
+/**
+ * eeh_enable_irq - enable interrupt for the recovering device
+ */
+static void eeh_enable_irq(struct pci_dev *dev)
+{
+   struct device_node *dn = pci_device_to_OF_node(dev);
+
+   if ((PCI_DN(dn)-eeh_mode)  EEH_MODE_IRQ_DISABLED) {
+   PCI_DN(dn)-eeh_mode = ~EEH_MODE_IRQ_DISABLED;
+   enable_irq(dev-irq);
+   }
+}
+
/* --- */
/**
 * eeh_report_error - report pci error to each device driver
@@ -95,11 +129,8 @@ static void eeh_report_error(struct pci_
if (!driver)
return;

-   if (irq_in_use (dev-irq)) {
-   struct device_node *dn = pci_device_to_OF_node(dev);
-   PCI_DN(dn)-eeh_mode |= EEH_MODE_IRQ_DISABLED;
-   disable_irq_nosync(dev-irq);
-   }
+   eeh_disable_irq(dev);
+
if (!driver-err_handler ||
!driver-err_handler-error_detected)
return;
@@ -144,15 +175,12 @@ static void eeh_report_reset(struct pci_
{
enum pci_ers_result rc, *res = userdata;
struct pci_driver *driver = dev-driver;
-   struct device_node *dn = pci_device_to_OF_node(dev);

if (!driver)
return;

-   if ((PCI_DN(dn)-eeh_mode)  EEH_MODE_IRQ_DISABLED) {
-   PCI_DN(dn)-eeh_mode = ~EEH_MODE_IRQ_DISABLED;
-   enable_irq(dev-irq);
-   }
+   eeh_enable_irq(dev);
+
if (!driver-err_handler ||
!driver-err_handler-slot_reset)
return;
@@ -171,17 +199,14 @@ static void eeh_report_reset(struct pci_
static void eeh_report_resume(struct pci_dev *dev, void *userdata)
{
struct pci_driver *driver = dev-driver;
-   struct device_node *dn = pci_device_to_OF_node(dev);

dev-error_state = pci_channel_io_normal;

if (!driver)
return;

-   if ((PCI_DN(dn)-eeh_mode)  EEH_MODE_IRQ_DISABLED) {
-   PCI_DN(dn)-eeh_mode = ~EEH_MODE_IRQ_DISABLED;
-   enable_irq(dev-irq);
-   }
+   eeh_enable_irq(dev);
+
if (!driver-err_handler ||
!driver-err_handler-resume)
return;
@@ -205,15 +230,12 @@ static void eeh_report_failure(struct pc
if (!driver)
return;

-   if (irq_in_use (dev-irq)) {
-   struct device_node *dn = pci_device_to_OF_node(dev);
-   PCI_DN(dn)-eeh_mode |= EEH_MODE_IRQ_DISABLED;
-   disable_irq_nosync(dev-irq);
-   }
-   if (!driver-err_handler)
-   return;
-   if (!driver-err_handler-error_detected)
+   eeh_disable_irq(dev);
+
+   if (!driver-err_handler ||
+   !driver-err_handler-error_detected)
return;
+
driver-err_handler-error_detected(dev, pci_channel_io_perm_failure);
}


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] Only disable/enable LSI interrupts in EEH

2009-02-09 Thread Mike Mason

The EEH code disables and enables interrupts during the
device recovery process.  This is unnecessary for MSI
and MSI-X interrupts because they are effectively disabled
by the DMA Stopped state when an EEH error occurs.  The 
current code is also incorrect for MSI-X interrupts.  It

doesn't take into account that MSI-X interrupts are tracked
in a different way than LSI/MSI interrupts.  This 
patch ensures only LSI interrupts are disabled/enabled.


The patch also includes a couple minor formatting fixes.


Signed-off-by: Mike Mason mm...@us.ibm.com 


--- linux-2.6.18.ppc64-orig/arch/powerpc/platforms/pseries/eeh_driver.c 
2009-01-16 10:59:59.0 -0800
+++ linux-2.6.18.ppc64/arch/powerpc/platforms/pseries/eeh_driver.c  
2009-02-07 12:29:14.0 -0800
@@ -67,7 +67,7 @@ static int irq_in_use(unsigned int irq)
{
int rc = 0;
unsigned long flags;
-   struct irq_desc *desc = irq_desc + irq;
+   struct irq_desc *desc = irq_desc + irq;

spin_lock_irqsave(desc-lock, flags);
if (desc-action)
@@ -76,6 +76,40 @@ static int irq_in_use(unsigned int irq)
return rc;
}

+/**
+ * eeh_disable_irq - disable interrupt for the recovering device
+ */
+void eeh_disable_irq(struct pci_dev *dev)
+{
+   struct device_node *dn = pci_device_to_OF_node(dev);
+
+   /* Don't disable MSI and MSI-X interrupts. They are
+	 * effectively disabled by the DMA Stopped state 
+	 * when an EEH error occurs.

+   */
+   if (dev-msi_enabled || dev-msix_enabled)
+   return;
+
+   if (!irq_in_use(dev-irq))
+   return;
+
+   PCI_DN(dn)-eeh_mode |= EEH_MODE_IRQ_DISABLED;
+   disable_irq_nosync(dev-irq);
+}
+
+/**
+ * eeh_enable_irq - enable interrupt for the recovering device
+ */
+void eeh_enable_irq(struct pci_dev *dev)
+{
+   struct device_node *dn = pci_device_to_OF_node(dev);
+
+   if ((PCI_DN(dn)-eeh_mode)  EEH_MODE_IRQ_DISABLED) {
+   PCI_DN(dn)-eeh_mode = ~EEH_MODE_IRQ_DISABLED;
+   enable_irq(dev-irq);
+   }
+}
+
/* --- */
/**
 * eeh_report_error - report pci error to each device driver
@@ -95,11 +129,8 @@ static void eeh_report_error(struct pci_
if (!driver)
return;

-   if (irq_in_use (dev-irq)) {
-   struct device_node *dn = pci_device_to_OF_node(dev);
-   PCI_DN(dn)-eeh_mode |= EEH_MODE_IRQ_DISABLED;
-   disable_irq_nosync(dev-irq);
-   }
+   eeh_disable_irq(dev);
+
if (!driver-err_handler ||
!driver-err_handler-error_detected)
return;
@@ -144,15 +175,12 @@ static void eeh_report_reset(struct pci_
{
enum pci_ers_result rc, *res = userdata;
struct pci_driver *driver = dev-driver;
-   struct device_node *dn = pci_device_to_OF_node(dev);

if (!driver)
return;

-   if ((PCI_DN(dn)-eeh_mode)  EEH_MODE_IRQ_DISABLED) {
-   PCI_DN(dn)-eeh_mode = ~EEH_MODE_IRQ_DISABLED;
-   enable_irq(dev-irq);
-   }
+   eeh_enable_irq(dev);
+
if (!driver-err_handler ||
!driver-err_handler-slot_reset)
return;
@@ -171,17 +199,14 @@ static void eeh_report_reset(struct pci_
static void eeh_report_resume(struct pci_dev *dev, void *userdata)
{
struct pci_driver *driver = dev-driver;
-   struct device_node *dn = pci_device_to_OF_node(dev);

dev-error_state = pci_channel_io_normal;

if (!driver)
return;

-   if ((PCI_DN(dn)-eeh_mode)  EEH_MODE_IRQ_DISABLED) {
-   PCI_DN(dn)-eeh_mode = ~EEH_MODE_IRQ_DISABLED;
-   enable_irq(dev-irq);
-   }
+   eeh_enable_irq(dev);
+
if (!driver-err_handler ||
!driver-err_handler-resume)
return;
@@ -205,15 +230,12 @@ static void eeh_report_failure(struct pc
if (!driver)
return;

-   if (irq_in_use (dev-irq)) {
-   struct device_node *dn = pci_device_to_OF_node(dev);
-   PCI_DN(dn)-eeh_mode |= EEH_MODE_IRQ_DISABLED;
-   disable_irq_nosync(dev-irq);
-   }
-   if (!driver-err_handler)
-   return;
-   if (!driver-err_handler-error_detected)
+   eeh_disable_irq(dev);
+
+   if (!driver-err_handler ||
+   !driver-err_handler-error_detected)
return;
+
driver-err_handler-error_detected(dev, pci_channel_io_perm_failure);
}



___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Don't panic when EEH_MAX_FAILS is exceeded

2008-07-21 Thread Mike Mason

Here's a repost of the patch with the suggested changes.

This patch changes the EEH_MAX_FAILS action from panic to printing an 
error message.  Panicking under under this condition is too harsh.  
Although performance will be affected and the device may not recover, 
the system is still running, which at the very least should allow for a 
more graceful shutdown. The patch also removes the msleep() within a 
spinlock, which can lead to a deadlock and is not recommended.


Signed-off-by: Mike Mason [EMAIL PROTECTED]
Acked-by: Linas Vepstas [EMAIL PROTECTED]

--- powerpc.git/arch/powerpc/platforms/pseries/eeh.c2008-07-18 
08:51:42.0 -0700
+++ powerpc.git-new/arch/powerpc/platforms/pseries/eeh.c2008-07-21 
03:25:43.0 -0700
@@ -75,9 +75,9 @@
 */

/* If a device driver keeps reading an MMIO register in an interrupt
- * handler after a slot isolation event has occurred, we assume it
- * is broken and panic.  This sets the threshold for how many read
- * attempts we allow before panicking.
+ * handler after a slot isolation event, it might be broken.
+ * This sets the threshold for how many read attempts we allow
+ * before printing an error message.
 */
#define EEH_MAX_FAILS   210

@@ -470,6 +470,7 @@ int eeh_dn_check_failure(struct device_n
unsigned long flags;
struct pci_dn *pdn;
int rc = 0;
+   const char *location;

total_mmio_ffs++;

@@ -509,18 +510,15 @@ int eeh_dn_check_failure(struct device_n
rc = 1;
if (pdn-eeh_mode  EEH_MODE_ISOLATED) {
pdn-eeh_check_count ++;
-   if (pdn-eeh_check_count = EEH_MAX_FAILS) {
-   printk (KERN_ERR EEH: Device driver ignored %d bad reads, 
panicing\n,
-   pdn-eeh_check_count);
+   if (pdn-eeh_check_count % EEH_MAX_FAILS == 0) {
+   location = of_get_property(dn, ibm,loc-code, NULL);
+   printk (KERN_ERR EEH: %d reads ignored for recovering 
device at 
+   location=%s driver=%s pci addr=%s\n,
+   pdn-eeh_check_count, location,
+   dev-driver-name, pci_name(dev));
+   printk (KERN_ERR EEH: Might be infinite loop in %s 
driver\n,
+   dev-driver-name);
dump_stack();
-   msleep(5000);
-   
-   /* re-read the slot reset state */
-   if (read_slot_reset_state(pdn, rets) != 0)
-   rets[0] = -1;   /* reset state unknown */
-
-   /* If we are here, then we hit an infinite loop. Stop. 
*/
-   panic(EEH: MMIO halt (%d) on device:%s\n, rets[0], 
pci_name(dev));
}
goto dn_unlock;
}


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


[PATCH] Don't panic when EEH_MAX_FAILS is exceeded

2008-07-20 Thread Mike Mason

This patch changes the EEH_MAX_FAILS action from panic to printing an error 
message.  Panicking under under this condition is too harsh.  Although 
performance will be affected and the device may not recover, the system is 
still running, which at the very least, should allow for a more graceful 
shutdown.  The panic() is now wrapped in a DEBUG statement for development 
purposes.  The patch also removes the msleep() within a spinlock, which is not 
allowed.

Signed-off-by: Mike Mason [EMAIL PROTECTED] 


--- powerpc.git/arch/powerpc/platforms/pseries/eeh.c2008-07-18 
08:51:42.0 -0700
+++ powerpc.git-new/arch/powerpc/platforms/pseries/eeh.c2008-07-18 
13:26:37.0 -0700
@@ -75,9 +75,9 @@
 */

/* If a device driver keeps reading an MMIO register in an interrupt
- * handler after a slot isolation event has occurred, we assume it
- * is broken and panic.  This sets the threshold for how many read
- * attempts we allow before panicking.
+ * handler after a slot isolation event, it might be broken.
+ * This sets the threshold for how many read attempts we allow
+ * before printing an error message.
 */
#define EEH_MAX_FAILS   210

@@ -470,6 +470,7 @@
unsigned long flags;
struct pci_dn *pdn;
int rc = 0;
+   const char *location;

total_mmio_ffs++;

@@ -509,18 +510,24 @@
rc = 1;
if (pdn-eeh_mode  EEH_MODE_ISOLATED) {
pdn-eeh_check_count ++;
-   if (pdn-eeh_check_count = EEH_MAX_FAILS) {
-   printk (KERN_ERR EEH: Device driver ignored %d bad reads, 
panicing\n,
-   pdn-eeh_check_count);
+   if (pdn-eeh_check_count % EEH_MAX_FAILS == 0) {
+   location = (char *) of_get_property(dn, ibm,loc-code, 
NULL);
+   printk (KERN_ERR EEH: %d reads ignored for recovering 
device at 
+   location=%s driver=%s pci addr=%s\n,
+   pdn-eeh_check_count, location,
+   dev-driver-name, pci_name(dev));
+   printk (KERN_ERR EEH: Might be infinite loop in %s 
driver\n,
+   dev-driver-name);
+#ifdef DEBUG
dump_stack();
-   msleep(5000);
-   
+
/* re-read the slot reset state */
if (read_slot_reset_state(pdn, rets) != 0)
rets[0] = -1;   /* reset state unknown */

/* If we are here, then we hit an infinite loop. Stop. 
*/
panic(EEH: MMIO halt (%d) on device:%s\n, rets[0], 
pci_name(dev));
+#endif
}
goto dn_unlock;
}


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [PATCH] Restore PERR/SERR bit settings during EEH device recovery

2008-07-08 Thread Mike Mason

Linas Vepstas wrote:

2008/7/7 Mike Mason [EMAIL PROTECTED]:

The following patch restores the PERR and SERR bits in the PCI
command register during an EEH device recovery.
We have found at least one case (an Agilent test card) where the
PERR/SERR bits are set to 1 by firmware at boot time, but are
not restored to 1 during EEH recovery.


Any chance they should be zero, and were accidentally set to 1?
In which case, you'd need an else clause, below.


I suppose it's possible.  I'll add your suggestion and resubmit.

Mike




The patch fixes the
Agilent card problem.  It has been tested on several other EEH-enabled cards
with no regressions.

Signed-off-by: Mike Mason [EMAIL PROTECTED]

--- linux-2.6.26-rc9/arch/powerpc/platforms/pseries/eeh.c   2008-07-07
16:06:57.0 -0700
+++ linux-2.6.26-rc9-new/arch/powerpc/platforms/pseries/eeh.c   2008-07-07
16:11:10.0 -0700
@@ -812,6 +812,7 @@
static inline void __restore_bars (struct pci_dn *pdn)
{
   int i;
+   u32 cmd;

   if (NULL==pdn-phb) return;
   for (i=4; i10; i++) {
@@ -832,6 +833,15 @@

   /* max latency, min grant, interrupt pin and line */
   rtas_write_config(pdn, 15*4, 4, pdn-config_space[15]);
+
+   /* Restore PERR  SERR bits, some devices require it,
+  don't touch the other command bits */
+   rtas_read_config(pdn, PCI_COMMAND, 4, cmd);
+   if (pdn-config_space[1]  PCI_COMMAND_PARITY)
+   cmd |= PCI_COMMAND_PARITY;


else cmd = ~PCI_COMMAND_PARITY;


+   if (pdn-config_space[1]  PCI_COMMAND_SERR)
+   cmd |= PCI_COMMAND_SERR;


else cmd = ~PCI_COMMAND_SERR;


+   rtas_write_config(pdn, PCI_COMMAND, 4, cmd);
}


Other than that, I'll add an

Acked-by: Linas Vepstas [EMAIL PROTECTED]

--linas


___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev