RE: lpfc PCIe error recoveyr

2007-01-11 Thread Bino . Sebastian
I looks like there is a recursion in the stack trace.
scsi_request_fn is called recursively.

-bino

-Original Message-
From: Linas Vepstas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 10, 2007 6:00 PM
To: Sebastian, Bino
Cc: Smart, James; [EMAIL PROTECTED]; Barry, Laurie; [EMAIL PROTECTED];
Papadimitriou, Vaios; [EMAIL PROTECTED];
[EMAIL PROTECTED]; linux-scsi@vger.kernel.org
Subject: Re: lpfc PCIe error recoveyr


On Tue, Jan 09, 2007 at 10:00:09AM -0500, [EMAIL PROTECTED] wrote:
 Hi Linas,
   Following is the latest lpfc driver patch we are testing in the 
 Emulex lab for PCI error recovery. This patch looks good on a Power5 
 platform. 

Yes, it seemed to survive a few hours of testting fine. I did see one
interesting thing, namely a softlockup. I attribute this to the fact
that I'd queued up a lot of heavy file i/o, issued a sync, which
typically takes more than a few seconds on the test sytem, and then 
injected the artificial PCI error. After about ten seconds, I got the 
softlockup, but after another 10-20 seconds, things seemed back to
normal. So I don't consider this an actual error, but thought 
it was interesting.

The actual stack trace was

BUG: soft lockup detected on CPU#2!
Call Trace:
[C253D470] [C000F8C8] .show_stack+0x68/0x1b0 (unreliable)
[C253D510] [C008E770] .softlockup_tick+0xec/0x124
[C253D5B0] [C006957C] .run_local_timers+0x1c/0x30
[C253D630] [C0023C18] .timer_interrupt+0xb8/0x4a4
[C253D710] [C0003578] decrementer_common+0xf8/0x100
--- Exception: 901 at .local_irq_restore+0x3c/0x40
LR = ._spin_unlock_irqrestore+0x24/0x3c
[C253DA00] [C046D574] ._spin_unlock_irqrestore+0x18/0x3c 
(unreliable)
[C253DA90] [C031BBA0] .scsi_dispatch_cmd+0x25c/0x2e4
[C253DB30] [C03227CC] .scsi_request_fn+0x2c4/0x3c0
[C253DBE0] [C021ADF8] .__generic_unplug_device+0x54/0x6c
[C253DC60] [C0216D34] .elv_insert+0x240/0x268
[C253DD00] [C021A224] .blk_requeue_request+0x38/0x54
[C253DD90] [C032282C] .scsi_request_fn+0x324/0x3c0
[C253DE40] [C021ADF8] .__generic_unplug_device+0x54/0x6c
[C253DEC0] [C0216D34] .elv_insert+0x240/0x268
[C253DF60] [C021A224] .blk_requeue_request+0x38/0x54
[C253DFF0] [C032282C] .scsi_request_fn+0x324/0x3c0
[C253E0A0] [C021ADF8] .__generic_unplug_device+0x54/0x6c
etc.

 However, on a Power4 architecture there are errors reported
 in upper layer (we discussed this in one of earlier emails) followed 
 by SCSI errors.

I'm trying to investigate now.

The patch you sent out got garbled, so I'm reposting below.



This patch adds PCI Error recovery support to the
Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver.
Lightly tested at this point, works.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Signed-off-by: [EMAIL PROTECTED]
Cc: James Smart [EMAIL PROTECTED]



 drivers/scsi/lpfc/lpfc_init.c |   96 ++
 drivers/scsi/lpfc/lpfc_sli.c  |   12 +
 2 files changed, 108 insertions(+)

Index: linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-rc4.orig/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 
12:30:01.0 -0600
+++ linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c  2007-01-10 
12:34:27.0 -0600
@@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba
struct lpfc_sli *psli = phba-sli;
struct lpfc_sli_ring  *pring;
uint32_t event_data;
+   /* If the pci channel is offline, ignore possible errors,
+* since we cannot communicate with the pci card anyway. */
+   if (pci_channel_offline(phba-pcidev))
+   return;
 
if (phba-work_hs  HS_FFER6 ||
phba-work_hs  HS_FFER5) {
@@ -1797,6 +1801,91 @@ lpfc_pci_remove_one(struct pci_dev *pdev
pci_set_drvdata(pdev, NULL);
 }
 
+/**
+ * lpfc_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata;
+   struct lpfc_sli *psli = phba-sli;
+   struct lpfc_sli_ring  *pring;
+
+   if (state == pci_channel_io_perm_failure) {
+   lpfc_pci_remove_one(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+   pci_disable_device(pdev);
+   /*
+* There may be I/Os dropped by the firmware.
+* Error iocb (I/O) on txcmplq and let the SCSI layer
+* retry it after re

Re: lpfc PCIe error recoveyr

2007-01-10 Thread Linas Vepstas
On Tue, Jan 09, 2007 at 10:00:09AM -0500, [EMAIL PROTECTED] wrote:
 Hi Linas,
   Following is the latest lpfc driver patch we are testing in the 
 Emulex lab for PCI error recovery. This patch looks good on a Power5 
 platform. 

Yes, it seemed to survive a few hours of testting fine. I did see one
interesting thing, namely a softlockup. I attribute this to the fact
that I'd queued up a lot of heavy file i/o, issued a sync, which
typically takes more than a few seconds on the test sytem, and then 
injected the artificial PCI error. After about ten seconds, I got the 
softlockup, but after another 10-20 seconds, things seemed back to
normal. So I don't consider this an actual error, but thought 
it was interesting.

The actual stack trace was

BUG: soft lockup detected on CPU#2!
Call Trace:
[C253D470] [C000F8C8] .show_stack+0x68/0x1b0 (unreliable)
[C253D510] [C008E770] .softlockup_tick+0xec/0x124
[C253D5B0] [C006957C] .run_local_timers+0x1c/0x30
[C253D630] [C0023C18] .timer_interrupt+0xb8/0x4a4
[C253D710] [C0003578] decrementer_common+0xf8/0x100
--- Exception: 901 at .local_irq_restore+0x3c/0x40
LR = ._spin_unlock_irqrestore+0x24/0x3c
[C253DA00] [C046D574] ._spin_unlock_irqrestore+0x18/0x3c 
(unreliable)
[C253DA90] [C031BBA0] .scsi_dispatch_cmd+0x25c/0x2e4
[C253DB30] [C03227CC] .scsi_request_fn+0x2c4/0x3c0
[C253DBE0] [C021ADF8] .__generic_unplug_device+0x54/0x6c
[C253DC60] [C0216D34] .elv_insert+0x240/0x268
[C253DD00] [C021A224] .blk_requeue_request+0x38/0x54
[C253DD90] [C032282C] .scsi_request_fn+0x324/0x3c0
[C253DE40] [C021ADF8] .__generic_unplug_device+0x54/0x6c
[C253DEC0] [C0216D34] .elv_insert+0x240/0x268
[C253DF60] [C021A224] .blk_requeue_request+0x38/0x54
[C253DFF0] [C032282C] .scsi_request_fn+0x324/0x3c0
[C253E0A0] [C021ADF8] .__generic_unplug_device+0x54/0x6c
etc.

 However, on a Power4 architecture there are errors reported
 in upper layer (we discussed this in one of earlier emails) followed 
 by SCSI errors.

I'm trying to investigate now.

The patch you sent out got garbled, so I'm reposting below.



This patch adds PCI Error recovery support to the
Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver.
Lightly tested at this point, works.

Signed-off-by: Linas Vepstas [EMAIL PROTECTED]
Signed-off-by: [EMAIL PROTECTED]
Cc: James Smart [EMAIL PROTECTED]



 drivers/scsi/lpfc/lpfc_init.c |   96 ++
 drivers/scsi/lpfc/lpfc_sli.c  |   12 +
 2 files changed, 108 insertions(+)

Index: linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c
===
--- linux-2.6.20-rc4.orig/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 
12:30:01.0 -0600
+++ linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c  2007-01-10 
12:34:27.0 -0600
@@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba
struct lpfc_sli *psli = phba-sli;
struct lpfc_sli_ring  *pring;
uint32_t event_data;
+   /* If the pci channel is offline, ignore possible errors,
+* since we cannot communicate with the pci card anyway. */
+   if (pci_channel_offline(phba-pcidev))
+   return;
 
if (phba-work_hs  HS_FFER6 ||
phba-work_hs  HS_FFER5) {
@@ -1797,6 +1801,91 @@ lpfc_pci_remove_one(struct pci_dev *pdev
pci_set_drvdata(pdev, NULL);
 }
 
+/**
+ * lpfc_io_error_detected - called when PCI error is detected
+ * @pdev: Pointer to PCI device
+ * @state: The current pci conneection state
+ *
+ * This function is called after a PCI bus error affecting
+ * this device has been detected.
+ */
+static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev,
+   pci_channel_state_t state)
+{
+   struct Scsi_Host *host = pci_get_drvdata(pdev);
+   struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata;
+   struct lpfc_sli *psli = phba-sli;
+   struct lpfc_sli_ring  *pring;
+
+   if (state == pci_channel_io_perm_failure) {
+   lpfc_pci_remove_one(pdev);
+   return PCI_ERS_RESULT_DISCONNECT;
+   }
+   pci_disable_device(pdev);
+   /*
+* There may be I/Os dropped by the firmware.
+* Error iocb (I/O) on txcmplq and let the SCSI layer
+* retry it after re-establishing link.
+*/
+   pring = psli-ring[psli-fcp_ring];
+   lpfc_sli_abort_iocb_ring(phba, pring);
+
+   /* Request a slot reset. */
+   return PCI_ERS_RESULT_NEED_RESET;
+}
+
+/**
+ * lpfc_io_slot_reset - called after the pci bus has been reset.
+ * @pdev: Pointer to PCI device
+ *
+ * Restart the card from scratch, as if from a cold-boot.
+ */
+static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev