RE: lpfc PCIe error recoveyr
I looks like there is a recursion in the stack trace. scsi_request_fn is called recursively. -bino -Original Message- From: Linas Vepstas [mailto:[EMAIL PROTECTED] Sent: Wednesday, January 10, 2007 6:00 PM To: Sebastian, Bino Cc: Smart, James; [EMAIL PROTECTED]; Barry, Laurie; [EMAIL PROTECTED]; Papadimitriou, Vaios; [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-scsi@vger.kernel.org Subject: Re: lpfc PCIe error recoveyr On Tue, Jan 09, 2007 at 10:00:09AM -0500, [EMAIL PROTECTED] wrote: Hi Linas, Following is the latest lpfc driver patch we are testing in the Emulex lab for PCI error recovery. This patch looks good on a Power5 platform. Yes, it seemed to survive a few hours of testting fine. I did see one interesting thing, namely a softlockup. I attribute this to the fact that I'd queued up a lot of heavy file i/o, issued a sync, which typically takes more than a few seconds on the test sytem, and then injected the artificial PCI error. After about ten seconds, I got the softlockup, but after another 10-20 seconds, things seemed back to normal. So I don't consider this an actual error, but thought it was interesting. The actual stack trace was BUG: soft lockup detected on CPU#2! Call Trace: [C253D470] [C000F8C8] .show_stack+0x68/0x1b0 (unreliable) [C253D510] [C008E770] .softlockup_tick+0xec/0x124 [C253D5B0] [C006957C] .run_local_timers+0x1c/0x30 [C253D630] [C0023C18] .timer_interrupt+0xb8/0x4a4 [C253D710] [C0003578] decrementer_common+0xf8/0x100 --- Exception: 901 at .local_irq_restore+0x3c/0x40 LR = ._spin_unlock_irqrestore+0x24/0x3c [C253DA00] [C046D574] ._spin_unlock_irqrestore+0x18/0x3c (unreliable) [C253DA90] [C031BBA0] .scsi_dispatch_cmd+0x25c/0x2e4 [C253DB30] [C03227CC] .scsi_request_fn+0x2c4/0x3c0 [C253DBE0] [C021ADF8] .__generic_unplug_device+0x54/0x6c [C253DC60] [C0216D34] .elv_insert+0x240/0x268 [C253DD00] [C021A224] .blk_requeue_request+0x38/0x54 [C253DD90] [C032282C] .scsi_request_fn+0x324/0x3c0 [C253DE40] [C021ADF8] .__generic_unplug_device+0x54/0x6c [C253DEC0] [C0216D34] .elv_insert+0x240/0x268 [C253DF60] [C021A224] .blk_requeue_request+0x38/0x54 [C253DFF0] [C032282C] .scsi_request_fn+0x324/0x3c0 [C253E0A0] [C021ADF8] .__generic_unplug_device+0x54/0x6c etc. However, on a Power4 architecture there are errors reported in upper layer (we discussed this in one of earlier emails) followed by SCSI errors. I'm trying to investigate now. The patch you sent out got garbled, so I'm reposting below. This patch adds PCI Error recovery support to the Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver. Lightly tested at this point, works. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Signed-off-by: [EMAIL PROTECTED] Cc: James Smart [EMAIL PROTECTED] drivers/scsi/lpfc/lpfc_init.c | 96 ++ drivers/scsi/lpfc/lpfc_sli.c | 12 + 2 files changed, 108 insertions(+) Index: linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c === --- linux-2.6.20-rc4.orig/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 12:30:01.0 -0600 +++ linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 12:34:27.0 -0600 @@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba struct lpfc_sli *psli = phba-sli; struct lpfc_sli_ring *pring; uint32_t event_data; + /* If the pci channel is offline, ignore possible errors, +* since we cannot communicate with the pci card anyway. */ + if (pci_channel_offline(phba-pcidev)) + return; if (phba-work_hs HS_FFER6 || phba-work_hs HS_FFER5) { @@ -1797,6 +1801,91 @@ lpfc_pci_remove_one(struct pci_dev *pdev pci_set_drvdata(pdev, NULL); } +/** + * lpfc_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata; + struct lpfc_sli *psli = phba-sli; + struct lpfc_sli_ring *pring; + + if (state == pci_channel_io_perm_failure) { + lpfc_pci_remove_one(pdev); + return PCI_ERS_RESULT_DISCONNECT; + } + pci_disable_device(pdev); + /* +* There may be I/Os dropped by the firmware. +* Error iocb (I/O) on txcmplq and let the SCSI layer +* retry it after re
Re: lpfc PCIe error recoveyr
On Tue, Jan 09, 2007 at 10:00:09AM -0500, [EMAIL PROTECTED] wrote: Hi Linas, Following is the latest lpfc driver patch we are testing in the Emulex lab for PCI error recovery. This patch looks good on a Power5 platform. Yes, it seemed to survive a few hours of testting fine. I did see one interesting thing, namely a softlockup. I attribute this to the fact that I'd queued up a lot of heavy file i/o, issued a sync, which typically takes more than a few seconds on the test sytem, and then injected the artificial PCI error. After about ten seconds, I got the softlockup, but after another 10-20 seconds, things seemed back to normal. So I don't consider this an actual error, but thought it was interesting. The actual stack trace was BUG: soft lockup detected on CPU#2! Call Trace: [C253D470] [C000F8C8] .show_stack+0x68/0x1b0 (unreliable) [C253D510] [C008E770] .softlockup_tick+0xec/0x124 [C253D5B0] [C006957C] .run_local_timers+0x1c/0x30 [C253D630] [C0023C18] .timer_interrupt+0xb8/0x4a4 [C253D710] [C0003578] decrementer_common+0xf8/0x100 --- Exception: 901 at .local_irq_restore+0x3c/0x40 LR = ._spin_unlock_irqrestore+0x24/0x3c [C253DA00] [C046D574] ._spin_unlock_irqrestore+0x18/0x3c (unreliable) [C253DA90] [C031BBA0] .scsi_dispatch_cmd+0x25c/0x2e4 [C253DB30] [C03227CC] .scsi_request_fn+0x2c4/0x3c0 [C253DBE0] [C021ADF8] .__generic_unplug_device+0x54/0x6c [C253DC60] [C0216D34] .elv_insert+0x240/0x268 [C253DD00] [C021A224] .blk_requeue_request+0x38/0x54 [C253DD90] [C032282C] .scsi_request_fn+0x324/0x3c0 [C253DE40] [C021ADF8] .__generic_unplug_device+0x54/0x6c [C253DEC0] [C0216D34] .elv_insert+0x240/0x268 [C253DF60] [C021A224] .blk_requeue_request+0x38/0x54 [C253DFF0] [C032282C] .scsi_request_fn+0x324/0x3c0 [C253E0A0] [C021ADF8] .__generic_unplug_device+0x54/0x6c etc. However, on a Power4 architecture there are errors reported in upper layer (we discussed this in one of earlier emails) followed by SCSI errors. I'm trying to investigate now. The patch you sent out got garbled, so I'm reposting below. This patch adds PCI Error recovery support to the Emulex Lightpulse Fibrechannel (lpfc) SCSI device driver. Lightly tested at this point, works. Signed-off-by: Linas Vepstas [EMAIL PROTECTED] Signed-off-by: [EMAIL PROTECTED] Cc: James Smart [EMAIL PROTECTED] drivers/scsi/lpfc/lpfc_init.c | 96 ++ drivers/scsi/lpfc/lpfc_sli.c | 12 + 2 files changed, 108 insertions(+) Index: linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c === --- linux-2.6.20-rc4.orig/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 12:30:01.0 -0600 +++ linux-2.6.20-rc4/drivers/scsi/lpfc/lpfc_init.c 2007-01-10 12:34:27.0 -0600 @@ -518,6 +518,10 @@ lpfc_handle_eratt(struct lpfc_hba * phba struct lpfc_sli *psli = phba-sli; struct lpfc_sli_ring *pring; uint32_t event_data; + /* If the pci channel is offline, ignore possible errors, +* since we cannot communicate with the pci card anyway. */ + if (pci_channel_offline(phba-pcidev)) + return; if (phba-work_hs HS_FFER6 || phba-work_hs HS_FFER5) { @@ -1797,6 +1801,91 @@ lpfc_pci_remove_one(struct pci_dev *pdev pci_set_drvdata(pdev, NULL); } +/** + * lpfc_io_error_detected - called when PCI error is detected + * @pdev: Pointer to PCI device + * @state: The current pci conneection state + * + * This function is called after a PCI bus error affecting + * this device has been detected. + */ +static pci_ers_result_t lpfc_io_error_detected(struct pci_dev *pdev, + pci_channel_state_t state) +{ + struct Scsi_Host *host = pci_get_drvdata(pdev); + struct lpfc_hba *phba = (struct lpfc_hba *)host-hostdata; + struct lpfc_sli *psli = phba-sli; + struct lpfc_sli_ring *pring; + + if (state == pci_channel_io_perm_failure) { + lpfc_pci_remove_one(pdev); + return PCI_ERS_RESULT_DISCONNECT; + } + pci_disable_device(pdev); + /* +* There may be I/Os dropped by the firmware. +* Error iocb (I/O) on txcmplq and let the SCSI layer +* retry it after re-establishing link. +*/ + pring = psli-ring[psli-fcp_ring]; + lpfc_sli_abort_iocb_ring(phba, pring); + + /* Request a slot reset. */ + return PCI_ERS_RESULT_NEED_RESET; +} + +/** + * lpfc_io_slot_reset - called after the pci bus has been reset. + * @pdev: Pointer to PCI device + * + * Restart the card from scratch, as if from a cold-boot. + */ +static pci_ers_result_t lpfc_io_slot_reset(struct pci_dev