Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]
Grant Grundler wrote: You want everything moved back to the queued state or failed (flush pending IO so upper layers can retry if they want). Upper layer is the linux block device; my understanding is that it does not retry, nor do the filesystems above that. Passing errors upwards seems to be pretty darned fatal. My goal is to limit retries to the driver. That's a bad idea. Been there done that. Upper layers can be alot smarter about retries than the driver ever could be. While the driver knows more about the transport and why someting might fail, upper layers will know alternate pathes to the same devices or to the same data on different devices. Upper layers also set the recovery policy for particular storage. Trying to do recovery transperently in the drivers is going to also mess up other high level SW like Service Guard or LifeKeeper. They want to know when a path has failed, log it, and make sure someone gets sent to service the HW if threshholds are exceeded. Let higher layers like dm, VxFS, LVM worry about recovery. The sym2 driver should fail everything back with DID_ERROR. In most cases, the scsi midlayer will retry if the upper layer allows retries and you will get the behavior you desire. If retries are not allowed, like for a tape device, the command will get failed back to the upper layer driver. -- Brian King eServer Storage I/O IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]
On Tue, Mar 22, 2005 at 11:38:36AM -0600, Brian King was heard to remark: Linas Vepstas wrote: My current hardware will halt all i/o to/from the symbios controller upon detection of a PCI error. The recovery proceedure that I am currently using is to call system firmware (aka 'bios') to raise and then lower the #RST pci signal line for 1/4 second, then wait 2 seconds for the PCI bus to settle, then restore the PCI config space registers (BARs, interrupt line, etc) to what they used to be. Then, I call sym_start_up() in an attempt to get the symbios card working again. And that's where I get stuck ... My assumption is that after the #RST, that the symbios card will sit there, dumb and stupid, with no scripts running. But sometimes I find that the card has done something to make the PCI error hardware trip again. Typically, this means that the card attempted to DMA to some address that its not allowed to touch, or raised #SERR or possibly #PERR (I can't tell which). What config registers are you restoring? BAR's, grant, latency, interrupt, cacheline size. Is it possible symbios does not like something in your config restore? possibly... Another possiblity is that asserting PCI reset is not cleanly resetting the card. Does PCI reset force BIST to be run on these cards? You could try to manually run BIST on the card after the PCI reset to see if that I didn't see bist in the code, but I wasn't looking for it either. I could try that. helps, or you could try power cycling the slot instead of using PCI reset. yes I could :( I'll try that next. Problem is, not all slots are power-cyclable, only the hotplug slots are. I've discoverd that for example, the ethernet chips are soldered to the motherboard, and can't be power-cycled (but fortunately, those don't give me trouble). --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]
Linas Vepstas wrote: Hi, There has been a running thread for a while on several mailing lists concerning PCI bus error recovery. Very breifly, some architectures have PCI error recovery mechanisms built into them (e.g. IBM PowerPC, also new PCI-Express chips from Intel (and other vendors) and possibly pa-risc and others). I've been trying to prototype error recovery. I currently have ethernet and the IPR scsi driver working, but I am having trouble with the symbios driver. I need help/advice ... On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark: On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote: I also want to do the symbios driver... FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org. My current hardware will halt all i/o to/from the symbios controller upon detection of a PCI error. The recovery proceedure that I am currently using is to call system firmware (aka 'bios') to raise and then lower the #RST pci signal line for 1/4 second, then wait 2 seconds for the PCI bus to settle, then restore the PCI config space registers (BARs, interrupt line, etc) to what they used to be. Then, I call sym_start_up() in an attempt to get the symbios card working again. And that's where I get stuck ... My assumption is that after the #RST, that the symbios card will sit there, dumb and stupid, with no scripts running. But sometimes I find that the card has done something to make the PCI error hardware trip again. Typically, this means that the card attempted to DMA to some address that its not allowed to touch, or raised #SERR or possibly #PERR (I can't tell which). What config registers are you restoring? Is it possible symbios does not like something in your config restore? Another possiblity is that asserting PCI reset is not cleanly resetting the card. Does PCI reset force BIST to be run on these cards? You could try to manually run BIST on the card after the PCI reset to see if that helps, or you could try power cycling the slot instead of using PCI reset. -Brian Sometimes, I get the PCI error while the card is sitting there idly after the #RST, but more often, I get the error in sym_chip_reset(), immediately after the OUTB (nc_istat, SRST); Any clue what this is about? Am I missing something? I'm rather perplexed at this point, any clues/hints/suggestions are welcome. --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Brian King eServer Storage I/O IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]
On Mon, Mar 21, 2005 at 05:10:28PM -0600, Linas Vepstas wrote: My current hardware will halt all i/o to/from the symbios controller upon detection of a PCI error. The recovery proceedure that I am currently using is to call system firmware (aka 'bios') to raise and then lower the #RST pci signal line for 1/4 second, then wait 2 seconds for the PCI bus to settle, then restore the PCI config space registers (BARs, interrupt line, etc) to what they used to be. Then, I call sym_start_up() in an attempt to get the symbios card working again. And that's where I get stuck ... Does this process cause a SCSI bus reset? SCSI devices will continue *forever* to send status back to the host on IO's that have completed. At least that's what I remember from working on this 8 years ago. Issuing a SCSI Bus Reset or Bus Device Reset (BDR) will quiesce the devices. I'm asking because it's possible sym2 driver isn't expecting anything from any device at that point. BTW, when did sym2 get a chance to cleanup pending requests? You want everything moved back to the queued state or failed (flush pending IO so upper layers can retry if they want). My assumption is that after the #RST, that the symbios card will sit there, dumb and stupid, with no scripts running. But sometimes I find that the card has done something to make the PCI error hardware trip again. Typically, this means that the card attempted to DMA to some address that its not allowed to touch, or raised #SERR or possibly #PERR (I can't tell which). PCI Reset typically only affects PCI facing parts of a chip. e.g. some LAN Phy's don't get reset and need to be manually reset. I'm skeptical sym2 will (or should) issue a SCSI Bus reset when PCI Reset is asserted. Think multi-initiator. Sometimes, I get the PCI error while the card is sitting there idly after the #RST, but more often, I get the error in sym_chip_reset(), immediately after the OUTB (nc_istat, SRST); Oh? Is this the driver trying to issue SCSI Reset? Any clue what this is about? Am I missing something? I'm rather perplexed at this point, any clues/hints/suggestions are welcome. Sorry - I'm no expert on 53c8xx chips. Hope the above helps. grant - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]
Hi, There has been a running thread for a while on several mailing lists concerning PCI bus error recovery. Very breifly, some architectures have PCI error recovery mechanisms built into them (e.g. IBM PowerPC, also new PCI-Express chips from Intel (and other vendors) and possibly pa-risc and others). I've been trying to prototype error recovery. I currently have ethernet and the IPR scsi driver working, but I am having trouble with the symbios driver. I need help/advice ... On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark: On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote: I also want to do the symbios driver... FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org. My current hardware will halt all i/o to/from the symbios controller upon detection of a PCI error. The recovery proceedure that I am currently using is to call system firmware (aka 'bios') to raise and then lower the #RST pci signal line for 1/4 second, then wait 2 seconds for the PCI bus to settle, then restore the PCI config space registers (BARs, interrupt line, etc) to what they used to be. Then, I call sym_start_up() in an attempt to get the symbios card working again. And that's where I get stuck ... My assumption is that after the #RST, that the symbios card will sit there, dumb and stupid, with no scripts running. But sometimes I find that the card has done something to make the PCI error hardware trip again. Typically, this means that the card attempted to DMA to some address that its not allowed to touch, or raised #SERR or possibly #PERR (I can't tell which). Sometimes, I get the PCI error while the card is sitting there idly after the #RST, but more often, I get the error in sym_chip_reset(), immediately after the OUTB (nc_istat, SRST); Any clue what this is about? Am I missing something? I'm rather perplexed at this point, any clues/hints/suggestions are welcome. --linas - To unsubscribe from this list: send the line unsubscribe linux-scsi in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html