Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-04-01 Thread Brian King
Grant Grundler wrote:
You want everything moved back to the queued state or failed
(flush pending IO so upper layers can retry if they want).

Upper layer is the linux block device; my understanding is that it does
not retry, nor do the filesystems above that.  Passing errors upwards
seems to be pretty darned fatal.  My goal is to limit retries to the
driver.
 
 
 That's a bad idea. Been there done that.
 
 Upper layers can be alot smarter about retries than the driver ever
 could be. While the driver knows more about the transport and why
 someting might fail, upper layers will know alternate pathes 
 to the same devices or to the same data on different devices.
 Upper layers also set the recovery policy for particular storage.
 
 Trying to do recovery transperently in the drivers is going to also
 mess up other high level SW like Service Guard or LifeKeeper.
 They want to know when a path has failed, log it, and make sure
 someone gets sent to service the HW if threshholds are exceeded.
 
 Let higher layers like dm, VxFS, LVM worry about recovery.

The sym2 driver should fail everything back with DID_ERROR.
In most cases, the scsi midlayer will retry if the upper layer allows
retries and you will get the behavior you desire. If retries are not
allowed, like for a tape device, the command will get failed back to the
upper layer driver.

-- 
Brian King
eServer Storage I/O
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-31 Thread Linas Vepstas
On Tue, Mar 22, 2005 at 11:38:36AM -0600, Brian King was heard to remark:
 Linas Vepstas wrote:
  
  My current hardware will halt all i/o to/from the symbios controller
  upon detection of a PCI error.  The recovery proceedure that I am
  currently using is to call system firmware (aka 'bios') to raise
  and then lower the #RST pci signal line for 1/4 second, then wait 2
  seconds for the  PCI bus to settle, then restore the PCI config space
  registers (BARs, interrupt line, etc) to what they used to be. Then,
  I call sym_start_up() in an attempt to get the symbios card working
  again.  And that's where I get stuck ... 
  
  My assumption is that after the #RST, that the symbios card will sit
  there, dumb and stupid, with no scripts running.  But sometimes I find 
  that the card has done something to make the PCI error hardware trip
  again.  Typically, this means that the card attempted to DMA to some
  address that its not allowed to touch, or raised #SERR or possibly 
  #PERR (I can't tell which). 
 
 What config registers are you restoring? 

BAR's, grant, latency, interrupt, cacheline size. 

 Is it possible symbios does not
 like something in your config restore?

possibly...

 Another possiblity is that asserting PCI reset is not cleanly resetting
 the card. Does PCI reset force BIST to be run on these cards? You could
 try to manually run BIST on the card after the PCI reset to see if that

I didn't see bist in the code, but I wasn't looking for it either.  I
could try that.

 helps, or you could try power cycling the slot instead of using PCI reset.

yes I could :(  I'll try that next.  Problem is, not all slots are
power-cyclable, only the hotplug slots are.  I've discoverd that 
for example, the ethernet chips are soldered to the motherboard, and
can't be power-cycled (but fortunately, those don't give me trouble).


--linas
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-22 Thread Brian King
Linas Vepstas wrote:
 Hi,
 
 There has been a running thread for a while on several mailing lists 
 concerning PCI bus error recovery.  Very breifly, some architectures
 have PCI error recovery mechanisms built into them (e.g. IBM PowerPC,
 also new PCI-Express chips from Intel (and other vendors) and possibly
 pa-risc and others).  
 
 I've been trying to prototype  error recovery.  I currently have
 ethernet and the IPR scsi driver working, but I am having trouble with 
 the symbios driver.  I need help/advice ... 
 
 On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark:
 
On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote:

I also want to do the symbios driver...

FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org.
 
 
 
 My current hardware will halt all i/o to/from the symbios controller
 upon detection of a PCI error.  The recovery proceedure that I am
 currently using is to call system firmware (aka 'bios') to raise
 and then lower the #RST pci signal line for 1/4 second, then wait 2
 seconds for the  PCI bus to settle, then restore the PCI config space
 registers (BARs, interrupt line, etc) to what they used to be. Then,
 I call sym_start_up() in an attempt to get the symbios card working
 again.  And that's where I get stuck ... 
 
 My assumption is that after the #RST, that the symbios card will sit
 there, dumb and stupid, with no scripts running.  But sometimes I find 
 that the card has done something to make the PCI error hardware trip
 again.  Typically, this means that the card attempted to DMA to some
 address that its not allowed to touch, or raised #SERR or possibly 
 #PERR (I can't tell which). 

What config registers are you restoring? Is it possible symbios does not
like something in your config restore?

Another possiblity is that asserting PCI reset is not cleanly resetting
the card. Does PCI reset force BIST to be run on these cards? You could
try to manually run BIST on the card after the PCI reset to see if that
helps, or you could try power cycling the slot instead of using PCI reset.

-Brian

 
 Sometimes, I get the PCI error while the card is sitting there idly
 after the #RST, but more often, I get the error in sym_chip_reset(),
 immediately after the   OUTB (nc_istat, SRST);
 
 Any clue what this is about? Am I missing something? I'm rather
 perplexed at this point, any clues/hints/suggestions are welcome.
 
 --linas
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-scsi in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Brian King
eServer Storage I/O
IBM Linux Technology Center
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-22 Thread Grant Grundler
On Mon, Mar 21, 2005 at 05:10:28PM -0600, Linas Vepstas wrote:
 My current hardware will halt all i/o to/from the symbios controller
 upon detection of a PCI error.  The recovery proceedure that I am
 currently using is to call system firmware (aka 'bios') to raise
 and then lower the #RST pci signal line for 1/4 second, then wait 2
 seconds for the  PCI bus to settle, then restore the PCI config space
 registers (BARs, interrupt line, etc) to what they used to be. Then,
 I call sym_start_up() in an attempt to get the symbios card working
 again.  And that's where I get stuck ... 

Does this process cause a SCSI bus reset?
SCSI devices will continue *forever* to send status back to the host
on IO's that have completed. At least that's what I remember from
working on this 8 years ago. Issuing a SCSI Bus Reset or
Bus Device Reset (BDR) will quiesce the devices.

I'm asking because it's possible sym2 driver isn't expecting
anything from any device at that point.

BTW, when did sym2 get a chance to cleanup pending requests?
You want everything moved back to the queued state or failed
(flush pending IO so upper layers can retry if they want).

 My assumption is that after the #RST, that the symbios card will sit
 there, dumb and stupid, with no scripts running.  But sometimes I find 
 that the card has done something to make the PCI error hardware trip
 again.  Typically, this means that the card attempted to DMA to some
 address that its not allowed to touch, or raised #SERR or possibly 
 #PERR (I can't tell which). 

PCI Reset typically only affects PCI facing parts of a chip.
e.g. some LAN Phy's don't get reset and need to be manually reset.
I'm skeptical sym2 will (or should) issue a SCSI Bus reset when
PCI Reset is asserted. Think multi-initiator.

 Sometimes, I get the PCI error while the card is sitting there idly
 after the #RST, but more often, I get the error in sym_chip_reset(),
 immediately after the   OUTB (nc_istat, SRST);

Oh? Is this the driver trying to issue SCSI Reset?

 Any clue what this is about? Am I missing something? I'm rather
 perplexed at this point, any clues/hints/suggestions are welcome.

Sorry - I'm no expert on 53c8xx chips. Hope the above helps.

grant
-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Symbios PCI error recovery [Was: Re: [PATCH/RFC] ppc64: EEH + SCSI recovery (IPR only)]

2005-03-21 Thread Linas Vepstas

Hi,

There has been a running thread for a while on several mailing lists 
concerning PCI bus error recovery.  Very breifly, some architectures
have PCI error recovery mechanisms built into them (e.g. IBM PowerPC,
also new PCI-Express chips from Intel (and other vendors) and possibly
pa-risc and others).  

I've been trying to prototype  error recovery.  I currently have
ethernet and the IPR scsi driver working, but I am having trouble with 
the symbios driver.  I need help/advice ... 

On Fri, Feb 25, 2005 at 11:36:09PM -0700, Grant Grundler was heard to remark:
 On Wed, Feb 23, 2005 at 07:31:37PM -0600, Linas Vepstas wrote:
  I also want to do the symbios driver...
 
 FYI, Mathew Wilcox maintains the sym2 driver in cvs.parisc-linux.org.


My current hardware will halt all i/o to/from the symbios controller
upon detection of a PCI error.  The recovery proceedure that I am
currently using is to call system firmware (aka 'bios') to raise
and then lower the #RST pci signal line for 1/4 second, then wait 2
seconds for the  PCI bus to settle, then restore the PCI config space
registers (BARs, interrupt line, etc) to what they used to be. Then,
I call sym_start_up() in an attempt to get the symbios card working
again.  And that's where I get stuck ... 

My assumption is that after the #RST, that the symbios card will sit
there, dumb and stupid, with no scripts running.  But sometimes I find 
that the card has done something to make the PCI error hardware trip
again.  Typically, this means that the card attempted to DMA to some
address that its not allowed to touch, or raised #SERR or possibly 
#PERR (I can't tell which). 

Sometimes, I get the PCI error while the card is sitting there idly
after the #RST, but more often, I get the error in sym_chip_reset(),
immediately after the   OUTB (nc_istat, SRST);

Any clue what this is about? Am I missing something? I'm rather
perplexed at this point, any clues/hints/suggestions are welcome.

--linas

-
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html