A while ago I asked > Since my mpt(4) controller looses one of its attached discs every few weeks, > needing a reboot and a twenty-hour RAID reconstruction, I'm thinking about > switching to some mpii(4)-based SAS controller. > > Does someone use mpii(4) in production? Is this ready to put 250 people's > home mand mail dirs on? And since no-one answered, I guess it's not ready yet and I better stick to mpt.
In the meantime, Brian Buhrow has kindly provided me with some patches that apparantly improve things for him. However, I'm not sure whether these patches do the right thing; at places, I'm quite sure they don't do the right thing for me. So I'm trying to tackle this myself (based on Brian's patch) and am making some progress. I need some advice from people with a better knowledge of LSI's MPI/MPT (does anyone have docs on that?) and a better understanding of the scsipi/driver interaction (is there documentation on this?) The sequence of events on a failure in my case is 1. I get SAS link down/link up events 2. I get mpt timeouts on that disc 3. RAIDframe fails the disc 4. Occasionally, I have silent FFS corruption After that, I need a reboot to get the SAS channel working again, followed by a twenty-hour RAID reconstruction. I suspect the root cause for the link loss is a hardware issue. Unfortunately, I'm unsuccessful in tracking it down, so I currently have to live with it and work around te best I can. The timeout presumably originates from the link loss. The inability to recover must be either a MPT firmware bug or a defiency in mpt(4) or both. The FFS corruption may be connected to improper timeout handling or be a seperate issue. I'm trying to simulate the failure by (on an identical machine) running dd and un-plugging and re-inserting the disc. With an unpatched kernel, I get the same symptoms: timeout and a stuck SAS channel. It seems to be possible to recover, by, on the timeout, reset (mpt_soft_reset()) and re-initialize (mpt_init()) the IOC and return all current commands to the scsipi layer. Is there a less intrusive way to reset just the one MPT's SAS channel? Now, what's the correct way of reset/init the IOC and returning everything to scsipi? I guess the correct order is to reset (which leave the IOC in the stopped state), then to set xs->error and call scsipi_done(xs) on all pending operations and then init the IOC (which empties the request queue). First question: what's the appropriate xs->error? XS_TIMEOUT seems to work, but doesn't seem correct (save the original timed out request, of course). Is there some XS_NEVER_MIND_JUST_TRY_AGAIN code? Second question: When repeatedly calling scsipi_done(), can it happen that scsipi tries to re-queue these requests before I return? I would then loose them when re-initializing the IOC. Third question: Do I need to care about xs->xs_callout? Or is returning everything to scsipi simply the wrong approach? Any comments or better ideas to recover?
