Re: [fm-discuss] FMA stuff for nexus/framework drivers?

Garrett D'Amore Tue, 15 Jan 2008 20:55:46 -0800

Steve Hanson wrote:
> Cynthia McGuire wrote:
>>
>>
>> Garrett D'Amore wrote:
>>> Thanks for the good advice that folks have given me.  I still have a 
>>> few more questions.
>>>
>>> 1) Many errors are likely to occur or be detected at hotplug time.  
>>> That is, when the SDcard is first inserted.  Generally, this means 
>>> that the given SDcard will not be initialized, and cannot be 
>>> accessed.  What are the expectations for FMA here?  I clearly can't 
>>> use the SDcard device itself in the topology, because it doesn't 
>>> exist (although the slot does exist).  Apart from the fact that the 
>>> administrator never got a chance to use the card in the first place, 
>>> there really isn't any loss of service.  (The service was never 
>>> delivered in the first place.)  Is FMA still the right answer here?  
>>> (Note that a lot of the detectable errors might just be someone 
>>> trying to use a card that is not supported by the slot, so its not 
>>> really a fault so much as just user-error.)
>>
>> I suppose these are not faults in the sense that there is broken 
>> hardware but you may want to send an 'alert'.  An 'alert' is defined 
>> in the phase I of the Sensor Abstraction project.  An alert event 
>> doesn't indict something broken but rather alerts the admin to 
>> something out of range.  In any case, I think you can model the 
>> topology with slots containing cards much like we do for disk drives 
>> and their bays.  The fault or alert event could then point to the 
>> slot rather than a card that doesn't exist.
>>
> Generally a driver's attach routine should be able to distinguish 
> between failure due to an administrator error (eg unrecognised 
> deviceid)  and a real hardware fault (device hang, parity error on 
> bus), I think the latter case is potentially quite likely - if there 
> is a hard fault on the card it is quite likely to show up during 
> attach. If your driver can detect genuine hardware faults during 
> attach it should report them and they should be diagnosed to faults so 
> that the appropriate service action can be raised.
>
> The other case Cindi mentions (raising an"alert" for cases where there 
> is no hardware fault) is part of an upcoming project, so I guess 
> that's more for the future,
>
> There is certainly a problem with devices that fail to attach not 
> being in the topology.  I've been discussing this with Vikram to see 
> if we can fix this (if the node got as far as the init state it may 
> still be possible to detect it).


Well, in the most likely case, I won't have even allocated  node yet.  
The most obvious and common failures will be while I'm still trying to 
identify the card as a valid device.  The low-level hardware 
initialization happens *way* before the DDI is ever notified, at least 
in my case.

>>>
>>> 2) The most reasonable response to most of the errors that the 
>>> SDcard framework can detect is simply to offline the failing card.  
>>> I don't think I want to wait until some userland agent does this -- 
>>> I'd feel a lot better if the offline/retire action took place in the 
>>> kernel, as quickly as possible.  (Mostly because I don't want the 
>>> framework then trying to continue to access the failing device.)  
>>
>> The FMA does permit immediate error handling following detection of 
>> an error when the system or user data may be compromised.  For 
>> example, a hardened drive may want to discontinue using a particular 
>> device instance after detecting a fatal error.  This is preferable in 
>> some situations to a panic.  Post-diagnosis, agents can decide 
>> whether or not the error handling action was correct.  For example, 
>> the diagnosis software could determine that the wrong device was 
>> offlined and make a correction.
>>
> You probably ought to read Vikram's IO retire spec (PSARC 2007/290). 
> He has a number of mechanisms for isolating a device such as 
> "fencing", which maybe you could use?

I'll take a look at it.  Thanks for the reference.

    -- Garrett
>
> Steve
>
>> The key thing is that you not embed complex diagnosis in your 
>> framework or driver.  Try to separate what needs to happen 
>> immediately and what can wait until diagnosis gives a clear picture 
>> of the problem.
>>
>> So, if the framework
>>> does this, what kind of topology should I report against?  The slot, 
>>> or the card itself?
>>>
>>
>> The thing that's broken which sounds like is the card.
>>
>>> 3) That leads to the next course, which is how to handle recovery.  
>>> My gut feeling is that the recovery action for errors should be:
>>>
>>>    a) the user removes and reinserts the card (or a different card)
>>>    b) the user uses cfgadm -x reset-slot to reset the slot and the card
>>>
>>
>> These sound like a possible repair actions that you will describe in 
>> your knowledge articles.
>>
>>> Note that I don't think automated recovery action in fmad is 
>>> necessarily a good idea.
>>>
>>
>> That's fine, although you may need to disable the IO retire agent 
>> from taking its default actions.
>>
>>> 4) SDcard as a bus, doesn't have the notion of DMA or bus mapping.  
>>> So access handle checking makes little sense to me.  But I'm 
>>> imagining that the errors that can be detected (e.g. a 
>>> protocol/signaling error) might need to be reported to child 
>>> drivers.  But then again, the recovery action is generally to just 
>>> report a synchronous failure to the child (e.g. SDA_EIO or 
>>> somesuch).  If I've done that, do I also need to go thru the trouble 
>>> of propagating these errors to child nodes?  (Generally the child 
>>> node is going to be taken offline anyway, although it may refuse to 
>>> the associated ddi-detach, but if it continues to try to perform 
>>> I/O, right now I wind up returning a generic SDA_EFAULTED error, 
>>> indicating that the slot is in a faulted state and IO is not possible.)
>>
>> It depends if you want to permit the child instances to report any 
>> errors of their own.  That's the purpose of the error reporting chain 
>> in PCI and the DDI DMA routines.  Because errors and controllers 
>> cross interface boundaries, providing an error reporting chain 
>> permits those errors to be reported  before the device is taken 
>> offline.  I don't really know enough about the technology to say 
>> which is the best approach.
>>
>>>
>>> 5) Of course, SD slot controllers are themselves on busses which 
>>> have DMA and registers, so the parent slot driver will be checking 
>>> access handles, detecting PCI bus errors, etc.  How, if at all, 
>>> would these be reported to the child driver.  Again, the child 
>>> driver has no access handles itself.  I'm kind of thinking that just 
>>> returning errors synchronously (in response to commands), combined 
>>> with a ereport posted upstream from the slot, is adequate.  But am I 
>>> missing something?
>>
>> Passing error information in-band via the command work should work 
>> just fine.
>>
>>>
>>> Thoughts?  Am I making sense?  Am I understanding things clearly?
>>
>> Yes, it sounds like you're on the right track!
>>
>>>
>>> Note that I think a lot of these similar issues would show up if FMA 
>>> was ever applied to e.g. USB.
>>
>> Absolutely.
>>
>> Cindi
>>
>

_______________________________________________
fm-discuss mailing list
[email protected]

Re: [fm-discuss] FMA stuff for nexus/framework drivers?

Reply via email to