Steve Hanson wrote:
> Cynthia McGuire wrote:
>>
>>
>> Garrett D'Amore wrote:
>>> Thanks for the good advice that folks have given me. I still have a
>>> few more questions.
>>>
>>> 1) Many errors are likely to occur or be detected at hotplug time.
>>> That is, when the SDcard is first inserted. Generally, this means
>>> that the given SDcard will not be initialized, and cannot be
>>> accessed. What are the expectations for FMA here? I clearly can't
>>> use the SDcard device itself in the topology, because it doesn't
>>> exist (although the slot does exist). Apart from the fact that the
>>> administrator never got a chance to use the card in the first place,
>>> there really isn't any loss of service. (The service was never
>>> delivered in the first place.) Is FMA still the right answer here?
>>> (Note that a lot of the detectable errors might just be someone
>>> trying to use a card that is not supported by the slot, so its not
>>> really a fault so much as just user-error.)
>>
>> I suppose these are not faults in the sense that there is broken
>> hardware but you may want to send an 'alert'. An 'alert' is defined
>> in the phase I of the Sensor Abstraction project. An alert event
>> doesn't indict something broken but rather alerts the admin to
>> something out of range. In any case, I think you can model the
>> topology with slots containing cards much like we do for disk drives
>> and their bays. The fault or alert event could then point to the
>> slot rather than a card that doesn't exist.
>>
> Generally a driver's attach routine should be able to distinguish
> between failure due to an administrator error (eg unrecognised
> deviceid) and a real hardware fault (device hang, parity error on
> bus), I think the latter case is potentially quite likely - if there
> is a hard fault on the card it is quite likely to show up during
> attach. If your driver can detect genuine hardware faults during
> attach it should report them and they should be diagnosed to faults so
> that the appropriate service action can be raised.
>
> The other case Cindi mentions (raising an"alert" for cases where there
> is no hardware fault) is part of an upcoming project, so I guess
> that's more for the future,
>
> There is certainly a problem with devices that fail to attach not
> being in the topology. I've been discussing this with Vikram to see
> if we can fix this (if the node got as far as the init state it may
> still be possible to detect it).
Well, in the most likely case, I won't have even allocated node yet.
The most obvious and common failures will be while I'm still trying to
identify the card as a valid device. The low-level hardware
initialization happens *way* before the DDI is ever notified, at least
in my case.
>>>
>>> 2) The most reasonable response to most of the errors that the
>>> SDcard framework can detect is simply to offline the failing card.
>>> I don't think I want to wait until some userland agent does this --
>>> I'd feel a lot better if the offline/retire action took place in the
>>> kernel, as quickly as possible. (Mostly because I don't want the
>>> framework then trying to continue to access the failing device.)
>>
>> The FMA does permit immediate error handling following detection of
>> an error when the system or user data may be compromised. For
>> example, a hardened drive may want to discontinue using a particular
>> device instance after detecting a fatal error. This is preferable in
>> some situations to a panic. Post-diagnosis, agents can decide
>> whether or not the error handling action was correct. For example,
>> the diagnosis software could determine that the wrong device was
>> offlined and make a correction.
>>
> You probably ought to read Vikram's IO retire spec (PSARC 2007/290).
> He has a number of mechanisms for isolating a device such as
> "fencing", which maybe you could use?
I'll take a look at it. Thanks for the reference.
-- Garrett
>
> Steve
>
>> The key thing is that you not embed complex diagnosis in your
>> framework or driver. Try to separate what needs to happen
>> immediately and what can wait until diagnosis gives a clear picture
>> of the problem.
>>
>> So, if the framework
>>> does this, what kind of topology should I report against? The slot,
>>> or the card itself?
>>>
>>
>> The thing that's broken which sounds like is the card.
>>
>>> 3) That leads to the next course, which is how to handle recovery.
>>> My gut feeling is that the recovery action for errors should be:
>>>
>>> a) the user removes and reinserts the card (or a different card)
>>> b) the user uses cfgadm -x reset-slot to reset the slot and the card
>>>
>>
>> These sound like a possible repair actions that you will describe in
>> your knowledge articles.
>>
>>> Note that I don't think automated recovery action in fmad is
>>> necessarily a good idea.
>>>
>>
>> That's fine, although you may need to disable the IO retire agent
>> from taking its default actions.
>>
>>> 4) SDcard as a bus, doesn't have the notion of DMA or bus mapping.
>>> So access handle checking makes little sense to me. But I'm
>>> imagining that the errors that can be detected (e.g. a
>>> protocol/signaling error) might need to be reported to child
>>> drivers. But then again, the recovery action is generally to just
>>> report a synchronous failure to the child (e.g. SDA_EIO or
>>> somesuch). If I've done that, do I also need to go thru the trouble
>>> of propagating these errors to child nodes? (Generally the child
>>> node is going to be taken offline anyway, although it may refuse to
>>> the associated ddi-detach, but if it continues to try to perform
>>> I/O, right now I wind up returning a generic SDA_EFAULTED error,
>>> indicating that the slot is in a faulted state and IO is not possible.)
>>
>> It depends if you want to permit the child instances to report any
>> errors of their own. That's the purpose of the error reporting chain
>> in PCI and the DDI DMA routines. Because errors and controllers
>> cross interface boundaries, providing an error reporting chain
>> permits those errors to be reported before the device is taken
>> offline. I don't really know enough about the technology to say
>> which is the best approach.
>>
>>>
>>> 5) Of course, SD slot controllers are themselves on busses which
>>> have DMA and registers, so the parent slot driver will be checking
>>> access handles, detecting PCI bus errors, etc. How, if at all,
>>> would these be reported to the child driver. Again, the child
>>> driver has no access handles itself. I'm kind of thinking that just
>>> returning errors synchronously (in response to commands), combined
>>> with a ereport posted upstream from the slot, is adequate. But am I
>>> missing something?
>>
>> Passing error information in-band via the command work should work
>> just fine.
>>
>>>
>>> Thoughts? Am I making sense? Am I understanding things clearly?
>>
>> Yes, it sounds like you're on the right track!
>>
>>>
>>> Note that I think a lot of these similar issues would show up if FMA
>>> was ever applied to e.g. USB.
>>
>> Absolutely.
>>
>> Cindi
>>
>
_______________________________________________
fm-discuss mailing list
[email protected]