On Thu, 2010-06-17 at 18:38 -0400, Eric Schrock wrote:
> On Jun 17, 2010, at 6:13 PM, Garrett D'Amore wrote:
> > 
> > So how do you diagnose the situation where someone trips over a cable,
> > or where the drive was bumped and detached from the cable?  I guess I'm
> > OK with the idea that these are in a REMOVED state, but I'd like the
> > messaging to say something besides "the administrator has removed the
> > device" or somesuch (which is what it says now).  Clearly that's not
> > what happened.
> 
> Are you requesting that we diagnose the difference between tripping over a 
> cable and intentionally unplugging it?  That's clearly beyond any software's 
> ability to diagnose.

I guess it depends.  If you have a way to indicate intent -- such as by
issuing a command first, then you can diagnose it, can't you?  (I
thought this was what cfgadm was all about.)

Perhaps the model here is that nobody ever needs to issue such commands
-- that its reasonable to go around yanking drives from systems in the
datacenter willy nilly.  I hope not.

> 
> On the SS7000 series, you get an alert that the enclosure has been detached 
> from the system.  The fru-monitor code (generalization of the disk-monitor) 
> that generates this sysevent has not yet been pushed to ON.
> 
> > a) when a unit is removed, a spare is recruited to replace it if one is
> > available.  (I.e. zfs-retire needs to work.)
> 
> This is handled by the REMOVED state, as zfs-retire subscribes to 
> resource.removed.

Yes, I saw that.

> 
> > b) ideally, this should be logged/handled in some manner asynchronously,
> > so that if such an event has occurred, it does not come as a surprise to
> > the administrator 2 weeks after the fact when the *2nd* unit dies or is
> > removed.
> 
> These are logged as alerts in the SS7000.  The first-class notion of a 
> Solaris alert is not new, and has been proposed in the past as part of FMA 
> work.  The FMA team is currently working on a project that will introduce 
> some of the underlying infrastructure to formalized alerts in Solaris.  These 
> events (the primitives are not called alerts) represent formalized things of 
> interest that are not directly related to a fault or defect.  That, along 
> with the ability to diagnose a defect over extended periods of removal, is 
> the correct way to represent this situation.
> 
> > Its that last point "b" that makes me feel less good about "REMOVED".
> > The current code seems to assume that removal is always intentional, and
> > therefore no further notification is needed.  But when a disk stops
> > answering SCSI commands, it may indicate an unplanned device failure.
> 
> There are many, many, failure modes that can be distinguished just fine from 
> physical device removal.  For example, you can have a PHY up but the attached 
> device completely unresponsive, but you know there is a device there.  Or you 
> can look at the SES data to determine physical presence.  Converting all 
> hotplug events into faults is too broad a brush here.

Many of these failure modes depend on having a suitable enclosure.
While this may be fine for the SS7000, there are other users of ZFS that
don't have that ability.

I guess the fact that the SS7000 code isn't kept up to date in ON means
that we may wind up having to do our own thing here... its a bit
unfortunate, but ok.

The point is that, for now, we have a real problem, and that is that
devices that fail in any of a number of various ways, don't have *any*
indication reported about the failure.  So *that* is what we need to
fix.

> 
> > One other thought -- I think ZFS should handle this in a manner such
> > that the behavior appears to the administrator to be the same,
> > regardless of whether I/O was occurring on the unit or not.
> > 
> > An interesting question is what happens if I yank a drive while there
> > are outstanding commands pending?  Those commands should time out at the
> > HBA, but will it report them as CMD_DEV_GONE, or will it report an error
> > causing a fault to be flagged?
> 
> This is detected as device removal.  There is a timeout associated with I/O 
> errors in zfs-diagnosis that gives some grace period to detect removal before 
> declaring a disk faulted.
> 

Ok.

        - Garrett

> - Eric
> 
> --
> Eric Schrock, Fishworks                        http://blogs.sun.com/eschrock
> 
> 


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to