On Tue, 25 Nov 2008, Ross Smith wrote:

> I disagree Bob, I think this is a very different function to that
> which FMA provides.
>
> As far as I know, FMA doesn't have access to the big picture of pool
> configuration that ZFS has, so why shouldn't ZFS use that information
> to increase the reliability of the pool while still using FMA to
> handle device failures?

If FMA does not currently have knowledge of the redundancy model but 
needs it to make well-informed decisions, then it should be updated to 
incorporate this information.

FMA sees all the hardware in the system, including devices used for 
UFS and other types of filesystems, and even tape devices.  It is able 
to see hardware at a much more detailed level than ZFS does.  ZFS only 
sees an abstracted level of the hardware.  If a HBA or part of the 
backplane fails, FMA should be able to determine the failing area (at 
least as far out as it can see based on available paths) whereas all 
ZFS knows is that it is having difficulty getting there from here.

> The flip side of the argument is that ZFS already checks the data
> returned by the hardware.  You might as well say that FMA should deal
> with that too since it's responsible for all hardware failures.

If bad data is returned, then I assume that there is a peg to FMA's 
error statistics counters.

> The role of ZFS is to manage the pool, availability should be part and
> parcel of that.

Too much complexity tends to clog up the works and keep other areas of 
ZFS from being enhanced expediently.  ZFS would soon become a chunk of 
source code that no mortal could understand and as such it would be 
put under "maintenance" with no more hope of moving forward and 
inability to address new requirements.

A rational system really does not want to have mutiple brains. 
Otherwise some parts of the system will think that the device is fine 
while other parts believe that it has failed. None of us want to deal 
with an insane system like that.  There is also the matter of fault 
isolation.  If a drive can not be reached, is it because the drive 
failed, or because a HBA supporting multiple drives failed, or a cable 
got pulled?  This sort of information is extremely important for large 
reliable systems.

Bob
======================================
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to