Hey Jeff,

Good to hear there's work going on to address this.

What did you guys think to my idea of ZFS supporting a "waiting for a
response" status for disks as an interim solution that allows the pool
to continue operation while it's waiting for FMA or the driver to
fault the drive?

I do appreciate that it's hard to come up with a definative "it's dead
Jim" answer, and I agree that long term the FMA approach will pay
dividends.  But I still feel this is a good short term solution, and
one that would also compliment your long term plans.

My justification for this is that it seems to me that you can split
disk behavior into two states:
- returns data ok
- doesn't return data ok

And for the state where it's not returning data, you can again split
that in two:
- returns wrong data
- doesn't return data

The first of these is already covered by ZFS with its checksums (with
FMA doing the extra work to fault drives), so it's just the second
that needs immediate attention, and for the life of me I can't think
of any situation that a simple timeout wouldn't catch.

Personally I'd love to see two parameters, allowing this behavior to
be turned on if desired, and allowing timeouts to be configured:

zfs-auto-device-timeout
zfs-auto-device-timeout-fail-delay

The first sets whether to use this feature, and configures the maximum
time ZFS will wait for a response from a device before putting it in a
"waiting" status.  The second would be optional and is the maximum
time ZFS will wait before faulting a device (at which point it's
replaced by a hot spare).

The reason I think this will work well with the FMA work is that you
can implement this now and have a real improvement in ZFS
availability.  Then, as the other work starts bringing better modeling
for drive timeouts, the parameters can be either removed, or set
automatically by ZFS.

Long term I guess there's also the potential to remove the second
setting if you felt FMA etc ever got reliable enough, but personally I
would always want to have the final fail delay set.  I'd maybe set it
to a long value such as 1-2 minutes to give FMA, etc a fair chance to
find the fault.  But I'd be much happier knowing that the system will
*always* be able to replace a faulty device within a minute or two, no
matter what the FMA system finds.

The key thing is that you're not faulting devices early, so FMA is
still vital.  The idea is purely to let ZFS to keep the pool active by
removing the need for the entire pool to wait on the FMA diagnosis.

As I said before, the driver and firmware are only aware of a single
disk, and I would imagine that FMA also has the same limitation - it's
only going to be looking at a single item and trying to determine
whether it's faulty or not.  Because of that, FMA is going to be
designed to be very careful to avoid false positives, and will likely
take it's time to reach an answer in some situations.

ZFS however has the benefit of knowing more about the pool, and in the
vast majority of situations, it should be possible for ZFS to read or
write from other devices while it's waiting for an 'official' result
from any one faulty component.

Ross


On Tue, Nov 25, 2008 at 8:37 AM, Jeff Bonwick <[EMAIL PROTECTED]> wrote:
> I think we (the ZFS team) all generally agree with you.  The current
> nevada code is much better at handling device failures than it was
> just a few months ago.  And there are additional changes that were
> made for the FishWorks (a.k.a. Amber Road, a.k.a. Sun Storage 7000)
> product line that will make things even better once the FishWorks team
> has a chance to catch its breath and integrate those changes into nevada.
> And then we've got further improvements in the pipeline.
>
> The reason this is all so much harder than it sounds is that we're
> trying to provide increasingly optimal behavior given a collection of
> devices whose failure modes are largely ill-defined.  (Is the disk
> dead or just slow?  Gone or just temporarily disconnected?  Does this
> burst of bad sectors indicate catastrophic failure, or just localized
> media errors?)  The disks' SMART data is notoriously unreliable, BTW.
> So there's a lot of work underway to model the physical topology of
> the hardware, gather telemetry from the devices, the enclosures,
> the environmental sensors etc, so that we can generate an accurate
> FMA fault diagnosis and then tell ZFS to take appropriate action.
>
> We have some of this today; it's just a lot of work to complete it.
>
> Oh, and regarding the original post -- as several readers correctly
> surmised, we weren't faking anything, we just didn't want to wait
> for all the device timeouts.  Because the disks were on USB, which
> is a hotplug-capable bus, unplugging the dead disk generated an
> interrupt that bypassed the timeout.  We could have waited it out,
> but 60 seconds is an eternity on stage.
>
> Jeff
>
> On Mon, Nov 24, 2008 at 10:45:18PM -0800, Ross wrote:
>> But that's exactly the problem Richard:  AFAIK.
>>
>> Can you state that absolutely, categorically, there is no failure mode out 
>> there (caused by hardware faults, or bad drivers) that won't lock a drive up 
>> for hours?  You can't, obviously, which is why we keep saying that ZFS 
>> should have this kind of timeout feature.
>>
>> For once I agree with Miles, I think he's written a really good writeup of 
>> the problem here.  My simple view on it would be this:
>>
>> Drives are only aware of themselves as an individual entity.  Their job is 
>> to save & restore data to themselves, and drivers are written to minimise 
>> any chance of data loss.  So when a drive starts to fail, it makes complete 
>> sense for the driver and hardware to be very, very thorough about trying to 
>> read or write that data, and to only fail as a last resort.
>>
>> I'm not at all surprised that drives take 30 seconds to timeout, nor that 
>> they could slow a pool for hours.  That's their job.  They know nothing else 
>> about the storage, they just have to do their level best to do as they're 
>> told, and will only fail if they absolutely can't store the data.
>>
>> The raid controller on the other hand (Netapp / ZFS, etc) knows all about 
>> the pool.  It knows if you have half a dozen good drives online, it knows if 
>> there are hot spares available, and it *should* also know how quickly the 
>> drives under its care usually respond to requests.
>>
>> ZFS is perfectly placed to spot when a drive is starting to fail, and to 
>> take the appropriate action to safeguard your data.  It has far more 
>> information available than a single drive ever will, and should be designed 
>> accordingly.
>>
>> Expecting the firmware and drivers of individual drives to control the 
>> failure modes of your redundant pool is just crazy imo.  You're throwing 
>> away some of the biggest benefits of using multiple drives in the first 
>> place.
>> --
>> This message posted from opensolaris.org
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss@opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to