Wow, some great comments on here now, even a few people agreeing with me which 
is nice :D

I'll happily admit I don't have the in depth understanding of storage many of 
you guys have, but since the idea doesn't seem pie-in-the-sky crazy, I'm going 
to try to write up all my current thoughts on how this could work after reading 
through all the replies.

1. Track disk response times
- ZFS should track the average response time of each disk.
- This should be used internally for performance tweaking, so faster disks are 
favoured for reads.  This works particularly well for lop sided mirrors.
- I'd like to see this information (and the number of timeouts) in the output 
of zpool status, so administrators can see if any one device is performing 
badly.

2. New parameters
- ZFS should gain two new parameters.
  - A timeout value for the pool.
  - An option to enable that timeout for writes too (off by default).
- Still to be decided is whether that timeout is set manually, or automatically 
based on the information gathered in 1.
- Do we need a pool timeout (based on the timeout of the slowest device in the 
pool), or will individual device timeouts work better?
- I've decided that having this off by default for writes is probably better 
for ZFS.  It addresses some people's concerns about writing to a degraded pool, 
and puts data integrity ahead of availability, which seems to fit better with 
ZFS' goals.  I'd still like it myself for writes  I can live with a pool 
running degraded for 2 minutes while the problem is diagnosed.
- With that said, could the write timeout default to on when you have a slog 
device?  After all, the data is safely committed to the slog, and should remain 
there until it's written to all devices.  Bob, you seemed the most concerned 
about writes, would that be enough redundancy for you to be happy to have this 
on by default?  If not, I'd still be ok having it off by default, we could 
maybe just include it in the evil tuning guide suggesting that this could be 
turned on by anybody who has a separate slog device.

3. How it would work
- If a read times out for any device, ZFS should immediately issue reads to all 
other devices holding that data.  The first response back will be used. 
- Timeouts should be logged so the information can be used by administrators or 
FMA to help diagnose failing drives, but they should not count as a device 
failure on their own.
- Some thought is needed as to how this algorithm works on busy pools.  When 
reads are queuing up, we need to avoid false positives and avoid adding extra 
load on the pool.  Would it be a possibility that instead of checking the 
response time for an individual request, this timeout is used to check if no 
responses at all have been received from a device for that length of time?  
That still sounds reasonable for finding stuck devices, and should still work 
reliably on a busy pool.
- For reads, the pool does not need to go degraded, the device is simply 
flagged as "WAITING".
- When enabled for writes, these will be going to all devices, so there are no 
alternate devices to try.  This means any write timeout will be used to put the 
pool into a degraded mode.  This should be considered a temporary state with 
the drive in "WAITING" status, as while the pool itself is degraded (due to 
missing the writes for that drive), the drive is not yet offline.  At this 
point the system is simply keeping itself running while waiting for a proper 
error response from either the drive or from FMA.  If the drive eventually 
returns the missing response, it can be resilvered with any data it missed.  If 
the drive doesn't return a response, FMA should eventually fault it, and the 
drive can be taken offline and replaced with a hot spare.  At all times the 
administrator can see what it going on using zpool status, with the appropriate 
pool and drive status visible.
- Please bear in mind that although I'm using the word 'degraded' above, this 
is not necessarily the case for dual parity pools, I just don't know the proper 
term to use for a dual parity raid set where a single drive has failed.
- If this is just a one off glitch and the device comes back online, the 
resilver shouldn't take long as ZFS just needs to send the data that was missed 
(which will still be stored in the ZIL).
- If many devices timeout at once due to a bad controller, cable pulled, power 
failure, etc, all the affected devices will be flagged as "WAITING" and if too 
many have gone for the pool to stay operational, ZFS should switch the entire 
pool to the 'wait' state while it waits for FMA, etc to return a proper 
response, after which it should react according to the proper failmode property 
for the pool.

4. Food for thought
- While I like nico's idea for lop sided mirrors, I'm not sure any tweaking is 
needed.  I was thinking about whether these timeouts could improve performance 
for such a mirror, but I think a better option there is simply to use plenty of 
local write cache.  A ton of flash memory for the ZIL would be the ideal, but 
if it's a particularly lopsided mirror you could use something as simple as a 
pair of locally mirrored drives.  Reads should be biased to the local devices 
anyway, so you're making the best possible use of the slow half by caching all 
writes and then streaming everything to the remote device.  Of course, things 
will slow down when the ZIL fils as it now has to write to both halves 
together, but if that happens you probably need to rethink your solution anyway 
as your slow mirror isn't keeping up with your demands.
- I see all this as working on top of the B_FAILFAST modes mentioned in the 
thread.  I don't see that it has to impact ZFS' current fault management at 
all.  It's an extra layer that sits on top, simply with the aim of keeping the 
pool responsive while the lower level fault management does its job.  Of 
course, it's also able to provide extra information for things like FMA, but 
it's not indended as a way of diagnosing faults itself.
- Thinking about this further, if anything this seems to me like it should 
improve ZFS' reliability for writes.  It puts the pool into a temporary 
'degraded' state much faster if there is a doubt over whether any writes have 
been committed to disk.  It will still be queuing writes for that device, 
exactly as it would have done before, but the pool and the administrator now 
have earlier information as to the status.  If a second device enters the 
"WAITING" state, the whole pool immediately switches to 'wait' mode until a 
proper diagnosis happens.  This happens much earlier than could be the case 
with FMA alone doing the diagnostic, and means that ZFS stops accepting data 
much earlier while it waits to see what the problem is.  For async writes it 
reduces the amount of data that has been accepted by ZFS but not committed to 
storage.  For sync writes I don't see that it has much effect, these writes 
would have been waiting for the bad device anyway.

Well, I think that's everything.  To me it seems to address most of the 
problems raised here and there should be performance benefits to using it in 
some situations.  It will definately have a beneficial effect in ensuring that 
ZFS maintains a good pool response time for any kind of failure, and it looks 
to me like it would work well for any kind of storage device.

I'm looking forward to seeing what holes can be knocked in these ideas with the 
next set of replies :)

Ross
--
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to