Re: [zfs-discuss] 1tb SATA drives

Haudy Kazemi Sat, 24 Jul 2010 11:44:06 -0700

But if it were just the difference between 5min freeze when a drive
fails, and 1min freeze when a drive fails, I don't see that anyone
would care---both are bad enough to invoke upper-layer application
timeouts of iSCSI connections and load balancers, but not disastrous.


but it's not.  ZFS doesn't immediately offline the drive after 1 read
error.  Some people find it doesn't offline the drive at all, until
they notice which drive is taking multiple seconds to complete
commands and offline it manually.  so you have 1 - 5 minute freezes
several times a day, every time the slowly-failing drive hits a latent
sector error.

I'm saying the works:notworks comparison is not between TLER-broken
and non-TLER-broken.  I think the TLER fans are taking advantage of
people's binary debating bias to imply that TLER is the ``works OK''
case and non-TLER is ``broken: dont u see it's 5x slower.''  There are
three cases to compare for any given failure mode: TLER-failed,
non-TLER-failed, and working.  The proper comparison is therefore
between a successful read (7ms) and an unsuccessful read (7000ms * <n>
cargo-cult retries put into various parts of the stack to work around
some scar someone has on their knee from some weird thing an FC switch
once did in 1999).

If you give a drive enough retries on a sector giving a read error,sometimes it can get the data back. I once had project with an 80gbMaxtor IDE drive that I needed to get all the files off of. One file (aZIP archive) was sitting over a sector with a read error. I found thatI could get what appeared to be partial data from the sector usingOntrack EasyRecovery, but the data read back from the 512 byte sectorwas slightly different each time. I manually repeated this a few timesand got it down to about a few bytes out of the 512 that were differenton each re-read attempt. Looking at those further I figured it wasactually only a few bits of each of those bytes that were different eachtime, and I could narrow that down as well by looking at the frequencyof the results of each read. I knew the ZIP file had a CRC32 code thatwould match the correct byte sequence, and figured I could write up abrute force recovery for the remaining bytes.

I didn't end up writing the code to do that because I found somethingelse: GNU ddrescue. That can image a drive including as many automaticretries as you like, including infinite. I didn't need the drive rightaway, so I started up ddrescue and let it go after the drive over awhole weekend. There was only one sector on the whole drive thatddrescue was working to recover...the one with the file on it. Abouttwo days later it finished reading, and when I mounted the drive image,I was then able to open up the ZIP file. The CRC passed and I hadconfirmation that the drive had finally after days of reread attemptsgotten that last sector.

It was really slow, but I had nothing to lose, and just wanted to seewhat would happen. I've tried it since on other bad sectors withvarying results. Sometimes a couple hundred or thousand retries willget a lucky break and recover the sector. Sometimes not.

The unsuccessful read is thousands of times slower than normal
performance.  It doesn't make your array seem 5x slower during the
fail like the false TLER vs non-TLER comparison makes it seem.  It
makes your array seem entirely frozen.  The actual speed doesn't
matter: it's FROZEN.  Having TLER does not make FROZEN any faster than
FROZEN.

I agree.

The story here sounds great, so I can see why it spreads so well:
``during drive failures, the array drags performance a little, maybe
5x, until you locate teh drive and replace it.  However, if you have
used +1 MAGICAL DRIVES OF RECKONING, the dragging is much reduced!
Unfortunately +1 magical drives are only appropriate for ENTERPRISE
use while at home we use non-magic drives, but you get what you pay
for.''  That all sounds fair, reasonable, and like good fun gameplay.
Unfortunately ZFS isn't a video game: it just fucking freezes.

    bh> The difference is that a fast fail with ZFS relies on ZFS to
    bh> fix the problem rather than degrading the array.

ok but the decision of ``degrading the array'' means ``not sending
commands to the slowly-failing drive any more''.

which is actually the correct decision, the wrong course being to
continue sending commands there and ``patiently waiting'' for them to
fail instead of re-issuing them to redundant drives, even when waiting
thousands of standard deviations outside the mean request time.  TLER
or not, a failing drive will poison the array by making reads
thousands of times slower.

I agree. This is the behavior all RAID type devices should have whetherhardware or Linux RAID or ZFS. If a drive is slow to respond, stopsending it read commands if there is enough redundancy remaining tocompute the data. ZFS should have no problem with this even though Iunderstand that it needs to read across multiple devices to see verifychecksums. If you have some number of devices and N levels ofredundancy, and your number of still working devices is equal to orgreater than the minimum needed for data integrity, there is no reasonto slow down reads (other than the time to compute the data).

The main reasons I can think of to slow down reads, even though there isenough redundancy remaining to be fast are:

1.) ease of implementation (one less design case)

2.) squeaky wheel policy (array is slow...figure out why and then fix itrather than limping along and failing completely later)

As for writes, that's more complex, as you have a device that is stillhalfway alive. Maybe for writes they just get cached longer until theslow drive gets them onto media. (But don't block the rest of thesystem in the meantime.)

And ZFS or HW, fail or degrade, the problem is still fixed for the
upper layers.  You make it soudn like ``degrading the array'' means
the upper layers got an error for the HW controller and got good data
for ZFS.  not so.  If anything, the thread above ZFS gave up waiting
on read() for ``fixed'' data to come back and got killed by request
timeout, or the user pressed ^Z^Z^C^C^C^C^C^\^\^acpkill -9 vi

The filesystem should elegantly tolerate slow drives when they're partof a redundant array.


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] 1tb SATA drives

Reply via email to