Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Maarten Bezemer Fri, 13 Apr 2012 12:10:51 -0700


On Fri, 13 Apr 2012, Ed W wrote:

On 13/04/2012 13:33, Stan Hoeppner wrote:

What I meant wasn't the drive throwing uncorrectable read errors but
the drives are returning different data that each think is correct or
both may have sent the correct data but one of the set got corrupted
on the fly. After reading the articles posted, maybe the correct term
would be the controller receiving silently corrupted data, say due to
bad cable on one.

This simply can't happen.  What articles are you referring to?  If the
author is stating what you say above, he simply doesn't know what he's
talking about.

It quite clearly can??!

I totally agree with Ed here. Drives sure can and sometimes really doreturn different data, without reporting errors. Also, data can getcorrupted on any of the busses or chips it passes through.

The math about 10^15 or 10^16 and all that stuff is not only about arraysizes. It's also about data transfer.

I've seen silent corruption on a few systems myself. (Luckily, only 3times in a couple years.) Those systems were only in the 2TB-5TB sizecategory, which is substantially lower than the 67TB claimed elsewhere.Yet, statistically, it's well within normal probability levels.

Linux mdraid only reads one mirror as long as the drives don't return anerror. Easy to check, the read speeds are way beyond a single drive's readspeed. When the kernel would have to read all (possibly more than two)mirrors, and compare them, and make a decision based on this comparison,things would be horribly slow. Hardware raid typically uses this exactsame approach. This goes for Areca, 3ware, LSI, which cover most of theregular (i.e. non-SAN) professional hardware raid setups.

If you don't believe it, just don't take my word for it but test it foryourself. Cleanly power down a raid1 array, take the individual drives,put them into a simple desktop machine, and write different data to both,using some raw disk writing tool like dd. Then put the drives back intothe raid1 array, power it up, and re-read the information. You'll see datafrom both drives will be intermixed as parts of the reads come from onedisk, and parts come from the other. Only when you order the raid array todo a verification pass, it'll start screaming and yelling. At least, Ihope it will...

But as explained elsewhere, silent corruption can occur at numerousplaces. If you don't have an explicit checksumming/checking mechanism,there are indeed cases that will haunt you if you don't do regularscrubbing or at least do regular verification runs. Heck, that's why Linuxmdadm comes with cron jobs to do just that, and hardware raid controllershave similar scheduling capabilities.

Of course, scrubbing/verification is not going to magically protect youfrom all problems. But you would at least get notifications if it detectsproblems.

If the controller compares the two sectors from the drives, it may be
able to tell us something is wrong but there isn't anyway for it to
know which one of the sector was a good read and which isn't, or is
there?
Yes it can, and it does.
No it definitely does not!! At least not with linux software raid and I don'tbelieve on commodity hardware controllers either! (You would be able to tellbecause the disk IO would be doubled)

Obviously there is no way to tell which versions of a story are correct ifyou are not biased to believe one of the storytellers and distrust theother. You would have to add a checksum layer for that. (And hope thechecksum isn't the part that got corrupted!)

To answer the questions
you're asking will require me to teach you the basics of hardware
signaling protocols, SCSI, SATA, Fiber Channel, and Ethernet
transmission error detection protocols, disk drive firmware error
recovery routines, etc, etc, etc.

I'm quite familiar with the basics of these protocols. I'm also quitefamiliar with the flaws in several implementations of "seeminglystraightforward protocols". More often than not, there's a pressing needto get new devices onto the market before the competition has somethingsimilar and you loose your advantage. More often than not, this results insuboptimal implementations of all those fine protocols and algorithms. Andlet's face it: flaws in error recovery routines often don't surface untilsomeone actually needs those routines. As long as drives (or any otherdevice) are functioning as expected, everything is all right. But as soonas something starts to get flaky, error recovery has to kick in but mayjust as well fail to do the right thing.

Just consider the real-world analogy of politicians. They do or saysomething stupid every once in a while, and error recovery (a.k.a. damagecontrol) has to kick in. But even though those well trained professionals,having decades of experience in the political arena, sometimes simply failto do the right thing. They may have overlooked some pesky details, orthey may take actions that don't have the expected outcome because...indeed, things work differently in damage control mode, and the only lawyou can trust is physics: you always go down when you can't stay on yourfeet.

With hard drives, raid controllers, mainboards, data buses, it's exactlythe same. If _something_ isn't working as it should, how should we knowwhich part of it we _can_ trust?

In closing, I'll simply say this:  If hardware, whether a mobo-down SATA
chip, or a $100K SGI SAN RAID controller, allowed silent data corruption
or transmission to occur, there would be no storage industry, and we'll
all still be using pen and paper.  The questions you're asking were
solved by hardware and software engineers decades ago.  You're fretting
and asking about things that were solved decades ago.

Isn't it just "worked around" by adding more layers of checksuming andadding more redundancy into the mix? Don't believe this "storage industry"because they tell you it's OK. It simply is not OK. You might want to talkto people in the data and computing cluster business about their opinionon "storage industry professionals"...

Timo's suggestion to add checksums to mailboxes/metadata could help to(at least) report these types of failures. Re-reading from differentstorage when available could also recover the data that got corrupted, butI'm not sure what would be the best way to handle these situations. If youknow there is a corruption problem on one of your storage locations, youmight want to switch that to read-only asap. Automagically trying torecover might not be the best thing to do. Given all kinds of differentuse cases, I think that should at least be configurable :-P




--
Maarten

Re: [Dovecot] Better to use a single large storage server or multiple smaller for mdbox?

Reply via email to