On Sun, 3 Feb 2008, Igor Sobrado wrote:

>> I have had 3 e7k100's fail with read errors in the same general
>> area of the disk.
>
> How odd!  The only way to get symptoms like the one you describe
> is getting the surface temperature of the disk platters higher
> than the Curie point.  It should not happen.

No.  Drives of the same model are zoned in the same way.  It is not
uncommon for certain areas of the disk to be particularly prone to
errors for a particular model.  For example, as the head moves towards
the end of the disk (the inner radius), the sectors are spaced more
closely, and right before the density switch to the next zone, the
drive might be a bit marginal.

> What about the new solid-state drives?
> Are solid-state disks a better alternative on Soekris
> computers than traditional drives?

Traditional flash drives (including CF) have had real problems with
write endurance and write speed.  A new breed of flash drives is
coming on the market; they use a significant microprocessors and RAM
buffers (backed with supercaps and firehose dump to flash) to get over
the speed problem for bursty workloads, and they use wear leveling to
extend the write endurance.  Today, such drives are beginning to
heavily push into the laptop market.  Search for recent press releases
from Toshiba for an example.  Also, there are boutique vendors such as
STec who ship flash drives with ATA/SATA/SAS/FC form factors, that
have fabulously good performance specs (but aren't cheap).

We are at an inflection point, where a transition from disk to flash,
and then from flash to newer solid-state technologies collectively
coined "storage-class memory" is occuring.  Things are changing.  For
now, disks are the undisputed cost/capacity leaders; flash is pushing
where good speed, good reliability and low power consumption is
important.  For real reliability, I would use CF for now, until the
newer wear-levelling flash drives become affordable.

Hint: I run a consumer grade 2.5" drive in my Soekris, and I have good
backups and a half dozen spare drives sitting in the drawer.  If it
fails, replacement and recovery would only take me a few hours.  I haven't 
had the patience to set up my 5501 for CF boot, but it would be a good 
idea (with the spinning disk being only for throwaway data).

> > This is good reading...
> >     http://research.google.com/archive/disk_failures.pdf
> > There are a few more independent studies done. Do not have the urls at
> > hand.

Several studies came out of NetApp recently, they are also excellent.

> On this report the authors write that "[the figure 4] shows that failures
> do not increase when the average temperature increases.  In fact, there
> is a clear trend showing that lower temperatures are associated with
> higher failure rates.  Only at very high temperatures is there a slight
> reversal of this trend."  For the authors, slight increase means only 1%
> at temperatures higher than 45 C.
> 
> The authors end saying that "We can conclude that at moderate temperature
> ranges it is likely that there are other effects which affect failure
> rates much more strongly than temperatures do."

All this is true, for the temperature ranges they studied.

> So, temperature is not a source of disk failures.

WRONG!  For the disks that were mounted in google's data center, there
was no clear correlation between higher temperatures and failure
rates, but this does NOT extrapolate to the general case.  DO NOT USE
THE GOOGLE DATA AS AN EXCUSE TO RUN YOUR DISKS HOT, you'll be sorry.

The storage research community has analyzed the google paper in much
detail, but most of those discussions are not public.  The google data
is hard to compare with individual deskside computers, the typical
Soekris box, or regular data centers, because many of the google disks
are running already much cooler than usual (few disks outside the
extremely well-cooled mega-datacenters that the likes of Google and
Livermore use run at temperatures as low as 20 to 25 degrees).  Part
of the Google data is actually explained by running the disks cooler;
the ideal temperature FOR THE GENERATION OF DISKS THAT DOMINATE THEIR
SAMPLE seems to be 30-40 degrees.  If you discuss this with the
experts from Seagate, Hitachi, Maxtor etc. you'll find that the
probably cause is spindle lubrication being a tad too viscous.
Furthermore, the Google data set is difficult to analyze, because it
contains many generations of disk drives (none or few are really
modern, or 2.5" disks), and the change from generation to generation
is highly correlated with changes in data center environment (Google's
data centers have become larger and better cooled), and the way disks
are mounted.

>  There are other sources
> of failure like the number of power cycles of the drive and power-on hours
> (in my humble opinion, POH is more important on 2.5" drives than on 3.5"
> ones.)

This is indeed true.  And don't forget vibration, which non-enterprise
disks (commonly ATA and SATA disks) really hate too.  But don't take
the Google data too literally.

If you are interested in this topic: FAST (the file system and storage
conference, where the Google paper was presented) is coming up in San
Jose at the end of February.  I'll be there.  The program shows that
NetApp and CMU will present a new paper on disk errors.  Expect
serious discussions of this topic in the hallways.

--
Ralph Becker-Szendy    [EMAIL PROTECTED]               (408)395-1435
735 Sunset Ridge Road; Los Gatos, CA 95033
_______________________________________________
Soekris-tech mailing list
Soekris-tech@lists.soekris.com
http://lists.soekris.com/mailman/listinfo/soekris-tech

Reply via email to