Re: [GENERAL] SSDD reliability

2011-05-18 Thread Craig Ringer

On 05/19/2011 08:57 AM, Martin Gainty wrote:

what is this talk about replicating your primary database to secondary
nodes in the cloud...


slow.

You'd have to do async replication with unbounded slave lag.

It'd also be very easy to get to the point where the load on the master 
meant that the slave could never, ever catch up because there just 
wasn't enough bandwidth.


--
Craig Ringer

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-18 Thread Toby Corkindale

On 19/05/11 10:50, mark wrote:

Note 1:
I have seen an array that was powered on continuously for about six
years, which killed half the disks when it was finally powered down,
left to cool for a few hours, then started up again.




Recently we rebooted about 6 machines that had uptimes of 950+ days.
Last time fsck had run on the file systems was 2006.

When stuff gets that old, has been on-line and under heavy load all that
time you actually get paranoid about reboots. In my newly reaffirmed
opinion, at that stage reboots are at best a crap shoot. We lost several
hours to that gamble more than we had budgeted for. HP is getting more of
their gear back than in a usual month.


I worked at one place, years ago, which had an odd policy.. They had 
automated hard resets hit all their servers on a Friday night, every week.

I thought they were mad at the time!

But.. it does mean that people design and test the systems so that they 
can survive unattended resets reliably. (No one wants to get a support 
call at 11pm on Friday because their server didn't come back up.)


It still seems a bit messed up though - even if friday night is a 
low-use period, it still means causing a small amount of disruption to 
customers - especially if a developer or sysadmin messed up, and a 
server *doesn't* come back up.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-18 Thread Martin Gainty

what is this talk about replicating your primary database to secondary nodes in 
the cloud.. or is cloud computing still marketing hype?

Martin 
__ 
Verzicht und Vertraulichkeitanmerkung/Note de déni et de confidentialité
 
Diese Nachricht ist vertraulich. Sollten Sie nicht der vorgesehene Empfaenger 
sein, so bitten wir hoeflich um eine Mitteilung. Jede unbefugte Weiterleitung 
oder Fertigung einer Kopie ist unzulaessig. Diese Nachricht dient lediglich dem 
Austausch von Informationen und entfaltet keine rechtliche Bindungswirkung. 
Aufgrund der leichten Manipulierbarkeit von E-Mails koennen wir keine Haftung 
fuer den Inhalt uebernehmen.
Ce message est confidentiel et peut être privilégié. Si vous n'êtes pas le 
destinataire prévu, nous te demandons avec bonté que pour satisfaire informez 
l'expéditeur. N'importe quelle diffusion non autorisée ou la copie de ceci est 
interdite. Ce message sert à l'information seulement et n'aura pas n'importe 
quel effet légalement obligatoire. Étant donné que les email peuvent facilement 
être sujets à la manipulation, nous ne pouvons accepter aucune responsabilité 
pour le contenu fourni.




> From: dvlh...@gmail.com
> To: toby.corkind...@strategicdata.com.au; pgsql-general@postgresql.org
> Subject: Re: Fwd: Re: [GENERAL] SSDD reliability
> Date: Wed, 18 May 2011 18:50:28 -0600
> 
> > Note 1:
> > I have seen an array that was powered on continuously for about six
> > years, which killed half the disks when it was finally powered down,
> > left to cool for a few hours, then started up again.
> > 
> 
> 
> Recently we rebooted about 6 machines that had uptimes of 950+ days. 
> Last time fsck had run on the file systems was 2006. 
> 
> When stuff gets that old, has been on-line and under heavy load all that
> time you actually get paranoid about reboots. In my newly reaffirmed
> opinion, at that stage reboots are at best a crap shoot. We lost several
> hours to that gamble more than we had budgeted for. HP is getting more of
> their gear back than in a usual month. 
> 
> Maybe that is just life with HP.
> 
> 
> -M
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
  

Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-18 Thread mark
> Note 1:
> I have seen an array that was powered on continuously for about six
> years, which killed half the disks when it was finally powered down,
> left to cool for a few hours, then started up again.
> 


Recently we rebooted about 6 machines that had uptimes of 950+ days. 
Last time fsck had run on the file systems was 2006. 

When stuff gets that old, has been on-line and under heavy load all that
time you actually get paranoid about reboots. In my newly reaffirmed
opinion, at that stage reboots are at best a crap shoot. We lost several
hours to that gamble more than we had budgeted for. HP is getting more of
their gear back than in a usual month. 

Maybe that is just life with HP.


-M


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-11 Thread Toby Corkindale
BTW, I saw a news article today about a brand of SSD that was claiming 
to have the price effectiveness of MLC-type chips, but with lifetime of 
4TB/day over 5 years.


http://www.storagereview.com/anobit_unveils_genesis_mlc_enterprise_ssds

which also links to:
http://www.storagereview.com/sandforce_and_ibm_promote_virtues_mlcbased_ssds_enterprise

which is a similar tech - much improved erase-cycle-counts on MLC.

No doubt this'll be common in all SSDs in a year or so then!

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-05 Thread Toby Corkindale

On 05/05/11 18:36, Florian Weimer wrote:

* Greg Smith:


Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
mechanical drives is around 2% during their first year, spiking to 5%
afterwards.  I suspect that Intel's numbers are actually much better
than the other manufacturers here, so a SSD from anyone else can
easily be less reliable than a regular hard drive still.


I'm a bit concerned with usage-dependent failures.  Presumably, two SDDs
in a RAID-1 configuration are weared down in the same way, and it would
be rather inconvenient if they failed at the same point.  With hard
disks, this doesn't seem to happen; even bad batches fail pretty much
randomly.


Actually I think it'll be the same as with hard disks.
ie. A batch of drives with sequential serial numbers will have a fairly 
similar average lifetime, but they won't pop their clogs all on the same 
day. (Unless there is an outside influence - see note 1)


The wearing-out of SSDs is not as exact as people seem to think. If the 
drive is rated for 10,000 erase cycles, then that is meant to be a 
MINIMUM amount. So most blocks will get more than that amount, and maybe 
a small number will die before that amount. I guess it's a probability 
curve, engineered such that 95% or some other high percentage will 
outlast that count. (and the SSDs have reserved blocks which are 
introduced to take over from failing blocks, invisibly to the end-user 
-since it can always read from the failing-to-erase block)


Note 1:
I have seen an array that was powered on continuously for about six 
years, which killed half the disks when it was finally powered down, 
left to cool for a few hours, then started up again.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-05 Thread Greg Smith

On 05/04/2011 08:31 PM, David Boreham wrote:
Here's my best theory at present : the failures ARE caused by cell 
wear-out, but the SSD firmware is buggy in so far as it fails to boot 
up and respond to host commands due to the wear-out state. So rather 
than the expected outcome (SSD responds but has read-only behavior), 
it appears to be (and is) dead. At least to my mind, this is a more 
plausible explanation for the reported failures vs. the alternative 
(SSD vendors are uniquely clueless at making basic electronics 
subassemblies), especially considering the difficulty in testing the 
firmware under all possible wear-out conditions.


One question worth asking is : in the cases you were involved in, was 
manufacturer failure analysis performed (and if so what was the 
failure cause reported?).


Unfortunately not.  Many of the people I deal with, particularly the 
ones with budgets to be early SSD adopters, are not the sort to return 
things that have failed to the vendor.  In some of these shops, if the 
data can't be securely erased first, it doesn't leave the place.  The 
idea that some trivial fix at the hardware level might bring the drive 
back to life, data intact, is terrifying to many businesses when drives 
fail hard.


Your bigger point, that this could just easily be software failures due 
to unexpected corner cases rather than hardware issues, is both a fair 
one to raise and even more scary.


Intel claims their Annual Failure Rate (AFR) on their SSDs in IT 
deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for 
mechanical drives is around 2% during their first year, spiking to 5% 
afterwards.  I suspect that Intel's numbers are actually much better 
than the other manufacturers here, so a SSD from anyone else can 
easily be less reliable than a regular hard drive still.


Hmm, this is speculation I don't support (non-intel vendors have a 10x 
worse early failure rate). The entire industry uses very similar 
processes (often the same factories). One rogue vendor with a bad 
process...sure, but all of them ??




I was postulating that you only have to be 4X as bad as Intel to reach 
2.4%, and then be worse than a mechanical drive for early failures.  If 
you look at http://labs.google.com/papers/disk_failures.pdf you can see 
there's a 5:1 ratio in first-year AFR just between light and heavy usage 
on the drive.  So a 4:1 ratio between best and worst manufacturer for 
SSD seemed possible.  Plenty of us have seen particular drive models 
that were much more than 4X as bad as average ones among regular hard 
drives.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-05 Thread Scott Marlowe
On Thu, May 5, 2011 at 1:54 PM, Greg Smith  wrote:

> I think your faith in PC component manufacturing is out of touch with the
> actual field failure rates for this stuff, which is produced with enormous
> cost cutting pressure driving tolerances to the bleeding edge in many cases.
>  The equipment of the 80's and 90's you were referring to ran slower, and
> was more expensive so better quality components could be justified.  The
> quality trend at the board and component level has been trending for a long
> time toward cheap over good in almost every case nowadays.

Modern CASE tools make this more and more of an issue.  You can be in
a circuit design program, right click on a component and pick from a
dozen other components with lower tolerances and get a SPICE
simulation that says initial production line failure rates will go
from 0.01% to 0.02%.  Multiply that times 100 components and it seems
like a small change.  But all it takes is one misstep and you've got a
board with a theoretical production line failure rate of 0.05 that's
really 0.08, and the first year failure rate goes from 0.5% to 2 or 3%
and the $2.00 you saved on all components on the board times 1M units
goes right out the window.

BTW, the common term we used to refer to things that fail due to weird
and unforseen circumstances were often referred to as P.O.M.
dependent, (phase of the moon) because they'd often cluster around
certain operating conditions that were unobvious until you collected
and collated a large enough data set.  Like hard drives that have
abnormally high failure rates at altitudes above 4500ft etc.  Seem
fine til you order 1,000 for your Denver data center and they all
start failing.  It could be anything like that.  SSDs that operate
fine until they're in an environment with constant % humidity below
15% and boom they start failing like mad.  It's impossible to test for
all conditions in the field, and it's quite possible that
environmental factors affect some of these SSDs we've heard about.
More research is necessary to say why someone would see such
clustering though.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-05 Thread Greg Smith

On 05/05/2011 10:35 AM, David Boreham wrote:

On 5/5/2011 8:04 AM, Scott Ribe wrote:


Actually, any of us who really tried could probably come up with a 
dozen examples--more if we've been around for a while. Original 
design cutting corners on power regulation; final manufacturers 
cutting corners on specs; component manufacturers cutting corners on 
specs or selling outright counterfeit parts...


These are excellent examples of failure causes for electronics, but 
they are

not counter-examples. They're unrelated to the discussion about SSD
early lifetime hard failures.


That's really optimistic.  For all we know, these problems are the 
latest incarnation of something like the bulging capacitor plague circa 
5 years ago.  Some part that is unique to the SSDs other than the flash 
cells that there's a giant bad batch of.


I think your faith in PC component manufacturing is out of touch with 
the actual field failure rates for this stuff, which is produced with 
enormous cost cutting pressure driving tolerances to the bleeding edge 
in many cases.  The equipment of the 80's and 90's you were referring to 
ran slower, and was more expensive so better quality components could be 
justified.  The quality trend at the board and component level has been 
trending for a long time toward cheap over good in almost every case 
nowadays.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-05 Thread David Boreham

On 5/5/2011 8:04 AM, Scott Ribe wrote:


Actually, any of us who really tried could probably come up with a dozen 
examples--more if we've been around for a while. Original design cutting 
corners on power regulation; final manufacturers cutting corners on specs; 
component manufacturers cutting corners on specs or selling outright 
counterfeit parts...


These are excellent examples of failure causes for electronics, but they are
not counter-examples. They're unrelated to the discussion about SSD
early lifetime hard failures.



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-05 Thread Scott Ribe
On May 4, 2011, at 9:34 PM, David Boreham wrote:

> So ok, yeah...I said that chips don't just keel over and die mid-life
> and you came up with the one counterexample in the history of
> the industry

Actually, any of us who really tried could probably come up with a dozen 
examples--more if we've been around for a while. Original design cutting 
corners on power regulation; final manufacturers cutting corners on specs; 
component manufacturers cutting corners on specs or selling outright 
counterfeit parts...

-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-05 Thread David Boreham

On 5/4/2011 11:50 PM, Toby Corkindale wrote:


In what way has the SMART read failed?
(I get the relevant values out successfully myself, and have Munin 
graph them.)
Mis-parse :) It was my _attempts_ to read SMART that failed. 
Specifically, I was able to read a table of numbers from the drive, but 
none of the numbers looked particularly useful or likely to be a "time 
to live" number. Similar to traditional drives, where you get this table 
of numbers that are either zero or random, that you look at saying 
"Huh?", all of which are flagged as "failing". Perhaps I'm using the 
wrong SMART groking tools ?





I do have to wonder if this Portman Wills guy was somehow Doing It 
Wrong to get a 100% failure rate over eight disks..



There are people out there who are especially highly charged.
So if he didn't wear out the drives, the next most likely cause I'd 
suspect is that he ESD zapped them.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-05 Thread David Boreham

On 5/5/2011 2:36 AM, Florian Weimer wrote:


I'm a bit concerned with usage-dependent failures.  Presumably, two SDDs
in a RAID-1 configuration are weared down in the same way, and it would
be rather inconvenient if they failed at the same point.  With hard
disks, this doesn't seem to happen; even bad batches fail pretty much
randomly.

fwiw this _can_ happen with traditional drives : we had a bunch of WD 
300G velociraptor
drives that had a firmware bug related to a 32-bit counter roll-over. 
This happened at
exactly the same time for all drives in a machine (because the counter 
counted since

power-up time). Needless to say this was quite frustrating !



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-05 Thread Florian Weimer
* Greg Smith:

> Intel claims their Annual Failure Rate (AFR) on their SSDs in IT
> deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for
> mechanical drives is around 2% during their first year, spiking to 5%
> afterwards.  I suspect that Intel's numbers are actually much better
> than the other manufacturers here, so a SSD from anyone else can
> easily be less reliable than a regular hard drive still.

I'm a bit concerned with usage-dependent failures.  Presumably, two SDDs
in a RAID-1 configuration are weared down in the same way, and it would
be rather inconvenient if they failed at the same point.  With hard
disks, this doesn't seem to happen; even bad batches fail pretty much
randomly.

-- 
Florian Weimer
BFK edv-consulting GmbH   http://www.bfk.de/
Kriegsstraße 100  tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-04 Thread Toby Corkindale

On 05/05/11 03:31, David Boreham wrote:

On 5/4/2011 11:15 AM, Scott Ribe wrote:


Sigh... Step 2: paste link in ;-)




To be honest, like the article author, I'd be happy with 300+ days to
failure, IF the drives provide an accurate predictor of impending doom.
That is, if I can be notified "this drive will probably quit working in
30 days", then I'd arrange to cycle in a new drive.
The performance benefits vs rotating drives are for me worth this hassle.

OTOH if the drive says it is just fine and happy, then suddenly quits
working, that's bad.

Given the physical characteristics of the cell wear-out mechanism, I
think it should be possible to provide a reasonable accurate remaining
lifetime estimate, but so far my attempts to read this information via
SMART have failed, for the drives we have in use here.


In what way has the SMART read failed?
(I get the relevant values out successfully myself, and have Munin graph 
them.)



FWIW I have a server with 481 days uptime, and 31 months operating that
has an el-cheapo SSD for its boot/OS drive.


Likewise, I have a server with a first-gen SSD (Kingston 60GB) that has 
been running constantly for over a year, without any hiccups. It runs a 
few small websites, a few email lists, all of which interact with 
PostgreSQL databases.. lifetime writes to the disk are close to 
three-quarters of a terabyte, and despite its lack of TRIM support, the 
performance is still pretty good.


I'm pretty happy!

I note in the comments of that blog post above, it includes:

"I have shipped literally hundreds of Intel G1 and G2 SSDs to my 
customers and never had a single in the field failure (save for one 
drive in a laptop where the drive itself functioned fine but one of the 
contacts on the SATA connector was actually flaky, probably from 
vibrational damage from a lot of airplane flights, and one DOA drive). I 
think you just got unlucky there."


I do have to wonder if this Portman Wills guy was somehow Doing It Wrong 
to get a 100% failure rate over eight disks..


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-04 Thread Scott Marlowe
On Wed, May 4, 2011 at 9:34 PM, David Boreham  wrote:
> On 5/4/2011 9:06 PM, Scott Marlowe wrote:
>>
>> Most of it is.  But certain parts are fairly new, i.e. the
>> controllers.  It is quite possible that all these various failing
>> drives share some long term ~ 1 year degradation issue like the 6Gb/s
>> SAS ports on the early sandybridge Intel CPUs.  If that's the case
>> then the just plain up and dying thing makes some sense.
>
> That Intel SATA port circuit issue was an extraordinarily rare screwup.
>
> So ok, yeah...I said that chips don't just keel over and die mid-life
> and you came up with the one counterexample in the history of
> the industry :)  When I worked in the business in the 80's and 90's
> we had a few things like this happen, but they're very rare and
> typically don't escape into the wild (as Intel's pretty much didn't).
> If a similar problem affected SSDs, they would have been recalled
> and lawsuits would be underway.

Not necessarily.  If there's a chip that has a 15% failure rate
instead of the predicted <1% it might not fail enough for people to
have noticed, since a user with a typically small sample might think
he just got a bit unlucky etc.  Nvidia made GPUs that overheated and
died by the thousand, but took 1 to 2 years to die.  There WAS a
lawsuit, and now to settle it, they're offering to buy everybody who
got stuck with the broken GPUs a nice single core $279 Compaq
computer, even if they bought a $4,000 workstation with one of those
dodgy GPUs.

There's a lot of possibilities as to why some folks are seeing high
failure rates, it'd be nice to know the cause.  But we can't assume
it's not an inherent problem with some part in them any more than we
can assume that it is.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-04 Thread David Boreham

On 5/4/2011 9:06 PM, Scott Marlowe wrote:

Most of it is.  But certain parts are fairly new, i.e. the
controllers.  It is quite possible that all these various failing
drives share some long term ~ 1 year degradation issue like the 6Gb/s
SAS ports on the early sandybridge Intel CPUs.  If that's the case
then the just plain up and dying thing makes some sense.


That Intel SATA port circuit issue was an extraordinarily rare screwup.

So ok, yeah...I said that chips don't just keel over and die mid-life
and you came up with the one counterexample in the history of
the industry :)  When I worked in the business in the 80's and 90's
we had a few things like this happen, but they're very rare and
typically don't escape into the wild (as Intel's pretty much didn't).
If a similar problem affected SSDs, they would have been recalled
and lawsuits would be underway.

SSDs are just not that different from anything else.
No special voodoo technology (besides the Flash devices themselves).



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-04 Thread Scott Marlowe
On Wed, May 4, 2011 at 6:31 PM, David Boreham  wrote:
>
> this). The technology and manufacturing processes are common across many
> different types of product. They either all work , or they all fail.

Most of it is.  But certain parts are fairly new, i.e. the
controllers.  It is quite possible that all these various failing
drives share some long term ~ 1 year degradation issue like the 6Gb/s
SAS ports on the early sandybridge Intel CPUs.  If that's the case
then the just plain up and dying thing makes some sense.

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-04 Thread David Boreham

On 5/4/2011 6:02 PM, Greg Smith wrote:

On 05/04/2011 03:24 PM, David Boreham wrote:
So if someone says that SSDs have "failed", I'll assume that they 
suffered from Flash cell

wear-out unless there is compelling proof to the contrary.


I've been involved in four recovery situations similar to the one 
described in that coding horror article, and zero of them were flash 
wear-out issues.  The telling sign is that the device should fail to 
read-only mode if it wears out.  That's not what I've seen happen 
though; what reports from the field are saying is that sudden, 
complete failures are the more likely event.


Sorry to harp on this (last time I promise), but I somewhat do know what 
I'm talking about, and I'm quite motivated to get to the bottom of this 
"SSDs fail, but not for the reason you'd suspect" syndrome (because we 
want to deploy SSDs in production soon).


Here's my best theory at present : the failures ARE caused by cell 
wear-out, but the SSD firmware is buggy in so far as it fails to boot up 
and respond to host commands due to the wear-out state. So rather than 
the expected outcome (SSD responds but has read-only behavior), it 
appears to be (and is) dead. At least to my mind, this is a more 
plausible explanation for the reported failures vs. the alternative (SSD 
vendors are uniquely clueless at making basic electronics 
subassemblies), especially considering the difficulty in testing the 
firmware under all possible wear-out conditions.


One question worth asking is : in the cases you were involved in, was 
manufacturer failure analysis performed (and if so what was the failure 
cause reported?).


The environment inside a PC of any sort, desktop or particularly 
portable, is not a predictable environment.  Just because the drives 
should be less prone to heat and vibration issues doesn't mean 
individual components can't slide out of spec because of them.  And 
hard drive manufacturers have a giant head start at working out 
reliability bugs in that area.  You can't design that sort of issue 
out of a new product in advance; all you can do is analyze returns 
from the field, see what you screwed up, and do another design rev to 
address it.
That's not really how it works (I've been the guy responsible for this 
for 10 years in a prior career, so I feel somewhat qualified to argue 
about this). The technology and manufacturing processes are common 
across many different types of product. They either all work , or they 
all fail. In fact, I'll eat my keyboard if SSDs are not manufactured on 
the exact same production lines as regular disk drives, DRAM modules, 
and so on (manufacturing tends to be contracted to high volume factories 
that make all kinds of things on the same lines). The only different 
thing about SSDs vs. any other electronics you'd come across is the 
Flash devices themselves. However, those are used in extraordinary high 
volumes all over the place and if there were a failure mode with the 
incidence suggested by these stories, I suspect we'd be reading about it 
on the front page of the WSJ.




Intel claims their Annual Failure Rate (AFR) on their SSDs in IT 
deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for 
mechanical drives is around 2% during their first year, spiking to 5% 
afterwards.  I suspect that Intel's numbers are actually much better 
than the other manufacturers here, so a SSD from anyone else can 
easily be less reliable than a regular hard drive still.


Hmm, this is speculation I don't support (non-intel vendors have a 10x 
worse early failure rate). The entire industry uses very similar 
processes (often the same factories). One rogue vendor with a bad 
process...sure, but all of them ??


For the benefit of anyone reading this who may have a failed SSD : all 
the tier 1 manufacturers have departments dedicated to the analysis of 
product that fails in the field. With some persistence, you can usually 
get them to take a failed unit and put it through the FA process (and 
tell you why it failed). For example, here's a job posting for someone 
who would do this work :

http://www.internmatch.com/internships/4620/intel/ssd-failure-analysis-intern-592345
I'd encourage you to at least try to get your failed devices into the 
failure analysis pile. If units are not returned, the manufacturer never 
finds out what broke, and therefore can't fix the problem.







--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: Fwd: Re: [GENERAL] SSDD reliability

2011-05-04 Thread Greg Smith

On 05/04/2011 03:24 PM, David Boreham wrote:
So if someone says that SSDs have "failed", I'll assume that they 
suffered from Flash cell

wear-out unless there is compelling proof to the contrary.


I've been involved in four recovery situations similar to the one 
described in that coding horror article, and zero of them were flash 
wear-out issues.  The telling sign is that the device should fail to 
read-only mode if it wears out.  That's not what I've seen happen 
though; what reports from the field are saying is that sudden, complete 
failures are the more likely event.


The environment inside a PC of any sort, desktop or particularly 
portable, is not a predictable environment.  Just because the drives 
should be less prone to heat and vibration issues doesn't mean 
individual components can't slide out of spec because of them.  And hard 
drive manufacturers have a giant head start at working out reliability 
bugs in that area.  You can't design that sort of issue out of a new 
product in advance; all you can do is analyze returns from the field, 
see what you screwed up, and do another design rev to address it.


The idea that these new devices, which are extremely complicated and 
based on hardware that hasn't been manufactured in volume before, should 
be expected to have high reliability is an odd claim.  I assume that any 
new electronics gadget has an extremely high failure rate during its 
first few years of volume production, particularly from a new 
manufacturer of that product.


Intel claims their Annual Failure Rate (AFR) on their SSDs in IT 
deployments (not OEM ones) is 0.6%.  Typical measured AFR rates for 
mechanical drives is around 2% during their first year, spiking to 5% 
afterwards.  I suspect that Intel's numbers are actually much better 
than the other manufacturers here, so a SSD from anyone else can easily 
be less reliable than a regular hard drive still.


--
Greg Smith   2ndQuadrant USg...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Fwd: Re: [GENERAL] SSDD reliability

2011-05-04 Thread David Boreham



No problem with that, for a first step. ***BUT*** the failures in this article 
and
many others I've read about are not in high-write db workloads, so they're not 
write wear,
they're just crappy electronics failing.


As a (lapsed) electronics design engineer, I'm suspicious of the notion that
a subassembly consisting of solid state devices surface-mounted on FR4 
substrate will fail
except in very rare (and of great interest to the manufacturer) circumstances.
And especially suspicious that one product category (SSD) happens to have a much
higher failure rate than all others.

Consider that an SSD is much simpler (just considering the electronics) than a 
traditional
disk drive, and subject to less vibration and heat.
Therefore one should see disk drives failing at the same (or higher rate).
Even if the owner is highly statically charged, you'd expect the to destroy all 
categories
of electronics at roughly the same rate (rather than just SSDs).

So if someone says that SSDs have "failed", I'll assume that they suffered from 
Flash cell
wear-out unless there is compelling proof to the contrary.





--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-04 Thread Scott Ribe
On May 4, 2011, at 11:31 AM, David Boreham wrote:

> To be honest, like the article author, I'd be happy with 300+ days to 
> failure, IF the drives provide an accurate predictor of impending doom.

No problem with that, for a first step. ***BUT*** the failures in this article 
and many others I've read about are not in high-write db workloads, so they're 
not write wear, they're just crappy electronics failing.

-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-04 Thread David Boreham

On 5/4/2011 11:15 AM, Scott Ribe wrote:


Sigh... Step 2: paste link in ;-)




To be honest, like the article author, I'd be happy with 300+ days to 
failure, IF the drives provide an accurate predictor of impending doom.
That is, if I can be notified "this drive will probably quit working in 
30 days", then I'd arrange to cycle in a new drive.

The performance benefits vs rotating drives are for me worth this hassle.

OTOH if the drive says it is just fine and happy, then suddenly quits 
working, that's bad.


Given the physical characteristics of the cell wear-out mechanism, I 
think it should be possible to provide a reasonable accurate remaining 
lifetime estimate, but so far my attempts to read this information via 
SMART have failed, for the drives we have in use here.


FWIW I have a server with 481 days uptime, and 31 months operating  that 
has an el-cheapo SSD for its boot/OS drive.




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] SSDD reliability

2011-05-04 Thread Scott Ribe
On May 4, 2011, at 10:50 AM, Greg Smith wrote:

> Your link didn't show up on this.

Sigh... Step 2: paste link in ;-)




-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] SSDD reliability

2011-05-04 Thread Scott Ribe
Yeah, on that subject, anybody else see this:

<>

Absolutely pathetic.

-- 
Scott Ribe
scott_r...@elevated-dev.com
http://www.elevated-dev.com/
(303) 722-0567 voice





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general