Re: [OpenIndiana-discuss] Recommendations for fast storage

2013-04-20 Thread Richard Elling
comment below…

On Apr 18, 2013, at 5:17 AM, Edward Ned Harvey (openindiana) 
 wrote:

>> From: Timothy Coalson [mailto:tsc...@mst.edu]
>> 
>> Did you also compare the probability of bit errors causing data loss
>> without a complete pool failure?  2-way mirrors, when one device
>> completely
>> dies, have no redundancy on that data, and the copy that remains must be
>> perfect or some data will be lost.  
> 
> I had to think about this comment for a little while to understand what you 
> were saying, but I think I got it.  I'm going to rephrase your question:
> 
> If one device in a 2-way mirror becomes unavailable, then the remaining 
> device has no redundancy.  So if a bit error is encountered on the (now 
> non-redundant) device, then it's an uncorrectable error.  Question is, did I 
> calculate that probability?
> 
> Answer is, I think so.  Modelling the probability of drive failure (either 
> complete failure or data loss) is very complex and non-linear.  Also 
> dependent on the specific model of drive in question, and the graphs are 
> typically not available.  So what I did was to start with some MTBDL graphs 
> that I assumed to be typical, and then assume every data-loss event meant 
> complete drive failure.  Already I'm simplifying the model beyond reality, 
> but the simplification focuses on worst case, and treats every bit error as 
> complete drive failure.  This is why I say "I think so," to answer your 
> question.  
> 
> Then, I didn't want to embark on a mathematician's journey of derivatives and 
> integrals over some non-linear failure rate graphs, so I linearized...  I 
> forget now (it was like 4-6 years ago) but I would have likely seen that 
> drives were unlikely to fail in the first 2 years, and about 50% likely to 
> fail after 3 years, and nearly certain to fail after 5 years, so I would have 
> likely modeled that as a linearly increasing probability of failure rate up 
> to 4 years, where it's assumed 100% failure rate at 4 years.

This technique shows a good appreciation of the expected lifetime of components.
Some of the more sophisticated models use a Weibull distribution, and this 
works 
particularly well for computing devices. The problem for designers is that the 
Weibull
model parameters are not publically published by the vendors. You need some time
in the field to collect these, so it is impractical for the systems designers.

At the end of the day, we have two practical choices:
1. Prepare for planned obsolescence and replacement of devices when the 
   expected lifetime metric is reached. The best proxy for HDD expected 
lifetime
   is the warranty period, and you'll often notice that enterprise 
drives have a better
   spec than consumer drives -- you tend to get what you pay for.

2. Measure your environment very carefully and take proactive action 
when the
   system begins to display signs of age-related wear out. This is a 
good idea
   in all cases, but the techniques are not widely adopted… yet.

> Yes, this modeling introduces inaccuracy, but that inaccuracy is in the 
> noise.  Maybe in the first 2 years, I'm 25% off in my estimates to the 
> positive, and after 4 years I'm 25% off in the negative, or something like 
> that.  But when the results show 10^-17 probability for one configuration and 
> 10^-19 probability for a different configuration, then the 25% error is 
> irrelevant.  It's easy to see which configuration is more probable to fail, 
> and it's also easy to see they're both well within acceptable limits for most 
> purposes (especially if you have good backups.)

For reliability measurements, this is not a bad track record. There are lots of 
other, 
environmental and historical factors that impact real life. As an analogy, for 
humans,
early death tends to be dominated by accidents rather than chronic health 
conditions.
For example, children tend to die in automobile accidents, while octogenarians 
tend 
to die from heart attacks, organ failure, or cancer -- different failure modes 
as a function
of age.
 -- richard

>> Also, as for time to resilver, I'm guessing that depends largely on where
>> bottlenecks are (it has to read effectively all of the remaining disks in
>> the vdev either way, but can do so in parallel, so ideally it could be the
>> same speed), 
> 
> No.  The big factor for resilver time is (a) the number of operations that 
> need to be performed, and (b) the number of operations per second.
> 
> If you have one big vdev making up a pool, then the number of operations to 
> be performed is equal to the number of objects in the pool.  The number of 
> operations per second is limited by the worst case random seek time for any 
> device in the pool.  If you have an all-SSD pool, then it's equal to a single 
> disk performance.  If you have an all-HDD pool, then with increasing number 
> of devices in your vdev, you approach 50% of the IOPS of a single device.
> 
> If

Re: [OpenIndiana-discuss] vdev reliability was: Recommendations for fast storage

2013-04-20 Thread Richard Elling
Terminology warning below…

On Apr 18, 2013, at 3:46 AM, Sebastian Gabler  wrote:

> Am 18.04.2013 03:09, schrieb openindiana-discuss-requ...@openindiana.org:
>> Message: 1
>> Date: Wed, 17 Apr 2013 13:21:08 -0600
>> From: Jan Owoc
>> To: Discussion list for OpenIndiana
>>  
>> Subject: Re: [OpenIndiana-discuss] Recommendations for fast storage
>> Message-ID:
>>  
>> Content-Type: text/plain; charset=UTF-8
>> 
>> On Wed, Apr 17, 2013 at 12:57 PM, Timothy Coalson  wrote:
>>> >On Wed, Apr 17, 2013 at 7:38 AM, Edward Ned Harvey (openindiana) <
>>> >openindi...@nedharvey.com> wrote:
>>> >
 >>You also said the raidz2 will offer more protection against failure,
 >>because you can survive any two disk failures (but no more.)  I would 
 >>argue
 >>this is incorrect (I've done the probability analysis before).  Mostly
 >>because the resilver time in the mirror configuration is 8x to 16x faster
 >>(there's 1/8 as much data to resilver, and IOPS is limited by a single
 >>disk, not the "worst" of several disks, which introduces another factor 
 >>up
 >>to 2x, increasing the 8x as high as 16x), so the smaller resilver window
 >>means lower probability of "concurrent" failures on the critical vdev.
 >>  We're talking about 12 hours versus 1 week, actual result of my 
 >> machines
 >>in production.
 >>
>>> >
>>> >Did you also compare the probability of bit errors causing data loss
>>> >without a complete pool failure?  2-way mirrors, when one device completely
>>> >dies, have no redundancy on that data, and the copy that remains must be
>>> >perfect or some data will be lost.  On the other hand, raid-z2 will still
>>> >have available redundancy, allowing every single block to have a bad read
>>> >on any single component disk, without losing data.  I haven't done the math
>>> >on this, but I seem to recall some papers claiming that this is the more
>>> >likely route to lost data on modern disks, by comparing bit error rate and
>>> >capacity.  Of course, a second outright failure puts raid-z2 in a much
>>> >worse boat than 2-way mirrors, which is a reason for raid-z3, but this may
>>> >already be a less likely case.
>> Richard Elling wrote a blog post about "mean time to data loss" [1]. A
>> few years later he graphed out a few cases for typical values of
>> resilver times [2].
>> 
>> [1]https://blogs.oracle.com/relling/entry/a_story_of_two_mttdl
>> [2]http://blog.richardelling.com/2010/02/zfs-data-protection-comparison.html
>> 
>> Cheers,
>> Jan
> 
> Notably, Richard's models posted do not include BER. Nevertheless it's an 
> important factor.

BER is the term most often used in networks, where the corruption is transient. 
For permanent
data faults, the equivalent is unrecoverable read error rate (UER), also 
expressed as a failure rate
per bit. My models clearly consider this. Unfortunately, terminology 
consistency between vendors
has been slow in coming, but Seagate and WD seem to be converging on 
"Non-recoverable 
read errors per bits read" while Toshiba seems to be ignoring the problem, or 
at least can't seem
to list it on their datasheets :-(

> From the back of my mind it will impact reliability in different ways in ZFS:
> 
> - Bit error in metadata (zfs should save us by metadata redundancy)
> - Bit error in full stripe data
> - Bit error in parity data

These aren't interesting from a system design perspective. To enhance the model 
to deal
with this, we just need to determine what percentage of the overall space 
contains copied
data. There is no general answer, but for most systems it will be a small 
percentage of the
total, as compared to data. In this respect, the models are worst-case, which 
is what we want
to use for design evalulations.

NB, traditional RAID systems don't know what is data and what is not data, so 
they could
run into uncorrectable errors that are not actually containing data. This 
becomes more
important for those systems which use a destructive scrub, as opposed to ZFS's 
readonly
scrub. Hence, some studies have shown where scrubbing can propagate errors in 
non-ZFS
RAID arrays.

> 
> AFAIK, a bit error in Parity or stripe data can be specifically dangerous 
> when it is raised during resilvering, and there is only one layer of 
> redundancy left. OTOH, BER issues scale with VDEV size, not with rebuild 
> time. So, I think that Tim actually made up a valid point about a 
> systematically weak point of 2-way mirrors or raidz1 on in vdevs that are 
> large in comparison to the BER rating of their member drives. Consumer drives 
> have a BER of 1:10^14..10^15, Enterprise drives start at 1:10^16.
> I do not think that zfs will have better resilience against rot of parity 
> data than conventional RAID. At best, block level checksums can help raise an 
> error, so you know at least that something went wrong. But recovery of the 
> data will probably not be possible. So, in my opinion BER is an issue under 
> ZFS as anywhere 

Re: [OpenIndiana-discuss] building a new box soon- HDD concerns and recommendations for virtual serving

2013-04-20 Thread Carl Brewer

On 19/04/2013 11:29 AM, Carl Brewer wrote:

On 18/04/2013 11:08 AM, Jay Heyl wrote:


One thing I would recommend is trying to use the ashift=12 setting to
force
the use of 4k blocks. I ran into problems because my initial pools were
created with 512-byte blocks. When I bought some spare drives I couldn't
use them because they were advanced format with 4k blocks and zfs
won't mix
block sizes on the same vdev. Had I used 4k blocks when I initially set
everything up I wouldn't have had this problem with the new drives.


How can I check what they are at the moment?


Like this :

root@hostie:~# zdb | egrep 'ashift| name'
name: 'rpool'
ashift: 12



And as I understand it, the 12 means 4k blocks, good, right? :)

Carl



___
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss


Re: [OpenIndiana-discuss] OpenIndiana-discuss Digest, Vol 33, Issue 36

2013-04-20 Thread Justin Warwick
Thanks very much for your replies, gentlemen. zdb -l indeed shows the disk
to be in a pool "rpool" with c2d0, which led me to realize I had made
mistake: I was looking at the wrong disk.

Some background information: initial configuration was two pools: "rpool"
and "media", each a simple two disk mirror. I added in two more disks, and
while doing so, it is possible that some of the original four were
reconnected to different SATA ports on the motherboard, though it is my
understanding the ZFS does not care so much about disk device paths,
because it uses GUIDs stored in the disk labels instead (which is why I
wasn't more careful at the time). Am I mistaken on that?

My confusion partly stems from the disk device paths having changed; here
is output from before I physically added in the new disks:

jrw@valinor:~$ zpool status
  pool: media
 state: ONLINE
  scan: none requested
config:

NAMESTATE READ WRITE CKSUM
media   ONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c2d1ONLINE   0 0 0
c3d1ONLINE   0 0 0

errors: No known data errors

  pool: rpool
 state: ONLINE
  scan: scrub repaired 9.50K in 0h4m with 0 errors on Mon Jun 18 12:07:30
2012
config:

NAMESTATE READ WRITE CKSUM
rpool   ONLINE   0 0 0
  mirror-0  ONLINE   0 0 0
c2d0s0  ONLINE   0 0 0
c3d0s0  ONLINE   0 0 0

errors: No known data errors

I am afraid that I stepped away from the problem for a while, without good
notes, so some points are fuzzy (and apparently my bash HISTSIZE is the
default 500), but I am pretty sure I just did a zpool detach, intending to
reorganize or maybe to address some perceived weird behavior (which was in
fact ZFS trying to deal with my mistakes). In any case it does seem to fit
your described scenario, Jim.

Thanks for pointing out the SATA/IDE problem. I was wondering about that.
Most of my experience is plain old SUN-brand SPARC boxes so the absence of
a slice number on some of the disks was a mystery. I have no other fancy
requirements like dualboot. These are all identical, new Seagate Momentus
XT drives. Maybe there is BIOS setting to change.

  -- Justin


>
> --
>
> Message: 5
> Date: Fri, 19 Apr 2013 09:30:07 -0400
> From: George Wilson 
> To: openindiana-discuss@openindiana.org
> Subject: Re: [OpenIndiana-discuss] zpool vdev membership misreported
> Message-ID: <517146df.7040...@delphix.com>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Justin,
>
> More than likely c3d1 contain a label from an old pool. If you run 'zdb
> -l /dev/dsk/c3d1s0' you should be able to if one exists. If does contain
> information for an old pool then using the '-f' option when attaching
> will solve your problem and relabel the device with the new pool
> information.
>
> You could also do a 'zpool import' to see if c3d1 shows up with
> information about an exported pool.
>
> Thanks,
> George
>
> On 4/19/13 5:20 AM, Justin Warwick wrote:
> > jrw@valinor:~$ pfexec zpool attach media c2d1 c3d1
> > invalid vdev specification
> > use '-f' to override the following errors:
> > /dev/dsk/c3d1s0 is part of active ZFS pool rpool. Please see zpool(1M).
> >
> > Yet I do not see c3d1 in zpool status.  Initially I had 4 disks total,
> all
> > Seagate SATA drives, two separate plain two-disk mirrors. I added a
> couple
> > more disks, but havn't yet added them to a pool. I noticed that the
> device
> > identifiers changed (which I did not expect). I broke the mirror on the
> > "media" pool (i can't remember whey I did that). Since that time I have
> > been getting that message which seems to mean somehow the disk is halfway
> > stuck in the mirror. Should I just issue the -f override? Am I asking for
> > trouble if i do?
> >
> >
> > jrw@valinor:~$ zpool status
> >pool: media
> >   state: ONLINE
> >scan: scrub repaired 0 in 3h38m with 0 errors on Sun Feb 17 04:44:29
> 2013
> > config:
> >
> >  NAMESTATE READ WRITE CKSUM
> >  media   ONLINE   0 0 0
> >c2d1  ONLINE   0 0 0
> >
> > errors: No known data errors
> >
> >pool: rpool
> >   state: ONLINE
> >scan: scrub in progress since Fri Apr 19 01:47:19 2013
> >  27.1M scanned out of 75.9G at 3.87M/s, 5h34m to go
> >  0 repaired, 0.03% done
> > config:
> >
> >  NAMESTATE READ WRITE CKSUM
> >  rpool   ONLINE   0 0 0
> >mirror-0  ONLINE   0 0 0
> >  c2d0s0  ONLINE   0 0 0
> >  c3d0s0  ONLINE   0 0 0
> >
> > errors: No known data errors
> > jrw@valinor:~$ echo | pfexec format
> > Searching for disks...done
> >
> >
> > AVAILABLE DISK SELECTIONS:
> > 0. c2d0 
> >/pci@0,0/pci-ide@11/ide@0/cmdk@0,0
> > 1. c2d