Re: [zfs-discuss] Raidz vdev size... again.

2009-04-29 Thread Bob Friesenhahn

On Tue, 28 Apr 2009, Richard Elling wrote:


I suppose if you could freeze the media to 0K, then it would not decay.
But that isn't the world I live in :-).  There is a whole Journal devoted
to things magnetic, with lots of studies of interesting compounds.  But
from a practical perspective, it is worth noting that some magnetic tapes
have a rated shelf life of 8-10 years while enterprise-class backup tapes
are only rated at 30 years.  Most disks have an expected operational life
of 5 years or so.  As Tim notes, it is a good idea to plan for migrating
important data to newer devices over time.


I am definitely a fan of migrating data.  As far as media degredation 
goes, perhaps much of the concern is the stability of the base stock 
(e.g. plastic) or disk drive mechanism and heads, and not the ability 
of the magnetic stuff to maintain its magnetism.


However, even the planet earth has an average shelf-life of 10,000 
years, after which the poles may suddenly be reversed (compass points 
in opposite direction).


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Richard Elling

Bob Friesenhahn wrote:

On Tue, 28 Apr 2009, Richard Elling wrote:


Yes and there is a very important point here.
There are 2 different sorts of scrubbing: read and rewrite.
ZFS (today) does read scrubbing, which does not reset the decay
process. Some RAID arrays also do rewrite scrubs which does reset
the decay process.  The problem with rewrite scrubbing is that you


I am not convinced that there is a "decay" process.  There is 
considerable magnetic hysteresis involved.  It seems most likely that 
corruption happens all of a sudden, and involves more than one or two 
bits.  More often than not we hear of a number of sectors failing at 
one time.


I suppose if you could freeze the media to 0K, then it would not decay.
But that isn't the world I live in :-).  There is a whole Journal devoted
to things magnetic, with lots of studies of interesting compounds.  But
from a practical perspective, it is worth noting that some magnetic tapes
have a rated shelf life of 8-10 years while enterprise-class backup tapes
are only rated at 30 years.  Most disks have an expected operational life
of 5 years or so.  As Tim notes, it is a good idea to plan for migrating
important data to newer devices over time.
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Tim
On Tue, Apr 28, 2009 at 11:12 PM, Bob Friesenhahn <
bfrie...@simple.dallas.tx.us> wrote:

> On Tue, 28 Apr 2009, Tim wrote:
>
>  I'll stick with the 3 year life cycle of the system followed by a hot
>> migration to new storage, thank you very much.
>>
>
> Once again there is a fixation on the idea that computers gradually degrade
> over time and that simply replacing the hardware before the expiration date
> (like a bottle of milk) will save the data.  I recently took an old Sun
> system out of service that was approaching 12 years on the same disks with
> no known read errors.  The Sun before that one was taken out of service
> after 11 years with no known read errors.  Lucky me.
>
> Various papers I have read suggest that degregation is in fits and bursts
> and contrary to what one would expect based on vendor specifications.
>
> Bob
>


I don't recall saying anything about a computer wearing out.  When net-new
and faster/more space is cheaper than maintenance renewal, I'll sick with
net-new.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Bob Friesenhahn

On Tue, 28 Apr 2009, Tim wrote:

I'll stick with the 3 year life cycle of the system followed by a 
hot migration to new storage, thank you very much.


Once again there is a fixation on the idea that computers gradually 
degrade over time and that simply replacing the hardware before the 
expiration date (like a bottle of milk) will save the data.  I 
recently took an old Sun system out of service that was approaching 12 
years on the same disks with no known read errors.  The Sun before 
that one was taken out of service after 11 years with no known read 
errors.  Lucky me.


Various papers I have read suggest that degregation is in fits and 
bursts and contrary to what one would expect based on vendor 
specifications.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Bob Friesenhahn

On Tue, 28 Apr 2009, Richard Elling wrote:


Yes and there is a very important point here.
There are 2 different sorts of scrubbing: read and rewrite.
ZFS (today) does read scrubbing, which does not reset the decay
process. Some RAID arrays also do rewrite scrubs which does reset
the decay process.  The problem with rewrite scrubbing is that you


I am not convinced that there is a "decay" process.  There is 
considerable magnetic hysteresis involved.  It seems most likely that 
corruption happens all of a sudden, and involves more than one or two 
bits.  More often than not we hear of a number of sectors failing at 
one time.


Do you have a reference to research results which show that a gradual 
"decay" process is a significant factor?


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Bob Friesenhahn

On Tue, 28 Apr 2009, Miles Nordin wrote:


* it'd be harmful to do this on SSD's.  it might also be a really
  good idea to do it on SSD's.  who knows yet.


SSDs can be based on many types of technologies, and not just those 
that wear out.



  * it may be wasteful to do read/rewrite on an ordinary magnetic
drive because if you just do a read, the drive should notice a
decaying block and rewrite it without being told specifically,
maybe.  though from netapp's paper, they say they disable many of


Does the drive have the capability to detect when a sector is written 
to the wrong track?  In order for it to detect that, the expected 
location would have to be written into the sector.



In the end, though, I bet we may end up with this feature on ZFS in
the disguise of a ``defragmenter''.  If the defragmenter will promise
to rewrite every block to a new spot, not jhust the ones it pleases,
this will do the job of your ``write scrub'' and also solve the drive
caching problem.


It seems doubtful that bulk re-writing of data will improve data 
integrity.  Writing is dangerous.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Tim
On Tue, Apr 28, 2009 at 4:52 PM, Richard Elling wrote:

>
> Well done!  Of course Hitachi doesn't use consumer-grade disks in
> their arrays...
>
> I'll also confess that I did set a bit of a math trap here :-)  The trap is
> that if you ever have to recover data from tape/backup, then you'll
> have no chance of making 5-9s when using large volumes.  Suppose
> you have a really nice backup system that can restore 10TBytes in
> 10 hours.  To achieve 5-9s you'd need to make sure that you never
> have to restore from backups for the next 114 years.  Since the
> expected lifetime of a disk is << 114 years, you'll have a poor
> chance of making it. So the problem really boils down to how sure
> you can be that you won't have an unrecoverable read during the
> expected lifetime of your system. Studies have shown [1] that you
> are much more likely to see this than you'd expect. The way to
> solve that problem is to use double parity to further reduce this
> probability.  Or, more simply, BAARF.
>
> [1] http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf
>
> -- richard



Your *trap* assumes COMPLETE data loss.  I don't' know what world you live
in, but the one I live in doesn't require a restore of 10TB of data when
*ONE* block is bad.  You've also assumed that the useful life of the data is
114 years, also false in the majority of primary disk systems.  Then there's
the little issue with you ignoring parity when you quote "a disk drives
life".  I'll stick with the 3 year life cycle of the system followed by a
hot migration to new storage, thank you very much.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread David Magda

On Apr 28, 2009, at 18:02, Richard Elling wrote:


Kees Nuyt wrote:





Some high availablility storage systems overcome this decay
by not just reading, but also writing all blocks during a
scrub. In those systems, scrubbing is done semi-continously
in the background, not on user/admin demand.


Yes and there is a very important point here.
There are 2 different sorts of scrubbing: read and rewrite.
ZFS (today) does read scrubbing, which does not reset the decay
process. Some RAID arrays also do rewrite scrubs which does reset
the decay process.  The problem with rewrite scrubbing is that you
really want to be sure the data is correct before you rewrite.   
Neither
is completely foolproof, so it is still a good idea to have  
backups :-)


Hopefully bp relocate will be make it into Solaris at some point, so  
when a scrub gets kicked off we'll be able to have that (at least as  
an option, if not by default).


Mac OS 10.5 auto-defrags in the background (given certain criteria are  
met), but iHFS+ doesn't have checksums, so there's a bit risk in  
creating errors.


Combine the two and you have a fairly robust defrag system.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Richard Elling

Kees Nuyt wrote:

On Mon, 27 Apr 2009 18:25:42 -0700, Richard Elling
 wrote:

  

The concern with large drives is unrecoverable reads during resilvering.
One contributor to this is superparamagnetic decay, where the bits are
lost over time as the medium tries to revert to a more steady state.
To some extent, periodic scrubs will help repair these while the disks
are otherwise still good. At least one study found that this can occur
even when scrubs are done, so there is an open research opportunity
to determine the risk and recommend scrubbing intervals. 



Some high availablility storage systems overcome this decay
by not just reading, but also writing all blocks during a
scrub. In those systems, scrubbing is done semi-continously
in the background, not on user/admin demand.
  


Yes and there is a very important point here.
There are 2 different sorts of scrubbing: read and rewrite.
ZFS (today) does read scrubbing, which does not reset the decay
process. Some RAID arrays also do rewrite scrubs which does reset
the decay process.  The problem with rewrite scrubbing is that you
really want to be sure the data is correct before you rewrite.  Neither
is completely foolproof, so it is still a good idea to have backups :-)
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Richard Elling

Tim wrote:



On Mon, Apr 27, 2009 at 8:25 PM, Richard Elling 
mailto:richard.ell...@gmail.com>> wrote:



I do not believe you can achieve five 9s with current consumer disk
drives for an extended period, say >1 year.


Just to pipe up, while very few vendors can pull it off, we've seen 
five 9's with Hitachi gear using SATA.


Well done!  Of course Hitachi doesn't use consumer-grade disks in
their arrays...

I'll also confess that I did set a bit of a math trap here :-)  The trap is
that if you ever have to recover data from tape/backup, then you'll
have no chance of making 5-9s when using large volumes.  Suppose
you have a really nice backup system that can restore 10TBytes in
10 hours.  To achieve 5-9s you'd need to make sure that you never
have to restore from backups for the next 114 years.  Since the
expected lifetime of a disk is << 114 years, you'll have a poor
chance of making it. So the problem really boils down to how sure
you can be that you won't have an unrecoverable read during the
expected lifetime of your system. Studies have shown [1] that you
are much more likely to see this than you'd expect. The way to
solve that problem is to use double parity to further reduce this
probability.  Or, more simply, BAARF.

[1] http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf

-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Miles Nordin
> "kn" == Kees Nuyt  writes:

kn> Some high availablility storage systems overcome this decay by
kn> not just reading, but also writing all blocks during a
kn> scrub. 

sounds like a good idea but harder in the ZFS model where the software
isn't the proprietary work of the only permitted integrator.

 * it'd be harmful to do this on SSD's.  it might also be a really
   good idea to do it on SSD's.  who knows yet.

 * optimizing the overall system depends on intimate knowledge of, and
   control over the release binding of, drive firmware and its
   errata/quirks/decisions

   * it may be wasteful to do read/rewrite on an ordinary magnetic
 drive because if you just do a read, the drive should notice a
 decaying block and rewrite it without being told specifically,
 maybe.  though from netapp's paper, they say they disable many of
 these features in their SCSI drives, including bad block
 remapping, and delegate them to the layer of their own software
 right above the drive

   * there's an ``offline self test'' in SMART where the drive is
 supposed to scrub itself, possibly including badblock remapping
 and marginal sector rewriting.  If this feature worked it could
 possibly accomplish scrubs with better QoS (less interference to
 real read/writes) and no controller-to-storage bandwidth wastage,
 compared to actually reading and rewriting through the
 controller, or possibly several layers above the controller
 through fanouts and such.

   * drives with caches may suppress overwrites to sectors containing
 what the cache says is already in those sectors.  I guess I heard
 on this list that SCSI has commands to ignore the cache for read
 and other commands to bypass it for write, but not SATA, or the
 commands could be broken because no one else uses them.  You have
 to have some business relationship with the drive company before
 they will admit what their proprietary firmware really does, much
 less alter it to your wishes, even if your wish is merely that it
 complies, or behaves like it did yesterday.  Every tiny piece of
 software that remains proprietary eventually turns into a blob
 that does someone else's bidding and fucks with you.

In the end, though, I bet we may end up with this feature on ZFS in
the disguise of a ``defragmenter''.  If the defragmenter will promise
to rewrite every block to a new spot, not jhust the ones it pleases,
this will do the job of your ``write scrub'' and also solve the drive
caching problem.

kn> In those systems, scrubbing is done semi-continously in the
kn> background, not on user/admin demand.

which ones?  name names. :) I thought netapp's two papers said they
are doing it ``every Sunday'' or something.

but, yeah, asking the admin to initiate it manually means if it makes
the array uselessly slow you blame the admin rather than the software
stack.  linux ubifs (NAND flash) scrubs are also mandatory/unsupervised.


pgpACOKK377Hd.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Kees Nuyt
On Mon, 27 Apr 2009 18:25:42 -0700, Richard Elling
 wrote:

>The concern with large drives is unrecoverable reads during resilvering.
>One contributor to this is superparamagnetic decay, where the bits are
>lost over time as the medium tries to revert to a more steady state.
>To some extent, periodic scrubs will help repair these while the disks
>are otherwise still good. At least one study found that this can occur
>even when scrubs are done, so there is an open research opportunity
>to determine the risk and recommend scrubbing intervals. 

Some high availablility storage systems overcome this decay
by not just reading, but also writing all blocks during a
scrub. In those systems, scrubbing is done semi-continously
in the background, not on user/admin demand.
-- 
  (  Kees Nuyt
  )
c[_]
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Fajar A. Nugraha
On Tue, Apr 28, 2009 at 9:42 AM, Scott Lawson
 wrote:
>> Mainstream Solaris 10 gets a port of ZFS from OpenSolaris, so its
>> features are fewer and later.  As time ticks away, fewer features
>> will be back-ported to Solaris 10.  Meanwhile, you can get a production
>> support  agreement for OpenSolaris.
>
> Sure if you want to run it on x86. I believe sometime in 2009 we will see a
> SPARC release
> for Opensolaris. I understand that it is to be the next OpenSolaris release,
> but I wouldn't hold
> my breath.

It's already available for Sparc (http://genunix.org/). Just not in
installer or Live DVD format (which should be availabe for 2009.6
release).

Regards,

Fajar
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Blake
On Tue, Apr 28, 2009 at 10:08 AM, Tim  wrote:
>
>
> On Mon, Apr 27, 2009 at 8:25 PM, Richard Elling 
> wrote:
>>
>> I do not believe you can achieve five 9s with current consumer disk
>> drives for an extended period, say >1 year.
>
> Just to pipe up, while very few vendors can pull it off, we've seen five 9's
> with Hitachi gear using SATA.

Can you specify the hardware?

I've recently switched to LSI SAS1068E controllers and am swimmingly
happy.  (That's my $.02 - controllers (not surprisingly) affect the
niceness of a software RAID solution like ZFS quite a bit - maybe even
more than the actual drives...?)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-28 Thread Tim
On Mon, Apr 27, 2009 at 8:25 PM, Richard Elling wrote:

>
> I do not believe you can achieve five 9s with current consumer disk
> drives for an extended period, say >1 year.
>

Just to pipe up, while very few vendors can pull it off, we've seen five 9's
with Hitachi gear using SATA.

--Tim
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Scott Lawson



Richard Elling wrote:

Some history below...

Scott Lawson wrote:


Michael Shadle wrote:

On Mon, Apr 27, 2009 at 4:51 PM, Scott Lawson
 wrote:

 
If possible though you would be best to let the 3ware controller 
expose
the 16 disks as a JBOD  to ZFS and create a RAIDZ2 within Solaris 
as you

will then
gain the full benefits of ZFS. Block self healing etc etc.

There isn't an issue in using a larger amount of disks in a RAIDZ2, 
just

that it
is not the optimal size. Longer rebuild times for larger vdev's in 
a zpool

(although this
is proportional to how full the pool is.). Two parity disks gives you
greater cover in
the event of a drive failing in a large vdev stripe.



Hmm, this is a bit disappointing to me. I would have dedicated only 2
disks out of 16 then to a single large raidz2 instead of two 8 disk
raidz2's (meaning 4 disks went to parity)

  
No I was referring to a single RAIDZ2 vdev of 16 disks in your pool. 
So you would
lose ~2 disks to parity effectively. The larger the stripe, 
potentially the slower the rebuild.
If you had multiple vdevs in a pool that were smaller stripes you 
would get less performance
degradation by virtue of IO isolation. Of course here you lose pool 
capacity. With
smaller vdevs, you could also potentially just use RAIDZ and not 
RAIDZ2 and then you would

have the equivalent size pool still with two parity disks. 1 per vdev.


A few years ago, Sun introduced the X4500 (aka Thumper) which had 48
disks in the chassis.  Of course, the first thing customers did was to 
make
a single-level 46 or 48 disk raidz set.  The second thing they did was 
complain
that the resulting performance sucked.  So the "solution" was to try 
and put
some sort of practical limit into the docs to help people not hurt 
themselves.

After much research (down at the pub? :-) the recommendation you see in
the man page was the concensus.  It has absolutely nothing to do with
correctness of design or implementation.  It has everything to do with
setting expectations of "goodness."
Sure, I understand this. I was a beta tester for the J4500 because I 
prefer SPARC systems mostly
for Solaris. Certainly for these large disk systems the preferred layout 
of around 5-6 drives per vdev
is what I use on my assortment of *4500 series devices. My production 
J4500's with 48 x 1 TB drives
yield around ~31 TB usable. A T5520 10 Gig attached  will pretty much 
saturate the 3Gb/s SAS HBA connecting

it to the J4500. ;)

Being that this a home NAS for Michael serving large contiguous files 
with fairly low random access
requirements, most likely I would imagine that these rules of thumb can 
be relaxed a little. As you
state they are a rule of thumb for generic loads. This list does appear 
to be attracting people
wanting to use ZFS for home and capacity tends to be the biggest 
requirement over performance.


As I always advise people. Test with *your* workload as *your* 
requirements may be different
to the next mans. If you favor capacity over performance then a larger 
vdev of a dozen or so  disks
will work 'OK' in my experience. (I do routinely get referred to Sun 
customers in NZ as a site that

actually use ZFS in production and doesn't just play with it.)

I have tested the aforementioned thumpers with just this sort of config 
myself with varying results

on varying workloads. Video servers, Sun Email etc etc... Long time ago now.

I also have hardware backed RAID 6's consisting of 16 drives in 6000 
series storage on Crystal firmware
which work just fine in the hardware RAID world. (where I want capacity 
over speed). This is real world
production class stuff. Works just fine. I have ZFS overlaid on top of 
this as well.


But it is good that we are emphasizing the trade offs that any config 
has. Everyone can learn from these

sorts of discussions. ;)




One thing you haven't mentioned is the drive type and size that you 
are planning to use as this
greatly influences what people here would recommend. RAIDZ2 is built 
for big, slow SATA
disks as reconstruction times in large RAIDZ's and RAIDZ2's increase 
the risk of vdev failure
significantly due to the time taken to resilver to a replacement 
drive. Hot spares are your friend!


The concern with large drives is unrecoverable reads during resilvering.
One contributor to this is superparamagnetic decay, where the bits are
lost over time as the medium tries to revert to a more steady state.
To some extent, periodic scrubs will help repair these while the disks
are otherwise still good. At least one study found that this can occur
even when scrubs are done, so there is an open research opportunity
to determine the risk and recommend scrubbing intervals.  To a lesser
extent, hot spares can help reduce the hours it may take to physically
repair the failed drive.

+1



I was still operating under the impression that vdevs larger than 7-8
disks typically make baby Jesus nervous.
  
You did also state that this is a system to be used for backups? 

Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Bob Friesenhahn

On Mon, 27 Apr 2009, Michael Shadle wrote:


I was still operating under the impression that vdevs larger than 7-8
disks typically make baby Jesus nervous.


Baby Jesus might not be particularly nervous but if your drives don't 
perform consistently, then there will be more chance of performance 
loss.  With raidz and raidz2 the drives need to operate in 
synchronicity so a balky drive (e.g. longer seek times than the others 
or slow transfers) will hurt performance.  Naturally, the more money 
you try to save, the more chance you have of a balky drive.  If you 
find one, you can replace it and replace drives until there are no 
more laggards left.


From what I have heard, there are limits to how many drives can be 
included in a ZFS block write so an individual write is not likely to 
span all of the drives, which makes finding the balky drives more 
interesting.


Bob
--
Bob Friesenhahn
bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Richard Elling

Some history below...

Scott Lawson wrote:


Michael Shadle wrote:

On Mon, Apr 27, 2009 at 4:51 PM, Scott Lawson
 wrote:

 

If possible though you would be best to let the 3ware controller expose
the 16 disks as a JBOD  to ZFS and create a RAIDZ2 within Solaris as 
you

will then
gain the full benefits of ZFS. Block self healing etc etc.

There isn't an issue in using a larger amount of disks in a RAIDZ2, 
just

that it
is not the optimal size. Longer rebuild times for larger vdev's in a 
zpool

(although this
is proportional to how full the pool is.). Two parity disks gives you
greater cover in
the event of a drive failing in a large vdev stripe.



Hmm, this is a bit disappointing to me. I would have dedicated only 2
disks out of 16 then to a single large raidz2 instead of two 8 disk
raidz2's (meaning 4 disks went to parity)

  
No I was referring to a single RAIDZ2 vdev of 16 disks in your pool. 
So you would
lose ~2 disks to parity effectively. The larger the stripe, 
potentially the slower the rebuild.
If you had multiple vdevs in a pool that were smaller stripes you 
would get less performance
degradation by virtue of IO isolation. Of course here you lose pool 
capacity. With
smaller vdevs, you could also potentially just use RAIDZ and not 
RAIDZ2 and then you would

have the equivalent size pool still with two parity disks. 1 per vdev.


A few years ago, Sun introduced the X4500 (aka Thumper) which had 48
disks in the chassis.  Of course, the first thing customers did was to make
a single-level 46 or 48 disk raidz set.  The second thing they did was 
complain
that the resulting performance sucked.  So the "solution" was to try and 
put
some sort of practical limit into the docs to help people not hurt 
themselves.

After much research (down at the pub? :-) the recommendation you see in
the man page was the concensus.  It has absolutely nothing to do with
correctness of design or implementation.  It has everything to do with
setting expectations of "goodness."



One thing you haven't mentioned is the drive type and size that you 
are planning to use as this
greatly influences what people here would recommend. RAIDZ2 is built 
for big, slow SATA
disks as reconstruction times in large RAIDZ's and RAIDZ2's increase 
the risk of vdev failure
significantly due to the time taken to resilver to a replacement 
drive. Hot spares are your friend!


The concern with large drives is unrecoverable reads during resilvering.
One contributor to this is superparamagnetic decay, where the bits are
lost over time as the medium tries to revert to a more steady state.
To some extent, periodic scrubs will help repair these while the disks
are otherwise still good. At least one study found that this can occur
even when scrubs are done, so there is an open research opportunity
to determine the risk and recommend scrubbing intervals.  To a lesser
extent, hot spares can help reduce the hours it may take to physically
repair the failed drive.


I was still operating under the impression that vdevs larger than 7-8
disks typically make baby Jesus nervous.
  
You did also state that this is a system to be used for backups? So 
availability is five 9's?


I do not believe you can achieve five 9s with current consumer disk
drives for an extended period, say >1 year.



Are you planning on using Open Solaris or mainstream Solaris 10? 
Mainstream Solaris
10 is more conservative and is capable of being placed under a support 
agreement if need

be.


Mainstream Solaris 10 gets a port of ZFS from OpenSolaris, so its
features are fewer and later.  As time ticks away, fewer features
will be back-ported to Solaris 10.  Meanwhile, you can get a production
support  agreement for OpenSolaris.
http://www.sun.com/service/opensolaris/index.jsp
-- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Scott Lawson



Michael Shadle wrote:

On Mon, Apr 27, 2009 at 5:32 PM, Scott Lawson
 wrote:

  

One thing you haven't mentioned is the drive type and size that you are
planning to use as this
greatly influences what people here would recommend. RAIDZ2 is built for
big, slow SATA
disks as reconstruction times in large RAIDZ's and RAIDZ2's increase the
risk of vdev failure
significantly due to the time taken to resilver to a replacement drive. Hot
spares are your friend!



Well these are Seagate 1.5TB SATA disks. So.. big slow disks ;)

  

Then RAIDZ2 is your friend! The resilver time on a large RAIDZ2 stripe on
these would take a significant amount of time. The probability of 
another drive failing
during this rebuild time is quite high. I have in my time seen numerous 
double disk failures

in  hardware backed RAID5's  resulting in complete volume failure.

You did also state that this is a system to be used for backups? So
availability is five 9's?

Are you planning on using Open Solaris or mainstream Solaris 10? Mainstream
Solaris
10 is more conservative and is capable of being placed under a support
agreement if need
be.



Nope. Home storage (DVDs, music, etc) - I'd be fine with mainstream
Solaris, the only reason I went with SXCE was for the in-kernel CIFS,
which I wound up not using anyway due to some weird bug.
  
I have a v240 at home with a 12 bay D1000 chassis with 11 x 300GB SCSI's 
in a RAIDZ2 at
home with 1 hot spare. Makes a great NAS for me. Mostly for photo's and 
music so
the capacity is fine. Speed is very very quick as these are 10 K drives. 
I have a a printing
business on the side where we store customer images on this and have 
gigabit to all
the macs that we use for photoshop. The assurance that RAIDZ2 gives me 
allows me to

sleep comfortably. (coupled with daily snapshots ;))

I use S10 10/08 with Samba for my network clients. Runs like a charm.

--
___


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Michael Shadle
On Mon, Apr 27, 2009 at 5:32 PM, Scott Lawson
 wrote:

> One thing you haven't mentioned is the drive type and size that you are
> planning to use as this
> greatly influences what people here would recommend. RAIDZ2 is built for
> big, slow SATA
> disks as reconstruction times in large RAIDZ's and RAIDZ2's increase the
> risk of vdev failure
> significantly due to the time taken to resilver to a replacement drive. Hot
> spares are your friend!

Well these are Seagate 1.5TB SATA disks. So.. big slow disks ;)

> You did also state that this is a system to be used for backups? So
> availability is five 9's?
>
> Are you planning on using Open Solaris or mainstream Solaris 10? Mainstream
> Solaris
> 10 is more conservative and is capable of being placed under a support
> agreement if need
> be.

Nope. Home storage (DVDs, music, etc) - I'd be fine with mainstream
Solaris, the only reason I went with SXCE was for the in-kernel CIFS,
which I wound up not using anyway due to some weird bug.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Scott Lawson


Michael Shadle wrote:

On Mon, Apr 27, 2009 at 4:51 PM, Scott Lawson
 wrote:

  

If possible though you would be best to let the 3ware controller expose
the 16 disks as a JBOD  to ZFS and create a RAIDZ2 within Solaris as you
will then
gain the full benefits of ZFS. Block self healing etc etc.

There isn't an issue in using a larger amount of disks in a RAIDZ2, just
that it
is not the optimal size. Longer rebuild times for larger vdev's in a zpool
(although this
is proportional to how full the pool is.). Two parity disks gives you
greater cover in
the event of a drive failing in a large vdev stripe.



Hmm, this is a bit disappointing to me. I would have dedicated only 2
disks out of 16 then to a single large raidz2 instead of two 8 disk
raidz2's (meaning 4 disks went to parity)

  
No I was referring to a single RAIDZ2 vdev of 16 disks in your pool. So 
you would
lose ~2 disks to parity effectively. The larger the stripe, potentially 
the slower the rebuild.
If you had multiple vdevs in a pool that were smaller stripes you would 
get less performance
degradation by virtue of IO isolation. Of course here you lose pool 
capacity. With
smaller vdevs, you could also potentially just use RAIDZ and not RAIDZ2 
and then you would

have the equivalent size pool still with two parity disks. 1 per vdev.

One thing you haven't mentioned is the drive type and size that you are 
planning to use as this
greatly influences what people here would recommend. RAIDZ2 is built for 
big, slow SATA
disks as reconstruction times in large RAIDZ's and RAIDZ2's increase the 
risk of vdev failure
significantly due to the time taken to resilver to a replacement drive. 
Hot spares are your friend!

I was still operating under the impression that vdevs larger than 7-8
disks typically make baby Jesus nervous.
  
You did also state that this is a system to be used for backups? So 
availability is five 9's?


Are you planning on using Open Solaris or mainstream Solaris 10? 
Mainstream Solaris
10 is more conservative and is capable of being placed under a support 
agreement if need

be.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Michael Shadle
On Mon, Apr 27, 2009 at 4:51 PM, Scott Lawson
 wrote:

> If possible though you would be best to let the 3ware controller expose
> the 16 disks as a JBOD  to ZFS and create a RAIDZ2 within Solaris as you
> will then
> gain the full benefits of ZFS. Block self healing etc etc.
>
> There isn't an issue in using a larger amount of disks in a RAIDZ2, just
> that it
> is not the optimal size. Longer rebuild times for larger vdev's in a zpool
> (although this
> is proportional to how full the pool is.). Two parity disks gives you
> greater cover in
> the event of a drive failing in a large vdev stripe.

Hmm, this is a bit disappointing to me. I would have dedicated only 2
disks out of 16 then to a single large raidz2 instead of two 8 disk
raidz2's (meaning 4 disks went to parity)

I was still operating under the impression that vdevs larger than 7-8
disks typically make baby Jesus nervous.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Scott Lawson

Leon,

RAIDZ2 is ~equivalent to RAID6. ~2 disks of parity data. Allowing a 
double drive

failure and still having the pool available.

If possible though you would be best to let the 3ware controller expose
the 16 disks as a JBOD  to ZFS and create a RAIDZ2 within Solaris as you 
will then

gain the full benefits of ZFS. Block self healing etc etc.

There isn't an issue in using a larger amount of disks in a RAIDZ2, just 
that it
is not the optimal size. Longer rebuild times for larger vdev's in a 
zpool (although this
is proportional to how full the pool is.). Two parity disks gives you 
greater cover in

the event of a drive failing in a large vdev stripe.

/Scott

Leon Meßner wrote:

Hi,
i'm new to the list so please bare with me. This isn't an OpenSolaris
related problem but i hope it's still the right list to post to.

I'm on the way to move a backup server to using zfs based storage, but i
don't want to spend too much drives to parity (the 16 drives are attached
to a 3ware raid controller so i could also just use raid6 there).

I want to be able to sustain two parallel drive failures so i need
raidz2. The man page of zpool says the recommended vdev size is
somewhere between 3-9 drives (for raidz). Is this just for getting the
best performance or are there stability issues ?

There won't be anything like heavy multi-user IO on this machine so
couldn't i just put all 16 drive in one raidz2 and have all the benefits
of zfs without sacrificing 2 extra drives to parity (compared to raid6)?

Thanks in Advance,
Leon
  



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  


--
___


Scott Lawson
Systems Architect
Manukau Institute of Technology
Information Communication Technology Services Private Bag 94006 Manukau
City Auckland New Zealand

Phone  : +64 09 968 7611
Fax: +64 09 968 7641
Mobile : +64 27 568 7611

mailto:sc...@manukau.ac.nz

http://www.manukau.ac.nz




perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'

 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Raidz vdev size... again.

2009-04-27 Thread Leon Meßner
Hi,
i'm new to the list so please bare with me. This isn't an OpenSolaris
related problem but i hope it's still the right list to post to.

I'm on the way to move a backup server to using zfs based storage, but i
don't want to spend too much drives to parity (the 16 drives are attached
to a 3ware raid controller so i could also just use raid6 there).

I want to be able to sustain two parallel drive failures so i need
raidz2. The man page of zpool says the recommended vdev size is
somewhere between 3-9 drives (for raidz). Is this just for getting the
best performance or are there stability issues ?

There won't be anything like heavy multi-user IO on this machine so
couldn't i just put all 16 drive in one raidz2 and have all the benefits
of zfs without sacrificing 2 extra drives to parity (compared to raid6)?

Thanks in Advance,
Leon


pgpMhQAOptNlB.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss