Re: [zfs-discuss] Why RAID 5 stops working in 2009

2008-07-03 Thread Mike Gerdts
On Thu, Jul 3, 2008 at 3:09 PM, Aaron Blew <[EMAIL PROTECTED]> wrote:
> My take is that since RAID-Z creates a stripe for every block
> (http://blogs.sun.com/bonwick/entry/raid_z), it should be able to
> rebuild the bad sectors on a per block basis.  I'd assume that the
> likelihood of having bad sectors on the same places of all the disks
> is pretty low since we're only reading the sectors related to the
> block being rebuilt.  It also seems that fragmentation would work in
> your favor here since the stripes would be distributed across more of
> the platter(s), hopefully protecting you from a wonky manufacturing
> defect that causes UREs on the same place on the disk.
>
> -Aaron

The per-block statement above is important - zfs will only rebuild the
blocks that have data.  A 100TB pool with 1 GB in use will rebuild 1
GB.  As such, it is more a factor of the amount of data rather than
the size of the RAID device.  A periodic zpool scrub will likely turn
up read errors before you have a drive failure AND unrelated read
errors.

Since ZFS merges the volume management and file system layers such an
uncorrectable read would turn into zfs saying "file /a/b/c is corrupt
- you need to restore it" rather than traditional RAID5 saying "this
12 TB volume is corrupt - restore it".  ZFS already makes multiple
copies of metadata so if  you were "lucky" and the corruption happens
to the metadata it should be able to get a working copy from
elsewhere.

Of course, raidz2 further decreases your chances of losing data.  I
would highly recommend reading Richard Elling's comments in this area.
 For example:

http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance
http://blogs.sun.com/relling/entry/a_story_of_two_mttdl
http://opensolaris.org/jive/thread.jspa?threadID=65564#255257

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Poor read/write performance when using ZFS iSCSI target

2008-07-03 Thread Cody Campbell
Greetings,

I want to take advantage of the iSCSI target support in the latest release 
(svn_91) of OpenSolaris, and I'm running into some performance problems when 
reading/writing from/to my target. I'm including as much detail as I can so 
bear with me here...

I've built an x86 OpenSolaris server (Intel Xeon running NV_91) with a zpool of 
15 750GB SATA disks, of which I've created and exported a ZFS Volume with the 
shareiscsi=on property set to generate an iSCSI target.  

My problem is, when I connect to this target from any initiator (tested with 
both Linux 2.6 and OpenSolaris NV_91 SPARC and x86), the read/write speed is 
dreadful (~ 3 megabytes / second!).  When I test read/write performance locally 
with the backing pool, I have excellent speeds. The same can be said when I use 
services such as NFS and FTP to move files between other hosts on the network 
and the volume I am exporting as a Target. When doing this I have achieved the 
near-Gigabit speeds I expect, which has me thinking this isn't a network 
problem of some sort (I've already disabled the Neagle algorithm if you're 
wondering). It's not until I add the iSCSI target to the stack that the speeds 
go south, so I am concerned that I may be missing something in configuration of 
the target. 

Below are some details pertaining to my configuration. 

OpenSolaris iSCSI Target Host:

target_host:~ # zpool status pool0
  pool: pool0
 state: ONLINE
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
pool0   ONLINE   0 0 0
  raidz1ONLINE   0 0 0
c0t0d0  ONLINE   0 0 0
c0t1d0  ONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
c0t6d0  ONLINE   0 0 0
  raidz1ONLINE   0 0 0
c0t7d0  ONLINE   0 0 0
c1t0d0  ONLINE   0 0 0
c1t1d0  ONLINE   0 0 0
c1t2d0  ONLINE   0 0 0
c1t3d0  ONLINE   0 0 0
c1t4d0  ONLINE   0 0 0
c1t5d0  ONLINE   0 0 0
spares
  c1t6d0AVAIL   

errors: No known data errors

target_host:~ # zfs get all pool0/vol0
NAMEPROPERTY VALUE  SOURCE
pool0/vol0  type volume -
pool0/vol0  creation Wed Jul  2 18:16 2008  -
pool0/vol0  used 5T -
pool0/vol0  available7.92T  -
pool0/vol0  referenced   34.2G  -
pool0/vol0  compressratio1.00x  -
pool0/vol0  reservation  none   default
pool0/vol0  volsize  5T -
pool0/vol0  volblocksize 8K -
pool0/vol0  checksum on default
pool0/vol0  compression  offdefault
pool0/vol0  readonly offdefault
pool0/vol0  shareiscsi   on local
pool0/vol0  copies   1  default
pool0/vol0  refreservation   5T local


target_host:~ # iscsitadm list target -v pool0/vol0

Target: pool0/vol0
iSCSI Name: iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab
Alias: pool0/vol0
Connections: 1
Initiator:
iSCSI Name: iqn.1986-03.com.sun:01:0003ba681e7f.486c0829
Alias: unknown
ACL list:
TPGT list:
TPGT: 1
LUN information:
LUN: 0
GUID: 01304865b1b42a00486c29d2
VID: SUN
PID: SOLARIS
Type: disk
Size: 5.0T
Backing store: /dev/zvol/rdsk/pool0/vol0
Status: online


OpenSolaris iSCSI Initiator Host:


initiator_host:~ # iscsiadm list target -vS 
iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab
Target: iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab
Alias: pool0/vol0
TPGT: 1
ISID: 402a
Connections: 1
CID: 0
  IP address (Local): 192.168.4.2:63960
  IP address (Peer): 192.168.4.3:3260
  Discovery Method: SendTargets 
  Login Parameters (Negotiated):
Data Sequence In Order: yes
Data PDU In Order: yes
Default Time To Retain: 20
Default Time To Wait: 2
Error Recovery Level: 0
First Burst Length: 65536
Immediate Data: yes
Initial Ready To Transfer (R2T): yes
Max Burst Length: 262144
Max Outstanding R2T: 1

[zfs-discuss] Kota, Sudha is out of the office.

2008-07-03 Thread Sudha_Kota

I will be out of the office starting  07/03/2008 and will not return until
07/07/2008.

Please contact George Mederos, Shawn Luft or Bernard Wu for Unix Support.


The information contained in this e-mail message is intended only for 
the personal and confidential use of the recipient(s) named above. This 
message may be an attorney-client communication and/or work product and 
as such is privileged and confidential. If the reader of this message 
is not the intended recipient or an agent responsible for delivering it 
to the intended recipient, you are hereby notified that you have 
received this document in error and that any review, dissemination, 
distribution, or copying of this message is strictly prohibited. If you 
have received this communication in error, please notify us immediately 
by e-mail, and delete the original message.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why RAID 5 stops working in 2009

2008-07-03 Thread Aaron Blew
My take is that since RAID-Z creates a stripe for every block
(http://blogs.sun.com/bonwick/entry/raid_z), it should be able to
rebuild the bad sectors on a per block basis.  I'd assume that the
likelihood of having bad sectors on the same places of all the disks
is pretty low since we're only reading the sectors related to the
block being rebuilt.  It also seems that fragmentation would work in
your favor here since the stripes would be distributed across more of
the platter(s), hopefully protecting you from a wonky manufacturing
defect that causes UREs on the same place on the disk.

-Aaron


On Thu, Jul 3, 2008 at 12:24 PM, Jim <[EMAIL PROTECTED]> wrote:
> Anyone here read the article "Why RAID 5 stops working in 2009" at 
> http://blogs.zdnet.com/storage/?p=162
>
> Does RAIDZ have the same chance of unrecoverable read error as RAID5 in Linux 
> if the RAID has to be rebuilt because of a faulty disk?  I imagine so because 
> of the physical constraints that plague our hds.  Granted, the chance of 
> failure in my case shouldn't be nearly as high as I will most likely recruit 
> four or three 750gb drives- not in the order of 10tb.
>
> With my opensolaris NAS, I will be scrubbing every week (consumer grade 
> drives[every month for enterprise-grade]) as recommended in the ZFS best 
> practices guide.  If I "zpool status" and I see that the scrub is 
> increasingly fixing errors, would that mean that the disk is in fact headed 
> towards failure or perhaps that the natural expansion of disk usage is to 
> blame?
>
>
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Bob Friesenhahn
On Thu, 3 Jul 2008, Richard Elling wrote:
>
> nit: SATA disks are single port, so you would need a SAS 
> implementation to get multipathing to the disks.  This will not 
> significantly impact the overall availability of the data, however. 
> I did an availability analysis of thumper to show this. 
> http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs

Richard,

It seems that the "Thumper" system (with 48 SATA drives) has been 
pretty well analyzed now.  Is it possible for you to perform similar 
analysis of the new Sun Fire X4240 with its 16 SAS drives?  SAS drives 
are usually faster than SATA drives and it is possible to multipath 
them (maybe not in this system?).

This system seems ideal for ZFS and should work great as a 
medium-sized data server or database server.  Maybe someone can run 
benchmarks on one and report the results here?

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Richard Elling
Miles Nordin wrote:
>> "djm" == Darren J Moffat <[EMAIL PROTECTED]> writes:
>> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes:
>> 
>
>djm> Why are you planning on using RAIDZ-2 rather than mirroring ?
>
> isn't MTDL sometimes shorter for mirroring than raidz2?  I think that
> is the biggest point of raidz2, is it not?
>   

Yes.  For some MTTDL models, a 3-way mirror is roughly equivalent to a
3-disk raidz2 set, with the mirror being slightly better because you do 
not require
both of the other two disks to be functional during reconstruction.  As 
the number
of disks in the set increases, the MTTDL goes down, so a 4-disk raidz2 
will have
lower MTTDL than a 3-disk mirror.  Somewhere I have graphs which show 
this...
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance

> bf> The probability of three disks independently dying during the
> bf> resilver
>
> The thing I never liked about MTDL models is their assuming disk
> failures are independent events.  It seems likely to get a bad batch
> of disks if you buy a single model from a single manufacturer, and buy
> all the disks at the same time.  They may have consecutive serial
> numbers, ship in the same box, u.s.w.
>   

You are correct in that the models assume independent failures.  Common
failures for independent devices (eg vintages) can be modeled using an
adjusted MTBF.  For example, we sometimes see a vintage where the
MTBF is statistically significantly different than other vintages.  These
can be difficult to predict and any such predictions may not help you
make decisions.  Somewhere I talk about that...
http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent

> You can design around marginal power supplies that feed a bank of
> disks with excessive ripple voltage, cause them all to write
> marginally readable data, and later make you think the disks all went
> bad at once.  or use long fibre cables to put chassis in different
> rooms with separate aircon.  or tell yourself other strange disaster
> stories and design around them.  But fixing the lack of diversity in
> manufacturing and shipping seems hard.
>   

My favorite is the guy who zip-ties the fiber in a tight wad at the back
of the rack.  Fiber (and copper) cables have a minimum bend radius
specification.  In fiber cables, small cracks can occur which, over time,
become larger and cause attenuation.

If you are really interested in diversity, you need to copy the data
someplace far, far away, as many of the Katrina survivors learned.
But even that might not be enough diversity...
http://blogs.sun.com/relling/entry/diversity_revisited
http://blogs.sun.com/relling/entry/diversity_in_your_connections
> For my low-end stuff, I have been buying the two sides of mirrors from
> two companies, but I don't know how workable that is for people trying
> to look ``professional''.  It's also hard to do with raidz since there
> are so few hard drive brands left.  
>   

I agree, and do the same.

> Retailers ought to charge an extra markup for ``aging'' the drives
> for you like cheese, and maintian several color-coded warehouses in
> which to do the aging: ``sell me 10 drives that were aged for six
> months in the Green warehouse.''
>   

I just looked at our field data for disks through last month and
would say that aging won't buy you any assurance.  We are
seeing excellent and improving reliability.  Mind you, we are
selling enterprise-class disks from the top bins :-)

Meanwhile, thanks Miles for being a setup guy :-)
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Why RAID 5 stops working in 2009

2008-07-03 Thread Jim
Anyone here read the article "Why RAID 5 stops working in 2009" at 
http://blogs.zdnet.com/storage/?p=162

Does RAIDZ have the same chance of unrecoverable read error as RAID5 in Linux 
if the RAID has to be rebuilt because of a faulty disk?  I imagine so because 
of the physical constraints that plague our hds.  Granted, the chance of 
failure in my case shouldn't be nearly as high as I will most likely recruit 
four or three 750gb drives- not in the order of 10tb.

With my opensolaris NAS, I will be scrubbing every week (consumer grade 
drives[every month for enterprise-grade]) as recommended in the ZFS best 
practices guide.  If I "zpool status" and I see that the scrub is increasingly 
fixing errors, would that mean that the disk is in fact headed towards failure 
or perhaps that the natural expansion of disk usage is to blame?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Henrik Johansen
[Richard Elling] wrote:
> Don Enrique wrote:
>> Hi,
>>
>> I am looking for some best practice advice on a project that i am working on.
>>
>> We are looking at migrating ~40TB backup data to ZFS, with an annual data 
>> growth of
>> 20-25%.
>>
>> Now, my initial plan was to create one large pool comprised of X RAIDZ-2 
>> vdevs ( 7 + 2 )
>> with one hotspare per 10 drives and just continue to expand that pool as 
>> needed.
>>
>> Between calculating the MTTDL and performance models i was hit by a rather 
>> scary thought.
>>
>> A pool comprised of X vdevs is no more resilient to data loss than the 
>> weakest vdev since loss
>> of a vdev would render the entire pool unusable.
>>   
>
> Yes, but a raidz2 vdev using enterprise class disks is very reliable.

That's nice to hear.

>> This means that i potentially could loose 40TB+ of data if three disks 
>> within the same RAIDZ-2
>> vdev should die before the resilvering of at least one disk is complete. 
>> Since most disks
>> will be filled i do expect rather long resilvering times.
>>
>> We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project 
>> with as much hardware
>> redundancy as we can get ( multiple controllers, dual cabeling, I/O 
>> multipathing, redundant PSUs,
>> etc.)
>>   
>
> nit: SATA disks are single port, so you would need a SAS implementation
> to get multipathing to the disks.  This will not significantly impact the
> overall availability of the data, however.  I did an availability  
> analysis of
> thumper to show this.
> http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs

Yeah, I read your blog. Very informative indeed. 

I am using SAS HBA cards and SAS enclosures with SATA disks so I should
be fine.

>> I could use multiple pools but that would make data management harder which 
>> in it self is a lengthy
>> process in our shop.
>>
>> The MTTDL figures seem OK so how much should i need to worry ? Anyone having 
>> experience from
>> this kind of setup ?
>>   
>
> I think your design is reasonable.  We'd need to know the exact
> hardware details to be able to make more specific recommendations.
> -- richard

Well, my choice of hardware is kind of limited by 2 things :

1. We are a 100% Dell shop.
2. We already have lots of enclosures that i would like to reuse for my project.

The HBA cards are SAS 5/E (LSI SAS1068 chipset) cards, the enclosures are
Dell MD1000 diskarrays.

>

-- 
Med venlig hilsen / Best Regards

Henrik Johansen
[EMAIL PROTECTED]


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] J4200/J4400 Array

2008-07-03 Thread Richard Elling
Albert Chin wrote:
> On Thu, Jul 03, 2008 at 01:43:36PM +0300, Mertol Ozyoney wrote:
>   
>> You are right that J series do not have nvram onboard. However most Jbods
>> like HPS's MSA series have some nvram. 
>> The idea behind not using nvram on the Jbod's is 
>>
>> -) There is no use to add limited ram to a JBOD as disks already have a lot
>> of cache.
>> -) It's easy to design a redundant Jbod without nvram. If you have nvram and
>> need redundancy you need to design more complex HW and more complex firmware
>> -) Bateries are the first thing to fail 
>> -) Servers already have too much ram
>> 
>
> Well, if the server attached to the J series is doing ZFS/NFS,
> performance will increase with zfs:zfs_nocacheflush=1. But, without
> battery-backed NVRAM, this really isn't "safe". So, for this usage case,
> unless the server has battery-backed NVRAM, I don't see how the J series
> is good for ZFS/NFS usage.
>
>   

The zfs_nocacheflush problem should be mostly gone as the fix was
implemented in b74.   We really expect that this recommendation will
disappear, except in its viral form.
http://bugs.opensolaris.org/view_bug.do?bug_id=6462690

You really don't want to set this when using common, magnetic disks
in a JBOD (J-series means JBOD) because there is no non-volatile cache.

For good ZFS+NFS performance under attribute-creating-intensive
loads using JBODs, we recommend using a slog.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Miles Nordin
> "djm" == Darren J Moffat <[EMAIL PROTECTED]> writes:
> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes:

   djm> Why are you planning on using RAIDZ-2 rather than mirroring ?

isn't MTDL sometimes shorter for mirroring than raidz2?  I think that
is the biggest point of raidz2, is it not?

bf> The probability of three disks independently dying during the
bf> resilver

The thing I never liked about MTDL models is their assuming disk
failures are independent events.  It seems likely to get a bad batch
of disks if you buy a single model from a single manufacturer, and buy
all the disks at the same time.  They may have consecutive serial
numbers, ship in the same box, u.s.w.

You can design around marginal power supplies that feed a bank of
disks with excessive ripple voltage, cause them all to write
marginally readable data, and later make you think the disks all went
bad at once.  or use long fibre cables to put chassis in different
rooms with separate aircon.  or tell yourself other strange disaster
stories and design around them.  But fixing the lack of diversity in
manufacturing and shipping seems hard.

For my low-end stuff, I have been buying the two sides of mirrors from
two companies, but I don't know how workable that is for people trying
to look ``professional''.  It's also hard to do with raidz since there
are so few hard drive brands left.  

Retailers ought to charge an extra markup for ``aging'' the drives
for you like cheese, and maintian several color-coded warehouses in
which to do the aging: ``sell me 10 drives that were aged for six
months in the Green warehouse.''


pgpQP4NYa6j2S.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] J4200/J4400 Array

2008-07-03 Thread Albert Chin
On Thu, Jul 03, 2008 at 01:43:36PM +0300, Mertol Ozyoney wrote:
> You are right that J series do not have nvram onboard. However most Jbods
> like HPS's MSA series have some nvram. 
> The idea behind not using nvram on the Jbod's is 
> 
> -) There is no use to add limited ram to a JBOD as disks already have a lot
> of cache.
> -) It's easy to design a redundant Jbod without nvram. If you have nvram and
> need redundancy you need to design more complex HW and more complex firmware
> -) Bateries are the first thing to fail 
> -) Servers already have too much ram

Well, if the server attached to the J series is doing ZFS/NFS,
performance will increase with zfs:zfs_nocacheflush=1. But, without
battery-backed NVRAM, this really isn't "safe". So, for this usage case,
unless the server has battery-backed NVRAM, I don't see how the J series
is good for ZFS/NFS usage.

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Richard Elling
Don Enrique wrote:
> Hi,
>
> I am looking for some best practice advice on a project that i am working on.
>
> We are looking at migrating ~40TB backup data to ZFS, with an annual data 
> growth of
> 20-25%.
>
> Now, my initial plan was to create one large pool comprised of X RAIDZ-2 
> vdevs ( 7 + 2 )
> with one hotspare per 10 drives and just continue to expand that pool as 
> needed.
>
> Between calculating the MTTDL and performance models i was hit by a rather 
> scary thought.
>
> A pool comprised of X vdevs is no more resilient to data loss than the 
> weakest vdev since loss
> of a vdev would render the entire pool unusable.
>   

Yes, but a raidz2 vdev using enterprise class disks is very reliable.

> This means that i potentially could loose 40TB+ of data if three disks within 
> the same RAIDZ-2
> vdev should die before the resilvering of at least one disk is complete. 
> Since most disks
> will be filled i do expect rather long resilvering times.
>
> We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project 
> with as much hardware
> redundancy as we can get ( multiple controllers, dual cabeling, I/O 
> multipathing, redundant PSUs,
> etc.)
>   

nit: SATA disks are single port, so you would need a SAS implementation
to get multipathing to the disks.  This will not significantly impact the
overall availability of the data, however.  I did an availability 
analysis of
thumper to show this.
http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs

> I could use multiple pools but that would make data management harder which 
> in it self is a lengthy
> process in our shop.
>
> The MTTDL figures seem OK so how much should i need to worry ? Anyone having 
> experience from
> this kind of setup ?
>   

I think your design is reasonable.  We'd need to know the exact
hardware details to be able to make more specific recommendations.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Chris Cosby
I'm going down a bit of a different path with my reply here. I know that all
shops and their need for data are different, but hear me out.

1) You're backing up 40TB+ of data, increasing at 20-25% per year. That's
insane. Perhaps it's time to look at your backup strategy no from a hardware
perspective, but from a data retention perspective. Do you really need that
much data backed up? There has to be some way to get the volume down. If
not, you're at 100TB in just slightly over 4 years (assuming the 25% growth
factor). If your data is critical, my recommendation is to go find another
job and let someone else have that headache.

2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares
and such) - $12,500 for raw drive hardware. Enclosures add some money, as do
cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives.
In my world, I know yours is different, but the difference in a $100,000
solution and a $75,000 solution is pretty negligible. The short description
here: you can afford to do mirrors. Really, you can. Any of the parity
solutions out there, I don't care what your strategy, is going to cause you
more trouble than you're ready to deal with.

I know these aren't solutions for you, it's just the stuff that was in my
head. The best possible solution, if you really need this kind of volume, is
to create something that never has to resilver. Use some nifty combination
of hardware and ZFS, like a couple of somethings that has 20TB per container
exported as a single volume, mirror those with ZFS for its end-to-end
checksumming and ease of management.

That's my considerably more than $0.02

On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn <
[EMAIL PROTECTED]> wrote:

> On Thu, 3 Jul 2008, Don Enrique wrote:
> >
> > This means that i potentially could loose 40TB+ of data if three
> > disks within the same RAIDZ-2 vdev should die before the resilvering
> > of at least one disk is complete. Since most disks will be filled i
> > do expect rather long resilvering times.
>
> Yes, this risk always exists.  The probability of three disks
> independently dying during the resilver is exceedingly low. The chance
> that your facility will be hit by an airplane during resilver is
> likely higher.  However, it is true that RAIDZ-2 does not offer the
> same ease of control over physical redundancy that mirroring does.
> If you were to use 10 independent chassis and split the RAIDZ-2
> uniformly across the chassis then the probability of a similar
> calamity impacting the same drives is driven by rack or facility-wide
> factors (e.g. building burning down) rather than shelf factors.
> However, if you had 10 RAID arrays mounted in the same rack and the
> rack falls over on its side during resilver then hope is still lost.
>
> I am not seeing any options for you here.  ZFS RAIDZ-2 is about as
> good as it gets and if you want everything in one huge pool, there
> will be more risk.  Perhaps there is a virtual filesystem layer which
> can be used on top of ZFS which emulates a larger filesystem but
> refuses to split files across pools.
>
> In the future it would be useful for ZFS to provide the option to not
> load-share across huge VDEVs and use VDEV-level space allocators.
>
> Bob
> ==
> Bob Friesenhahn
> [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>



-- 
chris -at- microcozm -dot- net
=== Si Hoc Legere Scis Nimium Eruditionis Habes
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Bob Friesenhahn
On Thu, 3 Jul 2008, Don Enrique wrote:
>
> This means that i potentially could loose 40TB+ of data if three 
> disks within the same RAIDZ-2 vdev should die before the resilvering 
> of at least one disk is complete. Since most disks will be filled i 
> do expect rather long resilvering times.

Yes, this risk always exists.  The probability of three disks 
independently dying during the resilver is exceedingly low. The chance 
that your facility will be hit by an airplane during resilver is 
likely higher.  However, it is true that RAIDZ-2 does not offer the 
same ease of control over physical redundancy that mirroring does. 
If you were to use 10 independent chassis and split the RAIDZ-2 
uniformly across the chassis then the probability of a similar 
calamity impacting the same drives is driven by rack or facility-wide 
factors (e.g. building burning down) rather than shelf factors. 
However, if you had 10 RAID arrays mounted in the same rack and the 
rack falls over on its side during resilver then hope is still lost.

I am not seeing any options for you here.  ZFS RAIDZ-2 is about as 
good as it gets and if you want everything in one huge pool, there 
will be more risk.  Perhaps there is a virtual filesystem layer which 
can be used on top of ZFS which emulates a larger filesystem but 
refuses to split files across pools.

In the future it would be useful for ZFS to provide the option to not 
load-share across huge VDEVs and use VDEV-level space allocators.

Bob
==
Bob Friesenhahn
[EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,http://www.GraphicsMagick.org/

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Don Enrique
> Don Enrique wrote:
> > Now, my initial plan was to create one large pool
> comprised of X RAIDZ-2 vdevs ( 7 + 2 )
> > with one hotspare per 10 drives and just continue
> to expand that pool as needed.
> > 
> > Between calculating the MTTDL and performance
> models i was hit by a rather scary thought.
> > 
> > A pool comprised of X vdevs is no more resilient to
> data loss than the weakest vdev since loss
> > of a vdev would render the entire pool unusable.
> > 
> > This means that i potentially could loose 40TB+ of
> data if three disks within the same RAIDZ-2
> > vdev should die before the resilvering of at least
> one disk is complete. Since most disks
> > will be filled i do expect rather long resilvering
> times.
> 
> Why are you planning on using RAIDZ-2 rather than
> mirroring ?

Mirroring would increase the cost significantly and is not within the budget of 
this project. 

> -- 
> Darren J Moffat
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discu
> ss
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Large zpool design considerations

2008-07-03 Thread Darren J Moffat
Don Enrique wrote:
> Now, my initial plan was to create one large pool comprised of X RAIDZ-2 
> vdevs ( 7 + 2 )
> with one hotspare per 10 drives and just continue to expand that pool as 
> needed.
> 
> Between calculating the MTTDL and performance models i was hit by a rather 
> scary thought.
> 
> A pool comprised of X vdevs is no more resilient to data loss than the 
> weakest vdev since loss
> of a vdev would render the entire pool unusable.
> 
> This means that i potentially could loose 40TB+ of data if three disks within 
> the same RAIDZ-2
> vdev should die before the resilvering of at least one disk is complete. 
> Since most disks
> will be filled i do expect rather long resilvering times.

Why are you planning on using RAIDZ-2 rather than mirroring ?

-- 
Darren J Moffat
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Large zpool design considerations

2008-07-03 Thread Don Enrique
Hi,

I am looking for some best practice advice on a project that i am working on.

We are looking at migrating ~40TB backup data to ZFS, with an annual data 
growth of
20-25%.

Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs 
( 7 + 2 )
with one hotspare per 10 drives and just continue to expand that pool as needed.

Between calculating the MTTDL and performance models i was hit by a rather 
scary thought.

A pool comprised of X vdevs is no more resilient to data loss than the weakest 
vdev since loss
of a vdev would render the entire pool unusable.

This means that i potentially could loose 40TB+ of data if three disks within 
the same RAIDZ-2
vdev should die before the resilvering of at least one disk is complete. Since 
most disks
will be filled i do expect rather long resilvering times.

We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project with 
as much hardware
redundancy as we can get ( multiple controllers, dual cabeling, I/O 
multipathing, redundant PSUs,
etc.)

I could use multiple pools but that would make data management harder which in 
it self is a lengthy
process in our shop.

The MTTDL figures seem OK so how much should i need to worry ? Anyone having 
experience from
this kind of setup ?

/Don E.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] J4200/J4400 Array

2008-07-03 Thread James C. McPherson
Mertol Ozyoney wrote:
> Hi;
> 
> You are right that J series do not have nvram onboard. However most Jbods
> like HPS's MSA series have some nvram. 
> The idea behind not using nvram on the Jbod's is 
> 
> -) There is no use to add limited ram to a JBOD as disks already have a lot
> of cache.
> -) It's easy to design a redundant Jbod without nvram. If you have nvram and
> need redundancy you need to design more complex HW and more complex firmware

which translates tohigher costs for hardware and higher costs
for the software (firmware and upwards) required to manage it.

> -) Bateries are the first thing to fail

That's why high-end arrays like those from HDS tend to have
enough battery storage to keep disks spinning for nearly
72 hours. Redundancy costs!

> -) Servers already have too much ram

Not when the users are looking at flash-heavy websites ;-)




James C. McPherson
--
Senior Kernel Software Engineer, Solaris
Sun Microsystems
http://blogs.sun.com/jmcp   http://www.jmcp.homeunix.com/blog
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] J4200/J4400 Array

2008-07-03 Thread Mertol Ozyoney
You should be able to buy them today. GA should be next week
Mertol 



Mertol Ozyoney 
Storage Practice - Sales Manager

Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +90212335
Email [EMAIL PROTECTED]



-Original Message-
From: Tim [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 02, 2008 9:45 PM
To: [EMAIL PROTECTED]; Ben B.; zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] J4200/J4400 Array

So when are they going to release msrp?



On 7/2/08, Mertol Ozyoney <[EMAIL PROTECTED]> wrote:
> Availibilty may depend on where you are located but J4200 and J4400 are
> available for most regions.
> Those equipment is engineered to go well with Sun open storage components
> like ZFS.
> Besides price advantage, J4200 and J4400 offers unmatched bandwith to
hosts
> or to stacking units.
>
> You can get the price from your sun account manager
>
> Best regards
> Mertol
>
>
>
> Mertol Ozyoney
> Storage Practice - Sales Manager
>
> Sun Microsystems, TR
> Istanbul TR
> Phone +902123352200
> Mobile +905339310752
> Fax +90212335
> Email [EMAIL PROTECTED]
>
>
> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] On Behalf Of Ben B.
> Sent: Wednesday, July 02, 2008 2:49 PM
> To: zfs-discuss@opensolaris.org
> Subject: [zfs-discuss] J4200/J4400 Array
>
> Hi,
>
> According to the Sun Handbook, there is a new array :
> SAS interface
> 12 disks SAS or SATA
>
> ZFS could be used nicely with this box.
>
> There is an another version called
> J4400 with 24 disks.
>
> Doc is here :
> http://docs.sun.com/app/docs/coll/j4200
>
> Does someone know price and availability for these products ?
>
> Best Regards,
> Ben
>
>
> This message posted from opensolaris.org
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] J4200/J4400 Array

2008-07-03 Thread Mertol Ozyoney
Hi;

You are right that J series do not have nvram onboard. However most Jbods
like HPS's MSA series have some nvram. 
The idea behind not using nvram on the Jbod's is 

-) There is no use to add limited ram to a JBOD as disks already have a lot
of cache.
-) It's easy to design a redundant Jbod without nvram. If you have nvram and
need redundancy you need to design more complex HW and more complex firmware
-) Bateries are the first thing to fail 
-) Servers already have too much ram

Best regards




Mertol Ozyoney 
Storage Practice - Sales Manager

Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +90212335
Email [EMAIL PROTECTED]



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Albert Chin
Sent: Wednesday, July 02, 2008 9:04 PM
To: zfs-discuss@opensolaris.org
Subject: Re: [zfs-discuss] J4200/J4400 Array

On Wed, Jul 02, 2008 at 04:49:26AM -0700, Ben B. wrote:
> According to the Sun Handbook, there is a new array :
> SAS interface
> 12 disks SAS or SATA
> 
> ZFS could be used nicely with this box.

Doesn't seem to have any NVRAM storage on board, so seems like JBOD.

> There is an another version called
> J4400 with 24 disks.
> 
> Doc is here :
> http://docs.sun.com/app/docs/coll/j4200

-- 
albert chin ([EMAIL PROTECTED])
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool upgrade -v

2008-07-03 Thread Boyd Adamson
"Walter Faleiro" <[EMAIL PROTECTED]> writes:

> GC Warning: Large stack limit(10485760): only scanning 8 MB
> Hi,
> I reinstalled our Solaris 10 box using the latest update available.
> However I could not upgrade the zpool
>
> bash-3.00# zpool upgrade -v
> This system is currently running ZFS version 4.
>
> The following versions are supported:
>
> VER  DESCRIPTION
> ---  
>  1   Initial ZFS version
>  2   Ditto blocks (replicated metadata)
>  3   Hot spares and double parity RAID-Z
>  4   zpool history
>
> For more information on a particular version, including supported releases,
> see:
>
> http://www.opensolaris.org/os/community/zfs/version/N
>
> Where 'N' is the version number.
>
> bash-3.00# zpool upgrade -a
> This system is currently running ZFS version 4.
>
> All pools are formatted using this version.
>
>
> The sun docs said to use zpool upgrade -a. Looks like I have missed something.

Not that I can see. Not every solaris update includes a change to the
zpool version number. Zpool version 4 is the most recent version on
Solaris 10 releases.

HTH,

Boyd
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS configuration for VMware

2008-07-03 Thread Ross
Regarding the error checking, as others suggested you're best buying two 
devices and mirroring them.  ZFS has great error checking, why not use it :D
http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on
 
And regarding the memory loss after the battery runs down, that's no different 
to any hardware raid controller with battery backed cache, which is exactly how 
this should be seen.  ZFS clears the ZIL on a clean shutdown, the only time you 
need to worry about battery life is if you have a sudden power failure, and in 
that situation I'd much rather have my data being written to the iRAM than to 
disk:  With the greater speed, there's a far greater chance of the system 
having had time to finish it's writes, and a far better chance that I can power 
it on again and have ZFS recover all my data.
 
I do agree the iRAM looks like a fringe product, but to me it's a fringe 
product that works very well for ZFS if you can fit it in your chassis.
 
Btw, your wishlist is pretty much a word for word description of the high end 
model of the 'hyperdrive'.  It supports up to eight 2GB ECC DDR chips, it's got 
a 6 hour backup battery (with optional external power too), and it supports 
copying the data to a laptop or compact flash disk on power fail.  The only 
downside for me is the price.  Around £1,700 to get your hands on a 16GB one:
http://www.hyperossystems.co.uk/

Ross
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zpool upgrade -v

2008-07-03 Thread Walter Faleiro
Hi,
I reinstalled our Solaris 10 box using the latest update available.
However I could not upgrade the zpool

bash-3.00# zpool upgrade -v
This system is currently running ZFS version 4.

The following versions are supported:

VER  DESCRIPTION
---  
 1   Initial ZFS version
 2   Ditto blocks (replicated metadata)
 3   Hot spares and double parity RAID-Z
 4   zpool history

For more information on a particular version, including supported releases,
see:

http://www.opensolaris.org/os/community/zfs/version/N

Where 'N' is the version number.

bash-3.00# zpool upgrade -a
This system is currently running ZFS version 4.

All pools are formatted using this version.


The sun docs said to use zpool upgrade -a. Looks like I have missed
something.


--Walter

On Fri, Jun 13, 2008 at 7:55 PM, Al Hopper <[EMAIL PROTECTED]> wrote:

> On Fri, Jun 13, 2008 at 4:48 PM, Dick Hoogendijk <[EMAIL PROTECTED]> wrote:
> > I have a disk on ZFS created by snv_79b (sxde4) and one on ZFS created
> > by snv_90 (sxce). I wonder, how do I know a ZFS version has to be
> > upgraded or not? I.e. are the ZFS versions of sxde and sxce the same?
> > How do I verify that?
>
> Hi Dick (from the solaris on x86 list),
>
> - First off, and you may already know this, you can upgrade - but it's
> a one-way ticket.  You can't change your mind and go "backwards", as
> in, down-grade to a previous release.  And, what if you want to
> restore a snapshot to a box running an older release of ZFS...
>
> - Secondly, you're not *required* to upgrade.  If there is even a 1 in
> a 1,000,000 chance that you might want to use the pool with a previous
> release of *olaris - *don't* do it!  And this includes moving the pool
> over to a different *olaris release - which is a requirement that
> cannot always be foreseen.
>
> - 3rd, in many cases, there is no "loss of features" by not upgrading.
>  Again - I say - in most cases - not in all cases.
>
> To examine the current version or to upgrade it, please read (latest
> version of) the ZFS admin guide doc #  817-2271.  Look at "zpool get
> version poolName" and "zpool upgrade"
>
> Recommendation (based on personal experience) leave the ondisk format
> at the "SXDE" default version for now.
>
> Regards,
>
> > --
> > Dick Hoogendijk -- PGP/GnuPG key: 01D2433D
> > ++ http://nagual.nl/ + SunOS sxce snv90 ++
> >
> > ___
> > zfs-discuss mailing list
> > zfs-discuss@opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
>
>
>
> --
> Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED]
>  Voice: 972.379.2133 Timezone: US CDT
> OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
> http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
> ___
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss