Re: [zfs-discuss] Why RAID 5 stops working in 2009
On Thu, Jul 3, 2008 at 3:09 PM, Aaron Blew <[EMAIL PROTECTED]> wrote: > My take is that since RAID-Z creates a stripe for every block > (http://blogs.sun.com/bonwick/entry/raid_z), it should be able to > rebuild the bad sectors on a per block basis. I'd assume that the > likelihood of having bad sectors on the same places of all the disks > is pretty low since we're only reading the sectors related to the > block being rebuilt. It also seems that fragmentation would work in > your favor here since the stripes would be distributed across more of > the platter(s), hopefully protecting you from a wonky manufacturing > defect that causes UREs on the same place on the disk. > > -Aaron The per-block statement above is important - zfs will only rebuild the blocks that have data. A 100TB pool with 1 GB in use will rebuild 1 GB. As such, it is more a factor of the amount of data rather than the size of the RAID device. A periodic zpool scrub will likely turn up read errors before you have a drive failure AND unrelated read errors. Since ZFS merges the volume management and file system layers such an uncorrectable read would turn into zfs saying "file /a/b/c is corrupt - you need to restore it" rather than traditional RAID5 saying "this 12 TB volume is corrupt - restore it". ZFS already makes multiple copies of metadata so if you were "lucky" and the corruption happens to the metadata it should be able to get a working copy from elsewhere. Of course, raidz2 further decreases your chances of losing data. I would highly recommend reading Richard Elling's comments in this area. For example: http://blogs.sun.com/relling/entry/raid_recommendations_space_vs_mttdl http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance http://blogs.sun.com/relling/entry/a_story_of_two_mttdl http://opensolaris.org/jive/thread.jspa?threadID=65564#255257 -- Mike Gerdts http://mgerdts.blogspot.com/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Poor read/write performance when using ZFS iSCSI target
Greetings, I want to take advantage of the iSCSI target support in the latest release (svn_91) of OpenSolaris, and I'm running into some performance problems when reading/writing from/to my target. I'm including as much detail as I can so bear with me here... I've built an x86 OpenSolaris server (Intel Xeon running NV_91) with a zpool of 15 750GB SATA disks, of which I've created and exported a ZFS Volume with the shareiscsi=on property set to generate an iSCSI target. My problem is, when I connect to this target from any initiator (tested with both Linux 2.6 and OpenSolaris NV_91 SPARC and x86), the read/write speed is dreadful (~ 3 megabytes / second!). When I test read/write performance locally with the backing pool, I have excellent speeds. The same can be said when I use services such as NFS and FTP to move files between other hosts on the network and the volume I am exporting as a Target. When doing this I have achieved the near-Gigabit speeds I expect, which has me thinking this isn't a network problem of some sort (I've already disabled the Neagle algorithm if you're wondering). It's not until I add the iSCSI target to the stack that the speeds go south, so I am concerned that I may be missing something in configuration of the target. Below are some details pertaining to my configuration. OpenSolaris iSCSI Target Host: target_host:~ # zpool status pool0 pool: pool0 state: ONLINE scrub: none requested config: NAMESTATE READ WRITE CKSUM pool0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t0d0 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 raidz1ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 c1t0d0 ONLINE 0 0 0 c1t1d0 ONLINE 0 0 0 c1t2d0 ONLINE 0 0 0 c1t3d0 ONLINE 0 0 0 c1t4d0 ONLINE 0 0 0 c1t5d0 ONLINE 0 0 0 spares c1t6d0AVAIL errors: No known data errors target_host:~ # zfs get all pool0/vol0 NAMEPROPERTY VALUE SOURCE pool0/vol0 type volume - pool0/vol0 creation Wed Jul 2 18:16 2008 - pool0/vol0 used 5T - pool0/vol0 available7.92T - pool0/vol0 referenced 34.2G - pool0/vol0 compressratio1.00x - pool0/vol0 reservation none default pool0/vol0 volsize 5T - pool0/vol0 volblocksize 8K - pool0/vol0 checksum on default pool0/vol0 compression offdefault pool0/vol0 readonly offdefault pool0/vol0 shareiscsi on local pool0/vol0 copies 1 default pool0/vol0 refreservation 5T local target_host:~ # iscsitadm list target -v pool0/vol0 Target: pool0/vol0 iSCSI Name: iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab Alias: pool0/vol0 Connections: 1 Initiator: iSCSI Name: iqn.1986-03.com.sun:01:0003ba681e7f.486c0829 Alias: unknown ACL list: TPGT list: TPGT: 1 LUN information: LUN: 0 GUID: 01304865b1b42a00486c29d2 VID: SUN PID: SOLARIS Type: disk Size: 5.0T Backing store: /dev/zvol/rdsk/pool0/vol0 Status: online OpenSolaris iSCSI Initiator Host: initiator_host:~ # iscsiadm list target -vS iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab Target: iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab Alias: pool0/vol0 TPGT: 1 ISID: 402a Connections: 1 CID: 0 IP address (Local): 192.168.4.2:63960 IP address (Peer): 192.168.4.3:3260 Discovery Method: SendTargets Login Parameters (Negotiated): Data Sequence In Order: yes Data PDU In Order: yes Default Time To Retain: 20 Default Time To Wait: 2 Error Recovery Level: 0 First Burst Length: 65536 Immediate Data: yes Initial Ready To Transfer (R2T): yes Max Burst Length: 262144 Max Outstanding R2T: 1
[zfs-discuss] Kota, Sudha is out of the office.
I will be out of the office starting 07/03/2008 and will not return until 07/07/2008. Please contact George Mederos, Shawn Luft or Bernard Wu for Unix Support. The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. This message may be an attorney-client communication and/or work product and as such is privileged and confidential. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why RAID 5 stops working in 2009
My take is that since RAID-Z creates a stripe for every block (http://blogs.sun.com/bonwick/entry/raid_z), it should be able to rebuild the bad sectors on a per block basis. I'd assume that the likelihood of having bad sectors on the same places of all the disks is pretty low since we're only reading the sectors related to the block being rebuilt. It also seems that fragmentation would work in your favor here since the stripes would be distributed across more of the platter(s), hopefully protecting you from a wonky manufacturing defect that causes UREs on the same place on the disk. -Aaron On Thu, Jul 3, 2008 at 12:24 PM, Jim <[EMAIL PROTECTED]> wrote: > Anyone here read the article "Why RAID 5 stops working in 2009" at > http://blogs.zdnet.com/storage/?p=162 > > Does RAIDZ have the same chance of unrecoverable read error as RAID5 in Linux > if the RAID has to be rebuilt because of a faulty disk? I imagine so because > of the physical constraints that plague our hds. Granted, the chance of > failure in my case shouldn't be nearly as high as I will most likely recruit > four or three 750gb drives- not in the order of 10tb. > > With my opensolaris NAS, I will be scrubbing every week (consumer grade > drives[every month for enterprise-grade]) as recommended in the ZFS best > practices guide. If I "zpool status" and I see that the scrub is > increasingly fixing errors, would that mean that the disk is in fact headed > towards failure or perhaps that the natural expansion of disk usage is to > blame? > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
On Thu, 3 Jul 2008, Richard Elling wrote: > > nit: SATA disks are single port, so you would need a SAS > implementation to get multipathing to the disks. This will not > significantly impact the overall availability of the data, however. > I did an availability analysis of thumper to show this. > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs Richard, It seems that the "Thumper" system (with 48 SATA drives) has been pretty well analyzed now. Is it possible for you to perform similar analysis of the new Sun Fire X4240 with its 16 SAS drives? SAS drives are usually faster than SATA drives and it is possible to multipath them (maybe not in this system?). This system seems ideal for ZFS and should work great as a medium-sized data server or database server. Maybe someone can run benchmarks on one and report the results here? Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
Miles Nordin wrote: >> "djm" == Darren J Moffat <[EMAIL PROTECTED]> writes: >> "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes: >> > >djm> Why are you planning on using RAIDZ-2 rather than mirroring ? > > isn't MTDL sometimes shorter for mirroring than raidz2? I think that > is the biggest point of raidz2, is it not? > Yes. For some MTTDL models, a 3-way mirror is roughly equivalent to a 3-disk raidz2 set, with the mirror being slightly better because you do not require both of the other two disks to be functional during reconstruction. As the number of disks in the set increases, the MTTDL goes down, so a 4-disk raidz2 will have lower MTTDL than a 3-disk mirror. Somewhere I have graphs which show this... http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_performance > bf> The probability of three disks independently dying during the > bf> resilver > > The thing I never liked about MTDL models is their assuming disk > failures are independent events. It seems likely to get a bad batch > of disks if you buy a single model from a single manufacturer, and buy > all the disks at the same time. They may have consecutive serial > numbers, ship in the same box, u.s.w. > You are correct in that the models assume independent failures. Common failures for independent devices (eg vintages) can be modeled using an adjusted MTBF. For example, we sometimes see a vintage where the MTBF is statistically significantly different than other vintages. These can be difficult to predict and any such predictions may not help you make decisions. Somewhere I talk about that... http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent > You can design around marginal power supplies that feed a bank of > disks with excessive ripple voltage, cause them all to write > marginally readable data, and later make you think the disks all went > bad at once. or use long fibre cables to put chassis in different > rooms with separate aircon. or tell yourself other strange disaster > stories and design around them. But fixing the lack of diversity in > manufacturing and shipping seems hard. > My favorite is the guy who zip-ties the fiber in a tight wad at the back of the rack. Fiber (and copper) cables have a minimum bend radius specification. In fiber cables, small cracks can occur which, over time, become larger and cause attenuation. If you are really interested in diversity, you need to copy the data someplace far, far away, as many of the Katrina survivors learned. But even that might not be enough diversity... http://blogs.sun.com/relling/entry/diversity_revisited http://blogs.sun.com/relling/entry/diversity_in_your_connections > For my low-end stuff, I have been buying the two sides of mirrors from > two companies, but I don't know how workable that is for people trying > to look ``professional''. It's also hard to do with raidz since there > are so few hard drive brands left. > I agree, and do the same. > Retailers ought to charge an extra markup for ``aging'' the drives > for you like cheese, and maintian several color-coded warehouses in > which to do the aging: ``sell me 10 drives that were aged for six > months in the Green warehouse.'' > I just looked at our field data for disks through last month and would say that aging won't buy you any assurance. We are seeing excellent and improving reliability. Mind you, we are selling enterprise-class disks from the top bins :-) Meanwhile, thanks Miles for being a setup guy :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Why RAID 5 stops working in 2009
Anyone here read the article "Why RAID 5 stops working in 2009" at http://blogs.zdnet.com/storage/?p=162 Does RAIDZ have the same chance of unrecoverable read error as RAID5 in Linux if the RAID has to be rebuilt because of a faulty disk? I imagine so because of the physical constraints that plague our hds. Granted, the chance of failure in my case shouldn't be nearly as high as I will most likely recruit four or three 750gb drives- not in the order of 10tb. With my opensolaris NAS, I will be scrubbing every week (consumer grade drives[every month for enterprise-grade]) as recommended in the ZFS best practices guide. If I "zpool status" and I see that the scrub is increasingly fixing errors, would that mean that the disk is in fact headed towards failure or perhaps that the natural expansion of disk usage is to blame? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
[Richard Elling] wrote: > Don Enrique wrote: >> Hi, >> >> I am looking for some best practice advice on a project that i am working on. >> >> We are looking at migrating ~40TB backup data to ZFS, with an annual data >> growth of >> 20-25%. >> >> Now, my initial plan was to create one large pool comprised of X RAIDZ-2 >> vdevs ( 7 + 2 ) >> with one hotspare per 10 drives and just continue to expand that pool as >> needed. >> >> Between calculating the MTTDL and performance models i was hit by a rather >> scary thought. >> >> A pool comprised of X vdevs is no more resilient to data loss than the >> weakest vdev since loss >> of a vdev would render the entire pool unusable. >> > > Yes, but a raidz2 vdev using enterprise class disks is very reliable. That's nice to hear. >> This means that i potentially could loose 40TB+ of data if three disks >> within the same RAIDZ-2 >> vdev should die before the resilvering of at least one disk is complete. >> Since most disks >> will be filled i do expect rather long resilvering times. >> >> We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project >> with as much hardware >> redundancy as we can get ( multiple controllers, dual cabeling, I/O >> multipathing, redundant PSUs, >> etc.) >> > > nit: SATA disks are single port, so you would need a SAS implementation > to get multipathing to the disks. This will not significantly impact the > overall availability of the data, however. I did an availability > analysis of > thumper to show this. > http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs Yeah, I read your blog. Very informative indeed. I am using SAS HBA cards and SAS enclosures with SATA disks so I should be fine. >> I could use multiple pools but that would make data management harder which >> in it self is a lengthy >> process in our shop. >> >> The MTTDL figures seem OK so how much should i need to worry ? Anyone having >> experience from >> this kind of setup ? >> > > I think your design is reasonable. We'd need to know the exact > hardware details to be able to make more specific recommendations. > -- richard Well, my choice of hardware is kind of limited by 2 things : 1. We are a 100% Dell shop. 2. We already have lots of enclosures that i would like to reuse for my project. The HBA cards are SAS 5/E (LSI SAS1068 chipset) cards, the enclosures are Dell MD1000 diskarrays. > -- Med venlig hilsen / Best Regards Henrik Johansen [EMAIL PROTECTED] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] J4200/J4400 Array
Albert Chin wrote: > On Thu, Jul 03, 2008 at 01:43:36PM +0300, Mertol Ozyoney wrote: > >> You are right that J series do not have nvram onboard. However most Jbods >> like HPS's MSA series have some nvram. >> The idea behind not using nvram on the Jbod's is >> >> -) There is no use to add limited ram to a JBOD as disks already have a lot >> of cache. >> -) It's easy to design a redundant Jbod without nvram. If you have nvram and >> need redundancy you need to design more complex HW and more complex firmware >> -) Bateries are the first thing to fail >> -) Servers already have too much ram >> > > Well, if the server attached to the J series is doing ZFS/NFS, > performance will increase with zfs:zfs_nocacheflush=1. But, without > battery-backed NVRAM, this really isn't "safe". So, for this usage case, > unless the server has battery-backed NVRAM, I don't see how the J series > is good for ZFS/NFS usage. > > The zfs_nocacheflush problem should be mostly gone as the fix was implemented in b74. We really expect that this recommendation will disappear, except in its viral form. http://bugs.opensolaris.org/view_bug.do?bug_id=6462690 You really don't want to set this when using common, magnetic disks in a JBOD (J-series means JBOD) because there is no non-volatile cache. For good ZFS+NFS performance under attribute-creating-intensive loads using JBODs, we recommend using a slog. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
> "djm" == Darren J Moffat <[EMAIL PROTECTED]> writes: > "bf" == Bob Friesenhahn <[EMAIL PROTECTED]> writes: djm> Why are you planning on using RAIDZ-2 rather than mirroring ? isn't MTDL sometimes shorter for mirroring than raidz2? I think that is the biggest point of raidz2, is it not? bf> The probability of three disks independently dying during the bf> resilver The thing I never liked about MTDL models is their assuming disk failures are independent events. It seems likely to get a bad batch of disks if you buy a single model from a single manufacturer, and buy all the disks at the same time. They may have consecutive serial numbers, ship in the same box, u.s.w. You can design around marginal power supplies that feed a bank of disks with excessive ripple voltage, cause them all to write marginally readable data, and later make you think the disks all went bad at once. or use long fibre cables to put chassis in different rooms with separate aircon. or tell yourself other strange disaster stories and design around them. But fixing the lack of diversity in manufacturing and shipping seems hard. For my low-end stuff, I have been buying the two sides of mirrors from two companies, but I don't know how workable that is for people trying to look ``professional''. It's also hard to do with raidz since there are so few hard drive brands left. Retailers ought to charge an extra markup for ``aging'' the drives for you like cheese, and maintian several color-coded warehouses in which to do the aging: ``sell me 10 drives that were aged for six months in the Green warehouse.'' pgpQP4NYa6j2S.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] J4200/J4400 Array
On Thu, Jul 03, 2008 at 01:43:36PM +0300, Mertol Ozyoney wrote: > You are right that J series do not have nvram onboard. However most Jbods > like HPS's MSA series have some nvram. > The idea behind not using nvram on the Jbod's is > > -) There is no use to add limited ram to a JBOD as disks already have a lot > of cache. > -) It's easy to design a redundant Jbod without nvram. If you have nvram and > need redundancy you need to design more complex HW and more complex firmware > -) Bateries are the first thing to fail > -) Servers already have too much ram Well, if the server attached to the J series is doing ZFS/NFS, performance will increase with zfs:zfs_nocacheflush=1. But, without battery-backed NVRAM, this really isn't "safe". So, for this usage case, unless the server has battery-backed NVRAM, I don't see how the J series is good for ZFS/NFS usage. -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
Don Enrique wrote: > Hi, > > I am looking for some best practice advice on a project that i am working on. > > We are looking at migrating ~40TB backup data to ZFS, with an annual data > growth of > 20-25%. > > Now, my initial plan was to create one large pool comprised of X RAIDZ-2 > vdevs ( 7 + 2 ) > with one hotspare per 10 drives and just continue to expand that pool as > needed. > > Between calculating the MTTDL and performance models i was hit by a rather > scary thought. > > A pool comprised of X vdevs is no more resilient to data loss than the > weakest vdev since loss > of a vdev would render the entire pool unusable. > Yes, but a raidz2 vdev using enterprise class disks is very reliable. > This means that i potentially could loose 40TB+ of data if three disks within > the same RAIDZ-2 > vdev should die before the resilvering of at least one disk is complete. > Since most disks > will be filled i do expect rather long resilvering times. > > We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project > with as much hardware > redundancy as we can get ( multiple controllers, dual cabeling, I/O > multipathing, redundant PSUs, > etc.) > nit: SATA disks are single port, so you would need a SAS implementation to get multipathing to the disks. This will not significantly impact the overall availability of the data, however. I did an availability analysis of thumper to show this. http://blogs.sun.com/relling/entry/zfs_raid_recommendations_space_vs > I could use multiple pools but that would make data management harder which > in it self is a lengthy > process in our shop. > > The MTTDL figures seem OK so how much should i need to worry ? Anyone having > experience from > this kind of setup ? > I think your design is reasonable. We'd need to know the exact hardware details to be able to make more specific recommendations. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
I'm going down a bit of a different path with my reply here. I know that all shops and their need for data are different, but hear me out. 1) You're backing up 40TB+ of data, increasing at 20-25% per year. That's insane. Perhaps it's time to look at your backup strategy no from a hardware perspective, but from a data retention perspective. Do you really need that much data backed up? There has to be some way to get the volume down. If not, you're at 100TB in just slightly over 4 years (assuming the 25% growth factor). If your data is critical, my recommendation is to go find another job and let someone else have that headache. 2) 40TB of backups is, at the best possible price, 50-1TB drives (for spares and such) - $12,500 for raw drive hardware. Enclosures add some money, as do cables and such. For mirroring, 90-1TB drives is $22,500 for the raw drives. In my world, I know yours is different, but the difference in a $100,000 solution and a $75,000 solution is pretty negligible. The short description here: you can afford to do mirrors. Really, you can. Any of the parity solutions out there, I don't care what your strategy, is going to cause you more trouble than you're ready to deal with. I know these aren't solutions for you, it's just the stuff that was in my head. The best possible solution, if you really need this kind of volume, is to create something that never has to resilver. Use some nifty combination of hardware and ZFS, like a couple of somethings that has 20TB per container exported as a single volume, mirror those with ZFS for its end-to-end checksumming and ease of management. That's my considerably more than $0.02 On Thu, Jul 3, 2008 at 11:56 AM, Bob Friesenhahn < [EMAIL PROTECTED]> wrote: > On Thu, 3 Jul 2008, Don Enrique wrote: > > > > This means that i potentially could loose 40TB+ of data if three > > disks within the same RAIDZ-2 vdev should die before the resilvering > > of at least one disk is complete. Since most disks will be filled i > > do expect rather long resilvering times. > > Yes, this risk always exists. The probability of three disks > independently dying during the resilver is exceedingly low. The chance > that your facility will be hit by an airplane during resilver is > likely higher. However, it is true that RAIDZ-2 does not offer the > same ease of control over physical redundancy that mirroring does. > If you were to use 10 independent chassis and split the RAIDZ-2 > uniformly across the chassis then the probability of a similar > calamity impacting the same drives is driven by rack or facility-wide > factors (e.g. building burning down) rather than shelf factors. > However, if you had 10 RAID arrays mounted in the same rack and the > rack falls over on its side during resilver then hope is still lost. > > I am not seeing any options for you here. ZFS RAIDZ-2 is about as > good as it gets and if you want everything in one huge pool, there > will be more risk. Perhaps there is a virtual filesystem layer which > can be used on top of ZFS which emulates a larger filesystem but > refuses to split files across pools. > > In the future it would be useful for ZFS to provide the option to not > load-share across huge VDEVs and use VDEV-level space allocators. > > Bob > == > Bob Friesenhahn > [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ > GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > -- chris -at- microcozm -dot- net === Si Hoc Legere Scis Nimium Eruditionis Habes ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
On Thu, 3 Jul 2008, Don Enrique wrote: > > This means that i potentially could loose 40TB+ of data if three > disks within the same RAIDZ-2 vdev should die before the resilvering > of at least one disk is complete. Since most disks will be filled i > do expect rather long resilvering times. Yes, this risk always exists. The probability of three disks independently dying during the resilver is exceedingly low. The chance that your facility will be hit by an airplane during resilver is likely higher. However, it is true that RAIDZ-2 does not offer the same ease of control over physical redundancy that mirroring does. If you were to use 10 independent chassis and split the RAIDZ-2 uniformly across the chassis then the probability of a similar calamity impacting the same drives is driven by rack or facility-wide factors (e.g. building burning down) rather than shelf factors. However, if you had 10 RAID arrays mounted in the same rack and the rack falls over on its side during resilver then hope is still lost. I am not seeing any options for you here. ZFS RAIDZ-2 is about as good as it gets and if you want everything in one huge pool, there will be more risk. Perhaps there is a virtual filesystem layer which can be used on top of ZFS which emulates a larger filesystem but refuses to split files across pools. In the future it would be useful for ZFS to provide the option to not load-share across huge VDEVs and use VDEV-level space allocators. Bob == Bob Friesenhahn [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
> Don Enrique wrote: > > Now, my initial plan was to create one large pool > comprised of X RAIDZ-2 vdevs ( 7 + 2 ) > > with one hotspare per 10 drives and just continue > to expand that pool as needed. > > > > Between calculating the MTTDL and performance > models i was hit by a rather scary thought. > > > > A pool comprised of X vdevs is no more resilient to > data loss than the weakest vdev since loss > > of a vdev would render the entire pool unusable. > > > > This means that i potentially could loose 40TB+ of > data if three disks within the same RAIDZ-2 > > vdev should die before the resilvering of at least > one disk is complete. Since most disks > > will be filled i do expect rather long resilvering > times. > > Why are you planning on using RAIDZ-2 rather than > mirroring ? Mirroring would increase the cost significantly and is not within the budget of this project. > -- > Darren J Moffat > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discu > ss This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Large zpool design considerations
Don Enrique wrote: > Now, my initial plan was to create one large pool comprised of X RAIDZ-2 > vdevs ( 7 + 2 ) > with one hotspare per 10 drives and just continue to expand that pool as > needed. > > Between calculating the MTTDL and performance models i was hit by a rather > scary thought. > > A pool comprised of X vdevs is no more resilient to data loss than the > weakest vdev since loss > of a vdev would render the entire pool unusable. > > This means that i potentially could loose 40TB+ of data if three disks within > the same RAIDZ-2 > vdev should die before the resilvering of at least one disk is complete. > Since most disks > will be filled i do expect rather long resilvering times. Why are you planning on using RAIDZ-2 rather than mirroring ? -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Large zpool design considerations
Hi, I am looking for some best practice advice on a project that i am working on. We are looking at migrating ~40TB backup data to ZFS, with an annual data growth of 20-25%. Now, my initial plan was to create one large pool comprised of X RAIDZ-2 vdevs ( 7 + 2 ) with one hotspare per 10 drives and just continue to expand that pool as needed. Between calculating the MTTDL and performance models i was hit by a rather scary thought. A pool comprised of X vdevs is no more resilient to data loss than the weakest vdev since loss of a vdev would render the entire pool unusable. This means that i potentially could loose 40TB+ of data if three disks within the same RAIDZ-2 vdev should die before the resilvering of at least one disk is complete. Since most disks will be filled i do expect rather long resilvering times. We are using 750 GB Seagate (Enterprise Grade) SATA disks for this project with as much hardware redundancy as we can get ( multiple controllers, dual cabeling, I/O multipathing, redundant PSUs, etc.) I could use multiple pools but that would make data management harder which in it self is a lengthy process in our shop. The MTTDL figures seem OK so how much should i need to worry ? Anyone having experience from this kind of setup ? /Don E. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] J4200/J4400 Array
Mertol Ozyoney wrote: > Hi; > > You are right that J series do not have nvram onboard. However most Jbods > like HPS's MSA series have some nvram. > The idea behind not using nvram on the Jbod's is > > -) There is no use to add limited ram to a JBOD as disks already have a lot > of cache. > -) It's easy to design a redundant Jbod without nvram. If you have nvram and > need redundancy you need to design more complex HW and more complex firmware which translates tohigher costs for hardware and higher costs for the software (firmware and upwards) required to manage it. > -) Bateries are the first thing to fail That's why high-end arrays like those from HDS tend to have enough battery storage to keep disks spinning for nearly 72 hours. Redundancy costs! > -) Servers already have too much ram Not when the users are looking at flash-heavy websites ;-) James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] J4200/J4400 Array
You should be able to buy them today. GA should be next week Mertol Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +90212335 Email [EMAIL PROTECTED] -Original Message- From: Tim [mailto:[EMAIL PROTECTED] Sent: Wednesday, July 02, 2008 9:45 PM To: [EMAIL PROTECTED]; Ben B.; zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] J4200/J4400 Array So when are they going to release msrp? On 7/2/08, Mertol Ozyoney <[EMAIL PROTECTED]> wrote: > Availibilty may depend on where you are located but J4200 and J4400 are > available for most regions. > Those equipment is engineered to go well with Sun open storage components > like ZFS. > Besides price advantage, J4200 and J4400 offers unmatched bandwith to hosts > or to stacking units. > > You can get the price from your sun account manager > > Best regards > Mertol > > > > Mertol Ozyoney > Storage Practice - Sales Manager > > Sun Microsystems, TR > Istanbul TR > Phone +902123352200 > Mobile +905339310752 > Fax +90212335 > Email [EMAIL PROTECTED] > > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Ben B. > Sent: Wednesday, July 02, 2008 2:49 PM > To: zfs-discuss@opensolaris.org > Subject: [zfs-discuss] J4200/J4400 Array > > Hi, > > According to the Sun Handbook, there is a new array : > SAS interface > 12 disks SAS or SATA > > ZFS could be used nicely with this box. > > There is an another version called > J4400 with 24 disks. > > Doc is here : > http://docs.sun.com/app/docs/coll/j4200 > > Does someone know price and availability for these products ? > > Best Regards, > Ben > > > This message posted from opensolaris.org > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] J4200/J4400 Array
Hi; You are right that J series do not have nvram onboard. However most Jbods like HPS's MSA series have some nvram. The idea behind not using nvram on the Jbod's is -) There is no use to add limited ram to a JBOD as disks already have a lot of cache. -) It's easy to design a redundant Jbod without nvram. If you have nvram and need redundancy you need to design more complex HW and more complex firmware -) Bateries are the first thing to fail -) Servers already have too much ram Best regards Mertol Ozyoney Storage Practice - Sales Manager Sun Microsystems, TR Istanbul TR Phone +902123352200 Mobile +905339310752 Fax +90212335 Email [EMAIL PROTECTED] -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Albert Chin Sent: Wednesday, July 02, 2008 9:04 PM To: zfs-discuss@opensolaris.org Subject: Re: [zfs-discuss] J4200/J4400 Array On Wed, Jul 02, 2008 at 04:49:26AM -0700, Ben B. wrote: > According to the Sun Handbook, there is a new array : > SAS interface > 12 disks SAS or SATA > > ZFS could be used nicely with this box. Doesn't seem to have any NVRAM storage on board, so seems like JBOD. > There is an another version called > J4400 with 24 disks. > > Doc is here : > http://docs.sun.com/app/docs/coll/j4200 -- albert chin ([EMAIL PROTECTED]) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool upgrade -v
"Walter Faleiro" <[EMAIL PROTECTED]> writes: > GC Warning: Large stack limit(10485760): only scanning 8 MB > Hi, > I reinstalled our Solaris 10 box using the latest update available. > However I could not upgrade the zpool > > bash-3.00# zpool upgrade -v > This system is currently running ZFS version 4. > > The following versions are supported: > > VER DESCRIPTION > --- > 1 Initial ZFS version > 2 Ditto blocks (replicated metadata) > 3 Hot spares and double parity RAID-Z > 4 zpool history > > For more information on a particular version, including supported releases, > see: > > http://www.opensolaris.org/os/community/zfs/version/N > > Where 'N' is the version number. > > bash-3.00# zpool upgrade -a > This system is currently running ZFS version 4. > > All pools are formatted using this version. > > > The sun docs said to use zpool upgrade -a. Looks like I have missed something. Not that I can see. Not every solaris update includes a change to the zpool version number. Zpool version 4 is the most recent version on Solaris 10 releases. HTH, Boyd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS configuration for VMware
Regarding the error checking, as others suggested you're best buying two devices and mirroring them. ZFS has great error checking, why not use it :D http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on And regarding the memory loss after the battery runs down, that's no different to any hardware raid controller with battery backed cache, which is exactly how this should be seen. ZFS clears the ZIL on a clean shutdown, the only time you need to worry about battery life is if you have a sudden power failure, and in that situation I'd much rather have my data being written to the iRAM than to disk: With the greater speed, there's a far greater chance of the system having had time to finish it's writes, and a far better chance that I can power it on again and have ZFS recover all my data. I do agree the iRAM looks like a fringe product, but to me it's a fringe product that works very well for ZFS if you can fit it in your chassis. Btw, your wishlist is pretty much a word for word description of the high end model of the 'hyperdrive'. It supports up to eight 2GB ECC DDR chips, it's got a 6 hour backup battery (with optional external power too), and it supports copying the data to a laptop or compact flash disk on power fail. The only downside for me is the price. Around £1,700 to get your hands on a 16GB one: http://www.hyperossystems.co.uk/ Ross This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool upgrade -v
Hi, I reinstalled our Solaris 10 box using the latest update available. However I could not upgrade the zpool bash-3.00# zpool upgrade -v This system is currently running ZFS version 4. The following versions are supported: VER DESCRIPTION --- 1 Initial ZFS version 2 Ditto blocks (replicated metadata) 3 Hot spares and double parity RAID-Z 4 zpool history For more information on a particular version, including supported releases, see: http://www.opensolaris.org/os/community/zfs/version/N Where 'N' is the version number. bash-3.00# zpool upgrade -a This system is currently running ZFS version 4. All pools are formatted using this version. The sun docs said to use zpool upgrade -a. Looks like I have missed something. --Walter On Fri, Jun 13, 2008 at 7:55 PM, Al Hopper <[EMAIL PROTECTED]> wrote: > On Fri, Jun 13, 2008 at 4:48 PM, Dick Hoogendijk <[EMAIL PROTECTED]> wrote: > > I have a disk on ZFS created by snv_79b (sxde4) and one on ZFS created > > by snv_90 (sxce). I wonder, how do I know a ZFS version has to be > > upgraded or not? I.e. are the ZFS versions of sxde and sxce the same? > > How do I verify that? > > Hi Dick (from the solaris on x86 list), > > - First off, and you may already know this, you can upgrade - but it's > a one-way ticket. You can't change your mind and go "backwards", as > in, down-grade to a previous release. And, what if you want to > restore a snapshot to a box running an older release of ZFS... > > - Secondly, you're not *required* to upgrade. If there is even a 1 in > a 1,000,000 chance that you might want to use the pool with a previous > release of *olaris - *don't* do it! And this includes moving the pool > over to a different *olaris release - which is a requirement that > cannot always be foreseen. > > - 3rd, in many cases, there is no "loss of features" by not upgrading. > Again - I say - in most cases - not in all cases. > > To examine the current version or to upgrade it, please read (latest > version of) the ZFS admin guide doc # 817-2271. Look at "zpool get > version poolName" and "zpool upgrade" > > Recommendation (based on personal experience) leave the ondisk format > at the "SXDE" default version for now. > > Regards, > > > -- > > Dick Hoogendijk -- PGP/GnuPG key: 01D2433D > > ++ http://nagual.nl/ + SunOS sxce snv90 ++ > > > > ___ > > zfs-discuss mailing list > > zfs-discuss@opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > > > > > > -- > Al Hopper Logical Approach Inc,Plano,TX [EMAIL PROTECTED] > Voice: 972.379.2133 Timezone: US CDT > OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007 > http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/ > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss