Re: [zfs-discuss] 350TB+ storage solution
On Wed, 18 May 2011, Chris Mosetick wrote: > to go in the packing dept. I still love their prices! There's a reason fort at: you don't get what you don't pay for! -- Rich Teer, Publisher Vinylphile Magazine www.vinylphilemag.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
The drives I just bought were half packed in white foam then wrapped > in bubble wrap. Not all edges were protected with more than bubble > wrap. Same here for me. I purchased 10 x 2TB Hitachi 7200rpm SATA disks from Newegg.com in March. The majority of the drives were protected in white foam. However ~1/2 inch of each end of all the drives were only protected by bubble wrap. A small batch of three disks I ordered (testing for the larger order) in February were packed similarly, and I've already had to RMA one of those drives. Newegg, moving in the right direction, but still have a ways to go in the packing dept. I still love their prices! -Chris ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16 at 21:55, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Paul Kraus All drives have a very high DOA rate according to Newegg. The way they package drives for shipping is exactly how Seagate specifically says NOT to pack them here 8 months ago, newegg says they've changed this practice. http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167 The drives I just bought were half packed in white foam then wrapped in bubble wrap. Not all edges were protected with more than bubble wrap. --eric -- Eric D. Mudama edmud...@bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Paul Kraus > > All drives have a very high DOA rate according to Newegg. The > way they package drives for shipping is exactly how Seagate > specifically says NOT to pack them here 8 months ago, newegg says they've changed this practice. http://www.facebook.com/media/set/?set=a.438146824167.223805.5585759167 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
2011-05-16 9:14, Richard Elling пишет: On May 15, 2011, at 10:18 AM, Jim Klimov wrote: Hi, Very interesting suggestions as I'm contemplating a Supermicro-based server for my work as well, but probably in a lower budget as a backup store for an aging Thumper (not as its superior replacement). Still, I have a couple of questions regarding your raidz layout recommendation. On one hand, I've read that as current drives get larger (while their random IOPS/MBPS don't grow nearly as fast with new generations), it is becoming more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for vdevs made of many disks - a dozen or so. When a drive fails, you still have two redundant parities, and with a resilver window expected to be in hours if not days range, I would want that airbag, to say the least. You know, failures rarely come one by one ;) Not to worry. If you add another level of redundancy, the data protection is improved by orders of magnitude. If the resilver time increases, the effect on data protection is reduced by a relatively small divisor. To get some sense of this, the MTBF is often 1,000,000,000 hours and there are only 24 hours in a day. If MTBFs were real, we'd never see disks failing within a year ;) Problem is, these values seem to be determined in an ivory-tower lab. An expensive-vendor edition of a drive running in a cooled data center with shock absorbers and other nice features does often live a lot longer than a similar OEM enterprise or consumer drive running in an apartment with varying weather around and often overheating and randomly vibrating with a dozen other disks rotating in the same box. The ramble about expensive-vendor drive editions comes from my memory of some forum or blog discussion which I can't point to now either, which suggested that vendors like Sun do not charge 5x-10x the price of the same label of OEM drive just for a nice corporate logo stamped onto the disk. Vendors were said to burn-in the drives in their labs for like half a year or a year before putting the survivors to the market. This implies that some of the drives did not survive a burn-in period, and indeed the MTBF for the remaining ones is higher because "infancy death" due to manufacturing problems soon after arrival to the end customer is unlikely for these particular tested devices. The long burn-in times were also said to be the partial reason why vendors never sell the biggest disks available on the market (does any vendor sell 3Tb with their own brand already? Sun-Oracle? IBM? HP?) Thus may be obscured as "certification process" which occasionally takes about as long - to see if the newest and greatest disks die within a year or so. Another implied idea in that discussion was that the vendors can influence OEMs in choice of components, an example in the thread being about different marks of steel for the ball bearings. Such choices can drive the price up with a reason - disks like that are more expensive to produce - but also increases their reliability. In fact, I've had very few Sun disks breaking in the boxes I've managed over 10 years; all I can remember now were two or three 2.5" 72Gb Fujitsus with a Sun brand. Still, we have another dozen of those running so far for several years. So yes, I can believe that Big Vendor Brand disks can boast huge MTBFs and prove that with a track record, and such drives are often replaced not because of a break-down, but rather as a precaution, and because of "moral aging", such as low speed and small volume. But for the rest of us (like Home-ZFS users) such numbers of MTBF are as fantastic as the Big Vendor prices, and inachievable for any number of reasons, starting with use of cheaper and potentially worse hardware from the beginning, and non-"orchard" conditions of running the machines... I do have some 5-year-old disks running in computers daily and still alive, but I have about as many which died young, sometimes even within the warranty period ;) On another hand, I've recently seen many recommendations that in a RAIDZ* drive set, the number of data disks should be a power of two - so that ZFS blocks/stripes and those of of its users (like databases) which are inclined to use 2^N-sized blocks can be often accessed in a single IO burst across all drives, and not in "one and one-quarter IO" on the average, which might delay IOs to other stripes while some of the disks in a vdev are busy processing leftovers of a previous request, and others are waiting for their peers. I've never heard of this and it doesn't pass the sniff test. Can you cite a source? I was trying to find an "authoritative" link today but failed. I know I've read this for many times over the past couple of months, but this may still be an "urban legend" or even FUD, retold many times... In fact, today I came across old posts from Jeff Bonwick, where he explains the disk usage and "ZFS striping" which is not like usual RAID striping. If th
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16 at 14:29, Paul Kraus wrote: I have stopped buying drives (and everything else) from Newegg as they cannot be bothered to properly pack items. It is worth the extra $5 per drive to buy them from CDW (who uses factory approved packaging). Note that I made this change 5 or so years ago and Newegg may have changed their packaging since then. NewEgg packaging is exactly what you describe, unchanged in the last few years. Most recent newegg drive purchase was last week for me. --eric -- Eric D. Mudama edmud...@bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 2:35 PM, Krunal Desai wrote: > An order of 6 the 5K3000 drives for work-related purposes shipped in a > Styrofoam holder of sorts that was cut in half for my small number of > drives (is this what 20 pks come in?). No idea what other packaging > was around them (shipping and receiving opened the packages). Yes, the 20 packs I have seen are a big box with a foam insert with 2 columns of 10 'slots' that hold a drive in anti-static plastic. P.S. I buy from CDW (and previously from Newegg) for home not work. Work tends to buy from Sun/Oracle via a reseller. I can't afford new Sun/Oracle for home use. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
Actually it is 100 or less, i.e. a 10 msec delay. -- Garrett D'Amore On May 16, 2011, at 11:13 AM, "Richard Elling" wrote: > On May 16, 2011, at 10:31 AM, Brandon High wrote: >> On Mon, May 16, 2011 at 8:33 AM, Richard Elling >> wrote: >>> As a rule of thumb, the resilvering disk is expected to max out at around >>> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect >>> the throttles or broken data path. >> >> My system was doing far less than 80 IOPS during resilver when I >> recently upgraded the drives. The older and newer drives were both 5k >> RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to >> be super fast. >> >> The worst resilver was 50 hours, the best was about 20 hours. This was >> just my home server, which is lightly used. The clients (2-3 CIFS >> clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS >> clients) are mostly idle and don't do a lot of writes. >> >> Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things >> up a bit, which suggests that the default values may be too >> conservative for some environments. > > I am more inclined to change the hires_tick value. The "delays" are in > units of clock ticks. For Solaris, the default clock tick is 10ms, that I will > argue is too large for modern disk systems. What this means is that when > the resilver, scrub, or memory throttle causes delays, the effective IOPS is > driven to 10 or less. Unfortunately, these values are guesses and are > probably suboptimal for various use cases. OTOH, the prior behaviour of > no resilver or scrub throttle was also considered a bad thing. > -- richard > > ___ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 2:29 PM, Paul Kraus wrote: > What Newegg was doing is buying drives in the 20-pack from the > manufacturer and packing them individually WRAPPED IN BUBBLE WRAP and > then stuffed in a box. No clamshell. I realized *something* was up > when _every_ drive I looked at had a much higher report of DOA (or > early failure) at the Newegg reviews than made any sense (and compared > to other site's reviews). I picked up a single 5K3000 last week, have not powered it on yet, but it came in a pseudo-OEM box with clamshells. I remember getting bubble-wrapped single drives from Newegg, and more than a fair share of those drives suffered early deaths or never powered on in the first place. No complaints about Amazon: Seagate drives came in Seagate OEM boxes with free shipping via Prime. (probably not practical for you enterprise/professional guys, but nice for home users). An order of 6 the 5K3000 drives for work-related purposes shipped in a Styrofoam holder of sorts that was cut in half for my small number of drives (is this what 20 pks come in?). No idea what other packaging was around them (shipping and receiving opened the packages). ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 1:20 PM, Brandon High wrote: > The 1TB and 2TB are manufactured in China, and have a very high > failure and DOA rate according to Newegg. All drives have a very high DOA rate according to Newegg. The way they package drives for shipping is exactly how Seagate specifically says NOT to pack them here http://www.seagate.com/ww/v/index.jsp?locale=en-US&name=what-to-pack&vgnextoid=5c3a8bc90bf03210VgnVCM101a48090aRCRD I have stopped buying drives (and everything else) from Newegg as they cannot be bothered to properly pack items. It is worth the extra $5 per drive to buy them from CDW (who uses factory approved packaging). Note that I made this change 5 or so years ago and Newegg may have changed their packaging since then. What Newegg was doing is buying drives in the 20-pack from the manufacturer and packing them individually WRAPPED IN BUBBLE WRAP and then stuffed in a box. No clamshell. I realized *something* was up when _every_ drive I looked at had a much higher report of DOA (or early failure) at the Newegg reviews than made any sense (and compared to other site's reviews). This is NOT to say that the drives in question really don't have a QC issue, just that the reports via Newegg are biased by Newegg's packing / shipping practices. -- {1-2-3-4-5-6-7-} Paul Kraus -> Senior Systems Architect, Garnet River ( http://www.garnetriver.com/ ) -> Sound Coordinator, Schenectady Light Opera Company ( http://www.sloctheater.org/ ) -> Technical Advisor, RPI Players ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 1:20 PM, Brandon High wrote: > The 1TB and 2TB are manufactured in China, and have a very high > failure and DOA rate according to Newegg. > > The 3TB drives come off the same production line as the Ultrastar > 5K3000 in Thailand and may be more reliable. Thanks for the heads up, I was thinking about 5K3000s to finish out my build (currently have Barracuda LPs). I do wonder how much of that DOA is due to newegg HDD packaging/shipping, however. --khd ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On May 16, 2011, at 10:31 AM, Brandon High wrote: > On Mon, May 16, 2011 at 8:33 AM, Richard Elling > wrote: >> As a rule of thumb, the resilvering disk is expected to max out at around >> 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect >> the throttles or broken data path. > > My system was doing far less than 80 IOPS during resilver when I > recently upgraded the drives. The older and newer drives were both 5k > RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to > be super fast. > > The worst resilver was 50 hours, the best was about 20 hours. This was > just my home server, which is lightly used. The clients (2-3 CIFS > clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS > clients) are mostly idle and don't do a lot of writes. > > Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things > up a bit, which suggests that the default values may be too > conservative for some environments. I am more inclined to change the hires_tick value. The "delays" are in units of clock ticks. For Solaris, the default clock tick is 10ms, that I will argue is too large for modern disk systems. What this means is that when the resilver, scrub, or memory throttle causes delays, the effective IOPS is driven to 10 or less. Unfortunately, these values are guesses and are probably suboptimal for various use cases. OTOH, the prior behaviour of no resilver or scrub throttle was also considered a bad thing. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 8:33 AM, Richard Elling wrote: > As a rule of thumb, the resilvering disk is expected to max out at around > 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect > the throttles or broken data path. My system was doing far less than 80 IOPS during resilver when I recently upgraded the drives. The older and newer drives were both 5k RPM drives (WD10EADS and Hitachi 5K3000 3TB) so I don't expect it to be super fast. The worst resilver was 50 hours, the best was about 20 hours. This was just my home server, which is lightly used. The clients (2-3 CIFS clients, 3 mostly idle VBox instances using raw zvols, and 2-3 NFS clients) are mostly idle and don't do a lot of writes. Adjusting zfs_resilver_delay and zfs_resilver_min_time_ms sped things up a bit, which suggests that the default values may be too conservative for some environments. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Sat, May 14, 2011 at 11:20 PM, John Doe wrote: >> 171 Hitachi 7K3000 3TB > I'd go for the more environmentally friendly Ultrastar 5K3000 version - with > that many drives you wont mind the slower rotation but WILL notice a > difference in power and cooling cost A word of caution - The Hitachi Deskstar 5K3000 drives in 1TB and 2TB are different than the 3TB. The 1TB and 2TB are manufactured in China, and have a very high failure and DOA rate according to Newegg. The 3TB drives come off the same production line as the Ultrastar 5K3000 in Thailand and may be more reliable. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
following are some thoughts if it's not too late: > 1 SuperMicro 847E1-R1400LPB I guess you meant the 847E1[b]6[/b]-R1400LPB, the SAS1 version makes no sense > 1 SuperMicro H8DG6-F not the best choice, see below why > 171 Hitachi 7K3000 3TB I'd go for the more environmentally friendly Ultrastar 5K3000 version - with that many drives you wont mind the slower rotation but WILL notice a difference in power and cooling cost > 1 LSI SAS 9202-16e this is really only a very expensive gadget to be honest, there's really no point to it - especially true when you start looking for the necessary cables that use a connector who's still in "draft" specification... stick to the excellent LSI SAS9200-8e, of which you will need at least 3 in your setup, one to connect each of the 3 JBODS - with them filled with fast drives like you chose, you will need two links (one for the front and one for the back backplane) as daisychainig the backplanes together would oversaturate a single link. if you'd want to take advantage of the dual expanders on your JBOD backplanes for additional redundancy in case of expander or controller failure, you will need 6 of those LSI SAS9200-8e - this is where your board isn't ideal as it has a 3/1/2 PCIe x16/x8/x4 configuration while you'd need 6 PCIe x8 - something the X8DTH-6F will provide, as well as the onboard LSI SAS2008 based HBA for the two backplanes in the server case. > 1 LSI SAS 9211-4i > 2 OCZ 64GB SSD Vertex 3 > 2 OCZ 256GB SSD Vertex 3 if these are meant to be connected together and used as ZIL+L2ARC, then I'd STRONGLY urge you to get the following instead: 1x LSI MegaRAID SAS 9265-8i 1x LSI FastPath licence 4-8x 120GB or 240GB Vertex 3 Max IOPS Edition, whatever suits the budget this solution allows you to push around 400k IOPS to the cache, more than likely way more than the stated appication of the system will need > 1 NeterionX3120SR0001 I don't know this card personally but since it's not listed as supported (http://www.sun.com/io_technologies/nic/NIC1.html) I'd be careful > My question is what is the optimum way of dividing > these drives across vdevs? I would do 14 x 12 drive raidz2 + 3 spare = 140*3TB = ~382TiB usable this would allow for a logical mapping of drives to vdevs, giving you in each case 2 vdevs in the front and 1 in the back with the 9 drive blocks in the back of the JBODs used as 3 x 4/4/1, giving the remaining 2 x 12 drive vdevs plus one spare per case > I could also go with 2TB drives and add an extra 45 > JBOD chassis. This would significantly decrease cost, > but I'm running a gauntlet by getting very close to > minimum useable space. > > 12 x 18 drive raidz2 I would never do vdevs that large, it's just an accident waiting to happen! hopefully these recommendations help you with your project. in any case, it's huge - the biggest system I worked on (which I actually have at home, go figure) only has a bit over 100TB in the following configuration: 6 x 12 drive raidz2 of Hitachi 5K3000 2TB 3 Norco 4224 with a HP SAS Expander in each Supermicro X8DTi-LN4F with 3x LSI SAS9200-8e so yeah, I based my thoughts on my own system but considering that it's been running smoothly for a while now (and that I had a very similar setup with smaller drives and older controllers before), I'm confident in my suggestions Regards from Switzerland, voyman -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On May 16, 2011, at 5:02 AM, Sandon Van Ness wrote: > On 05/15/2011 09:58 PM, Richard Elling wrote: >>> In one of my systems, I have 1TB mirrors, 70% full, which can be >>> sequentially completely read/written in 2 hrs. But the resilver took 12 >>> hours of idle time. Supposing you had a 70% full pool of raidz3, 2TB disks, >>> using 10 disks + 3 parity, and a usage pattern similar to mine, your >>> resilver time would have been minimum 10 days, >> bollix >> >>> likely approaching 20 or 30 >>> days. (Because you wouldn't get 2-3 weeks of consecutive idle time, and the >>> random access time for a raidz approaches 2x the random access time of a >>> mirror.) >> totally untrue >> >>> BTW, the reason I chose 10+3 disks above was just because it makes >>> calculation easy. It's easy to multiply by 10. I'm not suggesting using >>> that configuration. You may notice that I don't recommend raidz for most >>> situations. I endorse mirrors because they minimize resilver time (and >>> maximize performance in general). Resilver time is a problem for ZFS, which >>> they may fix someday. >> Resilver time is not a significant problem with ZFS. Resilver time is a much >> bigger problem with traditional RAID systems. In any case, it is bad systems >> engineering to optimize a system for best resilver time. >> -- richard > > Actually I have seen resilvers take a very long time (weeks) on > solaris/raidz2 when I almost never see a hardware raid controller take more > than a day or two. In one case i thrashed the disks absolutely as hard as I > could (hardware controller) and finally was able to get the rebuild to take > almost 1 week.. Here is an example of one right now: > > pool: raid3060 > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go > config: I have seen worse cases, but the root cause was hardware failures that are not reported by zpool status. Have you checked the health of the disk transports? Hint: fmdump -e Also, what zpool version is this? There were improvements made in the prefetch and the introduction of throttles last year. One makes it faster, the other intentionally slows it down. As a rule of thumb, the resilvering disk is expected to max out at around 80 IOPS for 7,200 rpm disks. If you see less than 80 IOPS, then suspect the throttles or broken data path. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
> From: Sandon Van Ness [mailto:san...@van-ness.com] > > ZFS resilver can take a very long time depending on your usage pattern. > I do disagree with some things he said though... like a 1TB drive being > able to be read/written in 2 hours? I seriously doubt this. Just reading > 1 TB in 2 hours means an average speed of over 130 megabytes/sec. 1Gbit/sec sustainable sequential disk speed is not uncommon these days, and it is in fact the performance of the disks in the system in question. SATA 7.2krpm disks... Not even special disks. Just typical boring normal disks. > Definitely no way to be that fast with reading *and* writing 1TB of data > to the drive. I guess if you count reading from one and writing to the > other. 3 hours is a much more likely figure and best case. No need to read & write from the same drive. You can read from one drive and write to the other simultaneously at full speed. If there is any performance difference between read & write on these drives, it's not measurable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
> From: Richard Elling [mailto:richard.ell...@gmail.com] > > > In one of my systems, I have 1TB mirrors, 70% full, which can be > > sequentially completely read/written in 2 hrs. But the resilver took 12 > > hours of idle time. Supposing you had a 70% full pool of raidz3, 2TB disks, > > using 10 disks + 3 parity, and a usage pattern similar to mine, your > > resilver time would have been minimum 10 days, > > bollix > > Resilver time is not a significant problem with ZFS. Resilver time is a much > bigger problem with traditional RAID systems. In any case, it is bad systems > engineering to optimize a system for best resilver time. Because RE seems to be emotionally involved with ZFS resilver times, I don't believe it's going to be productive for me to try addressing his off-hand comments. Instead, I'm only going to say this much: In my system mentioned above, a complete disk can be copied to another complete disk, sequentially, in 131 minutes. But during idle time it took 12 hours because ZFS resilver only does the used parts of disk, in essentially random order. So ZFS resilver often takes many times longer than a complete hardware-based complete disk resilver. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
I have to agree. ZFS needs a more intelligent scrub/resilver algorithm, which can 'sequentialise' the process. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. Giovanni Tirloni wrote: On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness wrote: Actually I have seen resilvers take a very long time (weeks) on solaris/raidz2 when I almost never see a hardware raid controller take more than a day or two. In one case i thrashed the disks absolutely as hard as I could (hardware controller) and finally was able to get the rebuild to take almost 1 week.. Here is an example of one right now: pool: raid3060 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go config: Resilver has been a problem with RAIDZ volumes for a while. I've routinely seen it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All disks are maxed out on IOPS while still reading 1-2MB/s and there rarely is any writes. I've written about it before here (and provided data). My only guess is that fragmentation is a real problem in a scrub/resilver situation but whenever the conversation changes to point weaknesses in ZFS we start seeing "that is not a problem" comments. With the 7000s appliance I've heard that the 900hr estimated resilver time was "normal" and "everything is working as expected". Can't help but think there is some walled garden syndrome floating around. -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Mon, May 16, 2011 at 9:02 AM, Sandon Van Ness wrote: > > Actually I have seen resilvers take a very long time (weeks) on > solaris/raidz2 when I almost never see a hardware raid controller take more > than a day or two. In one case i thrashed the disks absolutely as hard as I > could (hardware controller) and finally was able to get the rebuild to take > almost 1 week.. Here is an example of one right now: > > pool: raid3060 > state: ONLINE > status: One or more devices is currently being resilvered. The pool will > continue to function, possibly in a degraded state. > action: Wait for the resilver to complete. > scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go > config: > > Resilver has been a problem with RAIDZ volumes for a while. I've routinely seen it take >300 hours and sometimes >600 hours with 13TB pools at 80%. All disks are maxed out on IOPS while still reading 1-2MB/s and there rarely is any writes. I've written about it before here (and provided data). My only guess is that fragmentation is a real problem in a scrub/resilver situation but whenever the conversation changes to point weaknesses in ZFS we start seeing "that is not a problem" comments. With the 7000s appliance I've heard that the 900hr estimated resilver time was "normal" and "everything is working as expected". Can't help but think there is some walled garden syndrome floating around. -- Giovanni Tirloni ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On 05/15/2011 09:58 PM, Richard Elling wrote: In one of my systems, I have 1TB mirrors, 70% full, which can be sequentially completely read/written in 2 hrs. But the resilver took 12 hours of idle time. Supposing you had a 70% full pool of raidz3, 2TB disks, using 10 disks + 3 parity, and a usage pattern similar to mine, your resilver time would have been minimum 10 days, bollix likely approaching 20 or 30 days. (Because you wouldn't get 2-3 weeks of consecutive idle time, and the random access time for a raidz approaches 2x the random access time of a mirror.) totally untrue BTW, the reason I chose 10+3 disks above was just because it makes calculation easy. It's easy to multiply by 10. I'm not suggesting using that configuration. You may notice that I don't recommend raidz for most situations. I endorse mirrors because they minimize resilver time (and maximize performance in general). Resilver time is a problem for ZFS, which they may fix someday. Resilver time is not a significant problem with ZFS. Resilver time is a much bigger problem with traditional RAID systems. In any case, it is bad systems engineering to optimize a system for best resilver time. -- richard Actually I have seen resilvers take a very long time (weeks) on solaris/raidz2 when I almost never see a hardware raid controller take more than a day or two. In one case i thrashed the disks absolutely as hard as I could (hardware controller) and finally was able to get the rebuild to take almost 1 week.. Here is an example of one right now: pool: raid3060 state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 224h54m, 52.38% done, 204h30m to go config: ZFS resilver can take a very long time depending on your usage pattern. I do disagree with some things he said though... like a 1TB drive being able to be read/written in 2 hours? I seriously doubt this. Just reading 1 TB in 2 hours means an average speed of over 130 megabytes/sec. Only really new 1TB drives will even hit that type of speed at the begging of the drive and the average would be much closer to around 100 MB/sec at the end of the drive. Also that is best case scenario. I know 1TB drives (when they first came out) took aound 4-5 hours to do a complete read of all data on the disk at full speed. Definitely no way to be that fast with reading *and* writing 1TB of data to the drive. I guess if you count reading from one and writing to the other. 3 hours is a much more likely figure and best case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On Sun, May 15, 2011 at 10:14 PM, Richard Elling wrote: > On May 15, 2011, at 10:18 AM, Jim Klimov wrote: >> In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) >> or 18 (16+2) disks - the latter being mentioned in the original post. > > A similar theory was disproved back in 2006 or 2007. I'd be very surprised if > there was a reliable way to predict the actual use patterns in advance. > Features > like compression and I/O coalescing improve performance, but make the old > "rules of thumb" even more obsolete. I thought that having data disks that were a power of two was still recommended, due to the way that ZFS splits records/blocks in a raidz vdev. Or are you responding to some other point? -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On May 15, 2011, at 10:18 AM, Jim Klimov wrote: > Hi, Very interesting suggestions as I'm contemplating a Supermicro-based > server for my work as well, but probably in a lower budget as a backup store > for an aging Thumper (not as its superior replacement). > > Still, I have a couple of questions regarding your raidz layout > recommendation. > > On one hand, I've read that as current drives get larger (while their random > IOPS/MBPS don't grow nearly as fast with new generations), it is becoming > more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for > vdevs made of many disks - a dozen or so. When a drive fails, you still have > two redundant parities, and with a resilver window expected to be in hours if > not days range, I would want that airbag, to say the least. You know, > failures rarely come one by one ;) Not to worry. If you add another level of redundancy, the data protection is improved by orders of magnitude. If the resilver time increases, the effect on data protection is reduced by a relatively small divisor. To get some sense of this, the MTBF is often 1,000,000,000 hours and there are only 24 hours in a day. > On another hand, I've recently seen many recommendations that in a RAIDZ* > drive set, the number of data disks should be a power of two - so that ZFS > blocks/stripes and those of of its users (like databases) which are inclined > to use 2^N-sized blocks can be often accessed in a single IO burst across all > drives, and not in "one and one-quarter IO" on the average, which might delay > IOs to other stripes while some of the disks in a vdev are busy processing > leftovers of a previous request, and others are waiting for their peers. I've never heard of this and it doesn't pass the sniff test. Can you cite a source? > In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) > or 18 (16+2) disks - the latter being mentioned in the original post. A similar theory was disproved back in 2006 or 2007. I'd be very surprised if there was a reliable way to predict the actual use patterns in advance. Features like compression and I/O coalescing improve performance, but make the old "rules of thumb" even more obsolete. So, protect your data and if the performance doesn't meet your expectation, then you can make adjustments. -- richard > ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
On May 15, 2011, at 8:01 PM, Edward Ned Harvey wrote: >> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >> boun...@opensolaris.org] On Behalf Of Jim Klimov >> >> On one hand, I've read that as current drives get larger (while their > random >> IOPS/MBPS don't grow nearly as fast with new generations), it is becoming >> more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least >> for vdevs made of many disks - a dozen or so. When a drive fails, you > still >> have two redundant parities, and with a resilver window expected to be in >> hours if not days range, I would want that airbag, to say the least. > > This is both an underestimation of the time required, and a sort of backward > logic... > > In all of the following, I'm assuming you're creating a pool whose primary > storage is hard drives, not SSDs or similar. > > The resilver time scales linearly with the number of slabs (blocks) in the > degraded vdev, and depends on your usage patterns, which determine how > randomly your data got scattered throughout the vdev upon writes. In all of my studies of resilvering, I have never seen a linear correlation nor a correlation to the number of blocks. Can you share your data, or is this another guess? > I assume > your choice of raid type will not determine your usage patterns. So if you > create a big vdev (raidz3) as opposed to a bunch of smaller ones (mirrors) > the resilver time is longer for the large vdev. > > Also, even in the best case scenario (mirrors) assuming you have a pool > that's reasonably full (say, 50% to 70%) the resilver time is likely to take > several times longer than a complete sequential read/write of the entire > disk. Several times isn't significant wrt data protection. 10x to 100x is significant. > In one of my systems, I have 1TB mirrors, 70% full, which can be > sequentially completely read/written in 2 hrs. But the resilver took 12 > hours of idle time. Supposing you had a 70% full pool of raidz3, 2TB disks, > using 10 disks + 3 parity, and a usage pattern similar to mine, your > resilver time would have been minimum 10 days, bollix > likely approaching 20 or 30 > days. (Because you wouldn't get 2-3 weeks of consecutive idle time, and the > random access time for a raidz approaches 2x the random access time of a > mirror.) totally untrue > BTW, the reason I chose 10+3 disks above was just because it makes > calculation easy. It's easy to multiply by 10. I'm not suggesting using > that configuration. You may notice that I don't recommend raidz for most > situations. I endorse mirrors because they minimize resilver time (and > maximize performance in general). Resilver time is a problem for ZFS, which > they may fix someday. Resilver time is not a significant problem with ZFS. Resilver time is a much bigger problem with traditional RAID systems. In any case, it is bad systems engineering to optimize a system for best resilver time. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Jim Klimov > > On one hand, I've read that as current drives get larger (while their random > IOPS/MBPS don't grow nearly as fast with new generations), it is becoming > more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least > for vdevs made of many disks - a dozen or so. When a drive fails, you still > have two redundant parities, and with a resilver window expected to be in > hours if not days range, I would want that airbag, to say the least. This is both an underestimation of the time required, and a sort of backward logic... In all of the following, I'm assuming you're creating a pool whose primary storage is hard drives, not SSDs or similar. The resilver time scales linearly with the number of slabs (blocks) in the degraded vdev, and depends on your usage patterns, which determine how randomly your data got scattered throughout the vdev upon writes. I assume your choice of raid type will not determine your usage patterns. So if you create a big vdev (raidz3) as opposed to a bunch of smaller ones (mirrors) the resilver time is longer for the large vdev. Also, even in the best case scenario (mirrors) assuming you have a pool that's reasonably full (say, 50% to 70%) the resilver time is likely to take several times longer than a complete sequential read/write of the entire disk. In one of my systems, I have 1TB mirrors, 70% full, which can be sequentially completely read/written in 2 hrs. But the resilver took 12 hours of idle time. Supposing you had a 70% full pool of raidz3, 2TB disks, using 10 disks + 3 parity, and a usage pattern similar to mine, your resilver time would have been minimum 10 days, likely approaching 20 or 30 days. (Because you wouldn't get 2-3 weeks of consecutive idle time, and the random access time for a raidz approaches 2x the random access time of a mirror.) BTW, the reason I chose 10+3 disks above was just because it makes calculation easy. It's easy to multiply by 10. I'm not suggesting using that configuration. You may notice that I don't recommend raidz for most situations. I endorse mirrors because they minimize resilver time (and maximize performance in general). Resilver time is a problem for ZFS, which they may fix someday. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] 350TB+ storage solution
Hi, Very interesting suggestions as I'm contemplating a Supermicro-based server for my work as well, but probably in a lower budget as a backup store for an aging Thumper (not as its superior replacement). Still, I have a couple of questions regarding your raidz layout recommendation. On one hand, I've read that as current drives get larger (while their random IOPS/MBPS don't grow nearly as fast with new generations), it is becoming more and more reasonable to use RAIDZ3 with 3 redundancy drives, at least for vdevs made of many disks - a dozen or so. When a drive fails, you still have two redundant parities, and with a resilver window expected to be in hours if not days range, I would want that airbag, to say the least. You know, failures rarely come one by one ;) On another hand, I've recently seen many recommendations that in a RAIDZ* drive set, the number of data disks should be a power of two - so that ZFS blocks/stripes and those of of its users (like databases) which are inclined to use 2^N-sized blocks can be often accessed in a single IO burst across all drives, and not in "one and one-quarter IO" on the average, which might delay IOs to other stripes while some of the disks in a vdev are busy processing leftovers of a previous request, and others are waiting for their peers. In case of RAIDZ2 this recommendation leads to vdevs sized 6 (4+2), 10 (8+2) or 18 (16+2) disks - the latter being mentioned in the original post. Did you consider this aspect or test if the theoretical warnings are valid? Thanks, //Jim Klimov -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss