Re: [zfs-discuss] triple-parity: RAID-Z3
Hey Bob, MTTDL analysis shows that given normal evironmental conditions, the MTTDL of RAID-Z2 is already much longer than the life of the computer or the attendant human. Of course sometimes one encounters unusual conditions where additional redundancy is desired. To what analysis are you referring? Today the absolute fastest you can resilver a 1TB drive is about 4 hours. Real-world speeds might be half that. In 2010 we'll have 3TB drives meaning it may take a full day to resilver. The odds of hitting a latent bit error are already reasonably high especially with a large pool that's infrequently scrubbed meaning. What then are the odds of a second drive failing in the 24 hours it takes to resiler? I do think that it is worthwhile to be able to add another parity disk to an existing raidz vdev but I don't know how much work that entails. It entails a bunch of work: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Matt Ahrens is working on a key component after which it should all be possible. Zfs development seems to be overwelmed with marketing-driven requirements lately and it is time to get back to brass tacks and make sure that the parts already developed are truely enterprise- grade. While I don't disagree that the focus for ZFS should be ensuring enterprise-class reliability and performance, let me assure you that requirements are driven by the market and not by marketing. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
which gap? 'RAID-Z should mind the gap on writes' ? Message was edited by: thometal I believe this is in reference to the raid 5 write hole, described here: http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5_performance It's not. So I'm not sure what the 'RAID-Z should mind the gap on writes' comment is getting at either. Clarification? I'm planning to write a blog post describing this, but the basic problem is that RAID-Z, by virtue of supporting variable stripe writes (the insight that allows us to avoid the RAID-5 write hole), must round the number of sectors up to a multiple of nparity+1. This means that we may have sectors that are effectively skipped. ZFS generally lays down data in large contiguous streams, but these skipped sectors can stymie both ZFS's write aggregation as well as the hard drive's ability to group I/Os and write them quickly. Jeff Bonwick added some code to mind these gaps on reads. The key insight there is that if we're going to read 64K, say, with a 512 byte hole in the middle, we might as well do one big read rather than two smaller reads and just throw out the data that we don't care about. Of course, doing this for writes is a bit trickier since we can't just blithely write over gaps as those might contain live data on the disk. To solve this we push the knowledge of those skipped sectors down to the I/O aggregation layer in the form of 'optional' I/Os purely for the purpose of coalescing writes into larger chunks. I hope that's clear; if it's not, stay tuned for the aforementioned blog post. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Don't hear about triple-parity RAID that often: Author: Adam Leventhal Repository: /hg/onnv/onnv-gate Latest revision: 17811c723fb4f9fce50616cb740a92c8f6f97651 Total changesets: 1 Log message: 6854612 triple-parity RAID-Z http://mail.opensolaris.org/pipermail/onnv-notify/2009-July/ 009872.html http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6854612 (Via Blog O' Matty.) Would be curious to see performance characteristics. I just blogged about triple-parity RAID-Z (raidz3): http://blogs.sun.com/ahl/entry/triple_parity_raid_z As for performance, on the system I was using (a max config Sun Storage 7410), I saw about a 25% improvement to 1GB/s for a streaming write workload. YMMV, but I'd be interested in hearing your results. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] triple-parity: RAID-Z3
Don't hear about triple-parity RAID that often: I agree completely. In fact, I have wondered (probably in these forums), why we don't bite the bullet and make a generic raidzN, where N is any number =0. I agree, but raidzN isn't simple to implement and it's potentially difficult to get it to perform well. That said, it's something I intend to bring to ZFS in the next year or so. If memory serves, the second parity is calculated using Reed-Solomon which implies that any number of parity devices is possible. True; it's a degenerate case. In fact, get rid of mirroring, because it clearly is a variant of raidz with two devices. Want three way mirroring? Call that raidz2 with three devices. The truth is that a generic raidzN would roll up everything: striping, mirroring, parity raid, double parity, etc. into a single format with one parameter. That's an interesting thought, but there are some advantages to calling out mirroring for example as its own vdev type. As has been pointed out, reading from either side of the mirror involves no computation whereas reading from a RAID-Z 1+2 for example would involve more computation. This would complicate the calculus of balancing read operations over the mirror devices. Let's not stop there, though. Once we have any number of parity devices, why can't I add a parity device to an array? That should be simple enough with a scrub to set the parity. In fact, what is to stop me from removing a parity device? Once again, I think the code would make this rather easy. With RAID-Z stripes can be of variable width meaning that, say, a single row in a 4+2 configuration might have two stripes of 1+2. In other words, there might not be enough space in the new parity device. I did write up the steps that would be needed to support RAID-Z expansion; you can find it here: http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z Ok, back to the real world. The one downside to triple parity is that I recall the code discovered the corrupt block by excluding it from the stripe, reconstructing the stripe and comparing that with the checksum. In other words, for a given cost of X to compute a stripe and a number P of corrupt blocks, the cost of reading a stripe is approximately X^P. More corrupt blocks would radically slow down the system. With raidz2, the maximum number of corrupt blocks would be two, putting a cap on how costly the read can be. Computing the additional parity of triple-parity RAID-Z is slightly more expensive, but not much -- it's just bitwise operations. Recovering from a read failure is identical (and performs identically) to raidz1 or raidz2 until you actually have sustained three failures. In that case, performance is slower as more computation is involved -- but aren't you just happy to get your data back? If there is silent data corruption, then and only then can you encounter the O(n^3) algorithm that you alluded to, but only as a last resort. If we don't know what drives failed, we try to reconstruct your data by assuming that one drive, then two drives, then three drives are returning bad data. For raidz1, this was a linear operation; raidz2, quadratic; now raidz3 is N-cubed. There's really no way around it. Fortunately with proper scrubbing encountering data corruption in one stripe on three different drives is highly unlikely. Adam -- Adam Leventhal, Fishworkshttp://blogs.sun.com/ahl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs IO scheduler
tester writes: Hello, Trying to understand the ZFS IO scheduler, because of the async nature it is not very apparent, can someone give a short explanation for each of these stack traces and for their frequency this is the command dd if=/dev/zero of=/test/test1/trash count=1 bs=1024k;sync no other IO is happening to the test pool. OS is on a zfs pool (rpool) I don't see any zio_vdev_io_start in any of the function stacks, any idea why? I assume because of tail calls. If you trace zio_vdev_io_start() you see it being called but (looking at source) then it tail calls vdev_mirror_io_start() and so disappears from the stack. -r dtrace -n 'io:::start { @a[stack()] = count(); }' dtrace: description 'io:::start ' matched 6 probes genunix`bdev_strategy+0x44 zfs`vdev_disk_io_start+0x2a8 zfs`zio_execute+0x74 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 20 genunix`bdev_strategy+0x44 zfs`vdev_disk_io_start+0x2a8 zfs`zio_execute+0x74 zfs`vdev_queue_io_done+0x84 zfs`vdev_disk_io_done+0x4 zfs`zio_execute+0x74 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 31 genunix`bdev_strategy+0x44 zfs`vdev_disk_io_start+0x2a8 zfs`zio_execute+0x74 zfs`vdev_mirror_io_start+0x1b4 zfs`zio_execute+0x74 zfs`vdev_mirror_io_start+0x1b4 zfs`zio_execute+0x74 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 34 genunix`bdev_strategy+0x44 zfs`vdev_disk_io_start+0x2a8 zfs`zio_execute+0x74 zfs`vdev_mirror_io_start+0x1b4 zfs`zio_execute+0x74 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 45 genunix`bdev_strategy+0x44 zfs`vdev_disk_io_start+0x2a8 zfs`zio_execute+0x74 zfs`vdev_queue_io_done+0x9c zfs`vdev_disk_io_done+0x4 zfs`zio_execute+0x74 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 53 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Speeding up resilver on x4500
Stuart Anderson writes: On Jun 21, 2009, at 10:21 PM, Nicholas Lee wrote: On Mon, Jun 22, 2009 at 4:24 PM, Stuart Anderson ander...@ligo.caltech.edu wrote: However, it is a bit disconcerting to have to run with reduced data protection for an entire week. While I am certainly not going back to UFS, it seems like it should be at least theoretically possible to do this several orders of magnitude faster, e.g., what if every block on the replacement disk had its RAIDZ2 data recomputed from the degraded Maybe this is also saying - that for large disk sets a single RAIDZ2 provides a false sense of security. This configuration is with 3 large RAIDZ2 devices but I have more recently been building thumper/thor systems with a larger number of smaller RAIDZ2's. Thanks. 170M small files reconstructed in 1 week over 3 raid-z groups is 93 files / sec per raid-z group. That is not too far from expectations for 7.2K RPM drives (where they ?). I don't see orders of magnitude improvements on this however this CR (integrated in snv_109) might give the workload a boost : 6801507 ZFS read aggregation should not mind the gap This will enable more read aggregation to occur during a resilver. We could also contemplate enabling the vdev prefetch code for data during a resilver. Otherwise, limiting the # of small objects per raid-z group as you're doing now, seems wise to me. -r -- Stuart Anderson ander...@ligo.caltech.edu http://www.ligo.caltech.edu/~anderson ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zio_assess
zio_assess went away with SPA 3.0 : 6754011 SPA 3.0: lock breakup, i/o pipeline refactoring, device failure handling You now have : zio_vdev_io_assess(zio_t *zio) Yes it's one of the last stages of the I/O pipeline (see zio_impl.h). -r tester writes: Hi, What does zio_assess do? Is it a stage of pipeline? I see quite a bit these stacks in 5 second time. I tried to search src.opensolaris, did not find any reference. Thanks for any help zfs`zio_assess+0x58 zfs`zio_execute+0x74 genunix`taskq_thread+0x1a4 unix`thread_start+0x4 1604 Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Thanks for the feed back George. I hope we get the tools soon. At home I have now blown the ZFS away now and creating a HW raid-5 set :-( Hopefully in the future when the tools are there I will return to ZFS. To All : The ECC discussion was very interesting as I had never considered it that way! I willl be buying ECC memory for my home machine!! Again many many thanks to all how have replied it has been a very interesting and informative discussion for me. Best regards Russel -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Hi. Good to Know! But how do we deal with that on older sStems, which don't have the patch applied, once it is out? Thanks, Alexander On Tuesday, July 21, 2009, George Wilson george.wil...@sun.com wrote: Russel wrote: OK. So do we have an zpool import --xtg 56574 mypoolname or help to do it (script?) Russel We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: 6667683 need a way to rollback to an uberblock from a previous txg Thanks, George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Alexander -- [[ http://zensursula.net ]] [ Soc. = http://twitter.com/alexs77 | http://www.plurk.com/alexs77 ] [ Mehr = http://zyb.com/alexws77 ] [ Chat = Jabber: alexw...@jabber80.com | Google Talk: a.sk...@gmail.com ] [ Mehr = AIM: alexws77 ] [ $[ $RANDOM % 6 ] = 0 ] rm -rf / || echo 'CLICK!' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40
I don't mean to be offensive Russel, but if you do ever return to ZFS, please promise me that you will never, ever, EVER run it virtualized on top of NTFS (a.k.a. worst file system ever) in a production environment. Microsoft Windows is a horribly unreliable operating system in situations where things like protecting against data corruption are important. Microsoft knows this, which is why they secretly run much of Microsoft.com, their www advertisement campaigns, and the Microsoft Updates web sites on Akamai Linux in the data center across the hall from the data center where I work and the invulnerable file system behind Microsoft's cloud that secretly runs on Akamai's content delivery system is none other than ZFS's long lost brother... Netapp WAFL! The first time I started to catch on to this was when the Project Mojave advertisement campaign started and lots of people were nmap scanning the site and noticing that it was running Apache on Linux: http://openmanifesto.blogspot.com/2008/07/mss-blunder-with-mojave-experiment-uses.html Eventually Microsoft realized they messed up and started to edit the header strings like they usually do to make it look like IIS: https://lists.mayfirst.org/pipermail/nosi-discussion/2008-August/000417.html although you could still figure it out if you were smart enough by using telnet like this: http://news.netcraft.com/archives/2003/08/17/wwwmicrosoftcom_runs_linux_up_to_a_point_.html but the cat was already out of the bag. I did some investigating over a year ago and talked to some of my long time friends who were senior Akamai techs, and one of them eventually gave me a guided tour after hours and gave me a quick look at the Netapp WAFL setup and explained how Microsoft Windows updates actually work. Very cool! These Akamai guys are like the Wizard of Oz for the Internet running everything behind the curtains there. Whenever Microsoft Updates are down- Tell an Akamai tech! Everything's will start working fine within 5 minutes of you telling them (sure beats calling in to Microsoft Tech Support in Mumbai India). Is apple.com or itunes running slow? Tell an Akamai tech and it'll be fixed immediately. Cnn.com down? Jcpenny.com down? Yup. Tell an Akamai tech and it comes right back up. It's very rare that they have a serious problem like this one: http://www.theregister.co.uk/2004/06/15/akamai_goes_postal/ in which case 25% of the internet (including google, yahoo, and lycos) usually goes down with them. So my question to you Russel is- if Microsoft can't even rely on NTFS to run their own important infrastructure (they obviously have a Netapp WAFL dependancy), what hope can your 10TB pool possibly have? What you're doing is the equivalent of building a 100 story tall skyscraper out of titanium and then making the bottom-most ground floor and basement foundation out of glue and pop sickle sticks, and then when the entire building starts to collapse, you call in to the Titanium metal fabrication corporation, blame them for the problem, and then tell them that they are obligated to help you glue your pop sickle sticks back together because it's all their fault that the building collapsed! Not very fair IMHO. In the future, keep in mind that (as far as I understand it) the only way to get the 100% full benefits of ZFS checksum protection is to run it in on bare metal with no virtualization. If you're going to virtualize something, virtualize Microsoft Windows and Linux inside of OpenSolaris. I'm running ZFS in production with my OpenSolaris operating system zpool mirrored three times over on 3 different drives, and I've never had a problem with it. I even created a few simulated power outages to test my setup and pulling the plug while twelve different users were uploading multiple files into 12 different Solaris zones definitely didn't phase the zpool at all. Just boots right back up and everything works. The thing is though, it only seems to work when you're not running it virtualized on top of a closed-source proprietary file system that's made out of glue and pop sickle sticks. Just my 2 cents. I could be wrong though. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zpool import is trying to tell me something...
Maybe I should have posted the zdb -l output. Having seen another thread which suggests that I might be looking at the most recent txg being damaged, I went to get my pool's txg counter: hydra# zdb -l /dev/dsk/c3t10d0s0 | grep txg txg=10168474 txg=10168474 txg=6324561 txg=6324561 All of the disks are like this (indeed, the only thing that differs about their zdb -l output is their own guids, as expected). Staring at reams of od -x output, it appears that I have txgs 10168494 through 10168621 in L0 and L1. L2 and L3 appear to have not been updated in some time! L0 and L1 are both version=14 and have a hostname field L2 and L3 are both version=3 and do not. All four labels appear to describe the same array (guids and devices and all). The uberblocks in L2 and L3 seem to contain txgs 6319346 through 6319473. That's uh... funny. A little bit of dtrace and time travel back to vdev.c as of snv_115 later, I find that the immediate cause is that vdev_raidz_open is yielding an asize of 1280262012928 but that when vdev_open called vdev_raidz_open, it thought the asize was 1280272498688. (Thus vdev_open and vdev_root_open return EINVAL, and the import fails.) That is, the array is actually one megabyte smaller than it thought... which works out to 256K per disk, which is exactly the size of a label pair and might explain why L2 and L3 are stale. Help? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
Once these bits are available in Opensolaris then users will be able to upgrade rather easily. This would allow you to take a liveCD running these bits and recover older pools. Do you currently have a pool which needs recovery? Thanks, George Alexander Skwar wrote: Hi. Good to Know! But how do we deal with that on older sStems, which don't have the patch applied, once it is out? Thanks, Alexander On Tuesday, July 21, 2009, George Wilson george.wil...@sun.com wrote: Russel wrote: OK. So do we have an zpool import --xtg 56574 mypoolname or help to do it (script?) Russel We are working on the pool rollback mechanism and hope to have that soon. The ZFS team recognizes that not all hardware is created equal and thus the need for this mechanism. We are using the following CR as the tracker for this work: 6667683 need a way to rollback to an uberblock from a previous txg Thanks, George ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40 days work
To All : The ECC discussion was very interesting as I had never considered it that way! I willl be buying ECC memory for my home machine!! You have to make sure your mainboard, chipset and/or CPU support it, otherwise any ECC modules will just work like regular modules. The mainboard needs to have the necessary lanes to either the chipset that supports ECC (in case of Intel) or the CPU (in case of AMD). I think all Xeon chipsets do ECC, as do various consumer ones (I only know of X38/X48, there's also some 9xx ones that do). For consumer boards, it's hard to figure out which actually do support it. I have an X48-DQ6 mainboard from Gigabyte, which does it. Regards, -mg ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] zfs mirroring question
I am running basic mirroring in a server setup. When I pull out a hard drive and put it back in, it won't detect it and resilver it until I reboot the system. Is there a way to force it to detect it and resilver it in real time? Thank you. Dan -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...
We have Thumper that we got at a good price from the Sun Educational Grant program (thank you Sun!) but it came populated with 500GB drives. The box will be used as a virtual tape library and general purpose NFS/iSCSI/Samba file server for users' stuff. Probably, in about two years, we will want to reload it with whatever the big 1TB drive of the day is. This gives me a problem with respect to planning for the future, since currently one can't shrink a zpool. I can think of a few approaches: 1) Initial configuration with two zpools. This lets us do the upgrade just before utilization hits 50%. We can migrate everyone off pool 1, destroy it, upgrade it, and either repeat the process for pool2 or join the pools together. 2) Replace with new, bigger disks, and slice them in half. Use one slice to rejoin the existing pool, and the second slice to start a new pool. 3) Unlikely: Mirror the existing zpool with some kind of external vdev. I've tested this - I actually mirrored a physical disk with a NFS vdev once, and to my amazement it worked. Unfortunately the Thumper is the biggest box we have right now, we don't have any other devices with 18+TB of space. 3 1/2): Tape, like failure, is always an option. Either way with 1 or 2 we're stuck with two pools on the same host, but since I have 40+ disks to spread the IO over, I'm not too worried. Option 4) If I just replace the 500GB disks one by one with 1 TB disks in an existing single zpool, will the zpool magically have twice as much space when I am done replacing the very last disk? I don't have any way to test this. In the past I have been able to do this with *some* RAID5 array controllers. If you've been through this drill, let us know how you handled it. Thanks in advance, -W Sanders St Marys College of CA -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] zfs mirroring question
Daniel S wrote: I am running basic mirroring in a server setup. When I pull out a hard drive and put it back in, it won't detect it and resilver it until I reboot the system. Is there a way to force it to detect it and resilver it in real time? More info on your hardware is required. In particular what type of disks these are and how they are attached, eg: IDE, SATA, SAS, USB, FC, iSCSI... I'm assuming since you said basic mirroring you don't have any hot spares configured that would have kicked in. -- Darren J Moffat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...
Hi-- With 40+ drives, you might consider two pools any way. If you want to use a ZFS root pool, some like this: - Mirrored ZFS root pool (2 x 500 GB drives) - Mirrored ZFS non-root pool for everything else Mirrored pools are flexible and provide good performance. See this site for more tips: http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide Option 4 below is your best option. Depending on the Solaris release, ZFS will see the expanded space. If not, see this section: http://www.solarisinternals.com/wiki/index.php/ZFS_Troubleshooting_Guide#Changing_Disk_Capacity_Sizes Cindy On 07/22/09 10:31, W Sanders wrote: We have Thumper that we got at a good price from the Sun Educational Grant program (thank you Sun!) but it came populated with 500GB drives. The box will be used as a virtual tape library and general purpose NFS/iSCSI/Samba file server for users' stuff. Probably, in about two years, we will want to reload it with whatever the big 1TB drive of the day is. This gives me a problem with respect to planning for the future, since currently one can't shrink a zpool. I can think of a few approaches: 1) Initial configuration with two zpools. This lets us do the upgrade just before utilization hits 50%. We can migrate everyone off pool 1, destroy it, upgrade it, and either repeat the process for pool2 or join the pools together. 2) Replace with new, bigger disks, and slice them in half. Use one slice to rejoin the existing pool, and the second slice to start a new pool. 3) Unlikely: Mirror the existing zpool with some kind of external vdev. I've tested this - I actually mirrored a physical disk with a NFS vdev once, and to my amazement it worked. Unfortunately the Thumper is the biggest box we have right now, we don't have any other devices with 18+TB of space. 3 1/2): Tape, like failure, is always an option. Either way with 1 or 2 we're stuck with two pools on the same host, but since I have 40+ disks to spread the IO over, I'm not too worried. Option 4) If I just replace the 500GB disks one by one with 1 TB disks in an existing single zpool, will the zpool magically have twice as much space when I am done replacing the very last disk? I don't have any way to test this. In the past I have been able to do this with *some* RAID5 array controllers. If you've been through this drill, let us know how you handled it. Thanks in advance, -W Sanders St Marys College of CA ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...
4. Yes :-D While you can't shrink, you can already replace drives with bigger ones, and ZFS does increase the size at the end (although I think it needs an unmount/mount right now). However, even though you can simply pull one drive and replace it with a bigger one, that does degrade your array. So instead, depending on your needs, I'd suggest something like creating one pool of a bunch of raid-z2 vdevs, with 2-4 drives allocated as hot spares. That allows you in the future to replace the spare drives with new 2TB drives, then boot and run a 'zpool replace old disk new disk' for all of the spares. That will switch the drives to the bigger size without degrading the array. Then when that finishes, remove the replaced drives (which are the new spares), and repeat. The reason I suggest up to 4 spares is that it's likely to take some time to resilver, and even doing 4 at once you'll need to do this 12 times to upgrade a Thumper. So if you are planning to upgrade, sacrificing that space now is probably a worthwhile investment. Sun have confirmed that 2TB drives will be supported, and probably 4TB ones too. I've also tested this out myself (although just with a single 1TB drive) on a Thumper. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] When will shrink / evict be coming? With respect to drive upgrades ...
Thanks! Rats, we're running GA u7 and not Opensolaris for now: # zpool set autoexpand=on pool (my pool is, in fact, named pool) cannot set property for 'pool': invalid property 'autoexpand' We're not in production yet, but I eventually have to install Veritas Netbackup on this thing (please feel free to pity me), and I don't know if they are supporting Opensolaris yet. -w -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] SSD's and ZFS...
I've started reading up on this, and I know I have alot more reading to do, but I've already got some questions... :) I'm not sure yet that it will help for my purposes, but I was considering buying 2 SSD's for mirrored boot devices anyway. My main question is: Can a pair of say 60GB SSD's be shared for both the root pool and as an SSD ZIL? Can the installer be configured to make the slice for the root pool to be something less than the whole disk? leaving another slice for the ZIL? Or would a zVOL in the root pool be a better idea? I doubt 60GB will leave enough space, but would doing this for the L2ARC be useful also? -Kyle ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD's and ZFS...
I can't speak to whether it's a good idea or not, but I also wanted to do this and it was rather difficult. The problem is the opensolaris installer doesn't let you setup slices on a device to install to. The two ways I came up with were: 1) using the automated installer to do everything because it has the option to configure slices before installing files. this requires learning a lot about the AI just to configure slices before installing. 2) - install like normal on one drive - setup drive #2 with the partition map that you want to have - zpool replace drive #1 with drive #2 with altered partition map - setup drive #1 with new partition map - zpool attach drive #1 - install grub on both drives Even though approach #2 probably sounds more difficult, I ended up doing it that way and setup a root slice on each, a slog slice on each, and 2 independent swap slices. I would also like to hear if there's any other way to make this easier or any problems with my approach that I might have overlooked. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] virtualization, alignment and zfs variation stripes
One of the things that commonly comes up in the server virtualization world is making sure that all of the storage elements are aligned. This is because there are often so many levels of abstraction each using their own block size that without any tuning, they'll usually overlap and can cause 2 or even 3 times the I/O in some cases to read what would be just one block. I guess this was also a common thing in the SAN world many years back. Lets say I have a simple-ish setup that uses vmware files for virtual disks on an NFS share from zfs. I'm wondering how zfs' variable block size comes into play? Does it make the alignment problem go away? Does it make it worse? Or should we perhaps be creating filesystems with a fixed block size for virtualization workloads? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motherboard for home zfs/solaris file server
i7 doesn't support ECC even motherboard supports it, you need XEON W3500 which costs the same as i7 to support ECC. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] virtualization, alignment and zfs variation stripes
On Wed, 22 Jul 2009, t. johnson wrote: Lets say I have a simple-ish setup that uses vmware files for virtual disks on an NFS share from zfs. I'm wondering how zfs' variable block size comes into play? Does it make the alignment problem go away? Does it make it worse? Or should we perhaps be My understanding is that zfs uses fixed block sizes except for the tail block of a file, or if the filesystem has compression enabled. Zfs's large blocks can definitely cause performance problems if the system has insufficient memory to cache the blocks which are accessed, or only part of the block is updated. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Another user looses his pool (10TB) in this case and 40
aym == Anon Y Mous no-re...@opensolaris.org writes: mg == Mario Goebbels m...@tomservo.cc writes: aym I don't mean to be offensive Russel, but if you do ever return aym to ZFS, please promise me that you will never, ever, EVER run aym it virtualized on top of NTFS he said he was using raw disk devices IIRC. and once again, the host did not crash, only the guest, so even if it were NTFS rather than raw disks, the integrity characteristics of NTFS would have been irrelevant since the host was awlays shutdown cleanly. aym the only way to get the 100% full benefits of ZFS checksum aym protection is to run it in on bare metal with no aym virtualization. bullshit. That makes no sense at all. First, why should virtualization have anything to do with checksums? Obviously checksums go straight through it. The suspected problem lies elsewhere. Second, virtualization is serious business. Problems need to be found and fixed. At this point, you've become so aggressive with that broom, anyone can see there's obviously an elephant under the rug. aym I'm running ZFS in production with my OpenSolaris aym operating system zpool mirrored three times over on 3 aym different drives, and I've never had a problem with it. The idea of collecting other people's problem reports is to figure out what's causing problems before one hits you. I hear this type of thing all the time: ``The number of problems I've had is so close to zero, it is zero, so by extrapolation nobody else can be having any real problems because if I scale out my own experience the expected number of problems in the entire world is zero.''---wtf? clearly bogus! mg You have to make sure your mainboard, chipset and/or CPU mg support it, otherwise any ECC modules will just work like mg regular modules. also scrubbing is sometimes enabled separately from plain ECC. Without scrubbing the ECC can still correct errors, but won't do so until some actual thread reads the flipped-bit, which is probably okay but shrug. I vaguely remember something about an idle scrub thread in solaris where the CPU itself does the scrubbing? but at least on AMD platforms, the memory and cache controllers will do scrubbing themselves using only memory bandwidth, without using CPU cycles, if you ask. On AMD you can use this script on Linux to control scrub speed and ECC enablement if your BIOS does not support it. The script does appear to do something on Phenom II, but I haven't tried the 10-ohm resistor test the author suggests. I think it should be adaptable to SOlaris. http://hyvatti.iki.fi/~jaakko/sw/ now if only we could get 4GB ECC unbuffered DDR3 for similar prices to non-ECC. :( pgp2ROfrnMEfW.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
Have you considered running your script with ZFS pre-fetching disabled altogether to see if the results are consistent between runs? Brad Brad Diggs Senior Directory Architect Virtualization Architect xVM Technology Lead Sun Microsystems, Inc. Phone x52957/+1 972-992-0002 Mail bradley.di...@sun.com Blog http://TheZoneManager.com Blog http://BradDiggs.com On Jul 15, 2009, at 9:59 AM, Bob Friesenhahn wrote: On Wed, 15 Jul 2009, Ross wrote: Yes, that makes sense. For the first run, the pool has only just been mounted, so the ARC will be empty, with plenty of space for prefetching. I don't think that this hypothesis is quite correct. If you use 'zpool iostat' to monitor the read rate while reading a large collection of files with total size far larger than the ARC, you will see that there is no fall-off in read performance once the ARC becomes full. The performance problem occurs when there is still metadata cached for a file but the file data has since been expunged from the cache. The implication here is that zfs speculates that the file data will be in the cache if the metadata is cached, and this results in a cache miss as well as disabling the file read- ahead algorithm. You would not want to do read-ahead on data that you already have in a cache. Recent OpenSolaris seems to take a 2X performance hit rather than the 4X hit that Solaris 10 takes. This may be due to improvement of existing algorithm function performance (optimizations) rather than a related design improvement. I wonder if there is any tuning that can be done to counteract this? Is there any way to tell ZFS to bias towards prefetching rather than preserving data in the ARC? That may provide better performance for scripts like this, or for random access workloads. Recent zfs development focus has been on how to keep prefetch from damaging applications like database where prefetch causes more data to be read than is needed. Since OpenSolaris now apparently includes an option setting which blocks file data caching and prefetch, this seems to open the door for use of more aggressive prefetch in the normal mode. In summary, I agree with Richard Elling's hypothesis (which is the same as my own). Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Best Approach
I have (2) of the following boxes, exact matching. (2) Super Micro X7DBN Motherboard Dual (16) GB of RAM (8GB in each box) (4) 1.6GHz Intel XEON Quad-Core LGA771 (2) Super Micro 2U RM (12 Bay Chassis) (2) Super Micro AOC 8-port SATA Controller I'd like ZFS replicate this box to the other, is this practical for one and/or what is the best method? It will be strictly nothing but a backup storage box via NFS and/or ISCSI. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] L2ARC support in Solaris 10 (Update 8?)
Hi All, Can anyone shed some light on / if L2ARC support will be included in the next Solaris 10 update? Or if it is included in a Kernel patch over and above the standard Kernel patch rev that ships in 05/09 (AKA U7)? The reason I ask is that I have standardised on S10 here and am not keen to deploy OpenSolaris in production. (Just another platform and patching system to document and maintain. I don't want to debate this here. It's the way it is.) I am currently speccing some x4240's with SSD's for some upgraded Squid proxy cache's that will be handling caching duties for around 40 - 60 megabit's / s. Large disk caches and L1ARC for squid will make these systems really fly. (These are replacing tow v240's that are getting a little long in the tooth and want keep up with the bandwidth jump) The plan is to have a couple of x4240's with Dual quad core processors, 16 GB RAM and 6 x 146 GB 10K SAS drives plus 1 x 32 GB SSD as L2ARC. I can add this later if support for this not available at build time, but is road mapped for S8? ZFS config will be a pair of 146 GB mirrored as boot drives (and possibly access logging) and then a RAIDZ1 of 4 drives for max capacity (data is disposable as it is purely cached object data). Compression will be enabled on the disk cache RAIDZ1 to increase performance of cached data read from disk. (seeing as I have many CPU cycles to burn in these systems ;) ) I am hoping that these systems will have a L1ARC of around 10GB, L2ARC of 32GB and cache volume of ~420GB RAIDZ plus compression. We may add more drives or RAIDZ's as we tweak the Squid cached object size. We are hoping to cache objects up to around 100 MB. Any comments on either system configuration and / or L2ARC support are invited from the list. Thanks, Scott. -- ___ Scott Lawson Systems Architect Manukau Institute of Technology Information Communication Technology Services Private Bag 94006 Manukau City Auckland New Zealand Phone : +64 09 968 7611 Fax: +64 09 968 7641 Mobile : +64 27 568 7611 mailto:sc...@manukau.ac.nz http://www.manukau.ac.nz perl -e 'print $i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);' ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why is Solaris 10 ZFS performance so terrible?
On Wed, 22 Jul 2009, Roch wrote: HI Bob did you consider running the 2 runs with echo zfs_prefetch_disable/W0t1 | mdb -kw and see if performance is constant between the 2 runs (and low). That would help clear the cause a bit. Sorry, I'd do it for you but since you have the setup etc... Revert with : echo zfs_prefetch_disable/W0t0 | mdb -kw -r I see that if I update my test script so that prefetch is disabled before the first cpio is executed, the read performance of the first cpio reported by 'zpool iostat' is similar to what has been normal for the second cpio case (i.e. 32MB/second). This seems to indicate that prefetch is entirely disabled if the file has ever been read before. However, there is a new wrinkle in that the second cpio completes twice as fast with prefetch disabled even though 'zpool iostat' indicates the same consistent throughput. The difference goes away if I tripple the number of files. With 3000 8.2MB files: Doing initial (unmount/mount) 'cpio -C 131072 -o /dev/null' 14443520 blocks real3m41.61s user0m0.44s sys 0m8.12s Doing second 'cpio -C 131072 -o /dev/null' 14443520 blocks real1m50.12s user0m0.42s sys 0m7.21s Now if I increase the number of files to 9000 8.2MB files: Doing initial (unmount/mount) 'cpio -C 131072 -o /dev/null' 144000768 blocks real35m51.47s user0m4.46s sys 1m20.11s Doing second 'cpio -C 131072 -o /dev/null' 144000768 blocks real35m22.41s user0m4.40s sys 1m14.22s Notice that with 3X the files, the throughput is dramatically reduced and the time is the same for both cases. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] L2ARC support in Solaris 10 (Update 8?)
On Thu, 23 Jul 2009, Scott Lawson wrote: The plan is to have a couple of x4240's with Dual quad core processors, 16 GB RAM and 6 x 146 GB 10K SAS drives plus 1 x 32 GB SSD as L2ARC. I can add this later if support for this not available at build time, but is road mapped for S8? I suggest maxing out your server RAM capacity before worrying about adding a L2ARC. The reason why is that RAM is full speed and contains the L1ARC. The only reason to do otherwise if if you can't afford it. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Understanding SAS/SATA Backplanes and Connectivity
On Fri, 17 Jul 2009 14:16:32 -0400 Miles Nordin car...@ivy.net wrote: rl == Rob Logan r...@logan.com writes: rl Is there some magic that load balances the 4 SAS ports as this rl shows up as one scsi-bus? The LSI card is not SATA framework. I've the impression drive enumeration and topology is handled by the proprietary firmware on the card, so it's likely there isn't any explicit support for SAS expanders inside solaris's binary mpt driver at all. There kinda is - mpt(7d) detects SAS expanders as SCSI Enclosure Services devices (which is what the spec says), and passes the enumeration off to ses(7d) or sgen(7d), depending on what you've got as a device alias for scsiclass,0d. We also (in NV since build 81, S10 Update 6) detect and correcly handle Serial Management Protocol instances, which SAS expanders hook into. The SAS HBA chip passes SMP frames to and from the expander. If you have x86 I think you can explore topology using the bootup Blue Screens of Setup, but I don't have anything with SAS expander to test it. Yes, that's correct, the bluescreenofsetup allows you to do some minimal viewing of the config. I think the SAS standard itself has a concept of ``wide ports'' like infiniband or PCIe, so I would speculate the 4 pairs are treated as lanes rather than ports. mpt(7d) bundles the phys and only shows one controller for internal and one controller for external connections - on a physical hba basis. cheers, James C. McPherson -- Senior Kernel Software Engineer, Solaris Sun Microsystems http://blogs.sun.com/jmcp http://www.jmcp.homeunix.com/blog Kernel Conference Australia - http://au.sun.com/sunnews/events/2009/kernel ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motherboard for home zfs/solaris file server
Good news; the manual for the M4N78-VM mentions ECC and gives the following BIOS options: disabled/basic/good/super/maxi/user. Unsure what these mean but that's a start. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Motherboard for home zfs/solaris file server
Found this: ECC Mode [Disabled] Disables or sets the DRAM ECC mode that allows the hardware to report and correct memory errors. Set this item to [Basic] [Good] or [Max] to allow ECC mode auto-adjustment. Set this item to [Super] to adjust the DRAM BG Scrub sub-item manually. You may also adjust all sub-items by setting this item to [User]. Configuration options: [Disabled] [Basic] [Good] [Super] [Max] [User] I would have thought the checksum was either good or not. Apparently it's not so simple. Now about that unique PCIe-16 slot? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] virtualization, alignment and zfs variation stripes
Hmm.. I guess that's what I've heard as well. I do run compression and believe a lot of others would as well. So then, it seems to me that if I have guests that run a filesystem formatted with 4k blocks for example.. I'm inevitably going to have this overlap when using ZFS network storage? So if A were zfs blocks and B were virtualized guest blocks, I think it might look like this with compression on? | B1 | B2 | B3 | B4 | | A1 | A2 | A3 | A4 | So if the guest OS wants blocks B2 or B4, it actually has to read 2 blocks from the underlying zfs storage? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] virtualization, alignment and zfs variation stripes
On Thu, Jul 23, 2009 at 12:29 PM, thomasno-re...@opensolaris.org wrote: Hmm.. I guess that's what I've heard as well. I do run compression and believe a lot of others would as well. So then, it seems to me that if I have guests that run a filesystem formatted with 4k blocks for example.. I'm inevitably going to have this overlap when using ZFS network storage? So if A were zfs blocks and B were virtualized guest blocks, I think it might look like this with compression on? | B1 | B2 | B3 | B4 | | A1 | A2 | A3 | A4 | So if the guest OS wants blocks B2 or B4, it actually has to read 2 blocks from the underlying zfs storage? AFAIK If you use zvol, and set zfs volblocksize to be the same as the fs block size on virtualized system (which is 4k by default for several GB disk/partition with ext3/ntfs), every virtualized block read should correspond to one zfs block read. If you set compression on, the actual bytes read from the storage will not always be 4k though, it can be less depending on how compressible the data is. -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss