Re: [zfs-discuss] about btrfs and zfs
Or, if you absolutely must run linux for the operating system, see: http://zfsonlinux.org/ On Oct 17, 2011, at 8:55 AM, Freddie Cash wrote: If you absolutely must run Linux on your storage server, for whatever reason, then you probably won't be running ZFS. For the next year or two, it would probably be safer to run software RAID (md), with LVM on top, with XFS or Ext4 on top. It's not the easiest setup to manage, but it would be safer than btrfs. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] I'm back!
Warm welcomes back. So whats neXt? - Mike DeMan On Sep 2, 2011, at 6:30 PM, Erik Trimble wrote: Hi folks. I'm now no longer at Oracle, and the past couple of weeks have been a bit of a mess for me as I disentangle myself from it. I apologize to those who may have tried to contact me during August, as my @oracle.com email is no longer being read by myself, and I didn't have a lot of extra time to devote to things like making sure my email subscription lists pointed to my personal email. I've done that now. I now have a free(er) hand to do some work in IllumOS (hopefully, in ZFS in particular), so I'm looking forward to getting back into the swing of things. And, hopefully, not be too much of a PITA. :-) -Erik Trimble tr...@netdemons.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Zil on multiple usb keys
+1 on the below, and in addition... ...compact flash, like off of USB sticks is not designed to deal with very many writes to it. Commonly it is used to store a bootable image that maybe once a year will have an upgrade on it. Basically, trying to use those devices for a ZIL, even they are mirrored - you should be prepared to having one die and be replaced very, very regularly. Generally performance is going to pretty bad as well - USB sticks are not made to be written too rapidly. They are entirely different animals than SSDs. I would not be surprised (but would be curious to know if you still move forward on this) that you will find performance even worse trying to do this. On Jul 18, 2011, at 1:54 AM, Fajar A. Nugraha wrote: First of all, using USB disks for permanent storage is a bad idea. Go for e-sata instead (http://en.wikipedia.org/wiki/Serial_ata#eSATA). It ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Have my RMA... Now what??
Always pre-purchase one extra drive to have on hand. When you get it, confirm it was not dead-on-arrival by hooking up on an external USB to a workstation and running whatever your favorite tools are to validate it is okay. Then put it back in its original packaging, and put a label on it about what it is, and that it is a spare for box(s) XYZ disk system. When a drive fails, use that one off the shelf to do your replacement immediately then deal with the RMA, paperwork, and snailmail to get the bad drive replaced. Also, depending how many disks you have in your array - keeping multiple spares can be a good idea as well to cover another disk dying while waiting on that replacement one. In my opinion, the above goes whether you have your disk system configured with hot spare or not. And the technique is applicable to both personal/home-use and commercial uses if your data is important. - Mike On May 28, 2011, at 9:30 AM, Brian wrote: I have a raidz2 pool with one disk that seems to be going bad, several errors are noted in iostat. I have an RMA for the drive, however - no I am wondering how I proceed. I need to send the drive in and then they will send me one back. If I had the drive on hand, I could do a zpool replace. Do I do a zpool offline? zpool detach? Once I get the drive back and put it in the same drive bay.. Is it just a zpool replace device? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Have my RMA... Now what??
Yes, particularly if you have older drives with 512 sectors and then buy a newer drive that seems the same, but is not, because it has 4k sectors. Looks like it works, and will work, but performance drops. On May 28, 2011, at 4:59 PM, Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D. wrote: yes good idea, another things to keep in mind technology change so fast, by the time you want a replacement, may be HDD does exist any more or the supplier changed, so the drives are not exactly like your original drive On 5/28/2011 6:05 PM, Michael DeMan wrote: Always pre-purchase one extra drive to have on hand. When you get it, confirm it was not dead-on-arrival by hooking up on an external USB to a workstation and running whatever your favorite tools are to validate it is okay. Then put it back in its original packaging, and put a label on it about what it is, and that it is a spare for box(s) XYZ disk system. When a drive fails, use that one off the shelf to do your replacement immediately then deal with the RMA, paperwork, and snailmail to get the bad drive replaced. Also, depending how many disks you have in your array - keeping multiple spares can be a good idea as well to cover another disk dying while waiting on that replacement one. In my opinion, the above goes whether you have your disk system configured with hot spare or not. And the technique is applicable to both personal/home-use and commercial uses if your data is important. - Mike On May 28, 2011, at 9:30 AM, Brian wrote: I have a raidz2 pool with one disk that seems to be going bad, several errors are noted in iostat. I have an RMA for the drive, however - no I am wondering how I proceed. I need to send the drive in and then they will send me one back. If I had the drive on hand, I could do a zpool replace. Do I do a zpool offline? zpool detach? Once I get the drive back and put it in the same drive bay.. Is it just a zpool replacedevice? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss laotsao.vcf___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
I think on this, the big question is going to be whether Oracle continues to release ZFS updates under CDDL after their commercial releases. Overall, in the past it has obviously and necessarily been the case that FreeBSD has been a '2nd class citizen'. Moving forward, that 2nd class idea becomes very mutable - and ironically it becomes more so in regards to dealing with organizations that have longevity. Moving forward... If Oracle continues to release critical ZFS feature sets under CDDL to the community, then: A) They are no longer pre-releasing those features to OpenSolaris B) FreeBSD gets them at the same time. If Oracle does not continue to release ZFS features sets under CDDL, then then game changes. Pick your choice of operating systems - one that has a history of surviving for nearly two decades on its own with community support, or the 'green leaf off the dead tree' that just decided to jump into the willy-nilly world without direct/giant corporate support. 2nd class citizen issue for FreeBSD disappears either way. The only remaining question would be the remaining crufts of legal disposition. I could for instance see NetApp or somebody try and sue ixSystems, but I have a really, really rough time seeing Oracle/LarryEllison suing the FreeBSD foundation overall or something? Oh yeah - plus BTRFS on the horizon? Honestly - I am not here to start a flame war - I am asking these questions because businesses both big and small need to know what to do. My hunch is, we all have to wait and see if Oracle releases ZFS updates after Solaris 11, and if so, whether that is a subset of functionality or full functionality. - mike On Mar 19, 2011, at 11:54 PM, Fajar A. Nugraha wrote: On Sun, Mar 20, 2011 at 4:05 AM, Pawel Jakub Dawidek p...@freebsd.org wrote: On Fri, Mar 18, 2011 at 06:22:01PM -0700, Garrett D'Amore wrote: Newer versions of FreeBSD have newer ZFS code. Yes, we are at v28 at this point (the lastest open-source version). That said, ZFS on FreeBSD is kind of a 2nd class citizen still. [...] That's actually not true. There are more FreeBSD committers working on ZFS than on UFS. How is the performance of ZFS under FreeBSD? Is it comparable to that in Solaris, or still slower due to some needed compatibility layer? -- Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
I think we all feel the same pain with Oracle's purchase of Sun. FreeBSD that has commercial support for ZFS maybe? Not here quite yet, but it is something being looked at by an F500 that I am currently on contract with. www.freenas.org, www.ixsystems.com. Not saying this would be the right solution by any means, but for that 'corporate barrier', sometimes the option to get both the hardware and ZFS from the same place, with support, helps out. - mike On Mar 18, 2011, at 2:56 PM, Paul B. Henson wrote: We've been running Solaris 10 for the past couple of years, primarily to leverage zfs to provide storage for about 40,000 faculty, staff, and students as well as about 1000 groups. Access is provided via NFSv4, CIFS (by samba), and http/https (including a local module allowing filesystem acl's to be respected via web access). This has worked reasonably well barring some ongoing issues with scalability (approximately a 2 hour reboot window on an x4500 with ~8000 zfs filesystems, complete breakage of live upgrade) and acl/chmod interaction madness. We were just about to start working on a cutover to OpenSolaris (for the in-kernel CIFS server, and quicker access to new features/developments) when Oracle finished assimilating Sun and killed off the OpenSolaris distribution. We've been sitting pat for a while to see how things ended up shaking out, and at this point want to start reevaluating our best migration option to move forward from Solaris 10. There's really nothing else available that is comparable to zfs (perhaps btrfs someday in the indefinite future, but who knows when that day might come), so our options would appear to be Solaris 11 Express, Nexenta (either NexentaStor or NexentaCore), and OpenIndiana (FreeBSD is occasionally mentioned as a possibility, but I don't really see that as suitable for our enterprise needs). Solaris 11 is the official successor to OpenSolaris, has commercial support, and the backing of a huge corporation which historically has contributed the majority of Solaris forward development. However, that corporation is Oracle, and frankly, I don't like doing business with Oracle. With no offense intended to the no doubt numerous talented and goodhearted people that might work there, Oracle is simply evil. We've dealt with Oracle for a long time (in addition to their database itself, we're a PeopleSoft shop) and a positive interaction with them is quite rare. Since they took over Sun, costs on licensing, support contracts, and hardware have increased dramatically, at least in the cases where we've actually been able to get a quote. Arguably, we are not their target market, and they make that quite clear ;). There's also been significant brain drain of prior Sun employees since the takeover, so while they might still continue to contribute the most money into Solaris dev elopment, they might not be the future source of the most innovation. Given our needs, and our budget, I really don't consider this a viable option. Nexenta, on the other hand, seems to be the kind of company I'd like to deal with. Relatively small, nimble, with a ton of former Sun zfs talent working for them, and what appears to be actual consideration for the needs of their customers. I think I'd more likely get my needs addressed through Nexenta, they've already started work on adding aclmode back and I've had some initial discussion with one of their engineers on the possibility of adding additional options such as denying or ignoring attempted chmod updates on objects with acls. It looks like they only offer commercial support for NexentaStor, not NexentaCore. Commercial support isn't a strict requirement, a sizable chunk of our infrastructure runs on a non-commercial linux distribution and open source software, but it can make management happier. NexentaStor seems positioned as a storage appliance, which isn't really what we need. I'm not particularly interested in a web gui or cli interface that hides the underly ing complexity of the operating system and zfs, on the contrary, I want full access to the guts :). We have our zfs deployment integrated into our identity management system, which automatically provisions, destroys, and maintains filespace for our user/groups, as well as providing an API for end-users and administrators to manage quotas and other attributes. We also run apache with some custom modules. I still need to investigate further, but I'm not even sure if NexentaStor provides access into the underlying OS or encapsulates everything and only allows control through its own administrative functionality. NexentaCore is more of the raw operating system we're probably looking for, but with only community-based support. Given that NexentaCore and OpenIndiana are now both going to be based off of the illumos core, I'm not quite certain what's going to distinguish them.
Re: [zfs-discuss] [OpenIndiana-discuss] best migration path from Solaris 10
Hi David, Caught your note about bonnie, actually do some testing myself over the weekend. All on older hardware for fun - dual opteron 285 with 16GB RAM. Disk systems is off a pair of SuperMicro SATA cards, with a combination of WD enterprise and Seagate ES 1TB drives. No ZIL, no L2ARC, no tuning at all from base FreeNAS install. 10 drives total, I'm going to be running tests as below, mostly curious about IOPS and to sort out a little debate with a co-worker. - all 10 in one raidz2 (running now) - 5 by 2-way mirrors - 2 by 5-disk raidz1 Script is as below - if folks would find the data I collect be useful information at all, let me know and I will post it publicly somewhere. freenas# cat test.sh #!/bin/sh # Basic test for file I/O. We run lots and lots of the tradditional # 'bonnie' tool at 50GB file size, starting one every minute. Resulting # data should give us a good work mixture in the middle given all the different # tests that bonnnie runs, 100 instances running at the same time, and at different # stages of their processing. MAX=100 COUNT=0 FILESYSTEM=testrz2 LOG=${FILESYSTEM}.log date ${LOG} echo Test with file system named ${FILESYSTEM} and Configuration of... ${LOG} zpool status ${LOG} # DEMAN grab zfs and regular dev iostats every 10 minutes during test zpool iostat -v 600 ${LOG} iostat -w 600 ada0 ada1 ada2 ada3 ada4 ada5 ada6 ada7 ada8 ada9 ${LOG}.iostat while [ $COUNT -le $MAX ]; do echo kicking off bonnie bonnie -d /mnt/${FILESYSTEM} -s 5 sleep 60; COUNT=$((count+1)); done; On Mar 18, 2011, at 3:26 PM, David Brodbeck wrote: I'm in a similar position, so I'll be curious what kinds of responses you get. I can give you a thumbnail sketch of what I've looked at so far: I evaluated FreeBSD, and ruled it out because I need NFSv4, and FreeBSD's NFSv4 support is still in an early stage. The NFS stability and performance just isn't there yet, in my opinion. Nexenta Core looked promising, but locked up in bonnie++ NFS testing with our RedHat nodes, so its stability is a bit of a question mark for me. I haven't gotten the opportunity to thoroughly evaluate OpenIndiana, yet. It's only available as a DVD ISO, and my test machine currently has only a CD-ROM drive. Changing that is on my to-do list, but other things keep slipping in ahead of it. For now I'm running OpenSolaris, with a locally-compiled version of Samba. (The OpenSolaris Samba package is very old and has several unpatched security holes, at this point.) -- David Brodbeck System Administrator, Linguistics University of Washington ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] best migration path from Solaris 10
ZFSv28 is in HEAD now and will be out in 8.3. ZFS + HAST in 9.x means being able to cluster off different hardware. In regards to OpenSolaris and Indiana - can somebody clarify the relationship there? It was clear with OpenSolaris that the latest/greatest ZFS would always be available since it was a guinea-pig product for cost conscious folks and served as an excellent area for Sun to get marketplace feedback and bug fixes done before rolling updates into full Solaris. To me it seems that Open Indiana is basically a green branch off of a dead tree - if I am wrong, please enlighten me. On Mar 18, 2011, at 6:16 PM, Roy Sigurd Karlsbakk wrote: I think we all feel the same pain with Oracle's purchase of Sun. FreeBSD that has commercial support for ZFS maybe? Fbsd currently has a very old zpool version, not suitable for running with SLOGs, since if you lose it, you may lose the pool, which isn't very amusing... Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Jan 7, 2011, at 6:13 AM, David Magda wrote: On Fri, January 7, 2011 01:42, Michael DeMan wrote: Then - there is the other side of things. The 'black swan' event. At some point, given percentages on a scenario like the example case above, one simply has to make the business justification case internally at their own company about whether to go SHA-256 only or Fletcher+Verification? Add Murphy's Law to the 'black swan event' and of course the only data that is lost is that .01% of your data that is the most critical? The other thing to note is that by default (with de-dupe disabled), ZFS uses Fletcher checksums to prevent data corruption. Add also the fact all other file systems don't have any checksums, and simply rely on the fact that disks have a bit error rate of (at best) 10^-16. Agreed - but I think it is still missing the point of what the original poster was asking about. In all honesty I think the debate is a business decision - the highly improbable vs. certainty. Somebody somewhere must have written this stuff up, along with simple use cases? Perhaps even a new acronym? MTBC - mean time before collision? And even with the 'certainty' factor being the choice - other things like human error come in to play and are far riskier? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
At the end of the day this issue essentially is about mathematical improbability versus certainty? To be quite honest, I too am skeptical about about using de-dupe just based on SHA256. In prior posts it was asked that the potential adopter of the technology provide the mathematical reason to NOT use SHA-256 only. However, if Oracle believes that it is adequate to do that, would it be possible for somebody to provide: (A) The theoretical documents and associated mathematics specific to say one simple use case? (A1) Total data size is 1PB (lets say the zpool is 2PB to not worry about that part of it). (A2) Daily, 10TB of data is updated, 1TB of data is deleted, and 1TB of data is 'new'. (A3) Out of the dataset, 25% of the data is capable of being de-duplicated (A4) Between A2 and A3 above, the 25% rule from A3 also applies to everything in A2. I think the above would be a pretty 'soft' case for justifying the case that SHA-256 works? I would presume some kind of simple kind of scenario mathematically has been run already by somebody inside Oracle/Sun long ago when first proposing that ZFS be funded internally at all? Then - there is the other side of things. The 'black swan' event. At some point, given percentages on a scenario like the example case above, one simply has to make the business justification case internally at their own company about whether to go SHA-256 only or Fletcher+Verification? Add Murphy's Law to the 'black swan event' and of course the only data that is lost is that .01% of your data that is the most critical? Not trying to be aggressive or combative here at all against peoples opinions and understandings of it all - I would just like to see some hard information about it all - it must exist somewhere already? Thanks, - Mike On Jan 6, 2011, at 10:05 PM, Edward Ned Harvey wrote: From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- boun...@opensolaris.org] On Behalf Of Peter Taps Perhaps (Sha256+NoVerification) would work 99.99% of the time. But Append 50 more 9's on there. 99.% See below. I have been told that the checksum value returned by Sha256 is almost guaranteed to be unique. In fact, if Sha256 fails in some case, we have a bigger problem such as memory corruption, etc. Essentially, adding verification to sha256 is an overkill. Someone please correct me if I'm wrong. I assume ZFS dedup matches both the blocksize and the checksum right? A simple checksum collision (which is astronomically unlikely) is still not sufficient to produce corrupted data. It's even more unlikely than that. Using the above assumption, here's how you calculate the probability of corruption if you're not using verification: Suppose every single block in your whole pool is precisely the same size (which is unrealistic in the real world, but I'm trying to calculate worst case.) Suppose the block is 4K, which is again, unrealistically worst case. Suppose your dataset is purely random or sequential ... with no duplicated data ... which is unrealisic because if your data is like that, then why in the world are you enabling dedupe? But again, assuming worst case scenario... At this point we'll throw in some evil clowns, spit on a voodoo priestess, and curse the heavens for some extra bad luck. If you have astronomically infinite quantities of data, then your probability of corruption approaches 100%. With infinite data, eventually you're guaranteed to have a collision. So the probability of corruption is directly related to the total amount of data you have, and the new question is: For anything Earthly, how near are you to 0% probability of collision in reality? Suppose you have 128TB of data. That is ... you have 2^35 unique 4k blocks of uniformly sized data. Then the probability you have any collision in your whole dataset is (sum(1 thru 2^35))*2^-256 Note: sum of integers from 1 to N is (N*(N+1))/2 Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35 Note: (N*(N+1))/2 in this case = 2^69 + 2^34 So the probability of data corruption in this case, is 2^-187 + 2^-222 ~= 5.1E-57 + 1.5E-67 ~= 5.1E-57 In other words, even in the absolute worst case, cursing the gods, running without verification, using data that's specifically formulated to try and cause errors, on a dataset that I bet is larger than what you're doing, ... Before we go any further ... The total number of bits stored on all the storage in the whole planet is a lot smaller than the total number of molecules in the planet. There are estimated 8.87 * 10^49 molecules in planet Earth. The probability of a collision in your worst-case unrealistic dataset as described, is even 100 million times less likely than randomly finding a single specific molecule in the whole planet Earth by pure luck. ___
Re: [zfs-discuss] [RFC] Backup solution
On Oct 8, 2010, at 4:33 AM, Edward Ned Harvey wrote: From: Peter Jeremy [mailto:peter.jer...@alcatel-lucent.com] Sent: Thursday, October 07, 2010 10:02 PM On 2010-Oct-08 09:07:34 +0800, Edward Ned Harvey sh...@nedharvey.com wrote: If you're going raidz3, with 7 disks, then you might as well just make mirrors instead, and eliminate the slow resilver. There is a difference in reliability: raidzN means _any_ N disks can fail, whereas mirror means one disk in each mirror pair can fail. With a mirror, Murphy's Law says that the second disk to fail will be the pair of the first disk :-). Maybe. But in reality, you're just guessing the probability of a single failure, the probability of multiple failures, and the probability of multiple failures within the critical time window and critical redundancy set. The probability of a 2nd failure within the critical time window is smaller whenever the critical time window is decreased, and the probability of that failure being within the critical redundancy set is smaller whenever your critical redundancy set is smaller. So if raidz2 takes twice as long to resilver than a mirror, and has a larger critical redundancy set, then you haven't gained any probable resiliency over a mirror. Although it's true with mirrors, it's possible for 2 disks to fail and result in loss of pool, I think the probability of that happening is smaller than the probability of a 3-disk failure in the raidz2. How much longer does a 7-disk raidz2 take to resilver as compared to a mirror? According to my calculations, it's in the vicinity of 10x longer. This article has been posted elsewhere, is about 10 months old, but is a good read: http://queue.acm.org/detail.cfm?id=1670144 Really, there should be a ballpark / back of the napkin formula to be able to calculate this? I've been curious about this too, so here goes a 1st cut... DR = disk reliability, in terms of chance of the disk dying in any given time period, say any given hour? DFW = disk full write - time to write every sector on the disk. This will vary depending on system load, but is still an input item that can be determined by some testing. RSM = resilver time for a mirror of two of the given disks RSZ1 = resilver time for raidz1 vdev of two of the given disks? RSZ2 = resilver time for raidz2 vdev of two of the given disks? chances of losing all data in a mirror: DLM = RSM * DR. chances of losing all data in a raiz1: DLRZ1 = RSZ1 * DR. chances of losing all data in a raidz2: DLRZ2 = RSZ2 * DR * DR Now, for the above, I'll make some other assumptions... Lets just guess at a 1-year MTBF for our disks, and for purposes here, just flat line that at a failure rate of chance per hour throughout the year. Lets presume rebuilding a mirror takes one hour. Lets presume that a 7-disk raidz1 takes 24 times longer to rebuild one disk than a mirror, I think this would be a 'safe' ratio to the benefit of the mirror. Lets presume that a 7-disk raidz2 takes 72 times longer to rebuild one disk than a mirror, this should be 'safe' and again benefit to the mirror. DR for a one hour period = 1 / 24 hours / 365 day = .000114 - chance a disk might die in any given hour. DLM = one hour * DR = .000114 DLRZ1 = 24 hours * DR = .0001114 * 6 ( x6 because there are six more drives in the pool, and any one of them could fail) DLRZ2 = 72 hours * DR * DR = (72 * (.0001114 * 6-disks) * (.0001114 * 5 disks) = a much tinier chance of losing all that data. A better way to think about it maybe Based on our 1-year flat-line MTBF for disks, to figure out how much faster the mirror must rebuild for reliability to be the same as a raidz2... DLM = DLRZ2 .0001114 * 1 hour = X hours * (.0001114 * 6-disks) * (.0001114 * 5 disks) X = (.0001114 * 6-disks) * 5 X = .003342 So, the mirror would have to resilver three hundred times faster than the raiz2 (1 / .003342) in order for it to offer the same levels of reliability in regards to the chances of losing the entire vdev due to additional disk failures during a resilver? The governing thing here is that O(2) level of reliability based on expected chances of failure of additional disks during any given moment in time, vs. O(1) for mirrors and raidz1? Note that the above is O(2) for raidz2 and O(1) for mirror/raidz1, because we are working on the assumption we have already lost one disk. With raidz3, we would have ( 1 / (.0001114 * 4-disks remaining in pool ), or about 2,000 times more reliability? Now, the above does not include things like proper statistics that the chances of that 2nd and 3rd disk failing (even correlations) may be higher than our 'flat-line' %/hr. based on 1-year MTBF, or stuff like if all the disks were purchased in the same lots and at the same time, so their chances of failing around the same time is higher, etc. ___
Re: [zfs-discuss] TLER and ZFS
Can you give us release numbers that confirm that this is 'automatic'. It is my understanding that the last available public release of OpenSolaris does not do this. On Oct 5, 2010, at 8:52 PM, Richard Elling wrote: ZFS already aligns the beginning of data areas to 4KB offsets from the label. For modern OpenSolaris and Solaris implementations, the default starting block for partitions is also aligned to 4KB. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
I'm not sure on the TLER issues by themselves, but after the nightmares I have gone through dealing with the 'green drives', which have both the TLER issue and the IntelliPower head parking issues, I would just stay away from it all entirely and pay extra for the 'RAID Editiion' drives. Just out of curiosity, I took a peek a newegg. Western Digital RE3 WD1002FBYS 1TB 7200 RPM SATA 3.0Gb/s 3.5 Internal Hard Drive -Bare Drive are only $129. vs. $89 for the 'regular' black drives. 45% higher price, but it is my understanding that the 'RAID Edition' ones also are physically constructed for longer life, lower vibration levels, etc. On Oct 5, 2010, at 1:30 PM, Roy Sigurd Karlsbakk wrote: Hi all I just discovered WD Black drives are rumored not to be set to allow TLER. Does anyone know how much performance impact the lack of TLER might have on a large pool? Choosing Enterprise drives will cost about 60% more, and on a large install, that means a lot of money... Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
On Oct 5, 2010, at 1:47 PM, Roy Sigurd Karlsbakk wrote: Western Digital RE3 WD1002FBYS 1TB 7200 RPM SATA 3.0Gb/s 3.5 Internal Hard Drive -Bare Drive are only $129. vs. $89 for the 'regular' black drives. 45% higher price, but it is my understanding that the 'RAID Edition' ones also are physically constructed for longer life, lower vibration levels, etc. Well, here it's about 60% up and for 150 drives, that makes a wee difference... Vennlige hilsener / Best regards roy Understood on 1.6 times cost, especially for quantity 150 drives. I think (and if I am wrong, somebody else correct me) - that if you are using commodity controllers, which seems to generally fine for ZFS, then if a drive times out trying to constantly re-read a bad sector, it could stall out the read on the entire pool overall. On the other hand, if the drives are exported as JBOD from a RAID controller, I would think the RAID controller itself would just mark the drive as bad and offline it quickly based on its own internal algorithms. The above would also be relevant to the anticipated usage. For instance, if it is some sort of backup machine and delays due to some reads stalling on out TLER then perhaps it is not a big deal. If it is for more of an up-front production use, that could be intolerable. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
On Oct 5, 2010, at 2:47 PM, casper@sun.com wrote: I've seen several important features when selecting a drive for a mirror: TLER (the ability of the drive to timeout a command) sector size (native vs virtual) power use (specifically at home) performance (mostly for work) price I've heard scary stories about a mismatch of the native sector size and unaligned Solaris partitions (4K sectors, unaligned cylinder). Yes, avoiding the 4K sector sizes is a huge issue right now too - another item I forgot on the reasons to absolutely avoid those WD 'green' drives. Three good reasons to avoid WD 'green' drives for ZFS... - TLER issues - IntelliPower head park issues - 4K sector size issues ...they are an absolutely nightmare. The WD 1TB 'enterprise' drives are still 512 sector size and safe to use, who knows though, maybe they just started shipping with 4K sector size as I write this e-mail? Another annoying thing with the whole 4K sector size, is what happens when you need to replace drives next year, or the year after? That part has me worried on this whole 4K sector migration thing more than what to buy today. Given the choice, I would prefer to buy 4K sector size now, but operating system support is still limited. Does anybody know if there any vendors that are shipping 4K sector drives that have a jumper option to make them 512 size? WD has a jumper, but is there explicitly to work with WindowsXP, and is not a real way to dumb down the drive to 512. I would presume that any vendor that is shipping 4K sector size drives now, with a jumper to make it 'real' 512, would be supporting that over the long run? I would be interested, and probably others would too, on what the original poster finally decides on this? - Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] TLER and ZFS
Hi upfront, and thanks for the valuable information. On Oct 5, 2010, at 4:12 PM, Peter Jeremy wrote: Another annoying thing with the whole 4K sector size, is what happens when you need to replace drives next year, or the year after? About the only mitigation needed is to ensure that any partitioning is based on multiples of 4KB. I agree, but to be quite honest, I have no clue how to do this with ZFS. It seems that it should be something under the regular tuning documenation. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide Is it going to be the case that basic information like about how to deal with common scenarios like this is no longer going to be publicly available, and Oracle will simply keep it 'close to the vest', with the relevant information simply available for those who choose to research it themselves, or only available to those with certain levels of support contracts from Oracle? To put it another way - does the community that uses ZFS need to fork 'ZFS Best Practices' and 'ZFZ Evil Tuning' to ensure that it is reasonably up to date? Sorry for the somewhat hostile in the above, but the changes w/ the merger have demoralized a lot of folks I think. - Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Opteron 6100? Does it work with opensolaris?
I agree on the motherboard and peripheral chipset issue. This, and the last generation AMD quad/six core motherboards all seem to use the AMD SP56x0/SP5100 chipset, which I can't find much information about support on for either OpenSolaris or FreeBSD. Another issue is the LSI SAS2008 chipset for SAS controller which is frequently offered as an onboard option for many motherboards as well and still seems to be somewhat of a work in progress in regards to being 'production ready'. On May 11, 2010, at 3:29 PM, Brandon High wrote: On Tue, May 11, 2010 at 5:29 AM, Thomas Burgess wonsl...@gmail.com wrote: I'm specificially looking at this motherboard: http://www.newegg.com/Product/Product.aspx?Item=N82E16813182230 I'd be more concerned that the motherboard and it's attached peripherals are unsupported than the processor. Solaris can handle 12 cores with no problems. -B -- Brandon High : bh...@freaks.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] SSD best practices
By the way, I would like to chip in about how informative this thread has been, at least for me, despite (and actually because of) the strong opinions on some of the posts about the issues involved. From what I gather, there is still an interesting failure possibility with ZFS, although probably rare. In the case where a zil (aka slog) device fails, AND the zpool.cache information is not available, basically folks are toast? In addition, the zpool.cache itself exhibits the following behaviors (and I could be totally wrong, this is why I ask): A. It is not written to frequently, i.e., it is not a performance impact unless new zfs file systems (pardon me if I have the incorrect terminology) are not being fabricated and supplied to the underlying operating system. B. The current implementation stores that cache file on the zil device, so if for some reason, that device is totally lost (along with said .cache file), it is nigh impossible to recover the entire pool it correlates with. possible solutions: 1. Why not have an option to mirror that darn cache file (like to the root file system of the boot device at least as an initial implementation) no matter what intent log devices are present? Presuming that most folks at least want enough redundancy that their machine will boot, and if it boots - then they have a shot at recovery of the balance of the associated (zfs) directly attached storage, and with my other presumptions above, there is little reason do not to offer a feature like this? Respectfully, - mike On Apr 18, 2010, at 10:10 PM, Richard Elling wrote: On Apr 18, 2010, at 7:02 PM, Don wrote: If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? By definition, the zpool.cache file is always up to date. I'd prefer not to lose the ZIL, fail over, and then suddenly find out I can't import the pool on my second head. I'd rather not have multiple failures, either. But the information needed in the zpool.cache file for reconstructing a missing (as in destroyed) top-level vdev is easily recovered from a backup or snapshot. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] newbie, WAS: Re: SSD best practices
Also, pardon my typos, and my lack of re-titling my subject to note that it is a fork from the original topic. Corrections in text that I noticed after finally sorting out getting on the mailing list are below... On Apr 19, 2010, at 3:26 AM, Michael DeMan wrote: By the way, I would like to chip in about how informative this thread has been, at least for me, despite (and actually because of) the strong opinions on some of the posts about the issues involved. From what I gather, there is still an interesting failure possibility with ZFS, although probably rare. In the case where a zil (aka slog) device fails, AND the zpool.cache information is not available, basically folks are toast? In addition, the zpool.cache itself exhibits the following behaviors (and I could be totally wrong, this is why I ask): A. It is not written to frequently, i.e., it is not a performance impact unless new zfs file systems (pardon me if I have the incorrect terminology) are not being fabricated and supplied to the underlying operating system. The above 'are not being fabricated' should be 'are regularly being fabricated' B. The current implementation stores that cache file on the zil device, so if for some reason, that device is totally lost (along with said .cache file), it is nigh impossible to recover the entire pool it correlates with. The above, 'on the zil device', should say 'on the fundamental zfs file system itself, or a zil device if one is provisioned' possible solutions: 1. Why not have an option to mirror that darn cache file (like to the root file system of the boot device at least as an initial implementation) no matter what intent log devices are present? Presuming that most folks at least want enough redundancy that their machine will boot, and if it boots - then they have a shot at recovery of the balance of the associated (zfs) directly attached storage, and with my other presumptions above, there is little reason do not to offer a feature like this? Missing final sentence: The vast amount of problems with computer and network reliability is typically related to human error. The more '9s' that can be intrinsically provided by the systems themselves helps mitigate this. Respectfully, - mike On Apr 18, 2010, at 10:10 PM, Richard Elling wrote: On Apr 18, 2010, at 7:02 PM, Don wrote: If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? By definition, the zpool.cache file is always up to date. I'd prefer not to lose the ZIL, fail over, and then suddenly find out I can't import the pool on my second head. I'd rather not have multiple failures, either. But the information needed in the zpool.cache file for reconstructing a missing (as in destroyed) top-level vdev is easily recovered from a backup or snapshot. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] newbie, WAS: Re: SSD best practices
By the way, I would like to chip in about how informative this thread has been, at least for me, despite (and actually because of) the strong opinions on some of the posts about the issues involved. From what I gather, there is still an interesting failure possibility with ZFS, although probably rare. In the case where a zil (aka slog) device fails, AND the zpool.cache information is not available, basically folks are toast? In addition, the zpool.cache itself exhibits the following behaviors (and I could be totally wrong, this is why I ask): assumptions: A. It is not written to frequently, i.e., it is not a performance impact unless new zfs file systems (pardon me if I have the incorrect terminology) are not being fabricated and supplied to the underlying operating system. B. The current implementation stores that cache file on the zil file system, so if for some reason, that device is totally lost, it is nigh impossible to recover the entire pool it correlates with. possible solutions: 1. Why not have an option to mirror that darn cache file, like to the root file system of the boot device at least? Presuming that most folks at least want enough redundancy that their machine will boot, and if it boots have a shot at recovery the associated directly attached storage, and with my other presumptions above, there is little reason do not to offer a feature like this? Respectfully, - mike On Apr 18, 2010, at 10:10 PM, Richard Elling wrote: On Apr 18, 2010, at 7:02 PM, Don wrote: If you have a pair of heads talking to shared disks with ZFS- what can you do to ensure the second head always has a current copy of the zpool.cache file? By definition, the zpool.cache file is always up to date. I'd prefer not to lose the ZIL, fail over, and then suddenly find out I can't import the pool on my second head. I'd rather not have multiple failures, either. But the information needed in the zpool.cache file for reconstructing a missing (as in destroyed) top-level vdev is easily recovered from a backup or snapshot. -- richard ZFS storage and performance consulting at http://www.RichardElling.com ZFS training on deduplication, NexentaStor, and NAS performance Las Vegas, April 29-30, 2010 http://nexenta-vegas.eventbrite.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] freeNAS moves to Linux from FreeBSD
Actually it appears that FreeNAS is forking with planned support for both linux (we can only speculate on the preferred backing file system) and FreeBSD with ZFS as preferred backing file system. In regards to OpenSolaris advocacy for using OpenSolaris vs. FreeBSD, I'm all ears if anybody is bold enough to clutter up this mailing list with it. A quick start from my perspective (and this is no way complete) would be: Basically, I have a need for a modern file systems with snapshots both for internal purposes and to support vmware instances. De-depluciation is a nice idea, but given our size, the balance between risk and dollars makes it easier to just have more disk space. Args for FreeBSD + ZFS: - Limited budget - We are familiar with managing FreeBSD. - We are familiar with tuning FreeBSD. - Licensing model Args against FreeBSD + ZFS: - Stability (?) - Possibly performance (although we have limited needs for CIFS) Args for OpenSolaris + ZFS: - Stability Args against OpenSolaris + ZFS: - Hardware compatibility - Lack of knowledge for tuning and associated costs for training staff to learn 'yet one more operating system' they need to support. - Licensing model On Dec 6, 2009, at 6:28 PM, Gary Gendel wrote: The only reason I thought this news would be of interest is that the discussions had some interesting comments. Basically, there is a significant outcry because zfs was going away. I saw NextentaOS and EON mentioned several times as the path to go. Seem that there is some opportunity for OpenSolaris advocacy in this arena while the topic is hot. Gary -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] mirroring ZIL device
Hi All, Sorry if this question is already addressed in the documentation, but I am still unclear about some details ZIL devices. I am looking at provisioning some network attached storage with ZFS on the back end. In the interests of the 'inexpensive' part of the acronym 'RAID', I am looking at using SATA drives for the primary storage striped over multiple mirrors. I will be doing the operating system on a separate set of traditionally mirrored 15K SAS disks (formatted UFS2). In the interests of performance given the SATA drives for the main storage, I am considering provisioning a separate ZIL device, again on 15K SAS. For the ZIL device, my understanding is that if it fails, everything still 'just works'. Meanwhile, I tend to be a bit paranoid at times and am curious about having the ZIL device be mirrored. I suppose with hardware RAID, it would be obvious that I could just mirror two 15K SAS drives that way, export it out as a single device to the operating system and I am clear. My question(s) are: #1. Is it worthwhile to allocate a fast ZIL device to help out with writes to the sluggish SATA main storage drives? It seems to me that probably this kind of case might be exactly what ZIL is for? #2. Is it possible for me to mirror those two 15K SAS drives I want to use for ZIL directly within ZFS? From the documentation, it seems not, and if I want to go this way, I'll want hardware RAID1. #3. Does anybody have any field/production information about just using low-end (but newer generation) SATA SSDs for the ZIL device? My concern here is obviously about performance degradation over time with SSD, particularly as an SSD device. (Again, small budget, I can't afford any of the fancy SSD stuff). Thanks, - Mike ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss