Re: [zfs-discuss] Raidz vdev size... again.
Bob Friesenhahn wrote: On Tue, 28 Apr 2009, Richard Elling wrote: Yes and there is a very important point here. There are 2 different sorts of scrubbing: read and rewrite. ZFS (today) does read scrubbing, which does not reset the decay process. Some RAID arrays also do rewrite scrubs which does reset the decay process. The problem with rewrite scrubbing is that you I am not convinced that there is a "decay" process. There is considerable magnetic hysteresis involved. It seems most likely that corruption happens all of a sudden, and involves more than one or two bits. More often than not we hear of a number of sectors failing at one time. I suppose if you could freeze the media to 0K, then it would not decay. But that isn't the world I live in :-). There is a whole Journal devoted to things magnetic, with lots of studies of interesting compounds. But from a practical perspective, it is worth noting that some magnetic tapes have a rated shelf life of 8-10 years while enterprise-class backup tapes are only rated at 30 years. Most disks have an expected operational life of 5 years or so. As Tim notes, it is a good idea to plan for migrating important data to newer devices over time. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Tue, Apr 28, 2009 at 11:12 PM, Bob Friesenhahn < bfrie...@simple.dallas.tx.us> wrote: > On Tue, 28 Apr 2009, Tim wrote: > > I'll stick with the 3 year life cycle of the system followed by a hot >> migration to new storage, thank you very much. >> > > Once again there is a fixation on the idea that computers gradually degrade > over time and that simply replacing the hardware before the expiration date > (like a bottle of milk) will save the data. I recently took an old Sun > system out of service that was approaching 12 years on the same disks with > no known read errors. The Sun before that one was taken out of service > after 11 years with no known read errors. Lucky me. > > Various papers I have read suggest that degregation is in fits and bursts > and contrary to what one would expect based on vendor specifications. > > Bob > I don't recall saying anything about a computer wearing out. When net-new and faster/more space is cheaper than maintenance renewal, I'll sick with net-new. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Tue, 28 Apr 2009, Tim wrote: I'll stick with the 3 year life cycle of the system followed by a hot migration to new storage, thank you very much. Once again there is a fixation on the idea that computers gradually degrade over time and that simply replacing the hardware before the expiration date (like a bottle of milk) will save the data. I recently took an old Sun system out of service that was approaching 12 years on the same disks with no known read errors. The Sun before that one was taken out of service after 11 years with no known read errors. Lucky me. Various papers I have read suggest that degregation is in fits and bursts and contrary to what one would expect based on vendor specifications. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Tue, 28 Apr 2009, Richard Elling wrote: Yes and there is a very important point here. There are 2 different sorts of scrubbing: read and rewrite. ZFS (today) does read scrubbing, which does not reset the decay process. Some RAID arrays also do rewrite scrubs which does reset the decay process. The problem with rewrite scrubbing is that you I am not convinced that there is a "decay" process. There is considerable magnetic hysteresis involved. It seems most likely that corruption happens all of a sudden, and involves more than one or two bits. More often than not we hear of a number of sectors failing at one time. Do you have a reference to research results which show that a gradual "decay" process is a significant factor? Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Tue, 28 Apr 2009, Miles Nordin wrote: * it'd be harmful to do this on SSD's. it might also be a really good idea to do it on SSD's. who knows yet. SSDs can be based on many types of technologies, and not just those that wear out. * it may be wasteful to do read/rewrite on an ordinary magnetic drive because if you just do a read, the drive should notice a decaying block and rewrite it without being told specifically, maybe. though from netapp's paper, they say they disable many of Does the drive have the capability to detect when a sector is written to the wrong track? In order for it to detect that, the expected location would have to be written into the sector. In the end, though, I bet we may end up with this feature on ZFS in the disguise of a ``defragmenter''. If the defragmenter will promise to rewrite every block to a new spot, not jhust the ones it pleases, this will do the job of your ``write scrub'' and also solve the drive caching problem. It seems doubtful that bulk re-writing of data will improve data integrity. Writing is dangerous. Bob -- Bob Friesenhahn bfrie...@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer,http://www.GraphicsMagick.org/ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Tue, Apr 28, 2009 at 4:52 PM, Richard Elling wrote: > > Well done! Of course Hitachi doesn't use consumer-grade disks in > their arrays... > > I'll also confess that I did set a bit of a math trap here :-) The trap is > that if you ever have to recover data from tape/backup, then you'll > have no chance of making 5-9s when using large volumes. Suppose > you have a really nice backup system that can restore 10TBytes in > 10 hours. To achieve 5-9s you'd need to make sure that you never > have to restore from backups for the next 114 years. Since the > expected lifetime of a disk is << 114 years, you'll have a poor > chance of making it. So the problem really boils down to how sure > you can be that you won't have an unrecoverable read during the > expected lifetime of your system. Studies have shown [1] that you > are much more likely to see this than you'd expect. The way to > solve that problem is to use double parity to further reduce this > probability. Or, more simply, BAARF. > > [1] http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf > > -- richard Your *trap* assumes COMPLETE data loss. I don't' know what world you live in, but the one I live in doesn't require a restore of 10TB of data when *ONE* block is bad. You've also assumed that the useful life of the data is 114 years, also false in the majority of primary disk systems. Then there's the little issue with you ignoring parity when you quote "a disk drives life". I'll stick with the 3 year life cycle of the system followed by a hot migration to new storage, thank you very much. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Apr 28, 2009, at 18:02, Richard Elling wrote: Kees Nuyt wrote: Some high availablility storage systems overcome this decay by not just reading, but also writing all blocks during a scrub. In those systems, scrubbing is done semi-continously in the background, not on user/admin demand. Yes and there is a very important point here. There are 2 different sorts of scrubbing: read and rewrite. ZFS (today) does read scrubbing, which does not reset the decay process. Some RAID arrays also do rewrite scrubs which does reset the decay process. The problem with rewrite scrubbing is that you really want to be sure the data is correct before you rewrite. Neither is completely foolproof, so it is still a good idea to have backups :-) Hopefully bp relocate will be make it into Solaris at some point, so when a scrub gets kicked off we'll be able to have that (at least as an option, if not by default). Mac OS 10.5 auto-defrags in the background (given certain criteria are met), but iHFS+ doesn't have checksums, so there's a bit risk in creating errors. Combine the two and you have a fairly robust defrag system. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
Kees Nuyt wrote: On Mon, 27 Apr 2009 18:25:42 -0700, Richard Elling wrote: The concern with large drives is unrecoverable reads during resilvering. One contributor to this is superparamagnetic decay, where the bits are lost over time as the medium tries to revert to a more steady state. To some extent, periodic scrubs will help repair these while the disks are otherwise still good. At least one study found that this can occur even when scrubs are done, so there is an open research opportunity to determine the risk and recommend scrubbing intervals. Some high availablility storage systems overcome this decay by not just reading, but also writing all blocks during a scrub. In those systems, scrubbing is done semi-continously in the background, not on user/admin demand. Yes and there is a very important point here. There are 2 different sorts of scrubbing: read and rewrite. ZFS (today) does read scrubbing, which does not reset the decay process. Some RAID arrays also do rewrite scrubs which does reset the decay process. The problem with rewrite scrubbing is that you really want to be sure the data is correct before you rewrite. Neither is completely foolproof, so it is still a good idea to have backups :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
Tim wrote: On Mon, Apr 27, 2009 at 8:25 PM, Richard Elling mailto:richard.ell...@gmail.com>> wrote: I do not believe you can achieve five 9s with current consumer disk drives for an extended period, say >1 year. Just to pipe up, while very few vendors can pull it off, we've seen five 9's with Hitachi gear using SATA. Well done! Of course Hitachi doesn't use consumer-grade disks in their arrays... I'll also confess that I did set a bit of a math trap here :-) The trap is that if you ever have to recover data from tape/backup, then you'll have no chance of making 5-9s when using large volumes. Suppose you have a really nice backup system that can restore 10TBytes in 10 hours. To achieve 5-9s you'd need to make sure that you never have to restore from backups for the next 114 years. Since the expected lifetime of a disk is << 114 years, you'll have a poor chance of making it. So the problem really boils down to how sure you can be that you won't have an unrecoverable read during the expected lifetime of your system. Studies have shown [1] that you are much more likely to see this than you'd expect. The way to solve that problem is to use double parity to further reduce this probability. Or, more simply, BAARF. [1] http://www.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
> "kn" == Kees Nuyt writes: kn> Some high availablility storage systems overcome this decay by kn> not just reading, but also writing all blocks during a kn> scrub. sounds like a good idea but harder in the ZFS model where the software isn't the proprietary work of the only permitted integrator. * it'd be harmful to do this on SSD's. it might also be a really good idea to do it on SSD's. who knows yet. * optimizing the overall system depends on intimate knowledge of, and control over the release binding of, drive firmware and its errata/quirks/decisions * it may be wasteful to do read/rewrite on an ordinary magnetic drive because if you just do a read, the drive should notice a decaying block and rewrite it without being told specifically, maybe. though from netapp's paper, they say they disable many of these features in their SCSI drives, including bad block remapping, and delegate them to the layer of their own software right above the drive * there's an ``offline self test'' in SMART where the drive is supposed to scrub itself, possibly including badblock remapping and marginal sector rewriting. If this feature worked it could possibly accomplish scrubs with better QoS (less interference to real read/writes) and no controller-to-storage bandwidth wastage, compared to actually reading and rewriting through the controller, or possibly several layers above the controller through fanouts and such. * drives with caches may suppress overwrites to sectors containing what the cache says is already in those sectors. I guess I heard on this list that SCSI has commands to ignore the cache for read and other commands to bypass it for write, but not SATA, or the commands could be broken because no one else uses them. You have to have some business relationship with the drive company before they will admit what their proprietary firmware really does, much less alter it to your wishes, even if your wish is merely that it complies, or behaves like it did yesterday. Every tiny piece of software that remains proprietary eventually turns into a blob that does someone else's bidding and fucks with you. In the end, though, I bet we may end up with this feature on ZFS in the disguise of a ``defragmenter''. If the defragmenter will promise to rewrite every block to a new spot, not jhust the ones it pleases, this will do the job of your ``write scrub'' and also solve the drive caching problem. kn> In those systems, scrubbing is done semi-continously in the kn> background, not on user/admin demand. which ones? name names. :) I thought netapp's two papers said they are doing it ``every Sunday'' or something. but, yeah, asking the admin to initiate it manually means if it makes the array uselessly slow you blame the admin rather than the software stack. linux ubifs (NAND flash) scrubs are also mandatory/unsupervised. pgpACOKK377Hd.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Mon, 27 Apr 2009 18:25:42 -0700, Richard Elling wrote: >The concern with large drives is unrecoverable reads during resilvering. >One contributor to this is superparamagnetic decay, where the bits are >lost over time as the medium tries to revert to a more steady state. >To some extent, periodic scrubs will help repair these while the disks >are otherwise still good. At least one study found that this can occur >even when scrubs are done, so there is an open research opportunity >to determine the risk and recommend scrubbing intervals. Some high availablility storage systems overcome this decay by not just reading, but also writing all blocks during a scrub. In those systems, scrubbing is done semi-continously in the background, not on user/admin demand. -- ( Kees Nuyt ) c[_] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Question about ZFS Incremental Send/Receive
O> I feel like I understand what tar is doing, but I'm curious about what is it > that ZFS is looking at that makes it a "successful" incremental send? That > is, not send the entire file again. Does it have to do with how the > application (tar in this example) does a file open, fopen(), and what mode > is used? i.e. open for read, open for write, open for append. Or is it > looking at a file system header, or checksum? I'm just trying to explain > some observed behavior we're seeing during our testing. > > My proof of concept is to remote replicate these "container files", which > are created by a 3rd party application. ZFS knows what blocks where written since the first snapshot was taken. Filenames or type of open is not important. If you open a file and rewrite all blocks in that file with the same content all those block will be sent. If you rewrite 5 block only 5 blocks are sent (plus the meta data that where updated). The way it works is that all blocks have a time stamp. Block with a time stamp newer that the first snapshot will be sent. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] storage & zilstat assistance
bfrie...@simple.dallas.tx.us said: > Your IOPS don't seem high. You are currently using RAID-5, which is a poor > choice for a database. If you use ZFS mirrors you are going to unleash a > lot more IOPS from the available spindles. RAID-5 may be poor for some database loads, but it's perfectly adequate for this one (small data warehouse, sequential writes, and so far mostly sequential reads as well). So far the RAID-5 LUN has not been a problem, and it doesn't look like the low IOPS are because of the hardware, rather the database/application just isn't demanding more. Please correct me if I've come to the wrong conclusion here > I am not familiar with zilstat. Presumaby the '93' is actually 930 ops/ > second? I think you answered your question in your second post. But for others, the "93" is the total ops over the reporting interval. In this case, the interval was 10 seconds, so 9.3 ops/sec. > I have a 2540 here, but a very fast version with 12 300GB 15K RPM SAS drives > arranged as six mirrors (2540 is configured like a JBOD). While I don't run > a database, I have run an IOPS benchmark with random writers (8K blocks) and > see a peak of 3708 ops/sec. With a SATA model you are not likely to see > half of that. Thanks for the 2540 numbers you posted. There's a SAS 2530 here with the same 300GB 15kRPM drives, and as you said, it's fast. But it looks so far like the SATA model, even with less than half the IOPS, will be more than enough for our workload. I'm pretty convinced that the SATA 2540 will be sufficient. What I'm not sure of is if the cheaper J4200 without SSD would be sufficient. I.e., are we generating enough synchronous traffic that lack of NVRAM cache will cause problems? One thing zilstat doesn't make obvious (to me) is the latency effects of a separate log/ZIL device. I guess I could force our old array's cache into write-through mode and see what happens to the numbers. Judging by our experience with NFS servers using this same array, I'm reluctant to try. Thanks and regards, Marion ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Question about ZFS Incremental Send/Receive
I'm using ZFS snapshots and send and receive for a proof of concept, and I'd like to better understand how the incremental feature works. Consider this example: 1. create a tar file using tar -cvf of 10 image files 2. ZFS snapshot the filesystem that contains this tar file 3. Use ZFS send and receive and ssh to replicate this file system on another (remote) system 4. add 3 more image files to the tar file using tar -uvf 5. ZFS snapshot the same file system 6. Repeat step 3 above but this time do an incremental on the zfs send 7. observing the network traffic (iftop) I see that only the incremental data is transferred between the systems. This is my goal, to NOT have to resend the entire tar, or container file, over the network of each incremental. If I repeat the above experiment, but instead do a "tar cvf" at step 4, and just add more image files each time, i.e. step 1: tar cvf container01.tar file02 file02 file03 file04 file05 step 4: tar cvf container01.tar file02 file02 file03 file04 file05 file06 file07 file08 I see the amount of data equivalent to the entire container01.tar get transferred over the network. This not the behavior I want. In the second experimment above, what is it about ZFS that's catching the fact that it is a "new" file. I used tar in my experiments just because I'm familiar with it and it's on my Solaris 10 VM's. Does the "tar uvf" do an open() with the append flag, so ZFS somehow knows about that? What got changed when I did the "tar cvf" the second time, writing to the same file name, but instead with more files? I feel like I understand what tar is doing, but I'm curious about what is it that ZFS is looking at that makes it a "successful" incremental send? That is, not send the entire file again. Does it have to do with how the application (tar in this example) does a file open, fopen(), and what mode is used? i.e. open for read, open for write, open for append. Or is it looking at a file system header, or checksum? I'm just trying to explain some observed behavior we're seeing during our testing. My proof of concept is to remote replicate these "container files", which are created by a 3rd party application. Thanks in advance, Pat ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] can zfs create return with no error code before the mount takes place?
On Mon, Apr 27, 2009 at 6:54 PM, Robert Milkowski wrote: > > Hello Alastair, > > Monday, April 27, 2009, 10:18:50 PM, you wrote: > > Seems or did you confirm it with mount or df command? > > Do you mount it manually then? > >http://milek.blogspot.com > This is a sample of one of the failures. Clearly the filesystem gets created and the subsequent mount fails because of a copy of files from the skeleton directory to the expected home directory that basically creates a file in place of where the system expects a mountpoint directory. > sstephe3,11210,400,facstaff,Stephen V Stephenson > > mkdir: Failed to make directory "/web/u1/s/s/sstephe3/public_html"; No > such file or directory > > mkdir: Failed to make directory "/web/u1/s/s/sstephe3/htaccess"; No such > file or directory > > cp: /home/u1/s/s/sstephe3 not found > > chown: /web/u1/s/s/sstephe3: No such file or directory > > chmod: WARNING: /home/u1/s/s/sstephe3: Execute permission required for > set-ID on execution > > chmod: WARNING: can't access /home/u1/s/s/sstephe3/.profile > > chmod: WARNING: can't access /home/u1/s/s/sstephe3/* > > ln: cannot create /nfs4/home/u1/s/s/sstephe3/web: Not a directory > > ln: cannot create /nfs4/home/u1/s/s/sstephe3: File exists > > chmod: WARNING: can't access /nfs4/web/u1/s/s/sstephe3 > > chmod: WARNING: can't access /web/u1/s/s/sstephe3 > > ln: cannot create /nfs4/home/u1/s/s/sstephe3: File exists > > cannot mount 'u1/sstephe3': Not a directory > > filesystem successfully created, but not mounted Now this does not happen every time but happens with increasing frequency as the number of created filesystems increases. I cannot confirm readily as I have blown away the pools, but I think I end up with only about 4500 filesystems out of 8200 or so that should be created once the script ends. Another datum that may be relevant is that the block devices in the pools are actually iscsi volumes served from several linux openfiler systems over a private gige network using jumbo frames. I just updated the system to snv_111a so at some point I'll recreate the pools and try again, however we are in the middle of a building move so I might not get to it until later in the week. I really appreciate any pointers. Regards, Alastair ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Tue, Apr 28, 2009 at 9:42 AM, Scott Lawson wrote: >> Mainstream Solaris 10 gets a port of ZFS from OpenSolaris, so its >> features are fewer and later. As time ticks away, fewer features >> will be back-ported to Solaris 10. Meanwhile, you can get a production >> support agreement for OpenSolaris. > > Sure if you want to run it on x86. I believe sometime in 2009 we will see a > SPARC release > for Opensolaris. I understand that it is to be the next OpenSolaris release, > but I wouldn't hold > my breath. It's already available for Sparc (http://genunix.org/). Just not in installer or Live DVD format (which should be availabe for 2009.6 release). Regards, Fajar ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Tue, Apr 28, 2009 at 10:08 AM, Tim wrote: > > > On Mon, Apr 27, 2009 at 8:25 PM, Richard Elling > wrote: >> >> I do not believe you can achieve five 9s with current consumer disk >> drives for an extended period, say >1 year. > > Just to pipe up, while very few vendors can pull it off, we've seen five 9's > with Hitachi gear using SATA. Can you specify the hardware? I've recently switched to LSI SAS1068E controllers and am swimmingly happy. (That's my $.02 - controllers (not surprisingly) affect the niceness of a software RAID solution like ZFS quite a bit - maybe even more than the actual drives...?) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Raidz vdev size... again.
On Mon, Apr 27, 2009 at 8:25 PM, Richard Elling wrote: > > I do not believe you can achieve five 9s with current consumer disk > drives for an extended period, say >1 year. > Just to pipe up, while very few vendors can pull it off, we've seen five 9's with Hitachi gear using SATA. --Tim ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss