Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Thu, 06 Jan 2011 22:42:15 PST Michael DeMan wrote: > To be quite honest, I too am skeptical about about using de-dupe just based o > n SHA256. In prior posts it was asked that the potential adopter of the tech > nology provide the mathematical reason to NOT use SHA-256 only. However, if > Oracle believes that it is adequate to do that, would it be possible for some > body to provide: > > (A) The theoretical documents and associated mathematics specific to say one > simple use case? See http://en.wikipedia.org/wiki/Birthday_problem -- in particular see section 5.1 and the probability table of section 3.4. > On Jan 6, 2011, at 10:05 PM, Edward Ned Harvey wrote: > > >> I have been told that the checksum value returned by Sha256 is almost > >> guaranteed to be unique. In fact, if Sha256 fails in some case, we have a > >> bigger problem such as memory corruption, etc. Essentially, adding > >> verification to sha256 is an overkill. Agreed. > > Someone please correct me if I'm wrong. OK :-) > > Suppose you have 128TB of data. That is ... you have 2^35 unique 4k block > s > > of uniformly sized data. Then the probability you have any collision in > > your whole dataset is (sum(1 thru 2^35))*2^-256 > > Note: sum of integers from 1 to N is (N*(N+1))/2 > > Note: 2^35 * (2^35+1) = 2^35 * 2^35 + 2^35 = 2^70 + 2^35 > > Note: (N*(N+1))/2 in this case = 2^69 + 2^34 > > So the probability of data corruption in this case, is 2^-187 + 2^-222 ~= > > 5.1E-57 + 1.5E-67 > > > > ~= 5.1E-57 I believe this is wrong. See the wikipedia article referenced above. p(n,d) = 1 - d!/(d^n*(d-n)!) In your example n = 2^35, d = 2^256. If you extrapolate the 256 bits row of the probability table of section 3.1, it is somewhere between 10^-48 and 10^-51. This may be easier to grasp: to get a 50% probability of a collision with sha256, you need 4*10^38 blocks. For a probability similar to disk error rates (10^-15), you need 1.5*10^31 blocks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] A few questions
On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble wrote: > > The problem boils down to this: > > When ZFS does a resilver, it walks the METADATA tree to determine what > order to rebuild things from. That means, it resilvers the very first > slab ever written, then the next oldest, etc. The problem here is that > slab "age" has nothing to do with where that data physically resides on > the actual disks. If you've used the zpool as a WORM device, then, sure, > there should be a strict correlation between increasing slab age and > locality on the disk. However, in any reasonable case, files get > deleted regularly. This means that the probability that for a slab B, > written immediately after slab A, it WON'T be physically near slab A. > > In the end, the problem is that using metadata order, while reducing the > total amount of work to do in the resilver (as you only resilver live > data, not every bit on the drive), increases the physical inefficiency > for each slab. That is, seek time between cyclinders begins to dominate > your slab reconstruction time. In RAIDZ, this problem is magnified by > both the much larger average vdev size vs mirrors, and the necessity > that all drives containing a slab information return that data before > the corrected data can be written to the resilvering drive. > > Thus, current ZFS resilvering tends to be seek-time limited, NOT > throughput limited. This is really the "fault" of the underlying media, > not ZFS. For instance, if you have a raidZ of SSDs (where seek time is > negligible, but throughput isn't), they resilver really, really fast. > In fact, they resilver at the maximum write throughput rate. However, > HDs are severely seek-limited, so that dominates HD resilver time. You guys may be interested in a solution I used in a totally different situation. There an identical tree data structure had to be maintained on every node of a distributed system. When a new node was added, it needed to be initialized with an identical copy before it could be put in operation. But this had to be done while the rest of the system was operational and there may even be updates from a central node during the `mirroring' operation. Some of these updates could completely change the tree! Starting at the root was not going to work since a subtree that was being copied may stop existing in the middle and its space reused! In a way this is a similar problem (but worse!). I needed something foolproof and simple. My algorithm started copying sequentially from the start. If N blocks were already copied when an update comes along, updates of any block with block# > N are ignored (since the sequential copy would get to them eventually). Updates of any block# <= N were queued up (further update of the same block would overwrite the old update, to reduce work). Periodically they would be flushed out to the new node. This was paced so at to not affect the normal operation much. I should think a variation would work for active filesystems. You sequentially read some amount of data from all the disks from which data for the new disk to be prepared and write it out sequentially. Each time read enough data so that reading time dominates any seek time. Handle concurrent updates as above. If you dedicate N% of time to resilvering, the total time to complete resilver will be 100/N times sequential read time of the whole disk. (For example, 1TB disk, 100MBps io speed, 25% for resilver => under 12 hours). How much worse this gets depends on the amount of updates during resilvering. At the time of resilvering your FS is more likely to be near full than near empty so I wouldn't worry about optimizing the mostly empty FS case. Bakul ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
> The 45 byte score is the checksum of the top of the tree, isn't that > right? Yes. Plus an optional label. > ZFS snapshots and clones save a lot of space, but the > 'content-hash == address' trick means you could potentially save > much more. Especially if you carry around large files (disk images, databases) that change. > Though I'm still not sure how well it scales up - > Bigger working set means you need longer (more expensive) hashes > to avoid a collision, and even then its not guaranteed. > When i last looked they were still using SHA-160 > and I ran away screaming at that point :) You need 2^80 blocks for a 50%+ probability that a pair will have the same SHA-160 hash (by the birthday paradox). Crypto attacks are not relevant. For my personal use I am willing to live with these odds until my backups cross 2^40 distinct blocks (greater than 8 Petabytes)! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Yager on ZFS
> I have budget constraints then I can use only user-level storage. > > until I discovered zfs I used subversion and git, but none of them is designe > d to manage gigabytes of data, some to be versioned, some to be unversioned. > > I can't afford silent data corruption and, if the final response is "*now* th > ere is no *real* opensource software alternative to zfs automatic checksummin > g and simple snapshotting" I'll be an happy solaris user (for data storage), > an happy linux user (for everyday work), and an unhappy offline windows user > (for some video-related activity I can't do with linux). Note that I don't wish to argue for/against zfs/billtodd but the comment above about "no *real* opensource software alternative zfs automating checksumming and simple snapshotting" caught my eye. There is an open source alternative for archiving that works quite well. venti has been available for a few years now. It runs on *BSD, linux, macOS & plan9 (its native os). It uses strong crypto checksums, stored separately from the data (stored in the pointer blocks) so you get a similar guarantee against silent data corruption as ZFS. You can back up a variety of filesystems (ufs, hfs, ext2fs, fat) or use it to to backup a file tree. Each backup results in a single 45 byte "score" containing the checksum of root pointer block. Using this score you can retrieve the entire backup. Further, it stores only one copy of a data block regardless of what files or which backup it may belong to. In effect every "full backup" is an incremental backup (only changed data blocks and changed or new ptr blocks are stored). So it is really an "archival" server. You don't take snapshots but you do a backup. However you can nfs mount a venti and all your backups will show up under directories like ///. Ideally you'd store a venti on RAID storage. You can even copy a bunch of venti to another one, you can store its arenas on CDs or DVD and so on. It is not as fast as ZFS nor anywhere near as easy to use and its intended use is not the same as ZFS (not a primary filesystem). But for what it does, it is not bad at all! Unlike ZFS, it fits best where you have a fast filesystem for speed critical use, venti for backups and RAID for redundancy. Google for "venti sean dorward". If interested, go to http://swtch.com/plan9port/ and pick up plan9port (a collection of programs from plan9, not just venti). See http://swtch.com/plan9port/man/man8/index.html for how to use venti. > I think for every fully digital people own data are vital, and almost everyon > e would reply "NONE" at your question "what level of risk user is willing to > tolerate". NONE is not possible. It is a question of how much risk you are willing to tolerate for what cost. Thankfully, these days you have a variety of choices and much much lower cost for a given degree of risk compared to just a few years ago! ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] An Academic Sysadmin's Lament for ZFS ?
> why are you providing disk space to students? > > When you solve this problem, the quota problem is moot. > > NB. I managed a large University network for several years, and > am fully aware of the costs involved. I do not believe that the > 1960s timeshare model will survive in such environments. So are you saying you don't believe the network is the computer? :-) ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilvering speed?
> > It seems to me that once you copy meta data, you can indeed > > copy all live data sequentially. > > I don't see this, given the top down strategy. For instance, if I > understand the transactional update process, you can't commit the > metadata until the data is in place. > > Can you explain in more detail your reasoning? First realize that this is just a thought experiment -- I haven't read much source code in any detail as yet and it is entirely likely what I am suggesting can not work or there is no need for it! With that caveat http://blogs.sun.com/bonwick/entry/smokin_mirrors talks about top-down resilvering. That is, copy the root blocks first (for example the uberblock), then the blocks they point to and so on [1]. A major goal here is to minimize data loss in case of a second failure. Completely losing a metadata block means you can't access anything it points to so metadata blocks are far more precious than data blocks. This is different from a normal update transaction, where the copy-on-write proceeds bottom up -- which is what you are talking about. A major goal for a normal update is to ensure that at all times a consistent filesystem structure is seen by all. All I was suggesting is that once all the metadata is copied (or "resilvered"), switch to sequential copying to maximize performance. This does make checking the validity of a data block more complicated. So instead of copying data of file1 and then file2 and so on, just copy blocks in the most efficient order, save their checksums and periodically validate a whole bunch. In fact since metadata is read first, you can roughly figure which metadata blocks will be needed when to check data block validity (because you know where data blocks are stored). > > Given that a vast majority > > of disk blocks in use will typically contain data, this is a > > winning strategy from a performance point of view and still > > allows you to retrieve a fair bit of data in case of a second > > disk failure (checksumming will catch a case where good > > metadata points to as yet uncopied data block). If amount of > > live data is > 50% of disk space you may as well do a disk > > copy, perhaps skipping over already copied meta data. > > > > Not only that, you can even start using the disk being > > resilvered right away for writes, The new write will be > > either to a) an already copied block > > How can that be, under a COW regime? I was talking about resilvering, not a normal update. Copy on write happens only for a normal update. I was speculating that you can do normal updates during resilvering. Not sure if this is clear to anyone! -- bakul [1] Top down resilvering seems very much like a copying garbage collector. That similarity make me wonder if the physical layout can be rearranged in some way for a more efficient access to data -- the idea is to resilver and compactify at the same time on one of the mirrors and then make it the master and resilver the other mirrors. Nah... probably not worth the hassle. [Again, I suspect no one else understands what I am talking about:-)] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Resilvering speed?
> Robert Milkowski wrote: > > Hello Mario, > > > > Wednesday, May 9, 2007, 5:56:18 PM, you wrote: > > > > MG> I've read that it's supposed to go at full speed, i.e. as fast as > > MG> possible. I'm doing a disk replace and what zpool reports kind of > > MG> surprises me. The resilver goes on at 1.6MB/s. Did resilvering get > > MG> throttled at some point between the builds, or is my ATA controller hav > ing bigger issues? > > > > Lot of small files perhaps? What kind of protection have you used? > > Good question. Remember that resilvering is done in time order and from > the top-level metadata down, not by sequentially blasting bits. Jeff > Bonwick describes this as top-down resilvering. > http://blogs.sun.com/bonwick/entry/smokin_mirrors > > From a MTTR and performance perspective this means that ZFS recovery time > is a function of the amount of space used, where it is located (!), and the > validity of the surviving or regenerated data. The big win is the amount of > space used, as most file systems are not full. > -- richard It seems to me that once you copy meta data, you can indeed copy all live data sequentially. Given that a vast majority of disk blocks in use will typically contain data, this is a winning strategy from a performance point of view and still allows you to retrieve a fair bit of data in case of a second disk failure (checksumming will catch a case where good metadata points to as yet uncopied data block). If amount of live data is > 50% of disk space you may as well do a disk copy, perhaps skipping over already copied meta data. Not only that, you can even start using the disk being resilvered right away for writes, The new write will be either to a) an already copied block or b) as yet uncopied block. In case a) there is nothing more to do. In case b) the copied-from-block will have the new data so in both cases the right thing happens. Any potential window between reading a copied-from block and writing to copied-to block can be closed with careful coding/locking. If a second disk fails during copying, the current strategy doesn't buy you much in most any case. You really don't want to go through a zillion files looking for survivors. If you have a backup, you will restore from that rather than look through the debris. Not to mention you have made the window of a potentially catastrophic failure much larger if resilvering is significantly slower. Comments? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS vs UFS2 overhead and may be a bug?
> Pawel Jakub Dawidek wrote: > > This is what I see on Solaris (hole is 4GB): > > > > # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=128k > > real 23.7 > > # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=128k > > real 21.2 > > > > # /usr/bin/time dd if=/ufs/hole of=/dev/null bs=4k > > real 31.4 > > # /usr/bin/time dd if=/zfs/hole of=/dev/null bs=4k > > real 7:32.2 > > This is probably because the time to execute this on ZFS is dominated by > per-systemcall costs, rather than per-byte costs. You are doing 32x more > system calls with the 4k blocksize, and it is taking 20x longer. > > That said, I could be wrong, and yowtch, that's much slower than I'd like! You missed my earlier post where I showed accessing a hole file takes much longer than accessing a regular data file for blocksize of 4k and below. I will repeat the most dramatic difference: ZFSUFS2 Elapsed System Elapsed System md5 SPACY 210.01 77.46 337.51 25.54 md5 HOLEY 856.39 801.21 82.11 28.31 I used md5 because all but a couple of syscalls are for reading the file (with a buffer of 1K). dd would make an equal number of calls for writing. For both file systems and both cases the filesize is the same but SPACY has 10GB allocated while HOLEY was created with truncate -s 10G HOLEY. Look at the system times. On UFS2 system time is a little bit more for the HOLEY case because it has to clear a block. ON ZFS it is over 10 times more! Something is very wrong. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS vs UFS2 overhead and may be a bug?
[originally reported for ZFS on FreeBSD but Pawel Jakub Dawid says this problem also exists on Solaris hence this email.] Summary: on ZFS, overhead for reading a hole seems far worse than actual reading from a disk. Small buffers are used to make this overhead more visible. I ran the following script on both ZFS and UF2 filesystems. [Note that on FreeBSD cat uses a 4k buffer and md5 uses a 1k buffer. On Solaris you can replace them with dd with respective buffer sizes for this test and you should see similar results.] $ dd SPACY# 10G zero bytes allocated $ truncate -s 10G HOLEY # no space allocated $ time dd /dev/null bs=1m # A1 $ time dd /dev/null bs=1m # A2 $ time cat SPACY >/dev/null # B1 $ time cat HOLEY >/dev/null # B2 $ time md5 SPACY# C1 $ time md5 HOLEY# C2 I have summarized the results below. ZFSUFS2 Elapsed System Elapsed System Test dd SPACY bs=1m 110.26 22.52 340.38 19.11 A1 dd HOLEY bs=1m 22.44 22.41 24.24 24.13 A2 cat SPACY 119.64 33.04 342.77 17.30 B1 cat HOLEY 222.85 222.08 22.91 22.41 B2 md5 SPACY 210.01 77.46 337.51 25.54 C1 md5 HOLEY 856.39 801.21 82.11 28.31 C2 A1, A2: Numbers are more or less as expected. When doing large reads, reading from "holes" takes far less time than from a real disk. We also see that UFS2 disk is about 3 times slower for sequential reads. B1, B2: UFS2 numbers are as expected but ZFS numbers for the HOLEY file are much too high. Why should *not* going to a real disk cost more? We also see that UFS2 handles holey files 10 times more efficiently than ZFS! C1, C2: Again UFS2 numbers and C1 numbers for ZFS are as expected. but C2 numbers for ZFS are very high. md5 uses BLKSIZ (== 1k) size reads and does hardly any other system calls. For ZFS each syscall takes 76.4 microseconds while UFS2 syscalls are 2.7 us each! zpool iostat shows there is no IO to the real disk so this implies that for the HOLEY case zfs read calls have a significantly higher overhead or there is a bug. Basically C tests just confirm what we find in B tests. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss