Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Edward Ned Harvey > > ~= 5.1E-57 Bah. My math is wrong. I was never very good at P&S. I'll ask someone at work tomorrow to look at it and show me the folly. Wikipedia has it right, but I can't evaluate numbers to the few-hundredth power in any calculator that I have handy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Size of incremental stream
No compression, no dedup. I also forgot to mention it's on svn_134 -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Size of incremental stream
On 01/11/11 11:40 AM, fred wrote: Hello, I'm having a weird issue with my incremental setup. Here is the filesystem as it shows up with zfs list: NAMEUSED AVAIL REFER MOUNTPOINT Data/FS1 771M 16.1T 116M /Data/FS1 Data/f...@05 10.3G - 1.93T - Data/f...@06 14.7G - 1.93T - Data/f...@070 - 1.93T - Everyday, i sync this filesystem remotely with : zfs send -I X Y | ssh b...@blah zfs receive Z Now, i'm having hard time transferring @06 to @07. So i tried to copy the stream directly on the local filesystem to find out that the size of the stream was more than 50G! Anyone know why my stream is way bigger than the actual snapshot size (14.7G)? I don't have this problem on my others filesystems. Compression? -- Ian. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Size of incremental stream
Hello, I'm having a weird issue with my incremental setup. Here is the filesystem as it shows up with zfs list: NAMEUSED AVAIL REFER MOUNTPOINT Data/FS1 771M 16.1T 116M /Data/FS1 Data/f...@05 10.3G - 1.93T - Data/f...@06 14.7G - 1.93T - Data/f...@070 - 1.93T - Everyday, i sync this filesystem remotely with : zfs send -I X Y | ssh b...@blah zfs receive Z Now, i'm having hard time transferring @06 to @07. So i tried to copy the stream directly on the local filesystem to find out that the size of the stream was more than 50G! Anyone know why my stream is way bigger than the actual snapshot size (14.7G)? I don't have this problem on my others filesystems. Thanks -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool metadata corrupted - any options?
- Original Message - > Running "zpool status -x" gives the results below. Do I have any > options besides restoring from tape? > > David > > $ zpool status -x ... This may be a little off-topic, but using 20 drives in a single VDEV - isn't that a little more than recommended? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 r...@karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS root backup/"disaster" recovery, and moving root pool
Hi Karl, I would keep your mirrored root pool separate on the smaller disks as you have setup now. You can move your root pool, its easy enough. You can even replace or attach larger disks to the root pool and detach the smaller disks. You can't currently boot from snapshots, you must boot from a BE. Root pool recovery is generally restoring root pool snapshots so if you store those remotely, you should be covered. This process is described in the ZFS Admin Guide or the ZFS troubleshooting wiki. Combining your root pool with a ZIL and L2ARC on faster disks is not worth the headaches that can occur when trying to manage all 3 on the same disk. For example, if you decide to reinstall and accidentally clobber the contents of the ZIL for your data pool. Don't share disks for pool components or across pools to keep management and recovery simple. Thanks, Cindy On 01/10/11 10:58, Karl Wagner wrote: Hi everyone I am currently testing Solaris 11 Express. I currently have a root pool on a mirrored pair of small disks, and a data pool consisting of 2 mirrored pairs of 1.5TB drives. I have enabled auto snapshots on my root pool, and plan to archive the daily snapshots onto my data pool. I was wondering how easy it would be, in the case of a root pool failure (i.e. both disks giving up the ghost), to restore these backups to a new disk? Or even if it would be possible to boot from the latest snapshot, somehow? In a related topic, how easy is it to move a root pool? I am considering getting a pair of SSDs to use for ZIL, L2ARC and root pool, but am rather worried it will be quite a painful process to move the root pool onto them. The plan is to use 16GB or so for rpool, mirrored, then divide the rest between L2ARC and a mirrored ZIL, on 64GB SSDs. Cheers in advance Karl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] ZFS root backup/"disaster" recovery, and moving root pool
Hi everyone I am currently testing Solaris 11 Express. I currently have a root pool on a mirrored pair of small disks, and a data pool consisting of 2 mirrored pairs of 1.5TB drives. I have enabled auto snapshots on my root pool, and plan to archive the daily snapshots onto my data pool. I was wondering how easy it would be, in the case of a root pool failure (i.e. both disks giving up the ghost), to restore these backups to a new disk? Or even if it would be possible to boot from the latest snapshot, somehow? In a related topic, how easy is it to move a root pool? I am considering getting a pair of SSDs to use for ZIL, L2ARC and root pool, but am rather worried it will be quite a painful process to move the root pool onto them. The plan is to use 16GB or so for rpool, mirrored, then divide the rest between L2ARC and a mirrored ZIL, on 64GB SSDs. Cheers in advance Karl ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] pool metadata corrupted - any options?
Hi David, You might try importing this pool on a Oracle Solaris Express system, where a pool recovery feature is available might be able to bring this pool back (it rolls back to a previous transaction) or if that fails, you could import this pool by using the read-only option to at least recover your data. What events led up to this corruption? Thanks, Cindy On 01/08/11 11:57, David Stein wrote: Running "zpool status -x" gives the results below. Do I have any options besides restoring from tape? David $ zpool status -x pool: home state: FAULTED status: The pool metadata is corrupted and the pool cannot be opened. action: Destroy and re-create the pool from a backup source. see: http://www.sun.com/msg/ZFS-8000-72 scrub: none requested config: NAME STATE READ WRITE CKSUM home FAULTED 0 0 1 corrupted data raidz2 ONLINE 0 0 6 c0t10d0 ONLINE 0 0 0 c0t11d0 ONLINE 0 0 0 c0t12d0 ONLINE 0 0 0 c0t13d0 ONLINE 0 0 0 c0t14d0 ONLINE 0 0 0 c0t15d0 ONLINE 0 0 0 c0t16d0 ONLINE 0 0 0 c0t17d0 ONLINE 0 0 0 c0t18d0 ONLINE 0 0 0 c0t19d0 ONLINE 0 0 0 c0t20d0 ONLINE 0 0 0 c0t21d0 ONLINE 0 0 1 c0t22d0 ONLINE 0 0 0 c0t23d0 ONLINE 0 0 0 c0t2d0 ONLINE 0 0 0 c0t3d0 ONLINE 0 0 0 c0t4d0 ONLINE 0 0 0 c0t5d0 ONLINE 0 0 0 c0t6d0 ONLINE 0 0 0 c0t7d0 ONLINE 0 0 0 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] problem adding second MD1000 enclosure to LSI 9200-16e
As a follow-up, I tried a SuperMicro enclosure (SC847E26-RJBOD1). I have 3 sets of 15 drives. I got the same results when I loaded the second set of drives (15 to 30). Then, I tried changing the LSI 9200's BIOS setting for max INT 13 drives from 24 (the default) to 15. From then on, the SuperMicro enclosure worked fine, even with all 45 drives, and no kernel hangs. I suspect that the BIOS setting would have worked with >1 MD1000 enclosure, but I never tested the MD1000s, after I had the SuperMicro enclosure running. I'm not sure if the kernal hang with max int13=24 was a hardware problem, or a Solaris bug. - Rob > I have 15x SAS drives in a Dell MD1000 enclosure, > attached to an LSI 9200-16e. This has been working > well. The system is boothing off of internal drives, > on a Dell SAS 6ir. > > I just tried to add a second storage enclosure, with > 15 more SAS drives, and I got a lockup during Loading > Kernel. I got the same results, whether I daisy > chained the enclosures, or plugged them both directly > into the LSI 9200. When I removed the second > enclosure, it booted up fine. > > I also have an LSI MegaRAID 9280-8e I could use, but > I don't know if there is a way to pass the drives > through, without creating RAID0 virtual drives for > each drive, which would complicate replacing disks. > The 9280 boots up fine, and the systems can see new > virtual drives. > > Any suggestions? Is there some sort of boot > procedure, in order to get the system to recognize > the second enclosure without locking up? Is there a > special way to configure one of these LSI boards? -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS on emcpower0a and labels
Hi David, Don't know whether my info is still helpfull, but here it is anyway. Had the same problem and solved it using the format -e command. When you then enter the label option, you will get two options. format> label [0] SMI Label [1] EFI Label Specify Label type[0]: Choose zero and your disk will be a "SUN" disk again. Grtz, Philip. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of David Magda > > Knowing exactly how the math (?) works is not necessary, but understanding Understanding the math is not necessary, but it is pretty easy. And unfortunately it becomes kind of necessary because even when you tell somebody the odds of a collision are a zillion times smaller than the odds of our Sun exploding and destroying Earth, they still don't believe you. The explanation of the math, again, is described in the wikipedia article "Birthday Problem," or stated a little more simply here: Given a finite pool of N items, pick one at random and return it to the pool. Pick another one. The odds of it being the same as the first are 1/N. Pick another one. The odds of it being the same as the first are 1/N, and the odds of it being the same as the 2nd are 1/N. So the odds of it matching any of the prior picks are 2/N. Pick another one. The odds of it being the same as any previous pick are 3/N. If you repeatedly draw M items out of the pool (plus the first draw), returning them each time, then the odds of any draw matching any other draw are: P = 1/N + 2/N +3/N + ... + M/N P = ( sum(1 to M) ) / N Note: If you google for "sum positive integers," you'll find sum(1 to N) = N * (N+1) / 2 P = M * (M+1) / 2N In the context of hash collisions in a zpool, M would be the number of data blocks in your zpool, and N would be all the possible hashes. A sha-256 hash has 256 bits, so N = 2^256 I described an excessively large worst-case zpool in my other email, which had 2^35 data blocks in it. So... M = 2^35 So the probability of any block hash colliding with any other hash in that case is 2^35 * (2^35+1) / (2*2^256) = ( 2^70 + 2^35 ) * 2^-257 = 2^-187 + 2^-222 ~= 5.1E-57 There are estimated 8.87 E 49 atoms in planet Earth. ( http://pages.prodigy.net/jhonig/bignum/qaearth.html ) The probability of a collision in your worst-case unrealistic dataset as described, is even 100 million times less likely than randomly finding a single specific atom in the whole planet Earth by pure luck. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
> From: Pawel Jakub Dawidek [mailto:p...@freebsd.org] > > Well, I find it quite reasonable. If your block is referenced 100 times, > it is probably quite important. If your block is referenced 1 time, it is probably quite important. Hence redundancy in the pool. > There are many corruption possibilities > that can destroy your data. Imagine memory error, which corrupts > io_offset in write zio structure and corrupted io_offset points at your > deduped block referenced 100 times. It will be overwritten and > redundancy won't help you. All of the corruption scenarios which allow you to fail despite pool redundancy, also allow you to fail despite copies+N. > Note, that deduped data is not alone > here. Pool-wide metadata are stored 'copies+2' times (but no more than > three) and dataset-wide metadata are stored 'copies+1' times (but no > more than three), so by default pool metadata have three copies and > dataset metadata have two copies, AFAIR. When you lose root node of a > tree, you lose all your data, are you really, really sure only one copy > is enough? Interesting. But no. There is not only one copy as long as you have pool redundancy. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > boun...@opensolaris.org] On Behalf Of Peter Taps > > I haven't looked at the link that talks about the probability of collision. > Intuitively, I still wonder how the chances of collision can be so low. We are > reducing a 4K block to just 256 bits. If the chances of collision are so low, > *theoretically* it is possible to reconstruct the original block from the 256-bit > signature by using a simple lookup. Essentially, we would now have world's > best compression algorithm irrespective of whether the data is text or > binary. This is hard to digest. BTW, at work we do a lot of theoretical mathematics, and one day a few months ago, I was given the challenge to explore the concept of using a hashing algorithm as a form of compression, exactly as you said. The conclusion was: You can't reverse-hash in order to reconstruct unknown original data, but you can do it (theoretically) if you have enough additional information about what constitutes valid original data. If you have a huge lookup table of all the possible original data blocks, then the hash can only be used to identify 2^(N-M) of them as possible candidates, and some additional technique is necessary to figure out precisely which one of those is the original data block. (N is the length of the data block in bits, and M is the length of the hash, in bits.) Hashing discards some of the original data. In fact, random data is generally uncompressible, so if you try to compress random data and end up with something smaller than the original, you can rest assured you're not able to reconstruct. However, if you know something about the original... For example if you know the original is a valid text document written in English, then in all likelihood there is only one possible original block fitting that description and yielding the end hash result. Even if there is more than one original block which looks like valid English text and produces the same end hash, it is easy to choose which one is correct based on context... Since you presumably know the previous block and the subsequent block, you just choose the intermediate block which seamlessly continues to produce valid English grammar at the junctions with adjacent blocks. This technique can be applied to most types of clearly structured original data, but it cannot be applied to unstructured or unknown original data. So at best, hashing could be a special-case form of compression. To decompress would require near-infinite compute hours or a large lookup table to scan all the possible sets of inputs to find one which produces the end hash... So besides the fact that hashing is at best a specific form of compression requiring additional auxiliary information, it's also impractical. To get this down to something reasonable, I considered using a 48MB lookup table for a 24-bit block of data (that's 2^24 entries of 24 bits each), or a 16GB lookup table for a 32-bit block of data (2^32 entries of 32 bits each). Well, in order to get a compression ratio worth talking about, the hash size would have to be 3 bits or smaller. That's a pretty big lookup table to decompress 3 bits into 24 or 32... And let's face it ... 9:1 compression isn't stellar for a text document. And the final nail in the coffin was: In order for this technique to be viable, as mentioned, the original data must be structured. For any set of structured original data, all the information which is necessary for the reverse-hash to identify valid data from the lookup table, could have been used instead to create a specialized compression algorithm which is equal or better than the reverse-hash. So reverse-hash decompression is actually the worst case algorithm for all the data types which it's capable of working on. But yes, you're right, it's theoretically possible for specific cases, but not theoretically possible for the general case. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Mon, January 10, 2011 02:41, Eric D. Mudama wrote: > On Sun, Jan 9 at 22:54, Peter Taps wrote: >> Thank you all for your help. I am the OP. >> >> I haven't looked at the link that talks about the probability of >> collision. Intuitively, I still wonder how the chances of collision >> can be so low. We are reducing a 4K block to just 256 bits. If the >> chances of collision are so low, *theoretically* it is possible to >> reconstruct the original block from the 256-bit signature by using a >> simple lookup. Essentially, we would now have world's best >> compression algorithm irrespective of whether the data is text or >> binary. This is hard to digest. > > "simple" lookup isn't so simple when there are 2^256 records to > search, however, fundamentally your understanding of hashes is > correct. [...] It should also be noted that ZFS itself can "only" address 2^128 bytes (not even 4K 'records'), and supposedly to fill those 2^128 bytes it would take as much energy as it would take to boil the Earth's oceans: http://blogs.sun.com/bonwick/entry/128_bit_storage_are_you So recording and looking up 2^256 records would be quite an accomplishment. It's a lot of data. If the OP wants to know why the chances are so low, he'll have to learn a bit about hash functions (which is what SHA-256 is): http://en.wikipedia.org/wiki/Hash_function http://en.wikipedia.org/wiki/Cryptographic_hash_function Knowing exactly how the math (?) works is not necessary, but understanding the principles would be useful if one wants to have a general picture as to why SHA-256 doesn't need a verification step, and why it was chosen as one of the ZFS (dedupe) checksum options. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] cannot iterate filesystems: I/O error
Hi, after node panic I have an issue with import one of my zpools: # zpool import dmysqlb2 cannot iterate filesystems: I/O error so I tried to list zfs filesystems: # zfs list -r dmysqlb2 cannot iterate filesystems: I/O error NAMEUSED AVAIL REFER MOUNTPOINT dmysqlb2 15.5G 43.0G18K none dmysq...@20101130 0 -18K - there should be either: dmysqlb2/etc dmysqlb2/var and their snapshots. # zpool status -xv pool: dmysqlb2 state: ONLINE status: One or more devices has experienced an error resulting in data corruption. Applications may be affected. action: Restore the file in question if possible. Otherwise restore the entire pool from backup. see: http://www.sun.com/msg/ZFS-8000-8A scrub: none requested config: NAME STATE READ WRITE CKSUM dmysqlb2 ONLINE 0 0 5 mirror ONLINE 0 020 c4t600A0B80002ACF5A0D584AD5EEFFd0 ONLINE 0 020 c4t600A0B80002ACF32157F4ACDF206d0 ONLINE 0 020 errors: Permanent errors have been detected in the following files: dmysqlb2/var:<0x0> output from zdb: # zdb dmysqlb2 version=15 name='dmysqlb2' state=0 txg=1350126 pool_guid=258476501669044711 hostid=2207000451 hostname='wega' vdev_tree type='root' id=0 guid=258476501669044711 children[0] type='mirror' id=0 guid=1727969291773901682 metaslab_array=14 metaslab_shift=29 ashift=9 asize=64411140096 is_log=0 children[0] type='disk' id=0 guid=3314848442807482804 path='/dev/dsk/c4t600A0B80002ACF5A0D584AD5EEFFd0s0' devid='id1,s...@n600a0b80002acf5a0d584ad5eeff/a' phys_path='/scsi_vhci/s...@g600a0b80002acf5a0d584ad5eeff:a' whole_disk=1 DTL=35 children[1] type='disk' id=1 guid=15321902971336355296 path='/dev/dsk/c4t600A0B80002ACF32157F4ACDF206d0s0' devid='id1,s...@n600a0b80002acf32157f4acdf206/a' phys_path='/scsi_vhci/s...@g600a0b80002acf32157f4acdf206:a' whole_disk=1 DTL=36 WARNING: can't open objset for dmysqlb2/var Uberblock magic = 00bab10c version = 15 txg = 1350153 guid_sum = 2176453133877232877 timestamp = 1294658859 UTC = Mon Jan 10 12:27:39 2011 Dataset mos [META], ID 0, cr_txg 4, 10.7M, 92 objects Metaslabs: vdev offsetspacemap free -- --- --- - vdev 0 offset0 spacemap 17 free 237M vdev 0 offset 2000 spacemap 19 free 139M vdev 0 offset 4000 spacemap 23 free87.8M vdev 0 offset 6000 spacemap 32 free 264M vdev 0 offset 8000 spacemap 34 free 262M vdev 0 offset a000 spacemap 37 free 258M vdev 0 offset c000 spacemap 40 free 273M vdev 0 offset e000 spacemap 64 free 238M vdev 0 offset1 spacemap 65 free 274M vdev 0 offset12000 spacemap 66 free55.1M vdev 0 offset14000 spacemap 67 free 212K vdev 0 offset16000 spacemap 69 free 155K vdev 0 offset18000 spacemap 71 free 270M vdev 0 offset1a000 spacemap 75 free91.1M vdev 0 offset1c000 spacemap 77 free 158M vdev 0 offset1e000 spacemap 80 free 251M vdev 0 offset2 spacemap 82 free 260M vdev 0 offset22000 spacemap 92 free 283M vdev 0 offset24000 spacemap 93 free 279M vdev 0 offset26000 spacemap 94 free50.4M vdev 0 offset28000 spacemap 95 free 136M vdev 0 offset2a000 spacemap 0 free 512M vdev 0 offset2c000 spacemap 0 free 512M vdev 0 offset2e000 spacemap 16 free59.3M vdev 0 offset3
Re: [zfs-discuss] Migrating zpool to new drives with 4K Sectors
Actually, it is not my blog ;) To answer your question: you first need to create a new vdev that is 4K aligned unfortunately. I am not aware of any other means to accomplish what you seek. -- This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Sat, Jan 08, 2011 at 12:59:17PM -0500, Edward Ned Harvey wrote: > Has anybody measured the cost of enabling or disabling verification? Of course there is no easy answer:) Let me explain how verification works exactly first. You try to write a block. You see that block is already in dedup table (it is already referenced). You read the block (maybe it is in ARC or in L2ARC). You compare read block with what you want to write. Based on the above: 1. If you have dedup on, but your blocks are not deduplicable at all, you will pay no price for verification, as there will be no need to compare anything. 2. If your data is highly deduplicable you will verify often. Now it depends if the data you need to read fits into your ARC/L2ARC or not. If it can be found in ARC, the impact will be small. If your pool is very large and you can't count on ARC help, each write will be turned into a read. Also note an interesting property of dedup: if your data is highly deduplicable you can actually improve performance by avoiding data writes (and just increasing reference count). Let me show you three degenerated tests to compare options. I'm writing 64GB of zeros to a pool with dedup turned off, with dedup turned on and with dedup+verification turned on (I use SHA256 checksum everywhere): # zpool create -O checksum=sha256 tank ada{0,1,2,3} # time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; zpool export tank' 254,11 real 0,07 user40,80 sys # zpool create -O checksum=sha256 -O dedup=on tank ada{0,1,2,3} # time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; zpool export tank' 154,60 real 0,05 user37,10 sys # zpool create -O checksum=sha256 -O dedup=sha256,verify tank ada{0,1,2,3} # time sh -c 'dd if=/dev/zero of=/tank/zero bs=1m count=65536; sync; zpool export tank' 173,43 real 0,02 user38,41 sys As you can see in second and third test the data is of course in ARC, so the difference here is only because of data comparison (no extra reads are needed) and verification is 12% slower. This is of course silly test, but as you can see dedup (even with verification) is much faster than nodedup case, but this data is highly deduplicable:) # zpool list NAME SIZE ALLOC FREECAP DEDUP HEALTH ALTROOT tank 149G 8,58M 149G 0% 524288.00x ONLINE - -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp3iTC1h5dwE.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On 01/ 8/11 05:59 PM, Edward Ned Harvey wrote: Has anybody measured the cost of enabling or disabling verification? The cost of disabling verification is an infinitesimally small number multiplied by possibly all your data. Basically lim->0 times lim->infinity. This can only be evaluated on a case-by-case basis and there's no use in making any more generalizations in favor or against it. The benefit of disabling verification would presumably be faster performance. Has anybody got any measurements, or even calculations or vague estimates or clueless guesses, to indicate how significant this is? How much is there to gain by disabling verification? Exactly my point and there isn't one answer which fits all environments. In the testing I'm doing so far enabling/disabling verification doesn't make any noticeable difference so I'm sticking to verify. But I have enough memory and such a workload that I see little physical reads going on. -- Robert Milkowski http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] (Fletcher+Verification) versus (Sha256+No Verification)
On Sun, Jan 09, 2011 at 07:27:52PM -0500, Edward Ned Harvey wrote: > > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > > boun...@opensolaris.org] On Behalf Of Pawel Jakub Dawidek > > > > Dedupditto doesn't work exactly that way. You can have at most 3 copies > > of your block. Dedupditto minimal value is 100. The first copy is > > created on first write, the second copy is created on dedupditto > > references and the third copy is created on 'dedupditto * dedupditto' > > references. So once you reach 1 references of your block ZFS will > > create three physical copies, not earlier and never more than three. > > What is the point of dedupditto? If there is a block on disk, especially on > a pool with redundancy so it can safely be assumed good now and for the > future... Why store the multiples? Even if it is a maximum of 3, I > presently only see the sense in a maximum of 1. Well, I find it quite reasonable. If your block is referenced 100 times, it is probably quite important. There are many corruption possibilities that can destroy your data. Imagine memory error, which corrupts io_offset in write zio structure and corrupted io_offset points at your deduped block referenced 100 times. It will be overwritten and redundancy won't help you. You will be able to detect corruption on read, but it will be too late. Note, that deduped data is not alone here. Pool-wide metadata are stored 'copies+2' times (but no more than three) and dataset-wide metadata are stored 'copies+1' times (but no more than three), so by default pool metadata have three copies and dataset metadata have two copies, AFAIR. When you lose root node of a tree, you lose all your data, are you really, really sure only one copy is enough? -- Pawel Jakub Dawidek http://www.wheelsystems.com p...@freebsd.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp8xUrafPnRn.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss