Most of this while well documented seems to boil down to the same alarmist notion that had people trying to ban cell phones in gas stations. The possibility that something untoward COULD happen does NOT mean it WILL happen. To date I don't know of a single gas pump explosion or car fire that was traced to cell phone usage at the pump. Oddly enough though no one monitors gas pumps to be sure users aren't re-entering their vehicles and fires HAVE been traced to static electricity caused by that.
If odds are so important it seems it would be important to worry about the odds that your data center, your offsite storage location and your Disaster Recovery site will all be taken out at the same time. I also suggest the argument is flawed because it seems to imply that only the cksum is stored and no actual the data - it is original compressed data AND the cksum that result in the restore - not the cksum alone. -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of bob944 Sent: Wednesday, September 26, 2007 4:03 AM To: veritas-bu@mailman.eng.auburn.edu Subject: Re: [Veritas-bu] Tapeless backup environments? cpreston: > >Simplistically, it checksums the "block" and looks in a table of > >checksums-of-"blocks"-that-it-already-stores to see if the identical > ><ahem, anyone see a hole here?> data already lives there. > > To what hole do you refer? The idea that N bits of data can unambiguously be represented by fewer than N bits. Anyone who claims to the contrary might as well knock out perpetual motion, antigravity and faster-than-light travel while they're on a roll. > I see one in your simplistic example, but > not in what actually happens (which require a much longer technical > explanation). Hence my introduction that began with "[s]implistically." But throw in all the "much longer technical explanation" you like, any process which compares a reduction-of-data to another reduction-of-data will sooner or later return "foo" when what was originally stored was "bar." cpreston: > There are no products in the market that rely solely on a checksum to > identify redundant data. There are a few that rely solely on > a 160-bit > hash, which is significantly larger than a checksum (typically 12-16 No importa. The length of the checksum/hash/fingerprint and the sophistication of its algorithm only affect how frequently--not whether--the incorrect answer is generated. > [...] The ability to forcibly create a hash collision means > absolutely nothing in the context of deduplication. Of course it does. Most examples in the literature concern storing crafted-data-pattern-A ("pay me one dollar") in order for the data to be read later as something different ("pay me one million dollars"). It can't have escaped your attention that every day, some yahoo crafts another buffer-or-stack overflow exploit; some of them are brilliant. The notion that the bad guys will never figure out a way to plant a silent data-change based on checksum/hash/fingerprint collisions is, IMO, naive. > What matters is the chance that two > random chunks would have a hash collision. With a 128-bit and 160-bit > key space, the odds of that happening are 1 in 2128 with MD5, and 1 in > 2160 with SHA-1. That's 1038 and 1048, respectively. If you Grasshopper, the wisdom is not in the numbers, it is in remembering that HTML will not paste into ASCII well. But I suspect you mean "one in 2^128" or similar. Those are impressive, and dare I guess, vendor-supplied, numbers. And they're meaningless. We do not care about the odds that a particular block "the quick brown fox jumps over the lazy dog" checksums/hashes/fingerprints to the same value as another particular block "now is the time for all good men to come to the aid of their party." Of _course_ that will be astronomically unlikely, and with sufficient hand-waving (to quote your article: the odds of a hash collision with two random chunks are roughly 1,461,501,637,330,900,000,000,000,000 times greater than the number of bytes in the known computing universe") these totally meaningless numbers can seem important. They're not. What _is_ important? To me, it's important that if I read back any of the N terrabytes of data I might store this week, I get the same data that was written, not a silently changed version because the checksum/hash/fingerprint of one block that I wrote collides with another cheksum/hash/fingerprint. I can NOT have that happen to any block--in a file clerk's .pst, a directory inode or the finance database. "Probably, it won't happen" is not acceptable. > Let's compare those odds with the odds of an unrecoverable > read error on a typical disk--approximately 1 in 100 trillion Bogus comparison. In this straw man, that 1/100,000,000,000,000 read error a) probably doesn't affect anything because of the higher-level RAID array it's in and b) if it does, there's an error, a we-could-not-read-this-data, you-can't-proceed, stop, fail, get-it-from-another-source error--NOT a silent changing of the data from foo to bar on every read with no indication that it isn't the data that was written. > If you want to talk about the odds of something bad happening and not > knowing it, keep using tape. Everyone who has worked with tape for any > length of time has experienced a tape drive writing something that it > then couldn't read. That's not news, and why we've been making copies of data for, oh, 50 years or so. > Compare that to successful deduplication disk > restores. According to Avamar Technologies Inc. (recently acquired by > EMC Corp.), none of its customers has ever had a failed restore. Now _there's_ an unbiased source. _______________________________________________ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu ---------------------------------- CONFIDENTIALITY NOTICE: This e-mail may contain privileged or confidential information and is for the sole use of the intended recipient(s). If you are not the intended recipient, any disclosure, copying, distribution, or use of the contents of this information is prohibited and may be unlawful. If you have received this electronic transmission in error, please reply immediately to the sender that you have received the message in error, and delete it. Thank you. ---------------------------------- _______________________________________________ Veritas-bu maillist - Veritas-bu@mailman.eng.auburn.edu http://mailman.eng.auburn.edu/mailman/listinfo/veritas-bu