Tino Schwarze wrote at about 11:35:46 +0200 on Thursday, June 4, 2009: > Hi there, > > (I already felt like I was going to look dumb or anxious by writing what > I wrote...) > > On Wed, Jun 03, 2009 at 01:09:38PM -0400, Jeffrey J. Kosowsky wrote: > > Tino Schwarze wrote at about 18:39:26 +0200 on Wednesday, June 3, 2009: > > > > > I recently heard about lessfs, which runs on top of FUSE to provide > > > > > a file system that does block-level de-duplication. See: > > > > > > > > > > http://www.lessfs.com > > > > > https://sourceforge.net/project/showfiles.php?group_id=257120 > > > > > http://tokyocabinet.sourceforge.net/index.html > > > > > > > > > > The actual storage is several very large (sparse?) files on any > > > > > file system(s) of your choice. It should provide all the benefits > > > > > you expect: no issues of local limitations on hardlink counts, > > > > > meta-data etc, and the database files can be copied or rsynced. > > > > > I'm corresponding with the author to see if some additional useful > > > > > features could be added. > > > > > > Well, we've already got MD4 checksums of file blocks. And if I > > > understand everything correctly, we DO GET collisions, therefore the > > > hash chains. > > > > First, the hash chains are based on *partial* file *md5* (not md4) > > sums. > > > > Second, the collisions only occur because the hash is only done on the > > first > > and eighth (or last for small files) 128K block. So, obviously you will > > have collisions for large files that have the same first and eighth > > block. > > That was the first flaw of my thoughts... So I would have to scan my > pool and compare first and eigth 128k block (e.g. 0-128k and 1M-1M128k > or is it 896k-1M?) for matches? Maybe I'll try that, out of sheer > curiousity (if I find the time to script it). > > > > Of course, this if for 256k blocks, IIRC. And "only" 128 bit hashes. > > > But I don't like the idea of relying on probabilities. I've got enough > > > uncertainties by flaky hardware, bugs etc. > > > > We rely on probabilities in all aspects of life. Nothing is certain. > > I know that. Sometimes I'm paranoid - I just like to get rid of > probabilities (=uncertainties) where possible.
But that is what I mean by you're not understanding probability. If you believe in math and physics, then EVERYTHING in life is a probability. There is a real probability that anything will break (including yourself) at any moment. For electronic devices such as hard drives this probability is well-modeled for the most common failures. According to quantum mechanics there is a (truly infinitesimal) probability that you will simultaneously transform into a monkey. The point is that your statement "I just like to get rid of probabilities (=uncertainties) where possible" is impossible; at best you can reduce the probability of an adverse event. My point is that if the probability of a collision is trillions and trillions of times less likely than more mundane things such as hardware failure, then you are much better off worrying about reducing that risk then worrying about collisions. Worrying about collisions is analogous to worrying about quantum mechanical uncertainty principle at the macro scale. Spitting in the ocean is more effective than worrying about the incremental adverse risk of collisions with a 192-bit hash on 256K blocks. > > It all depends on the probability... I would much prefer to take the > > risk of a mathematically known infinitesimal probability (of the order > > of md5 hash collisions) than what most people in life take for granted > > as "absolute" fact. At least with a mathematically modeled system you > > know the risk which is more than most of us know about most other > > elements of our systems. > > > > > I won't trust such a file system for backup data. > > > > Making blanket statements like that show a lack of understanding of > > probability vs. certainty in the world. > > Well, I just said, *I* won't trust such a file system. It's just a > gut feeling. Something which isn't logical or anything. OK - if you don't believe in logic, then I can't argue with you. You might as well use Feng Shu to improve your data reliability. > > > If for example, the probability of a collision is many orders of > > magnitude less than the probability of you losing all your backups > > then I wouldn't worry about it. It all depends on the probability... > > The bad thing about probabilities is that they don't tell you anything > about what will happen, just about what might happen. Even if the > probability is very, very, very, very small, it doesn't mean it will > not instantly happen the next second. It's just very unlikely. > True - but that is fundamental to all life. The protons in your body could all simultaneously decay the next second -- possible though a bit unlikely ;) As I proved in my earlier post, the chance of a collision on even a Petabyte sized pool is about 1 in 10^38. Considering that the ocean has a volume of about 1.4 x 10^25 ml, spitting in the ocean (assuming 10ml of spit) would have a 1.4 x 10^24 effect. So, the chance of a collision is in a loose sense 100 trillion times less effective than spitting in the ocean. For your own sake and certainly for the sake of those you work for, please take a course in probability and absorb its meaning. You simply cannot make good decisions in life in general or in protecting the resources of your company if you cannot distinguish between 1 in 10^38 risks vs. 1 in 100 risks. ------------------------------------------------------------------------------ OpenSolaris 2009.06 is a cutting edge operating system for enterprises looking to deploy the next generation of Solaris that includes the latest innovations from Sun and the OpenSource community. Download a copy and enjoy capabilities such as Networking, Storage and Virtualization. Go to: http://p.sf.net/sfu/opensolaris-get _______________________________________________ BackupPC-users mailing list BackupPC-users@lists.sourceforge.net List: https://lists.sourceforge.net/lists/listinfo/backuppc-users Wiki: http://backuppc.wiki.sourceforge.net Project: http://backuppc.sourceforge.net/