Hmmn, you might want to look at Andrew Tridgell's' thesis (yes, Andrew of Samba fame), as he had to solve this very question to be able to select an algorithm to use inside rsync.
--dave Darren J Moffat wrote: > [EMAIL PROTECTED] wrote: > >>[EMAIL PROTECTED] wrote on 07/08/2008 03:08:26 AM: >> >> >>>>Does anyone know a tool that can look over a dataset and give >>>>duplication statistics? I'm not looking for something incredibly >>>>efficient but I'd like to know how much it would actually benefit our >>> >>>Check out the following blog..: >>> >>>http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool >> >>Just want to add, while this is ok to give you a ballpark dedup number -- >>fletcher2 is notoriously collision prone on real data sets. It is meant to >>be fast at the expense of collisions. This issue can show much more dedup >>possible than really exists on large datasets. > > > Doing this using sha256 as the checksum algorithm would be much more > interesting. I'm going to try that now and see how it compares with > fletcher2 for a small contrived test. > -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest [EMAIL PROTECTED] | -- Mark Twain (905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583 bridge: (877) 385-4099 code: 506 9191# _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss