Ah, right. I'm sorry to say I hadn't TFA yet and wasn't thinking of block-level deduplication. But though block-level compares might be possible, the distributed nature of swift would likely make this quite complicated. It might make sense for KVM to compare memory pages, but I imagine that locality of data (all on the same host) figured heavily in that implementation decision.
I am, however, confused by your assertion that rsync 'relies on comparison to ensure that the match is real, and not just a hash collision'. >From my reading, rsync computes an MD5 hash and a weaker (and easier to >compute) 'rolling checksum' on the recipient data, and then the sender also >computes the computationally cheap 'checksum' and will only bother computing >the MD5 hash when the checksums match. So rsync is optimizing hash >generation, but is still relying on hash collision rather than 'comparison'. >Or am I missing something? Regardless, I enjoyed reading Paulo's post and learning about the role deduplication might play in balancing storage and retrieval costs in swift. Cheers, Maru On 2012-03-10, at 1:13 PM, andi abes wrote: > Maybe a happy path exists, between efficiency and correctness ;) I > think the Rsync is probably a good comparison to the use case at hand > (it identifies identical blocks between the source and target, and > only sends deltas of the wire). > It combines a quick has to identify candidates that might be > duplicates, but relies on comparison to ensure that the match is real, > and not just a hash collision. > > See the source of all knowledge: > http://en.wikipedia.org/wiki/Rsync#Algorithm > > > > > > > On Sat, Mar 10, 2012 at 1:15 PM, Maru Newby <mne...@internap.com> wrote: >> Hi Joe, >> >> There's one huge difference between page deduplication and object >> deduplication: Page size is small and predictable, whereas object size is >> not. Given this, full compares would not be a good way to implement >> performant object deduplication in swift. >> >> Thanks, >> >> >> Maru >> >> >> On 2012-03-10, at 9:57 AM, Joe Gordon wrote: >> >> Paulo, Caitlin, >> >> >> Can SHA-1 collisions be generated? If so can you point me to the article? >> >> Also why compare hashes in the first place? Linux 'Kenel Samepage Merging', >> which does page deduplication for KVM, does a full compare to be safe [1]. >> Even if collisions can't be generated, what are the odds of a collision >> (for SHA-1 and SHA-256) happening by chance when using Swift at scale? >> >> >> best, >> Joe Gordon >> >> >> >> >> [1] http://www.linux-kvm.com/sites/default/files/KvmForum2008_KSM.pdf >> >> >> On Fri, Mar 9, 2012 at 4:44 PM, Caitlin Bestler >> <caitlin.best...@nexenta.com> wrote: >>> >>> Paulo, >>> >>> >>> >>> I believe you’ll find that we’re thinking along the same lines. Please >>> review my proposal at http://etherpad.openstack.org/P9MMYSWE6U >>> >>> >>> >>> One quick observation is that SHA-1 is totally inadequate for >>> fingerprinting objects in a public object store. An attacker could easily >>> >>> predict the fingerprint of content likely to be posted, generate alternate >>> content that had the same SHA-1 fingerprint and pre-empt >>> >>> the signature. For example: an ISO of an open source OS distribution. If I >>> get my false content with the same fingerprint into the >>> >>> repository first then everyone who downloads that ISO will get my altered >>> copy. >>> >>> >>> >>> SHA-256 is really needed to make this type of attack infeasible. >>> >>> >>> >>> I also think that distributed deduplication works very well with object >>> versioning. Your comments on the proposal cited above >>> >>> would be great to hear. >>> >>> >>> >>> From: openstack-bounces+caitlin.bestler=nexenta....@lists.launchpad.net >>> [mailto:openstack-bounces+caitlin.bestler=nexenta....@lists.launchpad.net] >>> On Behalf Of Paulo Ricardo Motta Gomes >>> Sent: Thursday, March 08, 2012 1:19 PM >>> To: openstack@lists.launchpad.net >>> >>> >>> Subject: [Openstack] Enabling data deduplication on Swift >>> >>> >>> >>> Hello everyone, >>> >>> >>> >>> I'm a student of the European Master in Distributed Computing (EMDC) >>> currently working on my master thesis on distributed content-addressable >>> storage/deduplication. >>> >>> >>> >>> I'm happy to announce I will be contributing the outcome of my thesis work >>> to OpenStack by enabling both object-level and block-level deduplication >>> functionality on Swift >>> (https://answers.launchpad.net/swift/+question/156862). >>> >>> >>> >>> I have written a detailed blog post where I describe the initial >>> architecture of my >>> solution: >>> http://paulormg.com/2012/03/05/enabling-deduplication-in-a-distributed-object-storage/ >>> >>> >>> >>> Feedback from the OpenStack/Swift community would be very appreciated. >>> >>> >>> >>> Cheers, >>> >>> >>> >>> Paulo >>> >>> >>> >>> -- >>> European Master in Distributed Computing - www.kth.se/emdc >>> Royal Institute of Technology - KTH >>> >>> Instituto Superior Técnico - IST >>> >>> http://paulormg.com >>> >>> >>> _______________________________________________ >>> Mailing list: https://launchpad.net/~openstack >>> Post to : openstack@lists.launchpad.net >>> Unsubscribe : https://launchpad.net/~openstack >>> More help : https://help.launchpad.net/ListHelp >>> >> >> _______________________________________________ >> Mailing list: https://launchpad.net/~openstack >> Post to : openstack@lists.launchpad.net >> Unsubscribe : https://launchpad.net/~openstack >> More help : https://help.launchpad.net/ListHelp >> >> >> >> _______________________________________________ >> Mailing list: https://launchpad.net/~openstack >> Post to : openstack@lists.launchpad.net >> Unsubscribe : https://launchpad.net/~openstack >> More help : https://help.launchpad.net/ListHelp >> _______________________________________________ Mailing list: https://launchpad.net/~openstack Post to : openstack@lists.launchpad.net Unsubscribe : https://launchpad.net/~openstack More help : https://help.launchpad.net/ListHelp