Ah, right.  I'm sorry to say I hadn't TFA yet and wasn't thinking of 
block-level deduplication.  But though block-level compares might be possible, 
the distributed nature of swift would likely make this quite complicated.  It 
might make sense for KVM to compare memory pages, but I imagine that locality 
of data (all on the same host) figured heavily in that implementation decision.

I am, however, confused by your assertion that rsync 'relies on comparison to 
ensure that the match is real, and not just a hash collision'.  

>From my reading, rsync computes an MD5 hash and a weaker (and easier to 
>compute) 'rolling checksum' on the recipient data, and then the sender also 
>computes the computationally cheap 'checksum' and will only bother computing 
>the MD5 hash when the checksums match.  So rsync is optimizing hash 
>generation, but is still relying on hash collision rather than 'comparison'.  
>Or am I missing something?

Regardless, I enjoyed reading Paulo's post and learning about the role 
deduplication might play in balancing storage and retrieval costs in swift.

Cheers,


Maru


On 2012-03-10, at 1:13 PM, andi abes wrote:

> Maybe a happy path exists, between efficiency and correctness ;) I
> think the Rsync is probably a good comparison to the use case at hand
> (it identifies identical blocks between the source and target, and
> only sends deltas of the wire).
> It combines a quick has to identify candidates that might be
> duplicates, but relies on comparison to ensure that the match is real,
> and not just a hash collision.
> 
> See the source of all knowledge:
> http://en.wikipedia.org/wiki/Rsync#Algorithm
> 
> 
> 
> 
> 
> 
> On Sat, Mar 10, 2012 at 1:15 PM, Maru Newby <mne...@internap.com> wrote:
>> Hi Joe,
>> 
>> There's one huge difference between page deduplication and object
>> deduplication:  Page size is small and predictable, whereas object size is
>> not.  Given this, full compares would not be a good way to implement
>> performant object deduplication in swift.
>> 
>> Thanks,
>> 
>> 
>> Maru
>> 
>> 
>> On 2012-03-10, at 9:57 AM, Joe Gordon wrote:
>> 
>> Paulo, Caitlin,
>> 
>> 
>> Can SHA-1 collisions be generated?  If so can you point me to the article?
>> 
>> Also why compare hashes in the first place?  Linux 'Kenel Samepage Merging',
>> which does page deduplication for KVM, does a full compare to be safe [1].
>>  Even if collisions can't be generated, what are the odds of a collision
>> (for SHA-1 and SHA-256) happening by chance when using Swift at scale?
>> 
>> 
>> best,
>> Joe Gordon
>> 
>> 
>> 
>> 
>> [1] http://www.linux-kvm.com/sites/default/files/KvmForum2008_KSM.pdf
>> 
>> 
>> On Fri, Mar 9, 2012 at 4:44 PM, Caitlin Bestler
>> <caitlin.best...@nexenta.com> wrote:
>>> 
>>> Paulo,
>>> 
>>> 
>>> 
>>> I believe you’ll find that we’re thinking along the same lines. Please
>>> review my proposal at http://etherpad.openstack.org/P9MMYSWE6U
>>> 
>>> 
>>> 
>>> One quick observation is that SHA-1 is totally inadequate for
>>> fingerprinting objects in a public object store. An attacker could easily
>>> 
>>> predict the fingerprint of content likely to be posted, generate alternate
>>> content that had the same SHA-1 fingerprint and pre-empt
>>> 
>>> the signature. For example: an ISO of an open source OS distribution. If I
>>> get my false content with the same fingerprint into the
>>> 
>>> repository first then everyone who downloads that ISO will get my altered
>>> copy.
>>> 
>>> 
>>> 
>>> SHA-256 is really needed to make this type of attack infeasible.
>>> 
>>> 
>>> 
>>> I also think that distributed deduplication works very well with object
>>> versioning. Your comments on the proposal cited above
>>> 
>>> would be great to hear.
>>> 
>>> 
>>> 
>>> From: openstack-bounces+caitlin.bestler=nexenta....@lists.launchpad.net
>>> [mailto:openstack-bounces+caitlin.bestler=nexenta....@lists.launchpad.net]
>>> On Behalf Of Paulo Ricardo Motta Gomes
>>> Sent: Thursday, March 08, 2012 1:19 PM
>>> To: openstack@lists.launchpad.net
>>> 
>>> 
>>> Subject: [Openstack] Enabling data deduplication on Swift
>>> 
>>> 
>>> 
>>> Hello everyone,
>>> 
>>> 
>>> 
>>> I'm a student of the European Master in Distributed Computing (EMDC)
>>> currently working on my master thesis on distributed content-addressable
>>> storage/deduplication.
>>> 
>>> 
>>> 
>>> I'm happy to announce I will be contributing the outcome of my thesis work
>>> to OpenStack by enabling both object-level and block-level deduplication
>>> functionality on Swift
>>> (https://answers.launchpad.net/swift/+question/156862).
>>> 
>>> 
>>> 
>>> I have written a detailed blog post where I describe the initial
>>> architecture of my
>>> solution: 
>>> http://paulormg.com/2012/03/05/enabling-deduplication-in-a-distributed-object-storage/
>>> 
>>> 
>>> 
>>> Feedback from the OpenStack/Swift community would be very appreciated.
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> 
>>> 
>>> Paulo
>>> 
>>> 
>>> 
>>> --
>>> European Master in Distributed Computing - www.kth.se/emdc
>>> Royal Institute of Technology - KTH
>>> 
>>> Instituto Superior Técnico - IST
>>> 
>>> http://paulormg.com
>>> 
>>> 
>>> _______________________________________________
>>> Mailing list: https://launchpad.net/~openstack
>>> Post to     : openstack@lists.launchpad.net
>>> Unsubscribe : https://launchpad.net/~openstack
>>> More help   : https://help.launchpad.net/ListHelp
>>> 
>> 
>> _______________________________________________
>> Mailing list: https://launchpad.net/~openstack
>> Post to     : openstack@lists.launchpad.net
>> Unsubscribe : https://launchpad.net/~openstack
>> More help   : https://help.launchpad.net/ListHelp
>> 
>> 
>> 
>> _______________________________________________
>> Mailing list: https://launchpad.net/~openstack
>> Post to     : openstack@lists.launchpad.net
>> Unsubscribe : https://launchpad.net/~openstack
>> More help   : https://help.launchpad.net/ListHelp
>> 


_______________________________________________
Mailing list: https://launchpad.net/~openstack
Post to     : openstack@lists.launchpad.net
Unsubscribe : https://launchpad.net/~openstack
More help   : https://help.launchpad.net/ListHelp

Reply via email to