Re: Fast hash of NSData?

Graham Cox Mon, 02 Dec 2013 09:30:08 -0800

On 2 Dec 2013, at 3:43 pm, Roland King <r...@rols.org> wrote:

> I keep not seeing the problem here .. you have two NSData's, you want to know 
> if they are the same, at the time you store them you store the data and some 
> fast-but-not-necessarily-very-good hash. So 'are they the same' becomes
> 
> is the underlying data the the same length? - no - not the same
> do they have the same rapid hash? - no - not the same
> actually compare the data byte by byte if the length and hash are the same to 
> work out if they are the same. rare case. 
> 
> You always get the right answer and even a fairly cruddy hash collides rarely 
> enough to make a full data test unlikely. 
> 
> reductio ad absurdum leads to not even bothering to hash them at all, 
> comparing the length and if it's the same, just comparing the data.


Essentially that’s correct, but the original use case means that some care is 
needed to keep things moving at a reasonable clip.

Let’s say I have an array of data blocks I’m using, and the user adds a new 
one. I need to know whether I’ve seen this data before. I could search the 
list, doing a comparison with each one, and
afterwards I know if I’ve seen it before or not, but the worst case is the most 
common one - I haven’t seen it before, but I’ve just spent ages doing a 
potentially massive byte compare to check.

The hash here is both a quick way to check whether I might have seen it before 
and to serve as a key to the data that yields that hash. I mean, that’s how a 
hash is typically used anyway. I compute the hash and I can go straight away to 
the data that matches and ignore all others, because it’s keyed by the hash. 
It’s only at that point I need to do an extra compare to check whether it’s a 
hash collision or a genuine match. (And the argument has become, is this final 
compare really needed? Looks unlikely, but doing it has a low cost, so why not, 
appears to be the answer).

> Reductio to even more absurdum says if two chunks of data are the same 
> length, just store the both of 'em. 

Except that no longer solves the original problem, which is to make sure that 
adding two identical blocks of data (images, as it happens) only yields one 
copy of said data in the eventual file stream for that document. If I just 
store them without bothering to check for equality, adding the same image 100 
times adds the same data 100 times to the file. It seems pretty obvious you’d 
want to avoid that.

—Graham


_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
https://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Re: Fast hash of NSData?

Reply via email to