Apr 14, 2007, at 12:15 PM, David Barrett wrote:

David -- Can you give a quick, layman's-terms overview of how you do the
Rabin fingerprinting?  I'm not familiar with the technique.  Do you
establish some "goal" (like "break this file up into about n roughly-equal
pieces") and somehow it comes up with an optimal separator?

The idea is to use some feature of the content to determine block boundaries, so that if you insert or remove a bit of data before that feature, the boundary still remains in the same place relative to other content.

To do so, imagine running a sliding-window hash function along the data:

thequickbrownfoxjumpedoverthelazydog12345679thisisbabble
\_______/ ---> move hash this way

shifting the window that you hash right by one byte each time. A typical window size might be, say, 40 bytes of data. Whenever the low-order K bits of the hash are zero, you say "that's a block boundary!" So if you want blocks to be 16KB long, then you'd need the 14 lowest bits of the hash to be zero (2^14 == 16384).

So in the example above, you'd hash:

thequickb
 hequickbr
  equickbro
  ...
and so on. Let's say that the block boundary was at "lazydog12". Now, if we inserted the word "hi" at the start of the string, there would *still* be a block boundary at "lazydog12", so all chunks following that one would remain the same. That's why it's nice for handling insertions and deletions.

This can break down on, say, long strings of zeros or other identical content, so you also have to set a max and min length for the block size.

The reason it's called Rabin fingerprinting is that Rabin developed a hash function that you can compute very efficiently in a sliding- window manner (you shift the new byte in and the old byte out).

I *think* that this use first showed up in the LBFS (Low Bandwidth File System) paper:
http://pdos.csail.mit.edu/papers/lbfs:sosp01/

There are many similarities to rsync and to earlier similarity- detection algorithms, but the one from LBFS is really nice when the sender wants to decide how to chunk the file in advance, as in a p2p system.


Also, on the topic of similarity, have you determined why 15% of movie
translations are identical? At best I'd imagine the portions that have no dialog could be the same (and maybe that accounts for 15% right there), but
due to audio/video interleaving the rest would seem to have very low
similarity.

The amount of similarity depends on the chunk size.

 [Video .....................] [Audio] [Video...............] [Audio]

If your chunk size is small enough that a chunk contains only video data, then that chunk is likely to be found in both files. If the chunk spans a video and audio chunk, it won't.

We found with an average (important word) 16KB chunk size, *some* of our chunks contained only video data, and hence we could find them in another file. However, when we took the chunk size down to 1KB, we found 80% similarity, which is closer to what you'd expect given that an AVI contains more bytes of video than it does of audio.

We are looking at building an AVI-specific chunker that understands the audio/video frame boundaries, but in typical academic fashion, I'm pushing my colleagues to see if we can detect "good" boundaries automatically instead of needing to build in file-type specific mechanisms. But we'll see - we're just starting on that.

(Also, note that many of the video files we found were pretty unique and we didn't help them much. It probably depends on the different language encodings being produced from the exact same source material with the exact same encoder settings, etc.)

  -Dave



Attachment: PGP.sig
Description: This is a digitally signed message part

_______________________________________________
p2p-hackers mailing list
[EMAIL PROTECTED]
http://lists.zooko.com/mailman/listinfo/p2p-hackers

Reply via email to