That's a pretty cool method. Is it actually a Rabin fingerprint? I noticed 
the number of bottom bits (M = 13) is the same as what they give in the 
Wikipedia 
article <https://en.wikipedia.org/wiki/Rabin_fingerprint>.

On Friday, July 31, 2020 at 4:33:43 AM UTC-4 ian wrote:

> Quoting o lu (2020-07-31 01:37:47)
>
> > How does Perkeep know where to cut media files when deduping?
>
> The key idea is: after each byte, compute a (non-cryptographic) hash of
> the most recent N bytes seen so far. If the bottom M bits of the hash
> are zero, split there, otherwise keep going. IIRC perkeep chooses N = 64
> and M = 13.
>
> For an intuition of why this works well, imagine the case where some
> characters are inserted in the middle of a large file, which was
> previously stored in several chunks. All of the chunks before the
> location of the insertion will be the same. The insertion itself will
> result in not re-using the chunk that contained that section of the
> file, but because the split point is chosen based on the content at that
> point, the new chunk will probably end at the same place the old one
> did, and then for subsequent chunks we'll have "synchronized" with the
> old copy, allowing us to re-use the old chunks after the modification as
> well.
>
> The particular hash function used is designed to make incrementally
> hashing a sliding window like this efficient, but for the purposes of
> choosing a "good" split point you could use any hash function, it would
> just be slower. The code for the hash function is in the package
> go4.org/rollsum
>
> The full implementation takes into account some other details -- things
> like enforcing a maximum & minimum chunk size -- but the above is where
> the magic comes from. The actual code that does the splitting is in
> writeFileChunks in pkg/schema/filewriter.go, if you're interested in all
> the gory details.
>
> I have kicking around on my hard drive a WIP library that just provides
> the splitting algorithm, without being entangled with the rest of the
> file upload logic. I should finish that up and publish it...
>
> Hope this is useful,
>
> -Ian
>

-- 
You received this message because you are subscribed to the Google Groups 
"Perkeep" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/perkeep/71ebcc93-bbed-4e16-bfc1-1ceb2d7678f3n%40googlegroups.com.

Reply via email to