That's a pretty cool method. Is it actually a Rabin fingerprint? I noticed the number of bottom bits (M = 13) is the same as what they give in the Wikipedia article <https://en.wikipedia.org/wiki/Rabin_fingerprint>.
On Friday, July 31, 2020 at 4:33:43 AM UTC-4 ian wrote: > Quoting o lu (2020-07-31 01:37:47) > > > How does Perkeep know where to cut media files when deduping? > > The key idea is: after each byte, compute a (non-cryptographic) hash of > the most recent N bytes seen so far. If the bottom M bits of the hash > are zero, split there, otherwise keep going. IIRC perkeep chooses N = 64 > and M = 13. > > For an intuition of why this works well, imagine the case where some > characters are inserted in the middle of a large file, which was > previously stored in several chunks. All of the chunks before the > location of the insertion will be the same. The insertion itself will > result in not re-using the chunk that contained that section of the > file, but because the split point is chosen based on the content at that > point, the new chunk will probably end at the same place the old one > did, and then for subsequent chunks we'll have "synchronized" with the > old copy, allowing us to re-use the old chunks after the modification as > well. > > The particular hash function used is designed to make incrementally > hashing a sliding window like this efficient, but for the purposes of > choosing a "good" split point you could use any hash function, it would > just be slower. The code for the hash function is in the package > go4.org/rollsum > > The full implementation takes into account some other details -- things > like enforcing a maximum & minimum chunk size -- but the above is where > the magic comes from. The actual code that does the splitting is in > writeFileChunks in pkg/schema/filewriter.go, if you're interested in all > the gory details. > > I have kicking around on my hard drive a WIP library that just provides > the splitting algorithm, without being entangled with the rest of the > file upload logic. I should finish that up and publish it... > > Hope this is useful, > > -Ian > -- You received this message because you are subscribed to the Google Groups "Perkeep" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/perkeep/71ebcc93-bbed-4e16-bfc1-1ceb2d7678f3n%40googlegroups.com.
