Re: best practices for finding duplicate chunks

Alexey Polyakov Sun, 14 Aug 2005 16:57:26 -0700

You can modify the algorithm I proposed to find groups of records that
are likely to have duplicate chunks. Simply record only a part of
hashes, something like: if md5(concat(word1,word2,...,word20))%32=0.
Disk usage for this table will be maybe 60 bytes per record, if your
average word is 8 bytes (counting whitespace), then disk space you'll
need is about 25% of data size.
After groups of record are found, you can do brute-force indexing to
find duplicate chunks.


On 8/15/05, Gerald Taylor <[EMAIL PROTECTED]> wrote:
> Thanks for your answer.  It would certainly work provided having
> enough disk space to do that.  I thought something like
> that but was hoping I can leverage fulltext  and just
> record the fulltext result between a each record
> and each other record. Then I can group all records that
> highly correlate and maybe do a much smaller scale version of
> the brute force indexing thing that you are proposing, i.e. only
> do it on a group of records that we already know  have a high
> correlation, ie a high probability of sharing a chunk in common
>   Then when done I can throw away that data
> and do another group.  What do you think?   Processing cycles I have
> but easy disk space I don't.

-- 
Alexey Polyakov

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: best practices for finding duplicate chunks

Reply via email to