You can modify the algorithm I proposed to find groups of records that are likely to have duplicate chunks. Simply record only a part of hashes, something like: if md5(concat(word1,word2,...,word20))%32=0. Disk usage for this table will be maybe 60 bytes per record, if your average word is 8 bytes (counting whitespace), then disk space you'll need is about 25% of data size. After groups of record are found, you can do brute-force indexing to find duplicate chunks.
On 8/15/05, Gerald Taylor <[EMAIL PROTECTED]> wrote: > Thanks for your answer. It would certainly work provided having > enough disk space to do that. I thought something like > that but was hoping I can leverage fulltext and just > record the fulltext result between a each record > and each other record. Then I can group all records that > highly correlate and maybe do a much smaller scale version of > the brute force indexing thing that you are proposing, i.e. only > do it on a group of records that we already know have a high > correlation, ie a high probability of sharing a chunk in common > Then when done I can throw away that data > and do another group. What do you think? Processing cycles I have > but easy disk space I don't. -- Alexey Polyakov -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]