subject:"Re\: best practices for finding duplicate chunks"

Re: best practices for finding duplicate chunks

2005-08-14 Thread Alexey Polyakov

You can modify the algorithm I proposed to find groups of records that are likely to have duplicate chunks. Simply record only a part of hashes, something like: if md5(concat(word1,word2,...,word20))%32=0. Disk usage for this table will be maybe 60 bytes per record, if your average word is 8 bytes

Re: best practices for finding duplicate chunks

2005-08-14 Thread Gerald Taylor

Thanks for your answer. It would certainly work provided having enough disk space to do that. I thought something like that but was hoping I can leverage fulltext and just record the fulltext result between a each record and each other record. Then I can group all records that highly correlate