best practices for finding duplicate chunks

2005-08-14 Thread Gerald Taylor
I just revived a database that was in a version 3.23 server and moved it to a 4.1 There are big fields of TEXT based data. They have a way of compressing the amount of TEXT data by identifying common subchunks and putting them in a subchunk table and replacing them with a marker inside the

Re: best practices for finding duplicate chunks

2005-08-14 Thread Gerald Taylor
Thanks for your answer. It would certainly work provided having enough disk space to do that. I thought something like that but was hoping I can leverage fulltext and just record the fulltext result between a each record and each other record. Then I can group all records that highly correlate

Re: best practices for finding duplicate chunks

2005-08-14 Thread Alexey Polyakov
You can modify the algorithm I proposed to find groups of records that are likely to have duplicate chunks. Simply record only a part of hashes, something like: if md5(concat(word1,word2,...,word20))%32=0. Disk usage for this table will be maybe 60 bytes per record, if your average word is 8 bytes