subject:"best practices for finding duplicate chunks"

Re: best practices for finding duplicate chunks

2005-08-14 Thread Alexey Polyakov

You can modify the algorithm I proposed to find groups of records that are likely to have duplicate chunks. Simply record only a part of hashes, something like: if md5(concat(word1,word2,...,word20))%32=0. Disk usage for this table will be maybe 60 bytes per record, if your average word is 8 bytes

Re: best practices for finding duplicate chunks

2005-08-14 Thread Gerald Taylor

Thanks for your answer. It would certainly work provided having enough disk space to do that. I thought something like that but was hoping I can leverage fulltext and just record the fulltext result between a each record and each other record. Then I can group all records that highly correlate

best practices for finding duplicate chunks

2005-08-14 Thread Gerald Taylor

I just revived a database that was in a version 3.23 server and moved it to a 4.1 There are big fields of TEXT based data. They have a way of compressing the amount of TEXT data by identifying common subchunks and putting them in a "subchunk" table and replacing them with a marker inside the

Re: best practices for finding duplicate chunks

Re: best practices for finding duplicate chunks

best practices for finding duplicate chunks

3 matches

Site Navigation

Mail list logo

Footer information