I just revived a database that was in a version 3.23 server and moved it
to a 4.1 There are big fields of TEXT based data. They have a way of
compressing the amount of TEXT data by identifying common subchunks and
putting them in a subchunk table and replacing them with a marker
inside the
Thanks for your answer. It would certainly work provided having
enough disk space to do that. I thought something like
that but was hoping I can leverage fulltext and just
record the fulltext result between a each record
and each other record. Then I can group all records that
highly correlate
You can modify the algorithm I proposed to find groups of records that
are likely to have duplicate chunks. Simply record only a part of
hashes, something like: if md5(concat(word1,word2,...,word20))%32=0.
Disk usage for this table will be maybe 60 bytes per record, if your
average word is 8 bytes