You can modify the algorithm I proposed to find groups of records that
are likely to have duplicate chunks. Simply record only a part of
hashes, something like: if md5(concat(word1,word2,...,word20))%32=0.
Disk usage for this table will be maybe 60 bytes per record, if your
average word is 8 bytes
Thanks for your answer. It would certainly work provided having
enough disk space to do that. I thought something like
that but was hoping I can leverage fulltext and just
record the fulltext result between a each record
and each other record. Then I can group all records that
highly correlate