Re: best practices for finding duplicate chunks

2005-08-14 Thread Gerald Taylor

Thanks for your answer.  It would certainly work provided having
enough disk space to do that.  I thought something like
that but was hoping I can leverage fulltext  and just
record the fulltext result between a each record
and each other record. Then I can group all records that
highly correlate and maybe do a much smaller scale version of
the brute force indexing thing that you are proposing, i.e. only
do it on a group of records that we already know  have a high
correlation, ie a high probability of sharing a chunk in common
  Then when done I can throw away that data
and do another group.  What do you think?   Processing cycles I have
but easy disk space I don't.

Alexey Polyakov wrote:

There's no easy way to do it I think. But if spending a few hours
(days?) programming is ok with you I'd suggest something like this:
1) create a table (let's call it hashes) with three columns: hash,
doc_id, pos_id (doc_id is an identifier for records from table with
big text chunks)
2) retrieve a record from big table. Calculate hash value for
concatenated first 20 words from text. Insert this value into
hash/doc_id table, and 1 as value of pos_id. Calculate hash for
concatenated 20 words starting from 2-nd word of this text, and also
insert it into hash/doc_id table (2 as value of pos_id). Repeat until
you reach the end of this text.
3) Repeat 2) for all records of big table
4) Now you have all data needed for identifying those duplicate chunks.
select count(doc_id) as c from hashes group by hash where c1;
will return all hashes for 20-word chunks that are found in 2 or more documents
select doc_id from hashes where hash=some_value; 
will return documents that contain this chunk.

select h1.pos_id, h2.pos_id from hashes h1, hashes h2 where
h1.doc_id=doc1 and h2.doc_id=doc2 and h1.hash=h2.hash order by
h1.pos_id;
will return word positions for duplicate text in two documents.
For example last query returns:
156 587
157 588
...
193 624
It means that you can take words 156-213 from doc1, insert it into
subchunks table, and replace words 156-212 at doc1 and words 587-643
at doc2 with a marker.


Yeah it looks ugly, and will take a lot of space for temporary data.
But in the end you'll have all 20+ words duplicate chunks properly
identified.

On 8/14/05, Gerald Taylor [EMAIL PROTECTED] wrote:


I just revived a database that was in a version 3.23 server and moved it
to a 4.1   There are big fields of TEXT based data.  They have a way of
compressing the amount of TEXT data by identifying common subchunks and
putting them in a subchunk table and replacing them with a marker
inside the main text that will pull in that subchunk whenever the parent
chunk is requested.  This subchunking seems to have been done kind of
ad hoc, because I've noticed the database still has quite a bit of
duplicated chunks from one record to another.  The client does not want
to buy another drive to store data (even tho he really should for
other reasons anyway but who cares what I think) , so he wants it
compressed, and oh well I look on it as an opportunity for some
housecleaning.  Now that we have 4.1 what is the best practice for
automated looking for common subchunks, factoring them out, and then
replacing the original parent text with itself with the chunk cut out
and a marker inserted.  The hard part is finding them, ovbiously.  The
rest is easy.


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]










--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Re: best practices for finding duplicate chunks

2005-08-14 Thread Alexey Polyakov
You can modify the algorithm I proposed to find groups of records that
are likely to have duplicate chunks. Simply record only a part of
hashes, something like: if md5(concat(word1,word2,...,word20))%32=0.
Disk usage for this table will be maybe 60 bytes per record, if your
average word is 8 bytes (counting whitespace), then disk space you'll
need is about 25% of data size.
After groups of record are found, you can do brute-force indexing to
find duplicate chunks.

On 8/15/05, Gerald Taylor [EMAIL PROTECTED] wrote:
 Thanks for your answer.  It would certainly work provided having
 enough disk space to do that.  I thought something like
 that but was hoping I can leverage fulltext  and just
 record the fulltext result between a each record
 and each other record. Then I can group all records that
 highly correlate and maybe do a much smaller scale version of
 the brute force indexing thing that you are proposing, i.e. only
 do it on a group of records that we already know  have a high
 correlation, ie a high probability of sharing a chunk in common
   Then when done I can throw away that data
 and do another group.  What do you think?   Processing cycles I have
 but easy disk space I don't.

-- 
Alexey Polyakov

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]