Re: Comparing text documents

Gordan Bobic Sat, 01 Dec 2001 08:43:17 -0800

On Saturday 01 Dec 2001 10:41, Ovanes Manucharyan wrote:
> I have lots of text documents archived in a mysql database. I need
> to compare these documents to see which ones are similar (either they
> may have a small formatting difference or have a word or two different).
> Is there a function/method in mysql that will easily accomplish this?
> If not could you suggest web sites/literature on this topic. (Lexical
> analyis?)


IMHO, this is not really something that should be handled by the database 
itself, as it is too complicated. It would, however, probably be most 
efficiently dealt with at the database end.

However, I have been running into the same issue myself, and here is 
something  have come up with.

Strip down the documents you want to compare at insert time. Make them all 
lower case, compress white space, erase all non-alpha characters 
(punctuation, etc). Split the document into two-word groups, with a sliding 
2-word window. That should give you n-1 elements, where n = number of words 
in the document.

In Perl you could handle this efficiently using a hash. Increment the value 
of a hash for each element by one, for each element in the array you've split 
the document into.

For example.

Document = "This is a test message."
StrippedDocument = "this is a test message"
Array = ("this is", "is a", "a test", "test message")
Hash = ("this is" => 1, "is a" => 1, "a test" => 1, "test message" => 1)

Save the contents of this has to a separate field for quick and easy 
retrieval (not sure what the best way to do this is yet, still working on it).

Then, at each insert, compare the new document with all the others. If you 
can constrain the cross-check test to begin with (e.g. by author, source, 
etc), that will speed things up massively.

Then just compare the hashes in some way, for example, by checking whether 
the number of two-word construct counts exceeds some similarity threshold 
that you can define yourself.

You could then store this similarity cross-list in another table.

It may be an idea to implement it as a stored procedure, but AFAIK, MySQL 
can't do that (please correct me if I am wrong). If you combine this with 
triggers, you could intercept INSERT and UPDATE queries on the relevant 
tablea, to make sure that things are always kept in sync.

I am not sure how this could be handled in MySQL, but in PostgreSQL you could 
use PL/Perl stored procedures and set up triggers. That is how the PostgreSQL 
Full Text Index contrib used to work. It is likely, though, that this will be 
very slow, especially on a big table...

You could probably handle some of this at the application level, too, but in 
any case, the INSERT time would go through the roof because you would have to 
compare every new record against every existing record (again, constraints in 
this comparison would be a very beneficial thing).

This is not a prticularly scientiffic approach - it is a 
not-too-well-thought-out pragmatic one, but I only ran into the same problem 
myself yesterday...

Regards.

Gordan

---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Re: Comparing text documents

Reply via email to