On Saturday 01 Dec 2001 10:41, Ovanes Manucharyan wrote:
> I have lots of text documents archived in a mysql database. I need
> to compare these documents to see which ones are similar (either they
> may have a small formatting difference or have a word or two different).
> Is there a function/method in mysql that will easily accomplish this?
> If not could you suggest web sites/literature on this topic. (Lexical
> analyis?)
IMHO, this is not really something that should be handled by the database
itself, as it is too complicated. It would, however, probably be most
efficiently dealt with at the database end.
However, I have been running into the same issue myself, and here is
something have come up with.
Strip down the documents you want to compare at insert time. Make them all
lower case, compress white space, erase all non-alpha characters
(punctuation, etc). Split the document into two-word groups, with a sliding
2-word window. That should give you n-1 elements, where n = number of words
in the document.
In Perl you could handle this efficiently using a hash. Increment the value
of a hash for each element by one, for each element in the array you've split
the document into.
For example.
Document = "This is a test message."
StrippedDocument = "this is a test message"
Array = ("this is", "is a", "a test", "test message")
Hash = ("this is" => 1, "is a" => 1, "a test" => 1, "test message" => 1)
Save the contents of this has to a separate field for quick and easy
retrieval (not sure what the best way to do this is yet, still working on it).
Then, at each insert, compare the new document with all the others. If you
can constrain the cross-check test to begin with (e.g. by author, source,
etc), that will speed things up massively.
Then just compare the hashes in some way, for example, by checking whether
the number of two-word construct counts exceeds some similarity threshold
that you can define yourself.
You could then store this similarity cross-list in another table.
It may be an idea to implement it as a stored procedure, but AFAIK, MySQL
can't do that (please correct me if I am wrong). If you combine this with
triggers, you could intercept INSERT and UPDATE queries on the relevant
tablea, to make sure that things are always kept in sync.
I am not sure how this could be handled in MySQL, but in PostgreSQL you could
use PL/Perl stored procedures and set up triggers. That is how the PostgreSQL
Full Text Index contrib used to work. It is likely, though, that this will be
very slow, especially on a big table...
You could probably handle some of this at the application level, too, but in
any case, the INSERT time would go through the roof because you would have to
compare every new record against every existing record (again, constraints in
this comparison would be a very beneficial thing).
This is not a prticularly scientiffic approach - it is a
not-too-well-thought-out pragmatic one, but I only ran into the same problem
myself yesterday...
Regards.
Gordan
---------------------------------------------------------------------
Before posting, please check:
http://www.mysql.com/manual.php (the manual)
http://lists.mysql.com/ (the list archive)
To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php