On Saturday 01 Dec 2001 10:41, Ovanes Manucharyan wrote: > I have lots of text documents archived in a mysql database. I need > to compare these documents to see which ones are similar (either they > may have a small formatting difference or have a word or two different). > Is there a function/method in mysql that will easily accomplish this? > If not could you suggest web sites/literature on this topic. (Lexical > analyis?)
IMHO, this is not really something that should be handled by the database itself, as it is too complicated. It would, however, probably be most efficiently dealt with at the database end. However, I have been running into the same issue myself, and here is something have come up with. Strip down the documents you want to compare at insert time. Make them all lower case, compress white space, erase all non-alpha characters (punctuation, etc). Split the document into two-word groups, with a sliding 2-word window. That should give you n-1 elements, where n = number of words in the document. In Perl you could handle this efficiently using a hash. Increment the value of a hash for each element by one, for each element in the array you've split the document into. For example. Document = "This is a test message." StrippedDocument = "this is a test message" Array = ("this is", "is a", "a test", "test message") Hash = ("this is" => 1, "is a" => 1, "a test" => 1, "test message" => 1) Save the contents of this has to a separate field for quick and easy retrieval (not sure what the best way to do this is yet, still working on it). Then, at each insert, compare the new document with all the others. If you can constrain the cross-check test to begin with (e.g. by author, source, etc), that will speed things up massively. Then just compare the hashes in some way, for example, by checking whether the number of two-word construct counts exceeds some similarity threshold that you can define yourself. You could then store this similarity cross-list in another table. It may be an idea to implement it as a stored procedure, but AFAIK, MySQL can't do that (please correct me if I am wrong). If you combine this with triggers, you could intercept INSERT and UPDATE queries on the relevant tablea, to make sure that things are always kept in sync. I am not sure how this could be handled in MySQL, but in PostgreSQL you could use PL/Perl stored procedures and set up triggers. That is how the PostgreSQL Full Text Index contrib used to work. It is likely, though, that this will be very slow, especially on a big table... You could probably handle some of this at the application level, too, but in any case, the INSERT time would go through the roof because you would have to compare every new record against every existing record (again, constraints in this comparison would be a very beneficial thing). This is not a prticularly scientiffic approach - it is a not-too-well-thought-out pragmatic one, but I only ran into the same problem myself yesterday... Regards. Gordan --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php