> > On Mon, 7 Jan 2002, Arne Mueller wrote: > > I wonder whether one can use the full text indexes in mysql to find out > > what words in a document are likely to be relevant key words. > > > > . . . > > > > I'd be nice to have a command like this: > > > > select keywords(10.0) from MyDocs where DocId = 666; > > > > . . . > > Isn't function MATCH what you want? Example from the manual: > > mysql> SELECT *,MATCH a,b AGAINST ('collections support') as x FROM t; > +------------------------------+-------------------------------+--------+ > | a | b | x | > +------------------------------+-------------------------------+--------+ > | MySQL has now support | for full-text search | 0.3834 | > | Full-text indexes | are called collections | 0.3834 | > | Only MyISAM tables | support collections | 0.7668 | > | Function MATCH ... AGAINST() | is used to do a search | 0 | > | Full-text search in MySQL | implements vector space model | 0 | > +------------------------------+-------------------------------+--------+ > 5 rows in set (0.00 sec) > > The function MATCH matches a natural language query AGAINST a text > collection (which is simply the columns that are covered by a FULLTEXT > index). For every row in a table it returns relevance - a similarity > measure between the text in that row (in the columns that are part of the > collection) and the query. When it is used in a WHERE clause (see example > above) the rows returned are automatically sorted with relevance > decreasing.
The problem is that I don't know the expression for the 'AGAINST' part. Given a document I'd like to know what it is about without reading it. Using the MATCH AGAINST functions to extract the most relevant key words from a single document I'd have to do something like this: foreach word in Document with DocId = N, do: SELECT MATCH text_column AGAINST (word) FROM table where DocId = N; if relevance of match > 0.5, do: remember this word as a relevant keyword print all keywords fo Document with DocId = N But I guess this is far too slow. Basically I've to implement this myself using a table for the docuemnts, a table for each word and a table that links a document with it's words. Each word in the word table will have a counter that tells me how often this word occures in all documents of the document table, and the linker table (that links the docs with the words) contains a counter column to count the frequency of a word in this particular document. From this I can extract the most relevant words for each document. The overall frequencies in the word table have to updated everytime a new document is inserted, but this is ok for me. The relevance for the word 'mysql' in document X could be calculated as the frequency of 'mysql' in X (stored in the linker table) divided by 'mysql' in all documents (stored in the word table). The more this number is close to 1 the better the score ... Has anyone here implemented such a text mining database? I'd be interested in your solutions and experience. thanks alot for comments, Arne -- Arne Mueller Biomolecular Modelling Laboratory Imperial Cancer Research Fund 44 Lincoln's Inn Fields London WC2A 3PX, U.K. phone : +44-(0)207 2693405 | fax :+44-(0)207-269-3534 email : [EMAIL PROTECTED] | http://www.bmm.icnet.uk --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php