> 
> On Mon, 7 Jan 2002, Arne Mueller wrote:
> > I wonder whether one can use the full text indexes in mysql to find out
> > what words in a document are likely to be relevant key words.
> >
> > . . .
> >
> > I'd be nice to have a command like this:
> >
> > select keywords(10.0) from MyDocs where DocId = 666;
> >
> > . . .
> 
> Isn't function MATCH what you want?  Example from the manual:
> 
> mysql> SELECT *,MATCH a,b AGAINST ('collections support') as x FROM t;
> +------------------------------+-------------------------------+--------+
> | a                            | b                             | x      |
> +------------------------------+-------------------------------+--------+
> | MySQL has now support        | for full-text search          | 0.3834 |
> | Full-text indexes            | are called collections        | 0.3834 |
> | Only MyISAM tables           | support collections           | 0.7668 |
> | Function MATCH ... AGAINST() | is used to do a search        |      0 |
> | Full-text search in MySQL    | implements vector space model |      0 |
> +------------------------------+-------------------------------+--------+
> 5 rows in set (0.00 sec)
> 
> The function MATCH matches a natural language query AGAINST a text
> collection (which is simply the columns that are covered by a FULLTEXT
> index). For every row in a table it returns relevance - a similarity
> measure between the text in that row (in the columns that are part of the
> collection) and the query.  When it is used in a WHERE clause (see example
> above) the rows returned are automatically sorted with relevance
> decreasing.

The problem is that I don't know the expression for the 'AGAINST' part.
Given a document I'd like to know what it is about without reading it.
Using the MATCH AGAINST
functions to extract the most relevant key words from a single document
I'd have to do something like this:

foreach word in Document with DocId = N, do:
    SELECT MATCH text_column AGAINST (word) FROM table where DocId = N;
    if relevance of match > 0.5, do:
        remember this word as a relevant keyword
print all keywords fo Document with DocId = N

But I guess this is far too slow. Basically I've to implement this
myself using a table for the docuemnts, a table for each word and a
table that links a document with it's words. Each word in the word table
will have a counter that tells me how often this word occures in all
documents of the document table, and the linker table (that links the
docs with the words) contains a counter column to count the frequency of
a word in this particular document. From this I can extract the most
relevant words for each document. The overall frequencies in the word
table have to updated everytime a new document is inserted, but this is
ok for me. The relevance for the word 'mysql' in document X  could be
calculated as the frequency of 'mysql' in X (stored in the linker table)
divided by 'mysql' in all documents (stored in the word table). The more
this number is close to 1 the better the score ...

Has anyone here implemented such a text mining database? I'd be
interested in your solutions and experience.

        thanks alot for comments,

        Arne

-- 
Arne Mueller
Biomolecular Modelling Laboratory
Imperial Cancer Research Fund
44 Lincoln's Inn Fields
London WC2A 3PX, U.K.
phone : +44-(0)207 2693405      | fax :+44-(0)207-269-3534
email : [EMAIL PROTECTED] | http://www.bmm.icnet.uk

---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to