At 1:55 PM +1000 4/12/06, Taco Fleur wrote:
Hello Steve,

Your suggestion works like a charm, I am now trying to get my head around
your following statement, I can't seem to get a grip on what you mean. Is
there anyway you could elaborate a little on the following? It would be
really nice if I could get the solution to work with full text and be able
to use stemming etc.


 where document_id is a foreign key pointing at the table containing your
fulltext. This would be easier to extend to handle synonym handling too, and
you could do all the suffix handling/stemming you need (eg; to take care of
plurals). I've done something like that as well, and included an extra field
for the metaphone version of the word, to match approximate spellings.


Assume table structures something like this:

   WordTable:
word char(32) not null # or whatever your max word length is likely to be
   word_count  integer unsigned not null
   document_id integer unsigned not null

   DocTable:
   document_id integer unsigned not null auto_increment primary key
   doc_author
   doc_date
   ...
   doc_body    text

In your original post, you needed to get exact counts of words appearing in the document body; MySQL's full text search can't do that. What you could do in this case is some preprocessing on the doc_body when you insert it. You would scan the text, ignoring the insignificant words (eg; and, the, a, is, and so on), and then transform the remaining words into a canonical form (eg; glasses, glassy, glass all become glass), and then insert into the WordTable. Then, when a search is performed, you translate the search terms using the same algorithm and search the WordTable. You might also have a SynonymTable that you could use to translate all synonyms to a standard term before insertion into the WordTable and before searching.

There are various stemming algorithms around; the Porter Algorithm was one of the earlier ones, and the one I have worked with some:

   http://www.tartarus.org/martin/PorterStemmer/

It worked pretty well, but I needed to maintain an exception list for some words it incorrectly translated. Also, you have to decide whether similar terms like 'anthropologist' and 'anthropology' are identical as far as your search is concerned.

The Snowball or Porter2 algorithm is apparently an improvement on the original:

   http://snowball.tartarus.org/algorithms/english/stemmer.html

A google for 'stemming algorithms' also turned up this:

   http://www.comp.lancs.ac.uk/computing/research/stemming/

And if you're dealing with non-English words, then you'll have to look for native language stemmers or modify the rules in the above algorithms.

If you had a table set up as above, you could do either searches based on the word table (where you could return documents sorted by the number of times the search words appeared), or using a fulltext search on the doc_body. I would imagine that for many documents the order of results would be similar. The fulltext algorithm also weights words more heavily based on their uniqueness.

If you want to do searching based on approximate spellings, you could add an additional column to WordTable:

   word_approx    char(32) not null

and store either the soundex version of the word -

   http://en.wikipedia.org/wiki/Soundex

- or the more accurate (in my experience) metaphone algorithm -

   http://en.wikipedia.org/wiki/Metaphone

Then the search terms would be run through two transforms: a stemmer, and a 'metaphoner'.

Lastly, it appears that MySQL 5.1 has a new plug-in API -

   http://dev.mysql.com/doc/refman/5.1/en/plugin-api.html

- which includes the ability to modify/replace fulltext parser behavior. It looks like you might be able to create custom functions to do most or all of the above using user-defined functions, presumably with relatively high efficiency.

        steve

--
+--------------- my people are the people of the dessert, ---------------+
| Steve Edberg                                http://pgfsun.ucdavis.edu/ |
| UC Davis Genome Center                            [EMAIL PROTECTED] |
| Bioinformatics programming/database/sysadmin             (530)754-9127 |
+---------------- said t e lawrence, picking up his fork ----------------+

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to