RE: counting keywords

Steve Edberg Wed, 12 Apr 2006 01:36:04 -0700

At 1:55 PM +1000 4/12/06, Taco Fleur wrote:

Hello Steve,


Your suggestion works like a charm, I am now trying to get my head around
your following statement, I can't seem to get a grip on what you mean. Is
there anyway you could elaborate a little on the following? It would be
really nice if I could get the solution to work with full text and be able
to use stemming etc.

 where document_id is a foreign key pointing at the table containing your

fulltext. This would be easier to extend to handle synonym handling too, and
you could do all the suffix handling/stemming you need (eg; to take care of
plurals). I've done something like that as well, and included an extra field
for the metaphone version of the word, to match approximate spellings.



Assume table structures something like this:

   WordTable:

word char(32) not null # or whatever your max wordlength is likely to be

   word_count  integer unsigned not null
   document_id integer unsigned not null

   DocTable:
   document_id integer unsigned not null auto_increment primary key
   doc_author
   doc_date
   ...
   doc_body    text

In your original post, you needed to get exact counts of wordsappearing in the document body; MySQL's full text search can't dothat. What you could do in this case is some preprocessing on thedoc_body when you insert it. You would scan the text, ignoring theinsignificant words (eg; and, the, a, is, and so on), and thentransform the remaining words into a canonical form (eg; glasses,glassy, glass all become glass), and then insert into the WordTable.Then, when a search is performed, you translate the search termsusing the same algorithm and search the WordTable. You might alsohave a SynonymTable that you could use to translate all synonyms to astandard term before insertion into the WordTable and beforesearching.

There are various stemming algorithms around; the Porter Algorithmwas one of the earlier ones, and the one I have worked with some:


   http://www.tartarus.org/martin/PorterStemmer/

It worked pretty well, but I needed to maintain an exception list forsome words it incorrectly translated. Also, you have to decidewhether similar terms like 'anthropologist' and 'anthropology' areidentical as far as your search is concerned.


The Snowball or Porter2 algorithm is apparently an improvement on the original:

   http://snowball.tartarus.org/algorithms/english/stemmer.html

A google for 'stemming algorithms' also turned up this:

   http://www.comp.lancs.ac.uk/computing/research/stemming/

And if you're dealing with non-English words, then you'll have tolook for native language stemmers or modify the rules in the abovealgorithms.

If you had a table set up as above, you could do either searchesbased on the word table (where you could return documents sorted bythe number of times the search words appeared), or using a fulltextsearch on the doc_body. I would imagine that for many documents theorder of results would be similar. The fulltext algorithm alsoweights words more heavily based on their uniqueness.

If you want to do searching based on approximate spellings, you couldadd an additional column to WordTable:


   word_approx    char(32) not null

and store either the soundex version of the word -

   http://en.wikipedia.org/wiki/Soundex

- or the more accurate (in my experience) metaphone algorithm -

   http://en.wikipedia.org/wiki/Metaphone

Then the search terms would be run through two transforms: a stemmer,and a 'metaphoner'.


Lastly, it appears that MySQL 5.1 has a new plug-in API -

   http://dev.mysql.com/doc/refman/5.1/en/plugin-api.html

- which includes the ability to modify/replace fulltext parserbehavior. It looks like you might be able to create custom functionsto do most or all of the above using user-defined functions,presumably with relatively high efficiency.


        steve

--
+--------------- my people are the people of the dessert, ---------------+
| Steve Edberg                                http://pgfsun.ucdavis.edu/ |
| UC Davis Genome Center                            [EMAIL PROTECTED] |
| Bioinformatics programming/database/sysadmin             (530)754-9127 |
+---------------- said t e lawrence, picking up his fork ----------------+

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

RE: counting keywords

Reply via email to