At 1:55 PM +1000 4/12/06, Taco Fleur wrote:
Hello Steve,
Your suggestion works like a charm, I am now trying to get my head around
your following statement, I can't seem to get a grip on what you mean. Is
there anyway you could elaborate a little on the following? It would be
really nice if I could get the solution to work with full text and be able
to use stemming etc.
where document_id is a foreign key pointing at the table containing your
fulltext. This would be easier to extend to handle synonym handling too, and
you could do all the suffix handling/stemming you need (eg; to take care of
plurals). I've done something like that as well, and included an extra field
for the metaphone version of the word, to match approximate spellings.
Assume table structures something like this:
WordTable:
word char(32) not null # or whatever your max word
length is likely to be
word_count integer unsigned not null
document_id integer unsigned not null
DocTable:
document_id integer unsigned not null auto_increment primary key
doc_author
doc_date
...
doc_body text
In your original post, you needed to get exact counts of words
appearing in the document body; MySQL's full text search can't do
that. What you could do in this case is some preprocessing on the
doc_body when you insert it. You would scan the text, ignoring the
insignificant words (eg; and, the, a, is, and so on), and then
transform the remaining words into a canonical form (eg; glasses,
glassy, glass all become glass), and then insert into the WordTable.
Then, when a search is performed, you translate the search terms
using the same algorithm and search the WordTable. You might also
have a SynonymTable that you could use to translate all synonyms to a
standard term before insertion into the WordTable and before
searching.
There are various stemming algorithms around; the Porter Algorithm
was one of the earlier ones, and the one I have worked with some:
http://www.tartarus.org/martin/PorterStemmer/
It worked pretty well, but I needed to maintain an exception list for
some words it incorrectly translated. Also, you have to decide
whether similar terms like 'anthropologist' and 'anthropology' are
identical as far as your search is concerned.
The Snowball or Porter2 algorithm is apparently an improvement on the original:
http://snowball.tartarus.org/algorithms/english/stemmer.html
A google for 'stemming algorithms' also turned up this:
http://www.comp.lancs.ac.uk/computing/research/stemming/
And if you're dealing with non-English words, then you'll have to
look for native language stemmers or modify the rules in the above
algorithms.
If you had a table set up as above, you could do either searches
based on the word table (where you could return documents sorted by
the number of times the search words appeared), or using a fulltext
search on the doc_body. I would imagine that for many documents the
order of results would be similar. The fulltext algorithm also
weights words more heavily based on their uniqueness.
If you want to do searching based on approximate spellings, you could
add an additional column to WordTable:
word_approx char(32) not null
and store either the soundex version of the word -
http://en.wikipedia.org/wiki/Soundex
- or the more accurate (in my experience) metaphone algorithm -
http://en.wikipedia.org/wiki/Metaphone
Then the search terms would be run through two transforms: a stemmer,
and a 'metaphoner'.
Lastly, it appears that MySQL 5.1 has a new plug-in API -
http://dev.mysql.com/doc/refman/5.1/en/plugin-api.html
- which includes the ability to modify/replace fulltext parser
behavior. It looks like you might be able to create custom functions
to do most or all of the above using user-defined functions,
presumably with relatively high efficiency.
steve
--
+--------------- my people are the people of the dessert, ---------------+
| Steve Edberg http://pgfsun.ucdavis.edu/ |
| UC Davis Genome Center [EMAIL PROTECTED] |
| Bioinformatics programming/database/sysadmin (530)754-9127 |
+---------------- said t e lawrence, picking up his fork ----------------+
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]