Resending this, since the first one didn't seem to get posted on the MySQL
list.
-Original Message-
From: Erlend HopsÛ StrÛmsvik
Sent: 7. januar 2003 10:18
To: [EMAIL PROTECTED]
Subject: RE: MySQL fulltext. Question about the stopword list
What I can easily do without breaking 4.0.x gamma status,
is to add
command line switch --disable-fulltext-stopwords. It can help as a
temporary solution, untill a proper fix - per-index
options, that is -
will be implemented.
That would be helpful for me, but what about Thomas Spahni's
suggestion?
Sergei,
but then, could you also add a command line switch
--read-stopwords-from-file=filename ???
Please. That could solve half of my problem.
Best regards,
Thomas Spahni
I was mere wondering why the stopword list was 'hardcoded'
since it seems to me that it's one of those things a user
should be able to change/modify without to much hassle and on
a more frequent basis than whenever one recompile MySQL. Also
a stopword list is very dependent on what kind of text/data
one wants to search in so a large system with multiple users
and databases might want different stopword lists...
I remember working on a project when I was school where we
wrote this
program using autogenerated stopword lists and N-gram
matching for the text
and search string. By this the stopword list was not hard coded..
What is N-gram matching ?
I post this to the MySQL board, since maybe someone else has
something to add/say about it too :)
Don't know where I got these texts from, but it should give
you a general idea about n-grams.
n-grams are used to describe objects as vectors. This makes
it possible to apply geometric, statistical and other
mathematical techniques, which are well defined for vectors,
but not for objects in general. For example, one of the most
common uses is to define a similarity measure between textual
documents based on the application of a mathematical function
to the vector representations of the documents
N-Grams
String-similarity approaches to conflation involve the system
calculating a measure of similarity between an input query
term and each of the distinct terms in the database. Those
database terms that have a high similarity to a query term
are then displayed to the user for possible inclusion in the query.
N-gram matching techniques are one of the most common of
these approaches (Freund Willett, 1982). An n-gram is a set
of n consecutive characters extracted from a word. The main
idea behind this approach is that, similar words will have a
high proportion of n-grams in common. Typical values for n
are 2 or 3, these corresponding to the use of digrams or
trigrams, respectively.
So if you have the word 'computer' you'll get the following digrams:
*c, co, om, mp, pu, ut, te, er, r*
and the trigrams:
**c,*co,com,omp,mpu,put,ute,ter,er*,r**
where '*' denotes a padding space. There are n+1 such digrams
and n+2 such trigrams in a word containing n characters.
Found this link after some 'googling about'
http://web.umr.edu/~tauritzd/ngram/
This is probably the original text for the first text I had:
http://web.umr.edu/~tauritzd/ngram/tutorial.html
Regards,
Sergei
--
MySQL Development Team
__ ___ ___ __
/ |/ /_ __/ __/ __ \/ / Sergei Golubchik [EMAIL PROTECTED]
/ /|_/ / // /\ \/ /_/ / /__ MySQL AB, http://www.mysql.com/
/_/ /_/\_, /___/\___\_\___/ Osnabrueck, Germany
___/
-
Before posting, please check:
http://www.mysql.com/manual.php (the manual)
http://lists.mysql.com/ (the list archive)
To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php