Resending this, since the first one didn't seem to get posted on the MySQL list.
> -----Original Message----- > From: Erlend HopsŪ StrŪmsvik > Sent: 7. januar 2003 10:18 > To: [EMAIL PROTECTED] > Subject: RE: MySQL fulltext. Question about the stopword list > > > > What I can easily do without breaking 4.0.x "gamma" status, > is to add > > command line switch --disable-fulltext-stopwords. It can help as a > > temporary solution, untill a proper fix - per-index > options, that is - > > will be implemented. > > That would be helpful for me, but what about Thomas Spahni's > suggestion? > > > > > Sergei, > > > > but then, could you also add a command line switch > > > > --read-stopwords-from-file="filename" ??? > > > > Please. That could solve half of my problem. > > > > Best regards, > > Thomas Spahni > > I was mere wondering why the stopword list was 'hardcoded' > since it seems to me that it's one of those things a user > should be able to change/modify without to much hassle and on > a more frequent basis than whenever one recompile MySQL. Also > a stopword list is very dependent on what kind of text/data > one wants to search in so a large system with multiple users > and databases might want different stopword lists... > > > > > > I remember working on a project when I was school where we > > wrote this > > > program using autogenerated stopword lists and N-gram > > matching for the text > > > and search string. By this the stopword list was not hard coded.. > > > > What is "N-gram matching" ? > > > > I post this to the MySQL board, since maybe someone else has > something to add/say about it too :) > Don't know where I got these texts from, but it should give > you a general idea about n-grams. > > ************************ > n-grams are used to describe objects as vectors. This makes > it possible to apply geometric, statistical and other > mathematical techniques, which are well defined for vectors, > but not for objects in general. For example, one of the most > common uses is to define a similarity measure between textual > documents based on the application of a mathematical function > to the vector representations of the documents > ************************ > N-Grams > String-similarity approaches to conflation involve the system > calculating a measure of similarity between an input query > term and each of the distinct terms in the database. Those > database terms that have a high similarity to a query term > are then displayed to the user for possible inclusion in the query. > N-gram matching techniques are one of the most common of > these approaches (Freund & Willett, 1982). An n-gram is a set > of n consecutive characters extracted from a word. The main > idea behind this approach is that, similar words will have a > high proportion of n-grams in common. Typical values for n > are 2 or 3, these corresponding to the use of digrams or > trigrams, respectively. > > So if you have the word 'computer' you'll get the following digrams: > *c, co, om, mp, pu, ut, te, er, r* > > and the trigrams: > **c,*co,com,omp,mpu,put,ute,ter,er*,r** > > where '*' denotes a padding space. There are n+1 such digrams > and n+2 such trigrams in a word containing n characters. > > > Found this link after some 'googling about' > http://web.umr.edu/~tauritzd/ngram/ > This is probably the original text for the first text I had: > http://web.umr.edu/~tauritzd/ngram/tutorial.html > > > > Regards, > > Sergei > > > > -- > > MySQL Development Team > > __ ___ ___ ____ __ > > / |/ /_ __/ __/ __ \/ / Sergei Golubchik <[EMAIL PROTECTED]> > > / /|_/ / // /\ \/ /_/ / /__ MySQL AB, http://www.mysql.com/ > > /_/ /_/\_, /___/\___\_\___/ Osnabrueck, Germany > > <___/ > > > --------------------------------------------------------------------- Before posting, please check: http://www.mysql.com/manual.php (the manual) http://lists.mysql.com/ (the list archive) To request this thread, e-mail <[EMAIL PROTECTED]> To unsubscribe, e-mail <[EMAIL PROTECTED]> Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php