Re: MySQL fulltext. Question about the stopword list

2003-01-10 Thread Sergei Golubchik
Hi!

On Jan 07, Erlend Hopso Stromsvik wrote:
  What I can easily do without breaking 4.0.x gamma status, is to add
  command line switch --disable-fulltext-stopwords. It can help as a
  temporary solution, untill a proper fix - per-index options, that is -
  will be implemented.
 
 That would be helpful for me, but what about Thomas Spahni's suggestion?

In todo already :)

   I remember working on a project when I was school where we wrote
   this program using autogenerated stopword lists and N-gram
   matching for the text and search string. By this the stopword list
   was not hard coded..
  
  What is N-gram matching ?
 
 
 n-grams are used to describe objects as vectors. This makes it possible to
 apply geometric, statistical and other mathematical techniques, which are
 well defined for vectors, but not for objects in general. For example, one
 of the most common uses is to define a similarity measure between textual
 documents based on the application of a mathematical function to the vector
 representations of the documents

Yep. This is how fulltext search in MySQL works (not IN BOOLEAN MODE, of
course). It is called vector space model, though words are used as
vector coordinates, not n-grams.

 
 An n-gram is a set of n consecutive characters extracted from a word.
 The main idea behind this approach is that, similar words will have a
 high proportion of n-grams in common. Typical values for n are 2 or 3,
 these corresponding to the use of digrams or trigrams, respectively.
 
 So if you have the word 'computer' you'll get the following digrams:
 *c, co, om, mp, pu, ut, te, er, r*
 
 and the trigrams:
 **c,*co,com,omp,mpu,put,ute,ter,er*,r**
 
Regards,
Sergei

-- 
MySQL Development Team
   __  ___ ___   __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik [EMAIL PROTECTED]
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, http://www.mysql.com/
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
   ___/

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




RE: MySQL fulltext. Question about the stopword list

2003-01-08 Thread Erlend Hopso Stromsvik
 What I can easily do without breaking 4.0.x gamma status, is to add
 command line switch --disable-fulltext-stopwords. It can help as a
 temporary solution, untill a proper fix - per-index options, that is -
 will be implemented.

That would be helpful for me, but what about Thomas Spahni's suggestion?

 
 Sergei,
 
 but then, could you also add a command line switch
 
 --read-stopwords-from-file=filename ???
 
 Please. That could solve half of my problem.
 
 Best regards,
 Thomas Spahni

I was mere wondering why the stopword list was 'hardcoded' since it seems to
me that it's one of those things a user should be able to change/modify
without to much hassle and on a more frequent basis than whenever one
recompile MySQL. Also a stopword list is very dependent on what kind of
text/data one wants to search in so a large system with multiple users and
databases might want different stopword lists...

 
  I remember working on a project when I was school where we 
 wrote this
  program using autogenerated stopword lists and N-gram 
 matching for the text
  and search string. By this the stopword list was not hard coded..
 
 What is N-gram matching ?
 

I post this to the MySQL board, since maybe someone else has something to
add/say about it too :)
Don't know where I got these texts from, but it should give you a general
idea about n-grams. 


n-grams are used to describe objects as vectors. This makes it possible to
apply geometric, statistical and other mathematical techniques, which are
well defined for vectors, but not for objects in general. For example, one
of the most common uses is to define a similarity measure between textual
documents based on the application of a mathematical function to the vector
representations of the documents

N-Grams
String-similarity approaches to conflation involve the system calculating a
measure of similarity between an input query term and each of the distinct
terms in the database. Those database terms that have a high similarity to a
query term are then displayed to the user for possible inclusion in the
query. 
N-gram matching techniques are one of the most common of these approaches
(Freund  Willett, 1982). An n-gram is a set of n consecutive characters
extracted from a word. The main idea behind this approach is that, similar
words will have a high proportion of n-grams in common. Typical values for n
are 2 or 3, these corresponding to the use of digrams or trigrams,
respectively.

So if you have the word 'computer' you'll get the following digrams:
*c, co, om, mp, pu, ut, te, er, r*

and the trigrams:
**c,*co,com,omp,mpu,put,ute,ter,er*,r**

where '*' denotes a padding space. There are n+1 such digrams and n+2 such
trigrams in a word containing n characters.


Found this link after some 'googling about' 
http://web.umr.edu/~tauritzd/ngram/
This is probably the original text for the first text I had:
http://web.umr.edu/~tauritzd/ngram/tutorial.html


 Regards,
 Sergei
 
 -- 
 MySQL Development Team
__  ___ ___   __
   /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik [EMAIL PROTECTED]
  / /|_/ / // /\ \/ /_/ / /__  MySQL AB, http://www.mysql.com/
 /_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
___/
 

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: MySQL fulltext. Question about the stopword list

2003-01-07 Thread Thomas Spahni
On Mon, 6 Jan 2003, Sergei Golubchik wrote:

  Does the stopword list have to be 'hardcoded' in myisam/ft_static.c?

 No, of course, it has not.
 The problem is, that I cannot implement something like

CREATE TABLE ... ( ... FULLTEXT (...) WITHOUT STOPWORD LIST )
 or
CREATE TABLE ... ( ... FULLTEXT (...) WITH STOPWORD LIST=filename )

 or whatever, as there is simply no room in .frm file to store extra
 table/column/index attributes :(
 Changing .frm format is in our todo for a long time already, but it
 includes few subtasks that makes it very non-trivial. :(
 It's something that Monty is going to do himself.

 What I can easily do without breaking 4.0.x gamma status, is to add
 command line switch --disable-fulltext-stopwords. It can help as a
 temporary solution, untill a proper fix - per-index options, that is -
 will be implemented.

Sergei,

but then, could you also add a command line switch

--read-stopwords-from-file=filename ???

Please. That could solve half of my problem.

Best regards,
Thomas Spahni
-- 
filter: sql,query


-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php




Re: MySQL fulltext. Question about the stopword list

2003-01-06 Thread Sergei Golubchik
Hi!

On Jan 06, Erlend Hopsø Strømsvik wrote:
 
 Hi
 
 I'm currently using MySQL (4.0.4 recompiled) as our database for storing
 information about books.
 There are about 1.5mill book titles + other information in the database now.
 
 The fulltext search capabilities of MySQL have turned out to become very
 good with the 4+ version and with a little tweaking of the ft_min_word_len
 and myisam/ft_static.c we have gotten all we wished for feature wise :)
 ft_min_word_len is 2 and I've completely removed the stopword list.
 A fulltext search with 1.5mill book titles and sub titles takes about 0.17
 sec so there is no speed issues here. (hands down for MySQL)  :)

Thanks :)

 But... 
 Does the stopword list have to be 'hardcoded' in myisam/ft_static.c?

No, of course, it has not.
The problem is, that I cannot implement something like

   CREATE TABLE ... ( ... FULLTEXT (...) WITHOUT STOPWORD LIST )

or

   CREATE TABLE ... ( ... FULLTEXT (...) WITH STOPWORD LIST=filename )

or whatever, as there is simply no room in .frm file to store extra
table/column/index attributes :(
Changing .frm format is in our todo for a long time already, but it
includes few subtasks that makes it very non-trivial. :(
It's something that Monty is going to do himself.

What I can easily do without breaking 4.0.x gamma status, is to add
command line switch --disable-fulltext-stopwords. It can help as a
temporary solution, untill a proper fix - per-index options, that is -
will be implemented.

 I remember working on a project when I was school where we wrote this
 program using autogenerated stopword lists and N-gram matching for the text
 and search string. By this the stopword list was not hard coded..

What is N-gram matching ?

Regards,
Sergei

-- 
MySQL Development Team
   __  ___ ___   __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik [EMAIL PROTECTED]
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, http://www.mysql.com/
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
   ___/

-
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/   (the list archive)

To request this thread, e-mail [EMAIL PROTECTED]
To unsubscribe, e-mail [EMAIL PROTECTED]
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php