RE: Fulltext Simple Question
"Scott Purcell" <[EMAIL PROTECTED]> wrote on 05/25/2005 03:35:54 PM: > Thanks Sean fo the info. > > I see where it states the server is configured for 4 character > indexing. I would like to try and set it to 3 and do not understand > what an options file is: > The documentation states the following: > > > > * > The minimum and maximum length of words to be indexed is defined by > the ft_min_word_len and ft_max_word_len system variables (available > as of MySQL 4.0.0). See <http://dev.mysql.com/doc/mysql/en/server- > system-variables.html> Section 5.3.3, "Server System Variables". The > default minimum value is four characters. The default maximum > depends on your version of MySQL. If you change either value, you > must rebuild your FULLTEXT indexes. For example, if you want three- > character words to be searchable, you can set the ft_min_word_len > variable by putting the following lines in an option file: > >[mysqld] > > ft_min_word_len=3 > > I use mysql from a binary install, and I am just learning it. How do > I create this file, and where does it go? > > Thanks, > Scott > > > > -Original Message- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Wednesday, May 25, 2005 2:21 PM > To: Brian Mansell > Cc: mysql@lists.mysql.com; Scott Purcell > Subject: Re: Fulltext Simple Question > > > > > Brian Mansell <[EMAIL PROTECTED]> wrote on 05/25/2005 03:09:03 PM: > > > Scott - > > > > Check this excerpt out ( > > http://dev.mysql.com/doc/mysql/en/fulltext-search.html ) from the MySQL > > Documentation. I hope it helps! > > > > --bemansell > > > > ... > > > > "Every correct word in the collection and in the query is weightedaccording > > to its significance in the collection or query. This way, a word that is > > present in many documents has a lower weight (and may even have a zero > > weight), because it has lower semantic value in this particular collection. > > Conversely, if the word is rare, it receives a higher weight. The > weights of > > the words are then combined to compute the relevance of the row. > > > > Such a technique works best with large collections (in fact, it was > > carefully tuned this way). For very small tables, word > distribution does not > > adequately reflect their semantic value, and this model may sometimes > > produce bizarre results. For example, although the word ``MySQL'' > is present > > in every row of the articles table, a search for the word produces no > > results: > > > > mysql> SELECT * FROM articles > > -> WHERE MATCH (title,body) AGAINST ('MySQL'); > > Empty set (0.00 sec) > > > > The search result is empty because the word ``MySQL'' is present in at > > least 50% of the rows. As such, it is effectively treated as a > stopword. For > > large datasets, this is the most desirable behavior---a natural language > > query should not return every second row from a 1GB table. For small > > datasets, it may be less desirable. > > > > A word that matches half of rows in a table is less likely to locate > > relevant documents. In fact, it most likely finds plenty of irrelevant > > documents. We all know this happens far too often when we are > trying to find > > something on the Internet with a search engine. It is with this reasoning > > that rows containing the word are assigned a low semantic value for *the > > particular dataset in which they occur*. A given word may exceed the 50% > > threshold in one dataset but not another. > > > > The 50% threshold has a significant implication when you first tryfull-text > > searching to see how it works: If you create a table and insert only one or > > two rows of text into it, every word in the text occurs in at least 50% of > > the rows. As a result, no search returns any results. Be sure to insert at > > least three rows, and preferably many more." > > > > > > > > On 5/25/05, Scott Purcell <[EMAIL PROTECTED]> wrote: > > > > > > Hello, > > > I am running 4.0.15 for Win95/98 and am working through the docs. > > > > > > I created a "text" type field with a 'fulltext' index. As I am > > > experimenting, I have run into a couple of questions: > > > > > > First off, I was having trouble getting results. So I added the word > > > "foobar" to one of the descriptions: > &g
RE: Fulltext Simple Question
Thanks Sean fo the info. I see where it states the server is configured for 4 character indexing. I would like to try and set it to 3 and do not understand what an options file is: The documentation states the following: * The minimum and maximum length of words to be indexed is defined by the ft_min_word_len and ft_max_word_len system variables (available as of MySQL 4.0.0). See <http://dev.mysql.com/doc/mysql/en/server-system-variables.html> Section 5.3.3, "Server System Variables". The default minimum value is four characters. The default maximum depends on your version of MySQL. If you change either value, you must rebuild your FULLTEXT indexes. For example, if you want three-character words to be searchable, you can set the ft_min_word_len variable by putting the following lines in an option file: [mysqld] ft_min_word_len=3 I use mysql from a binary install, and I am just learning it. How do I create this file, and where does it go? Thanks, Scott -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 25, 2005 2:21 PM To: Brian Mansell Cc: mysql@lists.mysql.com; Scott Purcell Subject: Re: Fulltext Simple Question Brian Mansell <[EMAIL PROTECTED]> wrote on 05/25/2005 03:09:03 PM: > Scott - > > Check this excerpt out ( > http://dev.mysql.com/doc/mysql/en/fulltext-search.html ) from the MySQL > Documentation. I hope it helps! > > --bemansell > > ... > > "Every correct word in the collection and in the query is weighted according > to its significance in the collection or query. This way, a word that is > present in many documents has a lower weight (and may even have a zero > weight), because it has lower semantic value in this particular collection. > Conversely, if the word is rare, it receives a higher weight. The weights of > the words are then combined to compute the relevance of the row. > > Such a technique works best with large collections (in fact, it was > carefully tuned this way). For very small tables, word distribution does not > adequately reflect their semantic value, and this model may sometimes > produce bizarre results. For example, although the word ``MySQL'' is present > in every row of the articles table, a search for the word produces no > results: > > mysql> SELECT * FROM articles > -> WHERE MATCH (title,body) AGAINST ('MySQL'); > Empty set (0.00 sec) > > The search result is empty because the word ``MySQL'' is present in at > least 50% of the rows. As such, it is effectively treated as a stopword. For > large datasets, this is the most desirable behavior---a natural language > query should not return every second row from a 1GB table. For small > datasets, it may be less desirable. > > A word that matches half of rows in a table is less likely to locate > relevant documents. In fact, it most likely finds plenty of irrelevant > documents. We all know this happens far too often when we are trying to find > something on the Internet with a search engine. It is with this reasoning > that rows containing the word are assigned a low semantic value for *the > particular dataset in which they occur*. A given word may exceed the 50% > threshold in one dataset but not another. > > The 50% threshold has a significant implication when you first try full-text > searching to see how it works: If you create a table and insert only one or > two rows of text into it, every word in the text occurs in at least 50% of > the rows. As a result, no search returns any results. Be sure to insert at > least three rows, and preferably many more." > > > > On 5/25/05, Scott Purcell <[EMAIL PROTECTED]> wrote: > > > > Hello, > > I am running 4.0.15 for Win95/98 and am working through the docs. > > > > I created a "text" type field with a 'fulltext' index. As I am > > experimenting, I have run into a couple of questions: > > > > First off, I was having trouble getting results. So I added the word > > "foobar" to one of the descriptions: > > and that worked with this query: > > select * from item where match(name, description) against('foobar') > > > > > > > > I have a word 'red' that appears 5-10 times, in a tmp table of 60 records. > > If I run that query with 'red' > > select * from item where match(name, description) against('red'); > > it returns empty set > > > > Upon reading, it looks like it is really trying to only get "unique" names > > from the index. But in my case the 'red' is a description that I would like > &
Re: Fulltext Simple Question
Brian Mansell <[EMAIL PROTECTED]> wrote on 05/25/2005 03:09:03 PM: > Scott - > > Check this excerpt out ( > http://dev.mysql.com/doc/mysql/en/fulltext-search.html ) from the MySQL > Documentation. I hope it helps! > > --bemansell > > ... > > "Every correct word in the collection and in the query is weighted according > to its significance in the collection or query. This way, a word that is > present in many documents has a lower weight (and may even have a zero > weight), because it has lower semantic value in this particular collection. > Conversely, if the word is rare, it receives a higher weight. The weights of > the words are then combined to compute the relevance of the row. > > Such a technique works best with large collections (in fact, it was > carefully tuned this way). For very small tables, word distribution does not > adequately reflect their semantic value, and this model may sometimes > produce bizarre results. For example, although the word ``MySQL'' is present > in every row of the articles table, a search for the word produces no > results: > > mysql> SELECT * FROM articles > -> WHERE MATCH (title,body) AGAINST ('MySQL'); > Empty set (0.00 sec) > > The search result is empty because the word ``MySQL'' is present in at > least 50% of the rows. As such, it is effectively treated as a stopword. For > large datasets, this is the most desirable behavior---a natural language > query should not return every second row from a 1GB table. For small > datasets, it may be less desirable. > > A word that matches half of rows in a table is less likely to locate > relevant documents. In fact, it most likely finds plenty of irrelevant > documents. We all know this happens far too often when we are trying to find > something on the Internet with a search engine. It is with this reasoning > that rows containing the word are assigned a low semantic value for *the > particular dataset in which they occur*. A given word may exceed the 50% > threshold in one dataset but not another. > > The 50% threshold has a significant implication when you first try full-text > searching to see how it works: If you create a table and insert only one or > two rows of text into it, every word in the text occurs in at least 50% of > the rows. As a result, no search returns any results. Be sure to insert at > least three rows, and preferably many more." > > > > On 5/25/05, Scott Purcell <[EMAIL PROTECTED]> wrote: > > > > Hello, > > I am running 4.0.15 for Win95/98 and am working through the docs. > > > > I created a "text" type field with a 'fulltext' index. As I am > > experimenting, I have run into a couple of questions: > > > > First off, I was having trouble getting results. So I added the word > > "foobar" to one of the descriptions: > > and that worked with this query: > > select * from item where match(name, description) against('foobar') > > > > > > > > I have a word 'red' that appears 5-10 times, in a tmp table of 60 records. > > If I run that query with 'red' > > select * from item where match(name, description) against('red'); > > it returns empty set > > > > Upon reading, it looks like it is really trying to only get "unique" names > > from the index. But in my case the 'red' is a description that I would like > > to get back. Anyway to force this to return results? > > > > Any info would be helpful. I have read, but it gets a little confusing > > first time through. > > > > Thanks, > > Scott > > The other thing to remember is the "minimum word length". By default it is set to 4. RED has only 3 characters so it would not have been indexed. That would explain why FT searches for RED is not returning any records. See here for FT tuning (settings): http://dev.mysql.com/doc/mysql/en/fulltext-fine-tuning.html
Re: Fulltext Simple Question
Scott - Check this excerpt out ( http://dev.mysql.com/doc/mysql/en/fulltext-search.html ) from the MySQL Documentation. I hope it helps! --bemansell ... "Every correct word in the collection and in the query is weighted according to its significance in the collection or query. This way, a word that is present in many documents has a lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are then combined to compute the relevance of the row. Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results. For example, although the word ``MySQL'' is present in every row of the articles table, a search for the word produces no results: mysql> SELECT * FROM articles -> WHERE MATCH (title,body) AGAINST ('MySQL'); Empty set (0.00 sec) The search result is empty because the word ``MySQL'' is present in at least 50% of the rows. As such, it is effectively treated as a stopword. For large datasets, this is the most desirable behavior---a natural language query should not return every second row from a 1GB table. For small datasets, it may be less desirable. A word that matches half of rows in a table is less likely to locate relevant documents. In fact, it most likely finds plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for *the particular dataset in which they occur*. A given word may exceed the 50% threshold in one dataset but not another. The 50% threshold has a significant implication when you first try full-text searching to see how it works: If you create a table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results. Be sure to insert at least three rows, and preferably many more." On 5/25/05, Scott Purcell <[EMAIL PROTECTED]> wrote: > > Hello, > I am running 4.0.15 for Win95/98 and am working through the docs. > > I created a "text" type field with a 'fulltext' index. As I am > experimenting, I have run into a couple of questions: > > First off, I was having trouble getting results. So I added the word > "foobar" to one of the descriptions: > and that worked with this query: > select * from item where match(name, description) against('foobar') > > > > I have a word 'red' that appears 5-10 times, in a tmp table of 60 records. > If I run that query with 'red' > select * from item where match(name, description) against('red'); > it returns empty set > > Upon reading, it looks like it is really trying to only get "unique" names > from the index. But in my case the 'red' is a description that I would like > to get back. Anyway to force this to return results? > > Any info would be helpful. I have read, but it gets a little confusing > first time through. > > Thanks, > Scott > > > -- > MySQL General Mailing List > For list archives: http://lists.mysql.com/mysql > To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] > >