Re: Fulltext In Boolean Wildcard Questions

2005-03-25 Thread Jigal van Hemert
From: Freddie Bingham [EMAIL PROTECTED]
  (1) Why is such a query as this allowed?
 
  WHERE MATCH (text)
  AGAINST ('+s*' IN BOOLEAN MODE)
 
  This returns everything that starts with an 's', as expected
  by the syntax.
  Why is it not stopped by the mysql minimum word length?  I

The minimum word length still applies to the items that are found! A record
with so very cruel in the `text` column will certainly not be found.

  have no desire to allow a search for every word that begins
  with an 's' as it can be resourceful and will return results
  with questionable usage. So I'll take care of not allowing
  these searches in a higher level

It is always wise to check user input before applying it to a query!

  (2) Why does this query also return everything that begins with an 's'
 
  WHERE MATCH (text)
  AGAINST ('[EMAIL PROTECTED]' IN BOOLEAN MODE)
 
  Mysql is obviously silently stripping the list of garbage
  characters away from the word, which I assume was also done
  with the indexed list of words.

See: http://dev.mysql.com/doc/mysql/en/fulltext-search.html
--
 MySQL uses a very simple parser to split text into words. A ``word'' is any
sequence of true word characters (letters, digits, and underscores),
optionally separated by no more than one sequential ''' character. For
example, wasn't is parsed as a single word, but wasn''t is parsed as two
words wasn and t. (And then t would be ignored as too short according to the
rules following.) Also, single quotes at the ends of words are stripped;
only embedded single quotes are retained.
---

As documented, MySQL will 'ignore' those characters while building the index
as they are not part of a 'word' ('word' as defined in the paragraph I
quoted).

As far as the search 'word' is concerned, MySQL supports a limited list of
operator, so you can assume that everything else is silently ignored...

  I now have a serious issue since I what appears to be a long
  word, but is really only one character long.  Putting quotes
  around the word seems to force it to be taken literally but
  that also takes the * literally, breaking the wildcard
  search.  Where can I find a list of these characters that are
  silently stripped out?

http://dev.mysql.com/doc/mysql/en/fulltext-boolean.html lists the operators
that are supported:
 + -   () ~ * 

  WHERE MATCH (text)
  AGAINST ('[EMAIL PROTECTED]' IN BOOLEAN MODE)
 
  This returns all results that begin with a 't', but not 's'!

This will probably be equal to : '+t* s*' or: must contain word starting
with t, possible containing word(s) starting with s. Which is reflected in
these results:

text relevance
passe0
Type error   1
type sweetheart! 1.730698
supergranny  0

  then this query:
 
  WHERE MATCH(text) AGAINST ('[EMAIL PROTECTED]' IN
  BOOLEAN MODE)

'+ qwerty* s* t*'
The first + is not leading a word, so it will be ignored: 'qwerty* s* t*'

So sweet sour set sells sins per se 1
so  0
passe   0
Type error  1
type sweetheart!!   2
supergranny 1
qwertyuiop  1
qwertyuiop same type2

  Matches all words that beginning with a 's' or a 't'!
or beginning with 'qwerty'

  Anyone have a concise explanation just exactly how the
  wildcard character works in terms of real world strings such as this?

Hopefully the examples above are clear enough.

If you want to only support searches for words longer than say three
characters it would be wise to:
- only allow alphanumeric characters, single quotes, white space and
supported operators
- remove anything else and check for length of 'words'
If the entry passes these checks you can probably apply it to a query (after
escaping quotes!!). The checks can be performed fairly easy if you use some
kind of regular expressions in the application that uses the query.

Regards Jigal.


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



Fulltext In Boolean Wildcard Questions

2005-03-24 Thread Freddie Bingham
I have a few questions about wildcard usage with Fulltext searches (Mysql
4.1.9). I have tried to find relevant answers to these questions with no
success.
 
(1) Why is such a query as this allowed?
 
WHERE MATCH (text) 
AGAINST ('+s*' IN BOOLEAN MODE)
 
This returns everything that starts with an 's', as expected by the syntax.
Why is it not stopped by the mysql minimum word length?  I have no desire to
allow a search for every word that begins with an 's' as it can be
resourceful and will return results with questionable usage. So I'll take
care of not allowing these searches in a higher level but that leads me to
the next question.
 
(2) Why does this query also return everything that begins with an 's'
 
WHERE MATCH (text) 
AGAINST ('[EMAIL PROTECTED] mailto:'[EMAIL PROTECTED]s*' s*' IN BOOLEAN 
MODE)
 
Mysql is obviously silently stripping the list of garbage characters away
from the word, which I assume was also done with the indexed list of words.
I now have a serious issue since I what appears to be a long word, but is
really only one character long.  Putting quotes around the word seems to
force it to be taken literally but that also takes the * literally, breaking
the wildcard search.  Where can I find a list of these characters that are
silently stripped out?
 
Now take this query
 
WHERE MATCH (text) 
AGAINST ('[EMAIL PROTECTED] mailto:'[EMAIL PROTECTED]s*' s*' IN BOOLEAN 
MODE)
 
This returns all results that begin with a 't', but not 's'!
 
then this query:
 
WHERE MATCH(text) AGAINST ('[EMAIL PROTECTED]' IN BOOLEAN MODE)
 
Matches all words that beginning with a 's' or a 't'!
 
Anyone have a concise explanation just exactly how the wildcard character
works in terms of real world strings such as this?
 


RE: Fulltext In Boolean Wildcard Questions

2005-03-24 Thread Freddie Bingham
It appears my email client took some liberties with converting portions of
my queries into mailto links. I've fixed them below. Sorry!

 I have a few questions about wildcard usage with Fulltext 
 searches (Mysql 4.1.9). I have tried to find relevant answers 
 to these questions with no success.
  
 (1) Why is such a query as this allowed?
  
 WHERE MATCH (text)
 AGAINST ('+s*' IN BOOLEAN MODE)
  
 This returns everything that starts with an 's', as expected 
 by the syntax.
 Why is it not stopped by the mysql minimum word length?  I 
 have no desire to allow a search for every word that begins 
 with an 's' as it can be resourceful and will return results 
 with questionable usage. So I'll take care of not allowing 
 these searches in a higher level but that leads me to the 
 next question.
  
 (2) Why does this query also return everything that begins with an 's'
  
 WHERE MATCH (text)
 AGAINST ('[EMAIL PROTECTED]' IN BOOLEAN MODE)
  
 Mysql is obviously silently stripping the list of garbage 
 characters away from the word, which I assume was also done 
 with the indexed list of words.
 I now have a serious issue since I what appears to be a long 
 word, but is really only one character long.  Putting quotes 
 around the word seems to force it to be taken literally but 
 that also takes the * literally, breaking the wildcard 
 search.  Where can I find a list of these characters that are 
 silently stripped out?
  
 Now take this query
  
 WHERE MATCH (text)
 AGAINST ('[EMAIL PROTECTED]' IN BOOLEAN MODE)
  
 This returns all results that begin with a 't', but not 's'!
  
 then this query:
  
 WHERE MATCH(text) AGAINST ('[EMAIL PROTECTED]' IN 
 BOOLEAN MODE)
  
 Matches all words that beginning with a 's' or a 't'!
  
 Anyone have a concise explanation just exactly how the 
 wildcard character works in terms of real world strings such as this?


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]