Re: How boolean full-text search finds matches?

2003-12-19 Thread Matt W
Hi Sergei!

Thanks for replying again.  I hope I'm not wasting too much of your time
with my questions! :-)

More below...

- Original Message -
From: Sergei Golubchik
Sent: Thursday, December 18, 2003 7:17 AM
Subject: Re: How boolean full-text search finds matches?


 Hi!

 On Dec 17, Matt W wrote:
  Hi,
 
  Just have a couple more full-text search inquiries here. :-)
 
  I'm not exactly clear on how matching rows are found when searching
for
  2 or more required words: '+word1 +word2'.  I understand that it
can't
  currently know which word occurs less, so that it can be searched
  first -- this optimization will come with 4.1's 2-level indexes. :-)
 
  I just want to know, when it finds a match for whichever word is
tried
  first, how does it check if the other required word(s) are present
in
  the same row?  Say that word1 and word2 are each present in 100,000
  rows.
 
  1) Surely it doesn't check the 100,000 entries for word2 for EACH
word1
  match to see if they're in the same row, does it?

 No it does not :)

Good!


  2) It *seems* the best way would be to do a lookup for (word2 +
rowid
  for word1) and see if there's a match.  Is this what's done?  I'm
not
  sure it's possible though with the way the index is structured...

 it is possible, but it is only sensible if word1 is much more rare
than
 word1. This could be done with 2-level indexes :)

I assume that should say if word2 is much more rare than word1.  I
guess that's because it would need too many [random?] index lookups
otherwise?


  3) Or, and I'm thinking *maybe* this is how it's done from what I've
  heard, does it get all the matches for word1, then for word2, and
then
  intersect them to find ones which are present in the same row?  If
so,
  how will the 2-level index optimization change things? Will it do
#2?

 Yes to both questions, without the word then.
 First, one match is found for each word. Then read_next is called for
 the word with the lowest rowid, etc.

Not completely clear on this. :-)  I get that one match is found for
each word... then whichever word occured first in the table (lowest
rowid) is... what? :-/

Oh, wait, I get what you mean! ;-)  You're saying that read_next is
called for the word with the lowest rowid until you see if the rowid
matches the rowid from the other word(s)?  Then if the rowid gets
greater than what you're looking for, you know that there's no matching
row?  (Since you say that each word is sorted by rowid -- see below
about that, though.)

Then I'm not sure what happens to find the next matching row.  Find a
match again for each word starting after the last found rowid?  I'm not
familiar enough with the MySQL code (or C) to understand what's going on
in ft_boolean_search.c. :-(


 The advantage is that matches are
 found and returned earlier - a user don't have to wait for the index
 scan to complete. Also LIMIT, if used, cuts off more work, that is
LIMIT
 is more effective.

Right, I know that LIMIT helps -- as long as there's no ORDER BY, etc.
that needs to find all rows to sort. :-)

That brings me to the main reason for asking these questions: for
searching on my forum system. There could be 5-10+ million posts, which
would put upwards of 250+ million entries in the full-text index.  I'll
probably use a LIMIT of 10-20,000 (w/o ORDER BY) to prevent searches
from taking too long (and returning way too many matches!).

However, with that many posts, I think it's possible that a search could
match 100k+ posts.  Then the search would have to be narrowed down to a
particular forum or date range.  If I add AND forumid=123 etc. to the
WHERE clause, that will make the search hit the data file for
who-knows-how-many posts that aren't in the desired forum -- thus
scanning more data rows than specified by LIMIT.

But you told list member Steven Roussey one time that he could add the
forum id to a character column and include it in the full-text index.  I
thought I'd try that too -- along with some text for the month and year
for date range searches.  Although now I'm not sure that's a good idea,
because what if there's 1 million posts in a single forum (is it bad to
have the same word for a forum id in 1 million rows??) and a search is
done that would actually find  LIMIT rows *across all forums*?  If I
include the forum id in the search, the current full-text code will look
at all 1M words for that forum id.  Much slower than just doing the
search and manually checking forum id. :-(  Of course, if the situation
is reversed (common search in a *small* forum), this method would be
faster than manually checking.

Any other ideas for improving performance when searching a small subset
of the full-text index?  Too bad we can't include numeric columns in the
full-text index (which would work as usual in the WHERE, not in MATCH
()).


 But when one word is much more common than the second one, it is
better
 to do #2, and it's what I'll probably do.

Yeah, like for my example. :-)


  Next

Re: How boolean full-text search finds matches?

2003-12-18 Thread Sergei Golubchik
Hi!

On Dec 17, Matt W wrote:
 Hi,
 
 Just have a couple more full-text search inquiries here. :-)
 
 I'm not exactly clear on how matching rows are found when searching for
 2 or more required words: '+word1 +word2'.  I understand that it can't
 currently know which word occurs less, so that it can be searched
 first -- this optimization will come with 4.1's 2-level indexes. :-)
 
 I just want to know, when it finds a match for whichever word is tried
 first, how does it check if the other required word(s) are present in
 the same row?  Say that word1 and word2 are each present in 100,000
 rows.
 
 1) Surely it doesn't check the 100,000 entries for word2 for EACH word1
 match to see if they're in the same row, does it?

No it does not :)
 
 2) It *seems* the best way would be to do a lookup for (word2 + rowid
 for word1) and see if there's a match.  Is this what's done?  I'm not
 sure it's possible though with the way the index is structured...

it is possible, but it is only sensible if word1 is much more rare than
word1. This could be done with 2-level indexes :)
 
 3) Or, and I'm thinking *maybe* this is how it's done from what I've
 heard, does it get all the matches for word1, then for word2, and then
 intersect them to find ones which are present in the same row?  If so,
 how will the 2-level index optimization change things? Will it do #2?

Yes to both questions, without the word then.
First, one match is found for each word. Then read_next is called for
the word with the lowest rowid, etc. The advantage is that matches are
found and returned earlier - a user don't have to wait for the index
scan to complete. Also LIMIT, if used, cuts off more work, that is LIMIT
is more effective.

But when one word is much more common than the second one, it is better
to do #2, and it's what I'll probably do.
 
 Next question is... a few weeks ago I was doing some test searches like
 '+word1 +word2'.  Actually, maybe I was only using 1 word, I can't
 remember, but I don't think it matters.  Anyway, I happened to try
 changing the query to '+word1* +word2*' -- e.g. adding a wild-card to
 the end of the same word(s) -- and I was amazed at how much faster the
 query was! (And no, there's no query cache; and they were both run many
 times so the index was cached. :-)) Can't remember how much faster, but
 it wasn't insignificant.  Then I tried adding a wild-card to the end of
 words in another search (the wild-card did not make more rows match as
 far as I know), but that made it a little slower (I'd expect that, if
 anything).  Is there any explanation for why adding wild-cards would
 make a search faster?

Yes :)

Looking on the description above how +word1 +word2 works, you can see
that it relies on the fact that key entries for each word are sorted by
rowid. But it is how MyISAM indexes work, they are always ordered by all
keyparts, and rowid is always the last keypart, so fulltext index is
always ORDER BY word, rowid.

But when the search is done for the word prefix only, this order breaks,
the index is, of course, not ORDER BY LEFT(word,5), rowid. Consider the
index

   aa, 5
   aaabbb, 10
   aaaccc, 3

The workaround was to exclude prefixes from index scan whenever possible
(that is in +word +pref1* +pref2* ... prefixes are not looked up in
the index at all, in +pref1* +pref2* ... only the first prefix is
looked up in the index) and to add a row scan (in the LIKE sense and as you
described in #1), to remove half-matched rows.

Thus it is sometimes slower - as more row data are read from the disk,
and sometimes faster - as there are less index scans performed.

Obvious optimization is to choose the most rare prefix for the scan, not
the first one. It is in the todo, of course :)

 P.S.  Sergei, if you see this, in one of your replies to my full-text
 suggestions back in September ( http://lists.mysql.com/mysql/149644 ),
 you said Another reply will follow...  I never saw another reply
 though. :-/  It's OK, I was just wondering what other interesting things
 you were going to say! :-)

Oops, checking...
Yes, sorry.
But I have to admit, I absolutely do not remember what I had in mind
when I wrote it and what should've followed :(
Sorry for this.

Regards,
Sergei

-- 
   __  ___ ___   __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik [EMAIL PROTECTED]
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, Senior Software Developer
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
   ___/  www.mysql.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



How boolean full-text search finds matches?

2003-12-17 Thread Matt W
Hi,

Just have a couple more full-text search inquiries here. :-)

I'm not exactly clear on how matching rows are found when searching for
2 or more required words: '+word1 +word2'.  I understand that it can't
currently know which word occurs less, so that it can be searched
first -- this optimization will come with 4.1's 2-level indexes. :-)

I just want to know, when it finds a match for whichever word is tried
first, how does it check if the other required word(s) are present in
the same row?  Say that word1 and word2 are each present in 100,000
rows.

1) Surely it doesn't check the 100,000 entries for word2 for EACH word1
match to see if they're in the same row, does it?

2) It *seems* the best way would be to do a lookup for (word2 + rowid
for word1) and see if there's a match.  Is this what's done?  I'm not
sure it's possible though with the way the index is structured...

3) Or, and I'm thinking *maybe* this is how it's done from what I've
heard, does it get all the matches for word1, then for word2, and then
intersect them to find ones which are present in the same row?  If so,
how will the 2-level index optimization change things? Will it do #2?


Next question is... a few weeks ago I was doing some test searches like
'+word1 +word2'.  Actually, maybe I was only using 1 word, I can't
remember, but I don't think it matters.  Anyway, I happened to try
changing the query to '+word1* +word2*' -- e.g. adding a wild-card to
the end of the same word(s) -- and I was amazed at how much faster the
query was! (And no, there's no query cache; and they were both run many
times so the index was cached. :-)) Can't remember how much faster, but
it wasn't insignificant.  Then I tried adding a wild-card to the end of
words in another search (the wild-card did not make more rows match as
far as I know), but that made it a little slower (I'd expect that, if
anything).  Is there any explanation for why adding wild-cards would
make a search faster?


Thanks in advance!

Matt

P.S.  Sergei, if you see this, in one of your replies to my full-text
suggestions back in September ( http://lists.mysql.com/mysql/149644 ),
you said Another reply will follow...  I never saw another reply
though. :-/  It's OK, I was just wondering what other interesting things
you were going to say! :-)


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]



How boolean full-text search finds matches?

2003-12-17 Thread Matt W
Hi,

Just have a couple more full-text search inquiries here. :-)

I'm not exactly clear on how matching rows are found when searching for
2 or more required words: '+word1 +word2'.  I understand that it can't
currently know which word occurs less, so that it can be searched
first -- this optimization will come with 4.1's 2-level indexes. :-)

I just want to know, when it finds a match for whichever word is tried
first, how does it check if the other required word(s) are present in
the same row?  Say that word1 and word2 are each present in 100,000
rows.

1) Surely it doesn't check the 100,000 entries for word2 for EACH word1
match to see if they're in the same row, does it?

2) It *seems* the best way would be to do a lookup for (word2 + rowid
for word1) and see if there's a match.  Is this what's done?  I'm not
sure it's possible though with the way the index is structured...

3) Or, and I'm thinking *maybe* this is how it's done from what I've
heard, does it get all the matches for word1, then for word2, and then
intersect them to find ones which are present in the same row?  If so,
how will the 2-level index optimization change things? Will it do #2?


Next question is... a few weeks ago I was doing some test searches like
'+word1 +word2'.  Actually, maybe I was only using 1 word, I can't
remember, but I don't think it matters.  Anyway, I happened to try
changing the query to '+word1* +word2*' -- e.g. adding a wild-card to
the end of the same word(s) -- and I was amazed at how much faster the
query was! (And no, there's no query cache; and they were both run many
times so the index was cached. :-)) Can't remember how much faster, but
it wasn't insignificant.  Then I tried adding a wild-card to the end of
words in another search (the wild-card did not make more rows match as
far as I know), but that made it a little slower (I'd expect that, if
anything).  Is there any explanation for why adding wild-cards would
make a search faster?


Thanks in advance!

Matt

P.S.  Sergei, if you see this, in one of your replies to my full-text
suggestions back in September ( http://lists.mysql.com/mysql/149644 ),
you said Another reply will follow...  I never saw another reply
though. :-/  It's OK, I was just wondering what other interesting things
you were going to say! :-)


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]