Re: How boolean full-text search finds matches?

Sergei Golubchik Thu, 18 Dec 2003 06:19:38 -0800

Hi!

On Dec 17, Matt W wrote:
> Hi,
> 
> Just have a couple more full-text search inquiries here. :-)
> 
> I'm not exactly clear on how matching rows are found when searching for
> 2 or more required words: '+word1 +word2'.  I understand that it can't
> currently know which word occurs less, so that it can be searched
> first -- this optimization will come with 4.1's 2-level indexes. :-)
> 
> I just want to know, when it finds a match for whichever word is tried
> first, how does it check if the other required word(s) are present in
> the same row?  Say that word1 and word2 are each present in 100,000
> rows.
> 
> 1) Surely it doesn't check the 100,000 entries for word2 for EACH word1
> match to see if they're in the same row, does it?


No it does not :)
 
> 2) It *seems* the best way would be to do a lookup for (word2 + <rowid
> for word1>) and see if there's a match.  Is this what's done?  I'm not
> sure it's possible though with the way the index is structured...

it is possible, but it is only sensible if word1 is much more rare than
word1. This could be done with 2-level indexes :)
 
> 3) Or, and I'm thinking *maybe* this is how it's done from what I've
> heard, does it get all the matches for word1, then for word2, and then
> "intersect" them to find ones which are present in the same row?  If so,
> how will the 2-level index optimization change things? Will it do #2?

Yes to both questions, without the word "then".
First, one match is found for each word. Then read_next is called for
the word with the lowest rowid, etc. The advantage is that matches are
found and returned earlier - a user don't have to wait for the index
scan to complete. Also LIMIT, if used, cuts off more work, that is LIMIT
is more effective.

But when one word is much more common than the second one, it is better
to do #2, and it's what I'll probably do.
 
> Next question is... a few weeks ago I was doing some test searches like
> '+word1 +word2'.  Actually, maybe I was only using 1 word, I can't
> remember, but I don't think it matters.  Anyway, I happened to try
> changing the query to '+word1* +word2*' -- e.g. adding a wild-card to
> the end of the same word(s) -- and I was amazed at how much faster the
> query was! (And no, there's no query cache; and they were both run many
> times so the index was cached. :-)) Can't remember how much faster, but
> it wasn't insignificant.  Then I tried adding a wild-card to the end of
> words in another search (the wild-card did not make more rows match as
> far as I know), but that made it a little slower (I'd expect that, if
> anything).  Is there any explanation for why adding wild-cards would
> make a search faster?

Yes :)

Looking on the description above how "+word1 +word2" works, you can see
that it relies on the fact that key entries for each word are sorted by
rowid. But it is how MyISAM indexes work, they are always ordered by all
keyparts, and rowid is always the last keypart, so fulltext index is
always ORDER BY word, rowid.

But when the search is done for the word prefix only, this order breaks,
the index is, of course, not ORDER BY LEFT(word,5), rowid. Consider the
index

   aaaaaa, 5
   aaabbb, 10
   aaaccc, 3

The workaround was to exclude prefixes from index scan whenever possible
(that is in "+word +pref1* +pref2* ..." prefixes are not looked up in
the index at all, in "+pref1* +pref2* ..." only the first prefix is
looked up in the index) and to add a row scan (in the LIKE sense and as you
described in #1), to remove "half-matched" rows.

Thus it is sometimes slower - as more row data are read from the disk,
and sometimes faster - as there are less index scans performed.

Obvious optimization is to choose the most rare prefix for the scan, not
the first one. It is in the todo, of course :)

> P.S.  Sergei, if you see this, in one of your replies to my full-text
> suggestions back in September ( http://lists.mysql.com/mysql/149644 ),
> you said "Another reply will follow..."  I never saw another reply
> though. :-/  It's OK, I was just wondering what other interesting things
> you were going to say! :-)

Oops, checking...
Yes, sorry.
But I have to admit, I absolutely do not remember what I had in mind
when I wrote it and what should've "followed" :(
Sorry for this.

Regards,
Sergei

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik <[EMAIL PROTECTED]>
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, Senior Software Developer
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
       <___/  www.mysql.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: How boolean full-text search finds matches?

Reply via email to