Re: How boolean full-text search finds matches?

Matt W Fri, 19 Dec 2003 04:28:05 -0800

Hi Sergei!

Thanks for replying again.  I hope I'm not wasting too much of your time
with my questions! :-)

<More below...>

----- Original Message -----
From: "Sergei Golubchik"
Sent: Thursday, December 18, 2003 7:17 AM
Subject: Re: How boolean full-text search finds matches?

> Hi!
>
> On Dec 17, Matt W wrote:
> > Hi,
> >
> > Just have a couple more full-text search inquiries here. :-)
> >
> > I'm not exactly clear on how matching rows are found when searching
for
> > 2 or more required words: '+word1 +word2'.  I understand that it
can't
> > currently know which word occurs less, so that it can be searched
> > first -- this optimization will come with 4.1's 2-level indexes. :-)
> >
> > I just want to know, when it finds a match for whichever word is
tried
> > first, how does it check if the other required word(s) are present
in
> > the same row?  Say that word1 and word2 are each present in 100,000
> > rows.
> >
> > 1) Surely it doesn't check the 100,000 entries for word2 for EACH
word1
> > match to see if they're in the same row, does it?
>
> No it does not :)

Good!

> > 2) It *seems* the best way would be to do a lookup for (word2 +
<rowid
> > for word1>) and see if there's a match.  Is this what's done?  I'm
not
> > sure it's possible though with the way the index is structured...
>
> it is possible, but it is only sensible if word1 is much more rare
than
> word1. This could be done with 2-level indexes :)

I assume that should say "if word2 is much more rare than word1."  I
guess that's because it would need too many [random?] index lookups
otherwise?

> > 3) Or, and I'm thinking *maybe* this is how it's done from what I've
> > heard, does it get all the matches for word1, then for word2, and
then
> > "intersect" them to find ones which are present in the same row?  If
so,
> > how will the 2-level index optimization change things? Will it do
#2?
>
> Yes to both questions, without the word "then".
> First, one match is found for each word. Then read_next is called for
> the word with the lowest rowid, etc.

Not completely clear on this. :-)  I get that one match is found for
each word... then whichever word occured first in the table (lowest
rowid) is... what? :-/

Oh, wait, I get what you mean! ;-)  You're saying that read_next is
called for the word with the lowest rowid until you see if the rowid
matches the rowid from the other word(s)?  Then if the rowid gets
greater than what you're looking for, you know that there's no matching
row?  (Since you say that each word is sorted by rowid -- see below
about that, though.)

Then I'm not sure what happens to find the next matching row.  Find a
match again for each word starting after the last found rowid?  I'm not
familiar enough with the MySQL code (or C) to understand what's going on
in ft_boolean_search.c. :-(

> The advantage is that matches are
> found and returned earlier - a user don't have to wait for the index
> scan to complete. Also LIMIT, if used, cuts off more work, that is
LIMIT
> is more effective.

Right, I know that LIMIT helps -- as long as there's no ORDER BY, etc.
that needs to find all rows to sort. :-)

That brings me to the main reason for asking these questions: for
searching on my forum system. There could be 5-10+ million posts, which
would put upwards of 250+ million entries in the full-text index.  I'll
probably use a LIMIT of 10-20,000 (w/o ORDER BY) to prevent searches
from taking too long (and returning way too many matches!).

However, with that many posts, I think it's possible that a search could
match 100k+ posts.  Then the search would have to be narrowed down to a
particular forum or date range.  If I add "AND forumid=123" etc. to the
WHERE clause, that will make the search hit the data file for
who-knows-how-many posts that aren't in the desired forum -- thus
scanning more data rows than specified by LIMIT.

But you told list member Steven Roussey one time that he could add the
forum id to a character column and include it in the full-text index.  I
thought I'd try that too -- along with some text for the month and year
for date range searches.  Although now I'm not sure that's a good idea,
because what if there's 1 million posts in a single forum (is it bad to
have the same "word" for a forum id in 1 million rows??) and a search is
done that would actually find < LIMIT rows *across all forums*?  If I
include the forum id in the search, the current full-text code will look
at all 1M "words" for that forum id.  Much slower than just doing the
search and manually checking forum id. :-(  Of course, if the situation
is reversed (common search in a *small* forum), this method would be
faster than manually checking.

Any other ideas for improving performance when searching a small subset
of the full-text index?  Too bad we can't include numeric columns in the
full-text index (which would work as usual in the WHERE, not in MATCH
()).

> But when one word is much more common than the second one, it is
better
> to do #2, and it's what I'll probably do.

Yeah, like for my example. :-)

> > Next question is... a few weeks ago I was doing some test searches
like
> > '+word1 +word2'.  Actually, maybe I was only using 1 word, I can't
> > remember, but I don't think it matters.  Anyway, I happened to try
> > changing the query to '+word1* +word2*' -- e.g. adding a wild-card
to
> > the end of the same word(s) -- and I was amazed at how much faster
the
> > query was! (And no, there's no query cache; and they were both run
many
> > times so the index was cached. :-)) Can't remember how much faster,
but
> > it wasn't insignificant.  Then I tried adding a wild-card to the end
of
> > words in another search (the wild-card did not make more rows match
as
> > far as I know), but that made it a little slower (I'd expect that,
if
> > anything).  Is there any explanation for why adding wild-cards would
> > make a search faster?
>
> Yes :)
>
> Looking on the description above how "+word1 +word2" works, you can
see
> that it relies on the fact that key entries for each word are sorted
by
> rowid. But it is how MyISAM indexes work, they are always ordered by
all
> keyparts, and rowid is always the last keypart, so fulltext index is
> always ORDER BY word, rowid.

What about weight?  I thought the index was <word, weight, rowid>.  So I
assumed each word was sorted by weight (even though boolean doesn't use
it, right? It's still in the index), then by rowid...

> But when the search is done for the word prefix only, this order
breaks,
> the index is, of course, not ORDER BY LEFT(word,5), rowid. Consider
the
> index
>
>    aaaaaa, 5
>    aaabbb, 10
>    aaaccc, 3
>
> The workaround was to exclude prefixes from index scan whenever
possible
> (that is in "+word +pref1* +pref2* ..." prefixes are not looked up in
> the index at all, in "+pref1* +pref2* ..." only the first prefix is
> looked up in the index) and to add a row scan (in the LIKE sense and
as you
> described in #1), to remove "half-matched" rows.

OK, I see. :-)  I'm surprised they're not looked up in the index.  But
after seeing the code, it looks like it has something to do with
preventing a row from being matched twice.  Not sure though.

Are you saying that for "+word +pref1* +pref2*", word and pref1 are
looked up in the index, or just word?  It sounds like you meant word and
pref1, but from a test I just did, it seems to be only word.

> Thus it is sometimes slower - as more row data are read from the disk,
> and sometimes faster - as there are less index scans performed.

Yep, that explains it!

> Obvious optimization is to choose the most rare prefix for the scan,
not
> the first one. It is in the todo, of course :)

Sounds great.  For now, if it uses the index for a wild-card and it
chooses the first one given in the query, this tells me that if I'm
using wild-cards, I should put the longest one first since it's more
probable that it would match fewer rows. :-)

> > P.S.  Sergei, if you see this, in one of your replies to my
full-text
> > suggestions back in September
( http://lists.mysql.com/mysql/149644 ),
> > you said "Another reply will follow..."  I never saw another reply
> > though. :-/  It's OK, I was just wondering what other interesting
things
> > you were going to say! :-)
>
> Oops, checking...
> Yes, sorry.
> But I have to admit, I absolutely do not remember what I had in mind
> when I wrote it and what should've "followed" :(
> Sorry for this.

Like I said, no problem.  I didn't know what more there was to say
either. :-)

Thanks very much in advance for any reply,

Matt

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: How boolean full-text search finds matches?

Reply via email to