Re: How boolean full-text search finds matches?

Sergei Golubchik Wed, 07 Jan 2004 13:28:30 -0800

Hi!

On Dec 19, Matt W wrote:
> > >
> > > Just have a couple more full-text search inquiries here. :-)
> > >
> > > I'm not exactly clear on how matching rows are found when searching for
> > > 2 or more required words: '+word1 +word2'.  I understand that it can't
> > > currently know which word occurs less, so that it can be searched
> > > first -- this optimization will come with 4.1's 2-level indexes. :-)
> > >
> > > I just want to know, when it finds a match for whichever word is tried
> > > first, how does it check if the other required word(s) are present in
> > > the same row?  Say that word1 and word2 are each present in 100,000
> > > rows.
> > >
> > > 2) It *seems* the best way would be to do a lookup for (word2 + <rowid
> > > for word1>) and see if there's a match.  Is this what's done?  I'm not
> > > sure it's possible though with the way the index is structured...
> >
> > it is possible, but it is only sensible if word1 is much more rare than
> > word1. This could be done with 2-level indexes :)
> 
> I assume that should say "if word2 is much more rare than word1."  I
> guess that's because it would need too many [random?] index lookups
> otherwise?


No, if word1 is more rare. In your approach, as far as I understand, one
does an index scan for all occurences of word1 and an index lookup to
"word2+<rowid for word1>" for each found word1. Thus the total number of
index lookups is 2*(number_or_word1s), and for the best performance
word1 should be more rare than the word2.

> > > 3) Or, and I'm thinking *maybe* this is how it's done from what I've
> > > heard, does it get all the matches for word1, then for word2, and then
> > > "intersect" them to find ones which are present in the same row?  If so,
> > > how will the 2-level index optimization change things? Will it do #2?
> >
> > Yes to both questions, without the word "then".
> > First, one match is found for each word. Then read_next is called for
> > the word with the lowest rowid, etc.
> 
> Not completely clear on this. :-)  I get that one match is found for
> each word... then whichever word occured first in the table (lowest
> rowid) is... what? :-/
> 
> Oh, wait, I get what you mean! ;-)  You're saying that read_next is
> called for the word with the lowest rowid until you see if the rowid
> matches the rowid from the other word(s)?  Then if the rowid gets
> greater than what you're looking for, you know that there's no matching
> row?  (Since you say that each word is sorted by rowid -- see below
> about that, though.)
> 
> Then I'm not sure what happens to find the next matching row.  Find a
> match again for each word starting after the last found rowid?  I'm not
> familiar enough with the MySQL code (or C) to understand what's going on
> in ft_boolean_search.c. :-(

I'm using "priority heap" - it's a data structure that can efficiently
find a smallest (or largest - depending on comparison function :)
element from the list.  http://www.nist.gov/dads/HTML/priorityque.html
In our implementation of heaps, the top element is the smallest one,
think as lighter elements are going up, and heavier are sinking down.

The pseudocode is like this:

for each word[i] {
  rowid=index_read(word[i]);
  heap->add( <word[i], rowid> ); // adding a struct of two elements
                                 // rowid is the sorting key
}

while heap->is_not_empty() {
  current=heap->top->rowid // get the smallest rowid
  ... init matching tree and counters ...
  while heap->top->rowid==current {
    word=heap->top->word;
    rowid=index_next(word);
    heap->replace_top(word, rowid) // after this replace(), this word
                          // went down another word is now on the top
    ... update matching tree and counters accordingly ...
  }
  // no more words for the rowid == current
  ... check matching tree and counters ...
  if (query is satisfied) return;
}
 
> > The advantage is that matches are
> > found and returned earlier - a user don't have to wait for the index
> > scan to complete. Also LIMIT, if used, cuts off more work, that is LIMIT
> > is more effective.
> 
> Right, I know that LIMIT helps -- as long as there's no ORDER BY, etc.
> that needs to find all rows to sort. :-)
> 
> That brings me to the main reason for asking these questions: for
> searching on my forum system. There could be 5-10+ million posts, which
> would put upwards of 250+ million entries in the full-text index.  I'll
> probably use a LIMIT of 10-20,000 (w/o ORDER BY) to prevent searches
> from taking too long (and returning way too many matches!).
> 
> However, with that many posts, I think it's possible that a search could
> match 100k+ posts.  Then the search would have to be narrowed down to a
> particular forum or date range.  If I add "AND forumid=123" etc. to the
> WHERE clause, that will make the search hit the data file for
> who-knows-how-many posts that aren't in the desired forum -- thus
> scanning more data rows than specified by LIMIT.
> 
> But you told list member Steven Roussey one time that he could add the
> forum id to a character column and include it in the full-text index.

Actually it was he, who wrote about this trick, not me :)

> I thought I'd try that too -- along with some text for the month and year
> for date range searches.  Although now I'm not sure that's a good idea,
> because what if there's 1 million posts in a single forum (is it bad to
> have the same "word" for a forum id in 1 million rows??) and a search is
> done that would actually find < LIMIT rows *across all forums*?  If I
> include the forum id in the search, the current full-text code will look
> at all 1M "words" for that forum id.  Much slower than just doing the
> search and manually checking forum id. :-(  Of course, if the situation
> is reversed (common search in a *small* forum), this method would be
> faster than manually checking.
> 
> Any other ideas for improving performance when searching a small subset
> of the full-text index?  Too bad we can't include numeric columns in the
> full-text index (which would work as usual in the WHERE, not in MATCH
> ()).

I thought about this feature - but all the ideas I got make natural
language search almost impossible. It relies on global,
collection-wide, statistics, and combined index would require to know
this statistics for any subset of the tree, like in

  SELECT ... WHERE forumid=123 AND match ...
  SELECT ... WHERE forumid=123 OR forumid=124 AND match ...
  SELECT ... WHERE forumid>15 AND forumid<25 AND match ...
  SELECT ... WHERE forumid>23 AND match ...

So - I'm still thinking about it...

As for other ideas, well, if the number of forums is reasonably small,
you can create one table per forum and combine them with a MERGE table.

> > Looking on the description above how "+word1 +word2" works, you can see
> > that it relies on the fact that key entries for each word are sorted by
> > rowid. But it is how MyISAM indexes work, they are always ordered by all
> > keyparts, and rowid is always the last keypart, so fulltext index is
> > always ORDER BY word, rowid.
> 
> What about weight?  I thought the index was <word, weight, rowid>.  So I
> assumed each word was sorted by weight (even though boolean doesn't use
> it, right? It's still in the index), then by rowid...

Right.
But I found that I needed ORDER BY word, rowid, so I introduced a new flag
to keyseg flags in MyISAM - "ignore me when sorting". That's why
FULLETXT index is not sorted by weight.

> > But when the search is done for the word prefix only, this order breaks,
> > the index is, of course, not ORDER BY LEFT(word,5), rowid. Consider the
> > index
> >
> >    aaaaaa, 5
> >    aaabbb, 10
> >    aaaccc, 3
> >
> > The workaround was to exclude prefixes from index scan whenever possible
> > (that is in "+word +pref1* +pref2* ..." prefixes are not looked up in
> > the index at all, in "+pref1* +pref2* ..." only the first prefix is
> > looked up in the index) and to add a row scan (in the LIKE sense and as you
> > described in #1), to remove "half-matched" rows.
> 
> OK, I see. :-)  I'm surprised they're not looked up in the index.  But
> after seeing the code, it looks like it has something to do with
> preventing a row from being matched twice.  Not sure though.

Yes, this too.
 
> Are you saying that for "+word +pref1* +pref2*", word and pref1 are
> looked up in the index, or just word?  It sounds like you meant word and
> pref1, but from a test I just did, it seems to be only word.

No, only +word. Because if the query is "+word +pref1* +pref2*"
then +pref1* and +pref2* cannot really ADD new rows to the result, they
only REMOVE rows from what was found by +word. But as the row is scanned
anyway when trunc* operators are used, there's no need to search for
+pref1* in the index.

Of course this trick doesn't work when the word actually CAN add new
rows to the result, like it "word pref1* pref2"

Regards,
Sergei

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik <[EMAIL PROTECTED]>
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, Senior Software Developer
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
       <___/  www.mysql.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: How boolean full-text search finds matches?

Reply via email to